Abstract:
Systems and methods for scheduling documents for crawling are disclosed in which sitemap information is updated for a first website identified by a sitemap by downloading updated sitemap information for the first website and scheduling documents for crawling in accordance with the updated sitemap information for the first website. The sitemap information includes one or more sitemap indexes, where each respective sitemap index in the one or more sitemap indices includes a list of URLs corresponding to documents stored at a corresponding website in a plurality of websites, the plurality of websites including the first website, and each sitemap index in the one or more sitemap indexes includes information identifying one or more of: a last modification date of a URL in the list of URLs, a change frequency of a document specified by the URL, a document title, an authority of the document, and a priority of the document.
Abstract:
Systems and methods for scheduling documents for crawling are disclosed in which sitemap information is updated for a first website identified by a sitemap by downloading updated sitemap information for the first website and scheduling documents for crawling in accordance with the updated sitemap information for the first website. The sitemap information includes one or more sitemap indexes, where each respective sitemap index in the one or more sitemap indices includes a list of URLs corresponding to documents stored at a corresponding website in a plurality of websites, the plurality of websites including the first website, and each sitemap index in the one or more sitemap indexes includes information identifying one or more of: a last modification date of a URL in the list of URLs, a change frequency of a document specified by the URL, a document title, an authority of the document, and a priority of the document.
Abstract:
Systems and method are provided for setting a respective reuse flag for a corresponding document in a plurality of documents based on a query-independent score associated with the corresponding document. A document crawling operation is performed on the plurality of documents in accordance with the reuse flag for respective documents in the plurality of documents. This document crawling operation includes reusing a previously downloaded version of a respective document in the plurality of documents instead of downloading a current version of the respective document from a host computer in accordance with a determination that the reuse flag associated with the respective document meets a predefined criterion.
Abstract:
Systems and method are provided for setting a respective reuse flag for a corresponding document in a plurality of documents based on a query-independent score associated with the corresponding document. A document crawling operation is performed on the plurality of documents in accordance with the reuse flag for respective documents in the plurality of documents. This document crawling operation includes reusing a previously downloaded version of a respective document in the plurality of documents instead of downloading a current version of the respective document from a host computer in accordance with a determination that the reuse flag associated with the respective document meets a predefined criterion.