Web Crawler Scheduler that Utilizes Sitemaps from Websites
    1.
    发明申请
    Web Crawler Scheduler that Utilizes Sitemaps from Websites 有权
    使用网站站点地图的Web爬虫计划程序

    公开(公告)号:US20150242508A1

    公开(公告)日:2015-08-27

    申请号:US14606882

    申请日:2015-01-27

    Applicant: GOOGLE INC.

    CPC classification number: G06F17/30864

    Abstract: Systems and methods for scheduling documents for crawling are disclosed in which sitemap information is updated for a first website identified by a sitemap by downloading updated sitemap information for the first website and scheduling documents for crawling in accordance with the updated sitemap information for the first website. The sitemap information includes one or more sitemap indexes, where each respective sitemap index in the one or more sitemap indices includes a list of URLs corresponding to documents stored at a corresponding website in a plurality of websites, the plurality of websites including the first website, and each sitemap index in the one or more sitemap indexes includes information identifying one or more of: a last modification date of a URL in the list of URLs, a change frequency of a document specified by the URL, a document title, an authority of the document, and a priority of the document.

    Abstract translation: 公开了用于调度用于爬行的文档的系统和方法,其中通过根据第一网站的更新的站点地图信息下载针对第一网站的更新的站点地图信息和用于爬行的调度文档,针对由站点地图标识的第一网站更新了站点地图信息。 所述站点地图信息包括一个或多个站点地图索引,其中所述一个或多个站点地图索引中的每个相应的站点索引索引包括对应于存储在多个网站中的相应网站上的文档的URL的列表,所述多个网站包括第一网站, 并且所述一个或多个站点地图索引中的每个站点索引包括标识以下URL中的一个或多个的信息:URL列表中的URL的最后修改日期,URL指定的文档的变化频率,文档标题, 文件和文件的优先权。

    Web crawler scheduler that utilizes sitemaps from websites
    2.
    发明授权
    Web crawler scheduler that utilizes sitemaps from websites 有权
    Web爬网程序调度程序利用网站的站点地图

    公开(公告)号:US09355177B2

    公开(公告)日:2016-05-31

    申请号:US14606882

    申请日:2015-01-27

    Applicant: GOOGLE INC.

    CPC classification number: G06F17/30864

    Abstract: Systems and methods for scheduling documents for crawling are disclosed in which sitemap information is updated for a first website identified by a sitemap by downloading updated sitemap information for the first website and scheduling documents for crawling in accordance with the updated sitemap information for the first website. The sitemap information includes one or more sitemap indexes, where each respective sitemap index in the one or more sitemap indices includes a list of URLs corresponding to documents stored at a corresponding website in a plurality of websites, the plurality of websites including the first website, and each sitemap index in the one or more sitemap indexes includes information identifying one or more of: a last modification date of a URL in the list of URLs, a change frequency of a document specified by the URL, a document title, an authority of the document, and a priority of the document.

    Abstract translation: 公开了用于调度用于爬行的文档的系统和方法,其中通过根据第一网站的更新的站点地图信息下载针对第一网站的更新的站点地图信息和用于爬行的调度文档,针对由站点地图标识的第一网站更新了站点地图信息。 所述站点地图信息包括一个或多个站点地图索引,其中所述一个或多个站点地图索引中的每个相应的站点索引索引包括对应于存储在多个网站中的相应网站上的文档的URL的列表,所述多个网站包括第一网站, 并且所述一个或多个站点地图索引中的每个站点索引包括标识以下URL中的一个或多个的信息:URL列表中的URL的最后修改日期,由URL指定的文档的变化频率,文档标题, 文件和文件的优先权。

    Document reuse in a search engine crawler

    公开(公告)号:US10216847B2

    公开(公告)日:2019-02-26

    申请号:US15617634

    申请日:2017-06-08

    Applicant: Google Inc.

    Abstract: Systems and method are provided for setting a respective reuse flag for a corresponding document in a plurality of documents based on a query-independent score associated with the corresponding document. A document crawling operation is performed on the plurality of documents in accordance with the reuse flag for respective documents in the plurality of documents. This document crawling operation includes reusing a previously downloaded version of a respective document in the plurality of documents instead of downloading a current version of the respective document from a host computer in accordance with a determination that the reuse flag associated with the respective document meets a predefined criterion.

    Document reuse in a search engine crawler

    公开(公告)号:US09679056B2

    公开(公告)日:2017-06-13

    申请号:US14245806

    申请日:2014-04-04

    Applicant: Google Inc.

    CPC classification number: G06F17/30864

    Abstract: Systems and method are provided for setting a respective reuse flag for a corresponding document in a plurality of documents based on a query-independent score associated with the corresponding document. A document crawling operation is performed on the plurality of documents in accordance with the reuse flag for respective documents in the plurality of documents. This document crawling operation includes reusing a previously downloaded version of a respective document in the plurality of documents instead of downloading a current version of the respective document from a host computer in accordance with a determination that the reuse flag associated with the respective document meets a predefined criterion.

Patent Agency Ranking