发明申请
- 专利标题: Techniques for clustering structurally similar web pages
- 专利标题(中): 聚类结构相似网页的技术
-
申请号: US11481734申请日: 2006-07-05
-
公开(公告)号: US20080010291A1公开(公告)日: 2008-01-10
- 发明人: Krishna Leela Poola , Arun Ramanujapuram
- 申请人: Krishna Leela Poola , Arun Ramanujapuram
- 主分类号: G06F17/30
- IPC分类号: G06F17/30
摘要:
Web page clustering techniques described herein are URL Clustering and Page Clustering, whereby clustering algorithms cluster together pages that are structurally similar. Regarding URL clustering, because similarly structured pages have similar patterns in their URLs, grouping similar URL patterns will group structurally similar pages. Embodiments of URL clustering may involve: (a) URL normalization and (b) URL variation computation. Regarding page clustering, page feature-based techniques further cluster any given set of homogenous clusters, reducing the number of clusters based on the underlying page code. Embodiments of page clustering may reduce the number of clusters based on the tag probabilities and the tag sequence, utilizing an Approximate Nearest Neighborhood (ANN) graph along with evaluation of intra-cluster and inter-cluster compactness.
公开/授权文献
- US07680858B2 Techniques for clustering structurally similar web pages 公开/授权日:2010-03-16
信息查询