Abstract:
In one aspect, this disclosure relates to a method and associated apparatus that allows a user to obtain a semi-structured data input and a workload input. An improved semi-structured data storage schema is selected for a relational schema in response to the semi-structured data input and the workload input. The semi-structured data is segmented based on the selected improved semi-structured data storage schema. In one aspect, the semi-structured data is XML data.
Abstract:
A method and system for identifying schemas of web databases is provided. A schema matching system generates a mapping between an interface schema and a result schema of a web database, which is used to represent the underlying database schema. The schema matching system also generates a mapping of the interface attributes and the result attributes of the web database to global attributes of a global schema whose semantics are known. Using these mappings, a search engine service can formulate queries using the global attributes, map those queries to the corresponding interface attributes, submit the query, and retrieve the values from the result attributes that correspond to the desired global attributes.
Abstract:
A method and system for determining relatedness of images of pages based on link and page layout analysis. A link analysis system determines relatedness between images by first identifying blocks within web pages, and then analyzing the importance of the blocks to web pages, web pages to blocks, and images to blocks. Based on this analysis, the link analysis system determines the degree to which each image is related to each other image. The link analysis system may also use the relatedness of images to generate a ranking of the images. The link analysis system may also generate a vector representation of the images based on their relatedness and apply a clustering algorithm to the vector representations to identify clusters of related images.
Abstract:
Described is a technology for understanding entities of a webpage, e.g., to label the entities on the webpage. An iterative and bidirectional framework processes a webpage, including a text understanding component (e.g., extended Semi-CRF model) that provides text segmentation features to a structure understanding component (e.g., extended HCRF model). The structure understanding component uses the text segmentation features and visual layout features of the webpage to identify a structure (e.g., labeled block). The text understanding component in turn uses the labeled block to further understand the text. The process continues iteratively until a similarity criterion is met, at which time the entities may be labeled. Also described is the use of multiple mentions of a set of text in the webpage to help in labeling an entity.
Abstract:
A method and system for identifying the importance of information areas of a display page. An importance system identifies information areas or blocks of a web page. A block of a web page represents an area of the web page that appears to relate to a similar topic. The importance system provides the characteristics or features of a block to an importance function that generates an indication of the importance of that block to its web page. The importance system “learns” the importance function by generating a model based on the features of blocks and the user-specified importance of those blocks. To learn the importance function, the importance system asks users to provide an indication of the importance of blocks of web pages in a collection of web pages.
Abstract:
The automated social networking graph mining and visualization technique described herein mines social connections and allows creation of a social networking graph from general (not necessarily social-application specific) Web pages. The technique uses the distances between a person's/entity's name and related people's/entities names on one or more Web pages to determine connections between people/entities and the strengths of the connections. In one embodiment, the technique lays out these connections, and then clusters them, in a 2-D layout of a social networking graph that represents the Web connection strengths among the related people's or entities' names, by using a force-directed model.
Abstract:
Described is a data-centric web search engine technology/architecture, in which document metadata, including offline-extracted metadata, is used as part of a search indexing and ranking pipeline. A web data management component receives crawled documents and extracts document metadata from the documents. An indexing component uses the document metadata to build an index for the documents. A serving component uses the index and the document metadata to serve content, e.g., search results. Also described is the use of query metadata extracted from queries of a query log for use in the pipeline.
Abstract:
A method and system for determining relevance of a document having text and images to a text string is provided. A scoring system identifies image text associated with an image of the document. The scoring system calculates an image score indicating relevance of the image text to the text string. The image score may be used in many applications, such as searching, summary generation, and document classification, image search, and image classification.
Abstract:
A content object indexing process including creating a content object knowledge index, calculating a description vector of a target content object, and indexing the target content object by searching for the description vector in the content object knowledge database. It may be difficult to search for an exact content object such as a music file or academic researcher as a conventional search index may not include related hierarchical information. A content object indexing process may add hierarchical information taken from a content object knowledge index and incorporate the hierarchical information to the index entry for a specific content object. An application of such a content object indexing process may be a world wide web search engine.
Abstract:
In one aspect, this disclosure relates to a method and associated apparatus that allows a user to obtain a semi-structured data input and a workload input. An improved semi-structured data storage schema is selected for a relational schema in response to the semi-structured data input and the workload input. The semi-structured data is segmented based on the selected improved semi-structured data storage schema. In one aspect, the semi-structured data is XML data.