Abstract:
A system and method for natural language processing of queries are provided. A lexicon includes text elements that are recognized as being a proper noun when capitalized. A natural language query includes a sequence of text elements including words. The query is processed. The processing includes a preprocessing step, in which part of speech features are assigned to the text elements in the query. This includes identifying, from a lexicon, a text element in the query which starts with a lowercase letter and assigning recapitalization information to the text element in the query, based on the lexicon. This information includes a part of speech feature of the capitalized form of the text element. Then parts of speech for the text elements in the query are disambiguated, which includes applying rules for recapitalizing text elements based on the recapitalization information.
Abstract:
An approach is described for using a query expressed in a source language to retrieve information expressed in a target language. The approach uses a translation dictionary to convert terms in the query from the source language to appropriate terms in the target language. The approach determines viable transliterations for out-of-vocabulary (OOV) query terms by retrieving a body of information based on an in-vocabulary component of the query, and then mining the body of information to identify the viable transliterations for the OOV query terms. The approach then adds the viable transliterations to the translation dictionary. The retrieval, mining, and adding operations can be repeated one or more or times.
Abstract:
Words having selected characteristics in a corpus of documents are found using a data processor arranged to execute queries. Memory stores an index structure in which entries in the index structure map words and marks for words having the selected characteristics to locations within documents in the corpus. Entries in the index structure represent words and other entries represent marks with the location information of a marked word. The entries for the marks can be tokens coalesced with prefixes of respective marked words or adjacent. A query processor forms a modified query by adding a mark for a word to the query. The processor executes the modified query.
Abstract:
A familiarity level classifier comprises a stopwords engine for conducting a stopwords analysis of stopwords, e.g., introductory level stopwords and advanced level stopwords, in a document, e.g., a website; and a familiarity level classifier module for generating a document familiarity level based on the stopwords analysis. The classifier may be in an indexing module, a search engine, a user computer, or elsewhere in a computer network. The classifier may also include a reading level engine for conducting a reading level analysis of the document, and wherein the familiarity level classifier module is configured to generate the familiarity level also based on the reading level analysis. The classifier may also include a document features engine for conducting a feature analysis of the document, and wherein the familiarity level classifier module is configured to generate the document familiarity level also based on the feature analysis.
Abstract:
Users in public forums often mention certain topics in the course of their discussions. Member's comments in messages to other members are analyzed to obtain terms that co-occur with topics. Frequencies of co-occurrence of a term with topics are normalized based on frequency of the term in a random sample of message. The terms are ranked by their normalized frequency of co-occurrence with a topic in messages. The top terms are selected based on their rank. Analysis of demographic information associated with members that mentioned top terms associated with a topic is displayed in graphical format that highlights the relationship between the age, gender, and usage of the top terms over time. The demographic information presented includes average age of members that mentioned a top term or their gender information within a selected time interval.
Abstract:
Enabling text searching that accommodates a search criteria corresponding to a capitalization characteristic. One or more search terms are received, and a determination is made as to a capitalization characteristic of at least one search term. One or more documents are identified from a collection of documents. The identification is based at least in part on the determination of the capitalization characterization of the search term, so that the search result satisfies the criteria of the capitalization characteristic.
Abstract:
A method of stemming text and system therefore are described. The method comprises removing stop words from a document based on at least one stop word entry in an array of stop words and flagging as nouns words determined to be attached to definite articles and preceded by a noun array entry in an array of stop words preceding at least one noun; adding flagged nouns to a noun dictionary; flagging as verbs words determined to be preceded by an verb array entry in an array of stop words preceding at least one verb; adding flagged verbs to a verb dictionary; searching the document for nouns and verbs based on the flagged nouns and the flagged verbs; removing remaining stop words subsequent to searching the document; applying light stemming on the flagged nouns; applying a root-based stemming on the flagged verbs; and storing the stemmed document.
Abstract:
A method and apparatus for processing user entered input and providing a response in a system for autonomously processing requests includes rules. For each rule, whether the input is recognized is determined. If it is, a response is sent to the user. To determine recognized input, the method attempts to match the rule to a pattern. If a match is not found, the input is not recognized. If a match is found, the input is recognized and the response is sent. Alternatively, the input is conditionally recognized and a statement validator is executed which queries structured data to determine if a logic statement evaluates to true. Depending on how the statement evaluates: i) the input is recognized and the response is sent, ii) the structured data is queried again for the next statement validator, or iii) the input is not recognized and the method continues to the next rule.
Abstract:
Phrases in a corpus of documents including stopwords are found using a data processor arranged to execute phrase queries. Memory stores an index structure which maps entries in the index structure to documents in the corpus. Entries in the index structure represent words and other entries represent stopwords found in the corpus coalesced with prefixes of respective adjacent words adjacent to the stopwords. The prefixes comprise one or more leading characters of the respective adjacent words. A query processor forms a modified query by substituting a stopword with a search token representing the stopword coalesced with a prefix of the next word in the query. The processor executes the modified query. Also, index structures including coalesced stopwords are created and maintained.
Abstract:
A system and method for improved name matching using regularized name forms is presented. A regularization rule engine uses culture-specific regularization rules to iteratively convert candidate names and query names to a canonical form, which are regularized candidate names and regularized query names, respectively. The regularization rules are context-sensitive or context-free rules that pertain to a name's originating culture. Subsequently, a name search engine compares the regularized query name with the regularized candidate names and identifies the regularized candidate names that meet a particular regularization matching threshold. In turn, name search engine selects the candidate names that correspond to the identified regularized candidate names and provides the selected candidate names to a user.