Abstract:
A multi-lingual indexing and search system performs tokenization and stemming in a manner which is independent of whether index entries and search terms appear as words in a dictionary. During the tokenization phase of the process, a string of text is separated into individual word tokens, and predetermined types of tokens are eliminated from further processing. The stemming phase of the process reduces words to grammatical stems by removing known word-endings associated with the various languages to be supported. Known word endings are removed from the word tokens without any effort to guarantee that the remaining stem is contained in a dictionary. In a preferred implementation, the stemming process is only applied to nouns.
Abstract:
The present invention discloses a search engine system with coded information and a search method using the same. The system includes a key word input part, a database for storing information as word codes which are not real standard words, and a central process unit for assigning a word code assigned to a standard word to a word input through the key word input part or a client system, and searching information corresponding to the word code of the input word through the database. In the invention, when key word(s) relating to information to be searched are input through the information input system, the input words are coded and the search is performed using the word codes through the database, thereby searching the information more precisely. In addition, since a plurality of different words having similar or same meanings are coded as one standard word code according to a simple coding rule and stored in the database, the process time for searching the information can be greatly reduced.
Abstract:
A method and system for retrieving information from an electronic dictionary. The system stores all information about words that have the same normalized form into a single entry within the electronic dictionary. The normalized form of a word has all lower case letters and no diacritical marks. When information is to be retrieved from the dictionary for a word, the word is first normalized and then the dictionary is searched for the entry corresponding to that normalized word. The entry that is found contains the information for that word.
Abstract:
A phrase generation is a method of generating sequences of terms, such as phrases, that may occur within a database of subsets containing sequences of terms, such as text. A database is provided and a relational model of the database is created. A query is then input. The query includes a term or a sequence of terms or multiple individual terms or multiple sequences of terms or combinations thereof. Next, several sequences of terms that are contextually related to the query are assembled from contextual relations in the model of the database. The sequences of terms are then sorted and output. Phrase generation can also be an iterative process used to produce sequences of terms from a relational model of a database.
Abstract:
A multi-lingual indexing and search system performs tokenization and stemming in a manner which is independent of whether index entries and search terms appear as words in a dictionary. During the tokenization phase of the process, a string of text is separated into individual word tokens, and predetermined types of tokens are eliminated from further processing. The stemming phase of the process reduces words to grammatical stems by removing known word-endings associated with the various languages to be supported. Known word endings are removed from the word tokens without any effort to guarantee that the remaining stem is contained in a dictionary. In a preferred implementation, the stemming process is only applied to nouns.
Abstract:
A method of searching for information in a text database, comprising: receiving (s1) at least one user input, the user input(s) defining a natural language expression, converting (s2, s3) the natural language expression to a tagged form (50, 51) including part-of-speech tags, applying (s4) to the tagged form (51) one or more grammar rules of the language of the natural language expression (49), to derive a regular expression (52), and analyzing (s5) the text database to determine whether there is a match between said regular expression (52) and a portion of said text database. An apparatus for carrying out this techniques is also disclosed. Users may find portions of a text which match multiword expressions given by the user. Matches include possible variations that are relevant with the initial criteria from a linguistic point of view including simple inflections like plural/singular, masculine/feminine or conjugated verbs and even more complex variations like the insertion of additional adjectives, adverbs, etc. in between the words specified by the user.
Abstract:
Sentence segmentation means performing sentence segmentation on the Japanese text data to be processed. Morpheme analysis means divides sentence-by-sentence data into morphemes and analyzes the resultant morphemes on the basis of information regarding morpheme-by-morpheme continuation contained in an analytical dictionary. Morpheme dictionary information development means develops the contents of the morpheme dictionary including part of speech information, semantic classification information, sentence pattern information and noted term information. Keyword candidate extraction means extracts keyword candidates from sentence-by-sentence data on the basis of the part of speech information and the like of each morpheme. Case information acquisition means acquires case information from information regarding the classes of case of keyword candidates immediately preceding noted terms stored in a noted term table and case class classification information for stored in a case class conversion table. Frequency information acquisition means acquires the appearance frequency of each keyword candidate. Importance calculation means calculates the importance of each keyword candidate as keyword. Keyword finalizing means definitely determines as true keywords only those keyword candidates having degrees of importance above a designated level of importance.
Abstract:
An information retrieval system based on probabilities that documents meet information needs. The frequency of occurrence of a representation in a collection of documents is estimated by identifying the frequency of occurrence of the representation in a sample of documents and calculating the difference between the maximum and minimum probable frequencies of occurrence of the representation in the collection. If the difference does not exceed a limit, a midpoint of the maximum and minimum probable frequencies is the estimated frequency of occurrence of the representation.Document distribution probabilities are optimized and probability thresholds are established for the identification of documents. An initial probability threshold is established and is adjusted as the probabilities are scored for documents in samples. The document result list is iteratively adjusted through the samples.
Abstract:
An apparatus and method for linguistic expression processing provides features for spelling verification, correction, and dictionary database storage. The system utilizes a linguistically salient word skeleton-forming process to correct both typrographic and cognitive spelling errors. The system also uses a suspect expression modification sequence to recognize and correct typographical spelling errors. A linguistic expression database includes a master lexicon having expression blocks arranged in accord with respective collation ranges of skeletons of expressions contained therein. In one preferred embodiment, these linguistically salient word skeletons corresponding to the master lexicon expressions are not retained in the database.
Abstract:
A method for use in analyzing tenant-specific data is disclosed. First data for a first tenant and second data for a second tenant is stored in a multi-tenant data storage system. A first portion of the first data is selected. Based on the selection, the first portion of the first data is copied to a data store that is specific to the first tenant. Data analysis techniques are applied to the data store.