Automatic stop word identification and compensation
    51.
    发明授权
    Automatic stop word identification and compensation 有权
    自动停止字识别和补偿

    公开(公告)号:US07720792B2

    公开(公告)日:2010-05-18

    申请号:US11348303

    申请日:2006-02-07

    CPC classification number: G06F17/30666 G06F17/30616

    Abstract: Disclosed are methods and computer program products for automatically identifying and compensating for stop words in a text processing system. This automatic stop word compensation allows such operations as performing queries on an abstract mathematical space built using all words from all texts, with the ability to compensate for the skew that the inclusion of the stop words may have introduced into the space. Documents are represented by document vectors in the abstract mathematical space. To compensate for stop words, a weight function is applied to a predetermined component of the document vectors associated with frequently occurring word(s) contained in the documents. The weight function may be applied dynamically during query processing. Alternatively, the weight function may be applied statically to all document vectors.

    Abstract translation: 公开了用于在文本处理系统中自动识别和补偿停止词的方法和计算机程序产品。 这种自动停止词补偿允许这样的操作,例如对使用来自所有文本的所有单词构建的抽象数学空间执行查询,并且能够补偿包含停止词可能已经被引入空间的偏差。 文档由抽象数学空间中的文档向量表示。 为了补偿停止词,将权重函数应用于与文档中包含的经常出现的单词相关联的文档向量的预定分量。 权重函数可以在查询处理期间动态应用。 或者,权重函数可以静态地应用于所有文档向量。

    System and method for improved name matching using regularized name forms
    52.
    发明授权
    System and method for improved name matching using regularized name forms 失效
    使用正则化名称形式改进名称匹配的系统和方法

    公开(公告)号:US07599921B2

    公开(公告)日:2009-10-06

    申请号:US11681333

    申请日:2007-03-02

    CPC classification number: G06F17/30666 G06F17/30669 Y10S707/99933

    Abstract: A system and method for improved name matching using regularized name forms is presented. A regularization rule engine uses culture-specific regularization rules to iteratively convert candidate names and query names to a canonical form, which are regularized candidate names and regularized query names, respectively. The regularization rules are context-sensitive or context-free rules that pertain to a name's originating culture. Subsequently, a name search engine compares the regularized query name with the regularized candidate names and identifies the regularized candidate names that meet a particular regularization matching threshold. In turn, name search engine selects the candidate names that correspond to the identified regularized candidate names and provides the selected candidate names to a user.

    Abstract translation: 介绍了使用正则化名称形式改进名称匹配的系统和方法。 正则化规则引擎使用文化特定的规则化规则来将候选名称和查询名称迭代地转换为规范形式,分别是正则化候选名称和正则化查询名称。 正则化规则是与名称的原始文化相关的上下文相关或上下文无关的规则。 随后,名称搜索引擎将正则化查询名称与正则化候选名称进行比较,并识别满足特定正则化匹配阈值的正则化候选名称。 依次,名称搜索引擎选择与所识别的正则化候选名称相对应的候选名称,并向用户提供所选择的候选名称。

    Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems
    53.
    发明授权
    Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems 有权
    在基于关键字的检索系统中找到有意义的词汇或停止词组

    公开(公告)号:US07409383B1

    公开(公告)日:2008-08-05

    申请号:US10813590

    申请日:2004-03-31

    Abstract: A stopword detection component detects stopwords (also stop-phrases) in search queries input to keyword-based information retrieval systems. Potential stopwords are initially identified by comparing the terms in the search query to a list of known stopwords. Context data is then retrieved based on the search query and the identified stopwords. In one implementation, the context data includes documents retrieved from a document index. In another implementation, the context data includes categories relevant to the search query. Sets of retrieved context data are compared to one another to determine if they are substantially similar. If the sets of context data are substantially similar, this fact may be used to infer that the removal of the potential stopword(s) is not material to the search. If the sets of context data are not substantially similar, the potential stopword can be considered material to the search and should not be removed from the query.

    Abstract translation: 停止词检测组件在输入到基于关键字的信息检索系统的搜索查询中检测到停止词(也称为停止词)。 最初通过将搜索查询中的术语与已知无效词列表进行比较来识别潜在的禁忌词。 然后基于搜索查询和所识别的无效词来检索上下文数据。 在一个实现中,上下文数据包括从文档索引检索的文档。 在另一实现中,上下文数据包括与搜索查询相关的类别。 将检索到的上下文数据的集合彼此进行比较以确定它们是否基本相似。 如果上下文数据集合基本相似,则可以使用该事实来推断潜在的停止词的移除对搜索不重要。 如果上下文数据集基本上不相似,潜在的停用词可以被认为是搜索的重要内容,不应该从查询中移除。

    Client-server word-breaking framework
    54.
    发明申请
    Client-server word-breaking framework 有权
    客户端 - 服务器破解框架

    公开(公告)号:US20070088677A1

    公开(公告)日:2007-04-19

    申请号:US11249623

    申请日:2005-10-13

    Abstract: Word-breaking of a query from a client machine in a client-server environment includes determining whether to use a first word breaking module operable with a client machine in the client-server environment and/or a second word breaking module operable with a server in the client-server environment.

    Abstract translation: 在客户机 - 服务器环境中从客户端机器断开查询包括确定是否使用可与客户机 - 服务器环境中的客户端机器一起操作的第一单词断开模块和/或可与服务器一起操作的第二单词断开模块 客户端 - 服务器环境。

    Method and system for information extraction
    55.
    发明授权
    Method and system for information extraction 有权
    信息提取方法与系统

    公开(公告)号:US07194406B2

    公开(公告)日:2007-03-20

    申请号:US11032075

    申请日:2005-01-11

    Abstract: A method and a system for extracting information from a natural language text corpus based on a natural language query are disclosed. In the method the natural language text corpus is analyzed with respect to surface structure of word tokens and surface syntactic roles of constituents, and the analyzed natural language text corpus is then indexed and stored. Furthermore a natural language query is analyzed with respect to surface structure of word tokens and surface syntactic roles of constituents. From the analyzed natural language query one or more surface variants are then created, where these surface variants are equivalent to the natural language query with respect to lexical meaning of word tokens and surface syntactic roles of constituents. The surface variants are then compared with the indexed and stored analyzed natural language text corpus, and each portion of text comprising a string of word tokens that matches the any one of the surface variants or the natural language query is extracted from the indexed and stored analyzed natural language text corpus.

    Abstract translation: 公开了一种基于自然语言查询从自然语言文本语料库中提取信息的方法和系统。 在方法中,针对词标记的表面结构和成分的表面句法角色分析了自然语言文本语料库,然后将分析的自然语言文本语料库进行索引和存储。 此外,针对词标记的表面结构和成分的表面句法角色分析了自然语言查询。 从分析的自然语言查询中,然后创建一个或多个表面变体,其中这些表面变体相当于自然语言查询关于词标记的词汇含义和组分的表面句法角色。 然后将表面变体与索引和存储的分析的自然语言文本语料库进行比较,并且从索引和存储的分析中提取包括与任何一个表面变体或自然语言查询相匹配的字符串的文本的每个部分 自然语言文本语料库。

    Automatic stop word identification and compensation
    57.
    发明申请
    Automatic stop word identification and compensation 有权
    自动停止字识别和补偿

    公开(公告)号:US20060224572A1

    公开(公告)日:2006-10-05

    申请号:US11348303

    申请日:2006-02-07

    Applicant: Robert Price

    Inventor: Robert Price

    CPC classification number: G06F17/30666 G06F17/30616

    Abstract: Computer-based methods for automatically identifying and compensating for stop words contained in documents are described. The method for compensating for stop words includes: generating an abstract mathematical space based on documents included in a collection of documents, wherein each document has a representation in the abstract mathematical space; receiving a user query; generating a representation of the user query in the abstract mathematical; computing a similarity between the representation of the user query and the representation of each document, wherein computing a similarity between the representation of the user query and the representation of a first document in the collection of documents comprises applying a weighting function to a value associated with a frequently occurring word contained in the first document, thereby automatically compensating for the frequently occurring word contained in the first document; and displaying a result based on the similarity computations.

    Abstract translation: 描述用于自动识别和补偿文档中包含的停止词的基于计算机的方法。 用于补偿停止词的方法包括:基于文档集合中包含的文档生成抽象数学空间,其中每个文档在抽象数学空间中具有表示; 接收用户查询; 以抽象数学生成用户查询的表示; 计算用户查询的表示和每个文档的表示之间的相似度,其中计算用户查询的表示与文档集合中的第一文档的表示之间的相似度包括将加权函数应用于与 包含在第一文档中的经常出现的词,从而自动补偿第一文档中包含的经常出现的词; 并且基于相似度计算显示结果。

    Information coding and retrieval system and method thereof
    59.
    发明授权
    Information coding and retrieval system and method thereof 失效
    信息编码和检索系统及其方法

    公开(公告)号:US06775663B1

    公开(公告)日:2004-08-10

    申请号:US09890365

    申请日:2001-07-30

    Applicant: Si Han Kim

    Inventor: Si Han Kim

    Abstract: A search engine system with coded information and a search method using the same is descovised. The system includes a key word input part, a database for storing information as word codes which are not real standard words, and a central process unit for assigning a word code assigned to a standard word to a word input through the key word input part or a client system, and searching information corresponding to the word code of the input word through the database. When key word(s) relating to information to be searched are input through the information input system, the input words are coded and the search is performed using the word codes through the database, thereby searching the information more precisely. In addition, since a plurality of different words having similar or same meanings are coded as one standard word code according to a simple coding rule and stored in the database, the process time for searching the information can be greatly reduced.

    Abstract translation: 具有编码信息的搜索引擎系统和使用其的搜索方法被删除。 该系统包括关键字输入部分,用于将信息存储为不是真实标准字的字代码的数据库,以及用于将分配给标准字的字代码分配给通过关键字输入部分输入的字的中央处理单元,或 客户端系统,并通过数据库搜索与输入单词的单词代码对应的信息。 当通过信息输入系统输入关于要搜索的信息的关键词时,输入字被编码,并且通过数据库使用字代码执行搜索,从而更精确地搜索信息。 此外,由于具有相同或相同含义的多个不同的词根据简单的编码规则被编码为一个标准字代码并存储在数据库中,因此可以大大减少用于搜索信息的处理时间。

    Multi-language document search and retrieval system

    公开(公告)号:US06654717B2

    公开(公告)日:2003-11-25

    申请号:US10080513

    申请日:2002-02-25

    Abstract: A multi-lingual indexing and search system performs tokenization and stemming in a manner which is independent of whether index entries and search terms appear as words in a dictionary. During the tokenization phase of the process, a string of text is separated into individual word tokens, and predetermined types of tokens are eliminated from further processing. The stemming phase of the process reduces words to grammatical stems by removing known word-endings associated with the various languages to be supported. Known word endings are removed from the word tokens without any effort to guarantee that the remaining stem is contained in a dictionary. In a preferred implementation, the stemming process is only applied to nouns.

Patent Agency Ranking