Multi-language document search and retrieval system
    81.
    发明授权
    Multi-language document search and retrieval system 有权
    多语言文档检索和检索系统

    公开(公告)号:US07174290B2

    公开(公告)日:2007-02-06

    申请号:US10612936

    申请日:2003-07-07

    Abstract: A multi-lingual indexing and search system performs tokenization and stemming in a manner which is independent of whether index entries and search terms appear as words in a dictionary. During the tokenization phase of the process, a string of text is separated into individual word tokens, and predetermined types of tokens are eliminated from further processing. The stemming phase of the process reduces words to grammatical stems by removing known word-endings associated with the various languages to be supported. Known word endings are removed from the word tokens without any effort to guarantee that the remaining stem is contained in a dictionary. In a preferred implementation, the stemming process is only applied to nouns.

    Abstract translation: 多语种索引和搜索系统执行令牌化和以不依赖索引条目和搜索词语作为词典中的单词的方式产生。 在进程的标记阶段期间,一串文本被分离成单独的单词令牌,并且从进一步处理中消除了预定类型的令牌。 该过程的结束阶段通过删除与要支持的各种语言相关联的已知字尾来将词语减少到语法句柄。 已知的单词结尾从单词标记中删除,而不用努力地保证剩余的词干包含在词典中。 在优选的实施方式中,引词过程仅适用于名词。

    Information coding and retrieval system and method thereof
    82.
    发明申请
    Information coding and retrieval system and method thereof 失效
    信息编码和检索系统及其方法

    公开(公告)号:US20040267733A1

    公开(公告)日:2004-12-30

    申请号:US10841271

    申请日:2004-05-07

    Inventor: Si Han Kim

    Abstract: The present invention discloses a search engine system with coded information and a search method using the same. The system includes a key word input part, a database for storing information as word codes which are not real standard words, and a central process unit for assigning a word code assigned to a standard word to a word input through the key word input part or a client system, and searching information corresponding to the word code of the input word through the database. In the invention, when key word(s) relating to information to be searched are input through the information input system, the input words are coded and the search is performed using the word codes through the database, thereby searching the information more precisely. In addition, since a plurality of different words having similar or same meanings are coded as one standard word code according to a simple coding rule and stored in the database, the process time for searching the information can be greatly reduced.

    Abstract translation: 本发明公开了一种具有编码信息的搜索引擎系统和使用其的搜索方法。 该系统包括关键字输入部分,用于将信息存储为不是真实标准字的字代码的数据库,以及用于将分配给标准字的字代码分配给通过关键字输入部分输入的字的中央处理单元,或 客户端系统,并通过数据库搜索与输入单词的单词代码对应的信息。 在本发明中,当通过信息输入系统输入关于要搜索的信息的关键词时,输入字被编码,并且通过数据库使用字代码执行搜索,从而更精确地搜索信息。 此外,由于具有相同或相同含义的多个不同的词根据简单的编码规则被编码为一个标准字代码并存储在数据库中,因此可以大大减少用于搜索信息的处理时间。

    Creating an electronic dictionary using source dictionary entry keys
    83.
    发明授权
    Creating an electronic dictionary using source dictionary entry keys 有权
    通过归一化源词​​典规范条目来创建电子词典

    公开(公告)号:US06651220B1

    公开(公告)日:2003-11-18

    申请号:US09303992

    申请日:1999-05-03

    CPC classification number: G06F17/2795 G06F17/30663 G06F17/30666

    Abstract: A method and system for retrieving information from an electronic dictionary. The system stores all information about words that have the same normalized form into a single entry within the electronic dictionary. The normalized form of a word has all lower case letters and no diacritical marks. When information is to be retrieved from the dictionary for a word, the word is first normalized and then the dictionary is searched for the entry corresponding to that normalized word. The entry that is found contains the information for that word.

    Abstract translation: 一种用于从电子词典检索信息的方法和系统。 系统将关于具有相同标准化形式的单词的所有信息存储在电子词典中的单个条目中。 单词的标准化形式具有所有小写字母和无变音符号。 当要从词典中检索出一个单词的信息时,首先对该单词进行归一化,然后搜索与该标准化单词对应的词典。 找到的条目包含该单词的信息。

    System, method and apparatus for generating phrases from a database
    84.
    发明申请
    System, method and apparatus for generating phrases from a database 失效
    用于从数据库生成短语的系统,方法和装置

    公开(公告)号:US20020188587A1

    公开(公告)日:2002-12-12

    申请号:US09800313

    申请日:2001-03-02

    Abstract: A phrase generation is a method of generating sequences of terms, such as phrases, that may occur within a database of subsets containing sequences of terms, such as text. A database is provided and a relational model of the database is created. A query is then input. The query includes a term or a sequence of terms or multiple individual terms or multiple sequences of terms or combinations thereof. Next, several sequences of terms that are contextually related to the query are assembled from contextual relations in the model of the database. The sequences of terms are then sorted and output. Phrase generation can also be an iterative process used to produce sequences of terms from a relational model of a database.

    Abstract translation: 短语生成是产生可能在包含术语序列(例如文本)的子集的数据库内发生的术语序列(例如短语)的方法。 提供数据库并创建数据库的关系模型。 然后输入查询。 查询包括术语或术语序列或多个单独术语或术语的多个序列或其组合。 接下来,与数据库模型中的上下文关系组合了与查询上下文相关的几个术语序列。 然后排序和输出术语序列。 短语生成也可以是用于从数据库的关系模型产生术语序列的迭代过程。

    Multi-language document search and retrieval system

    公开(公告)号:US20020161570A1

    公开(公告)日:2002-10-31

    申请号:US10080513

    申请日:2002-02-25

    Abstract: A multi-lingual indexing and search system performs tokenization and stemming in a manner which is independent of whether index entries and search terms appear as words in a dictionary. During the tokenization phase of the process, a string of text is separated into individual word tokens, and predetermined types of tokens are eliminated from further processing. The stemming phase of the process reduces words to grammatical stems by removing known word-endings associated with the various languages to be supported. Known word endings are removed from the word tokens without any effort to guarantee that the remaining stem is contained in a dictionary. In a preferred implementation, the stemming process is only applied to nouns.

    Linguistic search system
    86.
    发明授权
    Linguistic search system 失效
    语言搜索系统

    公开(公告)号:US06202064B1

    公开(公告)日:2001-03-13

    申请号:US09099909

    申请日:1998-06-18

    Inventor: Laurent Julliard

    Abstract: A method of searching for information in a text database, comprising: receiving (s1) at least one user input, the user input(s) defining a natural language expression, converting (s2, s3) the natural language expression to a tagged form (50, 51) including part-of-speech tags, applying (s4) to the tagged form (51) one or more grammar rules of the language of the natural language expression (49), to derive a regular expression (52), and analyzing (s5) the text database to determine whether there is a match between said regular expression (52) and a portion of said text database. An apparatus for carrying out this techniques is also disclosed. Users may find portions of a text which match multiword expressions given by the user. Matches include possible variations that are relevant with the initial criteria from a linguistic point of view including simple inflections like plural/singular, masculine/feminine or conjugated verbs and even more complex variations like the insertion of additional adjectives, adverbs, etc. in between the words specified by the user.

    Abstract translation: 一种在文本数据库中搜索信息的方法,包括:接收(s1)至少一个用户输入,定义自然语言表达的用户输入,将(s2,s3)自然语言表达转换为标记的形式( 包括部分语音标签,将(s4)应用于标记形式(51)自然语言表达式(49)的语言的一个或多个语法规则,以导出正则表达式(52),以及 分析(s5)文本数据库以确定所述正则表达式(52)和所述文本数据库的一部分之间是否存在匹配。 还公开了一种用于执行该技术的装置。 用户可以找到与用户给出的多字表达式匹配的文本部分。 匹配包括与语言角度的初始标准相关的可能的变化,包括诸如复数/单数,阳性/女性或共轭动词之类的简单变形,甚至更复杂的变化,例如在其间插入附加形容词,副词等 用户指定的单词。

    Keyword extraction apparatus for Japanese texts
    87.
    发明授权
    Keyword extraction apparatus for Japanese texts 失效
    日语文本的关键字提取装置

    公开(公告)号:US5619410A

    公开(公告)日:1997-04-08

    申请号:US219530

    申请日:1994-03-29

    CPC classification number: G06F17/30666 G06F17/2795 G06F17/30616

    Abstract: Sentence segmentation means performing sentence segmentation on the Japanese text data to be processed. Morpheme analysis means divides sentence-by-sentence data into morphemes and analyzes the resultant morphemes on the basis of information regarding morpheme-by-morpheme continuation contained in an analytical dictionary. Morpheme dictionary information development means develops the contents of the morpheme dictionary including part of speech information, semantic classification information, sentence pattern information and noted term information. Keyword candidate extraction means extracts keyword candidates from sentence-by-sentence data on the basis of the part of speech information and the like of each morpheme. Case information acquisition means acquires case information from information regarding the classes of case of keyword candidates immediately preceding noted terms stored in a noted term table and case class classification information for stored in a case class conversion table. Frequency information acquisition means acquires the appearance frequency of each keyword candidate. Importance calculation means calculates the importance of each keyword candidate as keyword. Keyword finalizing means definitely determines as true keywords only those keyword candidates having degrees of importance above a designated level of importance.

    Abstract translation: 句子分割意味着对待处理的日文文本数据执行句子分割。 语素分析意味着将逐句数据分解为语素,并根据分析词典中包含的语素语素延续信息分析结果语素。 语素字典信息开发意味着开发词素词典的内容,包括语音信息,语义分类信息,句型信息和注释术语信息。 关键字候选提取方法基于每个语素的语音信息等,从逐句数据中提取关键字候选。 情况信息获取装置从紧接在所述术语表中存储的所述术语之前的关键词候选的情况类别的信息和用于存储在病例分类转换表中的病例类别分类信息中获取病例信息。 频率信息获取装置获取每个关键字候选的出现频率。 重要性计算手段计算每个关键字候选人的关键字的重要性。 关键词最终确定手段绝对将确定为真正关键字的那些关键词候选人的重要度高于指定的重要程度。

    System of document representation retrieval by successive iterated
probability sampling
    88.
    发明授权
    System of document representation retrieval by successive iterated probability sampling 失效
    通过连续迭代概率抽样的文件表示检索系统

    公开(公告)号:US5488725A

    公开(公告)日:1996-01-30

    申请号:US39757

    申请日:1993-03-30

    Abstract: An information retrieval system based on probabilities that documents meet information needs. The frequency of occurrence of a representation in a collection of documents is estimated by identifying the frequency of occurrence of the representation in a sample of documents and calculating the difference between the maximum and minimum probable frequencies of occurrence of the representation in the collection. If the difference does not exceed a limit, a midpoint of the maximum and minimum probable frequencies is the estimated frequency of occurrence of the representation.Document distribution probabilities are optimized and probability thresholds are established for the identification of documents. An initial probability threshold is established and is adjusted as the probabilities are scored for documents in samples. The document result list is iteratively adjusted through the samples.

    Abstract translation: 一种信息检索系统,基于文件满足信息需求的概率。 通过识别文件样本中的表示的发生频率并计算集合中表示的最大和最小可能频率之间的差异来估计文档集合中的表示的发生频率。 如果差值不超过极限,最大和最小可能频率的中点是估计的表示发生频率。 优化文件分配概率,建立文件识别概率阈值。 建立初始概率阈值,并根据样本中的文档的概率进行调整。 文档结果列表通过样本进行迭代调整。

    Apparatus and method for linguistic expression processing
    89.
    发明授权
    Apparatus and method for linguistic expression processing 失效
    用于语言表达处理的装置和方法

    公开(公告)号:US4771401A

    公开(公告)日:1988-09-13

    申请号:US846366

    申请日:1986-03-31

    CPC classification number: G06F17/30666 G06F17/273

    Abstract: An apparatus and method for linguistic expression processing provides features for spelling verification, correction, and dictionary database storage. The system utilizes a linguistically salient word skeleton-forming process to correct both typrographic and cognitive spelling errors. The system also uses a suspect expression modification sequence to recognize and correct typographical spelling errors. A linguistic expression database includes a master lexicon having expression blocks arranged in accord with respective collation ranges of skeletons of expressions contained therein. In one preferred embodiment, these linguistically salient word skeletons corresponding to the master lexicon expressions are not retained in the database.

    Abstract translation: 用于语言表达处理的装置和方法提供拼写验证,校正和字典数据库存储的特征。 该系统利用语言学突出的词形成过程来纠正印刷和认知拼写错误。 系统还使用可疑表达式修改序列来识别和纠正排印拼写错误。 语言表达数据库包括具有根据其中包含的表达式的骨架的相应范围排列的表达块的主词典。 在一个优选实施例中,对应于主词典表达的这些语言学显着的词骨架不保留在数据库中。

    Analyzing tenant-specific data
    90.
    发明授权

    公开(公告)号:US09684712B1

    公开(公告)日:2017-06-20

    申请号:US12892069

    申请日:2010-09-28

    Inventor: Stephen J. Todd

    Abstract: A method for use in analyzing tenant-specific data is disclosed. First data for a first tenant and second data for a second tenant is stored in a multi-tenant data storage system. A first portion of the first data is selected. Based on the selection, the first portion of the first data is copied to a data store that is specific to the first tenant. Data analysis techniques are applied to the data store.

Patent Agency Ranking