METHODS FOR OBTAINING IMPROVED TEXT SIMILARITY MEASURES
    1.
    发明申请
    METHODS FOR OBTAINING IMPROVED TEXT SIMILARITY MEASURES 失效
    用于获取改进的文本相似性度量的方法

    公开(公告)号:US20090125805A1

    公开(公告)日:2009-05-14

    申请号:US11937550

    申请日:2007-11-09

    IPC分类号: G06F17/24

    摘要: The embodiments of the invention provide methods for obtaining improved text similarity measures. More specifically, a method of measuring similarity between at least two electronic documents begins by identifying similar terms between the electronic documents. This includes basing similarity between the similar terms on patterns, wherein the patterns can include word patterns, letter patterns, numeric patterns, and/or alphanumeric patterns. The identifying of the similar terms also includes identifying multiple pattern types between the electronic documents. Moreover, the basing of the similarity on patterns identifies terms within the electronic documents that are within a category of a hierarchy. Specifically, the identifying of the terms reviews a hierarchical data tree, wherein nodes of the tree represent terms within the electronic documents. Lower nodes of the tree have specific terms; and, wherein higher nodes of the tree have general terms.

    摘要翻译: 本发明的实施例提供了用于获得改进的文本相似性度量的方法。 更具体地说,一种测量至少两个电子文档之间的相似性的方法,首先是识别电子文档之间的类似术语。 这包括在模式上的类似术语之间的基础相似性,其中模式可以包括字模式,字母模式,数字模式和/或字母数字模式。 类似术语的识别还包括识别电子文档之间的多种模式类型。 此外,模式上的相似性的基础确定电子文档内的层次结构类别内的术语。 具体地,术语的识别审查分层数据树,其中树的节点表示电子文档内的术语。 树的下层节点有特定的术语; 并且其中树的较高节点具有一般术语。

    Methods for obtaining improved text similarity measures which replace similar characters with a string pattern representation by using a semantic data tree
    2.
    发明授权
    Methods for obtaining improved text similarity measures which replace similar characters with a string pattern representation by using a semantic data tree 失效
    用于通过使用语义数据树获得用字符串模式表示替换相似字符的改进的文本相似性度量的方法

    公开(公告)号:US07945525B2

    公开(公告)日:2011-05-17

    申请号:US11937550

    申请日:2007-11-09

    IPC分类号: G06F17/00

    摘要: The embodiments of the invention provide methods for obtaining improved text similarity measures. More specifically, a method of measuring similarity between at least two electronic documents begins by identifying similar terms between the electronic documents. This includes basing similarity between the similar terms on patterns, wherein the patterns can include word patterns, letter patterns, numeric patterns, and/or alphanumeric patterns. The identifying of the similar terms also includes identifying multiple pattern types between the electronic documents. Moreover, the basing of the similarity on patterns identifies terms within the electronic documents that are within a category of a hierarchy. Specifically, the identifying of the terms reviews a hierarchical data tree, wherein nodes of the tree represent terms within the electronic documents. Lower nodes of the tree have specific terms; and, wherein higher nodes of the tree have general terms.

    摘要翻译: 本发明的实施例提供了用于获得改进的文本相似性度量的方法。 更具体地说,一种测量至少两个电子文档之间的相似性的方法,首先是识别电子文档之间的类似术语。 这包括在模式上的类似术语之间的基础相似性,其中模式可以包括字模式,字母模式,数字模式和/或字母数字模式。 类似术语的识别还包括识别电子文档之间的多种模式类型。 此外,模式上的相似性的基础确定电子文档内的层次结构类别内的术语。 具体地,术语的识别审查分层数据树,其中树的节点表示电子文档内的术语。 树的下层节点有特定的术语; 并且其中树的较高节点具有一般术语。