Methods and apparatuses for intra-document reference identification and resolution
    1.
    发明授权
    Methods and apparatuses for intra-document reference identification and resolution 有权
    文件内参考识别和解析的方法和装置

    公开(公告)号:US08352857B2

    公开(公告)日:2013-01-08

    申请号:US12258627

    申请日:2008-10-27

    CPC classification number: G06F17/2235

    Abstract: Reference identification and resolution identifies reference text fragments in a document and associates referenced object text fragments in the document with the identified reference text fragments. Reference profiles are abstracted from the document. Each reference profile specifies at least a reference number and an object type identifier. A reference profile is paired with an object text fragment of the document containing the reference number of the reference profile. The pairing is repeated to associate reference profiles with object text fragments. A reference text fragment of the document satisfying one of the reference profiles is associated with the object text fragment paired with the satisfied reference profile. The associating is repeated to associate reference text fragments of the document with object text fragments.

    Abstract translation: 参考标识和分辨率识别文档中的参考文本片段,并将文档中引用的对象文本片段与所标识的引用文本片段相关联。 参考资料从文件中抽象出来。 每个参考配置文件至少指定一个参考号和一个对象类型标识符。 参考资料与包含参考资料的参考编号的文件的对象文本片段配对。 重复配对以将参考简档与对象文本片段相关联。 满足一个参考简档的文档的参考文本片段与与满足的参考简档配对的对象文本片段相关联。 重复关联,将文档的引用文本片段与对象文本片段相关联。

    Method and apparatus for structuring documents utilizing recognition of an ordered sequence of identifiers
    2.
    发明授权
    Method and apparatus for structuring documents utilizing recognition of an ordered sequence of identifiers 有权
    用于利用识别标识符的有序序列的识别来构造文档的方法和装置

    公开(公告)号:US07991709B2

    公开(公告)日:2011-08-02

    申请号:US12020743

    申请日:2008-01-28

    CPC classification number: G06F17/211

    Abstract: A method is provided for operating a computing device to create a document structure model of a computer parsable text document utilizing recognition of at least one ordered sequence of identifiers in the document. The method includes converting a computer parsable text document of any format to an alternative structured language format to form a converted document. The text of the converted document is fragmented into an ordered sequence of text fragments within a text format. The text fragments are enumerated to obtain a sequence of terms. At least one optimal sub-sequence of terms is identified from among the sequence of terms, with an optimal sub-sequence being one or more longest increasing sub-sequence(s). The computer parsable text document is annotated with tags, with the tags including information derived from identification of the optimal sub-sequence(s). The annotated document is displayed on the graphical user interface.

    Abstract translation: 提供了一种用于操作计算设备以利用文档中的至少一个有序序列的识别来创建计算机可解析文本文档的文档结构模型的方法。 该方法包括将任何格式的计算机可解析文本文档转换成替代结构化语言格式以形成转换的文档。 转换后的文档的文本被分割成文本格式的文本片段的有序序列。 枚举文本片段以获得术语序列。 从术语序列中识别术语的至少一个最佳子序列,其中最佳子序列是一个或多个最长增加子序列。 计算机可解析文本文档用标签注释,其中标签包括从最佳子序列的识别导出的信息。 注释文档显示在图形用户界面上。

    Systems and methods for converting legacy and proprietary documents into extended mark-up language format
    3.
    发明授权
    Systems and methods for converting legacy and proprietary documents into extended mark-up language format 失效
    将传统和专有文档转换为扩展标记语言格式的系统和方法

    公开(公告)号:US07165216B2

    公开(公告)日:2007-01-16

    申请号:US10756313

    申请日:2004-01-14

    CPC classification number: G06F17/30914 G06F17/227

    Abstract: A system and method that converts legacy and proprietary documents into extended mark-up language format which treats the conversion as transforming ordered trees of one schema and/or model into ordered trees of another schema and/or model. In embodiments, the tree transformers are coded using a learning method that decomposes the converting task into three components which include path re-labeling, structural composition and input tree traversal, each of which involves learning approaches. The transformation of an input tree into an output tree may involve labeling components in the input tree with valid labels or paths from a particular output schema, composing the labeled elements into the output tree with a valid structure, and finding such a traversal of the input tree that achieves the correct composition of the output tree and applies structural rules.

    Abstract translation: 将传统和专有文档转换为扩展标记语言格式的系统和方法,该格式将转换视为将一个模式和/或模型的有序树转换为另一模式和/或模型的有序树。 在实施例中,使用将转换任务分解为包括路径重新标记,结构组合和输入树遍历的三个组件的学习方法对树型变换器进行编码,每个组件涉及学习方法。 将输入树转换为输出树可能涉及使用来自特定输出模式的有效标签或路径来标注输入树中的组件,使用有效结构将标记的元素组合成输出树,并且找到输入的遍历 树,实现输出树的正确组合并应用结构规则。

    Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents

    公开(公告)号:US20060156226A1

    公开(公告)日:2006-07-13

    申请号:US11032817

    申请日:2005-01-10

    CPC classification number: G06F17/217 G06F17/2745

    Abstract: A method for identifying header/footer content of a document, in order to sequence text fragments comprising recognizable text blocks as derived from the document. The textual variability of lines comprised of text blocks, including the different kinds of text blocks within the line is analyzed for assessment of textual variability. Header/footer zones are defined by textual content having a low textual variability. An alternative embodiment identifies pagination constructs by comparing selected text-boxes for similarity and proximity and clustering the text boxes satisfying a predetermined similarity value, wherein the clustered text boxes are deemed to comprise pagination constructs.

    TYPOGRAPHICAL BLOCK GENERATION
    5.
    发明申请
    TYPOGRAPHICAL BLOCK GENERATION 审中-公开
    柱形生成

    公开(公告)号:US20130321867A1

    公开(公告)日:2013-12-05

    申请号:US13484708

    申请日:2012-05-31

    Applicant: Herve Dejean

    Inventor: Herve Dejean

    CPC classification number: G06F17/211

    Abstract: Embodiments of a computer-implemented method for grouping one or more token elements comprising one or more characters in an input file. The method comprises computing a first leading distance between a first baseline of a first token element, and a second baseline of a second token element. The method further comprises defining a block with the first token element and the second token element, and characterizing the first leading distance as a leading distance of the block. The method further comprises computing a second leading distance between the second baseline and a third baseline of a third token element. The method furthermore comprises, grouping the third token element in to the block based on a first difference between the second leading distance and the leading distance of the block lying within a first predefined threshold value.

    Abstract translation: 用于对包括输入文件中的一个或多个字符的一个或多个令牌元素进行分组的计算机实现的方法的实施例。 该方法包括计算第一令牌元素的第一基线和第二令牌元件的第二基线之间的第一前导距离。 该方法还包括使用第一令牌元素和第二令牌元素定义块,并且将第一前导距离表征为块的前导距离。 该方法还包括计算第三令牌元素的第二基线和第三基线之间的第二前导距离。 该方法还包括:基于位于第一预定阈值内的块的第二前导距离和前导距离之间的第一差异,将第三令牌元素分组到块中。

    METHODS AND APPARATUSES FOR INTRA-DOCUMENT REFERENCE IDENTIFICATION AND RESOLUTION
    6.
    发明申请
    METHODS AND APPARATUSES FOR INTRA-DOCUMENT REFERENCE IDENTIFICATION AND RESOLUTION 有权
    文献参考标识和分辨率的方法和设备

    公开(公告)号:US20100107045A1

    公开(公告)日:2010-04-29

    申请号:US12258627

    申请日:2008-10-27

    CPC classification number: G06F17/2235

    Abstract: Reference identification and resolution identifies reference text fragments in a document and associates referenced object text fragments in the document with the identified reference text fragments. Reference profiles are abstracted from the document. Each reference profile specifies at least a reference number and an object type identifier. A reference profile is paired with an object text fragment of the document containing the reference number of the reference profile. The pairing is repeated to associate reference profiles with object text fragments. A reference text fragment of the document satisfying one of the reference profiles is associated with the object text fragment paired with the satisfied reference profile. The associating is repeated to associate reference text fragments of the document with object text fragments.

    Abstract translation: 参考标识和分辨率识别文档中的参考文本片段,并将文档中引用的对象文本片段与所标识的引用文本片段相关联。 参考资料从文件中抽象出来。 每个参考配置文件至少指定一个参考号和一个对象类型标识符。 参考资料与包含参考资料的参考编号的文件的对象文本片段配对。 重复配对以将参考简档与对象文本片段相关联。 满足一个参考简档的文档的参考文本片段与与满足的参考简档配对的对象文本片段相关联。 重复关联,将文档的引用文本片段与对象文本片段相关联。

    Versatile page number detector
    7.
    发明申请
    Versatile page number detector 有权
    多功能页码检测器

    公开(公告)号:US20080114757A1

    公开(公告)日:2008-05-15

    申请号:US11599947

    申请日:2006-11-15

    CPC classification number: G06F17/30569 G06K9/00469

    Abstract: A method for detection of page numbers in a document includes identifying a plurality of text fragments associated with a plurality of pages of a document. From the identified text fragments, at least one sequence is identified. Each identified sequence includes a plurality of terms. Each term of the sequence is derived from a text fragment selected from the plurality text fragments. The terms of an identified sequence comply with at least one predefined numbering scheme which defines a form and an incremental state of the terms in a sequence. A subset of the identified sequences which cover at least some of the pages of the document is computed. Terms of at least some of the subset of the identified sequences are construed as page numbers of pages of the document. Additional page numbers may be identified by considering one or more features of the terms in the subset of identified sequences.

    Abstract translation: 用于检测文档中的页码的方法包括识别与文档的多个页面相关联的多个文本片段。 从识别的文本片段中,至少识别出一个序列。 每个识别的序列包括多个术语。 序列的每个术语从选自多个文本片段的文本片段导出。 所识别序列的术语符合至少一个定义序列中术语的形式和增量状态的预定义编号方案。 计算覆盖文档的至少一些页面的识别序列的子集。 所识别的序列的至少一部分子集的术语被解释为文档的页面页码。 可以通过考虑所识别序列的子集中的术语的一个或多个特征来识别附加页码。

    Captions detector
    8.
    发明申请
    Captions detector 有权
    字幕检测器

    公开(公告)号:US20080077847A1

    公开(公告)日:2008-03-27

    申请号:US11528261

    申请日:2006-09-27

    Applicant: Herve Dejean

    Inventor: Herve Dejean

    CPC classification number: G06F17/2745

    Abstract: To detect captions in a document that includes text fragments and objects of interest, a signature is assigned to each text fragment. The signature is the value for that text fragment of a text fragment representation comprising at least one text fragment attribute. A caption signature is identified as a signature assigned to a substantial number of text fragments that are near at least one object of interest in the document. One or more captions are detected as one or more text fragments each assigned a caption signature.

    Abstract translation: 要检测包含文本片段和感兴趣对象的文档中的标题,将为每个文本片段分配一个签名。 签名是包含至少一个文本片段属性的文本片段表示的文本片段的值。 字幕签名被识别为分配给文档中至少一个感兴趣对象附近的大量文本片段的签名。 一个或多个标题被检测为一个或多个文本片段,每个文本片段分配了字幕签名。

    Method and apparatus for structuring documents based on layout, content and collection
    9.
    发明申请
    Method and apparatus for structuring documents based on layout, content and collection 失效
    基于布局,内容和收集构建文档的方法和装置

    公开(公告)号:US20060155700A1

    公开(公告)日:2006-07-13

    申请号:US11033016

    申请日:2005-01-10

    CPC classification number: G06F17/30914

    Abstract: A method and apparatus is provided for converting a document in a first format essentially comprising a flat layout structure into a structured document in a hierarchical form in accordance with predetermined attributes identified from the input format. The process comprises fragmenting the input document into a plurality of document content elements in accordance with a predetermined set of document attributes identifiable from the input document format. The content elements are clustered into selective sets having similar document attributes. The clustered sets are validated with reference to common textual properties organizational content common in documents in the collection. The clustered sets are then categorized into predetermined categories comprising structured elements of the structured document format and the document content elements are organized by hierarchical dependency from the predetermined categories wherein the organized document elements comprise the desired structured document format.

    Abstract translation: 提供了一种方法和装置,用于根据从输入格式识别的预定属性将基本上包括平面布局结构的第一格式的文档以分层形式转换成结构化文档。 该过程包括根据从输入文档格式可识别的预定文档属性集,将输入文档分段成多个文档内容元素。 内容元素被聚集成具有相似文档属性的选择集。 参考集合中的文档中常见的常见文本属性组织内容来验证集群集。 然后,将集群集合分类为包括结构化文档格式的结构化元素的预定类别,并且文档内容元素由来自预定类别的分层依赖性组织,其中组织的文档元素包括期望的结构化文档格式。

    Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents
    10.
    发明授权
    Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents 有权
    用于在传统文档中检测包括标题和页脚的分页结构的方法和装置

    公开(公告)号:US09218326B2

    公开(公告)日:2015-12-22

    申请号:US13032996

    申请日:2011-02-23

    CPC classification number: G06F17/217 G06F17/2745

    Abstract: A method for identifying header/footer content of a document, in order to sequence text fragments comprising recognizable text blocks as derived from the document. The textual variability of lines comprised of text blocks, including the different kinds of text blocks within the line is analyzed for assessment of textual variability. Header/footer zones are defined by textual content having a low textual variability. An alternative embodiment identifies pagination constructs by comparing selected text-boxes for similarity and proximity and clustering the text boxes satisfying a predetermined similarity value, wherein the clustered text boxes are deemed to comprise pagination constructs.

    Abstract translation: 一种用于识别文档的页眉/页脚内容的方法,以便对从文档导出的包含可识别的文本块的文本片段进行排序。 分析文本块的文本变异性,包括文本块中的不同类型的文本块,对文本变异性进行评估。 页眉/页脚区域由具有低文本变异性的文本内容定义。 替代实施例通过比较用于相似性和邻近度的所选择的文本框并且对满足预定相似度值的文本框进行聚类来识别分页结构,其中聚类文本框被认为包括分页结构。

Patent Agency Ranking