SYSTEMS AND METHODS TO AUTOMATICALLY CLASSIFY ELECTRONIC DOCUMENTS USING EXTRACTED IMAGE AND TEXT FEATURES AND USING A MACHINE LEARNING SUBSYSTEM
    1.
    发明申请
    SYSTEMS AND METHODS TO AUTOMATICALLY CLASSIFY ELECTRONIC DOCUMENTS USING EXTRACTED IMAGE AND TEXT FEATURES AND USING A MACHINE LEARNING SUBSYSTEM 审中-公开
    使用提取的图像和文字特征以及使用机器学习子系统自动分类电子文档的系统和方法

    公开(公告)号:US20090116736A1

    公开(公告)日:2009-05-07

    申请号:US12266462

    申请日:2008-11-06

    CPC classification number: G06K9/00442 G06K9/6885

    Abstract: A document analysis system that automatically classifies documents by recognizing in each document distinctive features comprises a document acquisition system, a document recognition training system, a document classification system, a document recognition system, and a job organization system. The document acquisition system receives jobs wherein each job containing at least one electronic document. The document feature recognition system automatically extracts image and text features from each received document. The document classification system automatically classifies recognized electronic documents by finding the best match between the extracted features of each of the document and feature sets associated with each category of document. The document recognition training system automatically trains the feature set for each corresponding category of documents, wherein the training system using extracted features of unrecognized documents automatically modifies the feature set for a document category. The job organization system automatically organizes each job according to the document categories it contains.

    Abstract translation: 一种文档分析系统,通过在每个文档中识别独特的特征来自动分类文档包括文档获取系统,文档识别训练系统,文档分类系统,文档识别系统和作业组织系统。 文档获取系统接收作业,其中每个作业包含至少一个电子文档。 文档特征识别系统自动从每个收到的文档中提取图像和文本特征。 文档分类系统通过找到与每个文档类别相关联的每个文档和特征集的提取的特征之间的最佳匹配来自动对识别的电子文档进行分类。 文档识别训练系统自动训练每个相应类别的文档的特征集,其中使用提取的无法识别的文档的特征的训练系统自动修改文档类别的特征集。 作业组织系统根据其所包含的文档类别自动组织每个作业。

    Systems and methods for automatically processing electronic documents
    2.
    发明授权
    Systems and methods for automatically processing electronic documents 有权
    自动处理电子文件的系统和方法

    公开(公告)号:US08897563B1

    公开(公告)日:2014-11-25

    申请号:US14064935

    申请日:2013-10-28

    CPC classification number: G06K9/00442 G06K9/48 G06K9/72 G06K2209/01

    Abstract: In a document analysis system that receives and processes jobs from a plurality of users, in which each job may contain multiple electronic documents, to extract data from the electronic documents, a method of automatically pre-processing each received electronic document using a plurality of image transformation algorithms to improve subsequent data extraction from said document is provided. The method includes: electronically partitioning each received electronic document page into pieces; automatically processing each piece of the received electronic document page using each of a plurality of image pre-processing algorithms to produce a plurality of image variations of each piece; and analyzing the outputs of subsequent processing and data extraction, on each of the image variations of the pieces to determine which output is best, from the plurality of outputs for each piece.

    Abstract translation: 在接收和处理来自多个用户的作业的文档分析系统中,每个作业可以包含多个电子文档,以从电子文档中提取数据;一种使用多个图像自动预处理每个接收到的电子文档的方法 提供了用于改进从所述文档提取后续数据的转换算法。 该方法包括:将每个接收的电子文档页面电子分割成片; 使用多个图像预处理算法中的每一个自动处理所接收的电子文档页面以产生每个片段的多个图像变体; 并且对于每个片段的图像变化分析后续处理和数据提取的输出,以从每个片段的多个输出中确定哪个输出最佳。

    SYSTEMS AND METHODS FOR AUTOMATICALLY EXTRACTING DATA FROM ELETRONIC DOCUMENTS USING MULTIPLE CHARACTER RECOGNITION ENGINES
    4.
    发明申请
    SYSTEMS AND METHODS FOR AUTOMATICALLY EXTRACTING DATA FROM ELETRONIC DOCUMENTS USING MULTIPLE CHARACTER RECOGNITION ENGINES 审中-公开
    使用多个字符识别引擎从ELETRONIC文件自动提取数据的系统和方法

    公开(公告)号:US20110255784A1

    公开(公告)日:2011-10-20

    申请号:US13007434

    申请日:2011-01-14

    CPC classification number: G06K9/00442 G06K9/48 G06K9/72 G06K2209/01

    Abstract: In a document analysis system that receives and processes jobs from a plurality of users, in which each job may contain multiple electronic documents, to extract data from the electronic documents, a method of automatically extracting data from each received electronic document using a plurality of character recognition engines is provided. The method includes: automatically processing each received electronic document page using each of a plurality of recognition engines to extract data; comparing quality of data extracted from each of the recognition engines to assign a confidence score to the extracted data; and selecting extracted data having highest confidence score as the correct extracted data.

    Abstract translation: 在从每个作业可以包含多个电子文档的多个用户接收和处理作业的文档分析系统中,从电子文档中提取数据的方法,使用多个字符从每个接收到的电子文档中自动提取数据的方法 提供识别引擎。 该方法包括:使用多个识别引擎中的每一个自动处理所接收的电子文档页面以提取数据; 比较从每个识别引擎提取的数据的质量,以向所提取的数据分配置信度分数; 并选择具有最高置信度得分的提取数据作为正确的提取数据。

    Systems and methods for automatically processing electronic documents using multiple image transformation algorithms
    5.
    发明授权
    Systems and methods for automatically processing electronic documents using multiple image transformation algorithms 有权
    使用多个图像变换算法自动处理电子文档的系统和方法

    公开(公告)号:US08571317B2

    公开(公告)日:2013-10-29

    申请号:US13007452

    申请日:2011-01-14

    CPC classification number: G06K9/00442 G06K9/48 G06K9/72 G06K2209/01

    Abstract: In a document analysis system that receives and processes jobs from a plurality of users, in which each job may contain multiple electronic documents, to extract data from the electronic documents, a method of automatically pre-processing each received electronic document using a plurality of image transformation algorithms to improve subsequent data extraction from said document is provided. The method includes: electronically partitioning each received electronic document page into pieces; automatically processing each piece of the received electronic document page using each of a plurality of image pre-processing algorithms to produce a plurality of image variations of each piece; and analyzing the outputs of subsequent processing and data extraction, on each of the image variations of the pieces to determine which output is best, from the plurality of outputs for each piece.

    Abstract translation: 在接收和处理来自多个用户的作业的文档分析系统中,每个作业可以包含多个电子文档,以从电子文档中提取数据;一种使用多个图像自动预处理每个接收到的电子文档的方法 提供了用于改进从所述文档提取后续数据的转换算法。 该方法包括:将每个接收的电子文档页面电子分割成片; 使用多个图像预处理算法中的每一个自动处理所接收的电子文档页面以产生每个片段的多个图像变体; 并且对于每个片段的图像变化分析后续处理和数据提取的输出,以从每个片段的多个输出中确定哪个输出最佳。

    SYSTEMS AND METHODS FOR TRAINING DOCUMENT ANALYSIS SYSTEM FOR AUTOMATICALLY EXTRACTING DATA FROM DOCUMENTS
    6.
    发明申请
    SYSTEMS AND METHODS FOR TRAINING DOCUMENT ANALYSIS SYSTEM FOR AUTOMATICALLY EXTRACTING DATA FROM DOCUMENTS 审中-公开
    用于培训文件分析系统的系统和方法,用于从文档自动提取数据

    公开(公告)号:US20110258150A1

    公开(公告)日:2011-10-20

    申请号:US13007430

    申请日:2011-01-14

    CPC classification number: G06K9/00442 G06K9/48 G06K9/72 G06K2209/01

    Abstract: A method of training a document analysis system to extract data from documents is provided. The method includes: automatically analyzing images and text features extracted from a document to associate the document with a corresponding document category; comparing the extracted text features with a set of text features associated with corresponding category of the document, in which the set of text features includes a set of characters, words, and phrases; if the extracted features are found to consist of the characters, words, and phrases belonging to the set of text features associated with the corresponding document category, storing the extracted text features as the data contained in the corresponding document; and, if the extracted text features are found to include at least one text feature that does not belong to the set of text features associated with the corresponding document category, submitting the unrecognized text features to a training phase.

    Abstract translation: 提供了一种培训文档分析系统从文档中提取数据的方法。 该方法包括:自动分析从文档中提取的图像和文本特征,将文档与相应的文档类别相关联; 将所提取的文本特征与与文档的相应类别相关联的一组文本特征进行比较,其中该组文本特征包括一组字符,单词和短语; 如果发现所提取的特征由属于与相应文档类别相关联的文本特征集合的字符,单词和短语组成,则将所提取的文本特征存储为包含在相应文档中的数据; 并且如果所提取的文本特征被发现包括不属于与相应文档类别相关联的一组文本特征的至少一个文本特征,则将未被识别的文本特征提交到训练阶段。

Patent Agency Ranking