Determination of inputted image to be document or non-document
    1.
    发明授权
    Determination of inputted image to be document or non-document 有权
    输入图像的确定为文档或非文档

    公开(公告)号:US08385643B2

    公开(公告)日:2013-02-26

    申请号:US12353440

    申请日:2009-01-14

    IPC分类号: G06K9/00 G06K9/54 G06F17/00

    摘要: A preprocessing section binarizes input image data and calculates a total black pixel ratio. A feature extracting section detects connected components included in the binary image data and detects circumscribing bounding boxes of the connected components. Predetermined connected components are removed from all of the connected components based on the sizes of the detected circumscribing bounding boxes and bounding box black pixel ratios. By using the connected components that remain after removing the unnecessary connected components, a histogram is generated by specifying the sizes of the circumscribing bounding boxes as classes and numbers of the connected components as the frequencies of occurrence. A determining section determines whether the input image data is document image data or non-document image data based on information related to the generated histogram and the total black pixel ratio.

    摘要翻译: 预处理部分对输入图像数据进行二值化并计算总黑色像素比。 特征提取部分检测二进制图像数据中包括的连接分量并检测连接分量的外接边界框。 基于检测到的外接边界框和边框黑色像素比的尺寸,将所有连接的组件从所有连接的组件中移除。 通过使用除去不必要的连接部件后剩余的连接部件,通过将外接边界框的尺寸指定为连接部件的类别和编号作为发生频率来生成直方图。 确定部分基于与所生成的直方图和总黑色像素比相关的信息来确定输入图像数据是文档图像数据还是非文档图像数据。

    IMAGE DETERMINATION APPARATUS, IMAGE SEARCH APPARATUS AND A RECORDING MEDIUM ON WHICH AN IMAGE SEARCH PROGRAM IS RECORDED
    2.
    发明申请
    IMAGE DETERMINATION APPARATUS, IMAGE SEARCH APPARATUS AND A RECORDING MEDIUM ON WHICH AN IMAGE SEARCH PROGRAM IS RECORDED 有权
    图像确定装置,图像搜索装置和记录图像搜索程序的记录介质

    公开(公告)号:US20090245640A1

    公开(公告)日:2009-10-01

    申请号:US12353440

    申请日:2009-01-14

    IPC分类号: G06K9/34

    摘要: A preprocessing section binarizes input image data and calculates a total black pixel ratio. A feature extracting section detects connected components included in the binary image data and detects circumscribing bounding boxes of the connected components. Predetermined connected components are removed from all of the connected components based on the sizes of the detected circumscribing bounding boxes and bounding box black pixel ratios. By using the connected components that remain after removing the unnecessary connected components, a histogram is generated by specifying the sizes of the circumscribing bounding boxes as classes and numbers of the connected components as the frequencies of occurrence. A determining section determines whether the input image data is document image data or non-document image data based on information related to the generated histogram and the total black pixel ratio.

    摘要翻译: 预处理部分对输入图像数据进行二值化并计算总黑色像素比。 特征提取部分检测二进制图像数据中包括的连接分量并检测连接分量的外接边界框。 基于检测到的外接边界框和边框黑色像素比的尺寸,将所有连接的组件从所有连接的组件中移除。 通过使用除去不必要的连接部件后剩余的连接部件,通过将外接边界框的尺寸指定为连接部件的类别和编号作为发生频率来生成直方图。 确定部分基于与所生成的直方图和总黑色像素比相关的信息来确定输入图像数据是文档图像数据还是非文档图像数据。

    Document image processing apparatus
    3.
    发明授权
    Document image processing apparatus 有权
    文件图像处理装置

    公开(公告)号:US08160402B2

    公开(公告)日:2012-04-17

    申请号:US11972477

    申请日:2008-01-10

    IPC分类号: G06K9/03 G06K9/18

    摘要: An image of a character string composed of M pieces of characters is clipped from a document image, and the image is divided character by character, and image features of each character image are extracted. On the basis of the image features, N (N>1, integer) pieces of character images in descending order of degree of similarity are selected as candidate characters from a character image feature dictionary which stores the image features of character image in units of character, and the first index matrix of M×N cells is prepared. A candidate character string composed of a plurality of candidate characters constituting the first column of the first index matrix, is subjected to a lexical analysis according to a predetermined language model, whereby a second index matrix adjusted into a character string which makes sense is prepared to be utilized for searching.

    摘要翻译: 从文件图像中剪辑由M个字符组成的字符串的图像,并且逐个地分割图像,并且提取每个字符图像的图像特征。 基于图像特征,从以字符为单位存储字符图像的图像特征的字符图像特征词典中选择作为相似度降序的N(N> 1,整数)个字符图像的候选字符 ,并准备M×N个单元的第一个索引矩阵。 由构成第一索引矩阵的第一列的多个候选字符构成的候选字符串根据预定语言模型进行词法分析,由此将调整为有意义的字符串的第二索引矩阵准备为 用于搜索。

    Search and retrieval of documents indexed by optical character recognition
    4.
    发明授权
    Search and retrieval of documents indexed by optical character recognition 有权
    搜索和检索通过光学字符识别索引的文档

    公开(公告)号:US08208765B2

    公开(公告)日:2012-06-26

    申请号:US11972446

    申请日:2008-01-10

    IPC分类号: G06K9/00

    摘要: An image of a character string composed of M pieces of characters is clipped from a document image, and the image is divided into separate characters. Image features of each character image are extracted. Based on the image features, N (N>1, integer) pieces of character images in descending order of degree of similarity are selected as candidate characters, from a character image feature dictionary which stores the image features of character image in units of character, and a first index matrix of M×N cells is prepared. A candidate character string composed of a plurality of candidate characters constituting a first column of the first index matrix, is subjected to a lexical analysis according to a language model, and whereby a second index matrix having a character string which makes sense is prepared. In the language model, statistics are taken and then, the lexical analysis is performed.

    摘要翻译: 从文件图像剪切由M个字符组成的字符串的图像,并且将图像划分为单独的字符。 提取每个字符图像的图像特征。 基于图像特征,从以字符为单位存储字符图像的图像特征的字符图像特征词典中,选择相似度降序的N(N> 1,整数)个字符图像作为候选字符, 并准备M×N个单元的第一个索引矩阵。 由构成第一索引矩阵的第一列的多个候选字符组成的候选字符串根据语言模型进行词法分析,由此准备具有有意义的字符串的第二索引矩阵。 在语言模型中,进行统计,然后进行词法分析。

    CHARACTER IMAGE EXTRACTING APPARATUS AND CHARACTER IMAGE EXTRACTING METHOD
    5.
    发明申请
    CHARACTER IMAGE EXTRACTING APPARATUS AND CHARACTER IMAGE EXTRACTING METHOD 有权
    字符提取设备和字符提取方法

    公开(公告)号:US20090028435A1

    公开(公告)日:2009-01-29

    申请号:US11963613

    申请日:2007-12-21

    IPC分类号: G06K9/46

    摘要: In an extracting step, the extracting portion obtains a linked component composed of a plurality of mutually linking pixels from a character string region composed of a plurality of characters, and extracts section elements from the character string region, the section elements each being surrounded by a circumscribing figure circumscribing to the linked component. In the first altering step, the first altering portion combines section elements at least having a mutually overlapping part among the extracted section elements so as to prepare a new section element. In the first selecting step, the first selecting portion determines a reference size in advance and selects section elements having a size greater than the reference size, from among the section elements altered in the first altering step.

    摘要翻译: 在提取步骤中,提取部分从由多个字符组成的字符串区域中获得由多个相互关联的像素组成的链接成分,并从字符串区域中提取出部分元素, 限定连接组件的外观图。 在第一改变步骤中,第一改变部分组合至少在提取的部分元素中具有相互重叠的部分的部分元素,以便准备新的部分元素。 在第一选择步骤中,第一选择部分从第一改变步骤中改变的部分元素中预先确定参考尺寸并且选择具有大于参考尺寸的尺寸的部分元素。

    Information processing device, information processing system, information processing method, program, and storage medium
    6.
    发明申请
    Information processing device, information processing system, information processing method, program, and storage medium 审中-公开
    信息处理装置,信息处理系统,信息处理方法,程序和存储介质

    公开(公告)号:US20080244378A1

    公开(公告)日:2008-10-02

    申请号:US12002671

    申请日:2007-12-18

    IPC分类号: G06F17/21

    CPC分类号: G06K9/033 G06K9/00456

    摘要: An information processing device includes: a feature extracting section for extracting, as format information, a format feature of a process-target document from image data of the process-target document, on which filling-in spaces of plural items are printed; a document recognizing section for comparing the format information of the process-target document with registered format information stored in a storage device, and specifying a registered document that corresponds to the process-target document, the registered format information regarding format features of registered documents; a data acquiring section for converting characters in the image data of the process-target document into text data; and a distributing section for grouping the image data and text data of the characters into plural groups according to a separation rule that is set for the registered document, the characters being written in the fill-in spaces of the items of the process-target document, and for transmitting the different groups to different external devices. With this, information such as personal information to be protected can be processed, preventing an operator dealing with the information from obtaining the whole information.

    摘要翻译: 一种信息处理设备,包括:特征提取部分,用于从打印有多个项目的填充空间的处理对象文档的图像数据中提取作为格式信息的处理对象文档的格式特征; 文档识别部分,用于将处理目标文档的格式信息与存储在存储装置中的登记格式信息进行比较,并且指定对应于处理目标文档的注册文档,关于注册文档的格式特征的注册格式信息; 数据获取部分,用于将处理目标文档的图像数据中的字符转换为文本数据; 以及分配部,用于根据为登记文件设定的分离规则,将图像数据和文字数据分组成多个组,所述字符被写入处理对象文档的项目的填写空间中 ,并将不同组发送到不同的外部设备。 因此,可以处理诸如要保护的个人信息的信息,从而防止处理信息的操作者获得整个信息。

    Character image extracting apparatus and character image extracting method
    7.
    发明授权
    Character image extracting apparatus and character image extracting method 有权
    字符图像提取装置和字符图像提取方法

    公开(公告)号:US08750616B2

    公开(公告)日:2014-06-10

    申请号:US11963613

    申请日:2007-12-21

    IPC分类号: G06K9/34

    摘要: In an extracting step, the extracting portion obtains a linked component composed of a plurality of mutually linking pixels from a character string region composed of a plurality of characters, and extracts section elements from the character string region, the section elements each being surrounded by a circumscribing figure circumscribing to the linked component. In the first altering step, the first altering portion combines section elements at least having a mutually overlapping part among the extracted section elements so as to prepare a new section element. In the first selecting step, the first selecting portion determines a reference size in advance and selects section elements having a size greater than the reference size, from among the section elements altered in the first altering step.

    摘要翻译: 在提取步骤中,提取部分从由多个字符组成的字符串区域中获得由多个相互关联的像素组成的链接成分,并从字符串区域中提取出部分元素, 限定连接组件的外观图。 在第一改变步骤中,第一改变部分组合至少在提取的部分元素中具有相互重叠的部分的部分元素,以便准备新的部分元素。 在第一选择步骤中,第一选择部分从第一改变步骤中改变的部分元素中预先确定参考尺寸并且选择具有大于参考尺寸的尺寸的部分元素。

    Image document processing device, image document processing method, program, and storage medium
    9.
    发明授权
    Image document processing device, image document processing method, program, and storage medium 有权
    图像文件处理装置,图像文件处理方法,程序和存储介质

    公开(公告)号:US08290269B2

    公开(公告)日:2012-10-16

    申请号:US11953695

    申请日:2007-12-10

    CPC分类号: G06K9/6828 G06F17/30253

    摘要: A headline-region initial processing section clips a headline-region image in an image document, divides the image into individual character images, and extracts features of the individual character images. Based on the features, a candidate-character-sequence generating section selects N (N is an integer more than 1) character images as candidate characters in the order of degree of matching from a font-feature dictionary for storing features of individual character images, and generates M×N index matrix where M is the number of characters in an extracted character sequence. Based on the index matrix, a document-name generating section generates a meaningful document name according to the image document. An image-document-DB management section manages accumulated image documents using the document name. This provides an image document processing device and an image document processing method each allowing automatically generating and managing the meaningful document name that represents the contents of the image document, without user's operation.

    摘要翻译: 标题区域初始处理部分剪切图像文档中的标题区域图像,将图像分割成单独的字符图像,并且提取单个字符图像的特征。 基于特征,候选字符序列生成部从用于存储各个字符图像的特征的字体特征词典中选择N(N为1以上的整数)的字符图像作为匹配度的顺序的候选字符, 并生成M×N索引矩阵,其中M是提取的字符序列中的字符数。 基于索引矩阵,文档名称生成部根据图像文档生成有意义的文档名称。 图像文档DB管理部分使用文档名称来管理累积的图像文档。 这提供了一种图像文档处理设备和图像文档处理方法,每种图像文档处理方法都允许在不需要用户操作的情况下自动地生成和管理表示图像文档的内容的有意义的文档名称。

    DOCUMENT IMAGE PROCESSING APPARATUS, DOCUMENT IMAGE PROCESSING METHOD, DOCUMENT IMAGE PROCESSING PROGRAM, AND RECORDING MEDIUM ON WHICH DOCUMENT IMAGE PROCESSING PROGRAM IS RECORDED
    10.
    发明申请
    DOCUMENT IMAGE PROCESSING APPARATUS, DOCUMENT IMAGE PROCESSING METHOD, DOCUMENT IMAGE PROCESSING PROGRAM, AND RECORDING MEDIUM ON WHICH DOCUMENT IMAGE PROCESSING PROGRAM IS RECORDED 有权
    文件图像处理装置,文件图像处理方法,文件图像处理程序和记录文件图像处理程序的记录介质

    公开(公告)号:US20090028446A1

    公开(公告)日:2009-01-29

    申请号:US11972446

    申请日:2008-01-10

    IPC分类号: G06K9/72

    摘要: An image of a character string composed of M pieces of characters is clipped from a document image, and the image is divided into separate characters. Image features of each character image are extracted. Based on the image features, N (N>1, integer) pieces of character images in descending order of degree of similarity are selected as candidate characters, from a character image feature dictionary which stores the image features of character image in units of character, and a first index matrix of M×N cells is prepared. A candidate character string composed of a plurality of candidate characters constituting a first column of the first index matrix, is subjected to a lexical analysis according to a language model, and whereby a second index matrix having a character string which makes sense is prepared. In the language model, statistics are taken and then, the lexical analysis is performed.

    摘要翻译: 从文件图像剪切由M个字符组成的字符串的图像,并且将图像划分为单独的字符。 提取每个字符图像的图像特征。 基于图像特征,从以字符为单位存储字符图像的图像特征的字符图像特征词典中,选择相似度降序的N(N> 1,整数)个字符图像作为候选字符, 并准备MxN单元的第一指标矩阵。 由构成第一索引矩阵的第一列的多个候选字符组成的候选字符串根据语言模型进行词法分析,由此准备具有有意义的字符串的第二索引矩阵。 在语言模型中,进行统计,然后进行词法分析。