一种基于神经网络的复杂PDF结构解析方法及装置

发明授权

CN110598191B 一种基于神经网络的复杂PDF结构解析方法及装置有权转让

请登陆查看更多内容

专利标题： 一种基于神经网络的复杂PDF结构解析方法及装置
申请号： CN201911124192.6

申请日： 2019-11-18
公开(公告)号： CN110598191B

公开(公告)日： 2020-04-07
发明人: 宋永生 , 汤铭 , 王楠
申请人： 江苏联著实业股份有限公司
申请人地址： 江苏省南京市中山南路501号通服大厦1502室
专利权人： 江苏联著实业股份有限公司
当前专利权人： 文灵科技(北京)有限公司
当前专利权人地址： 江苏省南京市中山南路501号通服大厦1502室
代理机构： 连云港联创专利代理事务所
代理商 刘刚
主分类号： G06F40/126
IPC分类号： G06F40/126 ; G06F40/205 ; G06F40/258 ; G06F40/30 ; G06N3/04 ; G06N3/08

摘要：

本说明书实施例提供了一种基于神经网络的复杂PDF结构解析方法及装置，通过获得PDF文档的特征信息；根据最大熵模型对所述PDF文档的特征信息进行粗颗粒划分，获得所述PDF文档的分层段落；根据大规模语料集中训练的两层双向语言模型转化所述PDF文档的分层段落获得段落词向量，压缩所述段落词向量获得段落语义向量；将所述段落语义向量输入多层双向长短时记忆网络，获得所述PDF文档的所有段落的层级序列。解决了由于PDF文档结构不单一，存在泛化能力较差的技术问题，达到了避免人工设计规则逻辑的局限性，能够高水平的解析复杂PDF文档结构，泛化性强的技术效果。

摘要（英）：

Embodiments of the invention provide a complex PDF structure analysis method and device based on a neural network. The method comprises the steps of obtaining feature information of a PDF document; carrying out coarse particle division on the feature information of the PDF document according to a maximum entropy model to obtain a layered paragraph of the PDF document; converting layered paragraphsof the PDF document according to a two-layer bidirectional language model trained in a large-scale corpus set to obtain paragraph word vectors, and compressing the paragraph word vectors to obtain paragraph semantic vectors; and inputting the paragraph semantic vector into a multi-layer bidirectional long-short-term memory network to obtain a hierarchical sequence of all paragraphs of the PDF document. The technical problem that the generalization ability is poor due to the fact that the PDF document structure is not single is solved, and the technical effects that the limitation of manual design rule logic is avoided, the complex PDF document structure can be analyzed at a high level, and the generalization ability is high are achieved.

公开/授权文献

CN110598191A 一种基于神经网络的复杂PDF结构解析方法及装置公开/授权日：2019-12-20

信息查询

中国专利公布公告 Global Dossier Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F40/00	处理自然语言数据（语音分析或综合，语音识别G10L）
G06F40/10	.文本处理（自然语言分析G06F 40/20;语义分析G06F 40/30;自然语言处理或翻译G06F 40/40）
G06F40/12	..使用代码处理文本实体
G06F40/126	...字符编码