一种古汉语文本的一体化自动词法分析方法及系统

发明授权

CN109829159B 一种古汉语文本的一体化自动词法分析方法及系统有权

请登陆查看更多内容

专利标题： 一种古汉语文本的一体化自动词法分析方法及系统
申请号： CN201910085019.3

申请日： 2019-01-29
公开(公告)号： CN109829159B

公开(公告)日： 2020-02-18
发明人: 李斌 , 程宁 , 葛四嘉 , 李成名 , 郝星月 , 冯敏萱 , 许超
申请人： 南京师范大学
申请人地址： 江苏省南京市鼓楼区宁海路122号
专利权人： 南京师范大学
当前专利权人： 南京师范大学
当前专利权人地址： 江苏省南京市鼓楼区宁海路122号
代理机构： 南京苏高专利商标事务所
代理商 王恒静
主分类号： G06F40/284
IPC分类号： G06F40/284 ; G06F40/295

摘要：

本发明公开了一种古汉语文本的一体化自动词法分析方法，包括以下步骤：采用Word2Vec模型预训练得到具有语义特征的古汉语的字向量；将历朝历代文献中出现过的信息数据加入到古籍专名数据库中形成若干专有名词词条；调整Bi‑LSTM‑CRF神经网络模型的各参数，将所述最终训练语料预处理成模型可读的形式，加载到所述神经网络模型中，不断迭代学习，并对测试语料的标注结果进行自动评价。本发明采用断句、分词、词性标注一体化的标注方法，省去了词法分析多项子任务的重复标注过程，也避免了重复标注错误的多级扩散；本发明采用深度学习模型，可以自动学习到丰富的语言特征，省去了传统机器学习中人工定制特征模板的工作；本发明所述的标注模型采用GPU硬件加速，可以大大缩短模型训练的时间，效率比传统的机器学习模型要高很多。

摘要（英）：

The invention discloses an integrated automatic lexical analysis method for ancient Chinese texts. The method includes the following steps: pre-training the word vector of the ancient Chinese with semantic features by using the Word2Vec model; adding the information data appearing in the historical documents to the ancient name database to form a number of proper noun entries; adjusting Bi-LSTM- Each parameter of the CRF neural network model preprocesses the final training corpus into a model readable form, loads into the neural network model, continuously iteratively learns, and automaticallyevaluates the labeling result of the test corpus. According to the method, a sentence segmentation, word segmentation and part-of-speech tagging integrated tagging method is adopted, the repeated tagging process of lexical analysis of multiple sub-tasks is omitted, and multi-stage diffusion of repeated tagging errors is also avoided; According to the method, a deep learning model is adopted, richlanguage features can be learned automatically, and the work of manually customizing a feature template in traditional machine learning is omitted; The labeling model is accelerated by adopting GPU hardware, the model training time can be greatly shortened, and the efficiency is much higher than that of a traditional machine learning model.

公开/授权文献

CN109829159A 一种古汉语文本的一体化自动词法分析方法及系统公开/授权日：2019-05-31

信息查询

中国专利公布公告 Global Dossier Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F40/00	处理自然语言数据（语音分析或综合，语音识别G10L）
G06F40/20	.自然语言分析（自然语言的语义分析入G06F40/30）
G06F40/279	..文字实体的识别
G06F40/284	...词汇分析，例如标记或搭配词