一种基于中文语素和拼音联合统计的词向量表示方法

发明公开

CN109815476A 一种基于中文语素和拼音联合统计的词向量表示方法有权

请登陆查看更多内容

专利标题： 一种基于中文语素和拼音联合统计的词向量表示方法
专利标题（英）： A word vector representation method based on Chinese language element and pinyin joint statistics
申请号： CN201811465623.0

申请日： 2018-12-03
公开(公告)号： CN109815476A

公开(公告)日： 2019-05-28
发明人: 潘坚跃 , 刘祝平 , 潘艺旻 , 王译田 , 陈文康 , 王汝英 , 李欣荣 , 赵光俊 , 周航帆 , 魏伟 , 刘畅 , 李艳
申请人： 国网浙江省电力有限公司杭州供电公司 , 天津市普迅电力信息技术有限公司 , 国网信息通信产业集团有限公司
申请人地址： 浙江省杭州市上城区建国中路219号
专利权人： 国网浙江省电力有限公司杭州供电公司,天津市普迅电力信息技术有限公司,国网信息通信产业集团有限公司
当前专利权人： 国网浙江省电力有限公司杭州供电公司,天津市普迅电力信息技术有限公司,国网信息通信产业集团有限公司
当前专利权人地址： 浙江省杭州市上城区建国中路219号
代理机构： 天津盛理知识产权代理有限公司
代理商 董一宁
主分类号： G06F17/27
IPC分类号： G06F17/27 ; G06N3/04 ; G06N3/08

摘要：

一种基于中文语素和拼音联合统计的词向量表示方法，包括如下步骤：①采集互联网文本信息构建语料库，对构建的语料库进行正文清洗和分词处理；②对中文语料进行分词处理后转为不保留声调信息的拼音信息，然后分别对语素和拼音特征在训练集语料和全文档中统计词频和逆文档概率作统计权重TFc、IDFc、TFp和IDFp；③基于上下文语素和拼音联合统计的中文词表示模型，构造中文单个语素表示向量；④在步骤③的基础上训练一个三层神经网络以用于中心目标词的预测。该方法可满足离线词典和语料数据规模的适应性、可直接学习大规模无标注的互联网信息文本数据、可提高常规的词嵌入模型对于中文语言差异特性的兼顾性、可提高对错别字词语的表示和识别准确性。

摘要（英）：

A word vector representation method based on Chinese language element and pinyin joint statistics comprises the following steps of 1, collecting internet text information to construct a corpus, and conducting text cleaning and word segmentation processing on the constructed corpus; (2) carrying out word segmentation processing on the Chinese corpus, converting the processed Chinese corpus into pinyin information which does not reserve tone information, and then respectively carrying out statistical weights TFc, IDFc, TFp and IDFp on word frequency statistics and inverse document probability ofthe morphemes and pinyin characteristics in the training set corpus and the whole document; (3) constructing a Chinese single morpheme representation vector based on a Chinese word representation model of contextual morpheme and pinyin joint statistics; And (4) training a three-layer neural network on the basis of the step (3) for predicting the central target word. According to the method, the adaptability of an offline dictionary and the corpus data scale can be met, large-scale unlabeled internet information text data can be directly learned, the consideration of a conventional word embedding model on Chinese language difference characteristics can be improved, and the representation and recognition accuracy of wrongly written words can be improved.

公开/授权文献

CN109815476B 一种基于中文语素和拼音联合统计的词向量表示方法公开/授权日：2023-03-24

信息查询

中国专利公布公告 Global Dossier Espacenet