一种中文文本数据聚类方法及系统

发明公开

CN103218435A 一种中文文本数据聚类方法及系统失效 - 权利终止

请登陆查看更多内容

专利标题： 一种中文文本数据聚类方法及系统
专利标题（英）： Method and system for clustering Chinese text data
申请号： CN201310130406.7

申请日： 2013-04-15
公开(公告)号： CN103218435A

公开(公告)日： 2013-07-24
发明人: 赵旭
申请人： 上海嘉之道企业管理咨询有限公司
申请人地址： 上海市松江区沪松路315号
专利权人： 上海嘉之道企业管理咨询有限公司
当前专利权人： 上海嘉之道企业管理咨询有限公司
当前专利权人地址： 上海市松江区沪松路315号
代理机构： 上海申新律师事务所
代理商 竺路玲
主分类号： G06F17/30
IPC分类号： G06F17/30

摘要：

本发明公开了一种中文文本数据聚类方法及系统，属于数据挖掘技术领域其中，包括：步骤1将每条所述文本数据进行降维处理；步骤2将所述文本数据根据需要分成多批次；步骤3对单批次中的文本数据根据文本相似性进行聚类操作；步骤4完成所有批次批次之间的聚类操作，形成统一聚类。所述步骤1中的降维操作包括：步骤a.选取特征字集合；步骤b.将每条所述文本数据比照所述特征字集合，统计在所述文本数据中出现的特征字，形成文本数据的特征集合。本发明的有益效果是：通过对文本数据的降维操作和批次处理，有效地提高了系统运行速度和效率，减少了空间开销。解决了大规模中文文本的聚类的处理效率问题以及空间占用量大的性能问题。

摘要（英）：

The invention discloses a method and a system for clustering Chinese text data, which belong to the technical field of data mining. The method comprises steps of: step 1, carrying out dimension reduction process on each text data; step 2, dividing the text data into a plurality of batches; step 3, clustering the text data in a single batch according to the text similarity; and step 4, completing the clustering of all the batches so as to form unified clustering. The dimension reduction process of the step 1 comprises steps of: step a, selecting a tagged word set; and step b, comparing each text data with the tagged word set, completing statistics of tagged words in the text data, and forming a text data characteristic set. The method has the beneficial effects that the operation speed and efficiency of a system are effectively improved through carrying a dimension reduction process and a batch process on the text data, and the size overhead is lowered; and a processing efficiency problem of large-scale Chinese text clustering and a performance problem of large space occupation can be solved.

公开/授权文献

CN103218435B 一种中文文本数据聚类方法及系统公开/授权日：2017-01-25

信息查询

中国专利公布公告 Global Dossier Espacenet