Corpus generation device and method, human-machine interaction system

    公开(公告)号:US10268678B2

    公开(公告)日:2019-04-23

    申请号:US15694918

    申请日:2017-09-04

    发明人: Nan Qiu Haofen Wang

    IPC分类号: G06F17/21 G06F17/27 G06F17/28

    摘要: A corpus generation device and method, the device comprising: a segmentation module, connected to at least one monolingual parallel corpus for segmenting a sentence into words and processing the segmented words by a knowledge-driven approach; a classification module, for classifying sentences having different tag sequences but the same meaning into the same sentence cluster; a mapping module, for determining the categories of sentence structures of all the sentences in the sentence cluster, recording and storing a mapping mode for transforming tags between sentence structures when different categories of sentence structures in the same sentence cluster are transformed; a sentence structure generation module, for generating sentence structures according to a first mapping mode between a first category of sentence structures in one of the sentence clusters and other categories of sentence structures in the same sentence cluster; and a corpus generation module, for nesting a word corresponding to a sequence tag to generate a new monolingual parallel corpus.