-
公开(公告)号:US20240046036A1
公开(公告)日:2024-02-08
申请号:US18258867
申请日:2021-12-07
Inventor: Aygul GARIFULLINA , Mathias KERN , Leonhard APPLIS
IPC: G06F40/284
CPC classification number: G06F40/284
Abstract: A computer implemented method of pre-processing an input text for a natural language processing operation based on a training corpus of documents, can include accessing a set of stop words including predetermined words for de-emphasis in the text for the natural language processing operation, the set of stop words being separated into at least two subsets including a first subset and a second subset, the second subset containing stop words predetermined to be of potential semantic significance to documents in the training corpus; tokenizing documents in a training corpus to an ordered set of corpus tokens; removing, from the set of corpus tokens, tokens corresponding to stop words in the first subset of stop words; generating a set of n-grams by identifying n-grams from groups of tokens in the set of corpus tokens based on predetermined rules for n-gram identification; tokenizing the input text to an ordered set of input text tokens; identifying groups of tokens in the set of input text tokens corresponding to n-grams in the set of n-grams and replacing, in the set of input text tokens, each identified group of tokens by a singular n-gram token; removing, from the set of input text tokens, tokens corresponding to stop words in the second subset of stop words; and processing the input text by the natural language processing operation based on the set of input text tokens for the input text.