System for tokenizing text in languages without inter-word separation

    公开(公告)号:US10002128B2

    公开(公告)日:2018-06-19

    申请号:US14927274

    申请日:2015-10-29

    CPC classification number: G06F17/277 G06F17/2775 G06F17/2863

    Abstract: A computerized system for transforming an input string includes a dictionary with tokens and associated scores. A chart parser generates a chart parse of the input string by, for each position within the input string, (i) identifying a string of at least one consecutive character in the input string that begins at that position and matches one of the tokens and (ii) unless the identified string is a single character matching the start character for another entry in the chart parse, creating an entry corresponding to the identified string. A partition selection module determines a selected partition of the input string. The selected partition includes an array of tokens selected from the chart parse such that their concatenation matches the input string. The selected partition is a minimum score partition, where the score is based on a sum of the tokens' associated scores from the dictionary.

Patent Agency Ranking