MAPPING ENTITIES IN UNSTRUCTURED TEXT DOCUMENTS VIA ENTITY CORRECTION AND ENTITY RESOLUTION
摘要:
Methods, systems, and non-transitory computer readable storage media are disclosed for correcting entity detection errors with entity correction and resolution in optical character recognition for digitization of physical documents. Specifically, the disclosed system utilizes named entity recognition to extract entities from character strings (e.g., words) in a digital text document. The disclosed system also tokenizes the character strings in the digital text document based on attributes of the character strings. Furthermore, the disclosed system compares the extracted entities and tokenized character strings to determine similarity metrics between the extracted entities and tokenized character strings. The disclosed system also compares extracted entities to character strings including special/numerical characters to determine similarity metrics indicating correlation probabilities between entities and character strings. The disclosed systems generate mappings between the tokens and entities based on the similarity metrics to resolve entities to likely corresponding character strings while correcting for errors during entity extraction.
信息查询
0/0