Sliding window to detect entities in corpus using natural language processing

    公开(公告)号:US11222165B1

    公开(公告)日:2022-01-11

    申请号:US16996394

    申请日:2020-08-18

    IPC分类号: G06F40/166 G06F40/279

    摘要: According to one or more embodiments of the present invention, an input request to a natural language processing (NLP) system is optimized. A window-size is selected for annotating an input corpus. The corpus is divided into partitions of the window-size, each partition processed separately. Further, a first set of entities is identified in a first partition, and a second set of entities in a second partition. Further, a third partition containing a first segment and a second segment is determined. The first segment overlaps the first partition, and the second segment overlaps the second partition. The method further includes identifying a third set of entities in the third partition. In response to the third set of entities being distinct from a set of entities from the first segment and the second segment, the window-size is adjusted. The input request for the NLP system is generated using the adjusted window-size.

    Future potential natural language processing annotations

    公开(公告)号:US11520972B2

    公开(公告)日:2022-12-06

    申请号:US16984245

    申请日:2020-08-04

    摘要: Aspects of the invention include resolving future reference identifiers for documents. Aspects of the invention include processing a document including a reference to a future event, wherein processing includes performing natural language processing (NLP) on the document, and identifying the reference to the future event included in the document. Aspects of the invention also include generating a future reference identifier for the reference to the future event, and responsive to processing an occurrence of the future event, resolving the future reference identifier by providing data from a subsequent document for the future event associated with the future reference identifier.

    Detecting and processing sections spanning processed document partitions

    公开(公告)号:US11347928B2

    公开(公告)日:2022-05-31

    申请号:US16939283

    申请日:2020-07-27

    摘要: Aspects of the invention include detecting and processing sections spanning processed document partitions by caching a document partition. The document partition includes metadata indicating that the document partition is a portion of a whole document. Aspects also include pairing a candidate paragraph from the document partition with a cached paragraph segment and determining, using a coherence model, a probability that the candidate paragraph and the cached paragraph segment constitute a semantically coherent paragraph. Aspects further include discarding the cached paragraph segment and processing the candidate paragraph and the cached paragraph segment separately based on a determination that the probability is less than a threshold level and processing the candidate paragraph and the cached paragraph segment together as a cross-partition paragraph based on a determination that the probability is greater than the threshold level.

    HANDLING FORM DATA ERRORS ARISING FROM NATURAL LANGUAGE PROCESSING

    公开(公告)号:US20220028502A1

    公开(公告)日:2022-01-27

    申请号:US16934061

    申请日:2020-07-21

    摘要: Aspects include receiving a document and classifying at least a subset of the document as having a first type of data. Features are extracted from the document. The extracting includes initiating processing of the at least a subset of the document by a first processing engine that was previously trained to extract features from the first type of data. The extracting also includes initiating processing of a remaining portion of the document not included in the at least a subset of the document by a second processing engine that was previously trained to extract features from a second type of data. The first type of data is different than the second type of data. Features are received from one or both of the first processing engine and the second processing engine. The received features are stored as features of the document.