Identification of reading order text segments with a probabilistic language model

    公开(公告)号:US10372821B2

    公开(公告)日:2019-08-06

    申请号:US15462684

    申请日:2017-03-17

    Applicant: Adobe Inc.

    Abstract: Certain embodiments identify a correct structured reading-order sequence of text segments extracted from a file. A probabilistic language model is generated from a large text corpus to comprise observed word sequence patterns for a given language. The language model measures whether splicing together a first text segment with another continuation text segment results in a phrase that is more likely than a phrase resulting from splicing together the first text segment with other continuation text segments. Sets of text segments, which include a first set with a first text segment and a first continuation text segment as well as a second set with the first text segment and a second continuation text segment, are provided to the probabilistic model. A score indicative of a likelihood of the set providing a correct structured reading-order sequence is obtained for each set of text segments.

Patent Agency Ranking