System and Method for Unsupervised Density Based Table Structure Identification

    公开(公告)号:US20210141781A1

    公开(公告)日:2021-05-13

    申请号:US16680302

    申请日:2019-11-11

    Abstract: Embodiments described herein provide unsupervised density-based clustering to infer table structure from document. Specifically, a number of words are identified from a block of text in an noneditable document, and the spatial coordinates of each word relative to the rectangular region are identified. Based on the word density of the rectangular region, the words are grouped into clusters using a heuristic radius search method. Words that are grouped into the same cluster are determined to be the element that belong to the same cell. In this way, the cells of the table structure can be identified. Once the cells are identified based on the word density of the block of text, the identified cells can be expanded horizontally or grouped vertically to identify rows or columns of the table structure.

Patent Agency Ranking