METHOD AND SYSTEM FOR EXTRACTING INFORMATION FROM INPUT DOCUMENT COMPRISING MULTI-FORMAT INFORMATION

    公开(公告)号:US20230315799A1

    公开(公告)日:2023-10-05

    申请号:US17806955

    申请日:2022-06-15

    申请人: Wipro Limited

    摘要: Disclosed herein is method and a system for extracting information from an input document comprising multi-format information. In an embodiment, a Hypertext Markup Language (HTML) document corresponding to the input document is created by analyzing the input document comprising documents of multiple data formats. Further, the HTML document is realigned based on a number of columns in each page of the HTML document. Furthermore, a document Identifier (ID) associated with each of the documents is determined in realigned HTML document by classifying information in each of the document pages using a pretrained Machine Learning (ML) model. Subsequently, a hierarchy configuration file, corresponding to the realigned HTML document, is generated based on the document ID. Finally, information from the hierarchy configuration file associated with each of the document ID is extracted by orchestrating one or more data extractors for extracting data attributes from the hierarchy configuration file.