-
1.
公开(公告)号:US20230315799A1
公开(公告)日:2023-10-05
申请号:US17806955
申请日:2022-06-15
申请人: Wipro Limited
IPC分类号: G06F16/958 , G06F40/221 , G06F40/143
CPC分类号: G06F16/986 , G06F40/143 , G06F40/221
摘要: Disclosed herein is method and a system for extracting information from an input document comprising multi-format information. In an embodiment, a Hypertext Markup Language (HTML) document corresponding to the input document is created by analyzing the input document comprising documents of multiple data formats. Further, the HTML document is realigned based on a number of columns in each page of the HTML document. Furthermore, a document Identifier (ID) associated with each of the documents is determined in realigned HTML document by classifying information in each of the document pages using a pretrained Machine Learning (ML) model. Subsequently, a hierarchy configuration file, corresponding to the realigned HTML document, is generated based on the document ID. Finally, information from the hierarchy configuration file associated with each of the document ID is extracted by orchestrating one or more data extractors for extracting data attributes from the hierarchy configuration file.