-
公开(公告)号:US20250157209A1
公开(公告)日:2025-05-15
申请号:US19002208
申请日:2024-12-26
Applicant: Oracle International Corporation
Inventor: Yakupitiyage Don Thanuja Samodhye Dharmasiri , Xu Zhong , Ahmed Ataallah Ataallah Abobakr , Hongtao Yang , Budhaditya Saha , Shaoke Xu , Shashi Prasad Suravarapu , Mark Edward Johnson , Thanh Long Duong
IPC: G06V10/82 , G06V30/148 , G06V30/412
Abstract: Techniques for extracting key information from a document using machine-learning models in a chatbot system is disclosed herein. In one particular aspect, a method is provided that includes receiving a set of data, which includes key fields, within a document at a data processing system that includes a table detection module, a key information extraction module, and a table extraction module. Text information and corresponding location data are extracted via optical character recognition. The table detection module detects whether one or more tables are present in the document and, if applicable, a location of each of the tables. The key information extraction module extracts text from the key fields. The table extraction module extracts each of the tables based on input from the optical character recognition and the table detection module. Extraction results include the text from the key fields and each of the tables can be output.
-
公开(公告)号:US20250118398A1
公开(公告)日:2025-04-10
申请号:US18884459
申请日:2024-09-13
Applicant: Oracle International Corporation
Inventor: Shubham Pawankumar Shah , Syed Najam Abbas Zaidi , Xu Zhong , Poorya Zaremoodi , Srinivasa Phani Kumar Gadde , Arash Shamaei , Ganesh Kumar , Thanh Tien Vu , Nitika Mathur , Chang Xu , Shiquan Yang , Sagar Kalyan Gollamudi
Abstract: Techniques are disclosed for automatically generating Subjective, Objective, Assessment and Plan (SOAP) notes. Particularly, techniques are disclosed for training data collection and evaluation for automatic SOAP note generation. Training data is accessed, and evaluation process is performed on the training data to result in evaluated training data. A fine-tuned machine-learning model is generated using the evaluated training data. The fine-tuned machine-learning model can be used to perform a task associated with generating a SOAP note.
-
公开(公告)号:US12056434B2
公开(公告)日:2024-08-06
申请号:US18150924
申请日:2023-01-06
Applicant: Oracle International Corporation
Inventor: Vishank Bhatia , Xu Zhong , Thanh Long Duong , Mark Johnson , Srinivasa Phani Kumar Gadde , Vishal Vishnoi , King-Hwa Lee , Christopher Kennewick
IPC: G06F40/117 , G06F16/9538 , G06F16/955 , G06F40/134 , G06F40/143 , G06F40/205 , G06T7/70
CPC classification number: G06F40/117 , G06F16/9538 , G06F16/9558 , G06F40/134 , G06F40/143 , G06F40/205 , G06T7/70 , G06T2207/30176
Abstract: Techniques for generating formatting tags for textual content obtained from a source electronic document are disclosed. A system parses a digital file to obtain information about characters in an electronic document. The system applies tags to text generated based on the textual content of the electronic document by creating segments of textually-consecutive characters and applying corresponding text formatting style tags to the segments. The system further identifies segments of text overlapping bounding boxes in the electronic document. The system generates textual content including a segment of text and a corresponding hyperlink associated with the segment of text. The system further generates textual content by selectively applying line breaks from the source electronic document in the textual content.
-
公开(公告)号:US20240169161A1
公开(公告)日:2024-05-23
申请号:US18452803
申请日:2023-08-21
Applicant: Oracle International Corporation
Inventor: Paria Jamshid Lou , Gioacchino Tangari , Jason Black , Bhagya Gayathri Hettige , Xu Zhong , Poorya Zaremoodi , Thanh Long Duong , Mark Edward Johnson
IPC: G06F40/40 , G06F40/284 , G06F40/289 , G10L15/06
CPC classification number: G06F40/40 , G06F40/284 , G06F40/289 , G10L15/063
Abstract: Obtaining collections of sentences in different languages that are usable for training models in various applications of artificial intelligence is provided. A method is provided that obtains, from text corpus, webpages in a plurality of languages, each of the webpages corresponding to an URL; obtains annotations for each of the webpages based on its URL, to obtain annotated data entries corresponding to the webpages, each of the annotated data entries including a classification label corresponding to a sub-topic of one of a plurality of topics, where each of the plurality of topics includes a corresponding plurality of sub-topics; filters the annotated data entries to obtain topic-specific content in a target language based on the classification labels, the topic-specific content corresponding to one or more sub-topics; performs post-processing on the topic-specific content to obtain result data; and outputs the result data for the topic.
-
公开(公告)号:US20230139397A1
公开(公告)日:2023-05-04
申请号:US17819445
申请日:2022-08-12
Applicant: Oracle International Corporation
Inventor: Xu Zhong , Yakupitiyage Don Thanuja Samodhye Dharmasiri , Thanh Long Duong , Mark Edward Johnson
IPC: G06F40/35
Abstract: Deep learning techniques are disclosed for extraction of embedded data from documents. In an exemplary technique, a set of unstructured text data is received. One or more text groupings are generated by processing the set of unstructured text data. One or more text grouping embeddings are generated in a format for input to a machine learning model based on the one or more generated text groupings. One or more output predictions are generated by inputting the one or more text grouping embeddings into the machine learning model. Each output prediction of the one or more output predictions correspond to a predicted aspect of a text grouping of the one or more text groupings.
-
公开(公告)号:US20230095673A1
公开(公告)日:2023-03-30
申请号:US17888300
申请日:2022-08-15
Applicant: Oracle International Corporation
Inventor: Yakupitiyage Don Thanuja Samodhye Dharmasiri , Xu Zhong , Ahmed Ataallah Ataallah Abobakr , Hongtao Yang , Budhaditya Saha , Shaoke Xu , Shashi Prasad Suravarapu , Mark Edward Johnson , Thanh Long Duong
IPC: G06V10/82 , G06V30/412 , G06V30/148
Abstract: Techniques for extracting key information from a document using machine-learning models in a chatbot system is disclosed herein. In one particular aspect, a method is provided that includes receiving a set of data, which includes key fields, within a document at a data processing system that includes a table detection module, a key information extraction module, and a table extraction module. Text information and corresponding location data are extracted via optical character recognition. The table detection module detects whether one or more tables are present in the document and, if applicable, a location of each of the tables. The key information extraction module extracts text from the key fields. The table extraction module extracts each of the tables based on input from the optical character recognition and the table detection module. Extraction results include the text from the key fields and each of the tables can be output.
-
公开(公告)号:US12217497B2
公开(公告)日:2025-02-04
申请号:US17888300
申请日:2022-08-15
Applicant: Oracle International Corporation
Inventor: Yakupitiyage Don Thanuja Samodhye Dharmasiri , Xu Zhong , Ahmed Ataallah Ataallah Abobakr , Hongtao Yang , Budhaditya Saha , Shaoke Xu , Shashi Prasad Suravarapu , Mark Edward Johnson , Thanh Long Duong
IPC: G06V10/82 , G06V30/148 , G06V30/412
Abstract: Techniques for extracting key information from a document using machine-learning models in a chatbot system is disclosed herein. In one particular aspect, a method is provided that includes receiving a set of data, which includes key fields, within a document at a data processing system that includes a table detection module, a key information extraction module, and a table extraction module. Text information and corresponding location data are extracted via optical character recognition. The table detection module detects whether one or more tables are present in the document and, if applicable, a location of each of the tables. The key information extraction module extracts text from the key fields. The table extraction module extracts each of the tables based on input from the optical character recognition and the table detection module. Extraction results include the text from the key fields and each of the tables can be output.
-
8.
公开(公告)号:US20240338395A1
公开(公告)日:2024-10-10
申请号:US18298060
申请日:2023-04-10
Applicant: Oracle International Corporation
Inventor: Xu Zhong , Don Dharmasiri , Thanh Long Duong , Mark Johnson , Srinivasa Phani Kumar Gadde , Vishal Vishnoi
IPC: G06F16/332 , G06F40/205 , G06F40/284
CPC classification number: G06F16/3329 , G06F40/205 , G06F40/284
Abstract: Techniques for multi-layer training of a machine learning model are disclosed. A system pre-trains a machine learning model on training data obtained from unlabeled document graph data by executing unsupervised pre-training tasks on the unlabeled document graph data to generate a labeled pre-training data set. The system modifies document graphs to change attributes of nodes in the document graphs. The system pre-trains the machine learning model with a data set including the modified document graphs and un-modified document graphs to generate prediction associated with the modifications to the document graphs. Subsequent to pre-training, the system fine-tunes the machine learning model with a set of labeled training data to generate predictions associated with a specific attribute of a document graph.
-
公开(公告)号:US20240061989A1
公开(公告)日:2024-02-22
申请号:US18169740
申请日:2023-02-15
Applicant: Oracle International Corporation
Inventor: Xu Zhong , Vishank Bhatia , Thanh Long Duong , Mark Johnson , Srinivasa Phani Kumar Gadde , Vishal Vishnoi
IPC: G06F40/103 , G06F40/205 , G06F40/284 , G06F40/30
CPC classification number: G06F40/103 , G06F40/205 , G06F40/284 , G06F40/30
Abstract: Techniques for generating text content arranged in a consistent read order from a source document including text corresponding to different read orders are disclosed. A system parses a binary file representing an electronic document to identify characters and metadata associated with the characters. The system pre-sorts a character order of characters in each line of the electronic document to generate an ordered list of characters arranged according to the right-to-left reading order. The system performs a layout-mirroring operation to change a position of characters within the modified document relative to a right edge of the document and a left edge of the document. Subsequent to performing layout-mirroring, the system identifies native left-to-right reading-order text in-line with the native right-to-left reading-order text. The system flips the reading order of the native left-to-right read-order characters into the left-to-right reading order to be consistent with the native right-to-left read-order text.
-
公开(公告)号:US20230141853A1
公开(公告)日:2023-05-11
申请号:US18052694
申请日:2022-11-04
Applicant: Oracle International Corporation
Inventor: Thanh Tien Vu , Poorya Zaremoodi , Duy Vu , Mark Edward Johnson , Thanh Long Duong , Xu Zhong , Vladislav Blinov , Cong Duy Vu Hoang , Yu-Heng Hong , Vinamr Goel , Philip Victor Ogren , Srinivasa Phani Kumar Gadde , Vishal Vishnoi
IPC: G06F40/263 , G06F16/31
CPC classification number: G06F40/263 , G06F16/325 , H04L51/02
Abstract: Techniques disclosed herein relate generally to language detection. In one particular aspect, a method is provided that includes obtaining a sequence of n-grams of a textual unit; using an embedding layer to obtain an ordered plurality of embedding vectors for the sequence of n-grams; using a deep network to obtain an encoded vector that is based on the ordered plurality of embedding vectors; and using a classifier to obtain a language prediction for the textual unit that is based on the encoded vector. The deep network includes an attention mechanism, and using the embedding layer to obtain the ordered plurality of embedding vectors comprises, for each n-gram in the sequence of n-grams: obtaining hash values for the n-gram; based on the hash values, selecting component vectors from among the plurality of component vectors; and obtaining an embedding vector for the n-gram that is based on the component vectors.
-
-
-
-
-
-
-
-
-