Patent search ap:("Oracle International Corporation") AND inv:"Budhaditya Saha" Page 1

1.

发明申请
SYSTEM AND TECHNIQUES FOR HANDLING LONG TEXT FOR PRE-TRAINED LANGUAGE MODELS 有权

公开(公告)号：US20250117585A1

公开(公告)日：2025-04-10

申请号：US18987825

申请日：2024-12-19

Applicant: Oracle International Corporation

Inventor： Thanh Tien Vu , Tuyen Quang Pham , Mark Edward Johnson , Thanh Long Duong , Ying Xu , Poorya Zaremoodi , Omid Mohamad Nezami , Budhaditya Saha , Cong Duy Vu Hoang

IPC: G06F40/295 , G06F40/284 , H04L51/02

Abstract: In some aspects, a computing device may receive, at a data processing system, a set of utterances for training or inferencing with a named entity recognizer to assign a label to each token piece from the set of utterances. The computing device may determine a length of each utterance in the set and when the length of the utterance exceeds a pre-determined threshold of token pieces: dividing the utterance into a plurality of overlapping chunks of token pieces; assigning a label together with a confidence score for each token piece in a chunk; determining a final label and an associated confidence score for each chunk of token pieces by merging two confidence scores; determining a final annotated label for the utterance based at least on the merging the two confidence scores; and storing the final annotated label in a memory.

2.

发明授权
Extracting key information from document using trained machine-learning models 有权

公开(公告)号：US12217497B2

公开(公告)日：2025-02-04

申请号：US17888300

申请日：2022-08-15

Applicant: Oracle International Corporation

Inventor： Yakupitiyage Don Thanuja Samodhye Dharmasiri , Xu Zhong , Ahmed Ataallah Ataallah Abobakr , Hongtao Yang , Budhaditya Saha , Shaoke Xu , Shashi Prasad Suravarapu , Mark Edward Johnson , Thanh Long Duong

IPC: G06V10/82 , G06V30/148 , G06V30/412

Abstract: Techniques for extracting key information from a document using machine-learning models in a chatbot system is disclosed herein. In one particular aspect, a method is provided that includes receiving a set of data, which includes key fields, within a document at a data processing system that includes a table detection module, a key information extraction module, and a table extraction module. Text information and corresponding location data are extracted via optical character recognition. The table detection module detects whether one or more tables are present in the document and, if applicable, a location of each of the tables. The key information extraction module extracts text from the key fields. The table extraction module extracts each of the tables based on input from the optical character recognition and the table detection module. Extraction results include the text from the key fields and each of the tables can be output.

3.

发明申请
FRAMEWORK FOR FOCUSED TRAINING OF LANGUAGE MODELS AND TECHNIQUES FOR END-TO-END HYPERTUNING OF THE FRAMEWORK 有权

公开(公告)号：US20230098783A1

公开(公告)日：2023-03-30

申请号：US17952116

申请日：2022-09-23

Applicant: Oracle International Corporation

Inventor： Poorya Zaremoodi , Cong Duy Vu Hoang , Duy Vu , Dai Hoang Tran , Budhaditya Saha , Nagaraj N. Bhat , Thanh Tien Vu , Tuyen Quang Pham , Adam Craig Pocock , Katherine Silverstein , Srinivasa Phani Kumar Gadde , Vishal Vishnoi , Mark Edward Johnson , Thanh Long Duong

IPC: G10L15/06 , G10L15/183

Abstract: Techniques are disclosed herein for focused training of language models and end-to-end hypertuning of the framework. In one aspect, a method is provided that includes obtaining a machine learning model pre-trained for language modeling, and post-training the machine learning model for various tasks to generate a focused machine learning model. The post-training includes: (i) training the machine learning model on an unlabeled set of training data pertaining to a task that the machine learning model was pre-trained for as part of the language modeling, and the unlabeled set of training data is obtained with respect to a target domain, a target task, or a target language, and (ii) training the machine learning model on a labeled set of training data that pertains to another task that is an auxiliary task related to a downstream task to be performed using the machine learning model or output from the machine learning model.

4.

发明申请
EXTRACTING KEY INFORMATION FROM DOCUMENT USING TRAINED MACHINE-LEARNING MODELS 有权

公开(公告)号：US20230095673A1

公开(公告)日：2023-03-30

申请号：US17888300

申请日：2022-08-15

Applicant: Oracle International Corporation

Inventor： Yakupitiyage Don Thanuja Samodhye Dharmasiri , Xu Zhong , Ahmed Ataallah Ataallah Abobakr , Hongtao Yang , Budhaditya Saha , Shaoke Xu , Shashi Prasad Suravarapu , Mark Edward Johnson , Thanh Long Duong

IPC: G06V10/82 , G06V30/412 , G06V30/148

Abstract: Techniques for extracting key information from a document using machine-learning models in a chatbot system is disclosed herein. In one particular aspect, a method is provided that includes receiving a set of data, which includes key fields, within a document at a data processing system that includes a table detection module, a key information extraction module, and a table extraction module. Text information and corresponding location data are extracted via optical character recognition. The table detection module detects whether one or more tables are present in the document and, if applicable, a location of each of the tables. The key information extraction module extracts text from the key fields. The table extraction module extracts each of the tables based on input from the optical character recognition and the table detection module. Extraction results include the text from the key fields and each of the tables can be output.

5.

发明申请
EXTRACTING KEY INFORMATION FROM DOCUMENT USING TRAINED MACHINE-LEARNING MODELS 有权

公开(公告)号：US20250157209A1

公开(公告)日：2025-05-15

申请号：US19002208

申请日：2024-12-26

Applicant: Oracle International Corporation

Inventor： Yakupitiyage Don Thanuja Samodhye Dharmasiri , Xu Zhong , Ahmed Ataallah Ataallah Abobakr , Hongtao Yang , Budhaditya Saha , Shaoke Xu , Shashi Prasad Suravarapu , Mark Edward Johnson , Thanh Long Duong

IPC: G06V10/82 , G06V30/148 , G06V30/412

Abstract: Techniques for extracting key information from a document using machine-learning models in a chatbot system is disclosed herein. In one particular aspect, a method is provided that includes receiving a set of data, which includes key fields, within a document at a data processing system that includes a table detection module, a key information extraction module, and a table extraction module. Text information and corresponding location data are extracted via optical character recognition. The table detection module detects whether one or more tables are present in the document and, if applicable, a location of each of the tables. The key information extraction module extracts text from the key fields. The table extraction module extracts each of the tables based on input from the optical character recognition and the table detection module. Extraction results include the text from the key fields and each of the tables can be output.

6.

发明授权
Framework for focused training of language models and techniques for end-to-end hypertuning of the framework 有权

公开(公告)号：US12288550B2

公开(公告)日：2025-04-29

申请号：US17952116

申请日：2022-09-23

Applicant: Oracle International Corporation

Inventor： Poorya Zaremoodi , Cong Duy Vu Hoang , Duy Vu , Dai Hoang Tran , Budhaditya Saha , Nagaraj N. Bhat , Thanh Tien Vu , Tuyen Quang Pham , Adam Craig Pocock , Katherine Silverstein , Srinivasa Phani Kumar Gadde , Vishal Vishnoi , Mark Edward Johnson , Thanh Long Duong

IPC: G10L15/06 , G10L15/183

Abstract: Techniques are disclosed herein for focused training of language models and end-to-end hypertuning of the framework. In one aspect, a method is provided that includes obtaining a machine learning model pre-trained for language modeling, and post-training the machine learning model for various tasks to generate a focused machine learning model. The post-training includes: (i) training the machine learning model on an unlabeled set of training data pertaining to a task that the machine learning model was pre-trained for as part of the language modeling, and the unlabeled set of training data is obtained with respect to a target domain, a target task, or a target language, and (ii) training the machine learning model on a labeled set of training data that pertains to another task that is an auxiliary task related to a downstream task to be performed using the machine learning model or output from the machine learning model.

7.

发明授权
System and techniques for handling long text for pre-trained language models 有权

公开(公告)号：US12210830B2

公开(公告)日：2025-01-28

申请号：US17750240

申请日：2022-05-20

Applicant: Oracle International Corporation

Inventor： Thanh Tien Vu , Tuyen Quang Pham , Mark Edward Johnson , Thanh Long Duong , Ying Xu , Poorya Zaremoodi , Omid Mohamad Nezami , Budhaditya Saha , Cong Duy Vu Hoang

IPC: G06F40/30 , G06F40/169 , G06F40/284 , G06F40/295

Abstract: In some aspects, a computing device may receive, at a data processing system, a set of utterances for training or inferencing with a named entity recognizer to assign a label to each token piece from the set of utterances. The computing device may determine a length of each utterance in the set and when the length of the utterance exceeds a pre-determined threshold of token pieces: dividing the utterance into a plurality of overlapping chunks of token pieces; assigning a label together with a confidence score for each token piece in a chunk; determining a final label and an associated confidence score for each chunk of token pieces by merging two confidence scores; determining a final annotated label for the utterance based at least on the merging the two confidence scores; and storing the final annotated label in a memory.

8.

发明公开
ADAPTIVE TRAINING DATA AUGMENTATION TO FACILITATE TRAINING NAMED ENTITY RECOGNITION MODELS 审中-公开

公开(公告)号：US20240062112A1

公开(公告)日：2024-02-22

申请号：US18450678

申请日：2023-08-16

Applicant: Oracle International Corporation

Inventor： Omid Mohamad Nezami , Thanh Tien Vu , Budhaditya Saha , Shubham Pawankumar Shah

IPC: G06N20/00 , G06F40/295

CPC classification number: G06N20/00 , G06F40/295 , G10L15/1815

Abstract: Techniques are disclosed herein for adaptive training data augmentation to facilitate training named entity recognition (NER) models. Adaptive augmentation techniques are disclosed herein that take into consideration the distribution of different entity types within training data. The adaptive augmentation techniques generate adaptive numbers of augmented examples (e.g., utterances) based on the distribution of entities to make sure enough numbers of examples for minority class entities are generated during augmentation of the training data.

9.

发明公开
TRAINING DATA AUGMENTATION USING GAZETTEERS AND PERTURBATIONS TO FACILITATE TRAINING NAMED ENTITY RECOGNITION MODELS 审中-公开

公开(公告)号：US20230325599A1

公开(公告)日：2023-10-12

申请号：US18185675

申请日：2023-03-17

Applicant: Oracle International Corporation

Inventor： Omid Mohamad Nezami , Shivashankar Subramanian , Thanh Tien Vu , Tuyen Quang Pham , Budhaditya Saha , Aashna Devang Kanuga , Shubham Pawankumar Shah

IPC: G06F40/295 , G06N3/006

CPC classification number: G06F40/295 , G06N3/006

Abstract: Techniques are provided for augmenting training data using gazetteers and perturbations to facilitate training named entity recognition models. The training data can be augmented by generating additional utterances from original utterances in the training data and combining the generated additional utterances with the original utterances to form the augmented training data. The additional utterances can be generated by replacing the named entities in the original utterances with different named entities and/or perturbed versions of the named entities in the original utterances selected from a gazetteer. Gazetteers of named entities can be generated from the training data and expanded by searching a knowledge base and/or perturbing the named entities therein. The named entity recognition model can be trained using the augmented training data.

10.

发明公开
SYSTEM AND TECHNIQUES FOR HANDLING LONG TEXT FOR PRE-TRAINED LANGUAGE MODELS 审中-公开

公开(公告)号：US20230161963A1

公开(公告)日：2023-05-25

申请号：US17750240

申请日：2022-05-20

Applicant: Oracle International Corporation

Inventor： Thanh Tien Vu , Tuyen Quang Pham , Mark Edward Johnson , Thanh Long Duong , Ying Xu , Poorya Zaremoodi , Omid Mohamad Nezami , Budhaditya Saha , Cong Duy Vu Hoang

IPC: G06F40/295 , G06F40/284 , G06F40/169

CPC classification number: G06F40/295 , G06F40/284 , G06F40/169

Abstract: In some aspects, a computing device may receive, at a data processing system, a set of utterances for training or inferencing with a named entity recognizer to assign a label to each token piece from the set of utterances. The computing device may determine a length of each utterance in the set and when the length of the utterance exceeds a pre-determined threshold of token pieces: dividing the utterance into a plurality of overlapping chunks of token pieces; assigning a label together with a confidence score for each token piece in a chunk; determining a final label and an associated confidence score for each chunk of token pieces by merging two confidence scores; determining a final annotated label for the utterance based at least on the merging the two confidence scores; and storing the final annotated label in a memory.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification