-
11.
公开(公告)号:US20230153687A1
公开(公告)日:2023-05-18
申请号:US17984717
申请日:2022-11-10
Applicant: Oracle International Corporation
Inventor: Duy Vu , Varsha Kuppur Rajendra , Shivashankar Subramanian , Ahmed Ataallah Ataallah Abobakr , Thanh Long Duong , Mark Edward Johnson
CPC classification number: G06N20/00 , G06K9/6259 , G06K9/6262
Abstract: Techniques for named entity bias detection and mitigation for sentence sentiment analysis. In one particular aspect, a method is provided that includes obtaining a training set of labeled examples for training a machine learning model to classify sentiment, preparing a list of named entities using one or more data sources, for each example in the training set of labeled examples with a named entity, replacing the named entity with a corresponding entity type tag to generate a labeled template data set, executing a sampling process for each entity type t within the labeled template data set to generate a augmented invariance data set comprising one or more invariance groups having labeled examples for each entity type t, and training the machine learning model using labeled examples from the augmented invariance data set.
-
公开(公告)号:US20230100508A1
公开(公告)日:2023-03-30
申请号:US17936679
申请日:2022-09-29
Applicant: Oracle International Corporation
Inventor: Ahmed Ataallah Ataallah Abobakr , Mark Edward Johnson , Thanh Long Duong , Vladislav Blinov , Yu-Heng Hong , Cong Duy Vu Hoang , Duy Vu
IPC: G06F40/295 , G06F40/205 , G06F40/263
Abstract: Techniques disclosed herein relate generally to text classification and include techniques for fusing word embeddings with word scores for text classification. In one particular aspect, a method for text classification is provided that includes obtaining an embedding vector for a textual unit, based on a plurality of word embedding vectors and a plurality of word scores. The plurality of word embedding vectors includes a corresponding word embedding vector for each of a plurality of words of the textual unit, and the plurality of word scores includes a corresponding word score for each of the plurality of words of the textual unit. The method also includes passing the embedding vector for the textual unit through at least one feed-forward layer to obtain a final layer output, and performing a classification on the final layer output.
-
公开(公告)号:US20210303798A1
公开(公告)日:2021-09-30
申请号:US17217909
申请日:2021-03-30
Applicant: Oracle International Corporation
Inventor: Thanh Long Duong , Mark Edward Johnson , Vishal Vishnoi , Crystal C. Pan , Vladislav Blinov , Cong Duy Vu Hoang , Elias Luqman Jalaluddin , Duy Vu , Balakota Srinivas Vinnakota
IPC: G06F40/30 , G06F40/289 , H04L12/58 , G06N20/00
Abstract: The present disclosure relates to techniques for identifying out-of-domain utterances. One particular technique includes receiving an utterance and a target domain of a chatbot, generating a sentence embedding for the utterance, obtaining an embedding representation for each cluster of in-domain utterances associated with the target domain, predicting, using a metric learning model, a first probability that the utterance belongs to the target domain based on a similarity or difference between the sentence embedding and each embedding representation for each cluster, predicting, using an outlier detection model, a second probability that the utterance belongs to the target domain based on a determined distance or density deviation between the sentence embedding and embedding representations for neighboring clusters, evaluating the first probability and the second probability to determine a final probability, and classifying the utterance as in-domain or out-of-domain for the chatbot based on the final probability.
-
公开(公告)号:US20250095636A1
公开(公告)日:2025-03-20
申请号:US18823371
申请日:2024-09-03
Applicant: Oracle International Corporation
Inventor: Duy Vu , Yu-Heng Hong , Ying Xu , Philip Arthur
Abstract: Techniques are disclosed herein for improving the performance of an end-to-end (E2E) Automatic Speech Recognition (ASR) model in a target domain. A set of test examples are generated. The set of test examples comprise multiple subsets of test examples and each subset of test examples corresponds to a particular test category. A machine language model is then used to convert audio samples of the subset of test examples to text transcripts. A word error rate is determined for the subset of test examples. A test category is then selected based on the word error rates and a set of training examples is generated for training the ASR model in a particular target domain from a selected subset of test examples The training examples are used to fine-tune the model in the target domain. The trained model is then deployed in a cloud infrastructure of a cloud service provider.
-
公开(公告)号:US20240143934A1
公开(公告)日:2024-05-02
申请号:US18485700
申请日:2023-10-12
Applicant: Oracle International Corporation
Inventor: Poorya Zaremoodi , Duy Vu , Nagaraj N. Bhat , Srijon Sarkar , Varsha Kuppur Rajendra , Thanh Long Duong , Mark Edward Johnson , Pramir Sarkar , Shahid Reza
IPC: G06F40/30 , G06F40/284 , G06F40/289
CPC classification number: G06F40/30 , G06F40/284 , G06F40/289
Abstract: A method includes accessing document including sentences, document being associated with configuration flag indicating whether ABSA, SLSA, or both are to be performed; inputting the document into language model that generates chunks of token embeddings for the document; and, based on the configuration flag, performing at least one from among the ABSA and the SLSA by inputting the chunks of token embeddings into a multi-task model. When performing the SLSA, a part of token embeddings in each of the chunks is masked, and the masked token embeddings do not belong to a particular sentence on which the SLSA is performed.
-
公开(公告)号:US20240095454A1
公开(公告)日:2024-03-21
申请号:US18521805
申请日:2023-11-28
Applicant: Oracle International Corporation
Inventor: Duy Vu , Tuyen Quang Pham , Cong Duy Vu Hoang , Srinivasa Phani Kumar Gadde , Thanh Long Duong , Mark Edward Johnson , Vishal Vishnoi
IPC: G06F40/295 , G06F40/205 , G06F40/279 , G06F40/35 , G06F40/40 , G06V30/19
CPC classification number: G06F40/295 , G06F40/205 , G06F40/279 , G06F40/35 , G06F40/40 , G06V30/19147
Abstract: Techniques are provided for using context tags in named-entity recognition (NER) models. In one particular aspect, a method is provided that includes receiving an utterance, generating embeddings for words of the utterance, generating a regular expression and gazetteer feature vector for the utterance, generating a context tag distribution feature vector for the utterance, concatenating or interpolating the embeddings with the regular expression and gazetteer feature vector and the context tag distribution feature vector to generate a set of feature vectors, generating an encoded form of the utterance based on the set of feature vectors, generating log-probabilities based on the encoded form of the utterance, and identifying one or more constraints for the utterance.
-
公开(公告)号:US20230376696A1
公开(公告)日:2023-11-23
申请号:US18364298
申请日:2023-08-02
Applicant: Oracle International Corporation
Inventor: Thanh Long Duong , Mark Edward Johnson , Vishal Vishnoi , Crystal C. Pan , Vladislav Blinov , Cong Duy Vu Hoang , Elias Luqman Jalaluddin , Duy Vu , Balakota Srinivas Vinnakota
IPC: G06F40/30 , G06N20/00 , G06F40/289 , H04L51/02
CPC classification number: G06F40/30 , G06N20/00 , G06F40/289 , H04L51/02 , G06F40/205
Abstract: The present disclosure relates to techniques for identifying out-of-domain utterances. One particular technique includes receiving an utterance and a target domain of a chatbot, generating a sentence embedding for the utterance, obtaining an embedding representation for each cluster of in-domain utterances associated with the target domain, predicting, using a metric learning model, a first probability that the utterance belongs to the target domain based on a similarity or difference between the sentence embedding and each embedding representation for each cluster, predicting, using an outlier detection model, a second probability that the utterance belongs to the target domain based on a determined distance or density deviation between the sentence embedding and embedding representations for neighboring clusters, evaluating the first probability and the second probability to determine a final probability, and classifying the utterance as in-domain or out-of-domain for the chatbot based on the final probability.
-
公开(公告)号:US20230141853A1
公开(公告)日:2023-05-11
申请号:US18052694
申请日:2022-11-04
Applicant: Oracle International Corporation
Inventor: Thanh Tien Vu , Poorya Zaremoodi , Duy Vu , Mark Edward Johnson , Thanh Long Duong , Xu Zhong , Vladislav Blinov , Cong Duy Vu Hoang , Yu-Heng Hong , Vinamr Goel , Philip Victor Ogren , Srinivasa Phani Kumar Gadde , Vishal Vishnoi
IPC: G06F40/263 , G06F16/31
CPC classification number: G06F40/263 , G06F16/325 , H04L51/02
Abstract: Techniques disclosed herein relate generally to language detection. In one particular aspect, a method is provided that includes obtaining a sequence of n-grams of a textual unit; using an embedding layer to obtain an ordered plurality of embedding vectors for the sequence of n-grams; using a deep network to obtain an encoded vector that is based on the ordered plurality of embedding vectors; and using a classifier to obtain a language prediction for the textual unit that is based on the encoded vector. The deep network includes an attention mechanism, and using the embedding layer to obtain the ordered plurality of embedding vectors comprises, for each n-gram in the sequence of n-grams: obtaining hash values for the n-gram; based on the hash values, selecting component vectors from among the plurality of component vectors; and obtaining an embedding vector for the n-gram that is based on the component vectors.
-
公开(公告)号:US20220229993A1
公开(公告)日:2022-07-21
申请号:US17648376
申请日:2022-01-19
Applicant: Oracle International Corporation
Inventor: Duy Vu , Tuyen Quang Pham , Cong Duy Vu Hoang , Srinivasa Phani Kumar Gadde , Thanh Long Duong , Mark Edward Johnson , Vishal Vishnoi
IPC: G06F40/295 , G06F40/205 , G06F40/35 , G06F40/40 , G06V30/19
Abstract: Techniques are provided for using context tags in named-entity recognition (NER) models. In one particular aspect, a method is provided that includes receiving an utterance, generating embeddings for words of the utterance, generating a regular expression and gazetteer feature vector for the utterance, generating a context tag distribution feature vector for the utterance, concatenating or interpolating the embeddings with the regular expression and gazetteer feature vector and the context tag distribution feature vector to generate a set of feature vectors, generating an encoded form of the utterance based on the set of feature vectors, generating log-probabilities based on the encoded form of the utterance, and identifying one or more constraints for the utterance.
-
-
-
-
-
-
-
-