-
公开(公告)号:US20230141853A1
公开(公告)日:2023-05-11
申请号:US18052694
申请日:2022-11-04
Applicant: Oracle International Corporation
Inventor: Thanh Tien Vu , Poorya Zaremoodi , Duy Vu , Mark Edward Johnson , Thanh Long Duong , Xu Zhong , Vladislav Blinov , Cong Duy Vu Hoang , Yu-Heng Hong , Vinamr Goel , Philip Victor Ogren , Srinivasa Phani Kumar Gadde , Vishal Vishnoi
IPC: G06F40/263 , G06F16/31
CPC classification number: G06F40/263 , G06F16/325 , H04L51/02
Abstract: Techniques disclosed herein relate generally to language detection. In one particular aspect, a method is provided that includes obtaining a sequence of n-grams of a textual unit; using an embedding layer to obtain an ordered plurality of embedding vectors for the sequence of n-grams; using a deep network to obtain an encoded vector that is based on the ordered plurality of embedding vectors; and using a classifier to obtain a language prediction for the textual unit that is based on the encoded vector. The deep network includes an attention mechanism, and using the embedding layer to obtain the ordered plurality of embedding vectors comprises, for each n-gram in the sequence of n-grams: obtaining hash values for the n-gram; based on the hash values, selecting component vectors from among the plurality of component vectors; and obtaining an embedding vector for the n-gram that is based on the component vectors.