-
公开(公告)号:US20240412011A1
公开(公告)日:2024-12-12
申请号:US18332672
申请日:2023-06-09
Applicant: Microsoft Technology Licensing, LLC
Inventor: Hieu Trong Hoang , Marcin Junczys-Dowmunt , Anthony Aue
IPC: G06F40/58 , G06F16/955 , G06F40/205
Abstract: Systems and methods are provided for implementing URL embeddings for aligning parallel documents that are corresponding web pages in at least two different languages. A computing system uses a pre-trained model of an AI system to calculate URL embeddings for each URL among a plurality of URLs. The system identifies, based on closeness of the points represented by the URL embeddings, a set of candidate parallel URLs by analyzing the URL embeddings for the plurality of URLs or for a second plurality of URLs that has been partitioned into a cluster, using a clustering algorithm. A set of parallel URLs, associated with the parallel documents, is selected from the identified set of candidate parallel URLs. Document text and/or parallel sentences are extracted from web documents associated with the set of parallel URLs to train a machine translation model for translating between two or more languages.