UNIFORM RESOURCE LOCATOR (URL) EMBEDDINGS FOR ALIGNING PARALLEL DOCUMENTS

    公开(公告)号:US20240412011A1

    公开(公告)日:2024-12-12

    申请号:US18332672

    申请日:2023-06-09

    Abstract: Systems and methods are provided for implementing URL embeddings for aligning parallel documents that are corresponding web pages in at least two different languages. A computing system uses a pre-trained model of an AI system to calculate URL embeddings for each URL among a plurality of URLs. The system identifies, based on closeness of the points represented by the URL embeddings, a set of candidate parallel URLs by analyzing the URL embeddings for the plurality of URLs or for a second plurality of URLs that has been partitioned into a cluster, using a clustering algorithm. A set of parallel URLs, associated with the parallel documents, is selected from the identified set of candidate parallel URLs. Document text and/or parallel sentences are extracted from web documents associated with the set of parallel URLs to train a machine translation model for translating between two or more languages.

Patent Agency Ranking