Methods and systems for detecting duplicate document using document similarity measuring model based on deep learning

    公开(公告)号:US11631270B2

    公开(公告)日:2023-04-18

    申请号:US17119028

    申请日:2020-12-11

    Abstract: Disclosed is a method and system, the method including extracting similar and dissimilar document pair sets from a document database, the similar document pair set including similar document pairs having a common attribute, and the dissimilar document pair set including dissimilar document pairs extracted randomly, calculating a mathematical similarity for each of the similar and dissimilar document pairs using a mathematical measure to obtain a first and second mathematical similarities, calculating a semantic similarity for each of the similar and dissimilar document pairs to obtain a first and second semantic similarities, the first semantic similarities being higher than the first mathematical similarities, and the second semantic similarities being lower than the second mathematical similarities, training a similarity model based on the similar and dissimilar document pairs, and the first and second semantic similarities to obtain a trained similarity model, and detecting a duplicate document using the trained similarity model.

    Method and system for detecting duplicate document using vector quantization

    公开(公告)号:US11550996B2

    公开(公告)日:2023-01-10

    申请号:US17120693

    申请日:2020-12-14

    Abstract: Disclosed is a method and system for detecting a duplicate document using vector quantization. A duplicate document detection method may include acquiring, by processing circuitry, a respective vector expression for each of a plurality of documents using a similarity model, the similarity model being trained to output similar vector expressions for semantically similar documents, generating a key by performing a vector quantization on the respective vector expression, the key including a binary character string, and detecting a duplicate document from among the plurality of documents using the key.

Patent Agency Ranking