- 专利标题: Text classification using automatically generated seed data
-
申请号: US16333359申请日: 2018-03-22
-
公开(公告)号: US10671812B2公开(公告)日: 2020-06-02
- 发明人: Rajkumar Bondugula , Allan Joshua , Hongchao Li , Hannah Wang
- 申请人: EQUIFAX INC.
- 申请人地址: US GA Atlanta
- 专利权人: EQUIFAX INC.
- 当前专利权人: EQUIFAX INC.
- 当前专利权人地址: US GA Atlanta
- 代理机构: Kilpatrick Townsend & Stockton LLP
- 国际申请: PCT/US2018/023686 WO 20180322
- 国际公布: WO2019/182593 WO 20190926
- 主分类号: G06F40/295
- IPC分类号: G06F40/295 ; G06N5/04 ; G06N20/00 ; G06F16/35
摘要:
Certain aspects produce a scoring model that can automatically classify future text samples. In some examples, a processing device perform operations for producing a scoring model using active learning. The operations includes receiving existing text samples and searching a stored, pre-trained corpus defining embedding vectors for selected words, phrases, or documents to produce nearest neighbor vectors for each embedding vector. Nearest neighbor selections are identified based on distance between each nearest neighbor vector and the embedding vector for each selection to produce a text cloud. Text samples are selected from the text cloud to produce seed data that is used to train a text classifier. A scoring model can be produced based on the text classifier. The scoring model can receive a plurality of new text samples and provide a score indicative of a likelihood of being a member of a selected class.
公开/授权文献
信息查询