Text classification using automatically generated seed data

发明授权

US10671812B2 Text classification using automatically generated seed data 审中-公开

请登陆查看更多内容

专利标题： Text classification using automatically generated seed data
申请号： US16333359

申请日： 2018-03-22
公开(公告)号： US10671812B2

公开(公告)日： 2020-06-02
发明人: Rajkumar Bondugula , Allan Joshua , Hongchao Li , Hannah Wang
申请人： EQUIFAX INC.
申请人地址： US GA Atlanta
专利权人： EQUIFAX INC.
当前专利权人： EQUIFAX INC.
当前专利权人地址： US GA Atlanta
代理机构： Kilpatrick Townsend & Stockton LLP
国际申请： PCT/US2018/023686 WO 20180322
国际公布： WO2019/182593 WO 20190926
主分类号： G06F40/295
IPC分类号： G06F40/295 ; G06N5/04 ; G06N20/00 ; G06F16/35

Text classification using automatically generated seed data

摘要：

Certain aspects produce a scoring model that can automatically classify future text samples. In some examples, a processing device perform operations for producing a scoring model using active learning. The operations includes receiving existing text samples and searching a stored, pre-trained corpus defining embedding vectors for selected words, phrases, or documents to produce nearest neighbor vectors for each embedding vector. Nearest neighbor selections are identified based on distance between each nearest neighbor vector and the embedding vector for each selection to produce a text cloud. Text samples are selected from the text cloud to produce seed data that is used to train a text classifier. A scoring model can be produced based on the text classifier. The scoring model can receive a plurality of new text samples and provide a score indicative of a likelihood of being a member of a selected class.

公开/授权文献

US20200034419A1 TEXT CLASSIFICATION USING AUTOMATICALLY GENERATED SEED DATA 公开/授权日：2020-01-30

信息查询

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F40/00	处理自然语言数据（语音分析或综合，语音识别G10L）
G06F40/20	.自然语言分析（自然语言的语义分析入G06F40/30）
G06F40/279	..文字实体的识别
G06F40/289	...短语分析，例如有限状态技术或分块
G06F40/295	....命名实体识别