Unsupervised topic modeling for short texts

    公开(公告)号:US10241995B2

    公开(公告)日:2019-03-26

    申请号:US15888385

    申请日:2018-02-05

    Abstract: Topics are determined for short text messages using an unsupervised topic model. In a training corpus created from a number of short text messages, a vocabulary of words is identified, and for each word a distributed vector representation is obtained by processing windows of the corpus having a fixed length. The corpus is modeled as a Gaussian mixture model in which Gaussian components represent topics. To determine a topic of a sample short text message, a posterior distribution over the corpus topics is obtained using the Gaussian mixture model.

    System and method for unsupervised text normalization using distributed representation of words

    公开(公告)号:US11501066B2

    公开(公告)日:2022-11-15

    申请号:US16889609

    申请日:2020-06-01

    Abstract: A system, method and computer-readable storage devices for providing unsupervised normalization of noisy text using distributed representation of words. The system receives, from a social media forum, a word having a non-canonical spelling in a first language. The system determines a context of the word in the social media forum, identifies the word in a vector space model, and selects an “n-best” vector paths in the vector space model, where the n-best vector paths are neighbors to the vector space path based on the context and the non-canonical spelling. The system can then select, based on a similarity cost, a best path from the n-best vector paths and identify a word associated with the best path as the canonical version.

    Unsupervised Topic Modeling For Short Texts
    4.
    发明申请

    公开(公告)号:US20190179891A1

    公开(公告)日:2019-06-13

    申请号:US16268583

    申请日:2019-02-06

    Abstract: Topics are determined for short text messages using an unsupervised topic model. In a training corpus created from a number of short text messages, a vocabulary of words is identified, and for each word a distributed vector representation is obtained by processing windows of the corpus having a fixed length. The corpus is modeled as a Gaussian mixture model in which Gaussian components represent topics. To determine a topic of a sample short text message, a posterior distribution over the corpus topics is obtained using the Gaussian mixture model.

    System and method for locating bilingual web sites

    公开(公告)号:US10114818B2

    公开(公告)日:2018-10-30

    申请号:US15294883

    申请日:2016-10-17

    Abstract: Disclosed herein are systems, methods, and non-transitory computer-readable storage media for bootstrapping a language translation system. A system configured to practice the method performs a bidirectional web crawl to identify a bilingual website. The system analyzes data on the bilingual website to make a classification decision about whether the root of the bilingual website is an entry point for the bilingual website. The bilingual site can contain pairs of parallel pages. Each pair can include a first website in a first language and a second website in a second language, and a first portion of the first web page corresponds to a second portion of the second web page. Then the system analyzes the first and second web pages to identify corresponding information pairs in the first and second languages, and extracts the corresponding information pairs from the first and second web pages for use in a language translation model.

    Unsupervised topic modeling for short texts

    公开(公告)号:US11030401B2

    公开(公告)日:2021-06-08

    申请号:US16268583

    申请日:2019-02-06

    Abstract: Topics are determined for short text messages using an unsupervised topic model. In a training corpus created from a number of short text messages, a vocabulary of words is identified, and for each word a distributed vector representation is obtained by processing windows of the corpus having a fixed length. The corpus is modeled as a Gaussian mixture model in which Gaussian components represent topics. To determine a topic of a sample short text message, a posterior distribution over the corpus topics is obtained using the Gaussian mixture model.

    UNSUPERVISED TOPIC MODELING FOR SHORT TEXTS
    8.
    发明申请
    UNSUPERVISED TOPIC MODELING FOR SHORT TEXTS 有权
    短暂主题的不可分割的主题建模

    公开(公告)号:US20160110343A1

    公开(公告)日:2016-04-21

    申请号:US14519427

    申请日:2014-10-21

    CPC classification number: G06F17/2715 G06F17/2785 G10L25/30 H04W4/14

    Abstract: Topics are determined for short text messages using an unsupervised topic model. In a training corpus created from a number of short text messages, a vocabulary of words is identified, and for each word a distributed vector representation is obtained by processing windows of the corpus having a fixed length. The corpus is modeled as a Gaussian mixture model in which Gaussian components represent topics. To determine a topic of a sample short text message, a posterior distribution over the corpus topics is obtained using the Gaussian mixture model.

    Abstract translation: 使用无监督主题模型确定短文本消息的主题。 在从许多短文本消息创建的训练语料库中,识别词汇词,并且对于每个单词,通过处理具有固定长度的语料库的窗口来获得分布式向量表示。 语料库被建模为高斯混合模型,其中高斯分量表示主题。 为了确定样本短文本消息的主题,使用高斯混合模型获得语料库主题的后验分布。

    SYSTEM AND METHOD FOR ENRICHING SPOKEN LANGUAGE TRANSLATION WITH DIALOG ACTS
    9.
    发明申请
    SYSTEM AND METHOD FOR ENRICHING SPOKEN LANGUAGE TRANSLATION WITH DIALOG ACTS 有权
    用对话语言强化语音翻译的系统和方法

    公开(公告)号:US20130151232A1

    公开(公告)日:2013-06-13

    申请号:US13761549

    申请日:2013-02-07

    CPC classification number: G06F17/28 G06F17/279 G06F17/289

    Abstract: Disclosed herein are systems, computer-implemented methods, and tangible computer-readable media for enriching spoken language translation with dialog acts. The method includes receiving a source speech signal, tagging dialog acts associated with the received source speech signal using a classification model, dialog acts being domain independent descriptions of an intended action a speaker carries out by uttering the source speech signal, producing an enriched hypothesis of the source speech signal incorporating the dialog act tags, and outputting a natural language response of the enriched hypothesis in a target language. Tags can be grouped into sets such as statement, acknowledgement, abandoned, agreement, question, appreciation, and other. The step of producing an enriched translation of the source speech signal uses a dialog act specific translation model containing a phrase translation table.

    Abstract translation: 本文公开了系统,计算机实现的方法和有形计算机可读介质,用于通过对话行为丰富口语翻译。 该方法包括使用分类模型来接收源语音信号,与接收到的源语音信号相关联的标签对话动作,对话体是说话者通过发出源语音信号来执行的预期动作的域独立描述,产生丰富的假设 包含对话行为标签的源语音信号,并以目标语言输出丰富假说的自然语言响应。 标签可以分组,如声明,确认,放弃,协议,问题,升值等。 产生源语音信号的丰富翻译的步骤使用包含短语翻译表的对话行为特定翻译模型。

    System and method for unsupervised text normalization using distributed representation of words

    公开(公告)号:US10083167B2

    公开(公告)日:2018-09-25

    申请号:US14506156

    申请日:2014-10-03

    CPC classification number: G06F17/273 G06F17/289 G06Q50/01

    Abstract: A system, method and computer-readable storage devices for providing unsupervised normalization of noisy text using distributed representation of words. The system receives, from a social media forum, a word having a non-canonical spelling in a first language. The system determines a context of the word in the social media forum, identifies the word in a vector space model, and selects an “n-best” vector paths in the vector space model, where the n-best vector paths are neighbors to the vector space path based on the context and the non-canonical spelling. The system can then select, based on a similarity cost, a best path from the n-best vector paths and identify a word associated with the best path as the canonical version.

Patent Agency Ranking