Domain constraint path based data record extraction
    1.
    发明授权
    Domain constraint path based data record extraction 有权
    基于域约束路径的数据记录提取

    公开(公告)号:US09171080B2

    公开(公告)日:2015-10-27

    申请号:US13356241

    申请日:2012-01-23

    CPC classification number: G06F17/30864 G06F17/227 G06F17/30867

    Abstract: Described herein are techniques for extracting data records containing user-generated content from documents. The documents may be processed into document trees in which sub-trees represent the data records of the document. Domain constraints may be used to locate structured portions of the document tree. For example, anchor trees may be located as being sets of sibling sub-trees with similar tag paths that contain the domain constraints. The anchor trees may then be used to determine a record boundary (e.g., the start offset and length) of the data records. Finally, the data records may be extracted based on the anchor trees and the record boundaries.

    Abstract translation: 这里描述的是从文档中提取包含用户生成的内容的数据记录的技术。 文档可以被处理成文档树,其中子树表示文档的数据记录。 域约束可用于定位文档树的结构化部分。 例如,锚树可以被定位为具有包含域约束的类似标签路径的兄弟子树的集合。 然后可以使用锚树来确定数据记录的记录边界(例如,起始偏移和长度)。 最后,可以基于锚树和记录边界来提取数据记录。

    Domain Constraint Based Data Record Extraction
    2.
    发明申请
    Domain Constraint Based Data Record Extraction 有权
    基于域约束的数据记录提取

    公开(公告)号:US20120124077A1

    公开(公告)日:2012-05-17

    申请号:US12945517

    申请日:2010-11-12

    CPC classification number: G06F17/227

    Abstract: Embodiments for a Mining Data Records based on Anchor Trees (MiBAT) process are disclosed. In accordance with at least one embodiment, the MiBAT process extracts data records containing user-generated content from web documents. The web document is processed into a Document Object Model (DOM) tree in which sub-trees of the DOM tree represent the data records of the web document. Domain constraints are used to locate structured portions of the DOM tree. Anchor trees are then located as being sets of sibling sub-trees which contain the domain constraints. The anchor trees are then used to determine a record boundary (i.e. the start offset and length) of the data records. Finally, the data records are extracted based on the anchor trees and the record boundaries.

    Abstract translation: 公开了基于锚树(MiBAT)工艺的挖掘数据记录的实施例。 根据至少一个实施例,MiBAT处理从Web文档中提取包含用户生成的内容的数据记录。 Web文档被处理为文档对象模型(DOM)树,其中DOM树的子树表示Web文档的数据记录。 域约束用于定位DOM树的结构化部分。 然后,锚树被定位为包含域约束的兄弟子树的集合。 锚树然后用于确定数据记录的记录边界(即起始偏移和长度)。 最后,根据锚树和记录边界提取数据记录。

    SEARCHING QUESTIONS BASED ON TOPIC AND FOCUS
    3.
    发明申请
    SEARCHING QUESTIONS BASED ON TOPIC AND FOCUS 有权
    基于主题和焦点的搜索问题

    公开(公告)号:US20100030770A1

    公开(公告)日:2010-02-04

    申请号:US12185713

    申请日:2008-08-04

    CPC classification number: G06F17/30684

    Abstract: A method and system for determining the relevance of questions to a queried question based on topics and focuses of the questions is provided. A question search system provides a collection of questions with topics and focuses. Upon receiving a queried question, the question search system identifies a queried topic and queried focus of the queried question. The question search system generates a score indicating the relevance of a question of the collection to the queried question based on a language model of the topic of the question and a language model of the focus of the question.

    Abstract translation: 提供了一种基于问题的主题和焦点来确定问题与查询问题的相关性的方法和系统。 问题搜索系统提供了一些问题的集合,主题和重点。 问题搜索系统在收到查询问题后,会识别被查询的主题,并查询查询问题的重点。 问题搜索系统基于问题的主题的语言模型和问题的重点的语言模型生成指示收集问题与查询问题的相关性的分数。

    RECOMMENDING QUESTIONS TO USERS OF COMMUNITY QIESTION ANSWERING
    4.
    发明申请
    RECOMMENDING QUESTIONS TO USERS OF COMMUNITY QIESTION ANSWERING 审中-公开
    对社区用户的建议问题答复

    公开(公告)号:US20090253112A1

    公开(公告)日:2009-10-08

    申请号:US12098457

    申请日:2008-04-07

    CPC classification number: G06Q10/10 G06F16/3329

    Abstract: The present system graphs topic terms in stored cQA questions and also converts a submitted question into a graph of topic terms. Topic terms that correspond to a question topic are delineated from topic terms that correspond to question focus. New questions are recommended to the user based on a comparison between the topics of the new questions and the topic of the submitted question as well as the focus of the new questions and the focus of the submitted question.

    Abstract translation: 本系统将存储的cQA问题中的主题条目进行图表,并将提交的问题转换为主题术语图表。 与问题主题对应的主题术语从与问题焦点相对应的主题术语中进行了描述。 基于新问题的主题与提交的问题的主题以及新问题的重点和提交的问题的重点之间的比较,向用户推荐新的问题。

    Smart Sentiment Classifier for Product Reviews
    5.
    发明申请
    Smart Sentiment Classifier for Product Reviews 审中-公开
    智能情感分类器的产品评论

    公开(公告)号:US20080249764A1

    公开(公告)日:2008-10-09

    申请号:US11950512

    申请日:2007-12-05

    CPC classification number: G06F17/2785

    Abstract: A sentiment classifier is described. In one implementation, a system applies both full text and complex feature analyses to sentences of a product review. Each analysis is weighted prior to linear combination into a final sentiment prediction. A full text model and a complex features model can be trained separately offline to support online full text analysis and complex features analysis. Complex features include opinion indicators, negation patterns, sentiment-specific sections of the product review, user ratings, sequence of text chunks, and sentence types and lengths. A Conditional Random Field (CRF) framework provides enhanced sentiment classification for each segment of a complex sentence to enhance sentiment prediction.

    Abstract translation: 描述情感分类器。 在一个实现中,系统将全文和复杂特征分析应用于产品评论的句子。 将每个分析在线性组合之前加权到最终情绪预测中。 全文模型和复杂特征模型可以离线进行培训,以支持在线全文分析和复杂特征分析。 复杂的功能包括意见指标,否定模式,产品评论中的情绪特定部分,用户评分,文本块的顺序以及句型和长度。 条件随机场(CRF)框架为复杂句子的每个段提供增强的情感分类,以增强情绪预测。

    Determining utility of a question
    6.
    发明授权
    Determining utility of a question 有权
    确定问题的效用

    公开(公告)号:US08112269B2

    公开(公告)日:2012-02-07

    申请号:US12197991

    申请日:2008-08-25

    CPC classification number: G06F17/277 G06F17/30654

    Abstract: A question search system provides a collection of questions having words for use in evaluating the utility of the questions based on a language model. The question search system calculates n-gram probabilities for words within the questions of the collection. The n-gram probability of a word for a sequence of n−1 words indicates the probability of that word being next after that sequence in the collection of questions. The n-gram probabilities for the words of the collection represent the language model of the collection. The question search system calculates a language model utility score for each question within a collection that indicates the likelihood that a question is repeatedly asked by users. The question search system derives the language model utility score for a question from the n-gram probabilities of the words within that question.

    Abstract translation: 问题搜索系统提供了具有用于评估基于语言模型的问题的效用的单词的问题的集合。 问题搜索系统计算收集问题内的单词的n-gram概率。 n-1个词序列的单词的n-gram概率表示该词在该问题集合中的该序列之后的概率。 集合词的n-gram概率表示集合的语言模型。 问题搜索系统计算集合中每个问题的语言模型效用得分,其指示用户重复询问问题的可能性。 问题搜索系统从该问题中的单词的n-gram概率得出问题的语言模型效用得分。

    Clustering question search results based on topic and focus
    7.
    发明授权
    Clustering question search results based on topic and focus 有权
    基于主题和焦点的聚类问题搜索结果

    公开(公告)号:US08024332B2

    公开(公告)日:2011-09-20

    申请号:US12185702

    申请日:2008-08-04

    CPC classification number: G06F17/30696

    Abstract: A method and system for presenting questions that are relevant to a queried question based on clusters of topics and clusters of focuses of the questions is provided. A question search system provides a collection of questions. Each question of the collection has an associated topic and focus. Upon receiving a queried question, the question search system identifies questions of the collection that may be relevant to the queried question and generates a score or ranking indicating relevance of the identified questions. The question search system clusters the identified questions into topic clusters of questions with similar topics. The question search system may also cluster the questions within each topic cluster into focus clusters of questions with similar focuses.

    Abstract translation: 提供了一种方法和系统,用于根据问题的集群和问题的聚焦集提出与查询问题相关的问题。 问题搜索系统提供了一系列问题。 集合的每个问题都有相关的主题和焦点。 在收到查询问题后,问题搜索系统识别可能与查询问题相关的集合问题,并生成指示所识别问题的相关性的分数或排名。 问题搜索系统将识别的问题集中到具有相似主题的主题问题集群中。 问题搜索系统还可以将每个主题集群中的问题集中到具有类似重点的问题焦点集群中。

    QUESTION AND ANSWER SEARCH
    8.
    发明申请
    QUESTION AND ANSWER SEARCH 审中-公开
    问题和答案搜索

    公开(公告)号:US20100235311A1

    公开(公告)日:2010-09-16

    申请号:US12403560

    申请日:2009-03-13

    CPC classification number: G06F16/9535

    Abstract: Exemplary methods, computer-readable media, and systems are presented for leveraging question-answering knowledge from community sites by complementing product search services with a search of questions, answers, reviews and other Internet accessible content including user-generated content. Product or service information is obtained by crawling Internet-accessible Web sites including community sites. An integrated index of such information is generated. A user is able to browse questions by product or service feature, by topic, by identified comparative questions, and by question ranking (for example, interestingness or popularity).

    Abstract translation: 呈现示例性方法,计算机可读介质和系统,以通过对包括用户生成的内容的问题,答案,评论和其他因特网可访问内容的搜索来补充产品搜索服务来利用来自社区网站的问答答案。 产品或服务信息是通过抓取可访问Internet的网站(包括社区网站)获得的。 生成此类信息的综合索引。 用户能够通过产品或服务功能,主题,识别的比较问题以及问题排名(例如,趣味性或人气)来浏览问题。

    CLUSTERING QUESTION SEARCH RESULTS BASED ON TOPIC AND FOCUS
    9.
    发明申请
    CLUSTERING QUESTION SEARCH RESULTS BASED ON TOPIC AND FOCUS 有权
    基于主题和焦点的聚类问题搜索结果

    公开(公告)号:US20100030769A1

    公开(公告)日:2010-02-04

    申请号:US12185702

    申请日:2008-08-04

    CPC classification number: G06F17/30696

    Abstract: A method and system for presenting questions that are relevant to a queried question based on clusters of topics and clusters of focuses of the questions is provided. A question search system provides a collection of questions. Each question of the collection has an associated topic and focus. Upon receiving a queried question, the question search system identifies questions of the collection that may be relevant to the queried question and generates a score or ranking indicating relevance of the identified questions. The question search system clusters the identified questions into topic clusters of questions with similar topics. The question search system may also cluster the questions within each topic cluster into focus clusters of questions with similar focuses.

    Abstract translation: 提供了一种方法和系统,用于根据问题的集群和问题的聚焦集提出与查询问题相关的问题。 问题搜索系统提供了一系列问题。 集合的每个问题都有相关的主题和焦点。 在收到查询问题后,问题搜索系统识别可能与查询问题相关的集合问题,并产生指示所识别问题的相关性的分数或排名。 问题搜索系统将识别的问题集中到具有相似主题的主题问题集群中。 问题搜索系统还可以将每个主题集群中的问题集中到具有类似重点的问题焦点集群中。

    Domain constraint based data record extraction
    10.
    发明授权
    Domain constraint based data record extraction 有权
    基于域约束的数据记录提取

    公开(公告)号:US08983980B2

    公开(公告)日:2015-03-17

    申请号:US12945517

    申请日:2010-11-12

    CPC classification number: G06F17/227

    Abstract: Embodiments for a Mining Data Records based on Anchor Trees (MiBAT) process are disclosed. In accordance with at least one embodiment, the MiBAT process extracts data records containing user-generated content from web documents. The web document is processed into a Document Object Model (DOM) tree in which sub-trees of the DOM tree represent the data records of the web document. Domain constraints are used to locate structured portions of the DOM tree. Anchor trees are then located as being sets of sibling sub-trees which contain the domain constraints. The anchor trees are then used to determine a record boundary (i.e. the start offset and length) of the data records. Finally, the data records are extracted based on the anchor trees and the record boundaries.

    Abstract translation: 公开了基于锚树(MiBAT)工艺的挖掘数据记录的实施例。 根据至少一个实施例,MiBAT处理从Web文档中提取包含用户生成的内容的数据记录。 Web文档被处理为文档对象模型(DOM)树,其中DOM树的子树表示Web文档的数据记录。 域约束用于定位DOM树的结构化部分。 然后,锚树被定位为包含域约束的兄弟子树的集合。 锚树然后用于确定数据记录的记录边界(即起始偏移和长度)。 最后,根据锚树和记录边界提取数据记录。

Patent Agency Ranking