Document processing method, system and medium

    公开(公告)号:US07046847B2

    公开(公告)日:2006-05-16

    申请号:US09891080

    申请日:2001-06-25

    CPC classification number: G06F17/2745 G06F17/2229

    Abstract: A technique for extracting a meaningful text block from a document where a table, an itemized list, a multiple column, etc., are arbitrarily laid out. A document is input which is laid out using blanks or the like, then a symbol is acquired which is associated with a spatial coordinate of the document. Consecutive characters of the same type are extracted from the symbol to generate a token and a space. A stream is generated from consecutive spaces in the column direction, while a text block is generated from streams and tokens. A link is generated between the text blocks to form a document graph. Validity of a connection (link) between the text blocks in the document graph is evaluated using a language model, then the text blocks are merged if the connection is valid.

    Social media guided authoring
    2.
    发明授权

    公开(公告)号:US09892103B2

    公开(公告)日:2018-02-13

    申请号:US12193148

    申请日:2008-08-18

    CPC classification number: G06F17/241 G06F17/30867 G06Q10/10

    Abstract: Techniques and systems for assisting an author in creating content for social media (e.g., blog posts, microblogs, tweets, etc.) are disclosed, wherein hints are provided to the author as a function of social media stored in a social media knowledge store. Social media is collected and stored in a social media knowledge store according to some criteria. Upon the happening of some predetermined event, for example, relevant information is retrieved from the social media knowledge store. The relevancy of information may be a function of editing context (provided by the author) and/or social media behavior, for example. The relevant information may be translated into hints that provide an author with suggestions and/or corrections, for example. This information is provided to the author through a social media environment (e.g., an authoring tool) that may be also be capable of receiving input from the author and outputting editing context.

    NAMED ENTITY RESOLUTION USING MULTIPLE TEXT SOURCES
    3.
    发明申请
    NAMED ENTITY RESOLUTION USING MULTIPLE TEXT SOURCES 审中-公开
    使用多个文本来源的NAMED实体分辨率

    公开(公告)号:US20100094831A1

    公开(公告)日:2010-04-15

    申请号:US12251452

    申请日:2008-10-14

    Inventor: Matthew F. Hurst

    CPC classification number: G06F17/278

    Abstract: An arrangement for resolving ambiguity among named entities in web based text documents is provided in which multiple documents are utilized that are of different genres and will thus typically use different degrees of precision when referring to named entities. When an ambiguous named entity is located in a document, any links contained in that document are followed to other documents. If a linked document includes a named entity that is fully specified (i.e., includes both a first and last name), then this information can be used to resolve the ambiguity of the named entity in the original document.

    Abstract translation: 提供了一种用于解决基于Web的文本文档中的命名实体之间的歧义的布置,其中使用不同类型的多个文档,并且因此在引用命名实体时通常会使用不同的精度。 当一个不明确的命名实体位于文档中时,该文档中包含的任何链接都将跟随到其他文档。 如果链接的文档包括完全指定的命名实体(即,包括第一个和最后一个名称),则该信息可以用于解决原始文档中的命名实体的模糊性。

    TOPICAL SENTIMENTS IN ELECTRONICALLY STORED COMMUNICATIONS
    4.
    发明申请
    TOPICAL SENTIMENTS IN ELECTRONICALLY STORED COMMUNICATIONS 有权
    电子存储通信中的主题识别

    公开(公告)号:US20110093417A1

    公开(公告)日:2011-04-21

    申请号:US12969356

    申请日:2010-12-15

    CPC classification number: G06F17/274 G06Q30/02

    Abstract: The present application presents methods for performing topical sentiment analysis on electronically stored communications employing fusion of polarity and topicality. The present application also provides methods for utilizing shallow NLP techniques to determine the polarity of an expression. The present application also provides a method for tuning a domain-specific polarity lexicon for use in the polarity determination. The present application also provides methods for computing a numeric metric of the aggregate opinion about some topic expressed in a set of expressions.

    Abstract translation: 本申请提出了使用极性和话题的融合来进行电子存储通信的局部情绪分析的方法。 本申请还提供了利用浅层NLP技术来确定表达式的极性的方法。 本申请还提供了一种用于调整用于极性确定的域特定极性词典的方法。 本申请还提供了用于计算关于在一组表达式中表达的某个主题的聚合意见的数值度量的方法。

    METHOD AND APPARATUS FOR WEB CRAWLING
    5.
    发明申请
    METHOD AND APPARATUS FOR WEB CRAWLING 有权
    网络破解的方法和设备

    公开(公告)号:US20100250516A1

    公开(公告)日:2010-09-30

    申请号:US12413528

    申请日:2009-03-28

    CPC classification number: G06F17/30864

    Abstract: A method and system for retrieving data from a webpage is described herein. A scheduler organizes, or rather orders, a group of webpage identifiers according to some predetermined criteria. Based upon this ordering, a fetcher may be configured to fetch data from webpages identified by the identifiers. To promote efficiency and reduce the latency between when a webpage is updated and when the fetcher retrieves data from the webpage, the scheduler may be configured to reorder the identifiers in such a manner that it causes an identifier that was less relevant, and would not have been sent to the fetcher, to become more relevant. In this way, the method and system may be particularly useful for retrieving data related to webpages that are updated frequently, such as social media webpages, for example.

    Abstract translation: 本文描述了用于从网页检索数据的方法和系统。 调度器根据某些预定标准来组织或者相当地命令一组网页标识符。 基于该排序,提取器可以被配置为从由标识符标识的网页获取数据。 为了提高效率并减少网页更新时和提取器从网页检索数据之间的延迟,调度器可以被配置为以这样的方式重新排序标识符,使得它导致不相关的标识符,并且不会 被发送到提取者,变得更加相关。 以这种方式,该方法和系统可能特别适用于检索与频繁更新的网页相关的数据,例如社交媒体网页。

    Providing context for web articles
    6.
    发明授权
    Providing context for web articles 有权
    为网络文章提供上下文

    公开(公告)号:US08630972B2

    公开(公告)日:2014-01-14

    申请号:US12143765

    申请日:2008-06-21

    CPC classification number: G06F17/30014

    Abstract: An overwhelming number of articles are available everyday via the internet. Unfortunately, it is impossible to peruse more than a handful, and it is difficult to ascertain an article's social context. The techniques disclosed herein address this problem by harnessing implicit and explicit contextual information from social media. By extracting text surrounding a hyperlink to an article in a post and assessing the article as a function of content surrounding the hyperlink, an article's social context is determined and presented. Additionally, articles that are sufficiently similar in content may be grouped to establish a many-to-one relationship between posts and an article, creating a more accurate assessment.

    Abstract translation: 每天通过互联网可以获得绝大多数的文章。 不幸的是,不可能仔细阅读,而且很难确定文章的社会背景。 本文所揭示的技术通过利用来自社交媒体的隐含和明确的上下文信息来解决这个问题。 通过提取文章中超文本文章中的文章,并根据超链接的内容评估文章,确定并呈现文章的社会语境。 此外,内容足够相似的文章可以被分组以在帖子和文章之间建立多对一关系,从而创建更准确的评估。

    Topical sentiments in electronically stored communications

    公开(公告)号:US08041669B2

    公开(公告)日:2011-10-18

    申请号:US12969356

    申请日:2010-12-15

    CPC classification number: G06F17/274 G06Q30/02

    Abstract: The present application presents methods for performing topical sentiment analysis on electronically stored communications employing fusion of polarity and topicality. The present application also provides methods for utilizing shallow NLP techniques to determine the polarity of an expression. The present application also provides a method for tuning a domain-specific polarity lexicon for use in the polarity determination. The present application also provides methods for computing a numeric metric of the aggregate opinion about some topic expressed in a set of expressions.

    Topical sentiments in electronically stored communications
    8.
    发明授权
    Topical sentiments in electronically stored communications 有权
    电子存储通信中的主题情绪

    公开(公告)号:US07523085B2

    公开(公告)日:2009-04-21

    申请号:US11245542

    申请日:2005-09-30

    CPC classification number: G06F17/274 G06Q30/02

    Abstract: The present application presents methods for performing topical sentiment analysis on electronically stored communications employing fusion of polarity and topicality. The present application also provides methods for utilizing shallow NLP techniques to determine the polarity of an expression. The present application also provides a method for tuning a domain-specific polarity lexicon for use in the polarity determination. The present application also provides methods for computing a numeric metric of the aggregate opinion about some topic expressed in a set of expressions.

    Abstract translation: 本申请提出了使用极性和话题的融合来进行电子存储通信的局部情绪分析的方法。 本申请还提供了利用浅层NLP技术来确定表达式的极性的方法。 本申请还提供了一种用于调整用于极性确定的域特定极性词典的方法。 本申请还提供了用于计算关于在一组表达式中表达的某个主题的聚合意见的数值度量的方法。

    Method and apparatus for web crawling
    9.
    发明授权
    Method and apparatus for web crawling 有权
    网络爬行的方法和装置

    公开(公告)号:US08712992B2

    公开(公告)日:2014-04-29

    申请号:US12413528

    申请日:2009-03-28

    CPC classification number: G06F17/30864

    Abstract: A method and system for retrieving data from a webpage is described herein. A scheduler organizes, or rather orders, a group of webpage identifiers according to some predetermined criteria. Based upon this ordering, a fetcher may be configured to fetch data from webpages identified by the identifiers. To promote efficiency and reduce the latency between when a webpage is updated and when the fetcher retrieves data from the webpage, the scheduler may be configured to reorder the identifiers in such a manner that it causes an identifier that was less relevant, and would not have been sent to the fetcher, to become more relevant. In this way, the method and system may be particularly useful for retrieving data related to webpages that are updated frequently, such as social media webpages, for example.

    Abstract translation: 本文描述了用于从网页检索数据的方法和系统。 调度器根据某些预定标准来组织或者相当地命令一组网页标识符。 基于该排序,提取器可以被配置为从由标识符标识的网页获取数据。 为了提高效率并减少网页更新时和提取器从网页检索数据之间的延迟,调度器可以被配置为以这样的方式重新排序标识符,使得它导致不相关的标识符,并且不会 被发送到提取者,变得更加相关。 以这种方式,该方法和系统可能特别适用于检索与频繁更新的网页相关的数据,例如社交媒体网页。

    EXTRACTION OF CERTAIN TYPES OF ENTITIES
    10.
    发明申请
    EXTRACTION OF CERTAIN TYPES OF ENTITIES 审中-公开
    提取某些类型的实体

    公开(公告)号:US20110131244A1

    公开(公告)日:2011-06-02

    申请号:US12626905

    申请日:2009-11-29

    CPC classification number: G06F16/367 G06F16/355

    Abstract: Certain types of entities may be extracted from a document. In one example, the entities to be recognized are cultural entities, such as the names of movies, video games, books, etc. For each such entity, a concept graph may be built that shows the relationship between the entity itself and other entities, such as the relationship between a movie and the actor(s) who act in the movie. When a candidate entity name is detected in the document, the concept graph may be used to look for other entities that appear in the context of the candidate entity. The presence of related entities in the context of the candidate may be used to disambiguate the meaning of the candidate. For example, a common word like “up” might be recognized as the name of a movie if the names of actors or characters in that movie appear near the word “up”.

    Abstract translation: 可以从文档中提取某些类型的实体。 在一个示例中,要被识别的实体是文化实体,诸如电影,视频游戏,书籍等的名称。对于每个这样的实体,可以构建示出实体本身和其他实体之间的关系的概念图, 例如电影和在电影中扮演的演员之间的关系。 当在文档中检测到候选实体名称时,概念图可以用于查找出现在候选实体的上下文中的其他实体。 在候选人的上下文中存在相关实体可以用来消除候选人的意思。 例如,如果该电影中的演员或角色的名字出现在“up”字样附近,则可能将诸如“up”的常用单词识别为电影的名称。

Patent Agency Ranking