Abstract:
A document is received that has a plurality of lines with text. This document includes text associated with at least one topic of interest and text not associated with the at least one topic of interest. Thereafter, it is determined, for each line in the document, a length of the line and a number of off-topic indicators with the off-topic indicators characterizing portions of the document as likely being not being associated with the at least one topic of interest. Thereafter, a density for each line can be determined based on the determined line length and the determined number of off-topic indicators. The determined densities for each line are used to identify portions of the documents likely associated with the at least one topic of interest so that data characterizing the identified portions of the document can be provided. Related apparatus, systems, techniques and articles are also described.
Abstract:
A document is received that has a plurality of lines with text. This document includes text associated with at least one topic of interest and text not associated with the at least one topic of interest. Thereafter, it is determined, for each line in the document, a length of the line and a number of off-topic indicators with the off-topic indicators characterizing portions of the document as likely being not being associated with the at least one topic of interest. Thereafter, a density for each line can be determined based on the determined line length and the determined number of off-topic indicators. The determined densities for each line are used to identify portions of the documents likely associated with the at least one topic of interest so that data characterizing the identified portions of the document can be provided. Related apparatus, systems, techniques and articles are also described.
Abstract:
Data is received that comprises an entity name. Thereafter, it is determined (i) whether there are any punctuation variations for the entity name, (ii) whether there is at least one character to drop from the entity name, and (iii) whether there are alternative equivalents of at least a portion of the entity name. After such determinations have been made, a plurality of variants for the entity name is generated based on a combination of each determined punctuation variation, determined at least one character to drop, and determined alternative equivalent. Related apparatus, systems, techniques and articles are also described.
Abstract:
A company is associated, in an enterprise resource planning system, with a plurality of business entities that each have at least one structured record used by the enterprise resource planning system to characterize the business entity. Thereafter, documents are obtained from a plurality of information sources that characterize events associated with each business entity. It is then determined, using pre-defined business rules, which of the events are pertinent to the company so that enhancement records can be generated for the events determined to be pertinent to the company. These enhancement records characterize the corresponding event and are linked to the structured record for the corresponding business entity. Related apparatus, systems, techniques and articles are also described.
Abstract:
An event type generator may provide a training set for classifying documents with respect to an event type. The event type generator may include a request handler to receive the event type and at least one example document, a text analyzer to extract first entities from the at least one example document, and a result manager to execute a first search against an indexed corpus of documents, to obtain first search results, and further to receive at least one selected document from the first search results. The request handler may extract second entities from the at least one selected document, and execute a second search against the indexed corpus of documents, to obtain second search results. The event type generator may thus provide the at least one example document, the first search results, and the second search results as the training set.
Abstract:
A system and method of record matching using regular expressions and finite state representations. In this manner, the time (or computational effort) involved in record matching is reduced.
Abstract:
A system may include a record generator to receive a plurality of documents associated with a plurality of suppliers and provide supplier-specific data records based thereon. The record generator may include an event classifier configured to execute a supplier-independent, event-based classification of each document, to thereby obtain event-classified documents. The record generator may include a supplier query generator configured to query the plurality of documents to obtain potential supplier matches from the plurality of suppliers, and a supplier match analyzer configured to analyze each potential supplier match of the potential supplier matches, to thereby obtain supplier matches. The record generator may include a supplier relevance analyzer configured to relate, for each event-classified document, any supplier identified therein to at least one event of the event-classified document, to thereby obtain supplier-event relationships. Thus, the record generator may provide supplier-specific data records, based on the supplier event relationship.
Abstract:
An event type generator may provide a training set for classifying documents with respect to an event type. The event type generator may include a request handler to receive the event type and at least one example document, a text analyzer to extract first entities from the at least one example document, and a result manager to execute a first search against an indexed corpus of documents, to obtain first search results, and further to receive at least one selected document from the first search results. The request handler may extract second entities from the at least one selected document, and execute a second search against the indexed corpus of documents, to obtain second search results. The event type generator may thus provide the at least one example document, the first search results, and the second search results as the training set.
Abstract:
A system and method of record matching using regular expressions and finite state representations. In this manner, the time (or computational effort) involved in record matching is reduced.
Abstract:
A system may include a record generator to receive a plurality of documents associated with a plurality of suppliers and provide supplier-specific data records based thereon. The record generator may include an event classifier configured to execute a supplier-independent, event-based classification of each document, to thereby obtain event-classified documents. The record generator may include a supplier query generator configured to query the plurality of documents to obtain potential supplier matches from the plurality of suppliers, and a supplier match analyzer configured to analyze each potential supplier match of the potential supplier matches, to thereby obtain supplier matches. The record generator may include a supplier relevance analyzer configured to relate, for each event-classified document, any supplier identified therein to at least one event of the event-classified document, to thereby obtain supplier-event relationships. Thus, the record generator may provide supplier-specific data records, based on the supplier event relationship.