Abstract:
Language analysis means 21 analyzes texts read from a text DB 11, and generates a sentence structure as the analysis result. Similar-structure generation adjustment means 25 generates, from an input of an input device, a determination item for determining whether or not the structures are identical every type of differences between the sentence structures. Similar-structure determination adjustment means 26 generates, from an input of the input device 6, a determination item for determining whether or not the difference between attribute values is ignored every type of attribute values. Similar-structure generating means 22 generates a similar structure of a partial structure forming the sentence structure obtained by language analysis means 21 in accordance with the determination item from the similar-structure generation adjustment means 25, and sets the generated similar structure as an equivalent class of the partial structure on the generation source. Frequent-similar-pattern detection means 24 ignores the attribute value in accordance with the determination item given from the similar-structure determination adjustment means 26, detects the frequent pattern on the basis of a set of equivalent classes from the similar-structure generating means 22, and outputs the frequent pattern to an output device 3.
Abstract:
A system for supervised automatic code generation and tuning for natural language interaction applications, comprising a build environment comprising a developer user interface, automated coding tools, automated testing tools, and automated optimization tools, and an analytics framework software module. Text samples are imported into the build environment and automated clustering is performed to assign them to a plurality of input groups, each input group comprising a plurality of semantically related inputs. Language recognition rules are generated by automated coding tools. Automated testing tools carry out automated testing of language recognition rules and generate recommendations for tuning language recognition rules. The analytics framework performs analysis of interaction log files to identify problems in a candidate natural language interaction application. Optimizations to the candidate natural language interaction application are carried out and an optimized natural language interaction application is deployed into production and stored in the solution data repository.
Abstract:
Embodiments of the present invention provide a method, system and computer program product for the domain specific normalization of a corpus of text. In an embodiment of the invention, a method for domain specific normalization of a corpus of text is provided, including an industrial, organization, demographic or geographic domain. The method includes loading a corpus of text in memory of a computer and determining a domain for the corpus of text. The method also includes retrieving a lexicon of replacement words for the determined domain. Finally, the method includes text simplifying the corpus of text using the retrieved lexicon. In one aspect of the embodiment, the domain is determined through inference based upon words already presence in the corpus of text. In another aspect of the embodiment, the domain is determined based upon meta-data provided with the corpus of text.
Abstract:
As a document analysis system to calculate a similarity degree between texts with high accuracy, an information processing device includes: a common character string calculation unit to extract character strings that are common between two texts and to determine whether or not the two texts are to be set as calculation objects based on a number of the extracted character strings that are common; and a similarity degree calculation unit to calculate, when the two texts are the determined calculation objects, a similarity degree therebetween by using an approximation of a Kolmogorov complexity, and when the two texts are not the calculation objects, handling the similarity degree between the two texts as being dissimilar.
Abstract:
Provided is a text mining device that performs an analysis properly with respect to a difference between plural related document data. Equipped are an element extracting section 140 that extracts language elements from related two or more document data respectively; a differential processing section 150 that extracts a difference between the document data by comparing the elements between the document data which were extracted by the element extracting means 140; and a statistical processing section 170 that performs statistical processing on the difference extracted by the differential processing section 150. The differential processing section 150 has: element associating section 151 that associates respective elements which are in identical, similar, synonymous, or analogous relation by comparing the elements of the document data between the document data which were extracted by the element extracting section 140; and differential element extracting section 152 that extracts an element with no corresponding element of a pair in the association by the element association section 151.
Abstract:
A method and system for splitting a text document into individual sentences using sentence boundary detection, and establishing co-relationships between terms which are present in the same sentence. A document corpus, or collection of text records, is provided, containing text with terms to be extracted. The text records in the document corpus are divided into individual sentences, using a set of rules for sentence boundary detection. The individual sentences are then analyzed to extract and correlate terms, such as parts and symptoms, symptoms and actions, or parts and failure modes. The correlated terms are then validated based on frequency of occurrence, with term pairs being considered valid if their frequency of occurrence exceeds a minimum frequency threshold. The validated term correlations can be used for fault model development, document classification, and document clustering.
Abstract:
A paraphrase model of a question text inputted by a user is learned, and a paraphrase expression is generated in real time. When information in text set storage unit is updated, text pair extracting unit extracts a paraphrase text pair from the text set storage unit and stores it in text pair storage unit. Model learning unit learns a question text paraphrase model from the paraphrase text pair in text pair storage unit, and stores it in model storage unit. Text pair extracting unit extracts a paraphrase text pair again from the text set storage unit by using the question text paraphrase model which the model storage unit possesses, and stores it in the text pair storage unit. In case where the stored paraphrase text pair is the same as the paraphrase text pair stored in the text pair storage unit, learning of the question text paraphrase model is ended. Candidate creating unit reads the question text paraphrase model from the model storage unit and generates a paraphrase candidate of the inputted question text.
Abstract:
A system and method for extraction of suggestions for improvement form a corpus of documents, such as customer reviews, are disclosed. A structured terminology provided or a topic includes a set of semantic classes, each including a set of terms. A thesaurus of terms relating to suggestions of improvement is provided. Text elements of text strings in the documents which are instances of terms in the structured terminology are labeled with the corresponding semantic class and text elements which are instances of terms in the thesaurus are also labeled. A set of patterns is applied to the labeled text strings to identify suggestions of improvement expressions. The patterns define syntactic relations between text elements, some of which are required to be instances of one of the terms in a particular semantic class or thesaurus. A set of suggestions for improvements is output based on the identified suggestions of improvement expressions.
Abstract:
An automatic answering device and an automatic answering method for automatically answering to a user utterance are configured: to prepare a conversation scenario that is a set of input sentences and replay sentences, the input sentences each corresponding to a user utterance assumed to be uttered by a user, the reply sentences each being an automatic reply to the inputted sentence; to accept a user utterance; to determine the reply sentence to the accepted user utterance on the basis of the conversation scenario; and to present the determined reply sentence to the user. Data of the conversation scenario have a data structure that enables the inputted sentences and the reply sentences to be expressed in a state transition diagram in which each of the inputted sentences is defined as a morphism and the reply sentence corresponding to the inputted sentence is defined as an object.
Abstract:
A system and method are described for generating semantically similar sentences for a statistical language model. A semantic class generator determines for each word in an input utterance a set of corresponding semantically similar words. A sentence generator computes a set of candidate sentences each containing at most one member from each set of semantically similar words. A sentence verifier grammatically tests each candidate sentence to determine a set of grammatically correct sentences semantically similar to the input utterance. Also note that the generated semantically similar sentences are not restricted to be selected from an existing sentence database.