Abstract:
Methods, systems, devices, and media for creating a plan through multimodal search inputs are provided. A first search request comprises a first input received via a first input mode and a second input received via a different second input mode. The second input identifies a geographic area. First search results are displayed based on the first search request and corresponding to the geographic area. Each of the first search results is associated with a geographic location. A selection of one of the first search results is received and added to a plan. A second search request is received after the selection, and second search results are displayed in response to the second search request. The second search results are based on the second search request and correspond to the geographic location of the selected one of the first search results.
Abstract:
A method of providing hybrid speech recognition between a local embedded speech recognition system and a remote speech recognition system relates to receiving speech from a user at a device communicating with a remote speech recognition system. The system recognizes a first part of speech by performing a first recognition of the first part of the speech with the embedded speech recognition system that accesses private user data, wherein the private user data is not available to the remote speech recognition system. The system recognizes the second part of the speech by performing a second recognition of the second part of the speech with the remote speech recognition system. The final recognition result is a combination of these two recognition processes. The private data can be such local information as a user location, a playlist, frequently dialed numbers or texted people, user contact list information, and so forth.
Abstract:
Methods, systems, devices, and media for creating a plan through multimodal search inputs are provided. A multimodal virtual assistant receives a first search request which comprises a geographic area. First search results are displayed in response to the first search request being received. The first search results are based on the first search request and correspond to the geographic area. Each of the first search results is associated with a geographic location. The multimodal virtual assistant receives a selection of one of the first search results, and adds the selected one of the first search results to a plan. A second search request is received after the selection, and second search results are displayed in response to the second search request being received. The second search results are based on the second search request and correspond to the geographic location of the selected one of the first search results.
Abstract:
A system, method and computer-readable storage devices are disclosed for using targeted clarification (TC) questions in dialog systems in a multimodal virtual agent system (MVA) providing access to information about movies, restaurants, and musical events. In contrast with open-domain spoken systems, the MVA application covers a domain with a fixed set of concepts and uses a natural language understanding (NLU) component to mark concepts in automatically recognized speech. Instead of identifying an error segment, localized error detection (LED) identifies which of the concepts are likely to be present and correct using domain knowledge, automatic speech recognition (ASR), and NLU tags and scores. If at least concept is identified to be present but not correct, the TC component uses this information to generate a targeted clarification question. This approach computes probability distributions of concept presence and correctness for each user utterance, which can apply to automatic learning for clarification policies.
Abstract:
A system, method and computer-readable storage devices are disclosed for multi-modal interactions with a system via a long-touch gesture on a touch-sensitive display. A system operating per this disclosure can receive a multi-modal input comprising speech and a touch on a display, wherein the speech comprises a pronoun. When the touch on the display has a duration longer than a threshold duration, the system can identify an object within a threshold distance of the touch, associate the object with the pronoun in the speech, to yield an association, and perform an action based on the speech and the association.
Abstract:
A system, method and computer-readable storage devices are disclosed for multi-modal interactions with a system via a long-touch gesture on a touch-sensitive display. A system operating per this disclosure can receive a multi-modal input comprising speech and a touch on a display, wherein the speech comprises a pronoun. When the touch on the display has a duration longer than a threshold duration, the system can identify an object within a threshold distance of the touch, associate the object with the pronoun in the speech, to yield an association, and perform an action based on the speech and the association.
Abstract:
Extracting, from user activity data, quantitative attributes and qualitative attributes collected for users having user profiles. The quantitative attributes and the qualitative attributes are extracted during a specified time period determined before the user activity data is collected. Values for the quantitative attributes and the qualitative attributes are plotted, and subsets of the user profiles are clustered into separate group of users based on the plotted values. Delivering a product related content to the groups of users based on the clustering.
Abstract:
Personalization of speech recognition while maintaining privacy of user data is achieved by transmitting data associated with received speech to a speech recognition service and receiving a result from the speech recognition service. The speech recognition service result is generated from a general purpose speech language model. The system generates an input finite state machine from the speech recognition result and composes the input finite state machine with a phone edit finite state machine, to yield a resulting finite state machine. The system composes the resulting finite state machine with a user data finite state machine to yield a second resulting finite state machine, and uses a best path through the second resulting finite state machine to yield a user specific speech recognition result.