摘要:
An audio information system that may be used to form and convey an audio message having speech overlapped with non-speech audio is provided. The system has components to store a context indicator having non-speech audio to signify a characteristic of a speech content stream, to merge the context indicator with the speech content stream to form an integrated message, and to output the integrated message. The message has overlapping non-speech audio from the context indicator and speech audio. The system also has mechanisms to vary the format of integrated message generated in order to train the user on non-speech cues. In addition, other aspects of the present invention relating to the audio information system receiving content and generating an audio message are described.
摘要:
The waveform generation device can reproduce waveform data of various sounds stored in a memory at a reproduction velocity of reproducing waveforms at a real time and is provided with a storage system for storing the waveform data of a waveform sequence with the plural waveforms arrange in time series, a pitch information input system for entering pitch information, a time information generating system for generating time information changing at a velocity having no connection with the pitch to be reproduced, and a reproduction means for reproducing the waveform data to be read at the reproduction pitch waveform data of the pitch information input system, including a reading system for reading the waveform data from the storage system at a desired reading velocity corresponding to the time information and the pitch information yet having no connection with the velocity of changing the time information.
摘要:
A speech-generating computer apparatus for generating speech from electronic forms, a method of controlling a computer and a computer-readable media containing program code embodying an application program for performing a method of generating speech. The computer has a speech-generating function and at least one screen reader program. The at least one screen reader program generates human perceptible speech with the speech-generating function. The computer determines if a particular screen reader program is active and initializes an object in a format of a particular screen reader program that is active.
摘要:
A method of proofreading and correcting dictated text contained in an electronic document comprises the steps of: selecting proofreading criteria for identifying textual errors contained in the electronic document; playing back each word contained in the electronic document; and, marking as a textual error each played back word in nonconformity with at least one of the proofreading criteria. The method can further comprise the step of editing each the marked textual error identified in the marking step. In particular, the editing step can include reviewing each the marked textual error identified in the marking step; accepting user specified changes to each marked textual error reviewed in the reviewing step; and, unmarking each marked textual error corrected by the user in the accepting step. Also, the reviewing step can include highlighting each the word in the electronic document corresponding to the marked textual error marked in the marking step; and, displaying an explanation for each marked textual error in a user interface. Moreover, the reviewing step can further include suggesting a recommended change to the marked textual error; displaying the recommended change in the user interface; and, accepting a user specified preference to substitute the recommended change for the marked textual error.
摘要:
A method and system are provided for performing recorded word concatenation to create a natural sounding sequence of words, numbers, phrases, sounds, etc. for example. The method and system may include a tonal pattern identification unit that identifies tonal patterns, such as pitch accents, phrase accents and boundary tones, for utterances in a particular domain, such as telephone numbers, credit card numbers, the spelling of words, etc.; a script designer that designs a script for recording a string of words, numbers, sounds etc., based on an appropriate rhythm and pitch range in order to obtain natural prosody for utterances in the particular domain and with minimum coarticulation between concatenative units; a script recorder that records a speaker's utterances of the domain strings; a recording editor that edits the recorded strings by marking the beginning and end of each word, number etc. in the string and including or inserting pauses according to the tonal patterns; and a concatenation unit that concatenates the edited recording into a smooth and natural sounding string of words, numbers, letters of the alphabet, etc., for audio output.
摘要:
A multi-user e-mail reader system allows several users to access their e-mail accounts simultaneously and have the e-mail messages played back with speech synthesis. The user navigates through various functional states of the system using either touch-tone keypad commands or optionally voiced commands interpreted by a speech recognizer. Users can send reply e-mail messages without the use of a computer, by invoking the system's text processor. The text processor operates in conjunction with a keypad-to-ASCII conversion mechanism that allows fully punctuated and properly addressed e-mail messages to be composed from the touch-tone phone. Digital audio sound file attachments may be recorded through the telephone handset and attached to an outgoing e-mail message. A system for storing canned messages allows the user to quickly send pre-composed reply messages, either as stored or after editing using the text processor. The text processor uses a virtual cursor pointer that may be indexed forward and backward at different granularities, depending on whether the system is in play mode or record mode. The granularity can also be changed by the user.
摘要:
An apparatus and method for speech-text-transmit communication over data networks includes speech recognition devices and text to speech conversion devices that translate speech signals input to the terminal into text and text data received from a data network into speech output signals. The speech input signals are translated into text based on phonemes obtained from a spectral analysis of the speech input signals. The text data is transmitted to a receiving party over the data network as a plurality of text data packets such that a continuous stream of text data is obtained. The receiving party's terminal receives the text data and may immediately display the text data and/or translate it into speech output signals using the text to speech conversion device. The text to speech conversion device uses speech pattern data stored in a speech pattern database for synthesizing a human voice for playing of the speech output signals using a speech output device.
摘要:
A plurality of tasks are set in a speech synthesizing process, in which at least one of speakers, emotion or situation at the time speeches are made, and contents of the speeches, is different, and word dictionaries, prosody dictionaries, and waveform dictionaries corresponding to respective tasks are organized. When a character string to be synthesized is input with the task specified through, for example, a game system, a speech synthesizing process is performed using the word dictionary, the prosody dictionary, and the waveform dictionary corresponding to the specified task. Therefore, a speech message can be generated depending on the personality of a speaker, the emotion or situation at the time when a speech is made, and the contents of the speech.
摘要:
A frame erasure compensation method in a variable-rate speech coder includes quantizing, with a first encoder, a pitch lag value for a current frame and a first delta pitch lag value equal to the difference between the pitch lag value for the current frame and the pitch lag value for the previous frame. A second, predictive encoder quantizes only a second delta pitch lag value for the previous frame (equal to the difference between the pitch lag value for the previous frame and the pitch lag value for the frame prior to that frame). If the frame prior to the previous frame is processed as a frame erasure, the pitch lag value for the previous frame is obtained by subtracting the first delta pitch lag value from the pitch lag value for the current frame. The pitch lag value for the erasure frame is then obtained by subtracting the second delta pitch lag value from the pitch lag value for the previous frame. Additionally, a waveform interpolation method may be used to smooth discontinuities caused by changes in the coder pitch memory.
摘要:
The duration of speech varies according to the characteristics of pronounced speech and pronouncing habit of the speaker. In the speech duration processing method and apparatus of this invention, a large amount of natural speech was analyzed, and the following was known: Speech duration of monosyllables will vary according to factors, such as phonemes, tones, phrase construction, locations in the phrases, locations in the sentence, and front and rear connected phonemes, etc. of the syllables. Through the use of these varying factors, a “speech duration parameter storage portion” for speech duration parameters is constructed. By retrieving the speech duration parameters and combining the same with the basic speech duration of a syllable during syllable speech duration calculation, the speech duration of each monosyllable in any sentence can be accurately decided. As recognized from experimental results, a text-to-speech system using the speech duration processing apparatus of this invention can synthesize speech with natural speech duration.