摘要:
A voice emphasizing device emphasizes in a speech a “strained rough voice” at a position where a speaker or user of the speech intends to generate emphasis or musical expression. Thereby, the voice emphasizing device can provide the position with emphasis of anger, excitement, tension, or an animated way of speaking, or musical expression of Enka (Japanese ballad), blues, rock, or the like. As a result, rich vocal expression can be achieved. The voice emphasizing device includes: an emphasis utterance section detection unit (12) detecting, from an input speech waveform, an emphasis section that is a time duration having a waveform intended by the speaker or user to be converted; and a voice emphasizing unit (13) increasing fluctuation of an amplitude envelope of the waveform in the detected emphasis section.
摘要:
Speech parameters (P.sub.h and P.sub.l) are derived for consonant classification and recognition by separating a speech signal into Low and High frequency bands, then in each band obtaining the time first-derivative, from which the min-max differences (power dip) are obtained (P.sub.h and P.sub.l). The distribution of P.sub.h and P.sub.l in a two-dimensional plot for a discriminant diagram classifies the consonant phoneme.
摘要:
A vehicle control device (10) is provided that can predict a driving operation of a driver earlier to respond to the driving operation quickly. The vehicle control device (10) includes: a posture measuring unit (11) to measure a posture indicating a state of at least one of the buttock region, the upper pelvic region, and the driver's leg opposite to the other leg with which the driver operates a brake or an accelerator; a posture change detection unit (12) to detect a posture change measured; a preparatory movement identification unit (13) to identify whether the posture change is caused by the driver's preparatory movement spontaneously made before the brake or accelerator operation, based on whether the posture change detected satisfies a predetermined condition; and a vehicle control unit (14) to control the vehicle when it is identified that the posture change has been caused by the preparatory movement.
摘要:
In an acoustic model producing apparatus, a plurality of noise samples are categorized into clusters so that a number of the clusters is smaller than that of noise samples. A noise sample is selected in each of the clusters to set the selected noise samples to second noise samples for training. On the other hand, untrained acoustic models are stored on a storage unit so that the untrained acoustic models are trained by using the second noise samples for training, thereby producing trained acoustic models for speech recognition so as to produce a trained acoustic model for speech recognition.
摘要:
A video retrieval data generation apparatus includes an extractor that is configured to extract a characteristic pattern from a voice signal synchronous with a video signal. The video retrieval data generation apparatus also includes an index generator that is configured to set the voice signal for a voice period as a processing target. The index generator is further configured to prepare standard voice patterns of a subword corresponding to a plurality of subwords, detect, for each subword, a characteristic pattern similar to a standard voice pattern at each of the voice periods, and generate, for each subword, an index containing time synchronization information corresponding to a position where the similar characteristic pattern is detected. The video retrieval data generation apparatus also includes a multiplexer that is configured to multiplex video signals, voice signals and indexes to output in a data stream format.
摘要:
A set of "m" feature parameters is generated every frame from reference speech which is spoken by at least one speaker and which represents recognition-object words, where "m" denotes a preset integer. A set of "n" types of standard patterns is previously generated on the basis of speech data of a plurality of speakers, where "n" denotes a preset integer. Matching between the feature parameters of the reference speech and each of the standard patterns is executed to generate a vector of "n" reference similarities between the feature parameters of the reference speech and each of the standard patterns every frame. The reference similarity vectors of respective frames are arranged into temporal sequences corresponding to the recognition-object words respectively. The reference similarity vector sequences are previously registered as dictionary similarity vector sequences. Input speech to be recognized is analyzed to generate "m" feature parameters from the input speech. Matching between the feature parameters of the input speech and the standard patterns is executed to generate a vector of "n" input-speech similarities between the feature parameters of the input speech and the standard patterns every frame. The input-speech similarity vectors of respective frames are arranged into a temporal sequence. The input-speech similarity vector sequence is collated with the dictionary similarity vector sequences to recognize the input speech.
摘要:
A video retrieval apparatus includes a retrieval data generator that is configured to extract a characteristic pattern from a voice signal synchronous with a video signal to generate an index for video retrieval. The video retrieval apparatus also includes a retrieval processor that is configured to input a key word from a retriever and collate the key word with the index to retrieve a desired video. The retrieval data generator includes a multiplexor that is configured to multiplex video signals, voice signals and indexes to output in data stream format. The retrieval processor includes a demultiplexor that is configured to demultiplex the multiplexed data stream into the video signals, the voice signals and the indexes. A video reproduction apparatus may collate a visual pattern of the key word visual pattern data of the video signal at the time a person vocalizes a sound as the index for retrieval.
摘要:
An inter-frame similarity between an input voice and a standard patterned word is calculated for each of frames and for each of standard patterned words, and a posterior probability similarity is produced by subtracting a constant value from each of the inter-frame similarities. The constant value is determined by analyzing voice data obtained from specified persons to set the posterior probability similarities to positive values when a word existing in the input voice matches with the standard patterned word and to set the posterior probability similarities to negative values when a word existing in the input voice does not match with the standard patterned word. Thereafter, an accumulated similarity having an accumulated value obtained by accumulating values of the posterior probability similarities according to a continuous dynamic programming matching operation for the frames of the input voice is calculated for each of the standard patterned words. Thereafter, a particular standard patterned word relating to an accumulated similarity having a maximum value among the accumulated similarities is output as a recognized word of the input voice.
摘要:
A method of speech recognition includes the steps of analyzing input speech every frame and deriving feature parameters from the input speech, generating an input vector from the feature parameters of a plurality of frames, and periodically calculating partial distances between the input vector and partial standard patterns while shifting the frame one by one. Standard patterns correspond to recognition-object words respectively, and each of the standard patterns is composed of the partial standard patterns which represent parts of the corresponding recognition-object word respectively. The partial distances are accumulated into distances between the input speech and the standard patterns. The distances correspond to the recognition-object words respectively. The distances are compared with each other, and a minimum distance of the distances is selected when the input speech ends. One of the recognition-object words which corresponds to the minimum distance is decided to be a recognition result.
摘要:
A voice emphasizing device emphasizes in a speech a “strained rough voice” at a position where a speaker or user of the speech intends to generate emphasis or musical expression. Thereby, the voice emphasizing device can provide the position with emphasis of anger, excitement, tension, or an animated way of speaking, or musical expression of Enka (Japanese ballad), blues, rock, or the like. As a result, rich vocal expression can be achieved. The voice emphasizing device includes: an emphasis utterance section detection unit (12) detecting, from an input speech waveform, an emphasis section that is a time duration having a waveform intended by the speaker or user to be converted; and a voice emphasizing unit (13) increasing fluctuation of an amplitude envelope of the waveform in the detected emphasis section.