Deep networks for unit selection speech synthesis
    11.
    发明授权
    Deep networks for unit selection speech synthesis 有权
    深层网络单元选择语音合成

    公开(公告)号:US09460704B2

    公开(公告)日:2016-10-04

    申请号:US14019967

    申请日:2013-09-06

    Applicant: Google Inc.

    CPC classification number: G10L13/06 G10L25/30

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for providing a representation based on structured data in resources. The methods, systems, and apparatus include actions of receiving target acoustic features output from a neural network that has been trained to predict acoustic features given linguistic features. Additional actions include determining a distance between the target acoustic features and acoustic features of a stored acoustic sample. Further actions include selecting the acoustic sample to be used in speech synthesis based at least on the determined distance and synthesizing speech based on the selected acoustic sample.

    Abstract translation: 方法,系统和装置,包括在计算机存储介质上编码的计算机程序,用于基于资源中的结构化数据提供表示。 方法,系统和装置包括接收从神经网络输出的目标声学特征的动作,所述神经网络已被训练以预测具有语言特征的声学特征。 附加动作包括确定目标声学特征与存储的声学样本的声学特征之间的距离。 进一步的动作包括至少基于所确定的距离来选择要在语音合成中使用的声学样本,并且基于所选择的声学样本来合成语音。

    Cluster specific speech model
    12.
    发明授权
    Cluster specific speech model 有权
    集群特定语音模型

    公开(公告)号:US09401143B2

    公开(公告)日:2016-07-26

    申请号:US14663610

    申请日:2015-03-20

    Applicant: Google Inc.

    CPC classification number: G10L15/063 G10L15/183 G10L2015/0631

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for receiving data representing acoustic characteristics of a user's voice; selecting a cluster for the data from among a plurality of clusters, where each cluster includes a plurality of vectors, and where each cluster is associated with a speech model trained by a neural network using at least one or more vectors of the plurality of vectors in the respective cluster; and in response to receiving one or more utterances of the user, providing the speech model associated with the cluster for transcribing the one or more utterances.

    Abstract translation: 方法,系统和装置,包括在计算机存储介质上编码的计算机程序,用于接收表示用户声音的声学特性的数据; 从多个聚类中选择用于数据的聚类,其中每个聚类包括多个向量,并且其中每个聚类与使用所述多个向量的至少一个或多个向量的由神经网络训练的语音模型相关联 各集群; 并且响应于接收到所述用户的一个或多个话语,提供与所述群集相关联的语音模型以用于转录所述一个或多个话语。

    GENERATING REPRESENTATIONS OF ACOUSTIC SEQUENCES USING PROJECTION LAYERS
    13.
    发明申请
    GENERATING REPRESENTATIONS OF ACOUSTIC SEQUENCES USING PROJECTION LAYERS 有权
    使用投影层产生声学序列的表示

    公开(公告)号:US20150161991A1

    公开(公告)日:2015-06-11

    申请号:US14557725

    申请日:2014-12-02

    Applicant: Google Inc.

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating phoneme representations of acoustic sequences using projection sequences. One of the methods includes receiving an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a respective acoustic feature representation at each of a plurality of time steps; for each of the plurality of time steps, processing the acoustic feature representation through each of one or more long short-term memory (LSTM) layers; and for each of the plurality of time steps, processing the recurrent projected output generated by the highest LSTM layer for the time step using an output layer to generate a set of scores for the time step.

    Abstract translation: 方法,系统和装置,包括在计算机存储介质上编码的计算机程序,用于使用投影序列产生声学序列的音素表示。 方法之一包括接收声学序列,代表发音的声学序列,以及包括在多个时间步长中的每一个处的各个声学特征表示的声学序列; 对于所述多个时间步骤中的每个步骤,通过一个或多个长短期存储器(LSTM)层中的每一个处理所述声学特征表示; 并且对于多个时间步骤中的每一个,使用输出层处理由时间步长的最高LSTM层产生的复现投影输出,以生成用于该时间步长的一组分数。

    Speech recognition with acoustic models

    公开(公告)号:US09818410B2

    公开(公告)日:2017-11-14

    申请号:US14983315

    申请日:2015-12-29

    Applicant: Google Inc.

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media for learning pronunciations from acoustic sequences. One method includes receiving an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a sequence of multiple frames of acoustic data at each of a plurality of time steps; stacking one or more frames of acoustic data to generate a sequence of modified frames of acoustic data; processing the sequence of modified frames of acoustic data through an acoustic modeling neural network comprising one or more recurrent neural network (RNN) layers and a final CTC output layer to generate a neural network output, wherein processing the sequence of modified frames of acoustic data comprises: subsampling the modified frames of acoustic data; and processing each subsampled modified frame of acoustic data through the acoustic modeling neural network.

    SPEECH RECOGNITION WITH ACOUSTIC MODELS
    15.
    发明申请
    SPEECH RECOGNITION WITH ACOUSTIC MODELS 有权
    用声学模型进行语音识别

    公开(公告)号:US20160372119A1

    公开(公告)日:2016-12-22

    申请号:US14983315

    申请日:2015-12-29

    Applicant: Google Inc.

    Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media for learning pronunciations from acoustic sequences. One method includes receiving an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a sequence of multiple frames of acoustic data at each of a plurality of time steps; stacking one or more frames of acoustic data to generate a sequence of modified frames of acoustic data; processing the sequence of modified frames of acoustic data through an acoustic modeling neural network comprising one or more recurrent neural network (RNN) layers and a final CTC output layer to generate a neural network output, wherein processing the sequence of modified frames of acoustic data comprises: sub sampling the modified frames of acoustic data; and processing each subsampled modified frame of acoustic data through the acoustic modeling neural network.

    Abstract translation: 方法,系统和装置,包括在计算机存储介质上编码的用于从声学序列学习发音的计算机程序。 一种方法包括:在多个时间步长中的每个步骤处接收声学序列,代表发音的声学序列,以及包括多个声学数据帧序列的声学序列; 堆叠一个或多个声音数据帧以产生声学数据的修改帧序列; 通过包括一个或多个循环神经网络(RNN)层和最终CTC输出层的声学建模神经网络来处理声学数据的经修改的帧序列以产生神经网络输出,其中处理声学数据的经修改的帧序列包括 :对声学数据的修改帧进行子采样; 并通过声学建模神经网络处理每个子采样的声学数据的修改帧。

    PROCESSING MULTI-CHANNEL AUDIO WAVEFORMS
    16.
    发明申请
    PROCESSING MULTI-CHANNEL AUDIO WAVEFORMS 有权
    处理多通道音频波形

    公开(公告)号:US20160322055A1

    公开(公告)日:2016-11-03

    申请号:US15205321

    申请日:2016-07-08

    Applicant: Google Inc.

    Abstract: Methods, including computer programs encoded on a computer storage medium, for enhancing the processing of audio waveforms for speech recognition using various neural network processing techniques. In one aspect, a method includes: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined.

    Abstract translation: 方法,包括在计算机存储介质上编码的计算机程序,用于使用各种神经网络处理技术增强用于语音识别的音频波形的处理。 一方面,一种方法包括:接收对应于话语的多个音频数据通道; 在时域中将多个滤波器中的每一个与音频波形数据的多个通道中的每一个进行卷积以产生卷积输出,其中多个滤波器具有在训练过程期间已经学习的参数,其共同训练多个滤波器并训练深度 神经网络作为声学模型; 对于多个滤波器中的每一个组合用于多个声道波形数据的滤波器的卷积输出; 将组合卷积输出输入到与多个滤波器一起训练的深层神经网络; 并为确定的话语提供转录。

    MULTILINGUAL PROSODY GENERATION
    17.
    发明申请

    公开(公告)号:US20160071512A1

    公开(公告)日:2016-03-10

    申请号:US14942300

    申请日:2015-11-16

    Applicant: Google Inc.

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for multilingual prosody generation. In some implementations, data indicating a set of linguistic features corresponding to a text is obtained. Data indicating the linguistic features and data indicating the language of the text are provided as input to a neural network that has been trained to provide output indicating prosody information for multiple languages. The neural network can be a neural network having been trained using speech in multiple languages. Output indicating prosody information for the linguistic features is received from the neural network. Audio data representing the text is generated using the output of the neural network.

    CACHING SPEECH RECOGNITION SCORES
    18.
    发明申请
    CACHING SPEECH RECOGNITION SCORES 有权
    缓存语音识别码

    公开(公告)号:US20150371631A1

    公开(公告)日:2015-12-24

    申请号:US14311557

    申请日:2014-06-23

    Applicant: Google Inc.

    CPC classification number: G10L15/08 G10L15/285

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for caching speech recognition scores. In some implementations, one or more values comprising data about an utterance are received. An index value is determined for the one or more values. An acoustic model score for the one or more received values is selected, from a cache of acoustic model scores that were computed before receiving the one or more values, based on the index value. A transcription for the utterance is determined using the selected acoustic model score.

    Abstract translation: 方法,系统和装置,包括编码在计算机存储介质上的用于缓存语音识别分数的计算机程序。 在一些实现中,接收包括关于话语的数据的一个或多个值。 确定一个或多个值的索引值。 基于索引值,从接收到一个或多个值之前计算的声学模型分数的高速缓存中选择一个或多个接收值的声学模型分数。 使用所选择的声学模型得分确定发音的转录。

    Multilingual prosody generation
    19.
    发明授权
    Multilingual prosody generation 有权
    多语言韵律一代

    公开(公告)号:US09195656B2

    公开(公告)日:2015-11-24

    申请号:US14143627

    申请日:2013-12-30

    Applicant: Google Inc.

    Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for multilingual prosody generation. In some implementations, data indicating a set of linguistic features corresponding to a text is obtained. Data indicating the linguistic features and data indicating the language of the text are provided as input to a neural network that has been trained to provide output indicating prosody information for multiple languages. The neural network can be a neural network having been trained using speech in multiple languages. Output indicating prosody information for the linguistic features is received from the neural network. Audio data representing the text is generated using the output of the neural network.

    Abstract translation: 方法,系统和装置,包括在计算机存储介质上编码的计算机程序,用于多语言韵律生成。 在一些实现中,获得指示与文本相对应的一组语言特征的数据。 指示语言特征的数据和指示文本语言的数据被提供给已经被训练以提供指示多种语言的韵律信息的输出的神经网络的输入。 神经网络可以是已经使用多种语言的语音训练的神经网络。 从神经网络接收到表示语言特征的韵律信息的输出。 使用神经网络的输出生成表示文本的音频数据。

Patent Agency Ranking