METHOD AND SYSTEM FOR ALIGNING NATURAL AND SYNTHETIC VIDEO TO SPEECH SYNTHESIS
    1.
    发明申请
    METHOD AND SYSTEM FOR ALIGNING NATURAL AND SYNTHETIC VIDEO TO SPEECH SYNTHESIS 有权
    用于自然和合成视频对语音合成的方法和系统

    公开(公告)号:US20080312930A1

    公开(公告)日:2008-12-18

    申请号:US12193397

    申请日:2008-08-18

    IPC分类号: G10L13/00 G10L13/08 G06T13/00

    摘要: According to MPEG-4's TTS architecture, facial animation can be driven by two streams simultaneously—text, and Facial Animation Parameters. In this architecture, text input is sent to a Text-To-Speech converter at a decoder that drives the mouth shapes of the face. Facial Animation Parameters are sent from an encoder to the face over the communication channel. The present invention includes codes (known as bookmarks) in the text string transmitted to the Text-to-Speech converter, which bookmarks are placed between words as well as inside them. According to the present invention, the bookmarks carry an encoder time stamp. Due to the nature of text-to-speech conversion, the encoder time stamp does not relate to real-world time, and should be interpreted as a counter. In addition, the Facial Animation Parameter stream carries the same encoder time stamp found in the bookmark of the text. The system of the present invention reads the bookmark and provides the encoder time stamp as well as a real-time time stamp to the facial animation system. Finally, the facial animation system associates the correct facial animation parameter with the real-time time stamp using the encoder time stamp of the bookmark as a reference.

    摘要翻译: 根据MPEG-4的TTS架构,面部动画可以由两个流同时驱动 - 文本和面部动画参数。 在该架构中,文本输入被发送到驱动面部的嘴形的解码器处的文本到语音转换器。 面部动画参数通过通信通道从编码器发送到脸部。 本发明包括发送到文本到语音转换器的文本串中的代码(称为书签),哪些书签放置在单词之间以及它们之间。 根据本发明,书签带有编码器时间戳。 由于文本到语音转换的性质,编码器时间戳与实际时间无关,应被解释为计数器。 此外,面部动画参数流携带与文本书签相同的编码器时间戳。 本发明的系统读取书签,并向面部动画系统提供编码器时间戳以及实时时间戳。 最后,面部动画系统使用书签的编码器时间戳作为参考,将正确的面部动画参数与实时时间戳相关联。

    Method and system for aligning natural and synthetic video to speech synthesis

    公开(公告)号:US06567779B1

    公开(公告)日:2003-05-20

    申请号:US08905931

    申请日:1997-08-05

    IPC分类号: G10L1300

    摘要: According to MPEG-4's TTS architecture, facial animation can be driven by two streams simultaneously—text, and Facial Animation Parameters. In this architecture, text input is sent to a Text-To-Speech converter at a decoder that drives the mouth shapes of the face. Facial Animation Parameters are sent from an encoder to the face over the communication channel. The present invention includes codes (known as bookmarks) in the text string transmitted to the Text-to-Speech converter, which bookmarks are placed between words as well as inside them. According to the present invention, the bookmarks carry an encoder time stamp. Due to the nature of text-to-speech conversion, the encoder time stamp does not relate to real-world time, and should be interpreted as a counter. In addition, the Facial Animation Parameter stream carries the same encoder time stamp found in the bookmark of the text. The system of the present invention reads the bookmark and provides the encoder time stamp as well as a real-time time stamp to the facial animation system. Finally, the facial animation system associates the correct facial animation parameter with the real-time time stamp using the encoder time stamp of the bookmark as a reference.

    Methods and apparatus for rapid acoustic unit selection from a large speech corpus
    3.
    发明授权
    Methods and apparatus for rapid acoustic unit selection from a large speech corpus 有权
    用于从大型语音语料库中快速声学单元选择的方法和装置

    公开(公告)号:US08315872B2

    公开(公告)日:2012-11-20

    申请号:US13306157

    申请日:2011-11-29

    IPC分类号: G10L13/00 G10L13/06

    摘要: A speech synthesis system can select recorded speech fragments, or acoustic units, from a very large database of acoustic units to produce artificial speech. The selected acoustic units are chosen to minimize a combination of target and concatenation costs for a given sentence. However, as concatenation costs, which are measures of the mismatch between sequential pairs of acoustic units, are expensive to compute, processing can be greatly reduced by pre-computing and caching the concatenation costs. Unfortunately, the number of possible sequential pairs of acoustic units makes such caching prohibitive. However, statistical experiments reveal that while about 85% of the acoustic units are typically used in common speech, less than 1% of the possible sequential pairs of acoustic units occur in practice. A method for constructing an efficient concatenation cost database is provided by synthesizing a large body of speech, identifying the acoustic unit sequential pairs generated and their respective concatenation costs, and storing those concatenation costs likely to occur. By constructing a concatenation cost database in this faction, the processing power required at run-time is greatly reduced with negligible effect on speech quality.

    摘要翻译: 语音合成系统可以从声学单元的非常大的数据库中选择记录的语音片段或声学单元,以产生人造语音。 选择的声学单元被选择以最小化给定句子的目标和级联成本的组合。 然而,由于级联成本(即连续的声单元对之间的不匹配度量)是计算成本高的,所以可以通过预先计算和缓存级联成本大大降低处理能力。 不幸的是,可能的顺序对声学单元的数量使得这种高速缓存变得过高。 然而,统计学实验表明,虽然约85%的声学单位通常用于通用语音,但在实践中小于1%的可能顺序的声学单元对出现。 通过合成大量语音,识别产生的声学单元序列对及其各自的级联成本,并且存储可能发生的级联成本,提供了一种用于构建有效级联成本数据库的方法。 通过在该系统中构建级联成本数据库,运行时所需的处理能力大大降低,对语音质量的影响可以忽略不计。

    Advance TTS for facial animation
    4.
    发明授权
    Advance TTS for facial animation 失效
    推进面部动画TTS

    公开(公告)号:US07076426B1

    公开(公告)日:2006-07-11

    申请号:US09238224

    申请日:1999-01-27

    IPC分类号: G10L13/08

    CPC分类号: G10L13/10

    摘要: An enhanced system is achieved by allowing bookmarks which can specify that the stream of bits that follow corresponds to phonemes and a plurality of prosody information, including duration information, that is specified for times within the duration of the phonemes. Illustratively, such a stream comprises a flag to enable a duration flag, a flag to enable a pitch contour flag, a flag to enable an energy contour flag, a specification of the number of phonemes that follow, and, for each phoneme, one or more sets of specific prosody information that relates to the phoneme, such as a set of pitch values and their durations.

    摘要翻译: 通过允许书签来实现增强的系统,该书签可以指定对应于音素的比特流对应于在音素的持续时间内为时间指定的包括持续时间信息的多个韵律信息。 说明性地,这样的流包括一个标志,用于启用持续时间标志,启用音调轮廓标志的标志,能够启用能量轮廓标志的标志,跟随的音素数目的规格,以及对于每个音素,一个或 更多的与音素相关的特定韵律信息,例如一组音调值及其持续时间。

    Method and system for aligning natural and synthetic video to speech synthesis
    5.
    发明授权
    Method and system for aligning natural and synthetic video to speech synthesis 有权
    将自然和合成视频与语音合成对齐的方法和系统

    公开(公告)号:US06862569B1

    公开(公告)日:2005-03-01

    申请号:US10350225

    申请日:2003-01-23

    摘要: According to MPEG-4's TTS architecture, facial animation can be driven by two streams simultaneously—text, and Facial Animation Parameters. In this architecture, text input is sent to a Text-To-Speech converter at a decoder that drives the mouth shapes of the face. Facial Animation Parameters are sent from an encoder to the face over the communication channel. The present invention includes codes (known as bookmarks) in the text string transmitted to the Text-to-Speech converter, which bookmarks are placed between words as well as inside them. According to the present invention, the bookmarks carry-an encoder time stamp. Due to the nature of text-to-speech conversion, the encoder time stamp does not relate to real-world time, and should be interpreted as a counter. In addition, the Facial Animation Parameter stream carries the same encoder time stamp found in the bookmark of the text. The system of the present invention reads the bookmark and provides the encoder time stamp as well as a real-time time stamp to the facial animation system. Finally, the facial animation system associates the correct facial animation parameter with the real-time time stamp using the encoder time stamp of the bookmark as a reference.

    摘要翻译: 根据MPEG-4的TTS架构,面部动画可以由两个流同时驱动 - 文本和面部动画参数。 在该架构中,文本输入被发送到驱动面部的嘴形的解码器处的文本到语音转换器。 面部动画参数通过通信通道从编码器发送到脸部。 本发明包括发送到文本到语音转换器的文本串中的代码(称为书签),哪些书签放置在单词之间以及它们之间。 根据本发明,书签携带编码器时间戳。 由于文本到语音转换的性质,编码器时间戳与实际时间无关,应被解释为计数器。 此外,面部动画参数流携带与文本书签相同的编码器时间戳。 本发明的系统读取书签,并向面部动画系统提供编码器时间戳以及实时时间戳。 最后,面部动画系统使用书签的编码器时间戳作为参考,将正确的面部动画参数与实时时间戳相关联。

    Integration of talking heads and text-to-speech synthesizers for visual TTS
    6.
    发明授权
    Integration of talking heads and text-to-speech synthesizers for visual TTS 有权
    语音头和文本到语音合成器的集成,用于视觉TTS

    公开(公告)号:US06839672B1

    公开(公告)日:2005-01-04

    申请号:US09224583

    申请日:1998-12-31

    IPC分类号: G10L21/06 G10L13/08

    CPC分类号: G10L21/06 G10L2021/105

    摘要: An enhanced arrangement for a talking head driven by text is achieved by sending FAP information to a rendering arrangement that allows the rendering arrangement to employ the received FAPs in synchronism with the speech that is synthesized. In accordance with one embodiment, FAPs that correspond to visemes which can be developed from phonemes that are generated by a TTS synthesizer in the rendering arrangement are not included in the sent FAPs, to allow the local generation of such FAPs. In a further enhancement, a process is included in the rendering arrangement for creating a smooth transition from one FAP specification to the next FAP specification. This transition can follow any selected function. In accordance with one embodiment, a separate FAP value is evaluated for each of the rendered video frames.

    摘要翻译: 通过将FAP信息发送到允许渲染装置与合成的语音同步地使用接收到的FAP的再现布置来实现由文本驱动的通话头的增强布置。 根据一个实施例,对应于可以由呈现装置中的TTS合成器生成的音素开发的视角的FAP不包括在发送的FAP中,以允许本地生成这样的FAP。 在进一步的增强中,在渲染布置中包括一个过程,用于创建从一个FAP规范到下一个FAP规范的平滑过渡。 此转换可以遵循任何所选功能。 根据一个实施例,针对每个渲染的视频帧评估单独的FAP值。

    System and method for cloud-based text-to-speech web services
    7.
    发明授权
    System and method for cloud-based text-to-speech web services 有权
    用于基于云的文本到语音Web服务的系统和方法

    公开(公告)号:US09009050B2

    公开(公告)日:2015-04-14

    申请号:US12956354

    申请日:2010-11-30

    IPC分类号: G10L13/04

    摘要: Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating speech. One variation of the method is from a server side, and another variation of the method is from a client side. The server side method, as implemented by a network-based automatic speech processing system, includes first receiving, from a network client independent of knowledge of internal operations of the system, a request to generate a text-to-speech voice. The request can include speech samples, transcriptions of the speech samples, and metadata describing the speech samples. The system extracts sound units from the speech samples based on the transcriptions and generates an interactive demonstration of the text-to-speech voice based on the sound units, the transcriptions, and the metadata, wherein the interactive demonstration hides a back end processing implementation from the network client. The system provides access to the interactive demonstration to the network client.

    摘要翻译: 本文公开了用于产生语音的系统,方法和非暂时的计算机可读存储介质。 该方法的一个变体是来自服务器端,并且该方法的另一变体是来自客户端。 由基于网络的自动语音处理系统实现的服务器端方法包括首先从网络客户端接收与系统的内部操作相关的知识,生成文本到语音语音的请求。 该请求可以包括语音样本,语音样本的转录以及描述语音样本的元数据。 该系统基于转录从语音样本中提取声音单元,并且基于声音单元,转录和元数据生成文本到语音语音的交互式演示,其中交互式演示隐藏后端处理实现 网络客户端。 该系统提供对网络客户端的交互式演示的访问。

    System and method for automatic detection of abnormal stress patterns in unit selection synthesis
    8.
    发明授权
    System and method for automatic detection of abnormal stress patterns in unit selection synthesis 有权
    在单位选择合成中自动检测异常应力模式的系统和方法

    公开(公告)号:US08965768B2

    公开(公告)日:2015-02-24

    申请号:US12852146

    申请日:2010-08-06

    摘要: Disclosed herein are systems, methods, and non-transitory computer-readable storage media for detecting and correcting abnormal stress patterns in unit-selection speech synthesis. A system practicing the method detects incorrect stress patterns in selected acoustic units representing speech to be synthesized, and corrects the incorrect stress patterns in the selected acoustic units to yield corrected stress patterns. The system can further synthesize speech based on the corrected stress patterns. In one aspect, the system also classifies the incorrect stress patterns using a machine learning algorithm such as a classification and regression tree, adaptive boosting, support vector machine, and maximum entropy. In this way a text-to-speech unit selection speech synthesizer can produce more natural sounding speech with suitable stress patterns regardless of the stress of units in a unit selection database.

    摘要翻译: 这里公开了用于在单位选择语音合成中检测和校正异常应力模式的系统,方法和非暂时的计算机可读存储介质。 实施该方法的系统检测表示要合成的语音的所选声学单元中的不正确应力模式,并且校正所选声学单元中的不正确应力模式以产生校正的应力模式。 该系统可以基于校正的应力模式进一步合成语音。 在一个方面,系统还使用诸如分类和回归树,自适应增强,支持向量机和最大熵的机器学习算法对不正确的应力模式进行分类。 以这种方式,文本到语音单元选择语音合成器可以产生具有合适的应力模式的更自然的声音语音,而不管单元选择数据库中的单元的应力。

    SYSTEM AND METHOD FOR LOW-LATENCY WEB-BASED TEXT-TO-SPEECH WITHOUT PLUGINS
    9.
    发明申请
    SYSTEM AND METHOD FOR LOW-LATENCY WEB-BASED TEXT-TO-SPEECH WITHOUT PLUGINS 有权
    用于低延迟的基于WEB的文本到语音的系统和方法,没有插入

    公开(公告)号:US20130144624A1

    公开(公告)日:2013-06-06

    申请号:US13308860

    申请日:2011-12-01

    IPC分类号: G10L13/08

    CPC分类号: G10L13/04 G10L13/10

    摘要: Disclosed herein are systems, methods, and non-transitory computer-readable storage media for reducing latency in web-browsing TTS systems without the use of a plug-in or Flash® module. A system configured according to the disclosed methods allows the browser to send prosodically meaningful sections of text to a web server. A TTS server then converts intonational phrases of the text into audio and responds to the browser with the audio file. The system saves the audio file in a cache, with the file indexed by a unique identifier. As the system continues converting text into speech, when identical text appears the system uses the cached audio corresponding to the identical text without the need for re-synthesis via the TTS server.

    摘要翻译: 这里公开的是系统,方法和非暂时的计算机可读存储介质,用于在不使用插件或Flash®模块的情况下减少网页浏览TTS系统中的延迟。 根据所公开的方法配置的系统允许浏览器向web服务器发送具有韵律意义的文本段。 然后,TTS服务器将文本的语调短语转换为音频,并用音频文件对浏览器进行响应。 系统将音频文件保存在缓存中,文件由唯一标识符进行索引。 随着系统继续将文本转换为语音,当出现相同的文本时,系统使用对应于相同文本的缓存音频,而不需要经由TTS服务器重新合成。

    SYSTEM AND METHOD FOR AUTOMATIC DETECTION OF ABNORMAL STRESS PATTERNS IN UNIT SELECTION SYNTHESIS
    10.
    发明申请
    SYSTEM AND METHOD FOR AUTOMATIC DETECTION OF ABNORMAL STRESS PATTERNS IN UNIT SELECTION SYNTHESIS 有权
    用于自动检测单位选择合成中的异常应力模式的系统和方法

    公开(公告)号:US20120035917A1

    公开(公告)日:2012-02-09

    申请号:US12852146

    申请日:2010-08-06

    IPC分类号: G10L19/00

    摘要: Disclosed herein are systems, methods, and non-transitory computer-readable storage media for detecting and correcting abnormal stress patterns in unit-selection speech synthesis. A system practicing the method detects incorrect stress patterns in selected acoustic units representing speech to be synthesized, and corrects the incorrect stress patterns in the selected acoustic units to yield corrected stress patterns. The system can further synthesize speech based on the corrected stress patterns. In one aspect, the system also classifies the incorrect stress patterns using a machine learning algorithm such as a classification and regression tree, adaptive boosting, support vector machine, and maximum entropy. In this way a text-to-speech unit selection speech synthesizer can produce more natural sounding speech with suitable stress patterns regardless of the stress of units in a unit selection database.

    摘要翻译: 这里公开了用于在单位选择语音合成中检测和校正异常应力模式的系统,方法和非暂时的计算机可读存储介质。 实施该方法的系统检测表示要合成的语音的所选声学单元中的不正确应力模式,并且校正所选声学单元中的不正确应力模式以产生校正的应力模式。 该系统可以基于校正的应力模式进一步合成语音。 在一个方面,系统还使用诸如分类和回归树,自适应增强,支持向量机和最大熵的机器学习算法对不正确的应力模式进行分类。 以这种方式,文本到语音单元选择语音合成器可以产生具有合适的应力模式的更自然的声音语音,而不管单元选择数据库中的单元的应力。