Speech recognition using topic-specific language models

US 9,324,323 B1
Filed: 12/14/2012
Issued: 04/26/2016
Est. Priority Date: 01/13/2012
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving audio;

determining, based at least on comparing a representation of one or more features of the audio to a set of representations of one or more corresponding features of other items of content, a proximity in a vector space of the representation of the one or more features of the audio to each of the representations of one or more corresponding features of other items of content, wherein each of the representations of one or more corresponding features of other items of content is associated with two or more language models that are each associated with a different topic;

determining, based at least on the proximities in the vector space of the representation of the one or more features of the audio to the representations of one or more corresponding features of other items of content, that the representation of the one or more features of the audio is proximate to a representation of one or more corresponding features of another item of content;

identifying (i) the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, and, (ii) for each language model that is associated with the representation of the one or more corresponding features of the other item of content, a relevance of the topic associated with the language model to the other item of content;

obtaining, for each of the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, (i) a transcription of the audio, and (ii) a speech recognizer confidence score;

generating, for each transcription, an aggregated score based at least on (i) the speech recognizer confidence score for the transcription, (ii) the relevance of the topic associated with the language model for which the transcription was obtained to the other item of content, and (iii) the proximity of the representation of the one or more features of the audio to the representation of the one or more corresponding features of the other item of content; and

selecting a particular transcription of the audio, from among the transcriptions, based at least on the aggregated scores.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Speech recognition techniques may include: receiving audio; identifying one or more topics associated with audio; identifying language models in a topic space that correspond to the one or more topics, where the language models are identified based on proximity of a representation of the audio to representations of other audio in the topic space; using the language models to generate recognition candidates for the audio, where the recognition candidates have scores associated therewith that are indicative of a likelihood of a recognition candidate matching the audio; and selecting a recognition candidate for the audio based on the scores.

Citations

19 Claims

1. A method comprising:
- receiving audio;
  
  determining, based at least on comparing a representation of one or more features of the audio to a set of representations of one or more corresponding features of other items of content, a proximity in a vector space of the representation of the one or more features of the audio to each of the representations of one or more corresponding features of other items of content, wherein each of the representations of one or more corresponding features of other items of content is associated with two or more language models that are each associated with a different topic;
  
  determining, based at least on the proximities in the vector space of the representation of the one or more features of the audio to the representations of one or more corresponding features of other items of content, that the representation of the one or more features of the audio is proximate to a representation of one or more corresponding features of another item of content;
  
  identifying (i) the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, and, (ii) for each language model that is associated with the representation of the one or more corresponding features of the other item of content, a relevance of the topic associated with the language model to the other item of content;
  
  obtaining, for each of the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, (i) a transcription of the audio, and (ii) a speech recognizer confidence score;
  
  generating, for each transcription, an aggregated score based at least on (i) the speech recognizer confidence score for the transcription, (ii) the relevance of the topic associated with the language model for which the transcription was obtained to the other item of content, and (iii) the proximity of the representation of the one or more features of the audio to the representation of the one or more corresponding features of the other item of content; and
  
  selecting a particular transcription of the audio, from among the transcriptions, based at least on the aggregated scores.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, further comprising:
    - classifying documents by topic;
      
      classifying other audio by topic based on transcriptions of the other audio; and
      
      using the documents and the transcriptions of the other audio as training data to train at least the language models that are each associated with a different topic.
  - 3. The method of claim 1, wherein determining that the representation of the one or more features of the audio is proximate to the representation of the one or more corresponding features of the other item of content comprises:
    - mapping the representation of the one or more features of the audio into the vector space; and
      
      identifying the representation of the one or more features of the audio as proximate to the representation of the one or more corresponding features of the other item of content based at least on a distance between the representation of the one or more features of the audio and the representation of the one or more corresponding features of the other item of content in the vector space.
  - 4. The method of claim 3, wherein identifying the representation of the one or more features of the audio as proximate to the representation of the one or more corresponding features of the other item of content based at least on the distance between the representation of the one or more features of the audio and the representation of the one or more corresponding features of the other item of content in the vector space comprises:
    - determining that the representation of the one or more features of the audio is within a range of the representation of the one or more corresponding features of the other item of content.
  - 5. The method of claim 3, wherein identifying the representation of the one or more features of the audio as proximate to the representation of the one or more corresponding features of the other item of content based at least on the distance between the representation of the one or more features of the audio and the representation of the one or more corresponding features of the other item of content in the vector space comprises:
    - determining that the distance is one of a predetermined number of closest distances between the representation of the one or more features of the audio and representations of one or more corresponding features of other items of content, wherein the representations of one or more corresponding features of other items of content include the representation of the one or more corresponding features of the other item of content.
  - 6. The method of claim 3, wherein the vector space is an n-dimensional topic space, and wherein the representation of the one or more features of the audio is an n-dimensional vector.
  - 7. The method of claim 6, wherein each of the dimensions of the n-dimensional topic space corresponds to a topic.
  - 8. The method of claim 1, comprising identifying one or more topics associated with the audio.
  - 9. The method of claim 8, wherein the one or more topics associated with the audio are identified based on metadata associated with the audio.
  - 10. The method of claim 8, wherein the one or more topics associated with the audio are identified based on a transcription of the audio that is generated using a general language model that is not topic-specific.
  - 11. The method of claim 1, wherein the representation of the one or more features of the audio comprises a vector representation of the one or more features of the audio, and wherein the representation of the one or more corresponding features of the other content comprises a vector representation of the one or more corresponding features of the other content.
  - 12. The method of claim 1, wherein the other item of content is audio content or written language content.
  - 13. The method of claim 1, wherein the topics that are each associated with a different language model are part of a topic hierarchy, at least one of the topics associated with a language model being at a higher level in the topic hierarchy than another one of the topics associated with a language model.
  - 14. The method of claim 1, wherein the representation of the one or more features of the audio comprises a vector representation of the one or more features of the audio in which the elements of the vector representation of the one or more features of the audio each indicate a relevance of the audio to a different topic, and wherein the representation of the one or more corresponding features of the other content comprises a vector representation of the one or more corresponding features of the other content in which the elements of the vector representation of the one or more corresponding features of the other content each indicate a relevance of the other item of content to a different topic.

15. One or more non-transitory machine-readable media storing instructions that are executable by one or more processing devices to perform operations comprising:
- receiving audio;
  
  determining, based at least on comparing a representation of one or more features of the audio to a set of representations of one or more corresponding features of other items of content, a proximity in a vector space of the representation of the one or more features of the audio to each of the representations of one or more corresponding features of other items of content, wherein each of the representations of one or more corresponding features of other items of content is associated with two or more language models that are each associated with a different topic;
  
  determining, based at least on the proximities in the vector space of the representation of the one or more features of the audio to the representations of one or more corresponding features of other items of content, that the representation of the one or more features of the audio is proximate to a representation of one or more corresponding features of another item of content;
  
  identifying (i) the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, and, (ii) for each language model that is associated with the representation of the one or more corresponding features of the other item of content, a relevance of the topic associated with the language model to the other item of content;
  
  obtaining, for each of the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, (i) a transcription of the audio, and (ii) a speech recognizer confidence score;
  
  generating, for each transcription, an aggregated score based at least on (i) the speech recognizer confidence score for the transcription, (ii) the relevance of the topic associated with the language model for which the transcription was obtained to the other item of content, and (iii) the proximity of the representation of the one or more features of the audio to the representation of the one or more corresponding features of the other item of content; and
  
  selecting a particular transcription of the audio, from among the transcriptions, based at least on the aggregated scores.
- View Dependent Claims (16, 17)
- - 16. The non-transitory machine-readable media of claim 15, wherein determining that the representation of the one or more features of the audio is proximate to the representation of the one or more corresponding features of the other item of content comprises:
    - mapping the representation of the one or more features of the audio into the vector space; and
      
      identifying the representation of the one or more features of the audio as proximate to the representation of the one or more corresponding features of the other item of content based at least on a distance between the representation of the one or more features of the audio and the representation of the one or more corresponding features of the other item of content in the vector space.
  - 17. The non-transitory machine-readable media of claim 15, wherein the operations comprise identifying one or more topics associated with the audio.

18. A system comprising:
- memory storing instructions that are executable; and
  
  one or more processing devices to execute the instructions to perform operations comprising;
  
  receiving audio;
  
  determining, based at least on comparing a representation of one or more features of the audio to a set of representations of one or more corresponding features of other items of content, a proximity in a vector space of the representation of the one or more features of the audio to each of the representations of one or more corresponding features of other items of content, wherein each of the representations of one or more corresponding features of other items of content is associated with two or more language models that are each associated with a different topic;
  
  determining, based at least on the proximities in the vector space of the representation of the one or more features of the audio to the representations of one or more corresponding features of other items of content, that the representation of the one or more features of the audio is proximate to a representation of one or more corresponding features of another item of content;
  
  identifying (i) the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, and, (ii) for each language model that is associated with the representation of the one or more corresponding features of the other item of content, a relevance of the topic associated with the language model to the other item of content;
  
  obtaining, for each of the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, (i) a transcription of the audio, and (ii) a speech recognizer confidence score;
  
  generating, for each transcription, an aggregated score based at least on (i) the speech recognizer confidence score for the transcription, (ii) the relevance of the topic associated with the language model for which the transcription was obtained to the other item of content, and (iii) the proximity of the representation of the one or more features of the audio to the representation of the one or more corresponding features of the other item of content; and
  
  selecting a particular transcription of the audio, from among the transcriptions, based at least on the aggregated scores.
- View Dependent Claims (19)
- - 19. The system of claim 18, wherein determining that the representation of the one or more features of the audio is proximate to the representation of the one or more corresponding features of the other item of content comprises:
    - mapping the representation of the one or more features of the audio into the vector space; and
      
      identifying the representation of the one or more features of the audio as proximate to the representation of the one or more corresponding features of the other item of content based at least on a distance between the representation of the one or more features of the audio and the representation of the one or more corresponding features of the other item of content in the vector space.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Bikel, Daniel M., Pereira, Fernando, Biadsy, Fadi, Thadini, Kapil R., Shugrina, Maria
Primary Examiner(s)
Wozniak, James

Application Number

US13/715,139
Time in Patent Office

1,229 Days
Field of Search

704/9, 704/251, 704/255, 704/257
US Class Current

1/1
CPC Class Codes

G10L 15/183 using context dependencies,...

G10L 15/197 Probabilistic grammars, e.g...

Speech recognition using topic-specific language models

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Speech recognition using topic-specific language models

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links