System, method and program product for providing automatic speech recognition (ASR) in a shared resource environment

US 9,043,208 B2
Filed: 08/10/2012
Issued: 05/26/2015
Est. Priority Date: 07/18/2012
Status: Expired due to Fees

First Claim

Patent Images

1. An Automatic Speech Recognition (ASR) method comprising:

extracting utterances from each of one or more audio streams, each audio stream being associated with a particular context;

one or more computers extracting said utterances;

generating textual candidates for each extracted utterance, one or more utterances having a plurality of textual candidates generated as potential matches;

winnowing potential matches automatically with a context model within each particular context to adjust the likelihood of potential matches;

selecting a single textual candidate for said each extracted utterance as a match, any selected textual candidate not having been previously matched for the current context being a new match; and

updating said context model responsive to each match, wherein updating either adds the selected said textual candidate to said context model for the new match, or increases the likelihood for the previously selected said single textual candidate for the same particular context in the updated said context model, and said updated context model is used for winnowing subsequently extracted utterances.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition system, method of recognizing speech and a computer program product therefor. A client device identified with a context for an associated user selectively streams audio to a provider computer, e.g., a cloud computer. Speech recognition receives streaming audio, maps utterances to specific textual candidates and determines a likelihood of a correct match for each mapped textual candidate. A context model selectively winnows candidate to resolve recognition ambiguity according to context whenever multiple textual candidates are recognized as potential matches for the same mapped utterance. Matches are used to update the context model, which may be used for multiple users in the same context.

Citations

25 Claims

1. An Automatic Speech Recognition (ASR) method comprising:
- extracting utterances from each of one or more audio streams, each audio stream being associated with a particular context;
  
  one or more computers extracting said utterances;
  
  generating textual candidates for each extracted utterance, one or more utterances having a plurality of textual candidates generated as potential matches;
  
  winnowing potential matches automatically with a context model within each particular context to adjust the likelihood of potential matches;
  
  selecting a single textual candidate for said each extracted utterance as a match, any selected textual candidate not having been previously matched for the current context being a new match; and
  
  updating said context model responsive to each match, wherein updating either adds the selected said textual candidate to said context model for the new match, or increases the likelihood for the previously selected said single textual candidate for the same particular context in the updated said context model, and said updated context model is used for winnowing subsequently extracted utterances.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. An ASR method as in claim 1, wherein context is described by a cross-relation between a plurality of situational variables, and any said utterances having a single textual candidate generated are matched.
  - 3. An ASR method as in claim 1, wherein generating textual candidates generates a probability for each potential match, said probability indicating the likelihood that the respective textual candidate is a match and wherein for each of said one or more utterances any single one of said plurality of textual candidates having a probability exceeding the probability of every other one of said plurality of textual candidates by a selected threshold is a match.
  - 4. An ASR method as in claim 3, wherein winnowing potential matches automatically comprises:
    - providing from previously recognized candidates weights indicating the likelihood of each said previously recognized candidate occurring in a respective particular context; and
      
      weighting each said probability, said weights adjusting said likelihood.
  - 5. An ASR method as in claim 4, wherein said context model maintains a count of each previously recognized word occurring within each previously encountered context, and updating said context model comprises updating a respective count, said each new candidate having the lowest count (1) of all previously recognized words within each said particular context.
  - 6. An ASR method as in claim 5, wherein each said probability is weighted by the count of a respective said textual candidate for the respective context normalized to the highest count for said plurality of textual candidates.
  - 7. An ASR method as in claim 1, wherein speech recognition is untrained speech recognition, said method further comprising receiving said one or more audio streams from one or more client devices.
  - 8. An ASR method as in claim 7, wherein one or more provider computers are recognizing speech in said one or more audio streams, said one or more provider computers sharing capacity, resources and recognizing speech in a cloud environment, and context is device local context.
  - 9. An ASR method as in claim 7, wherein said context model is a group context model, said one or more client devices being associated with group members, corresponding ones of said one or more audio streams being recognized in the same context.
  - 10. An ASR method as in claim 2, wherein said situational variables include time, location, activity and social setting at the receiving end of a respective audio stream.

11. An Automatic Speech Recognition (ASR) method for recognizing speech without prior training, said ASR method comprising:
- receiving one or more audio streams from one or more client devices, each client device being associated with a particular context;
  
  extracting utterances from each of said one or more audio streams;
  
  generating textual candidates for each extracted utterance and a probability that each candidate is a match, every utterance having a single textual candidate generated is matched, remaining utterances having a plurality of textual candidates generated are unmatched, each of said plurality of textual candidates being a potential match;
  
  winnowing said plurality of textual candidates automatically for each of said remaining utterances with a context model within each said particular context to adjust likelihood of potential matches;
  
  selecting a single textual candidate for said extracted utterance as a match, any selected textual candidate not having been previously matched for the current context being a new match; and
  
  updating said context model responsive to each match, wherein updating either adds the selected said textual candidate to said context model for the new match, or increases the likelihood for the previously selected said single textual candidate for the same particular context in the updated said context model and said updated context model is used for winnowing subsequently extracted utterances.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. An ASR method as in claim 11 wherein context is described by a cross-relation between a plurality of situational variables, and for each utterance any textual candidate having a probability exceeding the probability of every other textual candidate by a selected threshold is a match.
  - 13. An ASR method as in claim 12, wherein said context model maintains a count of each previously recognized word occurring within each said particular context, updating said context model comprises updating a respective count, and winnowing potential matches comprises said context model weighting each said probability, said each new candidate having the lowest count (1) of all previously recognized words within each said particular context.
  - 14. An ASR method as in claim 13, wherein said situational variables include time, location, activity and social setting of a respective client device, and each weighted said probability is weighted by the count of a respective said textual candidate for the respective context normalized to the highest count for said plurality of textual candidates.
  - 15. An ASR method as in claim 11, wherein one or more provider computers are recognizing speech in said one or more audio streams, said one or more provider computers sharing capacity, resources and recognizing speech in a cloud environment, and context is context at a respective one of said one or more provider computers at the receiving end of a respective audio stream.
  - 16. An ASR method as in claim 11, wherein said context model is a group context model, said one or more client devices being associated with group members, corresponding ones of said one or more audio streams being recognized in the same context.

17. A computer program product for Automatic Speech Recognition (ASR), said computer program product comprising a non-transitory computer usable medium having computer readable program code stored thereon, said computer readable program code causing one or more computer executing said code to:
- extract utterances from each of one or more audio streams, each audio stream being associated with a particular context;
  
  generate textual candidates for each extracted utterance, each textual candidate including a probability of matching said each extracted utterance, one or more utterances having a plurality of textual candidates generated as potential matches;
  
  winnowing potential matches for said one or more utterances automatically with a context model within each respective said particular context for the respective utterance to adjust the likelihood of each of the respective potential matches;
  
  select a single textual candidate for said each extracted utterance as a match, any selected textual candidate not having been previously matched for the current context being a new match; and
  
  update said context model responsive to each match, wherein updating either adds the selected said textual candidate to said context model for the new match, or increases the likelihood for the previously selected said single textual candidate for the same particular context in the updated said context model and said updated context model is used for winnowing subsequently extracted utterances.
- View Dependent Claims (18, 19, 20, 21)
- - 18. A computer program product for ASR as in claim 17, wherein for each utterance any single textual candidate having a probability exceeding the probability of every other textual candidate by a selected threshold is a match, said one or more utterances having at least two textual candidates with a probability within said selected threshold of each other, and wherein context is described by a cross-relation between a plurality of situational variables.
  - 19. A computer program product for ASR as in claim 18, wherein said context model maintains a count of each previously recognized word occurring within each said particular context, updating said context model comprises updating a respective count, and winnowing potential matches comprises said context model weighting each said probability, said each new candidate having the lowest count (1) of all previously recognized words within each said particular context.
  - 20. A computer program product for ASR as in claim 19, wherein said situational variables include time, location, activity and social setting and each said probability is weighted by the count of a respective said textual candidate for the respective context normalized to the highest count for said plurality of textual candidates.
  - 21. A computer program product for ASR as in claim 20, wherein said context model is a group context model, said one or more audio streams are streamed from client devices associated with group members, said one or more computer is a plurality of provider computers sharing capacity, resources and recognizing speech in a cloud environment and corresponding ones of said one or more audio streams being recognized in the same context.

22. A computer program product for Automatic Speech Recognition (ASR), said computer program product comprising a non-transitory computer usable medium having computer readable program code stored thereon, said computer readable program code causing a plurality of computers including provider computers executing said code to:
- receive one or more audio streams from one or more client devices, each client device being associated with a particular context;
  
  extract utterances from each of said one or more audio streams;
  
  generate textual candidates for each extracted utterance and a probability that each candidate is a match, every utterance having a single textual candidate generated is matched, remaining utterances having a plurality of textual candidates generated are unmatched, each of said plurality of textual candidates being a potential match;
  
  winnow said plurality of textual candidates for each of said remaining utterances automatically with a context model within each said particular context to adjust likelihood of potential matches;
  
  select a single textual candidate for said extracted utterance as a match, any selected textual candidate not having been previously matched for the current context being a new match; and
  
  update said context model responsive to each match, wherein updating either adds the selected said textual candidate to said context model for the new match, or increases the likelihood for the previously selected said single textual candidate for the same particular context in the updated said context model and said updated context model is used for winnowing subsequently extracted utterances.
- View Dependent Claims (23, 24, 25)
- - 23. A computer program product for ASR as in claim 22, wherein:
    - context is described by a cross-relation between a plurality of situational variables;
      
      for each utterance any single textual candidate having a probability exceeding the probability of every other textual candidate by a selected threshold is a match, said one or more utterances having at least two textual candidates with a probability within said selected threshold of each other;
      
      said context model maintains a count of each previously recognized word occurring within each said particular context, updating said context model comprises updating a respective count, said each new candidate having the lowest count (1) of all previously recognized words within each said particular context; and
      
      winnowing potential matches comprises said context model weighting each said probability.
  - 24. A computer program product for ASR as in claim 23, wherein said situational variables include time, location, activity and social setting and each said probability is weighted by the count of a respective said textual candidate for the respective context normalized to the highest count for said plurality of textual candidates.
  - 25. A computer program product for ASR as in claim 24, wherein said context model is a group context model, said one or more audio streams are streamed from client devices associated with group members, said one or more computer is a plurality of provider computers sharing capacity, resources and recognizing speech in a cloud environment and corresponding ones of said one or more audio streams being recognized in the same context.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Koch, Fernando Luiz, Nogima, Julio
Primary Examiner(s)
ROBERTS, SHAUN A

Application Number

US13/571,409
Publication Number

US 20140025377A1
Time in Patent Office

1,019 Days
Field of Search

704/249, 704/231, 704/251, 704/244, 704/255
US Class Current

704/255
CPC Class Codes

G10L 15/1822   Parsing for meaning underst...

G10L 15/30   Distributed recognition, e....

G10L 15/32   Multiple recognisers used i...

G10L 2015/228   of application context

System, method and program product for providing automatic speech recognition (ASR) in a shared resource environment

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

System, method and program product for providing automatic speech recognition (ASR) in a shared resource environment

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links