Fast out-of-vocabulary search in automatic speech recognition systems

US 9,542,936 B2
Filed: 05/02/2013
Issued: 01/10/2017
Est. Priority Date: 12/29/2012
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, on a computer system, a text search query, the query comprising one or more query words;

generating, on the computer system, for each query word in the query, a set of one or more anchor segments from searching metadata corresponding to a plurality of speech recognition processed audio files, the metadata including representations of one or more words detected in the audio files, wherein, for each detected word, the metadata includes a reference to each audio file in which the word was detected, a temporal location of the detected word in the audio file, and a confidence measure for the word as detected within the audio file, where each anchor segment includes a query word, an identifier for an audio file, and a temporal location of the query word within the audio file, where generating anchor segments includes, for each query word;

determining, on the computer system, if the query word is included in a vocabulary of a learning model for a speech recognizer engine of the computer system;

on the computer system, when the query word is in the vocabulary, searching the metadata to identify one or more high confidence anchor segments corresponding to the query word; and

on the computer system, when the query word is not in the vocabulary;

generating a search list of one or more sub-words of the query word,searching the metadata to identify one or more audio files containing at least one of the one or more sub-words to identify one or more anchor segments corresponding to one or more of the sub-words;

post-processing, on the computer system, the one or more anchor segments, the post-processing comprising;

expanding the one or more anchor segments;

sorting the one or more anchor segments; and

merging overlapping ones of the one or more anchor segments; and

performing, on the computer system, speech recognition on the post-processed one or more expanded anchor segments for instances of at least one of the one or more query words using a constrained grammar.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method including: receiving, on a computer system, a text search query, the query including one or more query words; generating, on the computer system, for each query word in the query, one or more anchor segments within a plurality of speech recognition processed audio files, the one or more anchor segments identifying possible locations containing the query word; post-processing, on the computer system, the one or more anchor segments, the post-processing including: expanding the one or more anchor segments; sorting the one or more anchor segments; and merging overlapping ones of the one or more anchor segments; and searching, on the computer system, the post-processed one or more anchor segments for instances of at least one of the one or more query words using a constrained grammar.

83 Citations

19 Claims

1. A method comprising:
- receiving, on a computer system, a text search query, the query comprising one or more query words;
  
  generating, on the computer system, for each query word in the query, a set of one or more anchor segments from searching metadata corresponding to a plurality of speech recognition processed audio files, the metadata including representations of one or more words detected in the audio files, wherein, for each detected word, the metadata includes a reference to each audio file in which the word was detected, a temporal location of the detected word in the audio file, and a confidence measure for the word as detected within the audio file, where each anchor segment includes a query word, an identifier for an audio file, and a temporal location of the query word within the audio file, where generating anchor segments includes, for each query word;
  
  determining, on the computer system, if the query word is included in a vocabulary of a learning model for a speech recognizer engine of the computer system;
  
  on the computer system, when the query word is in the vocabulary, searching the metadata to identify one or more high confidence anchor segments corresponding to the query word; and
  
  on the computer system, when the query word is not in the vocabulary;
  
  generating a search list of one or more sub-words of the query word,searching the metadata to identify one or more audio files containing at least one of the one or more sub-words to identify one or more anchor segments corresponding to one or more of the sub-words;
  
  post-processing, on the computer system, the one or more anchor segments, the post-processing comprising;
  
  expanding the one or more anchor segments;
  
  sorting the one or more anchor segments; and
  
  merging overlapping ones of the one or more anchor segments; and
  
  performing, on the computer system, speech recognition on the post-processed one or more expanded anchor segments for instances of at least one of the one or more query words using a constrained grammar.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the generating the one or more anchor segments further comprises:
    - collecting low confidence words in the audio files, the low confidence words having word confidences below a threshold, andwherein the searching the metadata to identify one or more audio files containing at least one of the one or more sub-words comprises searching the low confidence words for only the sub-words of the query word when the query word is not in the vocabulary.
  - 3. The method of claim 1, wherein the constrained grammar comprises one or more out-of-vocabulary query words of the query, wherein each of the out-of-vocabulary query words is not in the vocabulary.
  - 4. The method of claim 1, wherein the speech recognition includes computing one or more event confidence levels, each of the event confidence levels corresponding to a confidence that an anchor segment of the one or more anchor segments contains a particular query word of the one or more query words of the query.
  - 5. The method of claim 4, further comprising outputting, from the computer system, a result of the speech recognition, wherein the result comprises the instances of the one or more query words in the audio file, sorted by event confidence level.
  - 6. The method of claim 1, further comprising:
    - applying, on the computer system, a utility function to each of the one or more anchor segments to compute one or more corresponding anchor utility values; and
      
      sorting, on the computer system, the one or more anchor segments in accordance with the one or more anchor utility values.
  - 7. The method of claim 6, wherein the speech recognition performed on the one or more post-processed anchor segments only searches the one or more anchor segments having best anchor utility values of the one or more anchor utility values.
  - 8. The method of claim 1, wherein the expanding the one or more anchor segments comprises:
    - for each query word in the query;
      
      counting a first number of characters in the query before the query word and a second number of characters after the query word;
      
      multiplying the first number of characters by an average character duration to obtain a first expansion amount; and
      
      multiplying the second number of characters by the average character duration to obtain a second expansion amount; and
      
      for each anchor segment, each anchor segment being identified by an anchor word, a start time, and an end time;
      
      subtracting the first expansion amount and a first constant expansion duration from the start time; and
      
      adding the second expansion amount and a second constant expansion duration to the end time.
  - 9. The method of claim 1, wherein the speech recognition performed on the one or more post-processed expanded anchor segments includes, when the query word is not in the vocabulary, re-processing, on the computer system, audio data in the audio file at the temporal location identified in the anchor segment and computing a confidence level corresponding to a confidence that the anchor segment contains the query word.

10. A system comprising a computer system comprising a processor, memory, and storage, the system being configured to:
- receive a text search query, the query comprising one or more query words;
  
  generate, for each query word in the query, a set of one or more anchor segments from searching metadata corresponding to a plurality of speech recognition processed audio files, the metadata including representations of one or more words detected in the audio files, wherein, for each detected word, the metadata includes a reference to each audio file in which the word was detected, a temporal location of the detected word in the audio file, and a confidence measure for the word as detected within the audio file, where each anchor segment includes a query word, an identifier for an audio file, and a temporal location of the query word within the audio file, where generating anchor segments includes, for each query word, the computer system;
  
  determining if the query word is included in a vocabulary of a learning model for a speech recognizer engine of the computer system;
  
  when the query word is in the vocabulary, searching the metadata to identify one or more high confidence anchor segments corresponding to the query word; and
  
  when the query word is not in the vocabulary;
  
  generating a search list of one or more sub-words of the query word,searching the metadata to identify one or more audio files containing at least one of the one or more sub-words to identify one or more anchor segments corresponding to one or more of the sub-words;
  
  post-process the one or more anchor segments, the post-process comprising;
  
  expanding the one or more anchor segments;
  
  sorting the one or more anchor segments; and
  
  merging overlapping ones of the one or more anchor segments; and
  
  perform speech recognition on the post-processed one or more expanded anchor segments for instances of at least one of the one or more query words using a constrained grammar.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The system of claim 10, wherein the system is further configured to collect low confidence words in the audio files, the low confidence words having word confidences below a threshold, andwherein the searching the metadata to identify one or more audio files containing at least one of the one or more sub-words comprises searching the low confidence words for only the sub-words of the query word when the query word is not in the vocabulary.
  - 12. The system of claim 10, wherein the constrained grammar comprises one or more out-of-vocabulary query words of the query, wherein each of the out-of-vocabulary query words is not in the vocabulary.
  - 13. The system of claim 10, wherein the system is further configured to search the one or more post-processed anchor segments by computing one or more event confidence levels, each of the event confidence levels corresponding to a confidence that an anchor segment of the one or more anchor segments contains a particular query word of the one or more query words of the query.
  - 14. The system of claim 13, wherein the system is further configured to output a result of the speech recognition, wherein the result comprises the instances of the query words in the audio file, sorted by event confidence level.
  - 15. The system of claim 10, wherein the system is further configured to:
    - apply a utility function to each of the one or more anchor segments to compute one or more corresponding anchor utility values; and
      
      sort the one or more anchor segments in accordance with the one or more anchor utility values.
  - 16. The system of claim 15, wherein the system is configured to search the one or more post-processed anchor segments by only searching the one or more anchor segments having best anchor utility values of the one or more anchor utility values.
  - 17. The system of claim 10, wherein the system is further configured to expand the one or more anchor segments by:
    - for each query word in the query;
      
      counting a first number of characters in the query before the query word and a second number of characters after the query word;
      
      multiplying the first number of characters by an average character duration to obtain a first expansion amount; and
      
      multiplying the second number of characters by the average character duration to obtain a second expansion amount; and
      
      for each anchor segment, each anchor segment being identified by an anchor word, a start time, and an end time;
      
      subtracting the first expansion amount and a first constant expansion duration from the start time; and
      
      adding the second expansion amount and a second constant expansion duration to the end time.
  - 18. The system of claim 10, wherein the speech recognition performed by the system on the one or more post-processed expanded anchor segments includes, when the query word is not in the vocabulary, re-processing audio data in the audio file at the temporal location identified in the anchor segment and computing a confidence level corresponding to a confidence that the anchor segment contains the query word.

19. A system comprisingmeans for receiving a text search query, the query comprising one or more query words;
- means for generating, for each query word in the query, a set of one or more anchor segments from searching metadata corresponding to a plurality of speech recognition processed audio files, the metadata including representations of one or more words detected in the audio files, wherein, for each detected word, the metadata includes a reference to each audio file in which the word was detected, a temporal location of the detected word in the audio file, and a confidence measure for the word as detected within the audio file, where each anchor segment includes a query word, an identifier for an audio file, and a temporal location of the query word within the audio file, where the means for generating anchor segments includes, for each query word;
  
  means for determining if the query word is included in a vocabulary of a learning model for a speech recognizer engine of the computer system;
  
  when the query word is in the vocabulary, means for searching the metadata to identify one or more high confidence anchor segments corresponding to the query word; and
  
  means for, when the query word is not in the vocabulary;
  
  generating a search list of one or more sub-words of the query word,searching the metadata to identify one or more audio files containing at least one of the one or more sub-words to identify one or more anchor segments corresponding to one or more of the sub-words;
  
  means for post-processing the one or more anchor segments comprising;
  
  means for expanding the one or more anchor segments;
  
  means for sorting the one or more anchor segments; and
  
  means for merging overlapping ones of the one or more anchor segments; and
  
  means for searching the post-processed one or more expanded anchor segments for instances of at least one of the one or more query words using a constrained grammar.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Genesys Telecommunications Laboratories Incorporated (Genesys Cloud Services Incorporated)
Original Assignee
Genesys Telecommunications Laboratories Incorporated (Genesys Cloud Services Incorporated)
Inventors
Lev-Tov, Amir, Faizakof, Avi, Konig, Yochai
Primary Examiner(s)
Vo, Huyen
Assistant Examiner(s)
Le, Thuykhanh

Application Number

US13/886,205
Publication Number

US 20140188475A1
Time in Patent Office

1,349 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/242   Query formulation

G06F 16/638   Presentation of query results

G06F 16/685   using automatically derived...

G10L 15/02   Feature extraction for spee...

G10L 15/08   Speech classification or se...

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/19   Grammatical context, e.g. d...

G10L 15/30   Distributed recognition, e....

Fast out-of-vocabulary search in automatic speech recognition systems

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

83 Citations

19 Claims

Specification

Use Cases

Quick Links

Others

Fast out-of-vocabulary search in automatic speech recognition systems

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

83 Citations

19 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others