Method and system for efficient spoken term detection using confusion networks

US 9,196,243 B2
Filed: 03/31/2014
Issued: 11/24/2015
Est. Priority Date: 03/31/2014
Status: Active Grant

First Claim

Patent Images

1. A method for spoken term detection, comprising:

receiving phone level out-of-vocabulary (OOV) keyword queries;

converting the phone level OOV keyword queries to words;

generating a confusion network (CN) based keyword searching (KWS) index; and

using the CN based KWS index for both in-vocabulary (IV) keyword queries and the OOV keyword queries;

wherein converting the phone level OOV keyword queries to words comprises;

converting the phone level OOV keyword queries to phonetic finite state acceptors, wherein phone sequences for IV terms are looked up in a recognition lexicon and phone sequences for OOV terms are generated with a grapheme-to-phoneme model;

expanding the phone level OOV keyword queries through composition with a weighted finite state transducer (WFST) that models probabilities of confusions between different phones;

extracting N-best hypotheses represented by each expanded WFST; and

mapping back the N-best hypotheses to a set of N or fewer word sequences through composition with a finite state transducer that maps from phone sequences to word sequences; and

wherein the receiving, converting, generating and using steps are performed by a computer system comprising a memory and at least one processor coupled to the memory.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for spoken term detection are provided. A method for spoken term detection, comprises receiving phone level out-of-vocabulary (OOV) keyword queries, converting the phone level OOV keyword queries to words, generating a confusion network (CN) based keyword searching (KWS) index, and using the CN based KWS index for both in-vocabulary (IV) keyword queries and the OOV keyword queries.

12 Citations

View as Search Results

20 Claims

1. A method for spoken term detection, comprising:
- receiving phone level out-of-vocabulary (OOV) keyword queries;
  
  converting the phone level OOV keyword queries to words;
  
  generating a confusion network (CN) based keyword searching (KWS) index; and
  
  using the CN based KWS index for both in-vocabulary (IV) keyword queries and the OOV keyword queries;
  
  wherein converting the phone level OOV keyword queries to words comprises;
  
  converting the phone level OOV keyword queries to phonetic finite state acceptors, wherein phone sequences for IV terms are looked up in a recognition lexicon and phone sequences for OOV terms are generated with a grapheme-to-phoneme model;
  
  expanding the phone level OOV keyword queries through composition with a weighted finite state transducer (WFST) that models probabilities of confusions between different phones;
  
  extracting N-best hypotheses represented by each expanded WFST; and
  
  mapping back the N-best hypotheses to a set of N or fewer word sequences through composition with a finite state transducer that maps from phone sequences to word sequences; and
  
  wherein the receiving, converting, generating and using steps are performed by a computer system comprising a memory and at least one processor coupled to the memory.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method according to claim 1, wherein generating the CN based KWS index comprises constructing the CN based KWS index from a plurality of confusion networks by compiling each confusion network into a weighted finite state transducer having the same topology as the confusion network.
  - 3. The method according to claim 2, wherein each weighted finite state transducer includes input labels that are words on each arc in the corresponding confusion network.
  - 4. The method according to claim 2, wherein each weighted finite state transducer includes output labels that encode a start time (T start) and an end time (T end) of each arc in the corresponding confusion network as T start-T end strings.
  - 5. The method according to claim 2, wherein each weighted finite state transducer includes costs that are negative log CN posteriors for each arc in the confusion network.
  - 6. The method according to claim 2, wherein for each weighted finite state transducer, the method further comprises adding a new start node, S_iwith zero-cost epsilon-arcs connecting S_ito each node in the weighted finite state transducer.
  - 7. The method according to claim 2, wherein for each weighted finite state transducer, the method further comprises adding a new end node, E_iwith zero-cost epsilon-arcs connecting each node in the weighted finite state transducer to E_i.
  - 8. The method according to claim 6, further comprising obtaining a final single index by creating a new start node, S, that is connected to each S_iby the zero-cost epsilon arcs.
  - 9. The method according to claim 7, further comprising obtaining a final single index by creating a new end node, E, that is connected to each E_iby the zero-cost epsilon arcs.
  - 10. The method according to claim 1, wherein using the CN based KWS index for an IV query comprises:
    - converting the query into a word automaton;
      
      composing the query automaton with an index transducer; and
      
      if overlapping hits are produced, keeping only a highest scoring hit.
  - 11. The method according to claim 1, wherein using the CN based KWS index for an OOV query comprises searching for the resulting word sequences via composition with the CN based KWS index.

12. A system for spoken term detection, comprising:
- a query module capable of receiving phone level out-of-vocabulary (OOV) keyword queries;
  
  a mapping module capable of;
  
  converting the phone level OOV keyword queries to words;
  
  converting the phone level OOV keyword queries to phonetic finite state acceptors, wherein phone sequences for IV terms are looked up in a recognition lexicon and phone sequences for OOV terms are generated with a grapheme-to-phoneme model;
  
  expanding the phone level OOV keyword queries through composition with a weighted finite state transducer (WFST) that models probabilities of confusions between different phones;
  
  extracting N-best hypotheses represented by each expanded WFST; and
  
  mapping back the N-best hypotheses to a set of N or fewer word sequences through composition with a finite state transducer that maps from phone sequences to word sequences;
  
  an indexing module capable of generating a confusion network (CN) based keyword searching (KWS) index; and
  
  a search module capable of using the CN based KWS index for both in-vocabulary (IV) keyword queries and the OOV keyword queries;
  
  wherein the query module, the mapping module, the indexing module, and the search module are implemented in at least one processor device coupled to a memory.
- View Dependent Claims (13)
- - 13. The system according to claim 12, wherein the indexing module is further capable of constructing the CN based KWS index from a plurality of confusion networks by compiling each confusion network into a weighted finite state transducer having the same topology as the confusion network.

14. A computer program product for spoken term detection, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to perform a method comprising:
- receiving phone level out-of-vocabulary (OOV) keyword queries;
  
  converting the phone level OOV keyword queries to words;
  
  generating a confusion network (CN) based keyword searching (KWS) index; and
  
  using the CN based KWS index for both in-vocabulary (IV) keyword queries and the OOV keyword queries;
  
  wherein converting the phone level OOV keyword queries to words comprises;
  
  converting the phone level OOV keyword queries to phonetic finite state acceptors, wherein phone sequences for IV terms are looked up in a recognition lexicon and phone sequences for OOV terms are generated with a grapheme-to-phoneme model;
  
  expanding the phone level OOV keyword queries through composition with a weighted finite state transducer (WFST) that models probabilities of confusions between different phones;
  
  extracting N-best hypotheses represented by each expanded WFST; and
  
  mapping back the N-best hypotheses to a set of N or fewer word sequences through composition with a finite state transducer that maps from phone sequences to word sequences.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The computer program product according to claim 14, wherein generating the CN based KWS index comprises constructing the CN based KWS index from a plurality of confusion networks by compiling each confusion network into a weighted finite state transducer having the same topology as the confusion network.
  - 16. The computer program product according to claim 15, wherein each weighted finite state transducer includes input labels that are words on each arc in the corresponding confusion network.
  - 17. The computer program product according to claim 15, wherein each weighted finite state transducer includes output labels that encode a start time (T start) and an end time (T end) of each arc in the corresponding confusion network as T start-T end strings.
  - 18. The computer program product according to claim 15, wherein each weighted finite state transducer includes costs that are negative log CN posteriors for each arc in the confusion network.
  - 19. The computer program product according to claim 15, wherein for each weighted finite state transducer, the method further comprises adding a new start node, S_iwith zero-cost epsilon-arcs connecting S_ito each node in the weighted finite state transducer.
  - 20. The computer program product according to claim 15, wherein for each weighted finite state transducer, the method further comprises adding a new end node, E_iwith zero-cost epsilon-arcs connecting each node in the weighted finite state transducer to E_i.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Kingsbury, Brian E. D., Kuo, Hong-Kwang, Mangu, Lidia, Soltau, Hagen
Primary Examiner(s)
AZAD, ABUL K

Application Number

US14/230,790
Publication Number

US 20150279358A1
Time in Patent Office

603 Days
Field of Search

704200-276
US Class Current

1/1
CPC Class Codes

G10L 13/08   Text analysis or generation...

G10L 15/02   Feature extraction for spee...

G10L 15/083   Recognition networks G10L15...

G10L 2015/025   Phonemes, fenemes or fenone...

G10L 2015/085   Methods for reducing search...

Method and system for efficient spoken term detection using confusion networks

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

12 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for efficient spoken term detection using confusion networks

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

12 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links