Dynamic match lattice spotting for indexing speech content

US 20070179784A1
Filed: 03/16/2006
Published: 08/02/2007
Est. Priority Date: 02/02/2006
Status: Abandoned Application

First Claim

Patent Images

1. A computer implemented method of indexing speech content, the method comprising the steps of:

generating a phone lattice from said speech content;

processing the phone lattice to generate a set of observed sequences Q=(Θ

,i), wherein Θ

are the observed sequences for each node i in said phone lattice; and

storing said set of observed sequences Q=(Θ

,i) for each node.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for indexing and searching speech content, the system includes two distinct stages, a speech indexing stage (100) and a speech retrieval stage (200). A phone lattice (103) is generated by passing speech content (101) through a speech recogniser (102). The resulting phone lattice is then processed to produce a set of observed sequences Q=(Θ,i) where Θ are the set of observed phone sequences for each node i in the phone lattice. During the retrieval stage (200), a user first inputs a target word (205) into the system, which is then reduced to a target phone sequence P=(p₁, p₂, . . . , p_N) (207). The system then compares target sequence P with the set of observed sequences Q (208), suitably by scoring each observed sequence against the target sequence using a Minimum Edit Distance (MED) calculation to produce a set of matching sequences R (209).

56 Citations

View as Search Results

32 Claims

1. A computer implemented method of indexing speech content, the method comprising the steps of:
- generating a phone lattice from said speech content;
  
  processing the phone lattice to generate a set of observed sequences Q=(Θ
  
  ,i), wherein Θ
  
  are the observed sequences for each node i in said phone lattice; and
  
  storing said set of observed sequences Q=(Θ
  
  ,i) for each node.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 32)
- - 2. The method of claim 1 wherein the step of generating of the phone lattice further comprises the steps of:
    - performing a feature based extraction process to construct a phone recognition network; and
      
      performing an N-best decoding on said phone recognition network to produce the phone lattice.
  - 3. The method of claim 2 wherein the phone recognition network is constructed using phone loop or phone sequence fragment loop techniques.
  - 4. The method of claim 3 wherein the N-best decoding utilises a set of well trained acoustic models and a language model.
  - 5. The method of claim 4 wherein the set of well trained acoustic models are tri-phone Hidden Markov Models (HMM) and the language model is an N-gram language model.
  - 6. The method of claim 1 wherein the step of generating the phone lattice further comprises optimising lattice size and complexity by selecting from the following sub-steps:
    - (a) tuning the number of tokens U used to generate the phone lattice;
      
      (b) pruning less likely paths outside a pruning beamwidth W; and
      
      (c) tuning the number of lattice traversals V.
  - 7. The method of any one of the preceding claims wherein said set of observed sequences Q=(Θ
    - ,i) is generated in accordance with Q(Θ
      
      ,i)={Q¹,Q², . . . }={θ
      
      ^k∈
      
      Θ
      
      |θ
      
      _N^k=i}, where Θ
      
      ={θ
      
      ₁ⁱ,θ
      
      ₂ⁱ, . . . } is the set of all N-length sequences θ
      
      ⁱ=(θ
      
      ₁ⁱ,θ
      
      ₂ⁱ, . . . ) that exist in the lattice, and wherein each element θ
      
      _kⁱcorresponds to a node within the lattice.
  - 32. Computer readable media having stored thereon instructions for executing, on at least one processor, the steps of the method of indexing speech content of claim 1 or the method of searching indexed speech content of claim 8.

8. A method for searching indexed speech content wherein said indexed speech content is stored in the form of a phone lattice, the method comprising the steps of:
- obtaining a target sequence P=(p₁, p₂, p₃, . . . p_N);
  
  comparing the target sequence P with a set of observed sequences Q=(Θ
  
  ,i) generated for each node i in said phone lattice, wherein the comparison between the target sequence and observed sequences includes scoring each observed sequence against the target sequence using a Minimum Edit Distance (MED) calculation; and
  
  outputting a set of sequences R from said set of observed sequences that match said target sequence.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 9. The method of claim 8 wherein said set of observed sequences Q=(Θ
    - ,i) is generated in accordance with Q(Θ
      
      ,i)={Q¹,Q², . . . }={θ
      
      ^k∈
      
      Θ
      
      |θ
      
      _N^k=}, wherein Θ
      
      ={θ
      
      ₁ⁱ,θ
      
      ₂ⁱ, . . . } is the set of all N-length sequences θ
      
      ⁱ=(θ
      
      ₁ⁱ,θ
      
      ₂ⁱ, . . . ) that exist in the phone lattice, and wherein each element θ
      
      _kⁱcorresponds to a node within the phone lattice.
  - 10. The method of claim 9 the step of scoring each observed sequence against the target sequence further comprises the step of generating a MED cost matrix.
  - 11. The method of claim 10 wherein the MED cost matrix is generated in accordance with a Levenstein algorithm.
  - 12. The method of claim 10 wherein the MED calculation comprises calculating the minimum cost S of transforming each observed sequence within the set of observed sequences into the target sequence in accordance with a set of insertion C_i, deletion C_dand substitution C_scosts, where S is defined by S=BESTMED(P,Q,C₁,C_d,C_s) and wherein BESTMED( . . . ) returns the last column MED cost matrix that is less than a maximum score threshold S_max.
  - 13. The method claim 12 wherein C_iand C_dare fixed and C_sis varied according to the following substitution rules:
    - C_s=0 for same letter consonant phone substitutions;
      
      C_s=1 for vowel substitutions;
      
      C_s=1 for closure and stop substitutions; and
      
      C_s=∞
      
      for all other substitutions.
  - 14. The method of claim 12 wherein the MED calculations are optimised by only calculating successive columns of the MED cost matrix if the minimum element of the current column is less than S_max.
  - 15. The method of claim 8 wherein comprising the further steps of:
    - processing the set of observed sequences Q=(Θ
      
      ,i) produce a set of hypersequences, wherein each hypersequence represents a particular group of observed sequences Q=(Θ
      
      ,i).
  - 16. The method of claim 15 wherein the hypersequences are produced by mapping the observed sequences to a hypersequence domain in accordance with a predetermined mapping function.
  - 17. The method of claim 16 wherein the mapping of the observed sequences to the hypersequence domain is performed on an element by element basis using a mapping method selected from:
    - (a) a linguistic knowledge based mapping;
      
      (b) a data driven acoustic mapping; and
      
      (c) a context dependent mapping.
  - 18. The method of claim 16 wherein the step of comparing the target sequence and the observed sequences comprises:
    - comparing the target sequence with each hypersequence to identify sequence groups most likely to yield a match for the target sequence; and
      
      comparing said target sequence with the set of observed sequences Q=(Θ
      
      ,i) contained within the identified hypersequence sequence groups.

19. A system for indexing and searching speech content, the system comprising:
- a speech recognition engine for generating a phone lattice from said speech content;
  
  a first database for storing said phone lattice generated by said speech recognition engine;
  
  an input device for obtaining a target sequence P=(p_{1, p}₂, p₃, . . . p_N);
  
  at least one processor coupled to said input device and said first database, which processor is configured to;
  
  process said phone lattice to generate a set of observed sequences Q=(Θ
  
  ,i), wherein Θ
  
  are the observed sequences for each node i in said phone lattice;
  
  store said observed sequences Q=(Θ
  
  ,i) in a second database;
  
  compare said target sequence P with the set of observed sequences Q=(Θ
  
  ,i) wherein the comparison between the target sequence and observed sequences includes scoring each observed sequence against the target sequence using a Minimum Edit Distance (MED) calculation; and
  
  output a set of sequences R from said set of observed sequences Q=(Θ
  
  ,i) that match said target sequence.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
- - 20. The system of claim 19 wherein the speech recognition engine is configured to:
    - construct a phone recognition network utilising a feature based extraction process;
      
      perform an N-best decoding operation on said phone recognition network to produce the phone lattice; and
      
      store said phone lattice in the first database.
  - 21. The system of claim 20 wherein the feature based extraction process is performed by a speech recognition program.
  - 22. The system of claim 20 wherein the phone recognition network is constructed using phone loop or phone sequence fragment loop techniques.
  - 23. The system of claim 19 where in the N-best decoding utilises a set of well trained acoustic models and an appropriate language model.
  - 24. The system of claim 23 wherein the set of well trained acoustic models are tri-phone Hidden Markov Models (HMM) and the language model is an N-gram language model.
  - 25. The system of claim 19 wherein phone lattice size and complexity is optimised by said at least one processor selecting from the following sub-steps:
    - (a) tuning the number of tokens U used to generate the phone lattice;
      
      (b) pruning less likely paths outside a pruning beamwidth W; and
      
      (c) tuning the number of lattice traversals V.
  - 26. The system of claim 19 wherein said set of observed sequences Q=(Θ
    - ,i) is generated in accordance with Q(Θ
      
      ,i)={Q¹,Q², . . . }={θ
      
      ^k∈
      
      Θ
      
      |θ
      
      _N^k=i}, where Θ
      
      ={θ
      
      ₁ⁱ,θ
      
      ₂ⁱ, . . . } is the set of all N-length sequences θ
      
      ⁱ=(θ
      
      ₁ⁱ,θ
      
      ₂ⁱ, . . . ) that exist in the lattice, and wherein each element θ
      
      _kⁱcorresponds to a node within the lattice.
  - 27. The system of claim 19 wherein scoring each observed sequence against the target sequence further includes generating a MED cost matrix.
  - 28. The system of claim 27 wherein generating the MED cost matrix comprises calculating the minimum cost S of transforming each observed sequence within the set of observed sequences into the target sequence in accordance with a set of insertion C_i, deletion C_dand substitution C_scosts, where S is defined by S=BESTMED(P,Q,C_i,C_d,C_s) and wherein BESTMED( . . . ) returns the last column MED cost matrix that is less than a maximum score threshold S_max.
  - 29. The system of claim 27 wherein C_iand C_dare fixed and C_sis varied according to the following substitution rules:
    - C_s=0 for same letter consonant phone substitutions;
      
      C_s=1 for vowel substitutions;
      
      C_s=1 for closure and stop substitutions; and
      
      C_s=∞
      
      for all other substitutions.
  - 30. The system of claim 28 wherein the MED calculations are optimised by only calculating successive columns of the MED cost matrix if the minimum element of the current column is less than S_max.
  - 31. The system of claim 28 wherein the MED cost matrix is generated in accordance with a Levenstein algorithm.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Queensland University of Technology
Original Assignee
Queensland University of Technology
Inventors
Sridharan, Subramanian, Thambiratnam, Albert Joseph

Application Number

US11/377,327
Publication Number

US 20070179784A1
Time in Patent Office

Days
Field of Search
US Class Current

704/255
CPC Class Codes

G10L 15/26 Speech to text systems G10L...

G10L 2015/025 Phonemes, fenemes or fenone...

Dynamic match lattice spotting for indexing speech content

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

56 Citations

32 Claims

Specification

Solutions

Use Cases

Quick Links

Dynamic match lattice spotting for indexing speech content

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

56 Citations

32 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links