Language recognition using sequence frequency

US 6,882,970 B1
Filed: 10/25/2000
Issued: 04/19/2005
Est. Priority Date: 10/28/1999
Status: Expired due to Fees

First Claim

Patent Images

1. A comparison apparatus comprising:

a receiver operable to receive first and second sequences of labels;

an identifier operable to identify a plurality of different first sub-sequences of labels within said first sequence of labels;

a first determiner operable to determine and to output the number of times each of said different first sub-sequences occurs within said first sequence of labels;

a definer operable to define a plurality of second sub-sequences of labels from said second sequence of labels;

a second determiner operable to determine and to output the number of times each of said different first sub-sequences occurs within said second sequence of labels by comparing each first sub-sequence of labels with each second sub-sequence of labels; and

a similarity measure calculator operable to calculate a measure of the similarity between the first and second sequences of labels by comparing the numbers output from said first determiner with the numbers output from said second determiner;

wherein said second determiner comprises;

a comparator operable to compare a current first sub-sequence of labels with each second sub-sequence of labels using predetermined data including confusion information which defines confusability between different labels, to provide a set of sub-sequence similarity measures; and

a counter operable to count the number of times the current first sub-sequence of labels occurs within the second sequence of labels in dependence upon the set of sub-sequence similarity measures provided by said comparator for the current first sub-sequence of labels.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system is provided for comparing an input query with a number of stored annotations to identify information to be retrieved from a database. The comparison technique divides the input query into a number of fixed-size fragments and identifies how many times each of the fragments occurs within each annotation using a dynamic programming matching technique. The frequencies of occurrence of the fragments in both the query and the annotation are then compared to provide a measure of the similarity between the query and the annotation. The information to be retrieved is then determined from the similarity measures obtained for all the annotations.

Citations

130 Claims

1. A comparison apparatus comprising:
- a receiver operable to receive first and second sequences of labels;
  
  an identifier operable to identify a plurality of different first sub-sequences of labels within said first sequence of labels;
  
  a first determiner operable to determine and to output the number of times each of said different first sub-sequences occurs within said first sequence of labels;
  
  a definer operable to define a plurality of second sub-sequences of labels from said second sequence of labels;
  
  a second determiner operable to determine and to output the number of times each of said different first sub-sequences occurs within said second sequence of labels by comparing each first sub-sequence of labels with each second sub-sequence of labels; and
  
  a similarity measure calculator operable to calculate a measure of the similarity between the first and second sequences of labels by comparing the numbers output from said first determiner with the numbers output from said second determiner;
  
  wherein said second determiner comprises;
  
  a comparator operable to compare a current first sub-sequence of labels with each second sub-sequence of labels using predetermined data including confusion information which defines confusability between different labels, to provide a set of sub-sequence similarity measures; and
  
  a counter operable to count the number of times the current first sub-sequence of labels occurs within the second sequence of labels in dependence upon the set of sub-sequence similarity measures provided by said comparator for the current first sub-sequence of labels.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45)
- - 2. An apparatus according to claim 1, wherein each of said first sub-sequences comprises the same number of labels.
  - 3. An apparatus according to claim 1, wherein each of said second sub-sequences comprises the same number of labels.
  - 4. An apparatus according to claim 1, wherein said second sub-sequences of labels comprise the same number of labels as said first sub-sequences of labels.
  - 5. An apparatus according to claim 1, wherein said first determiner comprises a comparator operable to perform a Boolean match between each first sub-sequence of labels and the first sequence of labels and a counter operable to increment a count associated with the current first sub-sequence of labels each time the current first sub-sequence of labels matches a sub-sequence of labels from the first sequence of labels.
  - 6. An apparatus according to claim 1, wherein said second determiner further comprises an aligner operable to align labels of the current first sub-sequence of labels with labels of a current second sub-sequence of labels to form a number of aligned pairs of labels;
    - wherein said comparator is operable to compare the labels of each aligned pair of labels using said confusion information to generate a comparison score representative of the similarity between the aligned pair of labels; and
      
      wherein said comparator further comprises a comparison score combiner operable to combine the comparison scores for all the aligned pairs of labels for the current first and second sub-sequences to provide the sub-sequence similarity measure for the current first sub-sequence of labels and the current second sub-sequence of labels.
  - 7. An apparatus according to claim 6, wherein said comparator comprises:
    - a first sub-comparator operable to compare, for each aligned pair, the first sub-sequence label in the aligned pair with each of a plurality of labels taken from a set of predetermined labels to provide a corresponding plurality of intermediate comparison scores representative of the similarity between said first sub-sequence label and the respective labels from the set;
      
      a second sub-comparator operable to compare, for each aligned pair, the second sub-sequence label in the aligned pair with each of said plurality of labels from the set to provide a further corresponding plurality of intermediate comparison scores representative of the similarity between said second sub-sequence label and the respective labels from the set; and
      
      a comparison score calculator operable to calculate said comparison score for the aligned pair by combining said pluralities of intermediate comparison scores.
  - 8. An apparatus according to claim 7, wherein said first and second sub-comparators are operable to compare the first sub-sequence label and the second sub-sequence label of the aligned pair respectively with each of the labels in said set of predetermined labels.
  - 9. An apparatus according to claim 7, wherein said comparator is operable to generate a comparison score for an aligned pair of labels which represents a probability of confusing the second sub-sequence label of the aligned pair as the first sub-sequence label of the aligned pair.
  - 10. An apparatus according to claim 9, wherein said first and second sub-comparators are operable to provide intermediate comparison scores which are indicative of a probability of confusing the corresponding label taken from the set of predetermined labels as the label in the aligned pair.
  - 11. An apparatus according to claim 10, wherein said comparison score calculator is operable (i) to multiply the intermediate scores obtained when comparing the first and second sub-sequence labels in the aligned pair with the same label from the set to provide a plurality of multiplied intermediate comparison scores;
    - and (ii) to add the resulting multiplied intermediate scores, to calculate said comparison score for the aligned pair.
  - 12. An apparatus according to claim 11, wherein each of said labels in said set of predetermined labels has a predetermined probability of occurring within a sequence of labels and wherein said comparison score calculator is operable to weigh each of said multiplied intermediate comparison scores with the respective probability of occurrence for the label from the set used to generate the multiplied intermediate comparison scores.
  - 13. An apparatus according to claim 12, wherein said comparison score calculator is operable to calculate:
    - $\sum_{r = 1}^{n} P (q_{j} | p_{r}) P (a_{i} | p_{r}) P (p_{r})$ where q_jand a_iare the aligned pair of first and second sub-sequence labels respectively;
      
      P(q_j|p_r) is the probability of confusing set label p_ras first sub-sequence label q_j;
      
      P(a_i|p_r) is the probability of confusing set label p_ras second sub-sequence label a_i; and
      
      P(p_r) represents the probability of set label p_roccurring in a sequence of labels.
  - 14. An apparatus according to claim 13, wherein the confusion probabilities for the first and second sequence labels are determined in advance and depend upon the recognition system that was used to generate the respective first and second sequences.
  - 15. An apparatus according to claim 11, wherein said intermediate scores represent log probabilities and wherein said comparison score calculator is operable to perform said multiplication by adding the respective intermediate scores and is operable to perform said addition of said multiplied scores by performing a log addition calculation.
  - 16. An apparatus according to claim 15, wherein said comparison score combiner is operable to add the comparison scores for all the aligned pairs to determine said sub-sequence similarity measure.
  - 17. An apparatus according to claim 6, wherein said aligner is operable to identify label deletions and insertions in said first and second sequences of labels and wherein said comparator is operable to generate said comparison score for an aligned pair of labels in dependence upon label deletions and insertions identified by said aligner which occur in the vicinity of the labels in the aligned pair.
  - 18. An apparatus according to claim 6, wherein said aligner comprises a dynamic programming unit operable to align said first and second sequences of labels using a dynamic programming technique.
  - 19. An apparatus according to claim 18, wherein said dynamic programming unit is operable to determine progressively a plurality of possible alignments between said current first sub-sequence of labels and said current second sub-sequence of labels and wherein said comparator is operable to determine a comparison score for each of the possible aligned pairs of labels determined by said dynamic programming unit.
  - 20. An apparatus according to claim 19, wherein said comparator is operable to generate said comparison score during the progressive determination of said possible alignments.
  - 21. An apparatus according to claim 18, wherein said dynamic programming unit is operable to determine an optimum alignment between said current first sub-sequence of labels and said current second sub-sequence of labels and wherein said comparison score combiner is operable to provide said sub-sequence similarity measure by combining the comparison scores only for the optimum aligned pairs of labels.
  - 22. An apparatus according to claim 19, wherein said comparison score combiner is operable to provide said sub-sequence similarity measure by combining all the comparison scores for all the possible aligned pairs of labels.
  - 23. An apparatus according to claim 7, wherein each of the labels in said first and second sub-sequences of labels belongs to said set of predetermined labels and wherein said confusion information comprises, for each label in the set of labels, a probability for confusing that label with each of the other labels in the set of labels.
  - 24. An apparatus according to claim 23, wherein said confusion probabilities are determined in advance and depend upon the system used to generate the first and second sub-sequences of labels.
  - 25. An apparatus according to claim 23, wherein said predetermined data further includes, for each label in the set of labels, a probability of inserting the label in a sequence of labels.
  - 26. An apparatus according to claim 23, wherein said predetermined data further includes, for each label in the set of labels, a probability of deleting the label from a sequence of labels.
  - 27. An apparatus according to claim 6, wherein said second determiner further comprises a normalising unit operable to normalise each of said sub-sequence similarity measures.
  - 28. An apparatus according to claim 27, wherein said normalising unit is operable to normalise each sub-sequence similarity measure by dividing each sub-sequence similarity measure by a respective normalisation score which varies in dependence upon the length of the corresponding first and second sub-sequences of labels.
  - 29. An apparatus according to claim 27, wherein the respective normalisation scores vary in dependence upon the sequence of labels in the corresponding first and second sub-sequences of labels.
  - 30. An apparatus according to claim 27, wherein a dynamic programming unit is operable to determine progressively a plurality of possible alignments between said current first sub-sequence of labels and said current second sub-sequence of labels and wherein said comparator is operable to determine a comparison score for each of the possible aligned pairs of labels determined by said dynamic programming unit and wherein said normalising unit is operable to calculate the respective normalisation scores during the progressive calculation of said possible alignments by said dynamic programming unit.
  - 31. An apparatus according to claim 1, wherein said definer is operable to define said plurality of second sub-sequences as successive portions of the second sequence of labels.
  - 32. An apparatus according to claim 31, wherein said successive portions are separated from each other by a single label.
  - 33. An apparatus according to claim 1, wherein said similarity measure calculator calculates said measure of the similarity by treating the numbers output by said first determiner as a first vector and the numbers output from said second determiner as a second vector and by determining a cosine measure of the angle between the two vectors.
  - 34. An apparatus according to claim 1, wherein said first and second sequences of labels represent time sequential signals.
  - 35. An apparatus according to claim 1, wherein said first and second sequences of labels represent audio signals.
  - 36. An apparatus according to claim 35, wherein said first and second sequences of labels represent speech.
  - 37. An apparatus according to claim 36, wherein each of said labels represents a sub-word unit of speech.
  - 38. An apparatus according to claim 37, wherein each of said labels represent a phoneme.
  - 39. An apparatus according to claim 1, wherein said first sequence of labels comprises a plurality of sub-word units generated from a typed input and wherein said confusion information comprises mis-typing probabilities and/or mis-spelling probabilities.
  - 40. An apparatus according to claim 1, wherein said second sequence of labels comprises a sequence of sub-word units generated from a spoken input and wherein said confusion information comprises mis-recognition probabilities.
  - 41. An apparatus according to claim 1, wherein said receiver is operable to receive a plurality of second sequences of labels, wherein said second determiner is operable to determine and output the number of times each of said first sub-sequences of labels occurs within each of said second sequences of labels and wherein said similarity measure calculator is operable to compute a respective measure of the similarity between the first sequence of labels and said plurality of second sequences of labels.
  - 42. An apparatus according to claim 41, further comprising a sequence determiner operable to compare said plurality of similarity measures output by said similarity measure calculator and for outputting a signal indicative of the second sequence of labels which is most similar to said first sequence of labels.
  - 43. An apparatus for searching a database comprising a plurality of information entries to identify information to be retrieved therefrom, each of said plurality of information entries having an associated annotation comprising a sequence of labels, the apparatus comprising:
    - a receiver operable to receive an input query comprising a sequence of labels;
      
      an apparatus according to claim 1 operable to compare the query sequence of labels with the labels of each annotation to provide a set of comparison results; and
      
      an identifier operable to identify said information to be retrieved from said database using said comparison results.
  - 44. An apparatus for searching a database comprising a plurality of information entries to identify information to be retrieved therefrom, each of said plurality of information entries having an associated annotation comprising a sequence of speech labels, the apparatus comprising:
    - a receiver operable to receive an input query comprising a sequence of speech labels;
      
      an apparatus according to claim 1 operable to compare said query sequence of speech labels with the speech labels of each annotation to provide a set of comparison results; and
      
      an identifier operable to identify said information to be retrieved from said database using said comparison results;
      
      wherein said apparatus according to claim 1 has a plurality of different comparison modes of operation and in that the apparatus further comprises;
      
      a determiner operable to determine (i) if the query sequence of speech labels was generated from an audio signal or from text; and
      
      (ii) if the sequence of speech labels of a current annotation was generated from an audio signal or from text, and to output a determination result; and
      
      a selector operable to select, for the current annotation, the mode of operation of said apparatus according to claim 1 in dependence upon said determination result.
  - 45. An apparatus according to claim 1, wherein said counter is operable to:
    - threshold each intermediate similarity measures in the set of intermediate similarity measures with a predetermined threshold value to provide a threshold result; and
      
      increment said count associated with the current first sub-sequence of labels in dependence upon said threshold result.

46. A comparison apparatus comprising:
- a receiver operable to receive first and second sequences of labels;
  
  an identifier operable to identify a plurality of different first sub-sequences of labels within said first sequence of labels;
  
  a first determiner operable to determine the number of times each of said different first sub-sequences occurs within said first sequence of labels;
  
  a second determiner operable to determine the number of times each of said different first sub-sequences occurs within said second sequence of labels; and
  
  a similarity measure calculator operable to calculate a similarity score measure representative of the similarity between the first and second sequences of labels using the numbers obtained from said first and second determiners;
  
  wherein the apparatus further comprises a third determiner operable to determine the total number of sub-sequences of labels in said second sequence; and
  
  in that said similarity score calculator comprises;
  
  a first sub-calculator operable to calculate a measure of the probability of each of said first sub-sequences occurring in said second sequence of labels using the numbers obtained from said second determiner and the number obtained from said third determiner; and
  
  a second sub-calculator operable to calculate said similarity score by taking products of said computed probability measures in dependence upon said numbers obtained from said first determiner.
- View Dependent Claims (47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 102)
- - 47. An apparatus according to claim 46, wherein the probability measure which is computed for each first sub-sequence occurring in said second sequence by said first sub-calculator is proportional to the number of times the first sub-sequence occurs within said second sequence obtained from said second determiner and is inversely proportional to the total number of sub-sequences of labels in the second sequence obtained from said third determiner.
  - 48. An apparatus according to claim 46, wherein said similarity score calculator is operable to calculate the similarity measure by calculating:
    - $\prod_{i; Q_{i} \neq 0} \prod_{j = 0}^{Q_{i} - 1} [\frac{A_{i} + j + α}{D + j_{s} + m α}]$ where the term in brackets is the probability measure calculated by said first sub-calculator for the i^thsub-sequence;
      
      A_iis the number of times the i^thsub-sequence occurs in the second sequence of labels;
      
      j is a loop counter used to ensure that the probability measure in brackets is multiplied for each of the occurrences of the i^thsub-sequence in the first sequence of labels;
      
      D is the total number of sub-sequences of labels in the second sequence of labels obtained by said third determiner;
      
      j_sis an index which is incremented at each calculation of the probability measure in brackets; and
      
      α and
      
      mα
      
      are constants to ensure that the probability measure in square brackets does not go below a predetermined lower limit.
  - 49. An apparatus according to claim 48, wherein α
    - lies between zero and one.
  - 50. An apparatus according to claim 46, wherein said first determiner comprises a comparator operable to perform a Boolean match between each first sub-sequence of labels and the first sequence of labels and a counter operable to increment a count associated with a current first sub-sequence of labels each time the current first sub-sequence of labels matches a sub-sequence of labels in the first sequence of labels.
  - 51. An apparatus according to claim 46, wherein said second determiner comprises a comparator operable to perform a Boolean match between each first sub-sequence of labels and the second sequence of labels and a counter operable to increment a count associated with a current first sub-sequence of labels each time the current first sub-sequence of labels matches a sub-sequence of labels in the second sequence of labels.
  - 52. An apparatus according to claim 46, further comprising a definer operable to define a plurality of second sub-sequences of labels from said second sequence of labels and wherein said second determiner is operable to determine said numbers by comparing each first sub-sequence of labels with each second sub-sequence of labels.
  - 53. An apparatus according to claim 52, wherein said second determiner comprises;
    - a comparator operable to compare a current first sub-sequence of labels with each second sub-sequence of labels using predetermined data including confusion information which defines confusability between different labels to provide a set of sub-sequence similarity measures; and
      
      a counter operable to count the number of times the current sub-sequence of labels occurs within the second sequence of labels in dependence upon the set of sub-sequence similarity measures provided by said comparator for the current first sub-sequence of labels.
  - 54. An apparatus according to claim 53, wherein said second determiner further comprises an aligner operable to align labels of the current first sub-sequence of labels with labels of a current second sub-sequence of labels to form a number of aligned pairs of labels;
    - wherein said comparator is operable to compare the labels of each aligned pair of labels using said confusion information to generate a comparison score representative of the similarity between the aligned pair of labels; and
      
      wherein said comparator further comprises a comparison score combiner operable to combine the comparison scores for all the aligned pairs of labels to provide the sub-sequence similarity measure for the current first sub-sequence of labels and the current second sub-sequence of labels.
  - 55. An apparatus according to claim 48, wherein each of the labels in said first and second sequences of labels belongs to a set of predetermined labels and wherein m is the number of possible sub-sequences of labels which can be formed from the set of predetermined labels.
  - 56. An apparatus according to claim 46, wherein said receiver is operable to receive a plurality of second sequences of labels, wherein said second determiner is operable to determine the number of times each of said first sub-sequences of labels occurs within each of said second sequences of labels and wherein said similarity score calculator is operable to calculate a respective measure of the similarity between the first sequence of labels and said plurality of second sequences of labels.
  - 57. An apparatus according to claim 56, further comprising a sequence determiner operable to compare said plurality of similarity measures output by said similarity score calculator and for outputting a signal indicative of the second sequence of labels which is most similar to said first sequence of labels.
  - 102. A method of searching a database comprising a plurality of information entries to identify information to be retrieved therefrom, each of said plurality of information entries having an associated annotation comprising a sequence of speech labels, the method comprising the steps of:
    - receiving an input query comprising a sequence of speech labels;
      
      a method according to claim 58 for comparing said query sequence of speech labels with the speech labels of each annotation to provide a set of comparison results; and
      
      identifying said information to be retrieved from said database using said comparison results;
      
      wherein said method according to claim 57 has a plurality of different comparison modes of operation and in that the method further comprises the steps of;
      
      determining if the query sequence of speech labels was generated from an audio signal or from text and if the sequence of speech labels of a current annotation was generated from an audio signal or from text and outputting a determination result; and
      
      selecting, for the current annotation, the mode of operation of said method according to claim 58 in dependence upon said determination result.

58. A comparison method comprising the steps of:
- receiving first and second sequences of labels;
  
  identifying a plurality of different first sub-sequences of labels within said first sequence of labels;
  
  a first determining step of determining and outputting the number of times each of said different first sub-sequences occurs within said first sequence of labels;
  
  defining a plurality of second sub-sequences of labels from said second sequence of labels;
  
  a second determining step of determining and outputting the number of times each of said different first sub-sequences occurs within said second sequence of labels by comparing each first sub-sequence of labels with each second sub-sequence of labels; and
  
  computing a measure of the similarity between the first and second sequences of labels by comparing the numbers output from said first determining step with the numbers output from said second determining step;
  
  wherein said second determining step comprises the steps of;
  
  comparing a current first sub-sequence of labels with each second sub-sequence of labels using predetermined data including confusion information which defines confusability between different labels, to provide a set of sub-sequence similarity measures; and
  
  counting the number of times the current first sub-sequence of labels occurs within the second sequence of labels in dependence upon the set of sub-sequence similarity measures provided by the comparing step for the current first sub-sequence of labels.
- View Dependent Claims (59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 103, 104)
- - 59. A method according to claim 58, wherein each of said first sub-sequences comprises the same number of labels.
  - 60. A method according to claim 58, wherein each of said second sub-sequences comprises the same number of labels.
  - 61. A method according to claim 58, wherein said second sub-sequences of labels comprise the same number of labels as said first sub-sequences of labels.
  - 62. A method according to claim 58, wherein said first determining step comprises the step of performing a Boolean match between a current first sub-sequence of labels and the first sequence of labels and the step of incrementing a count associated with the current first sub-sequence of labels each time the current first sub-sequence of labels matches a sub-sequence of labels from the first sequence of labels.
  - 63. A method according to claim 58, wherein said second determining step further comprises the step of aligning labels of the current first sub-sequence of labels with labels of a current second sub-sequence of labels to form a number of aligned pairs of labels;
    - wherein said comparing step compares the labels of each aligned pair of labels using said similarity information to generate a comparison score representative of the similarity between the aligned pair of labels; and
      
      wherein said comparing step further comprises the step of combining the comparison scores for all the aligned pairs of labels for the current first and second sub-sequences to provide the sub-sequence similarity measure for the current first sub-sequence of labels and the current second sub-sequence of labels.
  - 64. A method according to claim 63, wherein said comparing step comprises:
    - a first comparing step of comparing, for each aligned pair, the first sub-sequence label in the aligned pair with each of a plurality of labels taken from a set of predetermined labels to provide a corresponding plurality of intermediate comparison scores representative of the similarity between said first sub-sequence label and the respective labels from the set;
      
      a second comparing step of comparing, for each aligned pair, the second sub-sequence label in the aligned pair with each of said plurality of labels from the set to provide a further corresponding plurality of intermediate comparison scores representative of the similarity between said second sub-sequence label and the respective labels from the set; and
      
      a step of calculating said comparison score for the aligned pair by combining said pluralities of intermediate comparison scores.
  - 65. A method according to claim 64, wherein said first and second comparing steps compare the first sub-sequence label and the second sub-sequence label of the aligned pair respectively with each of the labels in said set of predetermined labels.
  - 66. A method according to claim 64, wherein said comparing step generates a comparison score for an aligned pair of labels which represents a probability of confusing the second sub-sequence label of the aligned pair as the first sub-sequence label of the aligned pair.
  - 67. A method according to claim 66, wherein said first and second comparing steps provide intermediate comparison scores which are indicative of a probability of confusing the corresponding label taken from the set of predetermined labels as the label in the aligned pair.
  - 68. A method according to claim 67, wherein said calculating step comprises the steps of (i) multiplying the intermediate scores obtained when comparing the first and second sub-sequence labels in the aligned pair with the same label from the set to provide a plurality of multiplied intermediate comparison scores;
    - and (ii) adding the resulting multiplied intermediate scores, to calculate said comparison score for the aligned pair.
  - 69. A method according to claim 68, wherein each of said labels in said set of predetermined labels has a predetermined probability of occurring within a sequence of labels and wherein said calculating step weighs each of said multiplied intermediate comparison scores with the respective probability of occurrence for the label from the set used to generate the multiplied intermediate comparison scores.
  - 70. A method according to claim 69, wherein said calculating step calculates:
    - $\sum_{r = 1}^{n} P (q_{j} | p_{r}) P (a_{i} | p_{r}) P (p_{r})$ where q_jand a_iare the aligned pair of first and second sub-sequence labels respectively;
      
      P(q_j|p_r) is the probability of confusing set label p_ras first sub-sequence label q_j;
      
      P(a_i|p_r) is the probability of confusing set label p_ras second sub-sequence label a_i; and
      
      P(p_r) represents the probability of set label p_roccurring in a sequence of labels.
  - 71. A method according to claims 70, wherein the confusion probabilities for the first and second sequence labels are determined in advance and depend upon the recognition system that was used to generate the respective first and second sequences.
  - 72. A method according to claim 68, wherein said intermediate scores represent log probabilities and wherein said calculating step performs said multiplication by adding the respective intermediate scores and performs said addition of said multiplied scores by performing a log addition calculation.
  - 73. A method according to claim 72, wherein said combining step adds the comparison scores for all the aligned pairs to determine said similarity measure.
  - 74. A method according to claim 63, wherein said aligning step identifies label deletions and insertions in said first and second sequences of labels and wherein said comparing step is operable to generate said comparison score for an aligned pair of labels in dependence upon labels deletions and insertions identified by said aligning step which occur in the vicinity of the labels in the aligned pair.
  - 75. A method according to claim 63, wherein said aligning step uses a dynamic programming alignment algorithm to align said first and second sequences of labels.
  - 76. A method according to claim 75, wherein said dynamic programming algorithm progressively determines a plurality of possible alignments between said current first sub-sequence of labels and said current second sub-sequence of labels and wherein said comparing step determines a comparison score for each of the possible aligned pairs of labels determined by said dynamic programming algorithm.
  - 77. A method according to claim 76, wherein said comparing step generates said comparison score during the progressive determination of said possible alignments.
  - 78. A method according to claim 75, wherein said dynamic programming algorithm determines an optimum alignment between said current first sub-sequence of labels and said current second sub-sequence of labels and wherein said combining step provides said sub-sequence similarity measure by combining the comparison scores only for the optimum aligned pairs of labels.
  - 79. A method according to claim 76, wherein said combining step provides said sub-sequence similarity measure by combining all the comparison scores for all the possible aligned pairs of labels.
  - 80. A method according to claim 58, wherein each of the labels in said first and second sub-sequences of labels belongs to said set of predetermined labels and wherein said confusion information comprises, for each label in the set of labels, a probability for confusing that label with each of the other labels in the set of labels.
  - 81. A method according to claim 80, wherein said confusion probabilities are determined in advance and depend upon the system used to generate the first and second sub-sequences of labels.
  - 82. A method according to claim 80, wherein said predetermined data further includes, for each label in the set of labels, a probability of inserting the label in a sequence of labels.
  - 83. A method according to claim 80, wherein said predetermined data further includes, for each label in the set of labels, a probability of deleting the label from a sequence of labels.
  - 84. A method according to claim 63, wherein said second determining step further comprises the step of normalising each of said sub-sequence similarity measures.
  - 85. A method according to claim 84, wherein said normalising step normalises each sub-sequence similarity measure by dividing each similarity measure by a respective normalisation score which varies in dependence upon the length of the corresponding first and second sub-sequences of labels.
  - 86. A method according to claim 84, wherein the respective normalisation scores vary in dependence upon the sequence of labels in the corresponding first and second sub-sequences of labels.
  - 87. A method according to claim 84, wherein said aligning step uses a dynamic programming alignment algorithm to align said first and second sequences of labels and wherein said normalising step calculates the respective normalisation scores during the progressive calculation of said possible alignments by said dynamic programming algorithm.
  - 88. A method according to claim 58, wherein said defining step defines said plurality of second sub-sequences as successive portions of the second sequence of labels.
  - 89. A method according to claim 88 ,wherein successive portions are separated from each other by a single label.
  - 90. A method according to claim 58, wherein said computing step computes said measure of the similarity between the first and second sequences of labels by treating the numbers output by the first determining step as a first vector and by treating the numbers output by said second determining step as a second vector and by determining a cosine measure of the angle between the two vectors.
  - 91. A method according to claim 58, wherein said first and second sequences of labels represent time sequential signals.
  - 92. A method according to claim 58, wherein said first and second sequences of labels represent audio signals.
  - 93. A method according to claim 92, wherein said first and second sequences of labels represent speech.
  - 94. A method according to claim 93, wherein each of said labels represents a sub-word unit of speech.
  - 95. A method according to claim 94, wherein each of said labels represents a phoneme.
  - 96. A method according to claim 58, wherein said first sequence of labels comprises a plurality of sub-word units generated from a typed input and wherein said similarity information comprises mis-typing probabilities and/or mis-spelling probabilities.
  - 97. A method according to claim 58, wherein said second sequence of labels comprises a sequence of sub-word units generated from a spoken input and wherein said similarity information comprises mis-recognition probabilities.
  - 98. A method according to claim 58, wherein said receiving step receives a plurality of second sequences of labels, wherein said second determining step determines and outputs the number of times each of said first sub-sequences of labels occurs within each of said second sequences of labels and wherein said computing step computes a respective measure of the similarity between the first sequence of labels and said plurality of second sequences of labels.
  - 99. A method according to claim 98, further comprising the step of comparing said plurality of similarity measures output by said computing step and the step of outputting a signal indicative of the second sequence of labels which is most similar to said first sequence of labels.
  - 100. A method of searching a database comprising a plurality of information entries to identify information to be retrieved therefrom, each of said plurality of information entries having an associated annotation comprising a sequence of labels, the method comprising the steps of:
    - receiving an input query comprising a sequence of labels;
      
      a method according to claim 58 for comparing the query sequence of labels with the labels of each annotation to provide a set of comparison results; and
      
      identifying said information to be retrieved from said database using said comparison results.
  - 101. A method according to claim 100, wherein one or more of said information entries is the associated annotation.
  - 103. A method according to claim 58, wherein the steps of the claimed method are performed in the order given in the claims.
  - 104. A method according to claim 58, wherein said counting step comprises:
    - thresholding each intermediate similarity measure in the set of intermediate similarity measures with a predetermined threshold value to provide a threshold result; and
      
      incrementing a count associated with the current first sub-sequence of labels in dependence upon said threshold result.

105. A comparison method comprising the steps of:
- receiving first and second sequences of labels;
  
  identifying a plurality of different first sub-sequences of labels within said first sequence of labels;
  
  a first obtaining step of obtaining the number of times each of said different first sub-sequences occurs within said first sequence of labels;
  
  a second obtaining step of obtaining the number of times each of said different first sub-sequences occurs within said second sequence of labels; and
  
  computing a similarity score representative of the similarity between the first and second sequences of labels using the numbers obtained from said first and second obtaining steps;
  
  wherein the method further comprises a third obtaining step of obtaining the total number of sub-sequences of labels in said second sequence; and
  
  in that said computing step comprises;
  
  a first computing step of computing a measure of the probability of each of said first sub-sequences occurring in said second sequence of labels using the numbers obtained from said second obtaining step and the number obtained from said third obtaining step; and
  
  a second computing step of computing said similarity score by taking products of said computed probability measures in dependence upon said numbers obtained from said first obtaining step.
- View Dependent Claims (106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116)
- - 106. A method according to claim 105, wherein the probability measure which is computed for each first sub-sequence occurring in said second sequence in said first computing step is proportional to the number of times the first sub-sequence occurs within said second sequence obtained in said second obtaining step and is inversely proportional to the total number of sub-sequences of labels in the second sequence obtained in said third obtaining step.
  - 107. A method according to claim 105, wherein said computing step computes the similarity measure by calculating:
    - $\prod_{i; Q_{i} \neq 0} \prod_{j = 0}^{Q_{i} - 1} [\frac{A_{i} + j + α}{D + j_{s} + m α}]$ where the term in brackets is the probability measure calculated in said first computing step for the i^thsub-sequence;
      
      A_iis the number of times the i^thsub-sequence occurs in the second sequence of labels;
      
      j is a loop counter used to ensure that the probability measure in brackets is multiplied for each of the occurrences of the i^thsub-sequence in the first sequence of labels;
      
      D is the total number of sub-sequences of labels in the second sequence of labels obtained in said third obtaining step;
      
      j_sis an index which is incremented at each calculation of the probability measure in brackets; and
      
      α and
      
      mα
      
      are constants to ensure that the probability measure in square brackets does not go below a predetermined lower limit.
  - 108. A method according to claim 107, wherein α
    - lies between zero and one.
  - 109. A method according to claim 105, wherein said first obtaining step comprises the step of performing a Boolean match between each first sub-sequence of labels and the first sequence of labels and the step of incrementing a count associated with a current first sub-sequence of labels each time the current first sub-sequence of labels matches a sub-sequence of labels in the first sequence of labels.
  - 110. A method according to claim 105, wherein said second obtaining step comprises the step of performing a Boolean match between each first sub-sequence of labels and the second sequence of labels and the step of incrementing a count associated with a current first sub-sequence of labels each time the current first sub-sequence of labels matches a sub-sequence of labels in the second sequence of labels.
  - 111. A method according to claim 105, further comprising the step of defining a plurality of second sub-sequences of labels from said second sequence of labels and wherein said second obtaining step obtains said numbers by comparing each first sub-sequence of labels with each second sub-sequence of labels.
  - 112. A method according to claim 111, wherein said second obtaining step comprises the steps of;
    - comparing a current first sub-sequence of labels with each second sub-sequence of labels using predetermined data including confusion information which defines confusability between different labels to provide a set of sub-sequence similarity measures;
      
      thresholding each intermediate similarity measure of the set with a predetermined threshold value and outputting a threshold result; and
      
      determining the number of times the current sub-sequence of labels occurs within the second sequence of labels in dependence upon the threshold results output for the corresponding set of sub-sequence similarity measures.
  - 113. A method according to claim 112, wherein said second obtaining step further comprises the step of aligning labels of the current first sub-sequence of labels with labels of a current second sub-sequence of labels to form a number of aligned pairs of labels;
    - wherein said comparing step compares the labels of each aligned pair of labels using said similarity information to generate a comparison score representative of the similarity between the aligned pair of labels; and
      
      wherein said comparing step further comprises the step of combining the comparison scores for all the aligned pairs of labels to provide the sub-sequence similarity measure for the current first sub-sequence of labels and the current second sub-sequence of labels.
  - 114. A method according to claim 107, wherein each of the labels in said first and second sequences of labels belongs to a set of predetermined labels and wherein m is the number of possible sub-sequences of labels which can be formed from the set of predetermined labels.
  - 115. A method according to claim 105, wherein said receiving step receives a plurality of second sequences of labels, wherein said second obtaining step obtains and outputs the number of times each of said first sub-sequences of labels occurs within each of said second sequences of labels and wherein said computing step computes a respective measure of the similarity between the first sequence of labels and said plurality of second sequences of labels.
  - 116. A method according to claim 115, further comprising the step of comparing said plurality of similarity measures output by said computing step and outputting a signal indicative of the second sequence of labels which is most similar to said first sequence of labels.

117. A computer readable medium storing processor implementable process steps for the carrying out a comparison method, the process steps comprising:
- a step of receiving first and second sequences of labels;
  
  a step of identifying a plurality of different first sub-sequences of labels within said first sequence of labels;
  
  a first determining step of determining and outputting the number of times each of said different first sub-sequences occurs within said first sequence of labels;
  
  a step of defining a plurality of second sub-sequences of labels from said second sequence of labels;
  
  a second determining step of determining and outputting the number of times each of said different first sub-sequences occurs within said second sequence of labels by comparing each first sub-sequence of labels with each second sub-sequence of labels; and
  
  a step of computing a measure of the similarity between the first and second sequences of labels by comparing the numbers output from said first determining step with the numbers output from said second determining step;
  
  wherein said second determining step comprises;
  
  a step of comparing a current first sub-sequence of labels with each second sub-sequence of labels using predetermined data including confusion information which defines confusability between different labels, to provide a set of sub-sequence similarity measures; and
  
  a step of counting the number of times the current first sub-sequence of labels occurs within the second sequence of labels in dependence upon the set of sub-sequence similarity measures provided by the comparing step for the current first sub-sequence of labels.
- View Dependent Claims (119)
- - 119. A computer readable medium storing processor implementable instructions for carrying out a method of searching a database comprising a plurality of information entries to identify information to be retrieved therefrom, each of said plurality of information entries having an associated annotation comprising a sequence of labels, the process steps comprising:
    - a step of receiving an input query comprising a sequence of labels;
      
      the process steps stored on the computer readable medium according to claim 117 or 118 for comparing the query sequence of labels with the labels of each annotation to provide a set of comparison results; and
      
      a step of identifying said information to be retrieved from said database using said comparison results.

118. A computer readable medium storing processor implementable process steps for carrying out a comparison method, the process steps comprising:
- a step of receiving first and second sequences of labels;
  
  a step of identifying a plurality of different first sub-sequences of labels within said first sequence of labels;
  
  a first obtaining step of obtaining the number of times each of said different first sub-sequences occurs within said first sequence of labels;
  
  a second obtaining step of obtaining the number of times each of said different first sub-sequences occurs within said second sequence of labels; and
  
  a step of computing a similarity score representative of the similarity between the first and second sequences of labels using the numbers obtained from said first and second obtaining steps;
  
  wherein the process steps further comprise a third obtaining step of obtaining the total number of sub-sequences of labels in said second sequence; and
  
  in that said computing step comprises;
  
  a first computing step of computing a measure of the probability of each of said first sub-sequences occurring in said second sequence of labels using the numbers obtained from said second obtaining step and the number obtained from said third obtaining step; and
  
  a second computing step of computing said similarity score by taking products of said computed probability measures in dependence upon said numbers obtained from said first obtaining step.

120. Processor implementable instructions for carrying out a comparison method, the process steps comprising:
- a step of receiving first and second sequences labels;
  
  a step of identifying a plurality of different first sub-sequences of labels within said first sequence labels;
  
  a first determining step of determining and outputting the number of times each of said different first sub-sequences occurs within said first sequence of labels;
  
  a step of defining a plurality of second sub-sequences of labels from said second sequence of labels;
  
  a second determining step of determining and outputting the number of times each of said different first sub-sequences occurs within said second sequence of labels by comparing each first sub-sequence of labels with each second sub-sequence of labels; and
  
  a step of computing a measure of the similarity between the first and second sequences of labels by comparing the numbers output from said first determining step with the numbers output from said second determining step;
  
  wherein said second determining step comprises;
  
  a step of comparing a current first sub-sequence of labels with each second sub-sequence of labels using predetermined data including confusion information which defines confusability between different labels, to provide a set of sub-sequence similarity measures; and
  
  counting the number of times the current first sub-sequence of labels occurs within the second sequence of labels in dependence upon the set of sub-sequence similarity measures provided by the comparing step for the current first sub-sequence of labels.
- View Dependent Claims (122)
- - 122. Processor implementable instructions for carrying out a method of searching a database comprising a plurality of information entries to identify information to be retrieved therefrom, each of said plurality of information entries having an associated annotation comprising a sequence of labels, the process steps comprising:
    - a step of receiving an input query comprising a sequence of labels;
      
      the process steps according to claim 120 or 121 for comparing the query sequence of labels with the labels of each annotation to provide a set of comparison results; and
      
      a step of identifying said information to be retrieved from said database using said comparison results.

121. Processor implementable instructions for carrying out a comparison method, comprising:
- a step of receiving first and second sequences of labels;
  
  a step of identifying a plurality of different first sub-sequences of labels within said first sequence of labels;
  
  a first obtaining step of obtaining the number of times each of said different first sub-sequences occurs within said first sequence of labels;
  
  a second obtaining step of obtaining the number of times each of said different first sub-sequences occurs within said second sequence of labels; and
  
  a step of computing a similarity score representative of the similarity between the first and second sequences of labels using the numbers obtained from said first and second obtaining steps;
  
  wherein the process steps further comprise a third obtaining step of obtaining the total number of sub-sequences of labels in said second sequence; and
  
  in that said computing step comprises;
  
  a first computing step of computing a measure of the probability of each of said first sub-sequences occurring in said second sequence of labels using the numbers obtained from said second obtaining step and the number obtained from said third obtaining step; and
  
  a second computing step of computing said similarity score by taking products of said computed probability measures in dependence upon said numbers obtained from said first obtaining step.

123. A comparison apparatus comprising:
- means for receiving first and second sequences of labels;
  
  means for identifying a plurality of different first sub-sequences of labels within said first sequence of labels;
  
  first determining means for determining and outputting the number of times each of said different first sub-sequences occurs within said first sequence of labels;
  
  means for defining a plurality of second sub-sequences of labels from said second sequence of labels;
  
  second determining means for determining and outputting the number of times each of said different first sub-sequences occurs within said second sequence of labels by comparing each first sub-sequence of labels with each second sub-sequence of labels; and
  
  means for computing a measure of the similarity between the first and second sequences of labels by comparing the numbers output from said first determining means with the numbers output from said second determining means;
  
  wherein said second determining means comprises;
  
  means for comparing a current first sub-sequence of labels with each second sub-sequence of labels using predetermined data including confusion information which defines confusability between different labels, to provide a set of intermediate similarity measures; and
  
  means for counting the number of times the current first sub-sequence of labels occurs within the second sequence of labels in dependence upon the set of intermediate similarity measures provided by said comparing means for the current first sub-sequence of labels.
- View Dependent Claims (124, 125)
- - 124. An apparatus for searching a database comprising a plurality of information entries to identify information to be retrieved therefrom, each of said plurality of information entries having an associated annotation comprising a sequence of labels, the apparatus comprising:
    - means for receiving an input query comprising a sequence of labels;
      
      an apparatus according to claim 123 for comparing the query sequence of labels with the labels of each annotation to provide a set of comparison results; and
      
      means for identifying said information to be retrieved from said database using said comparison results.
  - 125. An apparatus for searching a database comprising a plurality of information entries to identify information to be retrieved therefrom, each of said plurality of information entries having an associated annotation comprising a sequence of speech labels, the apparatus comprising:
    - means for receiving an input query comprising a sequence of speech labels;
      
      an apparatus according to claim 123 for comparing said query sequence of speech labels with the speech labels of each annotation to provide a set of comparison results; and
      
      means for identifying said information to be retrieved from said database using said comparison results;
      
      wherein said apparatus according to claim 123 has a plurality of different comparison modes of operation and in that the apparatus further comprises;
      
      means for determining (i) if the query sequence of speech labels was generated from an audio signal or from text; and
      
      (ii) if the sequence of speech labels of a current annotation was generated from an audio signal or from text, and for outputting a determination result; and
      
      means for selecting, for the current annotation, the mode of operation of said apparatus according to claim 123 in dependence upon said determination result.

126. A comparison apparatus comprising:
- means for receiving first and second sequences of labels;
  
  means for identifying a plurality of different first sub-sequences of labels within said first sequence of labels;
  
  first obtaining means for obtaining the number of times each of said different first sub-sequences occurs within said first sequence of labels;
  
  second obtaining means for obtaining the number of times each of said different first sub-sequences occurs within said second sequence of labels; and
  
  means for computing a similarity score representative of the similarity between the first and second sequences of labels using the numbers obtained from said first and second obtaining means;
  
  wherein the apparatus further comprises third obtaining means for obtaining the total number of sub-sequences of labels in said second sequence; and
  
  in that said computing means comprises;
  
  first computing means for computing a measure of the probability of each of said first sub-sequences occurring in said second sequence of labels using the numbers obtained from said second obtaining means and the number obtained from said third obtaining means; and
  
  second computing means for computing said similarity score by taking products of said computed probability measures in dependence upon said numbers obtained from said first obtaining means.

127. A comparison apparatus comprising:
- a receiver operable to receive first and second sequences of labels;
  
  an identifier operable to identify a plurality of first sub-sequences of labels within said first sequence of labels;
  
  a first determiner operable to determine and to output the number of times each of said first sub-sequences occurs within said first sequence of labels;
  
  a definer operable to define a plurality of second sub-sequences of labels from said second sequence of labels;
  
  a second determiner operable to determine and to output the number of times each of said first sub-sequences occurs within said second sequence of labels by comparing each first sub-sequence of labels with each second sub-sequence of labels; and
  
  a similarity measure calculator operable to calculate a measure of the similarity between the first and second sequences of labels by comparing the numbers output from said first determiner with the numbers output from said second determiner.

128. A comparison method comprising:
- receiving first and second sequences of labels;
  
  identifying a plurality of first sub-sequences of labels within said first sequence of labels;
  
  a first determining step of determining and outputting the number of times each of said first sub-sequences occurs within said first sequence of labels;
  
  defining a plurality of second sub-sequences of labels from said second sequence of labels;
  
  a second determining step of determining and outputting the number of times each of said first sub-sequences occurs within said second sequence of labels by comparing each first sub-sequence of labels with each second-sequence of labels; and
  
  computing a measure of the similarity between the first and second sequences of labels by comparing the numbers output from said first determining step with the numbers output from said second determining step.

129. A computer readable medium storing processor implementable process steps for carrying out a comparison method, the process steps comprising:
- a step of receiving first and second sequences of labels;
  
  a step of identifying a plurality of first sub-sequences of labels within said first sequence of labels;
  
  a first determining step of determining and outputting the number of times each of said first sub-sequences occurs within said first sequence of labels;
  
  a step of defining a plurality of second sub-sequences of labels from said second sequence of labels;
  
  a second determining step of determining and outputting the number of times each of said first sub-sequences occurs within said second sequence of labels by comparing each first sub-sequence of labels with each second sub-sequence of labels; and
  
  a step of computing a measure of the similarity between the first and second sequences of labels by comparing the numbers output from said first determining step with the numbers output from said second determining step.

130. Processor implementable instructions for carrying out a comparison method, the instructions comprising:
- instructions for receiving first and second sequences of labels;
  
  instructions for identifying a plurality of first sub-sequences of labels within said first sequence of labels;
  
  instructions for a first determining step of determining and outputting the number of times each of said first sub-sequences occurs within said first sequence of labels;
  
  instructions for defining a plurality of second sub-sequences of labels from said second sequence of labels;
  
  instructions for a second determining step of determining and outputting the number of times each of said first sub-sequences occurs within said second sequence of labels by comparing each first sub-sequence of labels with each second sub-sequence of labels; and
  
  instructions for computing a measure of the similarity between the first and second sequences of labels by comparing the numbers output from said first determining step with the numbers output from said second determining step.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Canon Kabushiki Kaisha (Canon Inc.)
Original Assignee
Canon Kabushiki Kaisha (Canon Inc.)
Inventors
Garner, Philip Neil, Charlesworth, Jason Peter Andrew, Higuchi, Asako
Primary Examiner(s)
Dorvil, Richemond
Assistant Examiner(s)
Lerner, Martin

Application Number

US09/695,078
Time in Patent Office

1,637 Days
Field of Search

704/238, 704/239, 704/240, 704/241, 704/242, 704/236, 704/234, 707/3, 707/6
US Class Current

704/236
CPC Class Codes

G06F 16/632   Query formulation

G06F 16/685   using automatically derived...

G10L 15/26   Speech to text systems G10L...

Y10S 707/99936   Pattern matching access

Language recognition using sequence frequency

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

130 Claims

Specification

Solutions

Use Cases

Quick Links

Language recognition using sequence frequency

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

130 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links