Word clustering for input data

US 8,249,871 B2
Filed: 11/18/2005
Issued: 08/21/2012
Est. Priority Date: 11/18/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A linguistic tool assembly including computer executable instructions on a computer storage media and the computer executable instructions being executed by a processing unit to implement one or more components of the linguistic tool comprising:

a clustering component which receives input data indicative of a plurality of input utterances and generates word clusters indicative of at least two or more words co-occurring in the utterances in the input data wherein the clustering component includes a word occurrence or co-occurrence vector generator which processes the input data to generate at least one word occurrence or co-occurrence vector v=(v_i, v₂) for one or more words in the input data where v₁corresponds to a first input utterance in the input data and v₂corresponds to a second input utterance in the input data and the v_iis either “

1”

or “

0”

depending if the one or more words occur in the first input utterance or if the one or more words do not occur in the first input utterance and the v₂is either “

1”

or “

0”

depending if the one or more words occur in the second input utterance or if the one or more words do not occur in the second input utterance and the clustering component including a vector dot product component which receives at least two word occurrence or co-occurrence vectors and calculates a vector dot product between the at least two word occurrence or co-occurrence vectors to generate a symmetrical word co-occurrence matrix having an upper triangular part, a lower triangular part, and a plurality of diagonal positions, and to provide a frequency of the co-occurrence of the words of the at least two word occurrence or co-occurrence vectors and the clustering component processing the output of the vector dot product component to output the word clusters for the co-occurrence of the at least two or more words above a threshold frequency in the plurality of input utterances, wherein processing of the word co-occurrence matrix is restricted to the upper or lower triangular part of the matrix and the plurality of diagonal positions are not computed to exclude repeated word clusters stemming from recognition errors; and

a grammar generator component configured to receive the output word clusters from the clustering component and generate grammar elements for the output word clusters above the threshold frequency and configured to add the grammar elements for the word clusters to a closed set of grammar elements stored on the computer storage media and used a speech recognition component for transcribing speech.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A clustering tool to generate word clusters. In embodiments described, the clustering tool includes a clustering component that generates word clusters for words or word combinations in input data. In illustrated embodiments, the word clusters are used to modify or update a grammar for a closed vocabulary speech recognition application.

47 Citations

View as Search Results

20 Claims

1. A linguistic tool assembly including computer executable instructions on a computer storage media and the computer executable instructions being executed by a processing unit to implement one or more components of the linguistic tool comprising:
- a clustering component which receives input data indicative of a plurality of input utterances and generates word clusters indicative of at least two or more words co-occurring in the utterances in the input data wherein the clustering component includes a word occurrence or co-occurrence vector generator which processes the input data to generate at least one word occurrence or co-occurrence vector v=(v_i, v₂) for one or more words in the input data where v₁corresponds to a first input utterance in the input data and v₂corresponds to a second input utterance in the input data and the v_iis either “
  
  1”
  
  or “
  
  0”
  
  depending if the one or more words occur in the first input utterance or if the one or more words do not occur in the first input utterance and the v₂is either “
  
  1”
  
  or “
  
  0”
  
  depending if the one or more words occur in the second input utterance or if the one or more words do not occur in the second input utterance and the clustering component including a vector dot product component which receives at least two word occurrence or co-occurrence vectors and calculates a vector dot product between the at least two word occurrence or co-occurrence vectors to generate a symmetrical word co-occurrence matrix having an upper triangular part, a lower triangular part, and a plurality of diagonal positions, and to provide a frequency of the co-occurrence of the words of the at least two word occurrence or co-occurrence vectors and the clustering component processing the output of the vector dot product component to output the word clusters for the co-occurrence of the at least two or more words above a threshold frequency in the plurality of input utterances, wherein processing of the word co-occurrence matrix is restricted to the upper or lower triangular part of the matrix and the plurality of diagonal positions are not computed to exclude repeated word clusters stemming from recognition errors; and
  
  a grammar generator component configured to receive the output word clusters from the clustering component and generate grammar elements for the output word clusters above the threshold frequency and configured to add the grammar elements for the word clusters to a closed set of grammar elements stored on the computer storage media and used a speech recognition component for transcribing speech.
- View Dependent Claims (2, 3, 4, 5, 10, 19)
- - 2. The linguistic tool assembly of claim 1 and further comprising a merge component to merge a first word cluster and a second word cluster to form a merged word cluster including both the first and second word clusters based upon at least one of an acoustic similarity of one or more words of the first and second word clusters or n-Best recognition results for the one or more words of the first and second word clusters.
  - 3. The linguistic tool assembly of claim 1 and further comprising a pruning component configured to prune the word clusters to reduce the word clusters based upon at least one of a compactness setting or a data store of trivial words.
  - 4. The linguistic tool assembly of claim 1 and further including a reporting component which labels the word clusters utilizing word position information.
  - 5. The linguistic tool assembly of claim 1 wherein the word occurrence or co-occurrence vector generator receives at least two word occurrence or co-occurrence vectors and generates a word co-occurrence vector for the at least two word occurrence or co-occurrence vectors using a bit-wise AND function.
  - 10. The language tool assembly of claim 1 and comprising a merge component to process the word clusters and generate a merge likelihood matrix based upon alternate recognition results.
  - 19. The linguistic tool assembly of claim 1 and comprising:
    - a pruning component configured to prune the word clusters using a compactness threshold based upon separation between the two or more words of the word clusters to eliminate one or more of the word clusters where the co-occurring words are not in sufficient proximity.

6. A computer-implemented method comprising the steps of:
- recognizing, by a computer, words in speech input utterances;
  
  generating, by the computer, word occurrence or co-occurrence vectors for the words in the speech input utterances;
  
  computing, by the computer, a vector dot product between at least one first word occurrence vector or co-occurrence vector and at least one second word occurrence or co-occurrence vector to generate a symmetrical word co-occurrence matrix having an upper triangular part, a lower triangular part, and a plurality of diagonal positions, and to identify word clusters being indicative of the words that co-occur in clusters in the speech input utterances and provide a co-occurrence frequency of the word clusters, wherein processing of the word co-occurrence matrix is restricted to the upper or lower triangular part of the matrix and the plurality of diagonal positions are not computed to exclude repeated word clusters stemming from recognition errors;
  
  identifying the computer, one or more word clusters in the input utterances having at least a threshold word co-occurrence frequency using the co-occurrence frequency provided by the computation of the vector dot product;
  
  receiving user input selections to input one or more word clusters having the threshold word co-occurrence frequency into a grammar or lexicon for a speech recognition component; and
  
  generating, by the computer, one or more grammar inputs corresponding to the one or more word clusters selected for input to the grammar or lexicon.
- View Dependent Claims (7, 8, 9, 11, 12)
- - 7. The method of claim 6 and further comprising the steps of:
    - generating a word co-occurrence vector for the one or more word clusters having the at least threshold word co-occurrence frequency; and
      
      computing a vector dot product between the word co-occurrence vector and another word occurrence vector or co-occurrence vector to identify additional word clusters having at least the threshold word co-occurrence frequency.
  - 8. The method of claim 7 and comprising the step of:
    - computing a pi-product to generate the word co-occurrence vectors for the word clusters.
  - 9. The method of claim 6 and further comprising the steps of:
    - logging unrecognized speech data from a closed vocabulary speech recognition component; and
      
      recognizing the logged unrecognized speech data using a free-form speech recognition system to recognize the words in the speech input utterances.
  - 11. The method of claim 7 and comprising:
    - repeating the steps of computing the vector dot product for the word co-occurrence vector and the other word occurrence or co-occurrence vector to identify word co-occurrence between the word co-occurrence vector and the other word occurrence or co-occurrence vector; and
      
      generating an additional word co-occurrence vector for the threshold word co-occurrence frequency between the word co-occurrence vector and the other word occurrence or co-occurrence vector.
  - 12. The method of claim 7 wherein the step of generating the word co-occurrence vector uses a bit-wise AND function.

13. A computer-implemented method comprising the steps of:
- recognizing, by a computer, words in a plurality of speech input utterances using a speech recognition component stored on the computer storage media to recognize the words in the plurality of speech input utterances;
  
  generating, by the computer, word occurrence or co-occurrence vectors for the words in the plurality of speech input utterances;
  
  computing, by the computer, a vector dot product between at least one first word occurrence vector or co-occurrence vector and at least one second word occurrence or co-occurrence vector to generate a symmetrical word co-occurrence matrix having an upper triangular part, a lower triangular part, and a plurality of diagonal positions;
  
  processing, by the computer, the results of the vector dot product computation for the words of the at least one first word occurrence vector or co-occurrence vector and the at least one second word occurrence or co-occurrence vector having at least a threshold co-occurrence frequency to identify word clusters including two or more words co-occurring in the input utterances, wherein processing of the word co-occurrence matrix is restricted to the upper or lower triangular part of the matrix and the plurality of diagonal positions are not computed to exclude repeated word clusters stemming from recognition errors;
  
  pruning, by the computer, the word clusters using a compactness threshold based upon separation between the two or more co-occurring words in the input utterances to eliminate one or more of the word clusters where the co-occurring words are not in sufficient proximity in the input utterances;
  
  identifying, by the computer, two or more related word clusters based upon a similarity metric;
  
  merging, by the computer, the two or more related word clusters in the plurality of speech input utterances to form a merged word cluster for the two or more related word clusters; and
  
  reporting, by the computer, the word clusters in the plurality of speech input utterances.
- View Dependent Claims (14, 15, 16, 17, 18, 20)
- - 14. The method of claim 13 and comprising:
    - using the reported word clusters in the plurality of speech input utterances to generate or update a grammar or lexicon for the speech recognition component.
  - 15. The method of claim 13 and comprising the steps of:
    - recognizing the speech input utterances using a closed vocabulary speech recognition component;
      
      logging unrecognized speech input utterances and recognizing the logged speech input utterances using a free-form speech recognition system; and
      
      generating the word occurrence or co-occurrence vectors for the words in the input utterances of the logged unrecognized speech input utterances.
  - 16. The method of claim 13 wherein the step of identifying the two or more related word clusters comprises:
    - processing alternate recognition results from the speech recognition component; and
      
      generating a merge likelihood matrix using the alternate recognition results.
  - 17. The method of claim 13 wherein the step of identifying the two or more related word clusters comprises:
    - determining an acoustic similarity between one or more words of the word clusters; and
      
      merging the acoustically similar word clusters to form the merged word cluster.
  - 18. The method of claim 13 and comprising the step ofprocessing the plurality of input utterances for the word clusters;
    - andusing order and separation of words of the word clusters in the plurality of speech input utterances to report the word clusters using wildcards to illustrate the separation of the words in the word clusters.
  - 20. The method of claim 14 and comprising the step of:
    - receiving user input selections of the reported word clusters to include in the grammar or lexicon; and
      
      generating an input to the grammar or lexicon for the user input selections of the reported word clusters.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Mukerjee, Kunal
Primary Examiner(s)
Godbold, Douglas
Assistant Examiner(s)
GUERRA-ERAZO, EDGAR X

Application Number

US11/283,149
Publication Number

US 20070118376A1
Time in Patent Office

2,468 Days
Field of Search

704/1, 704/9, 704/10, 704/251, 704/257, 704/250, 704/270.1, 704/270, 704/275, 704/4, 704/237, 704/245, 704/2, 704/235, 704/246, 707/102
US Class Current

704/245
CPC Class Codes

G10L 15/063   Training

G10L 15/183   using context dependencies,...

G10L 15/19   Grammatical context, e.g. d...

G10L 2015/0631   Creating reference template...

Word clustering for input data

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

47 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Word clustering for input data

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

47 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links