×

Word clustering for input data

  • US 8,249,871 B2
  • Filed: 11/18/2005
  • Issued: 08/21/2012
  • Est. Priority Date: 11/18/2005
  • Status: Expired due to Fees
First Claim
Patent Images

1. A linguistic tool assembly including computer executable instructions on a computer storage media and the computer executable instructions being executed by a processing unit to implement one or more components of the linguistic tool comprising:

  • a clustering component which receives input data indicative of a plurality of input utterances and generates word clusters indicative of at least two or more words co-occurring in the utterances in the input data wherein the clustering component includes a word occurrence or co-occurrence vector generator which processes the input data to generate at least one word occurrence or co-occurrence vector v=(vi, v2) for one or more words in the input data where v1 corresponds to a first input utterance in the input data and v2 corresponds to a second input utterance in the input data and the vi is either “

    1”

    or “

    0”

    depending if the one or more words occur in the first input utterance or if the one or more words do not occur in the first input utterance and the v2 is either “

    1”

    or “

    0”

    depending if the one or more words occur in the second input utterance or if the one or more words do not occur in the second input utterance and the clustering component including a vector dot product component which receives at least two word occurrence or co-occurrence vectors and calculates a vector dot product between the at least two word occurrence or co-occurrence vectors to generate a symmetrical word co-occurrence matrix having an upper triangular part, a lower triangular part, and a plurality of diagonal positions, and to provide a frequency of the co-occurrence of the words of the at least two word occurrence or co-occurrence vectors and the clustering component processing the output of the vector dot product component to output the word clusters for the co-occurrence of the at least two or more words above a threshold frequency in the plurality of input utterances, wherein processing of the word co-occurrence matrix is restricted to the upper or lower triangular part of the matrix and the plurality of diagonal positions are not computed to exclude repeated word clusters stemming from recognition errors; and

    a grammar generator component configured to receive the output word clusters from the clustering component and generate grammar elements for the output word clusters above the threshold frequency and configured to add the grammar elements for the word clusters to a closed set of grammar elements stored on the computer storage media and used a speech recognition component for transcribing speech.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×