Word clustering for input data
First Claim
Patent Images
1. A linguistic tool assembly including computer executable instructions on a computer storage media and the computer executable instructions being executed by a processing unit to implement one or more components of the linguistic tool comprising:
- a clustering component which receives input data indicative of a plurality of input utterances and generates word clusters indicative of at least two or more words co-occurring in the utterances in the input data wherein the clustering component includes a word occurrence or co-occurrence vector generator which processes the input data to generate at least one word occurrence or co-occurrence vector v=(vi, v2) for one or more words in the input data where v1 corresponds to a first input utterance in the input data and v2 corresponds to a second input utterance in the input data and the vi is either “
1”
or “
0”
depending if the one or more words occur in the first input utterance or if the one or more words do not occur in the first input utterance and the v2 is either “
1”
or “
0”
depending if the one or more words occur in the second input utterance or if the one or more words do not occur in the second input utterance and the clustering component including a vector dot product component which receives at least two word occurrence or co-occurrence vectors and calculates a vector dot product between the at least two word occurrence or co-occurrence vectors to generate a symmetrical word co-occurrence matrix having an upper triangular part, a lower triangular part, and a plurality of diagonal positions, and to provide a frequency of the co-occurrence of the words of the at least two word occurrence or co-occurrence vectors and the clustering component processing the output of the vector dot product component to output the word clusters for the co-occurrence of the at least two or more words above a threshold frequency in the plurality of input utterances, wherein processing of the word co-occurrence matrix is restricted to the upper or lower triangular part of the matrix and the plurality of diagonal positions are not computed to exclude repeated word clusters stemming from recognition errors; and
a grammar generator component configured to receive the output word clusters from the clustering component and generate grammar elements for the output word clusters above the threshold frequency and configured to add the grammar elements for the word clusters to a closed set of grammar elements stored on the computer storage media and used a speech recognition component for transcribing speech.
3 Assignments
0 Petitions
Accused Products
Abstract
A clustering tool to generate word clusters. In embodiments described, the clustering tool includes a clustering component that generates word clusters for words or word combinations in input data. In illustrated embodiments, the word clusters are used to modify or update a grammar for a closed vocabulary speech recognition application.
47 Citations
20 Claims
-
1. A linguistic tool assembly including computer executable instructions on a computer storage media and the computer executable instructions being executed by a processing unit to implement one or more components of the linguistic tool comprising:
-
a clustering component which receives input data indicative of a plurality of input utterances and generates word clusters indicative of at least two or more words co-occurring in the utterances in the input data wherein the clustering component includes a word occurrence or co-occurrence vector generator which processes the input data to generate at least one word occurrence or co-occurrence vector v=(vi, v2) for one or more words in the input data where v1 corresponds to a first input utterance in the input data and v2 corresponds to a second input utterance in the input data and the vi is either “
1”
or “
0”
depending if the one or more words occur in the first input utterance or if the one or more words do not occur in the first input utterance and the v2 is either “
1”
or “
0”
depending if the one or more words occur in the second input utterance or if the one or more words do not occur in the second input utterance and the clustering component including a vector dot product component which receives at least two word occurrence or co-occurrence vectors and calculates a vector dot product between the at least two word occurrence or co-occurrence vectors to generate a symmetrical word co-occurrence matrix having an upper triangular part, a lower triangular part, and a plurality of diagonal positions, and to provide a frequency of the co-occurrence of the words of the at least two word occurrence or co-occurrence vectors and the clustering component processing the output of the vector dot product component to output the word clusters for the co-occurrence of the at least two or more words above a threshold frequency in the plurality of input utterances, wherein processing of the word co-occurrence matrix is restricted to the upper or lower triangular part of the matrix and the plurality of diagonal positions are not computed to exclude repeated word clusters stemming from recognition errors; anda grammar generator component configured to receive the output word clusters from the clustering component and generate grammar elements for the output word clusters above the threshold frequency and configured to add the grammar elements for the word clusters to a closed set of grammar elements stored on the computer storage media and used a speech recognition component for transcribing speech. - View Dependent Claims (2, 3, 4, 5, 10, 19)
-
-
6. A computer-implemented method comprising the steps of:
-
recognizing, by a computer, words in speech input utterances; generating, by the computer, word occurrence or co-occurrence vectors for the words in the speech input utterances; computing, by the computer, a vector dot product between at least one first word occurrence vector or co-occurrence vector and at least one second word occurrence or co-occurrence vector to generate a symmetrical word co-occurrence matrix having an upper triangular part, a lower triangular part, and a plurality of diagonal positions, and to identify word clusters being indicative of the words that co-occur in clusters in the speech input utterances and provide a co-occurrence frequency of the word clusters, wherein processing of the word co-occurrence matrix is restricted to the upper or lower triangular part of the matrix and the plurality of diagonal positions are not computed to exclude repeated word clusters stemming from recognition errors; identifying the computer, one or more word clusters in the input utterances having at least a threshold word co-occurrence frequency using the co-occurrence frequency provided by the computation of the vector dot product;
receiving user input selections to input one or more word clusters having the threshold word co-occurrence frequency into a grammar or lexicon for a speech recognition component; andgenerating, by the computer, one or more grammar inputs corresponding to the one or more word clusters selected for input to the grammar or lexicon. - View Dependent Claims (7, 8, 9, 11, 12)
-
-
13. A computer-implemented method comprising the steps of:
-
recognizing, by a computer, words in a plurality of speech input utterances using a speech recognition component stored on the computer storage media to recognize the words in the plurality of speech input utterances; generating, by the computer, word occurrence or co-occurrence vectors for the words in the plurality of speech input utterances; computing, by the computer, a vector dot product between at least one first word occurrence vector or co-occurrence vector and at least one second word occurrence or co-occurrence vector to generate a symmetrical word co-occurrence matrix having an upper triangular part, a lower triangular part, and a plurality of diagonal positions; processing, by the computer, the results of the vector dot product computation for the words of the at least one first word occurrence vector or co-occurrence vector and the at least one second word occurrence or co-occurrence vector having at least a threshold co-occurrence frequency to identify word clusters including two or more words co-occurring in the input utterances, wherein processing of the word co-occurrence matrix is restricted to the upper or lower triangular part of the matrix and the plurality of diagonal positions are not computed to exclude repeated word clusters stemming from recognition errors; pruning, by the computer, the word clusters using a compactness threshold based upon separation between the two or more co-occurring words in the input utterances to eliminate one or more of the word clusters where the co-occurring words are not in sufficient proximity in the input utterances; identifying, by the computer, two or more related word clusters based upon a similarity metric; merging, by the computer, the two or more related word clusters in the plurality of speech input utterances to form a merged word cluster for the two or more related word clusters; and reporting, by the computer, the word clusters in the plurality of speech input utterances. - View Dependent Claims (14, 15, 16, 17, 18, 20)
-
Specification