Method for building linguistic models from a corpus
First Claim
1. A method of building a linguistic model from a corpus of features, comprising:
- determining whether to perform clustering based upon a cost function
5 Assignments
0 Petitions
Accused Products
Abstract
A method iteratively integrates clustering techniques with phrase acquisition techniques to build complex linguistic models from a corpus. A set of features is initialized by the corpus. Thereafter, the method determines, according to a predetermined cost function, to process the features by one of phrase clustering processing or phrase grammar learning processing. If phrase clustering processing is performed, the method processes an interstitial set of features comprising both the old features and newly established clusters by phrase grammar learning processing. The features obtained as an output of phrase grammar learning is re-indexed as a set of features for a subsequent iteration. The method may be repeated over several iterations to build a hierarchical linguistic model.
-
Citations
28 Claims
-
1. A method of building a linguistic model from a corpus of features, comprising:
determining whether to perform clustering based upon a cost function
-
2. A method of building a linguistic model from a corpus of features, comprising:
-
determining whether to perform clustering based upon a predetermined cost function;
if so, clustering upon the features into classes;
thereafter, performing phrase acquisition upon the features and the classes; and
storing the acquired phrases in a computer readable medium;
wherein the clustering step comprises;
identifying context symbols within the features, for each non-context symbol in the features, counting occurrences of predetermined relationships between the non-context symbol and a context symbol, generating frequency vectors for each non-context symbol based upon the counted occurrences, and clustering non-context symbols based upon the frequency vectors. - View Dependent Claims (3, 4, 5, 6, 7, 8)
-
-
9. A method of building a linguistic model from a set of features, the features initially constituting text from an input corpus, comprising:
applying a cost function to the set of features, the cost function defined by
-
10. A method of building a linguistic model from a set of features, the features initially constituting text from an input corpus, comprising:
-
applying a cost function to the set of features, based upon the results of the cost function, performing clustering on the set of features, the clustering resulting in classes that are included in the set of features;
performing phrase acquisition on the set of features; and
storing the acquired phrases in a computer readable medium;
wherein the clustering step comprises;
identifying context tokens within the set of features, for each non-context token in the set of features, counting occurrences of predetermined relationships between the non-context token and a context token, generating frequency vectors for each non-context token based upon the counted occurrences, and clustering non-context tokens based upon the frequency vectors. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
-
17. A method of building a linguistic model from a corpus, comprising:
-
initializing a set of features based upon the corpus, iteratively;
determining, based upon a predetermined cost function, whether to perform clustering upon the set of features and, if so, performing clustering on the set of features to identify classes therefrom, performing phrase acquisition on the set of features and any classes that exist to obtain phrases therefrom, re-initializing the set of features for a subsequent iteration to include the set of features, the classes and the phrases; and
storing the set of features, the classes, and the phrases in a computer readable medium;
wherein the cost function is;
-
-
18. A method of building a linguistic model from a corpus, comprising:
-
initializing a set of features based upon the corpus, iteratively;
determining, based upon a predetermined cost function, whether to perform clustering upon the set of features and, if so, performing clustering on the set of features to identify classes therefrom, performing phrase acquisition on the set of features and any classes that exist to obtain phrases therefrom, re-initializing the set of features for a subsequent iteration to include the set of features, the classes and the phrases; and
storing the set of features, the classes, and the phrases in a computer readable medium;
wherein the clustering step comprises;
identifying context tokens within the set of features, for each non-context token in the set of features, counting occurrences of predetermined relationships between the non-context token and a context token, generating frequency vectors for each non-context token based upon the counted occurrences, and clustering non-context tokens based upon the frequency vectors. - View Dependent Claims (19, 20, 21, 22, 23, 24)
-
-
25. A machine-readable medium having stored thereon executable instructions that when executed by a processor, cause the processor to build a linguistic model from a corpus of features by:
determining whether to perform clustering based upon a cost function
-
26. A machine-readable medium having stored thereon executable instructions that when executed by a processor, cause the processor to build a linguistic model from a corpus of features, by:
-
determining whether to perform clustering based upon a predetermined cost function;
if so, clustering upon the features into classes;
thereafter, performing phrase acquisition upon the features and the classes; and
storing the acquired phrases in a computer readable medium;
wherein the clustering step comprises;
identifying context symbols within the features, for each non-context symbol in the features, counting occurrences of predetermined relationships between the non-context symbol and a context symbol, generating frequency vectors for each non-context symbol based upon the counted occurrences, and clustering non-context symbols based upon the frequency vectors.
-
-
27. A machine-readable medium having stored thereon a linguistic model generated from a corpus of features according to the process of:
determining whether to perform clustering based upon a cost function
-
28. A machine-readable medium having stored thereon a linguistic model generated from a corpus of features according to the process of:
-
determining whether to perform clustering based upon a predetermined cost function;
if so, clustering upon the features into classes;
thereafter, performing phrase acquisition upon the features and the classes; and
storing the acquired phrases in a computer readable medium;
wherein the clustering step comprises;
identifying context symbols within the features, for each non-context symbol in the features, counting occurrences of predetermined relationships between the non-context symbol and a context symbol, generating frequency vectors for each non-context symbol based upon the counted occurrences, and clustering non-context symbols based upon the frequency vectors.
-
Specification