Behavior-driven multilingual stemming
First Claim
1. A computer-implemented method of stemming terms using behavioral data, comprising:
- under control of one or more computer systems configured with executable instructions,capturing behavioral data for a plurality of users with respect to a plurality of terms;
obtaining a rule set for stemming in a language, the language including the plurality of terms;
obtaining a word to be stemmed;
in response to determining that only one rule of the rule set is to be used to stem the obtained word, stemming the obtained word using only one rule;
orin response to determining that more than one rule of the rule set is to be used to stem the obtained word;
determining a set of forms of the obtained word;
determining an output set of forms corresponding to the set of forms, wherein each rule of the more than one rule corresponds to one of the forms in the output set of forms,determining, based at least in part upon the captured behavioral data, a relative measurement value of each form in the output set of forms, wherein each of the relative measurement values corresponds to an indication of a frequency of use of a corresponding one of the forms in the output set of forms, andselecting, based at least in part upon the relative measurement values, at least one form in the output set of forms to be used as a stem for the obtained word.
2 Assignments
0 Petitions
Accused Products
Abstract
User behavior data can be used with language-specific rule sets to generate stemming databases useful for such tasks as indexing and search query processing. The terms contained in user queries, as well as user behavior with respect to those queries or results returned for those queries, can be analyzed to determine a relative measure (e.g., relative frequency) of various forms of those terms. When generating a stemming database, language-specific rule sets can be used to determine appropriate stemming rules, and where more than one potential rule is identified the user behavior data can be used to select what is likely the appropriate rule, at least for the respective environment. Whitelists or other such components can be used to handle specific or irregular forms that do not follow the general rules or otherwise are exceptions that might not otherwise be processed correctly.
41 Citations
24 Claims
-
1. A computer-implemented method of stemming terms using behavioral data, comprising:
under control of one or more computer systems configured with executable instructions, capturing behavioral data for a plurality of users with respect to a plurality of terms; obtaining a rule set for stemming in a language, the language including the plurality of terms; obtaining a word to be stemmed; in response to determining that only one rule of the rule set is to be used to stem the obtained word, stemming the obtained word using only one rule;
orin response to determining that more than one rule of the rule set is to be used to stem the obtained word; determining a set of forms of the obtained word; determining an output set of forms corresponding to the set of forms, wherein each rule of the more than one rule corresponds to one of the forms in the output set of forms, determining, based at least in part upon the captured behavioral data, a relative measurement value of each form in the output set of forms, wherein each of the relative measurement values corresponds to an indication of a frequency of use of a corresponding one of the forms in the output set of forms, and selecting, based at least in part upon the relative measurement values, at least one form in the output set of forms to be used as a stem for the obtained word. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
9. A system for stemming terms using behavioral data, comprising:
-
a processor; and a memory device including instructions that, when executed by the processor, cause the system to; capture behavioral data for a plurality of users with respect to a plurality of terms; obtain a rule set for stemming in a language, the language including the plurality of terms; obtain a word to be stemmed; in response to determining that only one rule of the rule set is to be used to stem the obtained word, stemming the obtained word using only one rule;
orin response to determining that more than one rule of the rule set is to be used in stemming the obtained word; determine a set of forms of the obtained word; determine an output set of forms corresponding to the set of forms, wherein each rule of the more than one rule corresponds to one of the forms in the output set of forms, determine, based at least in part upon the captured behavioral data, a relative measurement value of each form in the set of output forms, and select, based at least in part upon the relative measurement values, at least one form in the output set of forms to be used as a stem for the obtained word. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory computer-readable storage medium including instructions that, when executed by at least one processor, cause the at least one processor to:
-
capture behavioral data for a plurality of users with respect to a plurality of terms; obtain a rule set for stemming in a language corresponding to the plurality of terms; obtain a word to be stemmed; in response to determining that only one rule of the rule set is to be used to stem the obtained word, stemming the obtained word using only one rule;
orin response to determining that more than one rule of the rule set is to be used in stemming the obtained word; determine a set of forms of the obtained word; determine an output set of forms corresponding to the set of forms, wherein each rule of the more than one rule corresponds to one of the forms in the output set of forms, determine, based at least in part upon the captured behavioral data, a relative measurement value of each form in the set of output forms, and select, based at least in part upon the relative measurement values, at least one form in the output set of forms to be used as a stem for the obtained word. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
Specification