System and method to retrieving information with natural language queries
First Claim
1. A search machine for retrieving information from documents of a database based on a query of a user, the search machine comprising:
- a lexicon generator for deriving a lexicon database of word stems from the database and from the query; and
an evaluation component, the evaluation component including a document vectorizer for creating document representation vectors for respective documents of the database and a query representation vector for the query using the lexicon database, the document representation vectors containing data on the word stems located in the respective documents and the query representation vector containing data on the word stems located in the query, the evaluation component further including a vector rule base, and a vector evaluator, the vector evaluator comparing the document representation vectors to the query representation vector based on the vector rule base, wherein the evaluation component generates the information from the database based on the query; and
a fine-tuner, the fine-tuner modifying the vector rule base based on external feedback from the user, the feedback associated with the information generated by the evaluation component.
3 Assignments
0 Petitions
Accused Products
Abstract
A search machine finds and ranks documents in a database based on a set of rules that match characteristics of the database with a natural language query. The system includes a lexicon component which may parse the query and the database into words and word stems. Thereafter, the query and documents may be vectorized such that the elements of the vector correspond to a given word stem, and the value of the element in the vector corresponds to the number of occurrences of the word in the document. The vectorized query is then compared and evaluated against each of the vectorized documents of the database to obtain a ranked list of documents from the database. The user may evaluate the documents found and provide information back to the search machine in order to adjust, for example, the ranking produced by the search machine. In this way, the search machine can fine tune its search and ranking technique to meet the user'"'"'s specific criteria.
96 Citations
23 Claims
-
1. A search machine for retrieving information from documents of a database based on a query of a user, the search machine comprising:
-
a lexicon generator for deriving a lexicon database of word stems from the database and from the query; and
an evaluation component, the evaluation component including a document vectorizer for creating document representation vectors for respective documents of the database and a query representation vector for the query using the lexicon database, the document representation vectors containing data on the word stems located in the respective documents and the query representation vector containing data on the word stems located in the query, the evaluation component further including a vector rule base, and a vector evaluator, the vector evaluator comparing the document representation vectors to the query representation vector based on the vector rule base, wherein the evaluation component generates the information from the database based on the query; and
a fine-tuner, the fine-tuner modifying the vector rule base based on external feedback from the user, the feedback associated with the information generated by the evaluation component. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
(i) a word stem does not occur in the query and also does not occur in a database document;
(ii) a word stem does not occur in the query but does occur in the database document;
(iii) a word stem occurs less frequently in the query than in the database document;
(iv) a word stem occurs in the query and equally as often occurs in the database document;
(v) a word stem occurs more frequently in the query than in the database document; and
(vi) a word stem occurs in the query but does not occur in the database document.
-
-
7. The search machine of claim 1, wherein the lexicon database includes word stem frequency data related to a frequency of occurrence of the word stem in the documents of the database.
-
8. The search machine of claim 1, wherein the lexicon database includes a significance value corresponding to each word stem.
-
9. The search machine of claim 1, wherein the feedback comprises a user ranked list of documents.
-
10. The search machine of claim 1, wherein the vector rule base contains at least one condition that provides an output value based on whether the at least one condition is met by a document representation vector, and wherein the fine-tuner modifies the vector rule base by randomly changing the output value of the at least one condition.
-
11. A method of retrieving documents from a database corresponding to a query document, the method including the steps of:
-
deriving a lexical database of word stems from each document of the database and the query document;
creating a document representation vector corresponding to each document of the database and a query representation vector for the query document, each document representation vector containing information about the word stems of the lexical database that are contained in the document to which the document representation vector corresponds, the query representation vector containing information about the word stems that are contained in the query document;
evaluating each document representation vector relative to the query representation vector with vector evaluation rules;
creating output reflecting the evaluation of the document representation vectors;
displaying the output;
receiving feedback that specifies preferences for the output; and
modifying the vector evaluation rules based on the feedback such that the output more closely reflects the preferences provided in the feedback. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
generating an n-dimensional representation vector with each element of the representation vector initialized to zero, where n is the number of word stems contained in the lexical database;
deriving the word stem for each word in the document;
retrieving the index of the word stem from the lexical database; and
incrementing the element stored at the index of the representation vector each time the index of a word stem is retrieved.
-
-
13. The method of claim 11, wherein the step of deriving a lexical database of word stems from each document of the database and the query document includes the steps of:
-
parsing each document into a list of words;
deriving words stems from the words by matching words that have similar beginning letter sequences but different ending letter sequences; and
storing each of the word stems in the lexical database.
-
-
14. The method of claim 13, wherein the step of deriving word stems includes the steps of:
-
setting a minimal word length;
setting a similarity threshold;
determining which words have matching letters for the minimal word length, and have a difference in length of not more than the similarity threshold, and taking the shorter of the words as the word stem.
-
-
15. The method of claim 13, wherein the step of deriving word stems includes the steps of:
-
setting a minimal word length;
setting a similarity threshold;
determining which words have matching letters for the minimal word length, and have a difference in length of not more than the similarity threshold, and taking the matching letters as the word stem.
-
-
16. The method of claim 13, wherein the step of deriving word stems includes the steps of:
-
setting a minimal word length;
setting a similarity threshold;
determining when words of equal length have matching letters for at least the minimal word length, and are non-equal for the last letters equal to or less than the similarity threshold, and taking the word without the non-equal part as the word stem.
-
-
17. The method of claim 11, wherein the step of evaluating each document representation vector relative to the query representation vector includes the steps of:
assigning a relevance value to each document representation vector based on whether conditions are met, the conditions including at least one of the following;
(i) a word stem does not occur in the query representation vector and also does not occur in the document representation vector;
(ii) a word stem does not occur in the query representation vector but does occur in the document representation vector;
(iii) a word stem occurs less frequently in the query representation vector than in the document representation vector;
(iv) a word stem occurs in the query representation vector and equally as often occurs in the document representation vector;
(v) a word stem occurs more frequently in the query representation vector than in the document representation vector; and
(vi) a word stem occurs in the query representation vector but does not occur in the document representation vector.
-
18. The method of claim 11, further including the step of:
assigning a significance value to at least one of the word stems of the lexical database and using the significance value to influence the evaluation of document representation vectors that contain the at least one word stem.
-
19. A search machine for retrieving information from documents of a database based on a query of a user, the search machine comprising:
-
means for generating a lexical database, the lexical database comprising word stems of the documents and of the query; and
means for evaluating the documents relative to the query, the means for evaluating including a vectorizing means for creating document representation vectors for the documents of the database and a query representation vector for the query using the lexicon database, the document representation vectors containing data on the word stems located in the documents and the query representation vector containing data on the word stems located in the query, the means for evaluating further including a vector rule base, and means for comparing vectors that compares the data on the word stems using rules from the vector rule base, wherein the means for evaluating the documents provides a resultant comprising at least some of the documents from the database ranked according to the documents relation to the query; and
means for modifying the vector rule base based on external feedback from the user, the feedback including a modified resultant. - View Dependent Claims (20, 21)
-
-
22. A computer-readable medium having stored thereupon a plurality of instructions, the plurality of instructions including instructions which, when executed by a processor, cause the processor to perform the steps of:
-
deriving a lexical database of word stems from documents of a database and from a query document;
creating a representation vector corresponding to each of the documents of the database where each representation vector contains information about the word stems of the lexical database that are contained in the document of the database to which the representation vector corresponds;
creating a query representation vector corresponding to the query document, where the query representation vector contains information about the word stems that are contained in the query document;
evaluating each representation vector relative to the query representation vector with vector evaluation rules to form an evaluation of the representation vectors;
creating output reflecting the evaluation of the representation vectors;
displaying the output;
receiving feedback that specifies preferences for the output; and
modifying the vector evaluation rules based on the feedback such that the output more closely reflects the preferences provided in the feedback. - View Dependent Claims (23)
generating an n-dimensional representation vector having n elements, each element of the representation vector being initialized to zero, where n is the number of word stems contained in the lexical database;
deriving the word stem f or each word in the document;
retrieving an index corresponding to the word stem from the lexical database;
incrementing the element stored at the index of the representation vector each time the index is retrieved.
-
Specification