Method and apparatus for information retrieval from a database by replacing domain specific stemmed phases in a natural language to create a search query
First Claim
1. A computer-implemented process for forming a search query for searching a document database by a computer-implemented search process, the search process identifying documents likely to match the search query by matching individual terms of the search query to individual terms and sequences of terms in the document database, the process for forming the search query comprising:
- a) providing a first database containing a plurality of phrases derived from domain specific natural-language phrases, each of said phrases consisting of a plurality of stemmed terms in original order;
b) input to a computer an input query composed in natural language and comprising a plurality of unstemmed terms arranged in a user-selected order;
c) parsing said input query into separate terms;
d) stemming the terms of said input query to form an ordered sequence of stemmed terms, the order of the stemmed terms in the sequence being the same as the order of the unstemmed terms in the input query;
e) selecting groups of stemmed terms, each group consisting of a plurality of successive stemmed terms of the sequence;
f) comparing each group of stemmed terms to each phrase in said first database to identify each group of stemmed terms of the input query that matches a phrase in said first database;
g) for each identified group of stemmed terms, identifying those stemmed terms which are shared by two successive identified groups of stemmed terms, identifying whether the number of stemmed terms in the two successive groups sharing a stemmed term is equal or unequal, assigning the shared stemmed term to only that group of the two successive groups containing the greatest number of stemmed terms if the number of terms is unequal, or assigning the shared stemmed term to only the first group of the two successive groups in the number of terms is equal; and
h) replacing each identified group of stemmed terms of the input query by the matching phrase from said first database, the individual terms of the search query comprising each matching phrase substituted for groups of stemmed terms of the input query and each remaining stemmed term of the input query.
2 Assignments
0 Petitions
Accused Products
Abstract
A computer implemented process for creating a search query for an information retrieval system in which a database is provided containing a plurality of stopwords and phrases. A natural language input query defines the composition of the test of documents to be identified. Each word of the natural language input query is compared to the database in order to remove stopwords from the query. The remaining words of the input query are stemmed to their basic roots, and the sequence of stemmed words in the list is compared to phrases in the database to identify phrases in the search query. The phrases are substituted for the sequence of stemmed words from the list so that the remaining elements, namely the substituted phrases and unsubstituted stemmed words, form the search query. The completed search query elements are query nodes of a query network used to match representation nodes of a document network of an inference network. The database includes as options a topic and key database for finding numerical keys, and a synonym database for finding synonyms, both of which are employed in the query as query nodes.
-
Citations
46 Claims
-
1. A computer-implemented process for forming a search query for searching a document database by a computer-implemented search process, the search process identifying documents likely to match the search query by matching individual terms of the search query to individual terms and sequences of terms in the document database, the process for forming the search query comprising:
-
a) providing a first database containing a plurality of phrases derived from domain specific natural-language phrases, each of said phrases consisting of a plurality of stemmed terms in original order; b) input to a computer an input query composed in natural language and comprising a plurality of unstemmed terms arranged in a user-selected order; c) parsing said input query into separate terms; d) stemming the terms of said input query to form an ordered sequence of stemmed terms, the order of the stemmed terms in the sequence being the same as the order of the unstemmed terms in the input query; e) selecting groups of stemmed terms, each group consisting of a plurality of successive stemmed terms of the sequence; f) comparing each group of stemmed terms to each phrase in said first database to identify each group of stemmed terms of the input query that matches a phrase in said first database; g) for each identified group of stemmed terms, identifying those stemmed terms which are shared by two successive identified groups of stemmed terms, identifying whether the number of stemmed terms in the two successive groups sharing a stemmed term is equal or unequal, assigning the shared stemmed term to only that group of the two successive groups containing the greatest number of stemmed terms if the number of terms is unequal, or assigning the shared stemmed term to only the first group of the two successive groups in the number of terms is equal; and h) replacing each identified group of stemmed terms of the input query by the matching phrase from said first database, the individual terms of the search query comprising each matching phrase substituted for groups of stemmed terms of the input query and each remaining stemmed term of the input query. - View Dependent Claims (2, 3, 4, 5, 6)
-
5. A computer-implemented process for forming a search query according to claim 1 wherein the input query may include one or more groups of terms forming citations, each citation including numerical terms, said process further includes:
-
i) identifying each group of terms forming a citation in said input query, and j) replacing each identified group of terms forming a citation by a citation word which comprises the numerical terms of the group of terms forming the citation and a predetermined word-level proximity number.
-
-
6. A computer-implemented process for forming a search query according to claim 1 further including before step e, removing stopwords from the input query.
-
-
7. A computer system for forming a search query for searching a document database by a computer-implemented search process, the search process identifying documents likely to match the search query by matching individual terms of the search query to individual terms and sequences of terms in the document database, said system comprising:
-
a) a read only memory containing a first database consisting of a plurality of phrases, each of said phrases derived from domain specific natural-language phrases consisting of a plurality of stemmed terms in original order; b) register means for storing an input query composed in natural language, the input query comprising a plurality of unstemmed terms arranged in a user-selected order; c) parsing means responsive to said register means for parsing said input query into separate terms; d) first processing means for stemming each term in said register means to form an ordered sequence of stemmed terms, the order of the stemmed terms being the same as the order of the unstemmed terms in the input query; e) selecting means for selecting groups of stemmed terms, each group consisting of a plurality of successive stemmed terms of the sequence; f) first comparing means for comparing each group of stemmed terms in said register means to each phrase in said first database to identify each group of stemmed terms in the register means which matches a phrase in said first database; g) second processing means for replacing each identified group of stemmed terms in said register means by the matching phrase in said first database; and h) third processing means for identifying those stemmed terms which are shared by two successive identified groups of stemmed terms, and for identifying whether the number of stemmed terms in the two successive groups sharing a stemmed term is equal or unequal, and fourth processing means for assigning the shared stemmed term to only that group of the two successive groups containing the greatest number of stemmed terms if the number of terms is unequal, or assigning the shared stemmed term to only the first group of the two successive groups in the number of terms is equal. - View Dependent Claims (8, 9, 10, 11, 12)
-
11. A computer system for forming a search query according to claim 7 wherein said input query may include on or more groups of terms forming citations, each citation having numerical terms said computer system further including:
-
i) seventh processing means for identifying each group of terms forming a citation in said input query, and j) eighth processing means for replacing each identified group of terms forming a citation by a citation word which comprises the numerical terms of the group of terms forming the citation and a predetermined word-level proximity number.
-
-
12. A computer system for forming a search query according to claim 7 wherein the first database further includes a plurality of stopwords, fifth comparing means for comparing each term in said register means to the stopwords in the first database, and deleting means responsive to the fifth comparing means for deleting each term from said register means that matches a stopword.
-
-
13. A computer-implemented process for identifying documents of a document database likely to match a search query defining the composition of the text of documents sought to be identified by matching individual terms of the search query to individual terms and sequences of terms in the document database, comprising:
-
a) providing a first database containing a plurality of phrases, derived from domain specific natural-language phrases each of said phrases consisting of a plurality of stemmed terms in original order, and providing said document database containing representations of the contents of the texts of a plurality of documents to be searched, the text of each document containing a plurality of terms; b) input to a computer an input query composed in natural language and comprising a plurality of unstemmed terms in a user-selected order; c) parsing said input query into separate terms; d) stemming the terms of said input query to form an ordered sequence of stemmed terms for the search query, the order of the stemmed terms being the same as the order of the unstemmed terms in the input query; e) selecting groups of stemmed terms, each group consisting of a plurality of successive stemmed terms of the sequence; f) comparing each group of stemmed terms to each phrase in said first database and identifying each group of stemmed terms that matches a phrase in said first database; g) replacing each identified group of stemmed terms by the matching phrase from said first database, the individual terms of the search query comprising each matching phrase substituted for groups of stemmed terms of the input query and each remaining stemmed term of the input query; h) after step (g), comparing each term of the search query to the terms in said document database to identify the frequency of occurrences of the stemmed search query terms for individual documents i the document database; i) assigning a statistical weight to individual documents representing the probability that the document matches the search query based on the number of occurrences of the stemmed search query terms in the representations for each document; and j) ranking the documents based on the statistical weight assigned in step (i). - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
18. Computer-implemented process for identifying documents according to claim 13 wherein the input query may include one or more groups of terms forming citations, each citation including numerical terms, said process further includes:
-
k) identifying each group of terms forming a citation in said input query, and l) replacing each identified group of terms forming a citation by a citation word which comprises the numerical terms of the group of terms forming the citation and a predetermined word-level proximity number, step (h) includes comparing the identified citation words to terms and sequences of terms in the representations for each document, and step (i) includes assigning a statistical weight to each document concerning the probability that the document matches the search query based on the frequency of occurrences of the identified citation words in the representations for each document.
-
-
19. A computer-implemented process for identifying documents according to claim 13 further including displaying the texts of selected ones of said documents.
-
20. A computer-implemented process for identifying documents according to claim 13 further including before step e, removing stopwords from the input query.
-
-
21. A computer system for identifying documents of a document database likely to match a search query defining the composition of the text of documents sought to be identified by matching individual terms of the search query to individual terms and sequences of terms in the document database, the system comprising:
-
a) a first read only memory containing a first database containing a plurality of phrases, derived from domain specific natural-language phrases each of said phrases consisting of a plurality of stemmed terms in original order; b) a second memory containing the document database containing representations of the contents of the texts of a plurality of documents to be searched, each document text containing a plurality of terms; c) register means for storing an input query composed in natural language and comprising a plurality of unstemmed terms arranged in a user-selected order; d) parsing means responsive to said register means for parsing said input query into separate terms; e) first processing means for stemming each term in said register means to form an ordered sequence of stemmed terms, the order of the stemmed terms being the same as the order of the unstemmed terms in the input query; f) selecting means for selecting groups of stemmed terms, each group consisting of a plurality of successive stemmed terms of the sequence; g) first comparing means for comparing each group of stemmed terms in said register means to each phrase in said first database to identify each group of stemmed terms that matches a phrase in said first database; and h) second processing means for replacing each identified group of stemmed terms in said register means by the matched phrase from said first database, the individual terms of the search query comprising each matching phrase substituted for groups of stemmed terms of the input query and each remaining stemmed term of the input query; i) second comparing means for comparing each term of the search query in said register means to the representations for the terms of each document in said second memory to identify the frequency of occurrences of the stemmed query terms for individual document in the second memory; j) third processing means responsive to said second comparing means for assigning a statistical weight to the individual document representing the probability that the document matches the search query based on the number of occurrences of the stemmed query terms in the representations for each document; and k) fourth processing means responsive to said third processing means for ranking the documents according to statistical weight. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28)
-
26. A computer system for identifying documents according to claim 21 wherein said input query may include one or more groups of terms forming citations, each citation having numerical terms, said computer system further including ninth processing means for identifying each group of terms forming a citation in said input query, tenth processing means for replacing each identified group of terms forming a citation in said register by a citation word which comprises the numerical terms of the group of terms forming the citation and a predetermined word-level proximity number, sixth comparing means for comparing said citation words in said register means to representations in said second memory to identify the frequency of occurrences of the citation word sin the representations for documents;
- said third processing means being further responsive to said sixth comparing means for assigning a statistical weights to documents concerning the probability that the document matches the search query.
-
27. A computer system for identifying document according to claim 21 further including display means for displaying the texts of selected ones of said documents.
-
28. A computer system for identifying documents according to claim 21 wherein said first read-only memory contains a third database containing a plurality of stopwords, seventh comparing means for comparing each term in said register means to the stopwords in the third database, and deleting means responsive to the seventh comparing means for deleting each term from said register means that matches a stopword.
-
- 29. In a computer-implemented process employing an inference network for identifying document sin a first database likely to match a search query defining the composition of the text of documents sought to be identified, said inference network being implemented in computer means forming a query network and a document network, the document network having the a first database containing a plurality of terms representing the texts of a plurality of documents to be searched, each term being represented by a node, the computer means comparing each term, i, or the search query, c, to each of the nodes of each document, j, to determine the probability that the individual term of the search query, ci, is a correct descriptor of the document in accordance with the relationship
- space="preserve" listing-type="equation">P(c.sub.i |d.sub.j)=0.4+0.6·
idf.sub.i ·
tf.sub.ij,
where idfi is based n the frequency of documents in the entire collection of documents in the first database containing the term i, and tfij is based on the frequency with which the term, i, appears in the respective document, j, said computer mans adding, for each document in the first database, the probabilities for each term of the search query and normalizing the sum of the probabilities that the terms of the search query are correct descriptors of the document by the number of terms in the search query, said computer means ranking the documents in accordance with the sum of the probabilities for each document, the improvement comprising establishing a query network by; a) providing a second database containing a plurality of phrases derived from domain specific natural-language phrases each consisting of a plurality of stemmed terms in original order, b) input to the computer means an input query composed in natural language and comprising a plurality of unstemmed words arranged in a user-selected order, c) parsing said input query into separate terms, d) stemming the terms of said input query to form an ordered sequence of stemmed terms for a search query, the order of the stemmed terms being the same as the order of the unstemmed terms in the input query, e) selecting groups of stemmed terms, each group consisting of a plurality of successive stemmed terms of the sequence, f) comparing each group of stemmed terms in said search query to each phrase in said second database and identifying each group of stemmed terms that matches a phrase in said second database, and g) replacing each identified group of stemmed terms by the matching phrase from the second database to form the search query comprising a plurality of individual terms, i, consisting of matched phrases substituted for groups of stemmed terms of the input query and of stemmed terms of the input query not substituted by matched phrases. - View Dependent Claims (30, 31, 32, 33, 34, 35, 36, 37)
- 34. A computer-implemented process according to claim 29 wherein the input query may include one or more groups of terms forming citations, each citation including numerical terms, said process further includes:
-
h) identifying each group of terms forming a citation in said input query, and i) replacing each identified group of terms forming a citation by a citation word which comprises the numerical terms of the group of terms forming the citation and a predetermined word-level proximity number so that each citation word becomes a term, i, and the computer means compares the identified citation words to the nodes of each document to determine the probability that the word is the correct descriptor of the document.
- 35. A computer-implemented process according to claim 29 further including displaying the texts of selected ones of said documents.
- 36. A computer-implemented process according to claim 29 wherein said second database further includes a plurality of stemmed synonyms of terms, said process including, after step g, comparing the stemmed terms of the input query remaining after substituting matching phrases to the stemmed synonyms in the second database and adding stemmed synonyms of remaining stemmed words to the input query to form the search query c, each individual term i of the search query being a remaining stemmed term or a respective synonym or a matching phrase.
- 37. A computer-implemented process according to claim 29 further including before step e, removing stopwords from the input query.
- space="preserve" listing-type="equation">P(c.sub.i |d.sub.j)=0.4+0.6·
- 38. In a system for identifying documents in a first database likely to match a search query defining the composition of the text of documents sought to be identified, said system including computer means and a read only memory arranged in an inference network forming a query network and a document network, said document network comprising the first database containing a plurality of terms representing the text of each of a plurality of documents, each term being represented by a node, said computer means having first compare means for comparing each term, i, of the search query, c, to each of the nodes of each document, j, in said first database, first processing means for determining the probability that the individual term of the search query, ci, is a correct descriptor of the document, j, in accordance with the following relationship
- space="preserve" listing-type="equation">P(c.sub.i |d.sub.j)=0.4+0.6·
idf.sub.i ·
tf.sub.ij,
where idfi is based on the frequency of documents in the entire collection of documents in the first database containing the term i, and tfij is based on the frequency with which the term, i, appears in the respective document, j, adding means for adding the probabilities determined by said first processing means for each term of the search query for each document in said first database, normalizing means responsive to said adding means for normalizing the sums of probabilities that the terms of the search query are correct descriptors of the document by the number of terms in the search query, and ranking means responsive to said normalizing means for ranking the documents in said first database in accordance with the values of the normalized sums of probabilities, the improvement of the query network comprising; a) a second database recorded on said read only memory, said second database containing a plurality of phrases derived from domain specific natural-language phrases each consisting of a plurality of stemmed terms in original order, b) input means connected to said computer means to input an input query to said computer means, the input query being composed in natural language and comprising a plurality of unstemmed words arranged in a user-selected order, c) said computer means including i) parse means for parsing said input query into separate terms, ii) stem means responsive to said input means for stemming each term of said input query to form an ordered sequence of stemmed terms for a search query, the order of the stemmed terms being the same as the order of the unstemmed terms in the input query, iii) selecting means for selecting groups of stemmed terms, each group consisting of a plurality of successive stemmed terms of the sequence, iv) second compare means for comparing each group of stemmed terms to each phrase i said second database and for identifying each group of stemmed terms that matches a phrase in said second database, and v) substitution means responsive to said second compare means for replacing each identified group of stemmed terms from said sequence by the matched phrase to form a search query, c, comprising a plurality of terms, i, each term i consisting of a phrase substituted for a group of stemmed terms in the input query or of stemmed terms in the input query not substituted by phrases. - View Dependent Claims (39, 40, 41, 42, 43, 44, 45, 46)
- 43. A computer system for identifying documents according to claim 38 wherein said input query may include one or more groups of terms forming citations, each citation having numerical terms, and said computer means of said query network further includes:
-
vi) seventh processing means for identifying each group of terms forming a citation in said input query, and vii) second substitution means for replacing each identified group of terms by a citation word which comprises the numerical terms of the group of terms forming the citation word and a predetermined word-level proximity number so that each citation word becomes a term, i, of search query, c, and said second compare means being further responsive to said second substitution means for comparing each term, i, of the search query, c, to nodes, j, in said second database.
- 44. A computer system for identifying documents according to claim 38 further including display means for displaying the texts of selected ones of said documents.
- 45. A computer system for identifying documents according to claim 38 further wherein said read-only memory contains a fourth database containing a plurality of stemmed synonyms of terms, sixth comparing means for comparing to the stemmed synonyms those stemmed terms of the input query that remain after identified groups of stemmed terms have been substituted by the matching phrases, and eighth processing means for adding stemmed synonyms of remaining stemmed terms to the input query to form the search query c, each individual term i of the search query being a remaining stemmed term or a respective synonym or a matching phrase.
- 46. A computer system for identifying documents according to claim 38 wherein said read-only memory contains a fifth database containing a plurality of stopwords, seventh comparing means for comparing each term in said register means to the stopwords in the fifth database, and deleting means responsive to the seventh comparing means for deleting each terms from said register means that matches a stopword.
- space="preserve" listing-type="equation">P(c.sub.i |d.sub.j)=0.4+0.6·
Specification