Method and apparatus for information retrieval from a database by replacing domain specific stemmed phases in a natural language to create a search query

US 5,265,065 A
Filed: 10/08/1991
Issued: 11/23/1993
Est. Priority Date: 10/08/1991
Status: Expired due to Term

First Claim

Patent Images

1. A computer-implemented process for forming a search query for searching a document database by a computer-implemented search process, the search process identifying documents likely to match the search query by matching individual terms of the search query to individual terms and sequences of terms in the document database, the process for forming the search query comprising:

a) providing a first database containing a plurality of phrases derived from domain specific natural-language phrases, each of said phrases consisting of a plurality of stemmed terms in original order;

b) input to a computer an input query composed in natural language and comprising a plurality of unstemmed terms arranged in a user-selected order;

c) parsing said input query into separate terms;

d) stemming the terms of said input query to form an ordered sequence of stemmed terms, the order of the stemmed terms in the sequence being the same as the order of the unstemmed terms in the input query;

e) selecting groups of stemmed terms, each group consisting of a plurality of successive stemmed terms of the sequence;

f) comparing each group of stemmed terms to each phrase in said first database to identify each group of stemmed terms of the input query that matches a phrase in said first database;

g) for each identified group of stemmed terms, identifying those stemmed terms which are shared by two successive identified groups of stemmed terms, identifying whether the number of stemmed terms in the two successive groups sharing a stemmed term is equal or unequal, assigning the shared stemmed term to only that group of the two successive groups containing the greatest number of stemmed terms if the number of terms is unequal, or assigning the shared stemmed term to only the first group of the two successive groups in the number of terms is equal; and

h) replacing each identified group of stemmed terms of the input query by the matching phrase from said first database, the individual terms of the search query comprising each matching phrase substituted for groups of stemmed terms of the input query and each remaining stemmed term of the input query.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer implemented process for creating a search query for an information retrieval system in which a database is provided containing a plurality of stopwords and phrases. A natural language input query defines the composition of the test of documents to be identified. Each word of the natural language input query is compared to the database in order to remove stopwords from the query. The remaining words of the input query are stemmed to their basic roots, and the sequence of stemmed words in the list is compared to phrases in the database to identify phrases in the search query. The phrases are substituted for the sequence of stemmed words from the list so that the remaining elements, namely the substituted phrases and unsubstituted stemmed words, form the search query. The completed search query elements are query nodes of a query network used to match representation nodes of a document network of an inference network. The database includes as options a topic and key database for finding numerical keys, and a synonym database for finding synonyms, both of which are employed in the query as query nodes.

Citations

46 Claims

1. A computer-implemented process for forming a search query for searching a document database by a computer-implemented search process, the search process identifying documents likely to match the search query by matching individual terms of the search query to individual terms and sequences of terms in the document database, the process for forming the search query comprising:
- a) providing a first database containing a plurality of phrases derived from domain specific natural-language phrases, each of said phrases consisting of a plurality of stemmed terms in original order;
  
  b) input to a computer an input query composed in natural language and comprising a plurality of unstemmed terms arranged in a user-selected order;
  
  c) parsing said input query into separate terms;
  
  d) stemming the terms of said input query to form an ordered sequence of stemmed terms, the order of the stemmed terms in the sequence being the same as the order of the unstemmed terms in the input query;
  
  e) selecting groups of stemmed terms, each group consisting of a plurality of successive stemmed terms of the sequence;
  
  f) comparing each group of stemmed terms to each phrase in said first database to identify each group of stemmed terms of the input query that matches a phrase in said first database;
  
  g) for each identified group of stemmed terms, identifying those stemmed terms which are shared by two successive identified groups of stemmed terms, identifying whether the number of stemmed terms in the two successive groups sharing a stemmed term is equal or unequal, assigning the shared stemmed term to only that group of the two successive groups containing the greatest number of stemmed terms if the number of terms is unequal, or assigning the shared stemmed term to only the first group of the two successive groups in the number of terms is equal; and
  
  h) replacing each identified group of stemmed terms of the input query by the matching phrase from said first database, the individual terms of the search query comprising each matching phrase substituted for groups of stemmed terms of the input query and each remaining stemmed term of the input query.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. A computer -implemented process for forming a search query according to claim 1 further including providing a second database containing a plurality of topics each having a descriptive topical text and an associated unique numerical key, each topical text being composed of a plurality of terms, comparing the terms of the input query or the search query to each of the terms of the topical texts in the second database, assigning a statistical weight to each topical text reflecting the probability that the topical text matches the query, ranking the topical texts based on the statistical weight, and inserting into the search query the numerical keys associated with up to n highest ranked topical texts, where n is a predetermined integer.
  - 3. A computer-implemented process for forming a search query according to claim 2 wherein the step of inserting the numerical keys into the search query includes comparing the statistical weights of the topical texts to a predetermined threshold, and inserting the numerical keys into the search query which are associated with topical texts having statistical weights which exceed the predetermined threshold.
  - 4. A computer-implemented process for forming a search query according to claim 2 wherein the statistical weight for each topical text is determined by comparing each term of the query to each term of the topical text, determining the probability that the query term is a correct descriptor of the topical text in accordance with the relationship
    
    space="preserve" listing-type="equation">P(c.sub.i |d.sub.j)=0.4+0.6·
    
    idf.sub.i ·
    
    tf.sub.ij,
    where idf_i is based on the frequency of texts in the second database containing the query term and tf_ij is based on the frequency with which the query term appears in the respective topical text, and for each topical text adding the probabilities for all terms of the query and normalizing the sum of the probabilities by the number of terms in the query.
- 5. A computer-implemented process for forming a search query according to claim 1 wherein the input query may include one or more groups of terms forming citations, each citation including numerical terms, said process further includes:
  - i) identifying each group of terms forming a citation in said input query, andj) replacing each identified group of terms forming a citation by a citation word which comprises the numerical terms of the group of terms forming the citation and a predetermined word-level proximity number.
- 6. A computer-implemented process for forming a search query according to claim 1 further including before step e, removing stopwords from the input query.

7. A computer system for forming a search query for searching a document database by a computer-implemented search process, the search process identifying documents likely to match the search query by matching individual terms of the search query to individual terms and sequences of terms in the document database, said system comprising:
- a) a read only memory containing a first database consisting of a plurality of phrases, each of said phrases derived from domain specific natural-language phrases consisting of a plurality of stemmed terms in original order;
  
  b) register means for storing an input query composed in natural language, the input query comprising a plurality of unstemmed terms arranged in a user-selected order;
  
  c) parsing means responsive to said register means for parsing said input query into separate terms;
  
  d) first processing means for stemming each term in said register means to form an ordered sequence of stemmed terms, the order of the stemmed terms being the same as the order of the unstemmed terms in the input query;
  
  e) selecting means for selecting groups of stemmed terms, each group consisting of a plurality of successive stemmed terms of the sequence;
  
  f) first comparing means for comparing each group of stemmed terms in said register means to each phrase in said first database to identify each group of stemmed terms in the register means which matches a phrase in said first database;
  
  g) second processing means for replacing each identified group of stemmed terms in said register means by the matching phrase in said first database; and
  
  h) third processing means for identifying those stemmed terms which are shared by two successive identified groups of stemmed terms, and for identifying whether the number of stemmed terms in the two successive groups sharing a stemmed term is equal or unequal, and fourth processing means for assigning the shared stemmed term to only that group of the two successive groups containing the greatest number of stemmed terms if the number of terms is unequal, or assigning the shared stemmed term to only the first group of the two successive groups in the number of terms is equal.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. A computer system for forming a search query according to claim 7 wherein said read only memory further contains a second database consisting of a plurality of topics each having a descriptive topical text and an associated unique numerical key, each topical text being composed of a plurality of terms, second comparing means for comparing the terms of the input query or the search query to each of the terms of the topical texts in the second database, fifth processing means for assigning a statistical weight to each topical text reflecting the probability that the topical text matches the query, ranking means for ranking the topical texts based on the statistical weight, said register means being responsive to the ranking means to store the numerical keys associated with up to n highest ranked topical texts, where n is a predetermined integer.
  - 9. A computer system for forming a search query according to claim 8 further including third comparing means for comparing the statistical weight of the topical texts to a predetermined threshold, said register means being responsive to the third comparing means to store numerical keys which are associated with topical texts having statistical weights which exceed the predetermined threshold.
  - 10. A computer system for forming a search query according to claim 8 further including fourth comparing means for comparing each term of the query to each term of the topical text, sixth processing means for determining the probability that the query term is a correct descriptor of the topical text in accordance with the relationship
    
    space="preserve" listing-type="equation">P(c.sub.i |d.sub.j)=0.4+0.6·
    
    idf.sub.i ·
    
    tf.sub.ij,
    where idf_i is based on the frequency of texts in the second database containing the query term and tf_ij is based on the frequency with which the query term appears in the respective topical text, adding means for adding for each topical text the probabilities for all terms of the query, and normalizing means responsive to the adding means for normalizing the sum of the probabilities by the number of terms in the query.
- 11. A computer system for forming a search query according to claim 7 wherein said input query may include on or more groups of terms forming citations, each citation having numerical terms said computer system further including:
  - i) seventh processing means for identifying each group of terms forming a citation in said input query, andj) eighth processing means for replacing each identified group of terms forming a citation by a citation word which comprises the numerical terms of the group of terms forming the citation and a predetermined word-level proximity number.
- 12. A computer system for forming a search query according to claim 7 wherein the first database further includes a plurality of stopwords, fifth comparing means for comparing each term in said register means to the stopwords in the first database, and deleting means responsive to the fifth comparing means for deleting each term from said register means that matches a stopword.

13. A computer-implemented process for identifying documents of a document database likely to match a search query defining the composition of the text of documents sought to be identified by matching individual terms of the search query to individual terms and sequences of terms in the document database, comprising:
- a) providing a first database containing a plurality of phrases, derived from domain specific natural-language phrases each of said phrases consisting of a plurality of stemmed terms in original order, and providing said document database containing representations of the contents of the texts of a plurality of documents to be searched, the text of each document containing a plurality of terms;
  
  b) input to a computer an input query composed in natural language and comprising a plurality of unstemmed terms in a user-selected order;
  
  c) parsing said input query into separate terms;
  
  d) stemming the terms of said input query to form an ordered sequence of stemmed terms for the search query, the order of the stemmed terms being the same as the order of the unstemmed terms in the input query;
  
  e) selecting groups of stemmed terms, each group consisting of a plurality of successive stemmed terms of the sequence;
  
  f) comparing each group of stemmed terms to each phrase in said first database and identifying each group of stemmed terms that matches a phrase in said first database;
  
  g) replacing each identified group of stemmed terms by the matching phrase from said first database, the individual terms of the search query comprising each matching phrase substituted for groups of stemmed terms of the input query and each remaining stemmed term of the input query;
  
  h) after step (g), comparing each term of the search query to the terms in said document database to identify the frequency of occurrences of the stemmed search query terms for individual documents i the document database;
  
  i) assigning a statistical weight to individual documents representing the probability that the document matches the search query based on the number of occurrences of the stemmed search query terms in the representations for each document; and
  
  j) ranking the documents based on the statistical weight assigned in step (i).
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. A computer-implemented process for identifying documents according to claim 13 further including, for each identified group of stemmed terms, identifying those stemmed terms which are shared by two successive identified groups of stemmed terms, identifying whether the number of stemmed terms in the two successive groups sharing a stemmed term is equal or unequal, assigning the shared stemmed term to only that group of the two successive groups containing the greatest number of stemmed terms if the number of terms is unequal, or assigning the shared stemmed term to only the first group of the two successive groups if the number of terms is equal.
  - 15. A computer-implemented process for identifying documents according to claim 13 further including providing a second database containing a plurality of topics each having a descriptive topical text and an associated unique numerical key, each topical text being composed of a plurality of terms, comparing the terms of the input query or the search query to each of the terms of the topical texts in the second database, assigning a statistical weight to each topical text reflecting the probability that the topical text matches the query, ranking the topical texts based on the statistical weight, and inserting into the search query the numerical keys associated with up to n highest ranked topical texts, where n is a predetermined integer.
  - 16. A computer-implemented process for identifying documents according to claim 15 wherein the step of inserting the numerical keys into the search query includes comparing the statistical weights of the topical texts to a predetermined threshold, and inserting the numerical keys into the search query which are associated with topical texts having statistical weights which exceed the predetermined threshold.
  - 17. A computer-implemented process for identifying documents according to claim 15 wherein the statistical weight for each topical text is determined by comparing each term of the query to each term of the topical text, determining the probability that the query term is a correct descriptor of the topical text in accordance with the relationship
    
    space="preserve" listing-type="equation">P(c.sub.i |d.sub.j)=0.4+0.6·
    
    idf.sub.i ·
    
    tf.sub.ij,
    where idf_i is based on the frequency of texts in the second database containing the query term and tf_ij is based on the frequency with which the query term appears in the respective topical text, for each topical text adding the probabilities for all terms of the query and normalizing the sum of the probabilities by the number of terms in the query.
- 18. Computer-implemented process for identifying documents according to claim 13 wherein the input query may include one or more groups of terms forming citations, each citation including numerical terms, said process further includes:
  - k) identifying each group of terms forming a citation in said input query, andl) replacing each identified group of terms forming a citation by a citation word which comprises the numerical terms of the group of terms forming the citation and a predetermined word-level proximity number,step (h) includes comparing the identified citation words to terms and sequences of terms in the representations for each document, and step (i) includes assigning a statistical weight to each document concerning the probability that the document matches the search query based on the frequency of occurrences of the identified citation words in the representations for each document.
- 19. A computer-implemented process for identifying documents according to claim 13 further including displaying the texts of selected ones of said documents.
- 20. A computer-implemented process for identifying documents according to claim 13 further including before step e, removing stopwords from the input query.

21. A computer system for identifying documents of a document database likely to match a search query defining the composition of the text of documents sought to be identified by matching individual terms of the search query to individual terms and sequences of terms in the document database, the system comprising:
- a) a first read only memory containing a first database containing a plurality of phrases, derived from domain specific natural-language phrases each of said phrases consisting of a plurality of stemmed terms in original order;
  
  b) a second memory containing the document database containing representations of the contents of the texts of a plurality of documents to be searched, each document text containing a plurality of terms;
  
  c) register means for storing an input query composed in natural language and comprising a plurality of unstemmed terms arranged in a user-selected order;
  
  d) parsing means responsive to said register means for parsing said input query into separate terms;
  
  e) first processing means for stemming each term in said register means to form an ordered sequence of stemmed terms, the order of the stemmed terms being the same as the order of the unstemmed terms in the input query;
  
  f) selecting means for selecting groups of stemmed terms, each group consisting of a plurality of successive stemmed terms of the sequence;
  
  g) first comparing means for comparing each group of stemmed terms in said register means to each phrase in said first database to identify each group of stemmed terms that matches a phrase in said first database; and
  
  h) second processing means for replacing each identified group of stemmed terms in said register means by the matched phrase from said first database, the individual terms of the search query comprising each matching phrase substituted for groups of stemmed terms of the input query and each remaining stemmed term of the input query;
  
  i) second comparing means for comparing each term of the search query in said register means to the representations for the terms of each document in said second memory to identify the frequency of occurrences of the stemmed query terms for individual document in the second memory;
  
  j) third processing means responsive to said second comparing means for assigning a statistical weight to the individual document representing the probability that the document matches the search query based on the number of occurrences of the stemmed query terms in the representations for each document; and
  
  k) fourth processing means responsive to said third processing means for ranking the documents according to statistical weight.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28)
- - 22. A computer system for identifying documents according to claim 21 further including fifth processing means for identifying those stemmed terms which are shared by two successive identified groups of stemmed terms and for identifying whether the number of stemmed terms in the two successive groups sharing a stemmed term is equal or unequal, and sixth processing means for assigning the shared stemmed term to only that group of the two successive groups containing the greatest number of stemmed terms if the number of terms is unequal, or assigning the shared stemmed term to only the first group of the two successive groups if the number of terms is equal.
  - 23. A computer system for identifying documents according to claim 21 wherein said first read only memory further includes second database containing a plurality of topics each having a descriptive text and an associated unique numerical key, each topical text being composed of a plurality of terms, third comparing means for comparing each of the terms of the input query or the search query to each of the terms of the texts of the topics in the second database, seventh processing means for assigning a statistical weight to each topical text reflecting the probability that the topical text matches the query, ranking means for ranking the topical texts based on the statistical weight, said register means being responsive to said ranking means to store the numerical keys associated with up to n highest ranked topical texts, where n is a predetermined integer.
  - 24. A computer system for identifying documents according to claim 23 further including fourth comparing means for comparing the statistical weight of the topical texts to a predetermined threshold, said register means being responsive to the fourth comparing means to store numerical keys which are associated with topical texts having statistical weight which exceed the predetermined threshold.
  - 25. A computer system for identifying documents according to claim 23 further including fifth comparing means for comparing each term of the query to each term of the topical text, eighth processing means for determining the probability that the query term is a correct descriptor of the topical text in accordance with the relationship
    
    space="preserve" listing-type="equation">P(d.sub.i |d.sub.j)=0.4+0.6·
    
    idf.sub.i ·
    
    tf.sub.ij,
    where idf_i is base don the frequency of texts in the second database containing the query term and tf_ij is based on the frequency with which the query term appears n the respective topical text, adding means for adding for each topical text the probabilities for all terms of the query, and normalizing means responsive to the adding means for normalizing the sum of the probabilities by the number of terms in the query.
- 26. A computer system for identifying documents according to claim 21 wherein said input query may include one or more groups of terms forming citations, each citation having numerical terms, said computer system further including ninth processing means for identifying each group of terms forming a citation in said input query, tenth processing means for replacing each identified group of terms forming a citation in said register by a citation word which comprises the numerical terms of the group of terms forming the citation and a predetermined word-level proximity number, sixth comparing means for comparing said citation words in said register means to representations in said second memory to identify the frequency of occurrences of the citation word sin the representations for documents;
  - said third processing means being further responsive to said sixth comparing means for assigning a statistical weights to documents concerning the probability that the document matches the search query.
- 27. A computer system for identifying document according to claim 21 further including display means for displaying the texts of selected ones of said documents.
- 28. A computer system for identifying documents according to claim 21 wherein said first read-only memory contains a third database containing a plurality of stopwords, seventh comparing means for comparing each term in said register means to the stopwords in the third database, and deleting means responsive to the seventh comparing means for deleting each term from said register means that matches a stopword.

29. In a computer-implemented process employing an inference network for identifying document sin a first database likely to match a search query defining the composition of the text of documents sought to be identified, said inference network being implemented in computer means forming a query network and a document network, the document network having the a first database containing a plurality of terms representing the texts of a plurality of documents to be searched, each term being represented by a node, the computer means comparing each term, i, or the search query, c, to each of the nodes of each document, j, to determine the probability that the individual term of the search query, c_i, is a correct descriptor of the document in accordance with the relationship

space="preserve" listing-type="equation">P(c.sub.i |d.sub.j)=0.4+0.6·

idf.sub.i ·

tf.sub.ij,
where idf_i is based n the frequency of documents in the entire collection of documents in the first database containing the term i, and tf_ij is based on the frequency with which the term, i, appears in the respective document, j, said computer mans adding, for each document in the first database, the probabilities for each term of the search query and normalizing the sum of the probabilities that the terms of the search query are correct descriptors of the document by the number of terms in the search query, said computer means ranking the documents in accordance with the sum of the probabilities for each document, the improvement comprising establishing a query network by;

a) providing a second database containing a plurality of phrases derived from domain specific natural-language phrases each consisting of a plurality of stemmed terms in original order,b) input to the computer means an input query composed in natural language and comprising a plurality of unstemmed words arranged in a user-selected order,c) parsing said input query into separate terms,d) stemming the terms of said input query to form an ordered sequence of stemmed terms for a search query, the order of the stemmed terms being the same as the order of the unstemmed terms in the input query,e) selecting groups of stemmed terms, each group consisting of a plurality of successive stemmed terms of the sequence,f) comparing each group of stemmed terms in said search query to each phrase in said second database and identifying each group of stemmed terms that matches a phrase in said second database, andg) replacing each identified group of stemmed terms by the matching phrase from the second database to form the search query comprising a plurality of individual terms, i, consisting of matched phrases substituted for groups of stemmed terms of the input query and of stemmed terms of the input query not substituted by matched phrases.
View Dependent Claims (30, 31, 32, 33, 34, 35, 36, 37)

30. A computer-implemented process according to claim 29 further including, for each identified group of stemmed terms, identifying those stemmed terms which are shared by two successive identified groups of stemmed terms, identifying whether the number of stemmed terms in the two successive groups sharing a stemmed term is equal or unequal, assigning the shared stemmed term to only that group of the two successive groups containing the greatest number of stemmed terms if the number of terms is unequal , or assigning the shared stemmed term to only the first group of the two successive groups if the number of terms is equal.

31. A computer-implemented process according to claim 29 further including providing a third database containing a plurality of topics each having a descriptive topical text and an associated unique numerical key, each topical text being composed of a plurality of terms, comparing the terms of the input query or the search query to each of the terms of the topical texts in the third database, assigning a statistical weight to each topical text reflecting the probability that the topical text matches the query, ranking the topical texts based on the statistical weight, and inserting into the search query the numerical keys associated with up to n highest ranked topical texts, where n is a predetermined integer.

32. A computer-implemented process according to claim 31 wherein the step of inserting the numerical keys into the search query includes comparing the statistical weights of the topical texts to a predetermined threshold, and inserting the numerical keys into the search query which are associated with topical texts having statistical weights which exceed the predetermined threshold.

33. A computer-implemented process according to claim 31 wherein the statistical weight for each topical text is determined by comparing each term of the query to each term of the topical text, determining the probability that the query term is a correct descriptor of the topical text in accordance with the relationship

space="preserve" listing-type="equation">P(c.sub.i |d.sub.j)=0.4+0.6·

idf.sub.i ·

tf.sub.ij,
where idf_i is based on the frequency of texts in the third database containing the query term and tf_ij is based on the frequency with which the query term appears in the respective topical text, for each topical text adding the probabilities for all terms of the query and normalizing the sum of the probabilities that the topical text is a correct descriptor of the query by the number of terms in the query.

34. A computer-implemented process according to claim 29 wherein the input query may include one or more groups of terms forming citations, each citation including numerical terms, said process further includes:
h) identifying each group of terms forming a citation in said input query, andi) replacing each identified group of terms forming a citation by a citation word which comprises the numerical terms of the group of terms forming the citation and a predetermined word-level proximity number so that each citation word becomes a term, i, andthe computer means compares the identified citation words to the nodes of each document to determine the probability that the word is the correct descriptor of the document.

35. A computer-implemented process according to claim 29 further including displaying the texts of selected ones of said documents.

36. A computer-implemented process according to claim 29 wherein said second database further includes a plurality of stemmed synonyms of terms, said process including, after step g, comparing the stemmed terms of the input query remaining after substituting matching phrases to the stemmed synonyms in the second database and adding stemmed synonyms of remaining stemmed words to the input query to form the search query c, each individual term i of the search query being a remaining stemmed term or a respective synonym or a matching phrase.

37. A computer-implemented process according to claim 29 further including before step e, removing stopwords from the input query.

38. In a system for identifying documents in a first database likely to match a search query defining the composition of the text of documents sought to be identified, said system including computer means and a read only memory arranged in an inference network forming a query network and a document network, said document network comprising the first database containing a plurality of terms representing the text of each of a plurality of documents, each term being represented by a node, said computer means having first compare means for comparing each term, i, of the search query, c, to each of the nodes of each document, j, in said first database, first processing means for determining the probability that the individual term of the search query, c_i, is a correct descriptor of the document, j, in accordance with the following relationship

space="preserve" listing-type="equation">P(c.sub.i |d.sub.j)=0.4+0.6·

idf.sub.i ·

tf.sub.ij,
where idf_i is based on the frequency of documents in the entire collection of documents in the first database containing the term i, and tf_ij is based on the frequency with which the term, i, appears in the respective document, j, adding means for adding the probabilities determined by said first processing means for each term of the search query for each document in said first database, normalizing means responsive to said adding means for normalizing the sums of probabilities that the terms of the search query are correct descriptors of the document by the number of terms in the search query, and ranking means responsive to said normalizing means for ranking the documents in said first database in accordance with the values of the normalized sums of probabilities, the improvement of the query network comprising;

a) a second database recorded on said read only memory, said second database containing a plurality of phrases derived from domain specific natural-language phrases each consisting of a plurality of stemmed terms in original order,b) input means connected to said computer means to input an input query to said computer means, the input query being composed in natural language and comprising a plurality of unstemmed words arranged in a user-selected order,c) said computer means includingi) parse means for parsing said input query into separate terms,ii) stem means responsive to said input means for stemming each term of said input query to form an ordered sequence of stemmed terms for a search query, the order of the stemmed terms being the same as the order of the unstemmed terms in the input query,iii) selecting means for selecting groups of stemmed terms, each group consisting of a plurality of successive stemmed terms of the sequence,iv) second compare means for comparing each group of stemmed terms to each phrase i said second database and for identifying each group of stemmed terms that matches a phrase in said second database, andv) substitution means responsive to said second compare means for replacing each identified group of stemmed terms from said sequence by the matched phrase to form a search query, c, comprising a plurality of terms, i, each term i consisting of a phrase substituted for a group of stemmed terms in the input query or of stemmed terms in the input query not substituted by phrases.
View Dependent Claims (39, 40, 41, 42, 43, 44, 45, 46)

39. A computer system for identifying documents according to claim 38 further including second processing means for identifying those stemmed terms which are shared by two successive identified groups of stemmed terms and for identifying whether the number of stemmed terms in the two successive groups is equal or unequal, and third processing means for assigning the shared stemmed term to only that group of the two successive groups of stemmed terms containing the greatest number of stemmed terms if the number of terms is unequal or to only the first group of the two successive groups if the number of stemmed terms is equal.

40. A computer system for identifying documents according to claim 38 wherein said read only memory further includes a third database containing a plurality of topics each having a descriptive text and an associated unique numerical key, each text of the topics being composed of a plurality of terms, third compare means for comparing each of the terms of the input query or the search query to each of the terms of the texts of the topics in the third database, fourth processing means for assigning a statistical weight to each topical text reflecting the probability that the topical text matches the query, second ranking means for ranking the topical texts based on the statistical weight, and fifth processing means responsive to the second ranking means for storing into the register means the numerical keys associated with up to n highest ranked topical texts, where n is a predetermined integer.

41. A computer system for identifying documents according to claim 40 further including fourth compare means for comparing the statistical weight of the topical texts to a predetermined threshold, said register means being responsive to the fourth compare means to store numerical keys which are associated with topical texts having statistical weight which exceed the predetermined threshold.

42. A computer system for identifying documents according to claim 40 further including fifth compare means for comparing each term of the query to each term of the topical text, sixth processing means for determining the probability that the query term is a correct descriptor of the topical text in accordance with the relationship

space="preserve" listing-type="equation">P(c.sub.i |d.sub.j)=0.4+0.6·

idf.sub.i ·

tf.sub.ij,
where idf_i is based on the frequency of texts in the third database containing the query term and tf_ij is based on the frequency with which the query term appears in the respective topical text, second adding means for adding for each topical text the probabilities for all terms of the query and second normalizing means responsive to the second adding means for normalizing the sum of the probabilities that the topical text matches the query by the number of terms in the query.

43. A computer system for identifying documents according to claim 38 wherein said input query may include one or more groups of terms forming citations, each citation having numerical terms, and said computer means of said query network further includes:
vi) seventh processing means for identifying each group of terms forming a citation in said input query, andvii) second substitution means for replacing each identified group of terms by a citation word which comprises the numerical terms of the group of terms forming the citation word and a predetermined word-level proximity number so that each citation word becomes a term, i, of search query, c, andsaid second compare means being further responsive to said second substitution means for comparing each term, i, of the search query, c, to nodes, j, in said second database.

44. A computer system for identifying documents according to claim 38 further including display means for displaying the texts of selected ones of said documents.

45. A computer system for identifying documents according to claim 38 further wherein said read-only memory contains a fourth database containing a plurality of stemmed synonyms of terms, sixth comparing means for comparing to the stemmed synonyms those stemmed terms of the input query that remain after identified groups of stemmed terms have been substituted by the matching phrases, and eighth processing means for adding stemmed synonyms of remaining stemmed terms to the input query to form the search query c, each individual term i of the search query being a remaining stemmed term or a respective synonym or a matching phrase.

46. A computer system for identifying documents according to claim 38 wherein said read-only memory contains a fifth database containing a plurality of stopwords, seventh comparing means for comparing each term in said register means to the stopwords in the fifth database, and deleting means responsive to the seventh comparing means for deleting each terms from said register means that matches a stopword.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
West Services
Original Assignee
West Publishing Corporation (The Woodbridge Co. Ltd.)
Inventors
Turtle, Howard R.
Primary Examiner(s)
Lee, Thomas C.
Assistant Examiner(s)
AMSBURY, WAYNE P

Application Number

US07/773,101
Time in Patent Office

777 Days
Field of Search

364/200, 364/300, 364/419, 364/513, 395/600
US Class Current

1/1
CPC Class Codes

G06F 16/3335   Syntactic pre-processing, e...

G06F 16/3346   using probabilistic model

G06F 16/93   Document management systems

Y10S 707/917   Text

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Method and apparatus for information retrieval from a database by replacing domain specific stemmed phases in a natural language to create a search query

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

46 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for information retrieval from a database by replacing domain specific stemmed phases in a natural language to create a search query

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

46 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links