Document information retrieval using global word co-occurrence patterns

US 5,675,819 A
Filed: 06/16/1994
Issued: 10/07/1997
Est. Priority Date: 06/16/1994
Status: Expired due to Term

- Alert
- Pin

First Claim

Patent Images

1. A method, using a processor and memory, for generating a thesaurus of word vectors based on lexical co-occurrence of words within documents of a corpus of documents, the corpus stored in the memory, the method comprising:

retrieving into the processor a retrieved word from the corpus;

recording a number of times a co-occuring word co-occurs in a same document within a predetermined range of the retrieved word;

repeating the recording step for every co-occurring word located within the predetermined range for each occurrence of the retrieved word in the corpus;

generating a word vector for the word based on every recorded number;

repeating the retrieving, recording, recording repeating and generating steps for each word in the corpus, andstoring the generated word vectors in the memory as the thesaurus.

View all claims

9 Assignments

Timeline View

Assignment View

Litigations

0 Petitions

Reexamination

Accused Products

Abstract

A method and apparatus accesses relevant documents based on a query. A thesaurus of word vectors is formed for the words in the corpus of documents. The word vectors represent global lexical co-occurrence patterns and relationships between word neighbors. Document vectors, which are formed from the combination of word vectors, are in the same multi-dimensional space as the word vectors. A singular value decomposition is used to reduce the dimensionality of the document vectors. A query vector is formed from the combination of word vectors associated with the words in the query. The query vector and document vectors are compared to determine the relevant documents. The query vector can be divided into several factor clusters to form factor vectors. The factor vectors are then compared to the document vectors to determine the ranking of the documents within the factor cluster.

Citations

47 Claims

1. A method, using a processor and memory, for generating a thesaurus of word vectors based on lexical co-occurrence of words within documents of a corpus of documents, the corpus stored in the memory, the method comprising:
- retrieving into the processor a retrieved word from the corpus;
  
  recording a number of times a co-occuring word co-occurs in a same document within a predetermined range of the retrieved word;
  
  repeating the recording step for every co-occurring word located within the predetermined range for each occurrence of the retrieved word in the corpus;
  
  generating a word vector for the word based on every recorded number;
  
  repeating the retrieving, recording, recording repeating and generating steps for each word in the corpus, andstoring the generated word vectors in the memory as the thesaurus.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 2. The method of claim 1, wherein a matrix of word vectors is formed using all of the generated word vectors of the corpus.
  - 3. The method of claim 2, wherein forming the matrix comprises the steps of:
    - forming a first matrix from a first subset of words within the corpus, each element of the first matrix recording the number of times that two words within the first subset co-occur in the predetermined range;
      
      clustering the first matrix into groups to form a set of low coverage word clusters;
      
      forming a second matrix from the low coverage word clusters and a second subset of words within the corpus, the second subset containing more words than the first subset, each element of the second matrix recording a number of times that each word of the second subset co-occurs with each cluster of the set of low coverage word clusters within the predetermined range;
      
      clustering the second matrix into groups to form a set of high coverage word clusters;
      
      forming a third matrix from the high coverage word clusters and all of the words of the corpus, each element of the third matrix recording a number of times that each word of the corpus co-occurs with each cluster of the set of high coverage word clusters within the predetermined range; and
      
      reducing dimensionality of the third matrix to represent each element of the third matrix as a compact vector.
  - 4. The method of claim 3, further comprising, before forming the first matrix:
    - computing word frequency and word pair frequency for the words of the corpus; and
      
      removing words and word pairs with low frequency.
  - 5. The method of claim 3, wherein a number of words selected from the corpus to form the first matrix is 3000 words.
  - 6. The method of claim 5, wherein the selected words are medium frequency words.
  - 7. The method of claim 3, wherein the clustering of the first matrix and the second matrix use a Buckshot fast clustering algorithm.
  - 8. The method of claim 3, wherein a clustering algorithm is based on a cosine similarity between columns of the first matrix.
  - 9. The method of claim 3, wherein a number of low coverage word clusters is 200.
  - 10. The method of claim 3, wherein the second subset contains 20,000 words.
  - 11. The method of claim 10, wherein the words within the second subset are the most frequent words excluding stop words.
  - 12. The method of claim 3, wherein the reducing step uses a singular value decomposition method.
  - 13. The method of claim 1, wherein the similarity between two words uses a cosine similarity function of:
    - ##EQU18## where V is a set of words;
      
      w_i is the word;
      
      w_j is the co-occurring word;
      
      φ
      
      is a word encoding vector of the word and the co-occurring word.
  - 14. The method of claim 1, wherein the word vector is generated based on:
    - ##EQU19## where φ
      
      ₁ is an order-1 vector for word w_i ;
      
      tokens t_j,k are unique words in document d_j within the corpus D;
      
      tokens t_j,l are neighbors of one of the tokens t_j,k ;
      
      W is a number of intervening words between tokens t_j,l and t_j,k ; and
      
      φ
      
      ₀ is an order-0 vector for tokens t_j,l.
  - 15. The method of claim 14, wherein the word vector is generated based on:
    - ##EQU20## where φ
      
      ₂ is an order-2 vector for word w_i ;
      
      tokens t_j,k are unique words in document d_j in the corpus D;
      
      v_n is a frequent content word;
      
      tokens t_j,l are neighbors of one of the tokens t_j,k ;
      
      W is a number of intervening words between tokens t_j,l and t_j,k ; and
      
      φ
      
      ₁ is an order-1 vector for token t_j,l.
  - 16. The method of claim 15, wherein the number of intervening words is 50.
  - 17. The method of claim 14, wherein the number of intervening words is 50.
  - 18. The method of claim 1, further comprising generating a context vector for each document of the corpus, the context vector for a document based on the word vectors from the thesaurus for each word located in the document.
  - 19. The method of claim 18, wherein generating the context vector comprises:
    - extracting each word from the document;
      
      retrieving the word vector from the thesaurus for each extracted word;
      
      adding the retrieved word vectors together to form the context vector for the document.
  - 20. The method of claim 19, wherein the word vectors are weighted before being added together.
  - 21. The method of claim 19 further comprising the step of normalizing the context vector.
  - 22. The method of claim 18, wherein generating the context vector uses the equation:
    - ##EQU21## where d_j is the vector for document j;
      
      w_ij is the weight for word i in document j; and
      
      v_i is the thesaurus vector for word i.
  - 23. The method of claim 22 wherein weighting the words in the document is by using an augmented term frequency-inverse document frequency method:
    - ##EQU22## where tf_ij is the frequency of word i in document j;
      
      N is the total number of documents; and
      
      n_i is the document frequency of word i.
  - 24. The method of claim 22, wherein the context vector is normalized using the equation:
    - ##EQU23## where p is the number of dimensions in the corpus; and
      
      d_j is the context vector.

25. A method, using a processor and memory, for determining relevant documents in a corpus of documents, the memory including a thesaurus of word vectors and a context vector for each document, the word vectors based on co-occurrence of words within each of the documents of the corpus of documents and each context vector based on the word vectors from the thesaurus for each word located in the corresponding document, the method comprising:
- inputting a query;
  
  generating a context vector for the query based on the word vectors from the thesaurus for each word in the query;
  
  determining a correlation coefficient for each document based on the context vectors for that document and the query;
  
  ranking each document based on the determined correlation coefficients; and
  
  outputting the ranking for at least one of the documents.
- View Dependent Claims (26, 27, 28, 29, 30, 31)
- - 26. The method of claim 25, wherein computing the correlation coefficient uses the following formula:
    - ##EQU24## where ψ
      
      (d_i) is the query vector and ψ
      
      (d_j) is the document vector.
  - 27. The method of claim 25, wherein ranking of documents is from highest correlation coefficients to lowest correlation coefficients.
  - 28. The method of claim 25, wherein the step of outputting comprises outputting, instead of the ranking for at least one document, at least one document according to its corresponding ranking.
  - 29. The method of claim 25, further comprising outputting at least one document according to its ranking.
  - 30. The method of claim 25, wherein the step of outputting comprises outputting, instead of the ranking for at least one document, at least one document according to its corresponding ranking.
  - 31. The method of claim 25, further comprising outputting at least one document according to its ranking.

32. A method, using a processor and memory, for determining relevant documents in a corpus of documents, the memory including a thesaurus of word vectors and a context vector for each document, the word vectors based on co-occurrence of words within each of the documents of the corpus of documents and each context vector based on the word vectors from the thesaurus for each word located in the corresponding document, the method comprising:
- inputting a query;
  
  generating a factor vector for the query based on a clustering of the word vectors of the query;
  
  determining a correlation coefficient for each document based on the factor vector and the context vector for that document;
  
  ranking each document within a factor cluster based on the determined correlation coefficients; and
  
  determining a maximum rank of each document based on a combination of the ranks of the documents in each factor cluster; and
  
  outputting a final rank for at least one of the documents.
- View Dependent Claims (33, 34, 35, 36, 37, 38)
- - 33. The method of claim 32, wherein generating the factor vectors comprises:
    - retrieving the thesaurus word vectors for each word in the query;
      
      forming factor clusters of the query;
      
      generating a factor vector for each factor cluster.
  - 34. The method of claim 32, wherein generating the factor vector uses the equation:
    - ##EQU25## where f_m is the factor vector for factor cluster m;
      
      w_im is the weight for word i in factor cluster m; and
      
      v_i is the thesaurus vector for word i.
  - 35. The method of claim 32, wherein computing the correlation coefficient for each document uses the following formula:
    - ##EQU26## where ψ
      
      (f_m) is the factor vector for factor cluster f_m and ψ
      
      (d_j) is the context vector for document d_j.
  - 36. The method of claim 32, wherein ranking of the documents within a factor cluster is based on correlation:
    - space="preserve" listing-type="equation">corr(f.sub.m,d.sub.j)
      where the rank of d_j according to this ranking is r_m (j); and
      
      corr(f_m,d_j) is the correlation of factor m and document j.
  - 37. The method of claim 32, wherein computing the maximum rank for each document uses the following equation:
    - space="preserve" listing-type="equation">r(j)=max.sub.m (r.sub.m (j))
      where r(j) is the ranking of document j; and
      
      r_m (j) is the rank of document j for factor cluster m.
  - 38. The method of claim 32, wherein the rank of the most relevant document is output.

39. An apparatus for generating a thesaurus of word vectors for each word in a corpus of documents, the word vectors being based on the lexical co-occurrence of words within each of the documents, comprising:
- a memory containing the corpus of documents;
  
  an extractor for retrieving a word from the corpus;
  
  a counter recording the number of times the word co-occurs with a co-occurring word located within a predetermined range, the co-occurring word being any word located before and after the word within the predetermined range, the counter recording the number for every co-occurring word within the predetermined range;
  
  a generator generating a word vector for the word based on every recorded number; and
  
  an output for outputting the word vectors in the thesaurus.

40. An apparatus for retrieving relevant documents from a corpus of documents based on a query, comprising:
- a memory containing the corpus of documents;
  
  a thesaurus of word vectors for each word of the corpus, the word vectors being based on lexical co-occurrence of words within each of the documents;
  
  a processor for generating document vectors and a query vector based on the word vectors, each document vector being a summation of the word vectors that are associated with the words located within the document;
  
  the query vector being a summation of the word vectors that are associated with the words located within the query;
  
  determining means for determining a co-occurrence correlation relationship between the document vectors and the query vector; and
  
  output means for outputting the relevant documents based on the correlation relationship determined by the determining means.
- View Dependent Claims (41, 42, 43, 44, 45, 46, 47)
- - 41. The apparatus of claim 40, wherein the thesaurus comprises:
    - an extractor for retrieving a word from the corpus;
      
      a counter recording the number of times the word co-occurs with a co-occurring word located within a predetermined range, the co-occurring word being any word located before and after the word within the predetermined range, the counter recording the number for every co-occurring word within the predetermined range; and
      
      a generator generating a word vector for the word based on every recorded number.
  - 42. The apparatus of claim 40, wherein the determining means ranks the documents based on the correlation relationship.
  - 43. The apparatus of claim 40, wherein the processing means weights the word vectors before forming the document vectors and the query vector.
  - 44. The apparatus of claim 40, wherein the memory is one of a RAM, a ROM and an external storage unit.
  - 45. The apparatus of claim 40, wherein the query vector is a plurality of query factor vectors, each query factor vector based on a cluster factor of the query.
  - 46. The apparatus of claim 45, wherein the determining means ranks the documents in factor clusters based on the correlation relationship between the document vectors and the query factor vectors of the query vector.
  - 47. The apparatus of claim 46, wherein the relevant documents are determined by combining the ranking of the document within each factor cluster.

Specification

Resources

Litigation Campaign Assessment

Litigation Data

Current Assignee
Technology Licensing Corporation
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Schuetze, Hinrich
Primary Examiner(s)
Hayes, Gail O.
Assistant Examiner(s)
KALIDINDI, KRISHNA V

Application Number

US08/260,575
Time in Patent Office

1,209 Days
Field of Search

395/600, 395/760, 395/759, 364/419.19, 364/419.11, 364/419.08
US Class Current

704/10
CPC Class Codes

G06F 16/334   Query execution G06F16/335 ...

G06F 40/253   Grammatical analysis; Style...

Y10S 707/99933   Query processing, i.e. sear...

Document information retrieval using global word co-occurrence patterns

First Claim

9 Assignments

Litigations

0 Petitions

Reexamination

Accused Products

Abstract

Citations

47 Claims

Specification

Solutions

Use Cases

Quick Links

Document information retrieval using global word co-occurrence patterns

First Claim

9 Assignments

Subscription Required

Subscription Required

Litigations

0 Petitions

Subscription Required

Reexamination

Accused Products

Subscription Required

Abstract

Citations

47 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links