Document information retrieval using global word co-occurrence patterns
DCFirst Claim
1. A method, using a processor and memory, for generating a thesaurus of word vectors based on lexical co-occurrence of words within documents of a corpus of documents, the corpus stored in the memory, the method comprising:
- retrieving into the processor a retrieved word from the corpus;
recording a number of times a co-occuring word co-occurs in a same document within a predetermined range of the retrieved word;
repeating the recording step for every co-occurring word located within the predetermined range for each occurrence of the retrieved word in the corpus;
generating a word vector for the word based on every recorded number;
repeating the retrieving, recording, recording repeating and generating steps for each word in the corpus, andstoring the generated word vectors in the memory as the thesaurus.
9 Assignments
Litigations
0 Petitions
Reexamination
Accused Products
Abstract
A method and apparatus accesses relevant documents based on a query. A thesaurus of word vectors is formed for the words in the corpus of documents. The word vectors represent global lexical co-occurrence patterns and relationships between word neighbors. Document vectors, which are formed from the combination of word vectors, are in the same multi-dimensional space as the word vectors. A singular value decomposition is used to reduce the dimensionality of the document vectors. A query vector is formed from the combination of word vectors associated with the words in the query. The query vector and document vectors are compared to determine the relevant documents. The query vector can be divided into several factor clusters to form factor vectors. The factor vectors are then compared to the document vectors to determine the ranking of the documents within the factor cluster.
-
Citations
47 Claims
-
1. A method, using a processor and memory, for generating a thesaurus of word vectors based on lexical co-occurrence of words within documents of a corpus of documents, the corpus stored in the memory, the method comprising:
-
retrieving into the processor a retrieved word from the corpus; recording a number of times a co-occuring word co-occurs in a same document within a predetermined range of the retrieved word; repeating the recording step for every co-occurring word located within the predetermined range for each occurrence of the retrieved word in the corpus; generating a word vector for the word based on every recorded number; repeating the retrieving, recording, recording repeating and generating steps for each word in the corpus, and storing the generated word vectors in the memory as the thesaurus. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
-
25. A method, using a processor and memory, for determining relevant documents in a corpus of documents, the memory including a thesaurus of word vectors and a context vector for each document, the word vectors based on co-occurrence of words within each of the documents of the corpus of documents and each context vector based on the word vectors from the thesaurus for each word located in the corresponding document, the method comprising:
-
inputting a query; generating a context vector for the query based on the word vectors from the thesaurus for each word in the query; determining a correlation coefficient for each document based on the context vectors for that document and the query; ranking each document based on the determined correlation coefficients; and outputting the ranking for at least one of the documents. - View Dependent Claims (26, 27, 28, 29, 30, 31)
-
-
32. A method, using a processor and memory, for determining relevant documents in a corpus of documents, the memory including a thesaurus of word vectors and a context vector for each document, the word vectors based on co-occurrence of words within each of the documents of the corpus of documents and each context vector based on the word vectors from the thesaurus for each word located in the corresponding document, the method comprising:
-
inputting a query; generating a factor vector for the query based on a clustering of the word vectors of the query; determining a correlation coefficient for each document based on the factor vector and the context vector for that document; ranking each document within a factor cluster based on the determined correlation coefficients; and determining a maximum rank of each document based on a combination of the ranks of the documents in each factor cluster; and outputting a final rank for at least one of the documents. - View Dependent Claims (33, 34, 35, 36, 37, 38)
-
-
39. An apparatus for generating a thesaurus of word vectors for each word in a corpus of documents, the word vectors being based on the lexical co-occurrence of words within each of the documents, comprising:
-
a memory containing the corpus of documents; an extractor for retrieving a word from the corpus; a counter recording the number of times the word co-occurs with a co-occurring word located within a predetermined range, the co-occurring word being any word located before and after the word within the predetermined range, the counter recording the number for every co-occurring word within the predetermined range; a generator generating a word vector for the word based on every recorded number; and an output for outputting the word vectors in the thesaurus.
-
-
40. An apparatus for retrieving relevant documents from a corpus of documents based on a query, comprising:
-
a memory containing the corpus of documents; a thesaurus of word vectors for each word of the corpus, the word vectors being based on lexical co-occurrence of words within each of the documents; a processor for generating document vectors and a query vector based on the word vectors, each document vector being a summation of the word vectors that are associated with the words located within the document;
the query vector being a summation of the word vectors that are associated with the words located within the query;determining means for determining a co-occurrence correlation relationship between the document vectors and the query vector; and output means for outputting the relevant documents based on the correlation relationship determined by the determining means. - View Dependent Claims (41, 42, 43, 44, 45, 46, 47)
-
Specification