System and method for context-dependent probabilistic modeling of words and documents
First Claim
1. A computer-implemented method for retrieving documents comprising:
- inputting the text of one or more documents, wherein each document includes human readable words;
creating context windows around each said word in each document;
generating a statistical evaluation of the characteristics of all of the windows, wherein the results are not a function of the order of the appearance of words within each window; and
combining the results of the statistical evaluation for each window.
2 Assignments
0 Petitions
Accused Products
Abstract
A computer-implemented system and method is disclosed for retrieving documents using context-dependant probabilistic modeling of words and documents. The present invention uses multiple overlapping vectors to represent each document. Each vector is centered on each of the words in the document, and consists of the local environment, i.e., the words that occur close to this word. The vectors are used to build probability models that are used for predictions. In one aspect of the invention a method of context-dependant probabilistic modeling of documents is provided wherein the text of one or more documents are input into the system, each document including human readable words. Context windows are then created around each word in each document. A statistical evaluation of the characteristics of each window is then generated, where the results of the statistical evaluation are not a function of the order of the appearance of words within each window. The statistical evaluation includes the counting of the occurrences of particular words and particular documents and the tabulation of the totals of the counts. The results of the statistical evaluation for each window are then combined. These results are then used for retrieving a document, for extracting features from a document, or for finding a word within a document based on its resulting statistics.
44 Citations
27 Claims
-
1. A computer-implemented method for retrieving documents comprising:
-
inputting the text of one or more documents, wherein each document includes human readable words;
creating context windows around each said word in each document;
generating a statistical evaluation of the characteristics of all of the windows, wherein the results are not a function of the order of the appearance of words within each window; and
combining the results of the statistical evaluation for each window. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25, 27)
-
-
18. A computer system comprising:
-
storage unit for receiving and storing a plurality of documents, wherein each document includes human readable words;
means for creating context windows around each said word in each document;
means for generating a statistical evaluation of the content of each window, wherein the order of the appearance of words within each window is not used in the statistical evaluation;
means for combining the results of the statistical evaluation for each window; and
means for determining the probabilities of documents having predetermined characteristics based on the combined statistical evaluation for each window.
-
-
26. A computer program product comprising:
-
a computer program storage device;
computer-readable instructions on the storage device for causing a computer to undertake method acts to facilitate retrieving documents, the method acts comprising;
inputting the text of one or more documents, wherein each document includes human readable words;
creating context windows around each said word in each document;
generating a statistical evaluation of the characteristics of each window, wherein the results are not a function of the order of the appearance of words within each window; and
combining the results of the statistical evaluation for each window.
-
Specification