Method for indexing for retrieving documents using particles
First Claim
1. A computer implemented method for indexing and retrieving documents in a database, comprising the steps of:
- constructing a set of particles and a simultaneously optimized particle-based language model using training documents, and in which a perplexity of the particle-based language model is at least ten times lower than the perplexity of a word-based language model constructed from the same training documents, wherein the set of particles applies expectation maximization to an objective function, and where the objective function considers any combination of;
a size of the set of particles;
errors in representing all documents in a document training set and a query training set;
a retrieval accuracy of using the set of particles;
an entropy of a statistical models that represent the set of particles; and
a particle-level language model derived from the documents and the queries in the training sets;
converting each document in a collection of documents to a document particle graph, the document particle graph including particles selected from the set of the particles;
extracting, for each document, a set of document keys from the corresponding particle graph;
storing the document keys for each document in an index to a database storing the collection of documents;
converting a query to a query particle graph including a set of query particles, the query graph including particles selected from the set of the particles;
extracting a set of query keys from the query particle graph;
retrieving relevant documents from the database according to the query keys and the document keys stored in the index; and
outputting the relevant documents to a user.
1 Assignment
0 Petitions
Accused Products
Abstract
An information retrieval system stores and retrieves documents using particles and a particle-based language model. A set of particles for a collection of documents in a particular language is constructed from training documents such that a perplexity of the particle-based language model is substantially lower than the perplexity of a word-based language model constructed from the same training documents. The documents can then be converted to document particle graphs from which particle-based keys are extracted to form an index to the documents. Users can then retrieve relevant documents using queries also in the form of particle graphs.
-
Citations
21 Claims
-
1. A computer implemented method for indexing and retrieving documents in a database, comprising the steps of:
-
constructing a set of particles and a simultaneously optimized particle-based language model using training documents, and in which a perplexity of the particle-based language model is at least ten times lower than the perplexity of a word-based language model constructed from the same training documents, wherein the set of particles applies expectation maximization to an objective function, and where the objective function considers any combination of; a size of the set of particles; errors in representing all documents in a document training set and a query training set; a retrieval accuracy of using the set of particles; an entropy of a statistical models that represent the set of particles; and a particle-level language model derived from the documents and the queries in the training sets; converting each document in a collection of documents to a document particle graph, the document particle graph including particles selected from the set of the particles; extracting, for each document, a set of document keys from the corresponding particle graph; storing the document keys for each document in an index to a database storing the collection of documents; converting a query to a query particle graph including a set of query particles, the query graph including particles selected from the set of the particles; extracting a set of query keys from the query particle graph; retrieving relevant documents from the database according to the query keys and the document keys stored in the index; and outputting the relevant documents to a user. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. An information retrieval system, comprising:
-
means for constructing a set of particles and a simultaneously optimized particle-based language model using training documents, and in which a perplexity of the particle-based language model is at least ten times lower than the perplexity of a word-based language model constructed from the same training documents, wherein the set of particles applies expectation maximization to an objective function, and where the objective function considers any combination of; a size of the set of particles; errors in representing all documents in a document training set and a query training set; a retrieval accuracy of using the set of particles; an entropy of a statistical models that represent the set of particles; and a particle-level language model derived from the documents and the queries in the training sets; means for converting each document in a collection of documents to a document particle graph, the document graph, the document graph including particles selected from a set of the particles; means for extracting, for each document, a set of document keys from the corresponding particle graph; means for storing the document keys for each document in an index to a database storing the collection of documents; means for converting a query to query particle graph including a set of query particles, the query graph including particles selected from the set of the particles; means for extracting a set of query keys from the query particle graph; means for retrieving relevant documents from the database according to the query keys and the document keys stored in the index; and means for outputting the relevant documents to a user.
-
Specification