Method for indexing for retrieving documents using particles

US 8,229,921 B2
Filed: 02/25/2008
Issued: 07/24/2012
Est. Priority Date: 02/25/2008
Status: Expired due to Fees

First Claim

Patent Images

1. A computer implemented method for indexing and retrieving documents in a database, comprising the steps of:

constructing a set of particles and a simultaneously optimized particle-based language model using training documents, and in which a perplexity of the particle-based language model is at least ten times lower than the perplexity of a word-based language model constructed from the same training documents, wherein the set of particles applies expectation maximization to an objective function, and where the objective function considers any combination of;

a size of the set of particles;

errors in representing all documents in a document training set and a query training set;

a retrieval accuracy of using the set of particles;

an entropy of a statistical models that represent the set of particles; and

a particle-level language model derived from the documents and the queries in the training sets;

converting each document in a collection of documents to a document particle graph, the document particle graph including particles selected from the set of the particles;

extracting, for each document, a set of document keys from the corresponding particle graph;

storing the document keys for each document in an index to a database storing the collection of documents;

converting a query to a query particle graph including a set of query particles, the query graph including particles selected from the set of the particles;

extracting a set of query keys from the query particle graph;

retrieving relevant documents from the database according to the query keys and the document keys stored in the index; and

outputting the relevant documents to a user.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An information retrieval system stores and retrieves documents using particles and a particle-based language model. A set of particles for a collection of documents in a particular language is constructed from training documents such that a perplexity of the particle-based language model is substantially lower than the perplexity of a word-based language model constructed from the same training documents. The documents can then be converted to document particle graphs from which particle-based keys are extracted to form an index to the documents. Users can then retrieve relevant documents using queries also in the form of particle graphs.

Citations

21 Claims

1. A computer implemented method for indexing and retrieving documents in a database, comprising the steps of:
- constructing a set of particles and a simultaneously optimized particle-based language model using training documents, and in which a perplexity of the particle-based language model is at least ten times lower than the perplexity of a word-based language model constructed from the same training documents, wherein the set of particles applies expectation maximization to an objective function, and where the objective function considers any combination of;
  
  a size of the set of particles;
  
  errors in representing all documents in a document training set and a query training set;
  
  a retrieval accuracy of using the set of particles;
  
  an entropy of a statistical models that represent the set of particles; and
  
  a particle-level language model derived from the documents and the queries in the training sets;
  
  converting each document in a collection of documents to a document particle graph, the document particle graph including particles selected from the set of the particles;
  
  extracting, for each document, a set of document keys from the corresponding particle graph;
  
  storing the document keys for each document in an index to a database storing the collection of documents;
  
  converting a query to a query particle graph including a set of query particles, the query graph including particles selected from the set of the particles;
  
  extracting a set of query keys from the query particle graph;
  
  retrieving relevant documents from the database according to the query keys and the document keys stored in the index; and
  
  outputting the relevant documents to a user.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 2. The method of claim 1, in which the set of particles is substantially larger than a number of phonemes in a language of the documents, and substantially smaller than a number of words in the language.
  - 3. The method of claim 1, in which a particular particle traverses word boundaries.
  - 4. The method of claim 1, in which the document and the query are a form of text words.
  - 5. The method of claim 1, in which the document is in a form of text words, and the query is in a form of spoken words.
  - 6. The method of claim 1, in which the document and the query are a form of spoken words.
  - 7. The method of claim 1, in which the document is in a form of spoken words, and the query is in a form of text words.
  - 8. The method of claim 1, in which the query is spoken and the query particle graph is a lattice representing alternate sequential groupings of sound sequences in the spoken query.
  - 9. The method of claim 1, in which the set of particles represent all possible sound sequences that can occur in any query.
  - 10. The method of claim 1, in which the set of particles is derived from a pronunciation of any sequence of words from the document.
  - 11. The method of claim 1, in which the set of particles identifies the key in any document that the document from other documents.
  - 12. The method of claim 1, in which the document particle graph and the query particle graph are normalized by a spelling-to-pronunciation mechanism.
  - 13. The method of claim 1, in which the particles in the set of particles are acoustically distinct and self-contained.
  - 14. The method of claim 1, in which a predictability of the occurrence of particles must be high.
  - 15. The method of claim 1, in which each particle has a distinctive acoustic structure that distinguish the particle from all other particles, and has a relatively low acoustic variability between different instances of the same particle.
  - 16. The method of claim 1, in which a predictability of an occurrence of a particular particle is relatively high.
  - 17. The method of claim 1, in which the particle set is determined manually.
  - 18. The method of claim 1, in which the particle set is determined heuristically.
  - 19. The method of claim 1, in which each word in each document is first converted to a phonetic graph that represents all the possible pronunciations of the word, and then converting the phonetic graph to the document particle set.
  - 20. The method of claim 1, further comprising:
    - rank ordering the relevant documents.

21. An information retrieval system, comprising:
- means for constructing a set of particles and a simultaneously optimized particle-based language model using training documents, and in which a perplexity of the particle-based language model is at least ten times lower than the perplexity of a word-based language model constructed from the same training documents, wherein the set of particles applies expectation maximization to an objective function, and where the objective function considers any combination of;
  
  a size of the set of particles;
  
  errors in representing all documents in a document training set and a query training set;
  
  a retrieval accuracy of using the set of particles;
  
  an entropy of a statistical models that represent the set of particles; and
  
  a particle-level language model derived from the documents and the queries in the training sets;
  
  means for converting each document in a collection of documents to a document particle graph, the document graph, the document graph including particles selected from a set of the particles;
  
  means for extracting, for each document, a set of document keys from the corresponding particle graph;
  
  means for storing the document keys for each document in an index to a database storing the collection of documents;
  
  means for converting a query to query particle graph including a set of query particles, the query graph including particles selected from the set of the particles;
  
  means for extracting a set of query keys from the query particle graph;
  
  means for retrieving relevant documents from the database according to the query keys and the document keys stored in the index; and
  
  means for outputting the relevant documents to a user.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Mitsubishi Electric Research Laboratories, Inc. (Mitsubishi Electric Corporation)
Original Assignee
Mitsubishi Electric Research Laboratories, Inc. (Mitsubishi Electric Corporation)
Inventors
Ramakrishnan, Bhiksha, Schmidt-Nielsen, Bent, Weinberg, Garrett, Harsham, Bret A., Gouvêa, Evandro B.
Primary Examiner(s)
Chen, Susan

Application Number

US12/036,681
Publication Number

US 20090216740A1
Time in Patent Office

1,611 Days
Field of Search

704/231, 704/251, 704/270
US Class Current

707/715
CPC Class Codes

G06F 16/3343 using phonetics

G10L 15/26 Speech to text systems G10L...

Method for indexing for retrieving documents using particles

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Method for indexing for retrieving documents using particles

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links