Method and system for optimally searching a document database using a representative semantic space

US 7,483,892 B1
Filed: 01/24/2005
Issued: 01/27/2009
Est. Priority Date: 04/24/2002
Status: Expired due to Term

First Claim

Patent Images

1. A method for comparing documents for similarity comprising:

generating a first pseudo-document vector utilizing a first document and a term-by-concept matrix determined from a set of documents that does not include the first document;

generating a second pseudo-document vector utilizing a second document and said term-by-concept matrix; and

determining similarity of said first document and said second document based on said first pseudo-document vector and said second pseudo-document vector.

View all claims

23 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A term-by-document matrix is compiled from a corpus of documents representative of a particular subject matter that represents the frequency of occurrence of each term per document. A weighted term dictionary is created using a global weighting algorithm and then applied to the term-by-document matrix forming a weighted term-by-document matrix. A term vector matrix and a singular value concept matrix are computed by singular value decomposition of the weighted term-document index. The k largest singular concept values are kept and all others are set to zero thereby reducing to the concept dimensions in the term vector matrix and a singular value concept matrix. The reduced term vector matrix, reduced singular value concept matrix and weighted term-document dictionary can be used to project pseudo-document vectors representing documents not appearing in the original document corpus in a representative semantic space. The similarities of those documents can be ascertained from the position of their respective pseudo-document vectors in the representative semantic space.

Citations

31 Claims

1. A method for comparing documents for similarity comprising:
- generating a first pseudo-document vector utilizing a first document and a term-by-concept matrix determined from a set of documents that does not include the first document;
  
  generating a second pseudo-document vector utilizing a second document and said term-by-concept matrix; and
  
  determining similarity of said first document and said second document based on said first pseudo-document vector and said second pseudo-document vector.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, further comprising:
    - selecting the set of documents based on at least one similarity criterion, wherein the at least one similarity criterion is one of document language, document type, subject matter and a category of a subject matter.
  - 3. The method of claim 1, further comprising:
    - generating a weighted term-by-document matrix; and
      
      decomposing the weighted term-by-document matrix to obtain the term-by-concept matrix and a reduced singular value matrix S.
  - 4. The method of claim 3, wherein said generating a weighted term-by-document matrix comprises:
    - identifying terms occurring in the set of documents;
      
      finding a global term weight for each identified term based on a frequency of occurrences for said each term in the group of documents;
      
      finding a local term weight for each identified term based on a frequency of occurrences for said each term in each document in the group of documents;
      
      organizing said local term weights in a term-by-document matrix for the identified terms; and
      
      creating a weighted term-by-document matrix by applying a global term weight to a corresponding local term weight for each of the respective identified terms in the term-by-document matrix.
  - 5. The method of claim 4, wherein the first pseudo-document vector is a product of a term vector A_d^Twith the term-by-document matrix T and an inverse reduced concepts matrix S⁻
    - 1, said term vector A_d^Tbeing formed from identified terms occurring in said first document, and said identified occurring terms being weighted based on frequency of occurrence in said first document and scaled based on the global weight of said each of said identified occurring term.
  - 6. The method of claim 5, further comprising:
    - compiling a collection of documents;
      
      generating a pseudo-document vector for each of the documents in the collection of documents, each pseudo-document vector being formed using a document from the collection of documents and the term-by-concept matrix; and
      
      determining similarity of each document in the collection of documents to the first document based on a comparison of said first pseudo-document vector to each pseudo-document vector for the documents in the collection of documents.
  - 7. The method of claim 1, further comprising:
    - adding a document to the collection of documents; and
      
      generating a pseudo-document vector for the added document, the added pseudo-document vector being formed using the added document and the term-by-concept matrix.

8. A method for searching a collection of documents, the method comprising:
- defining a representative semantic space, wherein the defining includes;
  
  compiling a corpus having a plurality of documents;
  
  identifying terms occurring in the corpus;
  
  constructing a weighted term-by-document matrix from weighted term frequencies of the corpus;
  
  decomposing the weighted term-by-document matrix into at least a term-by-concept matrix and a singular value matrix; and
  
  forming a reduced term-by-concept matrix and a reduced singular value matrix;
  
  compiling a document collection having at least one document that is not in the corpus; and
  
  using the representative semantic space to generate a pseudo-document vector for each document in a set of multiple documents in the document collection.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 9. The method of claim 8, wherein the set of multiple documents comprises all documents in the document collection.
  - 10. The method of claim 8, wherein compiling a corpus includes applying a random selection method to documents in the document collection.
  - 11. The method of claim 8, further comprising:
    - receiving at least one query document;
      
      using the representative semantic space to generate a query pseudo-document vector for the query document; and
      
      comparing the query pseudo-document vector to the pseudo-document vectors for the set of multiple documents in the document collection.
  - 12. The method of claim 11, further comprising:
    - ranking documents in the set of multiple documents based on said comparing the query pseudo-document vector to the pseudo-document vectors.
  - 13. The method of claim 8, further comprising:
    - obtaining a query having search terms;
      
      forming a query vector from the search terms and weighted term frequencies of the corpus;
      
      generating a query pseudo-document vector using the representative semantic space; and
      
      comparing the query pseudo-document vector to the pseudo-document vectors for the set of multiple documents in the document collection.
  - 14. The method of claim 13, further comprising:
    - constructing an index of terms identifying, for each term, documents from the document collection in which that term appears; and
      
      determining the set of multiple documents for which to generate a pseudo-document vector, wherein the determining includes performing a Boolean search of the document collection using the search terms and the index of terms.
  - 15. The method of claim 8, further comprising:
    - constructing an index of terms identifying, for each term, documents from the set of multiple documents in which that term appears;
      
      obtaining a query having search terms;
      
      performing a Boolean search of the set of multiple documents using query search terms and the index of terms to obtain a document subset;
      
      forming a query vector from the search terms and weighted term frequencies of the corpus;
      
      generating a query pseudo-document vector using the representative semantic space; and
      
      ranking documents in the document subset based on comparisons of the query pseudo-document vector to the pseudo-document vectors for the document subset.
  - 16. The method of claim 8, wherein documents are added to the document collection without re-defining the representative semantic space.
  - 17. The method of claim 8, wherein documents are removed from the document collection without re-defining the representative semantic space.
  - 18. The method of claim 8, wherein the document collection comprises resumes.
  - 19. The method of claim 8, wherein the document collection comprises patents or patent applications.

20. A system comprising:
- a network;
  
  a server connected to said network, said server comprising;
  
  an interface for interfacing with the network;
  
  a memory, said memory containing instructions for;
  
  identifying terms occurring in a plurality of documents;
  
  determining for each document a local term weight for each identified term;
  
  creating a weighted term-by-document matrix using at least the local term weights for each of the respective identified terms;
  
  decomposing the weighted term-by-document matrix into at least a term-by-concept matrix and a singular value matrix;
  
  using the term-by-concept matrix and the singular value matrix to determine a pseudo-document vector for a document not included in said plurality of documents; and
  
  a processor for executing instructions from memory.
- View Dependent Claims (21)
- - 21. The system of claim 20, further comprising:
    - a client connected to said network, said client passing a plurality of documents to said server, said client indicating for said server to generate a plurality of pseudo-document vectors for said plurality of documents.

22. A program product stored on a computer readable storage medium for performing a method, the program product comprising:
- instructions for generating a first pseudo-document vector using a first document and a term-by-concept matrix determined from a set of documents that does not include the first document;
  
  instructions for generating a second pseudo-document vector using a second document and the term-by-concept matrix; and
  
  instructions for determining similarity of said first document and said second document based on said first pseudo-document vector and said second pseudo-document vector.
- View Dependent Claims (23, 24, 25)
- - 23. The program product of claim 22, further comprising:
    - instructions for compiling a collection of documents;
      
      instructions for generating a pseudo-document vector for each of the documents in the collection of documents, each of said pseudo-document vectors being formed using a document from the collection of documents and the term-by-concept matrix; and
      
      instructions for determining similarity of each document in the collection of documents to said first document based on a comparison of said first pseudo-document vector to each pseudo-document vector for the documents in the collection of documents.
  - 24. The program product of claim 23, wherein the first document is a query document.
  - 25. The program product of claim 23, wherein the collection of documents includes at least one document outside the set of documents used to determine the term-by-concept matrix.

26. A computer implemented search method that comprises:
- applying a representative semantic space to a document set to obtain a set of pseudo-document vectors;
  
  applying the representative semantic space to a query to obtain a query pseudo-document vector; and
  
  ranking documents from the document set by comparing the corresponding pseudo-document vectors to the query pseudo-document vector.
- View Dependent Claims (27, 28, 29, 30, 31)
- - 27. The search method of claim 26, further comprising:
    - before said ranking, performing a Boolean keyword search on the document set to identify a subset of documents to be ranked.
  - 28. The search method of claim 26, further comprising:
    - before said ranking, performing a field search to identify a subset of documents to be ranked.
  - 29. The search method of claim 26, wherein the query comprises a set of keywords.
  - 30. The search method of claim 26, further comprising:
    - defining the representative semantic space based on a subset of the document set, wherein the defining comprises;
      
      constructing a term-by-document matrix having weighted term frequencies for the subset;
      
      decomposing the term-by-document matrix to obtain at least a term-by-concept matrix and a singular value matrix; and
      
      forming a reduced term-by-concept matrix by retaining that portion of the term-by-concept matrix corresponding to a predetermined number of largest singular values.
  - 31. The search method of claim 26, further comprising:
    - adding a document to the document set after the ranking is performed;
      
      applying the representative semantic space to the added document to obtain a corresponding pseudo-document vector; and
      
      ranking the new document by comparing the corresponding pseudo-document vector to the query pseudo-document vector.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kldiscovery Ontrack, LLC (KLDiscovery, Inc.)
Original Assignee
Kroll Ontrack Incorporated (KLDiscovery, Inc.)
Inventors
Thompson, Kevin B., Sommer, Matthew S.
Primary Examiner(s)
LEWIS, CHERYL RENEA

Application Number

US11/041,799
Time in Patent Office

1,464 Days
Field of Search

707 1- 6, 707/100, 715/500, 715/513, 704/1, 704/202
US Class Current

1/1
CPC Class Codes

G06F 16/3347   using vector based model

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Method and system for optimally searching a document database using a representative semantic space

First Claim

23 Assignments

0 Petitions

Accused Products

Abstract

Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for optimally searching a document database using a representative semantic space

First Claim

23 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links