Method and system for optimally searching a document database using a representative semantic space

US 6,847,966 B1
Filed: 04/24/2002
Issued: 01/25/2005
Est. Priority Date: 04/24/2002
Status: Expired due to Term

First Claim

Patent Images

1. A method for comparing documents for similarity by representative latent semantic analysis comprising:

generating a weighted term-by-document matrix for a group of selected documents, said group of selected documents being representative of a similarity criteria;

decomposing said weighted term-by-document matrix into a term matrix of terms occurring said group of selected documents and a reduced concepts matrix of concepts, at least a portion of said concepts indicative of latent semantics in at least one of said selected documents;

generating a first pseudo-document vector utilizing a first document and said reduced concepts matrix;

generating a second pseudo-document vector utilizing a second document and said reduced concepts matrix;

comparing said first pseudo-document vector to said second pseudo-document vector; and

determining similarity of said first document and said second document based on the comparison of said first pseudo-document vector to said second pseudo-document vector.

View all claims

22 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A term-by-document matrix is compiled from a corpus of documents representative of a particular subject matter that represents the frequency of occurrence of each term per document. A weighted term dictionary is created using a global weighting algorithm and then applied to the term-by-document matrix forming a weighted term-by-document matrix. A term vector matrix and a singular value concept matrix are computed by singular value decomposition of the weighted term-document index. The k largest singular concept values are kept and all others are set to zero thereby reducing to the concept dimensions in the term vector matrix and a singular value concept matrix. The reduced term vector matrix, reduced singular value concept matrix and weighted term-document dictionary can be used to project pseudo-document vectors representing documents not appearing in the original document corpus in a representative semantic space. The similarities of those documents can be ascertained from the position of their respective pseudo-document vectors in the representative semantic space.

680 Citations

100 Claims

1. A method for comparing documents for similarity by representative latent semantic analysis comprising:
- generating a weighted term-by-document matrix for a group of selected documents, said group of selected documents being representative of a similarity criteria;
  
  decomposing said weighted term-by-document matrix into a term matrix of terms occurring said group of selected documents and a reduced concepts matrix of concepts, at least a portion of said concepts indicative of latent semantics in at least one of said selected documents;
  
  generating a first pseudo-document vector utilizing a first document and said reduced concepts matrix;
  
  generating a second pseudo-document vector utilizing a second document and said reduced concepts matrix;
  
  comparing said first pseudo-document vector to said second pseudo-document vector; and
  
  determining similarity of said first document and said second document based on the comparison of said first pseudo-document vector to said second pseudo-document vector.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 2. The method recited in claim 1 above wherein generating the first pseudo-document vector further comprises utilizing the term matrix.
  - 3. The method recited in claim 1 above wherein generating the second pseudo-document vector further comprises utilizing the term matrix.
  - 4. The method recited in claim 1 above wherein generating the first and the second pseudo-document vectors comprise utilizing the term matrix.
  - 5. The method recited in claim 1 above wherein the similarity criteria is one of document language, document type, subject matter and a category of a subject matter.
  - 6. The method recited in claim 1 above wherein generating the weighted term-by-document matrix further comprises:
    - compiling the group of selected documents representative of a similarity criteria;
      
      identifying terms occurring in the group of documents;
      
      finding a global term weight for each identified term based on a frequency of occurrences for said each term in the group of documents;
      
      finding a local term weight for each identified term based on a frequency of occurrences for said each term in each document in the group of documents;
      
      organizing said local term weights in a term-by-document matrix for the identified terms; and
      
      creating a weighted term-by-document matrix by applying a global term weight to a corresponding local term weight for each of the respective identified terms in the term-by-document matrix.
  - 7. The method recited in claim 6 above, wherein identifying terms occurring in the group of documents further comprises eliminating stop words from the identified terms.
  - 8. The method recited in claim 7 above wherein said weighted term-by-document matrix is comprised of a product of said term matrix, said reduced concepts matrix S and the transpose of an other matrix, wherein said term matrix is a t row by k column matrix, said reduced concepts matrix is a k row by k column matrix, and said other matrix is a v row by k column matrix.
  - 9. The method recited in claim 8 above, wherein generating a first pseudo-document vector D further comprises:
    - finding the product of a transpose of a term vector A_d, said term matrix T, and an inverse of said reduced concepts matrix S, such that a D=A_d^TTS^−
      
      1, wherein said term vector A_dbeing formed from identified terms occurring in said first document and said identified occurring terms being weighted locally based on frequency of occurrence in said first document and scaled based on the global weight of said each of said identified occurring term.
  - 10. The method recited in claim 9 above further comprises:
    - compiling a collection of documents;
      
      generating a pseudo-document vector for each of the documents in the collection of documents, each of said pseudo-document vectors being formed utilizing a document from the collection of documents, said reduced concepts matrix and said term matrix;
      
      comparing said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents; and
      
      determining similarity of each of the documents in the collection of documents to said first document based on the comparison of said first pseudo-document vector to each of the pseudo-document vectors for the documents in the collection of documents.
  - 11. The method recited in claim 10 further comprises:
    - constructing an index of term-to-document mappings, each of said mappings correlate a term occurring in the collection of documents to documents in which said term occurs; and
      
      finding a subset of documents in the collection of document in which a query term occurs by searching the inverted index for the query term.
  - 12. The method recited in claim 11 further comprises:
    - generating a pseudo-document vector for each document in the subset of documents utilizing a document from the subset, said term matrix, and said reduced concepts matrix;
      
      comparing said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the subset of documents; and
      
      determining document similarity of each of the documents in the subset of documents to said first document based on the comparison of said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the subset of documents.
  - 13. The method recited in claim 12 above, wherein the first document is a list of terms.
  - 14. The method recited in claim 12 above, wherein the first document is a query document.
  - 15. The method recited in claim 10 further comprises:
    - adding a document to the collection of documents; and
      
      generating a pseudo-document vector for the added document, said added pseudo-document vector being formed utilizing said added document, said reduced concepts matrix and said term matrix.
  - 16. The method recited in claim 15 above further comprises:
    - comparing said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents; and
      
      determining document similarity of each of the documents in the collection of documents to said first document based on the comparison of said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents.
  - 17. The method recited in claim 10 further comprises:
    - deleting a document from the collection of documents; and
      
      eliminating a pseudo-document vector for the deleted document.
  - 18. The method recited in claim 17 above further comprises:
    - comparing said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents; and
      
      determining document similarity of each of the documents in the collection of documents to said first document based on the comparison of said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents.
  - 19. The method recited in claim 1 above, wherein decomposing said weighted term-by-document matrix further comprises:
    - establishing a predetermined quantity of strong concepts;
      
      deriving the predetermined quantity of strong concepts from said weighted term-by-document matrix; and
      
      ceasing decomposition.
  - 20. The method recited in claim 1 above, wherein decomposing said weighted term-by-document matrix further comprises applying a singular value decomposition algorithm to said weighted term-by-document matrix.

21. A method for comparing documents for similarity by representative latent semantic analysis comprising:
- selecting documents based on a similarity criteria;
  
  generating a weighted term-by-document matrix for said selected documents;
  
  decomposing said weighted term-by-document matrix into a reduced concepts matrix, a term matrix, and a document matrix;
  
  discarding said document matrix;
  
  generating a first pseudo-document vector utilizing a first document, said reduced concepts matrix and said term matrix;
  
  generating a second pseudo-document vector utilizing a second document, said reduced concepts matrix and said term matrix;
  
  comparing said first pseudo-document vector to said second pseudo-document vector, and determining similarity of said first document and said second document based on the comparison of said first pseudo-document vector to said second pseudo-document vector.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
- - 22. The method recited in claim 21 above, wherein decomposing said weighted term-by-document matrix further comprises applying a singular value decomposition algorithm to said weighted term-by-document matrix.
  - 23. The method recited in claim 21 above, wherein decomposing said weighted term-by-document matrix further comprises:
    - establishing a predetermined quantity of strong concepts;
      
      deriving the predetermined quantity of strong concepts from said weighted term-by-document matrix; and
      
      ceasing decomposition.
  - 24. The method recited in claim 23 above, wherein the predetermined quantity of strong concepts is k number of concepts.
  - 25. The method recited in claim 24 above wherein said weighted term-by-document matrix is comprised of a product of said term matrix, said reduced concepts matrix, and the transpose of an other matrix, wherein said term matrix is a t row by k column matrix, said reduced concepts matrix is a k row by k column matrix, and said other matrix is a v row by k column matrix.
  - 26. The method recited in claim 21 above wherein the similarity criteria is one of document language, document type, subject matter and subject matter category.
  - 27. The method recited in claim 21 above wherein said selected documents exclude said first document and further exclude said second document.
  - 28. The method recited in claim 21 above wherein generating the weighted term-by-document matrix further comprises:
    - identifying terms occurring in the selecting documents;
      
      finding a global term weight for each identified term based on a frequency of occurrences for said each identified term in the selected documents;
      
      finding a local term weight for each identified term based on a frequency of occurrences in each of the plurality of documents in the selected documents;
      
      organizing said local term weights in a term-by-document matrix for the identified terms; and
      
      creating a weighted term-by-document matrix by applying a global term weight to a corresponding local term weight for each of the respective identified terms in the term-by-document matrix.
  - 29. The method recited in claim 28 above, wherein identifying terms occurring in the group of documents further comprises eliminating stop words from the identified terms.
  - 30. The method recited in claim 29 above wherein said weighted term-by-document matrix is comprised of a product of said term matrix, said reduced concepts matrix, and the transpose of a document matrix, wherein said term matrix is a t row by k column matrix, said reduced concepts matrix is a k row by k column matrix, and said matrix is a d row by k column matrix.
  - 31. The method recited in claim 30 above, wherein generating a first pseudo-document vector D_lfurther comprises:
    - finding the product of a transpose of a term vector A_d1, said term matrix T, and an inverse of said reduced concepts matrix S, such that a D_l=A_d1^TS^−
      
      1, wherein said term vector A_d1being formed from identified terms occurring in said first document, wherein said identified occurring terms are terms being weighted locally based on frequency of occurrence in said first document and scaled based the global weight of said each of said identified occurring term.

32. A method for comparing documents for similarity by representative latent semantic analysis comprising:
- creating a concept database comprising;
  
  compiling a corpus from a plurality of documents representative of a similarity criteria;
  
  identifying terms occurring in the corpus;
  
  finding a global term weight for each identified term based on a frequency of occurrences in the corpus;
  
  finding a local term weight for each identified term based on a frequency of occurrences in each of the plurality of documents in the corpus;
  
  organizing said local term weights in a term-by-document matrix for the identified terms;
  
  creating a weighted term-by-document matrix by applying global term weights to corresponding local term weights for each of the respective identified terms in the term-by-document matrix;
  
  decomposing the weighted term-by-document matrix into at least a term matrix and a concepts matrix;
  
  forming a reduced concepts matrix by eliminating indices in said concept matrix identifying weak concepts; and
  
  forming a reduced term matrix by eliminating indices in said term matrix corresponding to the indices eliminated from the concept matrix;
  
  creating a document database from a collection of documents and said concept database comprising;
  
  compiling said collection of documents;
  
  generating a plurality of pseudo-document vectors utilizing said collection of documents and each of said reduced concepts matrix and said term matrix of said concept database; and
  
  determining document similarity between documents in the collection of documents comprising;
  
  comparing pseudo-document vectors from the document database for respective documents in the collection of documents; and
  
  determining similarity of said respective documents based on the comparison of the respective pseudo-document vectors from the document database.
- View Dependent Claims (33, 34, 35, 36, 37)
- - 33. The method recited in claim 32 above wherein the weighted term-by-document matrix is comprised of a product of said term matrix, said concepts matrix, and the transpose of a document matrix, wherein said term matrix is a t row by m column matrix, said concepts matrix is an m row by m column matrix, and said document matrix is a d row by m column matrix.
  - 34. The method recited in claim 33 above wherein forming a reduced concepts matrix by eliminating indices in said concept matrix identifying weak concepts further comprises;
    - selecting the k strongest concepts;
      
      eliminating (k+1)^ththrough m^thcolumn indices of said concepts matrix; and
      
      eliminating (k+1)^ththrough m^throw indices of said concepts matrix, thereby truncating said m row by m column concepts matrix to a k row by k column reduced concepts matrix.
  - 35. The method recited in claim 34 above wherein forming a reduced term matrix by eliminating indices in said concept matrix identifying weak concepts further comprises:
    - eliminating (k+1)^ththrough m^thcolumn indices of said term matrix, thereby truncating said t row by m column term matrix to a t row by k column reduced term matrix.
  - 36. The method recited in claim 33 above wherein forming a reduced concepts matrix by eliminating indices in said concept matrix identifying weak concepts further comprises;
    - establishing a predetermined quantity of strong concepts;
      
      deriving the predetermined quantity of strong concepts from said weighted term-by-document matrix; and
      
      ceasing decomposition.
  - 37. The method recited in claim 36 above, wherein generating a plurality of pseudo-document vectors further comprises:
    - finding the product of a transpose of a term vector A_di, said term matrix T, and an inverse of said reduced concepts matrix S, for each document d_iin the collection of documents such that a D_i=A_d^TTS¹, said term vector A_dibeing formed from identified terms occurring in an i^thdocument, wherein each of said identified occurring terms are weighted locally based on occurrences in said each document d_iof the collection of documents and scaled based on the global weight of said each of said identified occurring term.

38. A method for comparing documents for similarity by representative latent semantic analysis comprising:
- defining a representative semantic space;
  
  creating a first pseudo-document vector for a first document using at least a portion of the definition for the representative semantic space;
  
  creating a second pseudo-document vector for a second document using at least a portion of the definition for the representative semantic space;
  
  comparing the first pseudo-document vector to the second pseudo-document vector, and determining similarity of the first document and the second document based on the comparison of the first pseudo-document vector to the second pseudo-document vector.
- View Dependent Claims (39, 40, 41, 42, 43, 44)
- - 39. The method recited in claim 38 above, wherein defining a representative semantic space further comprises:
    - determining a similarity criteria;
      
      selecting parameters for said similarity criteria;
      
      compiling a corpus of documents representative of said similarity criteria based on said selected parameters, wherein said corpus of documents excludes said first and said second documents; and
      
      deriving concepts from the corpus of documents representative of said similarity criteria.
  - 40. The method recited in claim 39 above, wherein deriving concepts from the corpus of documents further comprises:
    - identifying terms occurring in the corpus of documents;
      
      finding a global term weight for each identified term based on a frequency of occurrences for said each term in the corpus of documents;
      
      finding a local term weight for each identified term based on a frequency of occurrences in each of the plurality of documents in the corpus of documents;
      
      organizing said local term weights in a term-by-document matrix for the identified terms;
      
      creating a weighted term-by-document matrix by applying a global term weight to a corresponding local term weight for each of the respective identified terms in the term-by-document matrix; and
      
      decomposing the weighted term-by-document matrix into concepts representative of said similarity criteria.
  - 41. The method recited in claim 40 above wherein said concepts representative of said similarity criteria further comprise a concept database including a concept index and a term index.
  - 42. The method recited in claim 41 above, wherein said concept index and term index being organized as a concept matrix and a term matrix respectively.
  - 43. The method recited in claim 42 above, wherein said concept matrix and said term matrix being products of the decomposition of said weighted term-by-document matrix, said decomposition comprises:
    - establishing a predetermined quantity of strongest concepts;
      
      decomposing said weighted term-by-document matrix into said term matrix, said concept matrix and a document matrix;
      
      deriving the predetermined quantity of strong concepts from said weighted term-by-document matrix; and
      
      ceasing decomposition.
  - 44. The method recited in claim 43 above, wherein creating a first pseudo-document vector for a first document using at least a portion of the definition for the representative semantic space further comprises:
    - creating a first document vector for the first document based on the co-occurrence of terms in the first document and the corpus of documents, the local term weight of each co-occurring term based on a frequency of occurrences in said first document and a global term weight for each co-occurring term based on a frequency of occurrences for said each term in the corpus of documents; and
      
      transposing said first document vector to a first pseudo-document vector using a product of said concept matrix and said term matrix.

45. A method for comparing resumè
- s for similarity by representative latent semantic analysis comprising;
  
  generating a weighted term-by-resumè
  
  matrix for a group of resumè
  
  s;
  
  decomposing said weighted term-by-resumè
  
  matrix into a term matrix of terms occurring in said group of resumè
  
  s and a reduced concepts matrix of concepts, at least a portion of said concepts indicative of latent semantics in at least one of said resumè
  
  s;
  
  identifying a collection of resumè
  
  s, wherein at least one resumè
  
  in the collection of resumè
  
  s is not included in said group of resumè
  
  s;
  
  generating a plurality of pseudo-resumè
  
  vectors utilizing said collection of resumè
  
  s and said reduced concepts matrix;
  
  generating a query pseudo-document vector utilizing a query document and said reduced concepts matrix;
  
  comparing said query pseudo-document vector to said plurality of pseudo-resumè
  
  vectors; and
  
  ranking said plurality of resumè
  
  based on similarity between said query pseudo-document vector to said plurality of pseudo-resumè
  
  vectors.
- View Dependent Claims (46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57)
- - 46. The method recited in claim 45 above wherein generating the pseudo-document vector further comprises utilizing the term matrix.
  - 47. The method recited in claim 45 above wherein generating the plurality of pseudo-resumè
    - vectors further comprises utilizing the term matrix.
  - 48. The method recited in claim 45 above wherein generating the query pseudo-document vector and the plurality of pseudo-resumè
    - vectors comprise utilizing the term matrix.
  - 49. The method recited in claim 45 above wherein said query document is a resumè
    - .
  - 50. The method recited in claim 45 above wherein said query document is a list of query terms.
  - 51. The method recited in claim 45 above wherein generating the weighted term-by-resumè
    - matrix for a group of resumè
      
      s further comprises;
      
      selecting the group of resumè
      
      s;
      
      identifying terms occurring in the group of resumè
      
      s;
      
      finding a global term weight for each term occurring in the group of resumè
      
      s based on a frequency of occurrences in the group of resumè
      
      s;
      
      finding a local term weight for each term identified as occurring in the group of resumè
      
      s based on a frequency of occurrences in each of the selected resumè
      
      s of the group of resumè
      
      s;
      
      creating a term-by-document matrix of local term weights for the terms identified as occurring in the group of resumè
      
      s; and
      
      creating a weighted term-by-document matrix by applying a global term weight to corresponding local term weight for each of the respective terms in the term-by-document matrix.
  - 52. The method recited in claim 51 further comprises:
    - adding a resumè
      
      to the collection of resumè
      
      s; and
      
      generating a pseudo-resumè
      
      vector for said added resumè
      
      , said added pseudo-resumè
      
      vector being formed by utilizing said added resumè
      
      , said reduced concepts matrix and said term matrix.
  - 53. The method recited in claim 52 above further comprises:
    - comparing said query pseudo-document vector to each of the pseudo-resumè
      
      vectors for each of the resumè
      
      s in the collection of resumè
      
      s; and
      
      determining similarity of each of the resumè
      
      s in the collection of resumè
      
      s to said query document based on the comparison of said query pseudo-resumè
      
      vector to each of the pseudo-resume vectors for each of the resumè
      
      s in the collection of resumè
      
      s.
  - 54. The method recited in claim 45 further comprises:
    - deleting a resumè
      
      from the collection of resumè
      
      s; and
      
      eliminating a pseudo-resumè
      
      vector for the deleted resumè
      
      .
  - 55. The method recited in claim 54 above further comprises:
    - comparing said query pseudo-document vector to each of the pseudo-resumè
      
      vectors for each of the resumè
      
      s in the collection of resumè
      
      s; and
      
      determining similarity of each of the resumè
      
      s in the collection of resumè
      
      s to said query document based on the comparison of said query pseudo-document vector to each of the pseudo-resumè
      
      vectors for each of the resumè
      
      s in the collection of resumè
      
      s.
  - 56. The method recited in claim 45 further comprises:
    - constructing an index of term-to-resumè
      
      mappings, each of said mappings correlate a term occurring in the collection of resumè
      
      s to resumè
      
      s in which said each term occurs; and
      
      finding a subset of resumè
      
      s in the collection of resumè
      
      in which a query term occurs by searching the inverted index for the query term.
  - 57. The method recited in claim 56 further comprises:
    - generating a pseudo-resumè
      
      vector for each resumè
      
      in the subset of resumè
      
      s utilizing a resumè
      
      from the subset, said term matrix, and said reduced concepts matrix;
      
      comparing said first pseudo-resumè
      
      vector to each of the pseudo-resumè
      
      vectors pseudo-resumè
      
      vectors for each of the resumè
      
      s in the subset of resumè
      
      s; and
      
      determining resumè
      
      similarity of each of the resumè
      
      s in the subset of resumè
      
      s to said first resumè
      
      based on the comparison of said first pseudo-resumè
      
      vector to each of the pseudo-resumè
      
      vectors for each of the resumè
      
      s in the subset of resumè
      
      s.

58. A method for comparing patent disclosures for similarity by representative latent semantic analysis comprising:
- generating a weighted term-by-patent matrix for a group of patent disclosures;
  
  decomposing said weighted term-by-patent disclosure matrix into a term matrix of terms occurring in said group of patent disclosures and a reduced concepts matrix of concepts, at least a portion of said concepts indicative of latent semantics in at least one of said patent disclosures;
  
  generating a plurality of pseudo-patent vectors utilizing said reduced concepts matrix and a plurality of patent disclosures;
  
  generating a query pseudo-document vector utilizing a query document and said reduced concepts matrix;
  
  comparing said query pseudo-document vector to said plurality of pseudo-patent disclosure vectors; and
  
  ranking said plurality of patent disclosures based on similarity between said query pseudo-document vector to said plurality of pseudo-patent vectors.
- View Dependent Claims (59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71)
- - 59. The method recited in claim 58 above wherein generating the pseudo-document vector further comprises utilizing the term matrix.
  - 60. The method recited in claim 58 above wherein generating the plurality of pseudo-patent vectors further comprises utilizing the term matrix.
  - 61. The method recited in claim 58 above wherein generating the query pseudo-document vector and the plurality of pseudo-patent vectors comprise utilizing the term matrix.
  - 62. The method recited in claim 58 above wherein said query document is a patent disclosure.
  - 63. The method recited in claim 58 above wherein said query document is a patent application.
  - 64. The method recited in claim 58 above wherein said query document is a list of query terms.
  - 65. The method recited in claim 58 above wherein generating the weighted term-by-patent matrix for a group of patent disclosures further comprises:
    - selecting the group of patent disclosures;
      
      identifying terms occurring in the group of patent disclosures;
      
      finding a global term weight for each term occurring in the group of patent disclosures based on a frequency of occurrences in the group of patent disclosures;
      
      finding a local term weight for each term identified as occurring in the group of patent disclosures based on a frequency of occurrences in each of the selected patent disclosures of the group of patent disclosures;
      
      creating a term-by-document matrix of local term weights for the terms identified as occurring in the group of patent disclosures; and
      
      creating a weighted term-by-document matrix by applying a global term weight to corresponding local term weight for each of the respective terms in the term-by-document matrix.
  - 66. The method recited in claim 65 further comprises:
    - adding a patent disclosure to the collection of patent disclosures; and
      
      generating a pseudo-patent vector for said added patent disclosure, said added pseudo-patent vector being formed utilizing said added patent disclosure, said reduced concepts matrix and said term matrix.
  - 67. The method recited in claim 66 above further comprises:
    - comparing said query pseudo-document vector to each of the pseudo-patent vectors for each of the patent disclosures in the collection of patent disclosures; and
      
      determining similarity of each of the patent disclosures in the collection of patent disclosures to said query document based on the comparison of said query pseudo-document vector to each of the pseudo-patent vectors for each of the patent disclosures in the collection of patent disclosures.
  - 68. The method recited in claim 58 further comprises:
    - deleting a patent from the collection of patent disclosures; and
      
      eliminating a pseudo-patent vector for the deleted patent.
  - 69. The method recited in claim 68 above further comprises:
    - comparing said query pseudo-document vector to each of the pseudo-patent vectors for each of the patent disclosures in the collection of patent disclosures; and
      
      determining similarity of each of the patent disclosures in the collection of patent disclosures to said query document based on the comparison of said query pseudo-document vector to each of the pseudo-patent vectors for each of the patent disclosures in the collection of patent disclosures.
  - 70. The method recited in claim 58 further comprises:
    - constructing an index of term-to-patent mappings, each of said mapping correlates a term occurring in the collection of patent disclosures to patent disclosures in which said each term occurs; and
      
      finding a subset of patent disclosures in the collection of patent in which a query term occurs by searching the inverted index for the query term.
  - 71. The method recited in claim 70 further comprises:
    - generating a pseudo-patent vector for each patent in the subset of patent disclosures utilizing a patent from the subset, said term matrix, and said reduced concepts matrix;
      
      comparing said first pseudo-patent vector to each of the pseudo-patent vectors pseudo-patent vectors for each of the patent disclosures in the subset of patent disclosures; and
      
      determining similarity of each of the patent disclosures in the subset. of patent disclosures to said first patent based on the comparison of said first pseudo-patent vector to each of the pseudo-patent vectors for each of the patent disclosures in the subset of patent disclosures.

72. A system for comparing documents for similarity by representative latent semantic analysis comprising:
- a network containing a plurality of network elements;
  
  a server connected to said network, said server comprising;
  
  an interface for interfacing with the network;
  
  a memory, said memory containing instructions for;
  
  identifying terms occurring in the plurality of documents;
  
  determining a global term weight for each identified term;
  
  determining a local term weight for each identified term;
  
  organizing said local term weights in a term-by-document matrix for the identified terms;
  
  creating a weighted term-by-document matrix for global term weights and local term weight for each of the respective identified terms in the weighted term-by-document matrix;
  
  decomposing the weighted term-by-document matrix into at least a term matrix and a reduced concepts matrix;
  
  generating a pseudo-document vector utilizing a document, said reduced concepts matrix, and said term matrix;
  
  comparing pseudo-document vectors; and
  
  determining similarity of respective documents based on the comparison of their respective pseudo-document vectors; and
  
  a processor for executing instructions from memory.
- View Dependent Claims (73, 74, 75, 76, 77, 78)
- - 73. The system recited in claim 72 above, wherein said server further comprises:
    - a client connected to said network, said client passing the plurality of documents to said server.
  - 74. The system recited in claim 72 above further comprises:
    - a client connected to said network, said client passing a document to said server, said client indicating to said server to generate a pseudo-document vector for said document.
  - 75. The system recited in claim 72 above further comprises:
    - a client connected to said network, said client passing a plurality of documents to said server, said client indicating for said server to generate a plurality of pseudo-document vectors for respective said plurality of documents.
  - 76. The system recited in claim 72 above, wherein said server further comprises:
    - a local interface for passing the plurality of documents to said server.
  - 77. The system recited in claim 72 above, wherein said server further comprises:
    - a local interface for passing a document to said server, said client indicating to said server to generate a pseudo-document vector for said document.
  - 78. The system recited in claim 72 above, wherein said server further comprises:
    - a local interface for passing a plurality of documents to said server, said client indicating for said server to generate a plurality of pseudo-document vectors for respective said plurality of documents.

79. A search engine method comparing documents for similarity by representative latent semantic analysis comprising:
- receiving a corpus of documents;
  
  creating a concept database from said corpus of documents, said concept database including at least a term matrix and a concepts matrix;
  
  receiving a collection of documents, wherein at least one document in said collection of documents is not in said corpus of documents;
  
  creating a document database from said collection of documents and said concept database, said documents database including at least one pseudo-document vector for each of at least two documents in said collection of documents, said at least one pseudo-document vector being created utilizing said each of at least two documents in said collection of documents and each of said reduced concepts matrix and said term matrix of said concept database; and
  
  determining document similarity between at least said each of at least two documents in said collection of documents using said pseudo-document vectors in said document database.
- View Dependent Claims (80)
- - 80. The system recited in claim 79 above, wherein creating a concept database from said corpus of documents further comprises:
    - identifying terms occurring in the corpus of documents;
      
      finding a global term weight for each identified term based on a frequency of occurrences for said each term in the corpus of documents;
      
      finding a local term weight for each identified term based on a frequency of occurrences in each of the plurality of documents in the corpus of documents;
      
      organizing said local term weights in a term-by-document matrix for the identified terms;
      
      creating a weighted term-by-document matrix by applying a global term weight to a corresponding local term weight for each of the respective identified terms in the term-by-document matrix;
      
      determining a number of stronger concepts; and
      
      decomposing the weighted tern-by-document matrix into at least a term matrix and a concepts matrix, said concepts matrix having indices for stronger concepts.

81. A data processing system implemented program product for performing a method for comparing documents for similarity by representative latent semantic analysis, the program product comprising:
- instructions for generating a weighted tern-by-document matrix for a group of selected documents, said group of selected documents being representative of a similarity criteria;
  
  instructions for decomposing said weighted term-by-document matrix into a term matrix of terms occurring said group of selected documents and a reduced concepts matrix of concepts, at least a portion of said concepts indicative of latent semantics in at least one of said selected documents;
  
  instructions for generating a first pseudo-document vector utilizing a first document and said reduced concepts matrix;
  
  instructions for generating a second pseudo-document vector utilizing a second document and said reduced concepts matrix;
  
  comparing said first pseudo-document vector to said second pseudo-document vector; and
  
  instructions for determining similarity of said first document and said second document based on the comparison of said first pseudo-document vector to said second pseudo-document vector.
- View Dependent Claims (82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)
- - 82. The program product recited in claim 81 above wherein said instructions for generating the first pseudo-document vector further comprises instructions for utilizing the term matrix.
  - 83. The program product recited in claim 81 above wherein said instructions for generating the second pseudo-document vector further comprises instructions for utilizing the term matrix.
  - 84. The program product recited in claim 81 above wherein said instructions for generating the first and the second pseudo-document vectors comprise instructions for utilizing the term matrix.
  - 85. The program product recited in claim 81 above wherein the similarity criteria is one of document language, document type, subject matter and a category of a subject matter.
  - 86. The program product recited in claim 81 above wherein said instructions for generating the weighted term-by-document matrix further comprises:
    - instructions for compiling the group of selected documents representative of a similarity criteria;
      
      instructions for identifying terms occurring in the group of documents;
      
      instructions for finding a global term weight for each identified term based on a frequency of occurrences for said each term in the group of documents;
      
      instructions for finding a local term weight for each identified term based on a frequency of occurrences for said each term in each document in the group of documents;
      
      instructions for organizing said local term weights in a term-by-document matrix for the identified terms; and
      
      instructions for creating a weighted term-by-document matrix by applying a global term weight to a corresponding local term weight for each of the respective identified terms in the term-by-document matrix.
  - 87. The program product recited in claim 86 above, wherein said instructions for identifying terms occurring in the group of documents further comprises instructions for eliminating stop words from the identified terms.
  - 88. The program product recited in claim 87 above wherein said weighted term-by-document matrix is comprised of a product of said term matrix, said reduced concepts matrix and the transpose of an other matrix, wherein said term matrix is a t row by k column matrix, said reduced concepts matrix is a k row by k column matrix, and said other matrix is a v row by k column matrix.
  - 89. The program product recited in claim 88 above, wherein said instructions for generating a first pseudo-document vector D further comprises:
    - instructions for finding the product of a transpose of a term vector A_d, said term matrix T, and an inverse of said reduced concepts matrix S, such that a D=A_d^TTS^−
      
      1, and said term vector A_dbeing formed from identified terms occurring in said first document, wherein said identified occurring terms are weighted locally based on frequency of occurrence in said first document and scaled by the global weight of said each of said identified occurring term.
  - 90. The program product recited in claim 89 above further comprises:
    - instructions for compiling a collection of documents;
      
      instructions for generating a pseudo-document vector for each of the documents in the collection of documents, each of said pseudo-document vectors being formed utilizing a document from the collection of documents d_i, said reduced concepts matrix S and said term matrix T;
      
      instructions for comparing said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents; and
      
      instructions for determining document similarity of each of the documents in the collection of documents to said first document based on the comparison of said first pseudo-document vector to each of the pseudo-document vectors for the documents in the collection of documents.
  - 91. The program product recited in claim 90 further comprises:
    - instructions for constructing an index of term-to-document mappings, each of said mapping correlates a term occurring in the collection of documents to documents in which said each term occurs; and
      
      instructions for finding a subset of documents in the collection of documents in which a query term occurs by searching the inverted index for the query term.
  - 92. The program product recited in claim 91 further comprises:
    - instructions for generating a pseudo-document vector for each document in the subset of documents utilizing a document from the subset, said term matrix, and said reduced concepts matrix;
      
      instructions for comparing said first pseudo-document vector to each of the pseudo-document vectors pseudo-document vectors for each of the documents in the subset of documents; and
      
      instructions for determining document similarity of each of the documents in the subset of documents to said first document based on the comparison of said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the subset of documents.
  - 93. The program product recited in claim 92 above, wherein the first document is a list of terms.
  - 94. The program product recited in claim 92 above, wherein the first document is a query document.
  - 95. The program product recited in claim 90 further comprises:
    - instructions for adding a document to the collection of documents; and
      
      instructions for generating a pseudo-document vector for the added document, said added pseudo-document vector being formed utilizing said added document, said reduced concepts matrix and said term matrix.
  - 96. The program product recited in claim 95 above further comprises:
    - instructions for comparing said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents; and
      
      instructions for determining document similarity of each of the documents in the collection of documents to said first document based on the comparison of said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents.
  - 97. The program product recited in claim 90 further comprises:
    - instructions for deleting a document from the collection of documents; and
      
      instructions for eliminating a pseudo-document vector for the deleted document.
  - 98. The program product recited in claim 97 above further comprises:
    - instructions for comparing said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents; and
      
      instructions for determining document similarity of each of the documents in the collection of documents to said first document based on the comparison of said first pseudo-document vector to each of the pseudo-document vectors for each of the documents in the collection of documents.
  - 99. The program product recited in claim 81 above, wherein said instructions for decomposing said weighted term-by-document matrix further comprises:
    - instructions for establishing a predetermined quantity of strong concepts;
      
      instructions for deriving the predetermined quantity of strong concepts from said weighted term-by-document matrix; and
      
      instructions for ceasing decomposition.
  - 100. The program product recited in claim 81 above, wherein said instructions for decomposing said weighted term-by-document matrix further comprises instructions for applying a singular value decomposition algorithm to said weighted term-by-document matrix.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kldiscovery Ontrack, LLC (KLDiscovery, Inc.)
Original Assignee
Engenium Corporation (KLDiscovery, Inc.)
Inventors
Thompson, Kevin B., Sommer, Matthew S.
Primary Examiner(s)
Robinson, Greta
Assistant Examiner(s)
LEWIS, CHERYL RENEA

Application Number

US10/131,888
Time in Patent Office

1,007 Days
Field of Search

707/2, 707/3, 707/4, 707/5, 707/6, 707/100, 704/1, 704/202, 715/513
US Class Current

707/739
CPC Class Codes

G06F 16/3347   using vector based model

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Method and system for optimally searching a document database using a representative semantic space

First Claim

22 Assignments

0 Petitions

Accused Products

Abstract

680 Citations

100 Claims

Specification

Use Cases

Quick Links

Others

Method and system for optimally searching a document database using a representative semantic space

First Claim

22 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

680 Citations

100 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others