Please download the dossier by clicking on the dossier button x
×

Methods and systems to efficiently find similar and near-duplicate emails and files

  • US 10,083,176 B1
  • Filed: 02/29/2016
  • Issued: 09/25/2018
  • Est. Priority Date: 01/23/2006
  • Status: Active Grant
First Claim
Patent Images

1. A method for generating and using a semantic space in a computer system, comprising:

  • receiving, via at least one computer processor, a plurality of documents and a plurality of random document vectors, wherein each random document vector in the plurality of random document vectors is generated based on random indexing and is associated with a corresponding document in the plurality of documents;

    selecting, via the at least one computer processor, a term for which to generate a term vector;

    for a first document in the plurality of documents, determining, via the at least one computer processor, if the term appears in the document;

    if the term appears in the document;

    determining, via the at least one computer processor, a frequency of the term in the document; and

    adding, via the at least one computer processor, an associated random document vector of the document to the term vector, wherein the associated random document vector is scaled by the term frequencydetermining, via the at least one computer processor, if the term appears in any remaining documents in the plurality of documents;

    if the term does not appear in any remaining documents in the plurality of documents, generating, via the at least one computer processor, a normalized version of the term vector with the added associated random document vector;

    outputting, via the at least one computer processor, the normalized term vector;

    receiving a query; and

    generating, based at least in part on the semantic space including the normalized version of the term vector, a user interface displaying similar and near-duplicate documents.

View all claims
  • 8 Assignments
Timeline View
Assignment View
    ×
    ×