×

Lightweight document clustering

  • US 6,654,739 B1
  • Filed: 01/31/2000
  • Issued: 11/25/2003
  • Est. Priority Date: 01/31/2000
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method for clustering similar documents together comprising the steps of:

  • a) creating a document list containing for each document a linear list of a subset of keywords that appear in said each document;

    b) creating a wordlist that contains a linear list of all the documents that contain on their respective linear list of a subset of keywords a particular keyword;

    c) selecting a document in the document list;

    d) determining a specified number of documents that are most similar to the document selected in step c), wherein the similarity is based upon a summation of the total number of times that each keyword appears in the documents not selected in step c) plus a bonus, which is the inverse document frequency (IDF);

    e) repeating steps c) and d) for each document in the document list;

    f) arranging the specified documents determined in each iteration of step d) into a plurality of clusters wherein each document within a cluster has at least one same common keyword with all other documents in the cluster;

    j) assigning each document to a cluster; and

    k) determining a number of match pairs for a first document, wherein a number of match pairs is the number of second documents that are on the top k matched list for that document;

    for each match pair;

    i) if the match score is less than a threshold minimum score, proceed to the next match pair;

    ii) if the match pair is already in the same cluster, proceed to the next match pair;

    iii) if the first document is in a cluster, and the second document is not, add the second document to the cluster that the first document is in, and proceed to the next match pair;

    iv) if the first document and the second document are in separate clusters;

    if option is no merging, go to the next match pair;

    if option is repeat documents, replicate the second document in all clusters that the first document is in, and go to the next match pair;

    or merging the two separate clusters into one cluster, and go to the next match pair;

    1) repeating step k) for each document in the document list.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×