×

Method and platform for term extraction from large collection of documents

  • US 7,395,256 B2
  • Filed: 06/20/2003
  • Issued: 07/01/2008
  • Est. Priority Date: 06/20/2003
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method for statistically extracting terms from a set of documents comprising the steps of:

  • determining for each document in the set of documents an importance vector based on importance values for words in each document;

    forming a binary document classification tree by clustering the documents into clusters of similar documents based on the importance vector for each of the documents;

    building an infrastructure for the set of documents by generalizing the binary document classification tree;

    analyzing distribution of the importance vectors to determine a cutting threshold;

    cutting the generalized tree of the infrastructure according to the cutting threshold to further cluster the documents;

    extracting statistically significant individual key words from the further clusters of similar documents; and

    extracting terms using the key words as seeds, wherein the cutting threshold is defined by S1*, that satisfies the equation;

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×