×

System and method for classifying electronically posted documents

  • US 7,137,065 B1
  • Filed: 02/24/2000
  • Issued: 11/14/2006
  • Est. Priority Date: 02/24/2000
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method for classifying electronically posted documents, the method comprising:

  • receiving a first document and a second document;

    generating a first metadata summary for said first document and a second metadata summary for the second document, wherein the first metadata summary includes a first plurality of sub-trees and the second metadata summary includes a second plurality of sub-trees, and wherein each of the sub-trees includes a plurality of nodes;

    comparing the first and second metadata summaries on a structural level by comparing a structure of the sub-trees of the first metadata summary with a structure of the sub-trees of the second metadata summary;

    identifying the first and second documents as distinct if the structures of the sub-trees of the first and second metadata summaries are not equivalent;

    if the structures of the sub-trees of the first and second metadata summaries are equivalent, performing a further comparison of the first and second metadata summaries,wherein the further comparison of the first and second metadata summaries includes the sub-steps of;

    comparing the first and second metadata summaries on a textual level by comparing textual content from the first document that is contained in the sub-trees of the first metadata summary with textual content from the second document that is contained in the sub-trees of the second metadata summary; and

    identifying the first and second documents as distinct if the textual content within the sub-trees of the first and second metadata summaries are not equivalent,wherein the method further comprises;

    defining a first equivalence metadata table comprising;

    a first row corresponding to the first metadata summary;

    a second row corresponding to the second metadata summary;

    a first column corresponding to the first metadata summary; and

    a second column corresponding to the second metadata summary, andwherein the step of identifying the first and second documents as distinct if the structures of the sub-trees of the first and second metadata summaries are not equivalent comprises storing a zero value in the first row and second column position of the first equivalence metadata table; and

    wherein the step of identifying the first and second documents as duplicates comprises removing the second metadata summary.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×