System and method for classifying electronically posted documents

US 7,137,065 B1
Filed: 02/24/2000
Issued: 11/14/2006
Est. Priority Date: 02/24/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A method for classifying electronically posted documents, the method comprising:

receiving a first document and a second document;

generating a first metadata summary for said first document and a second metadata summary for the second document, wherein the first metadata summary includes a first plurality of sub-trees and the second metadata summary includes a second plurality of sub-trees, and wherein each of the sub-trees includes a plurality of nodes;

comparing the first and second metadata summaries on a structural level by comparing a structure of the sub-trees of the first metadata summary with a structure of the sub-trees of the second metadata summary;

identifying the first and second documents as distinct if the structures of the sub-trees of the first and second metadata summaries are not equivalent;

if the structures of the sub-trees of the first and second metadata summaries are equivalent, performing a further comparison of the first and second metadata summaries,wherein the further comparison of the first and second metadata summaries includes the sub-steps of;

comparing the first and second metadata summaries on a textual level by comparing textual content from the first document that is contained in the sub-trees of the first metadata summary with textual content from the second document that is contained in the sub-trees of the second metadata summary; and

identifying the first and second documents as distinct if the textual content within the sub-trees of the first and second metadata summaries are not equivalent,wherein the method further comprises;

defining a first equivalence metadata table comprising;

a first row corresponding to the first metadata summary;

a second row corresponding to the second metadata summary;

a first column corresponding to the first metadata summary; and

a second column corresponding to the second metadata summary, andwherein the step of identifying the first and second documents as distinct if the structures of the sub-trees of the first and second metadata summaries are not equivalent comprises storing a zero value in the first row and second column position of the first equivalence metadata table; and

wherein the step of identifying the first and second documents as duplicates comprises removing the second metadata summary.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for classifying electronically posted documents includes receiving two posted documents and generating corresponding metadata summaries for each, wherein each of the metadata summaries includes at least one sub-tree structure. The structures of the two summary sub-trees within the respective metadata summaries are subsequently compared. If the two summary sub-trees are different, the two documents are deemed distinct. If the two summary sub-trees are the same, attribute values and text content of the metadata summaries are compared over a portion of the metadata summaries. If the compared attribute values and text content are determined to be the same, the documents are deemed duplicative.

108 Citations

3 Claims

1. A method for classifying electronically posted documents, the method comprising:
- receiving a first document and a second document;
  
  generating a first metadata summary for said first document and a second metadata summary for the second document, wherein the first metadata summary includes a first plurality of sub-trees and the second metadata summary includes a second plurality of sub-trees, and wherein each of the sub-trees includes a plurality of nodes;
  
  comparing the first and second metadata summaries on a structural level by comparing a structure of the sub-trees of the first metadata summary with a structure of the sub-trees of the second metadata summary;
  
  identifying the first and second documents as distinct if the structures of the sub-trees of the first and second metadata summaries are not equivalent;
  
  if the structures of the sub-trees of the first and second metadata summaries are equivalent, performing a further comparison of the first and second metadata summaries,wherein the further comparison of the first and second metadata summaries includes the sub-steps of;
  
  comparing the first and second metadata summaries on a textual level by comparing textual content from the first document that is contained in the sub-trees of the first metadata summary with textual content from the second document that is contained in the sub-trees of the second metadata summary; and
  
  identifying the first and second documents as distinct if the textual content within the sub-trees of the first and second metadata summaries are not equivalent,wherein the method further comprises;
  
  defining a first equivalence metadata table comprising;
  
  a first row corresponding to the first metadata summary;
  
  a second row corresponding to the second metadata summary;
  
  a first column corresponding to the first metadata summary; and
  
  a second column corresponding to the second metadata summary, andwherein the step of identifying the first and second documents as distinct if the structures of the sub-trees of the first and second metadata summaries are not equivalent comprises storing a zero value in the first row and second column position of the first equivalence metadata table; and
  
  wherein the step of identifying the first and second documents as duplicates comprises removing the second metadata summary.
- View Dependent Claims (3)
- - 3. The method of claim 1,wherein the step of identifying the first and second documents as distinct if the textual content within the sub-trees of the first and second metadata summaries are not equivalent comprises storing a zero value in the first row and second column position of the first equivalence metadata table.

2. A method for classifying electronically posted documents, the method comprising:
- receiving a first document and a second document;
  
  generating a first metadata summary for said first document and a second metadata summary for the second document, wherein the first metadata summary includes a first plurality of sub-trees and the second metadata summary includes a second plurality of sub-trees, and wherein each of the sub-trees includes a plurality of nodes;
  
  comparing the first and second metadata summaries on a structural level by comparing a structure of the sub-trees of the first metadata summary with a structure of the sub-trees of the second metadata summary;
  
  identifying the first and second documents as distinct if the structures of the sub-trees of the first and second metadata summaries are not equivalent;
  
  if the structures of the sub-trees of the first and second metadata summaries are equivalent, performing a further comparison of the first and second metadata summaries,wherein the further comparison of the first and second metadata summaries includes the sub-steps of;
  
  comparing the first and second metadata summaries on a textual level by comparing textual content from the first document that is contained in the sub-trees of the first metadata summary with textual content from the second document that is contained in the sub-trees of the second metadata summary; and
  
  identifying the first and second documents as distinct if the textual content within the sub-trees of the first and second metadata summaries are not equivalent,wherein the further comparison of the first and second metadata summaries further includes the sub-steps of;
  
  before comparing the first and second metadata summaries on a textual level, comparing the first and second metadata summaries on an attribute level by comparing attribute values within the sub-trees of the first metadata summary with attribute values within the sub-trees of the second metadata summary; and
  
  identifying the first and second documents as distinct if the attribute values within the sub-trees of the first and second metadata summaries are not equivalent,wherein the method further comprises;
  
  defining a first equivalence metadata table comprising;
  
  a first row corresponding to the first metadata summary;
  
  a second row corresponding to the second metadata summary;
  
  a first column corresponding to the first metadata summary; and
  
  a second column corresponding to the second metadata summary, andwherein the step of identifying the first and second documents as distinct if the attribute values within the sub-trees of the first and second metadata summaries are not equivalent comprises storing a zero value in the first row and second column position of the first equivalence metadata table; and
  
  wherein the step of identifying the first and second documents as duplicates comprises removing the second metadata summary.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Huang, Anita Wai-Ling, Sundaresan, Neelakantan
Primary Examiner(s)
Hong, Stephen
Assistant Examiner(s)
Basehoar, Adam L

Application Number

US09/513,058
Time in Patent Office

2,455 Days
Field of Search

715/513, 715/501.1, 707/102, 707/104.1, 707/10, 707/3, 707/6
US Class Current

715/205
CPC Class Codes

G06F 16/38   Retrieval characterised by ...

G06F 16/951   Indexing; Web crawling tech...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99936   Pattern matching access

System and method for classifying electronically posted documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

108 Citations

3 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for classifying electronically posted documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

108 Citations

3 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links