System and method for classifying electronically posted documents
First Claim
1. A method for classifying electronically posted documents, the method comprising:
- receiving a first document and a second document;
generating a first metadata summary for said first document and a second metadata summary for the second document, wherein the first metadata summary includes a first plurality of sub-trees and the second metadata summary includes a second plurality of sub-trees, and wherein each of the sub-trees includes a plurality of nodes;
comparing the first and second metadata summaries on a structural level by comparing a structure of the sub-trees of the first metadata summary with a structure of the sub-trees of the second metadata summary;
identifying the first and second documents as distinct if the structures of the sub-trees of the first and second metadata summaries are not equivalent;
if the structures of the sub-trees of the first and second metadata summaries are equivalent, performing a further comparison of the first and second metadata summaries,wherein the further comparison of the first and second metadata summaries includes the sub-steps of;
comparing the first and second metadata summaries on a textual level by comparing textual content from the first document that is contained in the sub-trees of the first metadata summary with textual content from the second document that is contained in the sub-trees of the second metadata summary; and
identifying the first and second documents as distinct if the textual content within the sub-trees of the first and second metadata summaries are not equivalent,wherein the method further comprises;
defining a first equivalence metadata table comprising;
a first row corresponding to the first metadata summary;
a second row corresponding to the second metadata summary;
a first column corresponding to the first metadata summary; and
a second column corresponding to the second metadata summary, andwherein the step of identifying the first and second documents as distinct if the structures of the sub-trees of the first and second metadata summaries are not equivalent comprises storing a zero value in the first row and second column position of the first equivalence metadata table; and
wherein the step of identifying the first and second documents as duplicates comprises removing the second metadata summary.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for classifying electronically posted documents includes receiving two posted documents and generating corresponding metadata summaries for each, wherein each of the metadata summaries includes at least one sub-tree structure. The structures of the two summary sub-trees within the respective metadata summaries are subsequently compared. If the two summary sub-trees are different, the two documents are deemed distinct. If the two summary sub-trees are the same, attribute values and text content of the metadata summaries are compared over a portion of the metadata summaries. If the compared attribute values and text content are determined to be the same, the documents are deemed duplicative.
108 Citations
3 Claims
-
1. A method for classifying electronically posted documents, the method comprising:
-
receiving a first document and a second document; generating a first metadata summary for said first document and a second metadata summary for the second document, wherein the first metadata summary includes a first plurality of sub-trees and the second metadata summary includes a second plurality of sub-trees, and wherein each of the sub-trees includes a plurality of nodes; comparing the first and second metadata summaries on a structural level by comparing a structure of the sub-trees of the first metadata summary with a structure of the sub-trees of the second metadata summary; identifying the first and second documents as distinct if the structures of the sub-trees of the first and second metadata summaries are not equivalent; if the structures of the sub-trees of the first and second metadata summaries are equivalent, performing a further comparison of the first and second metadata summaries, wherein the further comparison of the first and second metadata summaries includes the sub-steps of; comparing the first and second metadata summaries on a textual level by comparing textual content from the first document that is contained in the sub-trees of the first metadata summary with textual content from the second document that is contained in the sub-trees of the second metadata summary; and identifying the first and second documents as distinct if the textual content within the sub-trees of the first and second metadata summaries are not equivalent, wherein the method further comprises; defining a first equivalence metadata table comprising; a first row corresponding to the first metadata summary; a second row corresponding to the second metadata summary; a first column corresponding to the first metadata summary; and a second column corresponding to the second metadata summary, and wherein the step of identifying the first and second documents as distinct if the structures of the sub-trees of the first and second metadata summaries are not equivalent comprises storing a zero value in the first row and second column position of the first equivalence metadata table; and
wherein the step of identifying the first and second documents as duplicates comprises removing the second metadata summary. - View Dependent Claims (3)
-
-
2. A method for classifying electronically posted documents, the method comprising:
-
receiving a first document and a second document; generating a first metadata summary for said first document and a second metadata summary for the second document, wherein the first metadata summary includes a first plurality of sub-trees and the second metadata summary includes a second plurality of sub-trees, and wherein each of the sub-trees includes a plurality of nodes; comparing the first and second metadata summaries on a structural level by comparing a structure of the sub-trees of the first metadata summary with a structure of the sub-trees of the second metadata summary; identifying the first and second documents as distinct if the structures of the sub-trees of the first and second metadata summaries are not equivalent; if the structures of the sub-trees of the first and second metadata summaries are equivalent, performing a further comparison of the first and second metadata summaries, wherein the further comparison of the first and second metadata summaries includes the sub-steps of; comparing the first and second metadata summaries on a textual level by comparing textual content from the first document that is contained in the sub-trees of the first metadata summary with textual content from the second document that is contained in the sub-trees of the second metadata summary; and identifying the first and second documents as distinct if the textual content within the sub-trees of the first and second metadata summaries are not equivalent, wherein the further comparison of the first and second metadata summaries further includes the sub-steps of; before comparing the first and second metadata summaries on a textual level, comparing the first and second metadata summaries on an attribute level by comparing attribute values within the sub-trees of the first metadata summary with attribute values within the sub-trees of the second metadata summary; and identifying the first and second documents as distinct if the attribute values within the sub-trees of the first and second metadata summaries are not equivalent, wherein the method further comprises; defining a first equivalence metadata table comprising; a first row corresponding to the first metadata summary; a second row corresponding to the second metadata summary; a first column corresponding to the first metadata summary; and a second column corresponding to the second metadata summary, and wherein the step of identifying the first and second documents as distinct if the attribute values within the sub-trees of the first and second metadata summaries are not equivalent comprises storing a zero value in the first row and second column position of the first equivalence metadata table; and
wherein the step of identifying the first and second documents as duplicates comprises removing the second metadata summary.
-
Specification