SYSTEM AND METHOD FOR NEAR AND EXACT DE-DUPLICATION OF DOCUMENTS

US 20090276467A1
Filed: 04/30/2008
Published: 11/05/2009
Est. Priority Date: 04/30/2008
Status: Active Grant

First Claim

Patent Images

1. A method for identifying near and exact-duplicate documents in a document collection, the method comprising:

for each document in the collection;

reading textual content from the document;

filtering the textual content based on user settings;

determining N most frequent words from the filtered textual content of the document;

performing a quorum search of the N most frequent words in the document with a threshold M; and

sorting results from the quorum search based on relevancy,whereby based on the values of N and M near and exact-duplicate documents are identified in the document collection.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system, method and computer program product for identifying near and exact-duplicate documents in a document collection, including for each document in the collection, reading textual content from the document; filtering the textual content based on user settings; determining N most frequent words from the filtered textual content of the document; performing a quorum search of the N most frequent words in the document with a threshold M; and sorting results from the quorum search based on relevancy. Based on the values of N and M near and exact-duplicate documents are identified in the document collection.

Citations

33 Claims

1. A method for identifying near and exact-duplicate documents in a document collection, the method comprising:
- for each document in the collection;
  
  reading textual content from the document;
  
  filtering the textual content based on user settings;
  
  determining N most frequent words from the filtered textual content of the document;
  
  performing a quorum search of the N most frequent words in the document with a threshold M; and
  
  sorting results from the quorum search based on relevancy,whereby based on the values of N and M near and exact-duplicate documents are identified in the document collection.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, further comprising associating a respective XML wrapper for each document in the collection, wherein the XML wrapper includes a unique document identification for the document, and unique document identifications for near and exact-duplicate documents of the document.
  - 3. The method of claim 1, further comprising reading user preferences for the values of N and M, and content filtering settings for the filtering of the textual content options.
  - 4. The method of claim 3, wherein the content filtering settings include filtering of numbers, keyfields, noise words, and optical character recognition errors.
  - 5. The method of claim 2, wherein the reading of the textual content from the document includes reading the XML wrapper for the document.
  - 6. The method of claim 1, wherein the reading of the textual content from the document includes reading user settings to determine a text to read from the document.
  - 7. The method of claim 6, wherein the user settings to determine the text to read from the document include settings for reading the entire text from the document, reading only the first x Kb from the document, reading random sections of text from the document, and reading only the first x Kb after a key phrase or key word from the document.
  - 8. The method of claim 1, further comprising calculating N based on a recall percentage value user setting by multiplying the percentage value times a number of total words in the filtered text.
  - 9. The method of claim 1, further comprising calculating N based on a recall percentage value user setting by multiplying the percentage value times a number of unique words in the filtered text.
  - 10. The method of claim 1, further comprising calculating M based on a precision value percentage user setting by multiplying the percentage value times the N value.
  - 11. The method of claim 1, further comprising determining the relevancy by taking a number of hits for the quorum search in the document and dividing the number of hits by a size of the document in kilobytes of text in the document or a size in kilobytes for the entire document.

12. A computer program product for identifying near and exact-duplicate documents in a document collection and including one or more computer readable instructions embedded on a computer readable medium and configured to cause one or more computer processors to perform the steps of:
- for each document in the collection;
  
  reading textual content from the document;
  
  filtering the textual content based on user settings;
  
  determining N most frequent words from the filtered textual content of the document;
  
  performing a quorum search of the N most frequent words in the document with a threshold M; and
  
  sorting results from the quorum search based on relevancy,whereby based on the values of N and M near and exact-duplicate documents are identified in the document collection.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 13. The computer program product of claim 12, further comprising associating a respective XML wrapper for each document in the collection, wherein the XML wrapper includes a unique document identification for the document, and unique document identifications for near and exact-duplicate documents of the document.
  - 14. The computer program product of claim 12, further comprising reading user preferences for the values of N and M, and content filtering settings for the filtering of the textual content options.
  - 15. The method of claim 14, wherein the content filtering settings include filtering of numbers, keyfields, noise words, and optical character recognition errors.
  - 16. The method of claim 13, wherein the reading of the textual content from the document includes reading the XML wrapper for the document.
  - 17. The computer program product of claim 12, wherein the reading of the textual content from the document includes reading user settings to determine a text to read from the document.
  - 18. The method of claim 17, wherein the user settings to determine the text to read from the document include settings for reading the entire text from the document, reading only the first x Kb from the document, reading random sections of text from the document, and reading only the first x Kb after a key phrase or key word from the document.
  - 19. The computer program product of claim 12, further comprising calculating N based on a recall percentage value user setting by multiplying the percentage value times a number of total words in the filtered text.
  - 20. The computer program product of claim 12, further comprising calculating N based on a recall percentage value user setting by multiplying the percentage value times a number of unique words in the filtered text.
  - 21. The computer program product of claim 12, further comprising calculating M based on a precision value percentage user setting by multiplying the percentage value times the N value.
  - 22. The computer program product of claim 12, further comprising determining the relevancy by taking a number of hits for the quorum search in the document and dividing the number of hits by a size of the document in kilobytes of text in the document or a size in kilobytes for the entire document.

23. A system for identifying near and exact-duplicate documents in a document collection, the system comprising:
- for each document in the collection;
  
  means for reading textual content from the document;
  
  means for filtering the textual content based on user settings;
  
  means for determining N most frequent words from the filtered textual content of the document;
  
  means for performing a quorum search of the N most frequent words in the document with a threshold M; and
  
  means for sorting results from the quorum search based on relevancy,whereby based on the values of N and M near and exact-duplicate documents are identified in the document collection.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
- - 24. The system of claim 23, further comprising means for associating a respective XML wrapper for each document in the collection, wherein the XML wrapper includes a unique document identification for the document, and unique document identifications for near and exact-duplicate documents of the document.
  - 25. The system of claim 23, further comprising means for reading user preferences for the values of N and M, and content filtering settings for the filtering of the textual content options.
  - 26. The method of claim 25, wherein the content filtering settings include filtering of numbers, keyfields, noise words, and optical character recognition errors.
  - 27. The method of claim 24, wherein the reading of the textual content from the document includes reading the XML wrapper for the document.
  - 28. The system of claim 23, wherein the reading of the textual content from the document includes reading user settings to determine a text to read from the document.
  - 29. The method of claim 28, wherein the user settings to determine the text to read from the document include settings for reading the entire text from the document, reading only the first x Kb from the document, reading random sections of text from the document, and reading only the first x Kb after a key phrase or key word from the document.
  - 30. The system of claim 23, further comprising means for calculating N based on a recall percentage value user setting by multiplying the percentage value times a number of total words in the filtered text.
  - 31. The system of claim 23, further comprising means for calculating N based on a recall percentage value user setting by multiplying the percentage value times a number of unique words in the filtered text.
  - 32. The system of claim 23, further comprising means for calculating M based on a precision value percentage user setting by multiplying the percentage value times the N value.
  - 33. The system of claim 23, further comprising means for determining the relevancy by taking a number of hits for the quorum search in the document and dividing the number of hits by a size of the document in kilobytes of text in the document or a size in kilobytes for the entire document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
MSC Intellectual Properties BV
Original Assignee
MSC Intellectual Properties BV
Inventors
Bloembergen, Siebe, Scholtes, Johannes C.

Granted Patent

US 7,930,306 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/30 of unstructured textual dat...

G06F 16/93 Document management systems

SYSTEM AND METHOD FOR NEAR AND EXACT DE-DUPLICATION OF DOCUMENTS

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

33 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM AND METHOD FOR NEAR AND EXACT DE-DUPLICATION OF DOCUMENTS

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

33 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links