Document Comparison Method And Apparatus

US 20090198677A1
Filed: 12/12/2008
Published: 08/06/2009
Est. Priority Date: 02/05/2008
Status: Abandoned Application

First Claim

Patent Images

1. A document comparison and identification method, the method comprising the steps of:

identifying, in a source document, words of a predetermined number of characters or greater;

generating a list containing the identified words, and excluding identified words from said list that occur with a predetermined frequency or greater in a set of documents to be searched;

searching each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list;

for each of the plurality of documents, determining how many identified words from the list occur in the document; and

calculating a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A document comparison and identification method comprises the steps of: identifying (S210), in a source document, words of a predetermined number of characters or greater; generating a list containing the identified words (S220), and excluding (S220) identified words occurring with a predetermined frequency or greater throughout a set of documents to be searched; searching (S230) each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list; for each of the plurality of documents, determining (S230) how many identified words from the list occur in the document; and calculating (S240) a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches.

Citations

21 Claims

1. A document comparison and identification method, the method comprising the steps of:
- identifying, in a source document, words of a predetermined number of characters or greater;
  
  generating a list containing the identified words, and excluding identified words from said list that occur with a predetermined frequency or greater in a set of documents to be searched;
  
  searching each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list;
  
  for each of the plurality of documents, determining how many identified words from the list occur in the document; and
  
  calculating a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The document comparison and identification method according to claim 1, wherein the predetermined number of characters is 6.
  - 3. The document comparison and identification method according to claim 1, wherein the predetermined minimum required number of matches is calculated according to the formula:
    - M=Floor (((T−
      
      N)*X)+N)wherein;
      
      M is the minimum required number of matches;
      
      T is the number of words in the list;
      
      N is a constant coefficient;
      
      X is a similarity ranking value; and
      
      the number of identified words in the list is less than or equal to the constant coefficient.
  - 4. The document comparison and identification method according to claim 3, wherein a document is determined to have high similarity with the source document if the number of identified words in the list occurring in the document is greater than, or equal to, the predetermined minimum required number of matches when X=0.9.
  - 5. The document comparison and identification method according to claim 3, wherein a document is determined to have medium similarity with the source document if the number of identified words in the list occurring in the document is greater than, or equal to, the predetermined minimum required number of matches when X=0.7.
  - 6. The document comparison and identification method according to claim 3, wherein a document is determined to have low similarity with the source document if the number of identified words in the list occurring in the document is greater than, or equal to, the predetermined minimum required number of matches when X=0.5.
  - 7. The document comparison and identification method according to claim 1, wherein the document is determined not to be similar with the source document if the number of identified words in the list occurring in the document is less than the predetermined minimum required number of matches when X=0.5.
  - 8. The document comparison method according to claim 1, wherein the predetermined minimum required number of matches is equal to the number of identified words in the list.

9. A document comparison and identification method, comprising the steps of:
- performing a first search to identify documents identical to a source document;
  
  performing a second search to identify documents having an identical or a similar document name to the source document;
  
  performing a third search to identify documents of similar content to the source document;
  
  determining a ranking for the results of each of the first, second, and third searches; and
  
  presenting results of the first, second, and third searches in accordance with the determined ranking.
- View Dependent Claims (10, 11, 12)
- - 10. The document comparison and identification method according to claim 9, wherein the documents identified by the first and second searches are deemed to have a high similarity ranking.
  - 11. The document comparison and identification method according to claim 9, wherein the third search comprises identifying, in a source document, words of a predetermined number of characters or greater;
    - generating a list containing the identified words, and excluding identified words from said list that occur with a predetermined frequency or greater in a set of documents to be searched;
      
      searching each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list;
      
      for each of the plurality of documents, determining how many identified words from the list occur in the document; and
      
      calculating a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches.
  - 12. The document comparison and identification method according to claim 11, wherein the similarity of documents identified by the third search is determined in accordance with the formula:
    - M=Floor (((T−
      
      N)*X)+N)wherein;
      
      M is the minimum required number of matches;
      
      T is the number of words in the list;
      
      N is a constant coefficient; and
      
      X is a similarity ranking value; and
      
      the number of identified words in the list is less than or equal to the constant coefficient.

13. A document comparison and identification apparatus comprising:
- a memory unit for storing data and program instructions; and
  
  a processing unit coupled to said memory unit;
  
  wherein said processing unit is programmed to;
  
  identify, in a source document, words of a predetermined number of characters or greater;
  
  generate a list containing the identified words, and exclude identified words from the list that occur with a predetermined frequency or greater in a set of documents to be searched;
  
  search each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list;
  
  determine, for each of the plurality of documents, how many identified words from the list occur in the document; and
  
  calculate a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches.
- View Dependent Claims (14, 15)
- - 14. The document comparison and identification apparatus according to claim 13, wherein the processing unit is programmed to calculate the predetermined minimum required number of matches according to the formula:
    - M=Floor (((T−
      
      N)*X)+N)wherein;
      
      M is the minimum required number of matches;
      
      T is the number of words in the list;
      
      N is a constant coefficient;
      
      X is a similarity ranking value; and
      
      the number of identified words in the list is less than or equal to the constant coefficient.
  - 15. The document comparison apparatus according to claim 13, wherein the predetermined minimum required number of matches is equal to the number of identified words in the list.

16. A document comparison and identification apparatus, comprising:
- a memory unit for storing data and program instructions; and
  
  a processing unit coupled to said memory unit;
  
  wherein said processing unit is programmed to;
  
  perform a first search to identify documents identical to a source document;
  
  perform a second search to identify documents having an identical or a similar document name to the source document;
  
  perform a third search to identify documents of similar content to the source document;
  
  determine a ranking for the results of each of the first, second, and third searches; and
  
  present results of the first, second, and third searches in accordance with the determined ranking.
- View Dependent Claims (17, 18)
- - 17. The document comparison and identification apparatus according to claim 16, wherein for performing the third search, the processing unit is programmed to:
    - identify, in a source document, words of a predetermined number of characters or greater;
      
      generate a list containing the identified words, and exclude identified words from the list that occur with a predetermined frequency or greater in a set of documents to be searched;
      
      search each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list;
      
      determine, for each of the plurality of documents, how many identified words from the list occur in the document; and
      
      calculate a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches.
  - 18. The document comparison and identification apparatus according to claim 17, wherein the processing unit is programmed to calculate the predetermined minimum required number of matches in accordance with the formula:
    - M=Floor (((T−
      
      N)*X)+N)wherein;
      
      M is the minimum required number of matches;
      
      T is the number of words in the list;
      
      N is a constant coefficient;
      
      X is a similarity ranking value; and
      
      the number of identified words in the list is less than or equal to the constant coefficient.

19. A computer program product comprising a computer readable medium comprising a computer program recorded therein for document comparison and identification, said computer program product comprising:
- computer program code means for identifying, in a source document, words of a predetermined number of characters or greater;
  
  computer program code means for generating a list containing the identified words, and excluding identified words from said list that occur with a predetermined frequency or greater in a set of documents to be searched;
  
  computer program code means for searching each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list;
  
  computer program code means for, for each of the plurality of documents, determining how many identified words from the list occur in the document; and
  
  computer program code means for calculating a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches.

20. A computer program product comprising a computer readable medium comprising a computer program recorded therein for document comparison and identification, said computer program product comprising:
- computer program code means for performing a first search to identify documents identical to a source document;
  
  computer program code means for performing a second search to identify documents having an identical or a similar document name to the source document;
  
  computer program code means for performing a third search to identify documents of similar content to the source document;
  
  computer program code means for determining a ranking for the results of each of the first, second, and third searches; and
  
  presenting results of the first, second, and third searches in accordance with the determined ranking.
- View Dependent Claims (21)
- - 21. A computer program product according to claim 20, wherein said computer program code means for performing a third search comprises:
    - computer program code means for identifying, in a source document, words of a predetermined number of characters or greater;
      
      computer program code means for generating a list containing the identified words, and excluding identified words from said list that occur with a predetermined frequency or greater in a set of documents to be searched;
      
      computer program code means for searching each of the plurality of documents in the set of documents for occurrences of the identified words stored in the list;
      
      computer program code means for each of the plurality of documents, determining how many identified words from the list occur in the document; and
      
      computer program code means for calculating a similarity of each of the plurality of documents to the source document based on the total number of identified words in the list, the number of identified words in the list occurring in the document, and a predetermined minimum required number of matches.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuix Pty Ltd.
Original Assignee
Nuix Pty Ltd.
Inventors
Sitsky, David, Sheehy, Edward, Noll, Daniel

Application Number

US12/334,357
Publication Number

US 20090198677A1
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

Document Comparison Method And Apparatus

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Document Comparison Method And Apparatus

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links