×

Method and apparatus for removing redundant information from digital documents

  • US 7,017,113 B2
  • Filed: 12/05/2002
  • Issued: 03/21/2006
  • Est. Priority Date: 01/25/2002
  • Status: Expired due to Term
First Claim
Patent Images

1. A software program comprising instructions, stored on computer-readable media, wherein said instructions, when executed by a computer, perform the necessary steps for removing redundant information from digital documents, comprising:

  • organizing text into sentences and paragraphs;

    analyzing said sentences and said paragraphs;

    comparing said sentences and paragraphs with other documents; and

    identifying redundancies between said documents;

    wherein said step of analyzing further comprises the steps of;

    extracting statistical features selected from the group consisting of;

    size of a paragraph in characters;

    character histograms;

    number of words in each sentence;

    word histograms;

    starting word of each sentence; and

    ending word of a paragraph;

    determining whether similar said statistical features exist;

    IF similar statistical features exist, THENdeciding paragraphs are similar,removing redundant paragraph, andproceeding to said step of comparing said sentences and paragraphs with other documentsOTHERWISE,postponing removal of paragraph;

    analyzing corresponding image and data parts of said paragraph;

    determining whether said paragraphs are placed in a different order;

    IF said paragraphs are placed in a different order, THEN 

    analyzing the starting word of each sentence, 

    analyzing the length of each said sentence; and



    proceeding to said step of comparing said sentences and paragraphs with other documentsOTHERWISE, 

    proceeding to said step of comparing said sentences and paragraphs with other documents.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×