Method and apparatus for removing redundant information from digital documents
First Claim
1. A software program comprising instructions, stored on computer-readable media, wherein said instructions, when executed by a computer, perform the necessary steps for removing redundant information from digital documents, comprising:
- organizing text into sentences and paragraphs;
analyzing said sentences and said paragraphs;
comparing said sentences and paragraphs with other documents; and
identifying redundancies between said documents;
wherein said step of analyzing further comprises the steps of;
extracting statistical features selected from the group consisting of;
size of a paragraph in characters;
character histograms;
number of words in each sentence;
word histograms;
starting word of each sentence; and
ending word of a paragraph;
determining whether similar said statistical features exist;
IF similar statistical features exist, THENdeciding paragraphs are similar,removing redundant paragraph, andproceeding to said step of comparing said sentences and paragraphs with other documentsOTHERWISE,postponing removal of paragraph;
analyzing corresponding image and data parts of said paragraph;
determining whether said paragraphs are placed in a different order;
IF said paragraphs are placed in a different order, THEN
analyzing the starting word of each sentence,
analyzing the length of each said sentence; and
proceeding to said step of comparing said sentences and paragraphs with other documentsOTHERWISE,
proceeding to said step of comparing said sentences and paragraphs with other documents.
1 Assignment
0 Petitions
Accused Products
Abstract
Method and apparatus for reconstructing new documents from a group of old ones by removing the existing redundant information. Redundant information (images, text paragraphs) from retrieved multimedia documents is removed. Each document consists of two main parts stored in different databases. The first part of a document represents text paragraphs, the second part consists of the images and drawings related with the text paragraphs. An information reduction methodology examines first the text paragraphs of each document related with a specific topic, and removes the redundant information, such as same or similar paragraphs, by keeping pointers useful for a future reconstruction of the original documents. The remaining text paragraphs and the set of points are used to compose the first version of a new document. The invention also examines all the images related with the set of original documents and removes the same or similar images while keeping pointers that could assist a future reconstruction of the original documents. The invention merges text-paragraphs and images and creates the first stage new document.
-
Citations
6 Claims
-
1. A software program comprising instructions, stored on computer-readable media, wherein said instructions, when executed by a computer, perform the necessary steps for removing redundant information from digital documents, comprising:
-
organizing text into sentences and paragraphs; analyzing said sentences and said paragraphs; comparing said sentences and paragraphs with other documents; and identifying redundancies between said documents; wherein said step of analyzing further comprises the steps of; extracting statistical features selected from the group consisting of; size of a paragraph in characters; character histograms; number of words in each sentence; word histograms; starting word of each sentence; and ending word of a paragraph; determining whether similar said statistical features exist; IF similar statistical features exist, THEN deciding paragraphs are similar, removing redundant paragraph, and proceeding to said step of comparing said sentences and paragraphs with other documents OTHERWISE, postponing removal of paragraph; analyzing corresponding image and data parts of said paragraph; determining whether said paragraphs are placed in a different order; IF said paragraphs are placed in a different order, THEN
analyzing the starting word of each sentence,
analyzing the length of each said sentence; and
proceeding to said step of comparing said sentences and paragraphs with other documentsOTHERWISE,
proceeding to said step of comparing said sentences and paragraphs with other documents.- View Dependent Claims (2, 3)
-
-
4. A computer apparatus for removing redundant information from digital documents, comprising:
-
a computer workstation; a search engine software program residing in said computer workstation; a plurality of information databases; and an information redundancy removal software program residing in said computer workstation; wherein said search engine software program comprises instructions, stored on computer-readable media, and wherein said instructions, when executed by said computer workstation, provide means to perform the necessary steps for retrieving digital documents from said plurality of information databases; wherein said information redundancy removal software program comprises instructions, stored on computer-readable media, and wherein said instructions, when executed by said computer workstation, provide means to perform the necessary steps for removing redundant information from said retrieved digital documents; and wherein said computer-executable instructions within said information redundancy removal software program further provide means for; organizing text into sentences and paragraphs; analyzing said sentences and said paragraphs; comparing said sentences and paragraphs with other documents; identifying redundancies between said documents extracting statistical features selected from the group consisting of; size of a paragraph in characters; character histograms; number of words in each sentence; word histograms; starting word of each sentence; and ending word of a paragraph; determining whether similar said statistical features exist; IF similar statistical features exist, THEN deciding paragraphs are similar, removing redundant paragraph, and proceeding to means for comparing said sentences and paragraphs with other documents OTHERWISE, postponing removal of paragraph; analyzing corresponding image and data parts of said paragraph; determining whether said paragraphs are placed in a different order; IF said paragraphs are placed in a different order, THEN
analyzing the starting word of each sentence,
analyzing the length of each said sentence; and
comparing said sentences and paragraphs with other documentsOTHERWISE, comparing said sentences and paragraphs with other documents. - View Dependent Claims (6)
-
-
5. A computer apparatus and a set of information redundancy removal software code, said software code being executable therein so as to remove redundant information from digital documents input thereinto by providing means for:
-
analyzing each image in each of said documents; extracting statistical features from each said image, wherein said features are selected from the group consisting of; number of image regions; relative size of regions; texture of regions; and weighted regions graph determining whether same features exist; IF same features exist, THEN deciding that images are similar; removing redundant image; and terminating said means for analyzing each image; OTHERWISE, postponing removal of image; analyzing corresponding text and data parts of image; determining whether there is an ambiguity; IF there is an ambiguity, THEN
performing image understanding;
making a final decision on removal of image; and
returning to removing redundant image;OTHERWISE,
terminating analyzing each image.
-
Specification