System and method for identifying useless documents

US 6,397,211 B1
Filed: 01/03/2000
Issued: 05/28/2002
Est. Priority Date: 01/03/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A processing system for identifying useless documents, comprising:

at least one document database storing a plurality of documents;

at least one processor;

a search engine, executed by the at least one processor, that accesses at least one document stored within the at least one document database satisfying a query; and

a useless document identifier engine, executed by the at least one processor, for identifying useless documents from the at least one accessed document, the useless document identifier engine determining if the at least one accessed document is useless by determining if one of the following two conditions is true;

(i) a length of the at least one accessed document is less than a first predetermined amount of bytes;

or (ii) the length of the at least one accessed document is less than a second predetermined amount of bytes, the at least one accessed document has less than a predetermined number of terms with an Intelligent Quotient (IQ) greater than a first predetermined number, and the at least one accessed document has less than a predetermined number of appearances of terms having a tf*idf value of greater than a second predetermined number.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method are disclosed for identifying useless or insignificant documents in a document hit list assembled from documents stored in one or more document collection databases. A search engine is used to compose the document hit list based on a query presented by a user. A text extraction algorithm run by a processor is then used to process the documents identified by the document hit list to produce a table of terms and their corresponding collection-level importance ranking called the IQ or Information Quotient. The text algorithm also produces a table of the most important terms per document. The documents are also scanned independently and a table of documents with filenames and lengths is also produced. A summarizing text algorithm is also run by a processor against the documents of the document hit list to produce a table of terms having a high tf*idf value for each document. All of the tables are stored in a relational database, which allows the system of the present invention to generate a table of terms per document ranked by decreasing IQ. To determine whether a document is useful or useless, the table of terms and IQs, the table of most important terms per document, the table of documents with filename and lengths, and the table of high tf*idf values are examined.

Citations

23 Claims

1. A processing system for identifying useless documents, comprising:
- at least one document database storing a plurality of documents;
  
  at least one processor;
  
  a search engine, executed by the at least one processor, that accesses at least one document stored within the at least one document database satisfying a query; and
  
  a useless document identifier engine, executed by the at least one processor, for identifying useless documents from the at least one accessed document, the useless document identifier engine determining if the at least one accessed document is useless by determining if one of the following two conditions is true;
  
  (i) a length of the at least one accessed document is less than a first predetermined amount of bytes;
  
  or (ii) the length of the at least one accessed document is less than a second predetermined amount of bytes, the at least one accessed document has less than a predetermined number of terms with an Intelligent Quotient (IQ) greater than a first predetermined number, and the at least one accessed document has less than a predetermined number of appearances of terms having a tf*idf value of greater than a second predetermined number.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The processing system of claim 1, wherein the at least one document database is a workstation.
  - 3. The processing system of claim 1, wherein the useless document identifier engine is an algorithm translated to programmable instructions.
  - 4. The processing system of claim 1, wherein the first predetermined amount of bytes is 2,000 bytes.
  - 5. The processing system of claim 1, wherein the second predetermined amount of bytes is 40,000 bytes.
  - 6. The processing system of claim 1, wherein the predetermined number of terms is five.
  - 7. The processing system of claim 1, wherein the first predetermined number is 60.
  - 8. The processing system of claim 1, wherein the predetermined number of appearances of terms is six.
  - 9. The processing system of claim 1, wherein the second predetermined number is 2.2.
  - 10. The processing system of claim 1, wherein the useless document identifier engine removes all identified useless documents from a document hit list listing all documents satisfying the query which includes the at least one accessed document.
  - 11. The processing system of claim 1, wherein the useless document identifier engine assigns a corresponding ranking to the at least one accessed document.
  - 12. The processing system of claim 11, wherein the useless document identifier engine downgrades the corresponding ranking of all identified useless documents.

13. A method for identifying if a document is useless, the method comprising the steps of:
- determining if a length of the document is less than a first predetermined amount of bytes; and
  
  determining if the length of the document is less than a second predetermined amount of bytes, determining if the document has less than a predetermined number of terms with an Intelligent Quotient (IQ) greater than a first predetermined number, and determining if the document has less than a predetermined number of appearances of terms having a tf*idf value of greater than a second predetermined number, if it is determined that the length of the document is greater than or equal to the first predetermined amount of bytes.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 14. The method of claim 13, wherein the first predetermined amount of bytes is 2,000 bytes.
  - 15. The method of claim 13, wherein the second predetermined amount of bytes is 40,000 bytes.
  - 16. The method of claim 13, wherein the predetermined number of terms is five.
  - 17. The method of claim 13, wherein the first predetermined number is 60.
  - 18. The method of claim 13, wherein the predetermined number of appearances of terms is six.
  - 19. The method of claim 13, wherein the second predetermined number is 2.2.
  - 20. The method of claim 13, wherein the useless document identifier engine removes the document from a document hit list listing all documents satisfying a query if the document is determined to be useless.
  - 21. The method of claim 13, wherein the useless document identifier engine assigns a corresponding ranking to the document.
  - 22. The method of claim 21, wherein the useless document identifier engine downgrades the corresponding ranking of the document if the document is determined to be useless.

23. A processing system for identifying documents, comprising:
- at least one document database storing a plurality of documents;
  
  at least one processor;
  
  a search engine, executed by the at least one processor, that accesses at least one document stored within the at least one document database satisfying a query; and
  
  a document identifier engine, executed by the at least one processor, for identifying documents from the at least one accessed document, the document identifier engine determining if the at least one accessed document is pertinent to a user by determining and analyzing user-defined parameters of the at least one accessed document, wherein said user-defined parameters includes;
  
  (i) a length of the at least one accessed document is less than a first predetermined amount of bytes;
  
  or (ii) the length of the at least one accessed document is less than a second predetermined amount of bytes, the at least one accessed document has less than a predetermined number of terms with an Intelligent Quotient (IQ) greater than a first predetermined number, and the at least one accessed document has less than a predetermined number of appearances of terms having a tf*idf value of greater than a second predetermined number.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Cooper, James W.
Primary Examiner(s)
Alam, Hosain T.
Assistant Examiner(s)
Alam, Shahid Al

Application Number

US09/476,943
Time in Patent Office

876 Days
Field of Search

707/1,2,3,4,5,6,7,9,10,102,202,204,500,513 704/9 709/219 706/12
US Class Current

707/706
CPC Class Codes

G06F 16/345   Summarisation for human users

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99953   Recoverability

System and method for identifying useless documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for identifying useless documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links