System and method for identifying useless documents
First Claim
1. A processing system for identifying useless documents, comprising:
- at least one document database storing a plurality of documents;
at least one processor;
a search engine, executed by the at least one processor, that accesses at least one document stored within the at least one document database satisfying a query; and
a useless document identifier engine, executed by the at least one processor, for identifying useless documents from the at least one accessed document, the useless document identifier engine determining if the at least one accessed document is useless by determining if one of the following two conditions is true;
(i) a length of the at least one accessed document is less than a first predetermined amount of bytes;
or (ii) the length of the at least one accessed document is less than a second predetermined amount of bytes, the at least one accessed document has less than a predetermined number of terms with an Intelligent Quotient (IQ) greater than a first predetermined number, and the at least one accessed document has less than a predetermined number of appearances of terms having a tf*idf value of greater than a second predetermined number.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method are disclosed for identifying useless or insignificant documents in a document hit list assembled from documents stored in one or more document collection databases. A search engine is used to compose the document hit list based on a query presented by a user. A text extraction algorithm run by a processor is then used to process the documents identified by the document hit list to produce a table of terms and their corresponding collection-level importance ranking called the IQ or Information Quotient. The text algorithm also produces a table of the most important terms per document. The documents are also scanned independently and a table of documents with filenames and lengths is also produced. A summarizing text algorithm is also run by a processor against the documents of the document hit list to produce a table of terms having a high tf*idf value for each document. All of the tables are stored in a relational database, which allows the system of the present invention to generate a table of terms per document ranked by decreasing IQ. To determine whether a document is useful or useless, the table of terms and IQs, the table of most important terms per document, the table of documents with filename and lengths, and the table of high tf*idf values are examined.
-
Citations
23 Claims
-
1. A processing system for identifying useless documents, comprising:
-
at least one document database storing a plurality of documents;
at least one processor;
a search engine, executed by the at least one processor, that accesses at least one document stored within the at least one document database satisfying a query; and
a useless document identifier engine, executed by the at least one processor, for identifying useless documents from the at least one accessed document, the useless document identifier engine determining if the at least one accessed document is useless by determining if one of the following two conditions is true;
(i) a length of the at least one accessed document is less than a first predetermined amount of bytes;
or(ii) the length of the at least one accessed document is less than a second predetermined amount of bytes, the at least one accessed document has less than a predetermined number of terms with an Intelligent Quotient (IQ) greater than a first predetermined number, and the at least one accessed document has less than a predetermined number of appearances of terms having a tf*idf value of greater than a second predetermined number. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method for identifying if a document is useless, the method comprising the steps of:
-
determining if a length of the document is less than a first predetermined amount of bytes; and
determining if the length of the document is less than a second predetermined amount of bytes, determining if the document has less than a predetermined number of terms with an Intelligent Quotient (IQ) greater than a first predetermined number, and determining if the document has less than a predetermined number of appearances of terms having a tf*idf value of greater than a second predetermined number, if it is determined that the length of the document is greater than or equal to the first predetermined amount of bytes. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A processing system for identifying documents, comprising:
-
at least one document database storing a plurality of documents;
at least one processor;
a search engine, executed by the at least one processor, that accesses at least one document stored within the at least one document database satisfying a query; and
a document identifier engine, executed by the at least one processor, for identifying documents from the at least one accessed document, the document identifier engine determining if the at least one accessed document is pertinent to a user by determining and analyzing user-defined parameters of the at least one accessed document, wherein said user-defined parameters includes;
(i) a length of the at least one accessed document is less than a first predetermined amount of bytes;
or(ii) the length of the at least one accessed document is less than a second predetermined amount of bytes, the at least one accessed document has less than a predetermined number of terms with an Intelligent Quotient (IQ) greater than a first predetermined number, and the at least one accessed document has less than a predetermined number of appearances of terms having a tf*idf value of greater than a second predetermined number.
-
Specification