System for information discovery
First Claim
1. A method for analyzing and characterizing a database of electronically formatted natural language based documents comprising the steps of:
- a) subjecting the database to a sequence of word filters to eliminate terms in the database which do not discriminate document content, resulting in a filtered word set whose members are highly predictive of content;
b) defining a subset of the filtered word set as the topic set, said topic set being characterized as the set of filtered words which best discriminate the content of the documents which contain them, c) forming a two dimensional matrix with the words contained within the topic set defining one dimension of said matrix and the words contained within the filtered word set comprising the other dimension of said matrix d) calculating matrix entries as the conditional probability that a document in the database will contain each word in the topic set given that it contains each word in the filtered word set, and e) providing said matrix entries as vectors to interpret the document contents of said database.
0 Assignments
0 Petitions
Accused Products
Abstract
A sequence of word filters are used to eliminate terms in the database which do not discriminate document content, resulting in a filtered word set and a topic word set whose members are highly predictive of content. These two word sets are then formed into a two dimensional matrix with matrix entries calculated as the conditional probability that a document will contain a word in a row given that it contains the word in a column. The matrix representation allows the resultant vectors to be utilized to interpret document contents.
-
Citations
5 Claims
-
1. A method for analyzing and characterizing a database of electronically formatted natural language based documents comprising the steps of:
-
a) subjecting the database to a sequence of word filters to eliminate terms in the database which do not discriminate document content, resulting in a filtered word set whose members are highly predictive of content;
b) defining a subset of the filtered word set as the topic set, said topic set being characterized as the set of filtered words which best discriminate the content of the documents which contain them, c) forming a two dimensional matrix with the words contained within the topic set defining one dimension of said matrix and the words contained within the filtered word set comprising the other dimension of said matrix d) calculating matrix entries as the conditional probability that a document in the database will contain each word in the topic set given that it contains each word in the filtered word set, and e) providing said matrix entries as vectors to interpret the document contents of said database. - View Dependent Claims (2, 3, 4, 5)
-
Specification