System and method for interpreting document contents
First Claim
Patent Images
1. A method, comprising the steps of:
- a) semantically filtering a set of documents in a database to extract a set of semantic concepts, to improve an efficiency of a predictive relationship to its content, based on at least one of word frequency, overlap and topicality;
b) defining a topic set, said topic set being characterized as the set of semantic concepts which best discriminate the content of the documents containing them, said topic set being defined based on at least one of word frequency, overlap and topicality;
c) forming a matrix with the semantic concepts contained within the topic set defining one dimension of said matrix and the semantic concepts contained within the filtered set of documents comprising another dimension of said matrix;
d) calculating matrix entries as the conditional probability that a document in the database will contain each semantic concept in the topic set given that it contains each semantic concept in the filtered set of documents; and
e) providing said matrix entries as vectors to interpret the document contents of said database.
0 Assignments
0 Petitions
Accused Products
Abstract
A sequence of word filters are used to eliminate terms in the database which do not discriminate document content, resulting in a filtered word set and a topic word set whose members are highly predictive of content. These two word sets are then formed into a two dimensional matrix with matrix entries calculated as the conditional probability that a document will contain a word in a row given that it contains the word in a column. The matrix representation allows the resultant vectors to be utilized to interpret document contents.
119 Citations
15 Claims
-
1. A method, comprising the steps of:
-
a) semantically filtering a set of documents in a database to extract a set of semantic concepts, to improve an efficiency of a predictive relationship to its content, based on at least one of word frequency, overlap and topicality;
b) defining a topic set, said topic set being characterized as the set of semantic concepts which best discriminate the content of the documents containing them, said topic set being defined based on at least one of word frequency, overlap and topicality;
c) forming a matrix with the semantic concepts contained within the topic set defining one dimension of said matrix and the semantic concepts contained within the filtered set of documents comprising another dimension of said matrix;
d) calculating matrix entries as the conditional probability that a document in the database will contain each semantic concept in the topic set given that it contains each semantic concept in the filtered set of documents; and
e) providing said matrix entries as vectors to interpret the document contents of said database. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A system, comprising:
-
a) a filter for semantically filtering a set of documents in a database to extract a set of semantic concepts, to improve an efficiency of a predictive relationship to its content;
b) means for defining a topic set, said topic set being characterized as the set of semantic concepts which best discriminate the content of the documents containing them, said topic set being defined based on at least one of word frequency, overlap and topicality;
c) a matrix, with the semantic concepts contained within the topic set defining one dimension of said matrix and the semantic concepts contained within the filtered set of documents comprising another dimension of said matrix, wherein matrix entries are calculated as the conditional probability that a document in the database will contain each semantic concept in the topic set given that it contains each semantic concept in the filtered set of documents; and
e) means for interpreting the document contents of said database based on vectors derived from said matrix. - View Dependent Claims (15)
-
Specification