System and method for dynamically evaluating latent concepts in unstructured documents
First Claim
1. A computer-readable storage medium storing computer executable program code to be executed by a computer system, the computer executable program code comprising the method of:
- storing data on a computer-readable storage medium, comprising;
tabulating frequencies of occurrence of terms from within a set of documents;
correlating and selecting two or more of the terms to generate weighted groups of themes based on at least one of upper and lower edge conditions on the frequencies of occurrence;
determining an inner product for each theme as a function of the frequencies of occurrence and the weighted groups of themes; and
assigning each document into a set of clusters based on the inner product of the theme comprising those terms from the document; and
accessing code for the data stored on the computer-readable storage medium.
12 Assignments
0 Petitions
Accused Products
Abstract
A system and method for dynamically evaluating latent concepts in unstructured documents is disclosed. A multiplicity of concepts are extracted from a set of unstructured documents into a lexicon. The lexicon uniquely identifies each concept and a frequency of occurrence. A frequency of occurrence representation is created for the documents set. The frequency representation provides an ordered corpus of the frequencies of occurrence of each concept. A subset of concepts is selected from the frequency of occurrence representation filtered against a pre-defined threshold. A group of weighted clusters of concepts selected from the concepts subset is generated. A matrix of best fit approximations is determined for each document weighted against each group of weighted clusters of concepts.
54 Citations
26 Claims
-
1. A computer-readable storage medium storing computer executable program code to be executed by a computer system, the computer executable program code comprising the method of:
storing data on a computer-readable storage medium, comprising; tabulating frequencies of occurrence of terms from within a set of documents; correlating and selecting two or more of the terms to generate weighted groups of themes based on at least one of upper and lower edge conditions on the frequencies of occurrence; determining an inner product for each theme as a function of the frequencies of occurrence and the weighted groups of themes; and assigning each document into a set of clusters based on the inner product of the theme comprising those terms from the document; and accessing code for the data stored on the computer-readable storage medium. - View Dependent Claims (2, 3, 4)
-
5. A system for identifying clustered groups of semantically-related documents, comprising:
-
a memory, comprising; frequencies of occurrence of terms tabulated from within a set of documents; and weighted groups of themes that include two or more of the terms, which are correlated and selected to generate the themes based on at least one of upper and lower edge conditions on the frequencies of occurrence; and a processor, comprising; a text analyzer to determine an inner product for each theme as a function of the frequencies of occurrence and the weighted groups of themes; and a visualization module to assign each document into a cluster based on the inner product of the theme comprising those terms from the document. - View Dependent Claims (6, 7, 8, 9, 10, 11)
-
-
12. A computer-readable storage medium storing computer executable program code to be executed by a computer system, the computer executable program code comprising the method of:
collectively organizing structured data as a database record, comprising; tabulating frequencies of occurrence of terms from within a set of documents; correlating and selecting two or more of the terms to generate weighted groups of themes based on at least one of upper and lower edge conditions on the frequencies of occurrence; determining an inner product for each theme as a function of the frequencies of occurrence and the weighted groups of themes; and assigning each document into a cluster based on the inner product of the theme comprising those terms from the document; and accessing code for the structured data of the database record on the computer-readable storage medium. - View Dependent Claims (13, 14, 15)
-
16. A method for identifying clustered groups of semantically-related documents, comprising:
-
tabulating frequencies of occurrence of terms within a set of documents; correlating weighted groups of themes including two or more of the terms; selecting the terms to generate the themes based on at least one of upper and lower edge conditions on the frequencies of occurrence; determining an inner product for each theme as a function of the frequencies of occurrence and the weighted groups of themes; and assigning each document into a cluster based on the inner product of the theme comprising those terms from the document. - View Dependent Claims (17, 18, 19, 20, 21, 22)
-
-
23. A system for identifying clustered groups of semantically-related documents, comprising:
-
a set of stored documents that each comprise at least one term, wherein each term corresponds to a different dimension within a logically-defined multi-dimensional concept space; a database comprising a memory to store records that each comprises; frequencies of occurrence of terms tabulated from within a set of documents; and weighted groups of themes that include two or more of the terms, which are correlated and selected to generate the themes based on at least one of upper and lower edge conditions on the frequencies of occurrence; a processor, comprising; a text analyzer to determine an inner product for each theme as a function of the frequencies of occurrence and the weighted groups of themes; and a visualization module to assign each document into a cluster based on the inner product of the theme comprising those terms from the document; and a display system to render each cluster for output as a two-dimensional visualization of the themes by semantic relatedness. - View Dependent Claims (24, 25, 26)
-
Specification