System and method for scoring concepts in a document set
First Claim
1. A system for scoring concepts in a document set, comprising:
- a database to maintain a set of documents;
a concept identification module to identify concepts comprising two or more terms extracted from the document set and to designate each document having one or more of the concepts as a candidate seed document;
a value module to determine for each of the concepts identified within each candidate seed document, values for a frequency of occurrence of that concept within that candidate seed document, a concept weight reflecting a specificity of meaning for that concept within that candidate seed document, a structural weight reflecting a degree of significance based on a location of that concept within that candidate seed document, and a corpus weight inversely weighing a reference count of the occurrence for that concept within the document set according to the equation;
11 Assignments
0 Petitions
Accused Products
Abstract
A system and method for scoring concepts in a document set is provided. Concepts including two or more terms extracted from the document set are identified. Each document having one or more of the concepts is designated as a candidate seed document. A score is calculated for each of the concepts identified within each candidate seed document based on a frequency of occurrence, concept weight, structural weight, and corpus weight. A vector is formed for each candidate seed document. The vector is compared with a center of one or more clusters each comprising thematically-related documents. At least one of the candidate seed documents that is sufficiently distinct from the other candidate seed documents is selected as a seed document for a new cluster. Each of the unselected candidate seed documents is placed into one of the clusters having a most similar cluster center.
231 Citations
16 Claims
-
1. A system for scoring concepts in a document set, comprising:
-
a database to maintain a set of documents; a concept identification module to identify concepts comprising two or more terms extracted from the document set and to designate each document having one or more of the concepts as a candidate seed document; a value module to determine for each of the concepts identified within each candidate seed document, values for a frequency of occurrence of that concept within that candidate seed document, a concept weight reflecting a specificity of meaning for that concept within that candidate seed document, a structural weight reflecting a degree of significance based on a location of that concept within that candidate seed document, and a corpus weight inversely weighing a reference count of the occurrence for that concept within the document set according to the equation; - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method for scoring concepts in a document set, comprising:
-
maintaining a set of documents; identifying concepts comprising two or more terms extracted from the document set and designating each document having one or more of the concepts as a candidate seed document; determining for each of the concepts identified within each candidate seed document, values for a frequency of occurrence of that concept within that candidate seed document, a concept weight reflecting a specificity of meaning for that concept within that candidate seed document, a structural weight reflecting a degree of significance based on a location of that concept within that candidate seed document, and a corpus weight inversely weighing a reference count of the occurrence for that concept within the document set according to the equation; - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
Specification