System And Method For Scoring Concepts In A Document Set
First Claim
1. A system for scoring concepts in a document set, comprising:
- a database to maintain a set of documents;
a concept identification module to identify concepts comprising two or more terms extracted from the document set and to designate each document having one or more of the concepts as a candidate seed document;
a scoring module to calculate a score for each of the concepts identified within each candidate seed document based on a frequency of occurrence, concept weight, structural weight, and corpus weight;
a vector module to form a vector for each candidate seed document comprising the concepts located in that candidate seed document and the associated concept scores;
a document comparison module to compare the vector for each candidate seed document with a center of one or more clusters each comprising thematically-related documents and to select at least one of the candidate seed documents that is sufficiently distinct from the other candidate seed documents as a seed document for a new cluster; and
a clustering module to place each of the unselected candidate seed documents into one of the clusters having a most similar cluster center.
11 Assignments
0 Petitions
Accused Products
Abstract
A system and method for scoring concepts in a document set is provided. Concepts including two or more terms extracted from the document set are identified. Each document having one or more of the concepts is designated as a candidate seed document. A score is calculated for each of the concepts identified within each candidate seed document based on a frequency of occurrence, concept weight, structural weight, and corpus weight. A vector is formed for each candidate seed document. The vector is compared with a center of one or more clusters each comprising thematically-related documents. At least one of the candidate seed documents that is sufficiently distinct from the other candidate seed documents is selected as a seed document for a new cluster. Each of the unselected candidate seed documents is placed into one of the clusters having a most similar cluster center.
-
Citations
20 Claims
-
1. A system for scoring concepts in a document set, comprising:
-
a database to maintain a set of documents; a concept identification module to identify concepts comprising two or more terms extracted from the document set and to designate each document having one or more of the concepts as a candidate seed document; a scoring module to calculate a score for each of the concepts identified within each candidate seed document based on a frequency of occurrence, concept weight, structural weight, and corpus weight; a vector module to form a vector for each candidate seed document comprising the concepts located in that candidate seed document and the associated concept scores; a document comparison module to compare the vector for each candidate seed document with a center of one or more clusters each comprising thematically-related documents and to select at least one of the candidate seed documents that is sufficiently distinct from the other candidate seed documents as a seed document for a new cluster; and a clustering module to place each of the unselected candidate seed documents into one of the clusters having a most similar cluster center. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method for scoring concepts in a document set, comprising:
-
maintaining a set of documents; identifying concepts comprising two or more terms extracted from the document set and designating each document having one or more of the concepts as a candidate seed document; determining a score for each of the concepts identified within each candidate seed document based on a frequency of occurrence, concept weight, structural weight, and corpus weight; forming a vector for each candidate seed document comprising the concepts located in that candidate seed document and the associated concept scores; comparing the vector for each candidate seed document with a center of one or more clusters each comprising thematically-related documents and selecting at least one of the candidate seed documents that is sufficiently distinct from the other candidate seed documents as a seed document for a new cluster; and placing each of the unselected candidate seed documents into one of the clusters having a most similar cluster center. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification