System and method for performing discovery of digital information in a subject area
First Claim
Patent Images
1. A system for performing discovery of digital information in a subject area, comprising:
- an information collection maintained in a storage device; and
a computer comprising a processor and memory within which code for execution by the processor is stored, comprising;
a user interface of the computer configured to designate each of topics in a subject area, training material for the topics, and a corpus comprising electronically-stored digital information;
a topic modeler configured to build candidate topic models on the computer, comprising;
a seed word selector configured to select seed words for each of the topics, anda pattern generator configured to generate patterns from the seed words for each topic as candidate topic models for that topic;
an index trainer to evaluate the topic models against the training material comprising;
a pattern tester configured to match the patterns in each candidate topic model to the training material and to rate the candidate topic model based on topical prediction; and
an index builder configured to build an evergreen index comprising topic models for each of the topics by pairing each topic to the candidate topic model that was best rated.
7 Assignments
0 Petitions
Accused Products
Abstract
A system and method for performing discovery of digital information in a subject area is provided. Each of topics in a subject area, training material for the topics, and a corpus comprising digital information are designated. Topic models for each of the topics are built. The topic models are evaluated against the training material. The digital information from the corpus is organized by the topics using the topic models into an evergreen index.
-
Citations
25 Claims
-
1. A system for performing discovery of digital information in a subject area, comprising:
-
an information collection maintained in a storage device; and a computer comprising a processor and memory within which code for execution by the processor is stored, comprising; a user interface of the computer configured to designate each of topics in a subject area, training material for the topics, and a corpus comprising electronically-stored digital information; a topic modeler configured to build candidate topic models on the computer, comprising; a seed word selector configured to select seed words for each of the topics, and a pattern generator configured to generate patterns from the seed words for each topic as candidate topic models for that topic; an index trainer to evaluate the topic models against the training material comprising; a pattern tester configured to match the patterns in each candidate topic model to the training material and to rate the candidate topic model based on topical prediction; and an index builder configured to build an evergreen index comprising topic models for each of the topics by pairing each topic to the candidate topic model that was best rated. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method for performing discovery of digital information in a subject area, comprising:
-
designating through a user interface of a computer a corpus comprising electronically-stored digital information, which are maintained in a storage device; selecting one or more topics and training material for the selected topics comprising on topic information and off topic information; building candidate topic models on the computer comprising; selecting seed words for each of the selected topics; and generating patterns from the seed words for each topic as candidate topic models for that topic; evaluating the candidate topic models for each selected topic against the training material comprising; matching the patterns in each candidate topic model to the training material; rating each candidate topic model for the selected topic comprising; assigning a higher score to each candidate topic model that matches the on topic information for the selected topic; assigning a lower score to each candidate topic model that does not match the on topic information for the selected topic; assigning a higher score to each candidate topic model that does not match the off topic information for the selected topic; and assigning a lower score to each candidate topic model that matches the off topic information for the selected topic; and building an evergreen index comprising topic models for each of the selected topics by pairing each topic to the candidate topic model that has the best overall score. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
-
25. An apparatus for performing discovery of digital information in a subject area, comprising:
-
means for designating through a user interface of a computer a corpus comprising electronically-stored digital information, which are maintained in a storage device; means for selecting one or more topics and training material for the selected topics comprising on topic information and off topic information; means for building candidate topic models on the computer comprising; means for selecting seed words for each of the selected topics; and means for generating patterns from the seed words for each topic as candidate topic models for that topic; means for evaluating the candidate topic models for each selected topic against the training material comprising; means for matching the patterns in each candidate topic model to the training material; means for rating each candidate topic model for the selected topic comprising; means for assigning a higher score to each candidate topic model that matches the on topic information for the selected topic; means for assigning a lower score to each candidate topic model that does not match the on topic information for the selected topic; means for assigning a higher score to each candidate topic model that does not match the off topic information for the selected topic; and means for assigning a lower score to each candidate topic model that matches the off topic information for the selected topic; and means for building an evergreen index comprising topic models for each of the selected topics by pairing each topic to the candidate topic model that has the best overall score.
-
Specification