System and method for classifying legal concepts using legal topic scheme
First Claim
1. A computer-implemented method of building a knowledge base for a legal topic classification system, the method comprising:
- inputting a plurality of training documents;
parsing the plurality of training documents to extract classified legal concepts;
extracting features from the legal concepts;
generating relevance scores for each feature; and
storing features, topics, and relevance scores in a knowledge base, using an inverted index.
3 Assignments
0 Petitions
Accused Products
Abstract
An economic, scalable machine learning system and process perform document (concept) classification with high accuracy using large topic schemes, including large hierarchical topic schemes. One or more highly relevant classification topics is suggested for a-given document (concept) to be classified. The invention includes training and concept classification processes. The invention also provides methods that may be used as part of the training and/or concept classification processes, including: a method of scoring the relevance of features in training concepts, a method of ranking concepts based on relevance score, and a method of voting on topics associated with an input concept. In a preferred embodiment, the invention is applied to the legal (case law) domain, classifying legal concepts (rules of law) according to a proprietary legal topic classification scheme (a hierarchical scheme of areas of law).
199 Citations
20 Claims
-
1. A computer-implemented method of building a knowledge base for a legal topic classification system, the method comprising:
-
inputting a plurality of training documents;
parsing the plurality of training documents to extract classified legal concepts;
extracting features from the legal concepts;
generating relevance scores for each feature; and
storing features, topics, and relevance scores in a knowledge base, using an inverted index. - View Dependent Claims (2, 3, 4)
partitioning the text by section; and
partitioning the text by legal concept.
-
-
3. The method as set forth in claim 1, the step of extracting features comprising the steps of:
-
extracting terms, excluding stop words;
extracting legal phrases; and
extracting embedded case citations.
-
-
4. The method as set forth in claim 1, the step of generating relevance scores including the steps of:
-
converting features to terms;
generating, for each training concept, term frequency (TF) for each term, as number of occurrences of that term in that training concept;
generating, for each training concept, document frequency (DF) for each term, as total number of training concepts in which term appears;
generating inverse document frequency (IDF) for each term; and
generating a relevance score for each term for each concept.
-
-
5. A computer-implemented method of building a knowledge base for a legal topic classification system, the method comprising:
-
analyzing previously classified legal concepts to determine distinguishing features for each concept;
generating relevance scores for each feature in each training concept; and
storing features, topics, and relevance scores in a knowledge base, using an inverted index. - View Dependent Claims (6, 7)
converting features to terms;
generating, for each training concept, term frequency (TF) for each term, as number of occurrences of that term in that training concept;
generating, for each training concept, average term frequency of terms;
generating, for each training concept, document frequency (DF) for each term, as total number of training concepts in which term appears;
determining DBSIZE as total number of training concepts in knowledge base;
generating inverse document frequency (IDF) for each term; and
generating a relevance score for each term for each concept.
-
-
7. The method as set forth in claim 6, wherein the step of generating IDF is performed using the formula, log ((DBSIZE−
- DF+0.5)/(DF+0.05)).
-
8. A computer-implemented method of processing an input concept from a document text to provide, from a topic scheme, a list of one or more topics that are relevant to the input concept, the method comprising:
-
analyzing the input concept to arrive at a set of distinguishing features;
converting candidate concept features to candidate terms;
searching a database of concepts, previously classified according to the topic scheme, for concepts similar to the input concept based on features;
ranking the similar concepts based on relevance score; and
voting on topics associated with the concepts within the database to form the list of topics relevant to the input concept. - View Dependent Claims (9, 10, 11, 12, 13, 14)
retrieving, for each training concept, relevance scores from a knowledge base for all candidate terms;
calculating total relevance score for each training concept, as a sum of candidate term relevance scores for that concept; and
sorting training concepts by total relevance scores.
-
-
10. The method as set forth in claim 9, the step of ranking further including, before the step of retrieving, the steps of:
-
sorting candidate terms by document frequency (DF) of each term, as number of knowledge base training concepts in which term occurs; and
reducing candidate term list to least common terms.
-
-
11. The method as set forth in claim 8, the step of voting including the steps of:
-
retrieving topics associated with each training concept from a knowledge base;
grouping training concepts and scores by associated topics;
calculating a total topic relevance score for each topic, as a sum of training concept scores for each topic; and
sorting topics by total topic relevance score to create a topic list.
-
-
12. The method as set forth in claim 11, further comprising, within a hierarchical topic scheme, the steps of:
-
grouping topics by tier;
weighting the topic list according to number of occurrences of each tier topic;
generating a final topic list using the weighted topic list; and
sorting the final topic list by tier.
-
-
13. The method as set forth in claim 11, the step of sorting including comparing each total topic relevance score to a threshold and eliminating from the topic list those topics having a total topic relevance score below the threshold.
-
14. The method as set forth in claim 11, the step of sorting including the steps of:
-
determining a number of times each topic occurs;
comparing the number to a threshold; and
eliminating from the topic list those topics having a number of occurrences below the threshold.
-
-
15. A computer-implemented method of processing an input concept from a document text to provide, from a topic scheme incorporating a plurality of training concepts, a list of one or more topics that are relevant to the input concept, the method comprising:
-
retrieving topics associated with the training concepts from a knowledge base, the training concepts having been previously classified and scored in accordance with the topic scheme;
grouping training concepts and scores by associated topics;
calculating a total topic relevance score for each topic, as a sum of training concept scores for each topic; and
sorting topics by total topic relevance score to create a topic list relevant to the input concept. - View Dependent Claims (16)
grouping topics by tier;
weighting the topic list according to number of occurrences of each tier topic;
generating a final topic list using the weighted topic list; and
sorting the final topic list by tier.
-
-
17. A computer-implemented method of processing an input concept from a document text to identify, within a knowledge base incorporating a plurality of training concepts, concepts similar to the input concept and to rank these similar concepts, the method comprising:
-
identifying features of the input concept as candidate terms;
retrieving, from the knowledge base, relevance scores for training concepts similar to the input concept;
calculating a total relevance score for each retrieved training concept, as a sum of candidate term relevance scores for that concept; and
sorting retrieved training concepts by total relevance scores.
-
-
18. A computer-implemented method of building a knowledge base for a legal topic classification system by identifying features within previously classified training concepts and generating relevance scores for these features, the method comprising the steps of:
-
converting the features into terms;
generating, for each training concept, term frequency (TF) for each term, as number of occurrences of that term in that training concept;
generating, for each training concept, average term frequency (AVE_TF) of terms;
generating, for each training concept, document frequency (DF) for each term, as total number of training concepts in which term appears;
determining training set DBSIZE as total number of training concepts in the knowledge base;
generating inverse document frequency (IDF) for each term; and
generating a relevance score for each term for each concept. - View Dependent Claims (19, 20)
where and IDF=log ((DBSIZE−
DF+0.5)/(DF+0.05)).
-
-
20. The method as set forth in claim 18, wherein when a length of a current concept, doclength, is less than or equal to an average length of concepts in a set, aveDocLength, the relevance score is calculated using the formula TFwt×
- IDF,
where and IDF=log ((DBSIZE−
DF+0.5)/(DF+0.05)).
- IDF,
Specification