System and method for identifying facts and legal discussion in court case law documents
First Claim
1. A method of gathering large quantities of training data from case law documents and of extracting features that are independent of specific machine learning algorithms needed to accurately classify case law text passages as fact passages or as discussion passages, the method comprising:
- a) partitioning text passages within an opinion segment of a case law document by headings contained therein;
b) comparing the headings in the document;
1) to fact headings in a fact heading list, said fact headings in said fact heading list representing a specific set of predefined terms and phrases; and
2) to discussion headings in a discussion heading list, said discussion headings in said discussion heading list representing a specific set of predefined terms and phrases;
c) filtering from out of the document;
1) the headings in said document that match at least one of said fact headings and said discussion headings set forth in said fact heading list and said discussion heading list, respectively; and
2) text passages that are associated with the filtered headings;
d) categorizing the text passages as fact training data or as discussion training data based on the filtered headings associated with said text passages, and storing the fact training data and the discussion training data on persistent storage;
e) determining a relative position of the text passages in said opinion segment;
f) parsing the text passages into text chunks;
g) comparing the text chunks to predetermined feature entities for possible matched feature entities, said predetermined feature entities including at least five of;
i) a Case Cite format;
ii) a Statute Cite format;
iii) entities in a Past Tense Verb list;
iv) a Date format;
v) entities in a Signal Word list;
vi) entities in a This Court Phrases list;
vii) entities in a Lower Court Phrases list;
viii) entities in a Defendant Words list;
ix) entities in a Plaintiff Words list; and
x) entities in a Legal Phrases list;
h) associating the relative position and matched feature entities with the text passages, for use by one of the learning algorithms; and
i) classifying each of the text passages as at least one of a fact passage or a discussion passage based on the relative position and matched feature entities.
3 Assignments
0 Petitions
Accused Products
Abstract
A computer-implemented method of gathering large quantities of training data from case law documents (especially suitable for use as input to a learning algorithm that is used in a subsequent process of recognizing and distinguishing fact passages and discussion passages in additional case law documents) has steps of: partitioning text in the documents by headings in the documents, comparing the headings in the documents to fact headings in a fact heading list and to discussion headings in a discussion heading list, filtering from the documents the headings and text that is associated with the headings, and storing (on persistent storage in a manner adapted for input into the learning algorithm) fact training data and discussion training data that are based on the filtered headings and the associated text. Another method (of extracting features that are independent of specific machine learning algorithms needed to accurately classify case law text passages as fact passages or as discussion passages) has steps of: determining a relative position of the text passages in an opinion segment in the case law text, parsing the text passages into text chunks, comparing the text chunks to predetermined feature entities for possible matched feature entities, and associating the relative position and matched feature entities with the text passages for use by one of the learning algorithms. Corresponding apparatus and computer-readable memories are also provided.
-
Citations
14 Claims
-
1. A method of gathering large quantities of training data from case law documents and of extracting features that are independent of specific machine learning algorithms needed to accurately classify case law text passages as fact passages or as discussion passages, the method comprising:
-
a) partitioning text passages within an opinion segment of a case law document by headings contained therein;
b) comparing the headings in the document;
1) to fact headings in a fact heading list, said fact headings in said fact heading list representing a specific set of predefined terms and phrases; and
2) to discussion headings in a discussion heading list, said discussion headings in said discussion heading list representing a specific set of predefined terms and phrases;
c) filtering from out of the document;
1) the headings in said document that match at least one of said fact headings and said discussion headings set forth in said fact heading list and said discussion heading list, respectively; and
2) text passages that are associated with the filtered headings;
d) categorizing the text passages as fact training data or as discussion training data based on the filtered headings associated with said text passages, and storing the fact training data and the discussion training data on persistent storage;
e) determining a relative position of the text passages in said opinion segment;
f) parsing the text passages into text chunks;
g) comparing the text chunks to predetermined feature entities for possible matched feature entities, said predetermined feature entities including at least five of;
i) a Case Cite format;
ii) a Statute Cite format;
iii) entities in a Past Tense Verb list;
iv) a Date format;
v) entities in a Signal Word list;
vi) entities in a This Court Phrases list;
vii) entities in a Lower Court Phrases list;
viii) entities in a Defendant Words list;
ix) entities in a Plaintiff Words list; and
x) entities in a Legal Phrases list;
h) associating the relative position and matched feature entities with the text passages, for use by one of the learning algorithms; and
i) classifying each of the text passages as at least one of a fact passage or a discussion passage based on the relative position and matched feature entities. - View Dependent Claims (2, 3, 4, 5, 6, 7)
associating the relative position and matched feature entities with the text passages, for use by a logistical regression learning algorithm.
-
-
3. The method of claim 1, wherein the associating step includes:
associating the relative position and matched feature entities with the text passages, for use by a naive Bayes learning algorithm.
-
4. The method of claim 1, wherein each of the method steps is performed using computer-readable code.
-
5. The method of claim 1, wherein each fact heading in said fact heading list used in said step b) of comparing includes at least one word selected from the group consisting of:
- background, facts, factual, history, procedural, procedure, proceedings, nature, case and underlying.
-
6. The method of claim 1, wherein each discussion heading in said discussion heading list used in said step b) of comparing includes at least one word selected from the group consisting of:
- discussion, rule, issues and analysis.
-
7. The method of claim 1, wherein the step g) of comparing the text chunks to predetermined feature entities for possible matched feature entities includes comparing the text chunks to all ten of said predetermined feature entities listed in step g).
-
8. An apparatus for gathering large quantities of training data from case law documents and for extracting features that are independent of specific machine learning algorithms needed to accurately classify case law text passages as fact passages or as discussion passages, the apparatus comprising:
-
a) means for partitioning text passages within an opinion segment of a case law document by headings contained therein;
b) means for comparing the headings in the document;
1) to fact headings in a fact heading list, said fact headings in said fact heading list representing a specific set of predefined terms and phrases; and
2) to discussion headings in a discussion heading list, said discussion headings in said discussion heading list representing a specific set of predefined terms and phrases;
c) means for filtering from out of the document;
1) the headings in said document that match at least one of said fact headings and said discussion headings set forth in said fact heading list and said discussion heading list, respectively;
2) text passages that are associated with the filtered headings; and
d) means for categorizing the text passages as fact training data or as discussion training data based on the filtered headings associated with said text passages, and storing the fact training data and the discussion training data on persistent storage;
e) means for determining a relative position of the text passages in said opinion segment;
f) means for parsing the text passages into text chunks;
g) means for comparing the text chunks to a list of predetermined feature entities for possible matched feature entities, said list of predetermined feature entities including at least five of;
i) a Case Cite format;
ii) a Statute Cite format;
iii) entities in a Past Tense Verb list;
iv) a Date format;
v) entities in a Signal Word list;
vi) entities in a This Court Phrases list;
vii) entities in a Lower Court Phrases list;
viii) entities in a Defendant Words list;
ix) entities in a Plaintiff Words list; and
x) entities in a Legal Phrases list; and
h) means for associating the relative position and matched feature entities with the text passages, for use by one of the learning algorithms to classify each of the text passages as at least one of a fact passage or a discussion passage based on the relative position and matched feature entities. - View Dependent Claims (9, 10, 11, 12, 13, 14)
means for associating the relative position and matched feature entities with the text passages, for use by a logistical regression learning algorithm.
-
-
10. The apparatus of claim 8, wherein the associating means includes:
means for associating the relative position and matched feature entities with the text passages, for use by a naive Bayes learning algorithm.
-
11. The method of claim 8, wherein each of the method steps is performed using computer-readable code.
-
12. The apparatus of claim 8, wherein each fact heading in said fact heading list includes at least one word selected from the group consisting of:
- background, facts, factual, history, procedural, procedure, proceedings, nature, case and underlying.
-
13. The apparatus of claim 8, wherein each discussion heading in said discussion heading list includes at least one word selected from the group consisting of:
- discussion, rule, issues and analysis.
-
14. The apparatus of claim 8, wherein said list of predetermined feature entities includes all ten of a Case Cite format, a Statute Cite format, entities in a Past Tense Verb list, a Date format, entities in a Signal Word list, entities in a This Court Phrases list, entities in a Lower Court Phrases list, entities in a Defendant Words list, entities in a Plaintiff Words list, and entities in a Legal Phrases list.
Specification