Textual data classification method and apparatus
First Claim
1. A system for assigning a natural language text to a class within a classification system, comprising:
- inputting a natural language text to be classified;
identifying chunks within said natural language text having at least a first rank, wherein said chunks comprise n-grams including at least one of a complete natural language word and an abbreviated natural language word;
assigning a weight vector to identified n-grams for each of multiple classifications determining a count vector for each of said identified n-grams;
computing a scalar product of each of the count vectors and weight vectors assigned to identified n-grams for each of the multiple classifications;
computing a sum of said scalar products for each of the multiple classifications;
assigning the natural language text to the classification for which the highest sum of scalar products is computed;
wherein weight vectors are represented as sparse vectors;
wherein the weight vectors are determined by a process comprising initialization and iteration, wherein said classifications are related to a meaning of said chunks, and wherein the assigned classification is related to a meaning of the natural language text.
7 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for classifying textual data is provided. The invention is adapted to automatically classify text. In particular, the invention utilizes a sparse vector framework to evaluate natural language text and to accurately and automatically assign that text to a predetermined classification. This can be done even where the disclosed system has not seen an example of the exact text before. The disclosed method and apparatus are particularly well-suited for coding adverse event reports, commonly referred to as “verbatims,” generated during clinical trials of pharmaceuticals, The invention also provides a method and apparatus that can be used to translate verbatims that have already been classified according to one coding scheme to be translated to another coding scheme in a highly automated process.
214 Citations
30 Claims
-
1. A system for assigning a natural language text to a class within a classification system, comprising:
-
inputting a natural language text to be classified;
identifying chunks within said natural language text having at least a first rank, wherein said chunks comprise n-grams including at least one of a complete natural language word and an abbreviated natural language word;
assigning a weight vector to identified n-grams for each of multiple classifications determining a count vector for each of said identified n-grams;
computing a scalar product of each of the count vectors and weight vectors assigned to identified n-grams for each of the multiple classifications;
computing a sum of said scalar products for each of the multiple classifications;
assigning the natural language text to the classification for which the highest sum of scalar products is computed;
wherein weight vectors are represented as sparse vectors;
wherein the weight vectors are determined by a process comprising initialization and iteration, wherein said classifications are related to a meaning of said chunks, and wherein the assigned classification is related to a meaning of the natural language text. - View Dependent Claims (2, 3, 4, 5, 6, 12)
-
-
7. A method for determining weight vectors for use in classifying a natural language text within a classification system, comprising:
-
providing a set of natural language descriptions correctly classified in the classification system;
parsing the correctly classified natural language text into chunks;
ranking the chunks into hypothesized hierarchical importance;
identifying n-grams occurring more than once in a subset of highest ranking chunks;
creating weight vectors based on a frequency analysis for n-grams occurring more than once;
provisionally classifying the natural language text to classes based on the created weight vectors;
comparing the provisional classifications with the correct classifications; and
adjusting weight vectors to improve the accuracy of classification, wherein said classes relate to a meaning of said natural language text. - View Dependent Claims (8, 9, 10, 11)
-
-
13. A method for classifying the contents of natural language text, comprising:
-
receiving natural language text;
parsing said natural language text into at least a first chunk, wherein said at least a first chunk comprises a chunk of said natural language text that does not include any punctuation marks;
counting a number of instances of said first chunk to obtain a frequency for said first chunk;
multiplying said frequency for said first chunk by a first vector weight assigned to said first chunk with respect to a first classification related to a first meaning of said chunk to obtain a first product;
multiplying said frequency for said first chunk by a second vector weight assigned to said first chunk with respect to a second classification related to a second meaning of said chunk to obtain a second product;
determining a first sum, wherein said first sum is equal to a sum of all products associated with said first classification;
determining a second sum, wherein said second sum is equal to a sum of all products associated with said second classification;
in response to said first sum being greater than said second sum, assigning said natural language text to said first classification; and
in response to said second sum being greater than said first sum, assigning said natural language text to said second classification. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
parsing said natural language text into at least a second chunk, wherein said at least a second chunk comprises at least a first punctuation mark;
counting a number of instances of said second chunk to obtain a frequency for said second chunk;
multiplying said frequency for said second chunk by a third vector weight assigned to said second chunk with respect to said first classification to obtain a third product;
multiplying said frequency for said second chunk by a fourth vector weight assigned to said second chunk with respect to said second classification to obtain a fourth product;
wherein said step of determining a first sum comprises calculating a sum of said first and third products; and
wherein said step of determining a second sum comprises calculating a sum of said second and fourth products.
-
-
15. The method of claim 13, wherein said first and said second vector weights are represented as sparse vectors.
-
16. The method of claim 13, further comprising:
in response to a difference between said first sum and said second sum that is less than a predetermined amount, assigning said natural language text to human coder for classification.
-
17. The method of claim 16, wherein at least one of said first sum and said second sum is non-zero, said method further comprising:
presenting said human coder with said classifications for which a corresponding sum is non-zero as a list of suggested classifications.
-
18. The method of claim 17, wherein said list is ordered according to a magnitude of each of said sums.
-
19. The method of claim 13, wherein each of said chunks comprises at least a first n-gram, wherein an n-gram is at least one of a natural language word and an abbreviation of a natural language word comprising at least one textual character.
-
20. The method of claim 13, wherein said first classification comprises a first adverse event category and wherein said second classification comprises a second adverse event category.
-
21. The method of claim 13, wherein said natural language text comprises an adverse event report.
-
22. A method for assigning a relevancy weight to a chunk of natural language text with respect to a plurality of classifications, comprising:
-
receiving a plurality of examples of natural language text, wherein each of said examples of natural language text belongs to at least one of a first classification and a second classification, and wherein said classifications are related to a meaning of said examples of natural language text;
with respect to each of said examples of natural language text, parsing said natural language text into at least a first chunk, wherein said at least a first chunk comprises a natural language word having at least one textual character;
assigning a rank to each chunk parsed from each of said examples of natural language text; and
assigning a weight value to each chunk having at least a selected rank, wherein a chunk found in an example of natural language text belonging to said first classification and in an example of natural language text belonging to said second classification is assigned a first weight with respect to said first classification and a second weight corresponding to said second classification. - View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30)
storing at least a first weight value having a non-zero value for an associated chunk.
-
-
24. The method of claim 23, wherein weight values equal to zero are discarded.
-
25. The method of claim 22, wherein chunks obtained from text belonging to a one of said classifications that are unique are discarded.
-
26. The method of claim 22, wherein chunks having more than one punctuation mark are discarded.
-
27. The method of claim 22, wherein only chunks having at most one punctuation mark are assigned a rank.
-
28. The method of claim 22, further comprising:
-
providing a natural language text belonging to said first classification;
dividing said provided natural language text into chunks;
totaling weight values associated with said chunks obtained from said provided natural language text, wherein a total weight value for said chunks with respect to said first classification is less than said total weight value for said chunks with respect to said second classification; and
adjusting said assigned weight values.
-
-
29. The method of claim 28, wherein said step of adjusting said assigned weight values comprises at least one of raising a weight value for at least a first of said chunks with respect to said first classification and lowering a weight value for said at least a first of said chunks with respect to said second classification.
-
30. The method of claim 28, wherein said first classification comprises a plurality of subclassifications, wherein said first classification is directed to a first human physiological system and wherein said plurality of subclassifications are directed to conditions related to said first human physiological system.
Specification