Textual data classification method and apparatus

US 6,507,829 B1
Filed: 01/17/2000
Issued: 01/14/2003
Est. Priority Date: 06/18/1999
Status: Expired due to Fees

First Claim

Patent Images

1. A system for assigning a natural language text to a class within a classification system, comprising:

inputting a natural language text to be classified;

identifying chunks within said natural language text having at least a first rank, wherein said chunks comprise n-grams including at least one of a complete natural language word and an abbreviated natural language word;

assigning a weight vector to identified n-grams for each of multiple classifications determining a count vector for each of said identified n-grams;

computing a scalar product of each of the count vectors and weight vectors assigned to identified n-grams for each of the multiple classifications;

computing a sum of said scalar products for each of the multiple classifications;

assigning the natural language text to the classification for which the highest sum of scalar products is computed;

wherein weight vectors are represented as sparse vectors;

wherein the weight vectors are determined by a process comprising initialization and iteration, wherein said classifications are related to a meaning of said chunks, and wherein the assigned classification is related to a meaning of the natural language text.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for classifying textual data is provided. The invention is adapted to automatically classify text. In particular, the invention utilizes a sparse vector framework to evaluate natural language text and to accurately and automatically assign that text to a predetermined classification. This can be done even where the disclosed system has not seen an example of the exact text before. The disclosed method and apparatus are particularly well-suited for coding adverse event reports, commonly referred to as “verbatims,” generated during clinical trials of pharmaceuticals, The invention also provides a method and apparatus that can be used to translate verbatims that have already been classified according to one coding scheme to be translated to another coding scheme in a highly automated process.

214 Citations

30 Claims

1. A system for assigning a natural language text to a class within a classification system, comprising:
- inputting a natural language text to be classified;
  
  identifying chunks within said natural language text having at least a first rank, wherein said chunks comprise n-grams including at least one of a complete natural language word and an abbreviated natural language word;
  
  assigning a weight vector to identified n-grams for each of multiple classifications determining a count vector for each of said identified n-grams;
  
  computing a scalar product of each of the count vectors and weight vectors assigned to identified n-grams for each of the multiple classifications;
  
  computing a sum of said scalar products for each of the multiple classifications;
  
  assigning the natural language text to the classification for which the highest sum of scalar products is computed;
  
  wherein weight vectors are represented as sparse vectors;
  
  wherein the weight vectors are determined by a process comprising initialization and iteration, wherein said classifications are related to a meaning of said chunks, and wherein the assigned classification is related to a meaning of the natural language text.
- View Dependent Claims (2, 3, 4, 5, 6, 12)
- - 2. A system, as claimed in claim 1, wherein chunks including the entire inputted natural language text are ranked with the highest importance.
  - 3. A system, as claimed in claim 1, wherein a subset of the highest ranking chunks is used to identify n-grams occurring more than once in the subset of highest ranking chunks.
  - 4. A system, as claimed in claim 1, wherein the process of determining weight vectors includes n-gram frequency analysis of the chunks for chunks occurring more than once.
  - 5. A system, as claimed in claim 4, wherein the process of n-gram frequency analysis is implemented for chunks longer than a given threshold.
  - 6. A system, as claimed in claim 1, further comprising, before the step of computing n-grams from the natural language description, normalizing the natural language description.
  - 12. The system of claim 1, wherein a weight vector assigned to at least one of said identified n-grams is zero.

7. A method for determining weight vectors for use in classifying a natural language text within a classification system, comprising:
- providing a set of natural language descriptions correctly classified in the classification system;
  
  parsing the correctly classified natural language text into chunks;
  
  ranking the chunks into hypothesized hierarchical importance;
  
  identifying n-grams occurring more than once in a subset of highest ranking chunks;
  
  creating weight vectors based on a frequency analysis for n-grams occurring more than once;
  
  provisionally classifying the natural language text to classes based on the created weight vectors;
  
  comparing the provisional classifications with the correct classifications; and
  
  adjusting weight vectors to improve the accuracy of classification, wherein said classes relate to a meaning of said natural language text.
- View Dependent Claims (8, 9, 10, 11)
- - 8. A method, as claimed in claim 7, wherein the n-grams are used in classification for n values higher than 3.
  - 9. A method, as claimed in claim 7, wherein the frequency analysis is implemented for chunks longer than a given threshold.
  - 10. A method, as claimed in claim 7, wherein chunks including an entire corresponding natural language description are ranked with the highest importance.
  - 11. A method, as claimed in claim 7, further comprising, before the step of parsing the natural language text, normalizing the text.

13. A method for classifying the contents of natural language text, comprising:
- receiving natural language text;
  
  parsing said natural language text into at least a first chunk, wherein said at least a first chunk comprises a chunk of said natural language text that does not include any punctuation marks;
  
  counting a number of instances of said first chunk to obtain a frequency for said first chunk;
  
  multiplying said frequency for said first chunk by a first vector weight assigned to said first chunk with respect to a first classification related to a first meaning of said chunk to obtain a first product;
  
  multiplying said frequency for said first chunk by a second vector weight assigned to said first chunk with respect to a second classification related to a second meaning of said chunk to obtain a second product;
  
  determining a first sum, wherein said first sum is equal to a sum of all products associated with said first classification;
  
  determining a second sum, wherein said second sum is equal to a sum of all products associated with said second classification;
  
  in response to said first sum being greater than said second sum, assigning said natural language text to said first classification; and
  
  in response to said second sum being greater than said first sum, assigning said natural language text to said second classification.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
- - 14. The method of claim 13, further comprising:
15. The method of claim 13, wherein said first and said second vector weights are represented as sparse vectors.
16. The method of claim 13, further comprising:
- in response to a difference between said first sum and said second sum that is less than a predetermined amount, assigning said natural language text to human coder for classification.
17. The method of claim 16, wherein at least one of said first sum and said second sum is non-zero, said method further comprising:
- presenting said human coder with said classifications for which a corresponding sum is non-zero as a list of suggested classifications.
18. The method of claim 17, wherein said list is ordered according to a magnitude of each of said sums.
19. The method of claim 13, wherein each of said chunks comprises at least a first n-gram, wherein an n-gram is at least one of a natural language word and an abbreviation of a natural language word comprising at least one textual character.
20. The method of claim 13, wherein said first classification comprises a first adverse event category and wherein said second classification comprises a second adverse event category.
21. The method of claim 13, wherein said natural language text comprises an adverse event report.

22. A method for assigning a relevancy weight to a chunk of natural language text with respect to a plurality of classifications, comprising:
- receiving a plurality of examples of natural language text, wherein each of said examples of natural language text belongs to at least one of a first classification and a second classification, and wherein said classifications are related to a meaning of said examples of natural language text;
  
  with respect to each of said examples of natural language text, parsing said natural language text into at least a first chunk, wherein said at least a first chunk comprises a natural language word having at least one textual character;
  
  assigning a rank to each chunk parsed from each of said examples of natural language text; and
  
  assigning a weight value to each chunk having at least a selected rank, wherein a chunk found in an example of natural language text belonging to said first classification and in an example of natural language text belonging to said second classification is assigned a first weight with respect to said first classification and a second weight corresponding to said second classification.
- View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30)
- - 23. The method of claim 22, further comprising:
24. The method of claim 23, wherein weight values equal to zero are discarded.
25. The method of claim 22, wherein chunks obtained from text belonging to a one of said classifications that are unique are discarded.
26. The method of claim 22, wherein chunks having more than one punctuation mark are discarded.
27. The method of claim 22, wherein only chunks having at most one punctuation mark are assigned a rank.
28. The method of claim 22, further comprising:
- providing a natural language text belonging to said first classification;
  
  dividing said provided natural language text into chunks;
  
  totaling weight values associated with said chunks obtained from said provided natural language text, wherein a total weight value for said chunks with respect to said first classification is less than said total weight value for said chunks with respect to said second classification; and
  
  adjusting said assigned weight values.
29. The method of claim 28, wherein said step of adjusting said assigned weight values comprises at least one of raising a weight value for at least a first of said chunks with respect to said first classification and lowering a weight value for said at least a first of said chunks with respect to said second classification.
30. The method of claim 28, wherein said first classification comprises a plurality of subclassifications, wherein said first classification is directed to a first human physiological system and wherein said plurality of subclassifications are directed to conditions related to said first human physiological system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
PPD Development (Smolensk) LLC (Thermo Fisher Scientific Incorporated)
Original Assignee
PPD Development LP (Thermo Fisher Scientific Incorporated)
Inventors
Kornai, Andras, Richards, Jon Michael
Primary Examiner(s)
Black, Thomas
Assistant Examiner(s)
Hirl, Joseph P.

Application Number

US09/483,828
Time in Patent Office

1,093 Days
Field of Search

706/45, 706/46, 706/12, 700/49
US Class Current

706/45
CPC Class Codes

G06F 16/353   into predefined classes

G06F 40/216   using statistical methods

G06F 40/289   Phrasal analysis, e.g. fini...

Textual data classification method and apparatus

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

214 Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Textual data classification method and apparatus

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

214 Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links