Text-classification system and method

US 7,016,895 B2
Filed: 02/25/2003
Issued: 03/21/2006
Est. Priority Date: 07/05/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-executed method for classifying a target document in the form of a digitally encoded natural-language text into one or more of two or more different classes, comprising the steps of:

(a) for each of a plurality of terms composed of non-generic words and, optionally, proximately arranged word groups in the target document, selecting a term as a descriptive term if the term has an above-threshold selectivity value in at least one library of texts in a field, where the selectivity value of the term in the library of texts in the field is related to the frequency of occurrence of that the term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively,(b) determining for each of a plurality of sample texts, a match score related to the number of descriptive terms present in or derived from that the text that match those in the target document, where each of the plurality of sample texts has an associated classification identifier that identifies the one of more different classes to which that the text belongs,(c) selecting one or more of the sample texts having the highest match scores,(d) recording the one or more classification identifiers associated with the one or more sample texts having the highest match scores, and(e) associating the one or more classification identifiers from step (d) with the target document, thereby to classify the target document as belonging to one or more classes represented by at least one of the classification identifiers from step (d).

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed are a computer-readable code, system and method for classifying a target document in the form of a digitally encoded natural-language text as belonging to one or more of two or more different classes. Each of a plurality of non-generic words and optionally, words groups characterizing the target document is selected as a descriptive term if the term has an above-threshold selectivity value in at least one library of texts in a field, where the selectivity value of a term is a measure of the field-specificity of that term. There is then determined, for each of the plurality of sample texts having associated classification identifiers, a match score related to the number of descriptive terms present in or derived from that text that match those in the target text. From the selected matched texts, and the associated classification identifiers, a classification determination of the target document is made.

93 Citations

View as Search Results

26 Claims

1. A computer-executed method for classifying a target document in the form of a digitally encoded natural-language text into one or more of two or more different classes, comprising the steps of:
- (a) for each of a plurality of terms composed of non-generic words and, optionally, proximately arranged word groups in the target document, selecting a term as a descriptive term if the term has an above-threshold selectivity value in at least one library of texts in a field, where the selectivity value of the term in the library of texts in the field is related to the frequency of occurrence of that the term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively,(b) determining for each of a plurality of sample texts, a match score related to the number of descriptive terms present in or derived from that the text that match those in the target document, where each of the plurality of sample texts has an associated classification identifier that identifies the one of more different classes to which that the text belongs,(c) selecting one or more of the sample texts having the highest match scores,(d) recording the one or more classification identifiers associated with the one or more sample texts having the highest match scores, and(e) associating the one or more classification identifiers from step (d) with the target document, thereby to classify the target document as belonging to one or more classes represented by at least one of the classification identifiers from step (d).
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 21)
- - 2. The method of claim 1, wherein the sample texts are texts in the libraries of texts from which the selectivity values of target terms are determined.
  - 3. The method of claim 2, wherein said steps (a) and (b) each includes accessing a text database containing said libraries of texts, in processed or unprocessed form, where each text is associated with a text identifier, a library identifier, and a classification identifier.
  - 4. The method of claim 2, wherein said steps (a) and (b) each includes accessing a database of word records, where each record includes text identifiers of the library texts that contain the word, associated library and classification identifiers for each text, and optionally, one or more selectivity values associated with the word.
  - 5. The method of claim 4, wherein carrying out the step of selecting descriptive words in target text includes (i) accessing said database to identify at least one selectivity value associated with each non-generic target word, and (ii) selecting the word as the descriptive word if at least one of its selectivity values is above the threshold value.
  - 6. The method of claim 4, wherein carrying out the step of selecting descriptive word groups in a target text, includes (i) accessing the database to identify text and library identifiers for each non-generic word in the target text, (ii) using the identified text and library identifiers to calculate one or more selectivity values for that the word, and (iii) selecting the word as the descriptive word if at least one of the calculated selectivity values is above the threshold value.
  - 7. The method of claim 4, wherein carrying out the step of determining match scores includes (i) accessing the database, to identify library texts associated with each descriptive word in the target text, and (ii) from the identified texts recorded in step (i), determining text match score based on the number of descriptive words in that the text weighted by the selectivity values of the matching words.
  - 8. The method of claim 4, wherein said database further includes, for each word record, word-position identifiers, and carrying out the step of selecting a word group with an above-threshold selectivity value includes (i) accessing said database to identify texts and associated library and word-position identifiers associated with the word group, (ii) from the identified texts, library identifiers, and word-position identifiers recorded in step (i) determining one or more selectivity values for the word group, and (iii) identifying a wordpair as a descriptive word group if at least one of its selectivity values is above a selected threshold value.
  - 9. The method of claim 8, wherein carrying out the step of determining match scores includes (i) recording the texts associated with each descriptive word group, and (ii) determining a text match score based, at least in part, on number of descriptive word groups in the text, weighted by the selectivity values of such words groups.
  - 10. The method of claim 1, wherein the selectivity value associated with term in is related to the greatest selectivity value determined with respect to each of a plurality N>
    - 2 of libraries of texts in different fields.
  - 11. The method of claim 1, wherein the selectivity value assigned to the descriptive term is a root function of the frequency of occurrence of that the term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, and the match score is weighted by the selectivity values of the matching terms.
  - 12. The method of claim 11, wherein the root function is between 2, the square root function, and 3, the cube root function.
  - 13. The method of claim 4, wherein said database further includes, for each word record, word-position identifiers, and carrying out the step of selecting word group with the above-threshold selectivity value includes (i) accessing said database to identify texts and associated library and word-position identifiers associated with the word group, (ii) from the identified texts, library identifiers, and word-position identifiers recorded in step (i) determining one or more selectivity values for that the word group, and (iii) identifying a wordpair as descriptive word group if at least one of its selectivity values is above a selected threshold value.
  - 14. The method of claim 1, wherein each library of texts contains texts with multiple different classification identifiers.
  - 15. The method of claim 1, wherein said sample texts and corresponding classification identifiers are selected from the group consisting of:
    - (a) libraries of different-field patent texts, and said classification identifier includes at least one patent class and, optionally, at least one patent subclass;
      
      (b) libraries of different-field research grant proposals or reports, and said classification identifier includes a research funding class within an agency;
      
      (c) libraries of case reports or head notes relating to different legal topics, and said classification identifier includes one or more different legal topics; and
      
      (d) libraries of different-field scientific or technical texts, and said classification identifier includes at least one of a plurality of different science or technology filed classifications.
  - 21. The system of claim 1, wherein said library texts and corresponding classification identifiers are selected from the group consisting of:
    - (a) libraries of different-field patent texts, and said classification identifier includes at least one patent class and, optionally, at least one patent subclass;
      
      (b) libraries of different-field research grant proposals or reports, and said classification identifier includes a research funding class within the agency;
      
      (c) libraries of case reports or head notes relating to different legal topics, and said classification identifier includes one or more different legal topics; and
      
      (d) libraries of different-field scientific or technical texts, and said classification identifier includes at least one of a plurality of different science or technology filed classifications.

16. An automated system for classifying a target document in the form of a digitally encoded text as belonging to one or more of a plurality of different classes comprising(1) a computer,(2) accessible by said computer, a database of word records, where each record includes text identifiers of the library texts that contain the word, associated library and classification identifiers for each text, and optionally, one or more selectivity values for each word, where the selectivity value of a term in a library of texts in a field is related to the frequency of occurrence of the term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively,(3) a computer readable code which is operable, under the control of said computer, to perform the steps of(a) for each of a plurality of terms composed of non-generic words and, optionally, proximately arranged word groups characterizing the target document, selecting the term as a descriptive term if the term has an above-threshold selectivity value in at least one library of texts in a field, by (i) accessing said database and (ii) calculating or recording from the database, the selectivity value associated with the term,(b) determining for each of the plurality of library texts, a match score related to the number of descriptive terms present in or derived from the text that match those in the target document,(c) selecting one or more of the library texts having the highest match scores,(d) recording the one or more classification identifiers associated with the one or more library texts having the highest match scores, and(e) associating the one or more classification identifiers from step (d) with the target document, thereby to classify the target document as belonging to at least class represented by the classification identifiers from step (d).
- View Dependent Claims (17, 19, 20)
- - 17. The system of claim 16, wherein said code is operable, in carrying out the step of selecting descriptive word in a target text, to (i) access the database to identify text and library identifiers for each non-generic word in the target text, (ii) use the identified text and library identifiers to calculate one or more selectivity values for that the word, and (iii) select the word as the descriptive word if at least one of the selectivity values is above the threshold value.
  - 19. The system of claim 16, wherein said database further includes, for each word record, word-position identifiers, and said code is operable, in carrying out the step of selecting the word group with the above-threshold selectivity value, to (i) access said database to identify texts and associated word-position identifiers associated with the word group, (ii) from the identified texts and word-position identifiers recorded in step (i) determine one or more selectivity values for the word group, and (iii) identify the word group as a descriptive wordpair if at least one of its selectivity valued is above the selected threshold value.
  - 20. The system of claim 19, wherein said code is operable, in carrying out the step of determining match scores, to (i) record the texts associated with each descriptive wordpair, and (ii) determine text match score based, at least in pad, on the number of descriptive word groups in the text, weighted by the selectivity values for such word groups.

18. The system of 16, wherein said code is operable, in carrying out the step of determining match scores, to (i) access the database, to identify library texts associated with each descriptive word in the target text, and (ii) from the identified texts recorded in step (i), determine text match score based on number of descriptive words in a text, weighted by the selectivity values of the matching words.

22. Computer readable code for use with an electronic computer and a database word records in classifying a target document in the form of a digitally encoded text as belonging to one or more of a plurality of different classes, where each record in the word records database includes text identifiers of the library texts that contain the word, an associated library identifier for each text, an associated classification identifier for each text, and optionally, one or more selectivity values for each word, where the selectivity value of a term in a library of texts in a field is related to the frequency of occurrence of the term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, said code being operable, under the control of said computer, to perform the steps of(a) for each of a plurality of terms composed of non-generic words and, optionally, proximately arranged word groups characterizing the target document, selecting the term as a descriptive term if the term has an above-threshold selectivity value in at least one library of texts in the field, by (i) accessing said database and (ii) calculating or recording from the database, the selectivity value associated with the term,(b) determining for each of the plurality of library texts, a match score related to the number of descriptive terms present in or derived from text that match those in the target document,(c) selecting one or more of the library texts having the highest match scores,(d) recording the one or more classification identifiers associated with the one or more library texts having the highest match scores, and(e) associating the one or more classification identifiers from step (d) with the target document, thereby to classify the target document as belonging to at least one class represented by the classification identifiers from step (d).
- View Dependent Claims (23, 24, 25, 26)
- - 23. The code of claim 22, which is operable, in carrying out the step of selecting descriptive word in a target text, to (i) access the database to identify text and library identifiers for each non-generic word in the target text, (ii) use the identified text and library identifiers to calculate one or more selectivity values for the word, and (iii) select a word as a descriptive word if at least one of the selectivity values is above threshold value.
  - 24. The code of claim 22, which is operable, in carrying out the step of determining match scores, to (i) access the database, to identify library texts associated with each descriptive word in the target text, and (ii) from the identified texts recorded in step (i), determine text match score based on number of descriptive words in a text, weighted by the selectivity values of the matching words.
  - 25. The code of claim 22, wherein said database further includes, for each word record, word-position identifiers, and said code is operable, in carrying out the step of selecting word group with above-threshold selectivity value, to (i) access said database to identify texts and associated word-position identifiers associated with the word group, (ii) from the identified texts and word-position identifiers recorded in step (i) determine one or more selectivity values for the word group, and (iii) identify the word group as a descriptive wordpair if at least one of the selectivity valued is above the selected threshold value.
  - 26. The code of claim 25, which is operable, in carrying out the step of determining match scores, to (i) record the texts associated with each descriptive wordpair, and (ii) determine text match score based, at least in part, on the number of descriptive word groups in a text, weighted by the selectivity values for such word groups.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Word Data Corporation
Original Assignee
Word Data Corporation
Inventors
Chin, Shao, Dehlinger, Peter J.
Primary Examiner(s)
Corrielus, Jean M.

Application Number

US10/374,877
Publication Number

US 20040006457A1
Time in Patent Office

1,120 Days
Field of Search

704/9, 707/5, 707/2, 707/6, 707/101
US Class Current

707/750
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 16/353   into predefined classes

G06F 40/216   using statistical methods

Y10S 707/917   Text

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Text-classification system and method

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

93 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Text-classification system and method

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

93 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links