Natural-language processing system using a large corpus

US 20040024583A1
Filed: 09/20/2002
Published: 02/05/2004
Est. Priority Date: 03/20/2000
Status: Active Grant

First Claim

Patent Images

1. ) A computer system, using a provided corpus of linear natural-language elements of natural language text string data in a subject language and an input string of natural-language elements in the subject language, for assisting natural-language processing, comprising, in combination:

a) for a first adjoining pair, comprising a first pair element and a second pair element, of such natural-language elements of such input string, finding, from such string data from such corpus, a first listing of each such element syntactically equivalent to such first pair element and a second listing of each such element syntactically equivalent to such second pair element;

b) from matching each such first-listing element with each such second-listing element, making a matched-pairs third listing by finding which matched pairs of said matching are found in such string data from such corpus; and

c) for such matched pairs of such matched-pairs third listing, finding, from such string data from such corpus, a fourth listing of each fourth such natural-language element syntactically equivalent to any such matched pair of said third listing.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-parsing system based upon using vectors (lists) to represent natural-language elements, providing a robust, distributed way to score grammaticality of an input string by using as a source material a large corpus of natural-language text. The system uses recombining of asymetric associations of syntactically similar strings to form an the vectors. The system uses equivalence lists for your the organization subparts of the string to build equivalence lists for our the province longer strings in an order controlled by the potential these/parse to be scored. The power of recombination of Entries from: vector elements in building longer strings provides a means of representing collocational complexity. Grammaticality scoring is based upon the number and similarity of the vector elements.

65 Citations

View as Search Results

26 Claims

1. ) A computer system, using a provided corpus of linear natural-language elements of natural language text string data in a subject language and an input string of natural-language elements in the subject language, for assisting natural-language processing, comprising, in combination:
- a) for a first adjoining pair, comprising a first pair element and a second pair element, of such natural-language elements of such input string, finding, from such string data from such corpus, a first listing of each such element syntactically equivalent to such first pair element and a second listing of each such element syntactically equivalent to such second pair element;
  
  b) from matching each such first-listing element with each such second-listing element, making a matched-pairs third listing by finding which matched pairs of said matching are found in such string data from such corpus; and
  
  c) for such matched pairs of such matched-pairs third listing, finding, from such string data from such corpus, a fourth listing of each fourth such natural-language element syntactically equivalent to any such matched pair of said third listing.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 23, 24)
- - 2. ) The computer system according to claim 1 further comprising:
    - a) scoring each such natural-language element of such fourth listing, such scoring comprising counting the number of occurrences of each such natural-language element of such fourth listing in such string data from such corpus.
  - 3. ) The computer system according to claim 1 further comprising:
    - a) for such fourth natural-language elements of such fourth listing, finding, from such string data from such corpus, a fifth listing of each such natural-language element syntactically equivalent to any such fourth natural-language element.
  - 4. ) The computer system according to claim 3 further comprising:
    - a) scoring each such natural-language element of such fifth listing, such scoring comprising counting the number of occurrences of each such natural-language element of such fifth listing in such string data from such corpus.
  - 5. ) The computer system according to claim 3 further comprising:
    - a) for such nth natural-language elements of such nth listing, finding, from such string data from such corpus, an (n+1)th listing of each such natural-language element syntactically equivalent to any such nth natural-language element.
  - 6. ) The computer system according to claim 5 further comprising:
    - a) scoring each such natural-language element of such (n+1)th listing, such scoring comprising counting the number of occurrences of each such natural-language element of such (n+1)th listing in such string data from such corpus.
  - 7. ) The computer system according to claim 1 further comprising:
    - a) for a second adjoining pair, comprising such first adjoining pair as a second first pair element and another natural-language element adjoining such first adjoining pair as a second second pair element, of such natural-language elements of such input string, finding, from such string data from such corpus, a second first listing of each such element syntactically equivalent to such second first pair element and a second second listing of each such element syntactically equivalent to such second second pair element;
      
      b) from matching each such second first-listing element with each such second second-listing element, making a matched-pairs second third listing by finding which matched pairs of said matching are found in such string data from such corpus; and
      
      c) for such matched pairs of such matched-pairs second third listing, finding, from such string data from such corpus, a second fourth listing of each second fourth such natural-language element syntactically equivalent to any such matched pair of such second third listing.
  - 8. ) The computer system according to claim 7 further comprising:
    - a) scoring each such natural-language element of such fourth listing, such scoring comprising counting the number of occurrences of each such natural-language element of such fourth listing in such string data from such corpus.
  - 9. ) The computer system according to claim 7 further comprising:
    - a) for an (n+1)th adjoining pair, comprising such nth adjoining pair as an (n+1)th first pair element and another natural-language element adjoining such nth adjoining pair as an (n+1)th second pair element, of such natural-language elements of such input string, finding, from such string data from such corpus, an (n+1)th first listing of each such element syntactically equivalent to such (n+1)th first pair element and an (n+1)th second listing of each such element syntactically equivalent to such (n+1)th second pair element;
      
      b) from matching each such (n+1)th first-listing element with each such (n+1)th second-listing element, making a matched-pairs (n+1)th third listing by finding which matched pairs of said matching are found in such string data from such corpus; and
      
      c) for such matched pairs of such matched-pairs (n+1)th third listing, finding, from such string data from such corpus, an (n+1)th fourth listing of each (n+1)th fourth such natural-language element syntactically equivalent to any such matched pair of such (n+1)th third listing.
  - 10. ) The computer system according to claim 9 further comprising:
    - a) scoring each such natural-language element of such (n+1)th fourth listing, such scoring comprising counting the number of occurrences of each such natural-language element of such (n+1)th fourth listing in such string data from such corpus.
  - 11. ) The computer system according to each of claim 1 further comprising:
    - a) repeating such steps of claim 1 while considering i) such original first adjoining pair as a new first pair element in such repeating, ii) such original fourth listing as a new first listing in such repeating, and iii) a new natural-language element adjoining, in such input string, such new first pair element as a new second pair element, thereby providing a new first adjoining pair, iv) thereby providing a new fourth listing in association with such new first adjoining pair.
  - 12. ) The computer system according claim 11 further comprising:
    - a) re-performing steps a)1) through a)iv) of claim 11 while considering i) such new first adjoining pair as a first replacement first pair element in such re-performing, ii) such new fourth listing as a first replacement first listing in such re-performing, and iii) a further new natural-language element adjoining, in such input string, such first replacement first pair element as a first replacement second pair element, thereby providing a first replacement first adjoining pair, iv) thereby providing a first replacement fourth listing in association with such first replacement first adjoining pair.
  - 13. ) The computer system according claim 12 further comprising:
    - a) further continuing to perform, for such entire input string, steps a)I) through a)iv) of claim 12 while considering i) such nth first adjoining pair as an (n+1)th replacement first pair element in such further performing, ii) such nth fourth listing as an (n+1)th replacement first listing in such further performing, and iii) a further new natural-language element adjoining, in such input string, such (n+1)th replacement first pair element as an (n+1)th replacement second pair element, thereby providing an (n+1)th replacement first adjoining pair, iv) thereby providing an (n+1)th replacement fourth listing in association with such (n+1)th replacement first adjoining pair.
  - 14. ) The computer system according to claim 13 further comprising:
    - a) for an (n+1)th adjoining pair, comprising such nth adjoining pair as an (n+1)th first pair element and another natural-language element adjoining such nth adjoining pair as an (n+1)th second pair element, of such natural-language elements of such input string, finding, from such string data from such corpus, an (n+1)th first listing of each such element syntactically equivalent to such (n+1)th first pair element and an (n+1)th second listing of each such element syntactically equivalent to such (n+1)th second pair element;
      
      b) from matching each such (n+1)th first-listing element with each such (n+1)th second-listing element, making a matched-pairs (n+1)th third listing by finding which matched pairs of said matching are found in such string data from such corpus; and
      
      c) for such matched pairs of such matched-pairs (n+1)th third listing, finding, from such string data from such corpus, an (n+1)th fourth listing of each (n+1)th fourth such natural-language element syntactically equivalent to any such matched pair of such (n+1)th third listing.
  - 15. ) The computer system according to claim 14 further comprising:
    - a) scoring each such natural-language element of such (n+1)th fourth listing, such scoring comprising counting the number of occurrences of each such natural-language element of such (n+1)th fourth listing in such string data from such corpus;
      
      b) wherein said scoring comprises a similarity measure for statistical similarity between such scored natural-language element and such string data from such corpus; and
      
      c) wherein such scores for each such natural language element of such (n+1)th fourth listing are essentially added to determine a scoring for a string comprising such (n+1)th replacement first adjoining pair.
  - 16. ) The computer system according to claim 15 wherein such computer system is applied to possible ordered string subcombinations of at least two potential parses of such natural-language elements of such input string and a highest such scoring among such potential parses is used to determine maximum grammaticality among such potential parses.
  - 17. ) The computer system according to each of claims 2, 4, 6, 8, and 10, and 15 wherein said scoring comprises:
    - a) a similarity measure for statistical similarity between such scored natural-language element and such string data from such corpus.
  - 18. ) The computer system according to each of claims 2, 4, 6, 8, and 10, and 15 wherein such scoring of each such fourth list element comprises:
    - a) the product of i) a measure of statistical similarity between each such element (of such first listing) syntactically equivalent to such first pair element and such first pair element;
      
      ii) a measure of statistical similarity between each such element (of such second listing) syntactically equivalent to such second pair element and such second pair element;
      
      iii) a measure of statistical association between such first and second pair elements; and
      
      iv) a measure of statistical similarity between each matched pair of such matched-pairs third listing and each fourth such natural-language element of such fourth listing; and
      
      b) the sum of each such product for each such third list element.
  - 23. ) The computer system according to each of claims 1-21 wherein each such pair element comprises at least one word.
  - 24. ) The computer system according to each of claims 1-21 wherein each such pair element comprises at least two words.

19. ) A computer system, using a provided corpus of linear natural-language elements of natural language text string data in a subject language and an input string, to be parsed, of natural-language elements in the subject language, for assisting natural-language parsing, comprising, in combination:
- a) for each of at least two natural-language input subcombinations which are potential subparses of such input string, building an equivalence list of all corpus strings syntactically equivalent to such each input string subcombination;
  
  b) from such equivalence lists, in different orders for each potential parse of said input string, building to a final equivalence list for each such potential parse of such input string; and
  
  c) from the number and quality of entries in each respective such final equivalence list, scoring the grammaticality of such respective potential parse.
- View Dependent Claims (20)
- - 20. ) The computer system according to claim 19 wherein such scoring comprises essentially adding scores for each such entry to obtain a score for such potential parse.

21. ) A computer system, using a provided corpus of linear natural-language elements of natural language text string data in a subject language and an input string of natural-language elements in the subject language, for assisting natural-language processing, comprising, in combination:
- a) for a first adjoining pair, comprising a first pair element and a second pair element, of such natural-language elements of such input string, finding, from such string data from such corpus, a first listing of each such element syntactically equivalent to such first pair element and a second listing of each such element syntactically equivalent to such second pair element; and
  
  b) from matching each such first-listing element with each such second-listing element, making a matched-pairs third listing by finding which matched pairs of said matching are found in such string data from such corpus;
  
  c) wherein at least one of said first adjoining pair comprises at least a pair of natural-language elements.
- View Dependent Claims (22)
- - 22. ) The computer system according to claim 21 wherein:
    - a) at least one of such first pair element and such second pair element comprises at least a pair of words.

25. ) A computer-readable medium (for a computer system, using a provided corpus of linear natural-language elements of natural language text string data in a subject language and an input string of natural-language elements in the subject language, for assisting natural-language processing) whose contents cause a computer system to determine a grammatical parse by:
- a) for each of at least two natural-language input subcombinations which are potential subparses of such input string, building an equivalence list of all corpus strings syntactically equivalent to such each input string subcombination;
  
  b) from such equivalence lists, in different orders for each potential parse of said input string, building to a final equivalence list for each such potential parse of such input string; and
  
  c) from the number and quality of entries in each respective such final equivalence list, scoring the grammaticality of such respective potential parse.

26. ) A computer-implemented natural-language system (for a computer system, using a provided corpus of linear natural-language elements of natural language text string data in a subject language and an input string of natural-language elements in the subject language, for assisting natural-language processing) comprising:
- a) for each of at least two natural-language input subcombinations which are potential subparses of such input string, means for building an equivalence list of all corpus strings syntactically equivalent to such each input string subcombination;
  
  b) means for building, from such equivalence lists, in different orders for each potential parse of said input string, to a final equivalence list for each such potential parse of such input string; and
  
  c) means for scoring, from the number and quality of entries in each respective such final equivalence list, the grammaticality of such respective potential parse.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Robert J. Freeman
Original Assignee
Robert J. Freeman
Inventors
Freeman, Robert J

Granted Patent

US 7,392,174 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/4
CPC Class Codes

G06F 40/216 using statistical methods

Natural-language processing system using a large corpus

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

65 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Natural-language processing system using a large corpus

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

65 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links