System, method, program product, and networking use for recognizing words and their parts of speech in one or more natural languages

US 7,680,649 B2
Filed: 06/17/2002
Issued: 03/16/2010
Est. Priority Date: 06/17/2002
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented system for recognizing one or more words not listed in a dictionary database, the system comprising:

at least one central processing unit;

a memory operably associated with the at least one processing unit; and

a dictionary augmentation system storable in memory and executable by the at least one processing unit, the dictionary augmentation system comprising;

a root process that searches the dictionary database to obtain root information about a root word, the root word being a word with no prefix and suffix; and

a statistical process that, if the root word is not found in the dictionary database, checks one or more proper substrings of the root word comprising two or more characters in the root word and every proper substring having fewer characters than the root word, against a complete database of each and every possible subset of individual valid words within the dictionary database, to determine, from the likelihood that the proper substring of the root word occurs in a sequence in the subsets of the individual valid words, a probability that the root word is a valid word that was previously unknown, wherein each character in the root word and in the individual valid words is an alphabet-based character and wherein the dictionary database is distinct from the complete database.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system, method, and computer program are disclosed for recognizing one or more words not listed in a dictionary database. One or more sequences of characters in the word are checked to determine a probability that the word is valid. A prefix removal process removes any prefixes from a word, and obtains information about the removed prefix. A suffix removal process removes any suffixes from the word, and obtains information about the removed suffix. A root process obtains information about a root word from the dictionary database. A combination process then determines if the prefix, the root, and the suffix can be combined into a valid word as defined by one or more combination rules, obtains one or more of the possible parts of speech of the valid word, and stores the parts of speech with the valid word in the dictionary database.

Citations

21 Claims

1. A computer-implemented system for recognizing one or more words not listed in a dictionary database, the system comprising:
- at least one central processing unit;
  
  a memory operably associated with the at least one processing unit; and
  
  a dictionary augmentation system storable in memory and executable by the at least one processing unit, the dictionary augmentation system comprising;
  
  a root process that searches the dictionary database to obtain root information about a root word, the root word being a word with no prefix and suffix; and
  
  a statistical process that, if the root word is not found in the dictionary database, checks one or more proper substrings of the root word comprising two or more characters in the root word and every proper substring having fewer characters than the root word, against a complete database of each and every possible subset of individual valid words within the dictionary database, to determine, from the likelihood that the proper substring of the root word occurs in a sequence in the subsets of the individual valid words, a probability that the root word is a valid word that was previously unknown, wherein each character in the root word and in the individual valid words is an alphabet-based character and wherein the dictionary database is distinct from the complete database.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. A system, as in claim 1, where the probability comprises a measure of a likelihood that a substring of the one or more substrings is correctly placed adjacent to one or more other characters in the root word.
  - 3. A system, as in claim 2, where the one or more other characters precedes the substring.
  - 4. A system, as in claim 2, where the one or more other characters follows the substring.
  - 5. A system, as in claim 4, where the substring and one or more other characters form a trigram.
  - 6. A system, as in claim 2, where the probability is determined by:
    - comparing, for each of the one or more substrings and the one or more adjacent characters in the root word, a string of the substring and the adjacent character to a database of strings associated with a respective probability to yield a set of string probabilities;
      
      multiplying each string probability in the set of string probabilities by a log₂of the string probability to yield a set of log string probabilities; and
      
      summing the log string probabilities in the set of log string probabilities to yield the probability that the root word is a valid word.
  - 7. A system, as in claim 6, where the respective probability of the strings in the database is determined by finding one or more possible strings of characters and counting the frequency of occurrence of the possible strings of characters in a database of valid words.
  - 8. A system, as in claim 2, further comprising one or more rules that define a part of speech of the word, the rules having a rule probability based on the frequency of occurrence, greater than a threshold, that the rule correctly applies to a database of valid words.
  - 9. A system, as in claim 8, where the part of speech of the root word is determined by one of the rules.
  - 10. A system, as in claim 8, where the rules apply to the ending of the root words.
  - 11. A system, as in claim 1, further comprising:
    - a compound word process that breaks the word into two components, the root word being the second component.
  - 12. A system, as in claim 10, where the compound word process further determines a part of speech of the root word.
  - 13. A system, as in claim 1, where once the word is determined to be a valid word, the word is stored in a new word dictionary memory.
  - 14. A system, as in claim 1, further comprising a word counting process that counts the frequency of occurrence of the word in one or more documents to determine an importance of the word if the word is determined as the valid word.
  - 15. A system, as in claim 1, further comprising:
    - a prefix removal process that removes one or more prefixes from the word, the prefixes being in a prefix list, the prefix removal being constrained by one or more prefix removal rules, the prefix removal process further obtaining prefix information about the removed prefix.
  - 16. A system, as in claim 15, where the prefix information is obtained from any one or more of the following:
    - a dictionary database and a prefix list.
  - 17. A system, as in claim 1, further comprising:
    - a suffix removal process that removes one or more suffixes from the word, the suffixes being in a suffix list, the suffix removal being constrained by one or more suffix removal rules, the suffix removal process further obtaining suffix information about the removed suffix.
  - 18. A system, as in claim 17, where the suffix information is obtained from any one or more of the following:
    - a dictionary database and a suffix list.

19. A computer-implemented method for recognizing one or more words not listed in a dictionary database, the method comprising the steps of:
- identifying a root word in a document, wherein the document is stored on one of a hard disk and a network, and wherein the root word is a word with no prefix and no suffix;
  
  using at least one processing unit, searching the dictionary database to obtain root information about the root word; and
  
  if the root word is not found in the dictionary database, checking one or more proper substrings of the root word comprising two or more characters in the root word, and every proper substring having fewer characters than the root word, against a complete database of each and every possible subset of individual valid words within the dictionary database, to determine, from the likelihood that the substrings of the root word occurs in a sequence in the subsets of the individual valid words, a probability that the root word is a valid word that was previously unknown, wherein each character in the root word and in the individual valid words is an alphabet-based character and wherein the dictionary database is distinct from the complete database.

20. A computer-implemented system for recognizing one or more words not listed in a dictionary database, the system comprising:
- at least one central processing unit;
  
  a memory operably associated with the at least one processing unit; and
  
  a dictionary augmentation system storable in memory and executable by the at least one processing unit, the dictionary augmentation system comprising;
  
  means for searching the dictionary database to obtain root information about a root word, the root word being a word with no prefix and suffix; and
  
  means for checking one or more proper substrings of the root word comprising two or more characters in the root word, and every proper substring having fewer characters than the root word, against a complete database of each and every possible subset of individual valid words within the dictionary database, to determine, from the likelihood that the substrings of the root word occurs in a sequence in the subsets of the individual valid words, a probability that the root word is a valid word that was previously unknown, if the root word is not found in the dictionary database, wherein each character in the root word and in the individual valid words is an alphabet-based character and wherein the dictionary database is distinct from the complete database.

21. A computer memory storage device storing a dictionary augmentation System, the dictionary augmentation system comprising a computer program that causes a computer system to perform the steps of:
- identifying a root word in a document, wherein the document is stored on one of a hard disk and a network, and wherein the root word is a word with no prefix and no suffix;
  
  using at least one processing unit, searching the dictionary database to obtain root information about the root word; and
  
  checking one or more proper substrings of the root word comprising two or more characters in the root word, and every proper substring having fewer characters than the root word, against a complete database of each and every possible subset comprising individual valid words within the dictionary database, to determine, from the likelihood that the subsets of the root word occurs in a sequence in the subsets of the individual valid words, a probability that the root word is a valid word that was previously unknown, if the root word is not found in the dictionary database, wherein each character in the root word and in the individual valid words is an alphabet-based character and wherein the dictionary database is distinct from the complete database.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Park, Youngja
Primary Examiner(s)
Dorvil; Richemond
Assistant Examiner(s)
YEN, ERIC L

Application Number

US10/173,931
Publication Number

US 20030233235A1
Time in Patent Office

2,829 Days
Field of Search

704/1, 704/8, 704/9, 704/10, 704/257
US Class Current

704/10
CPC Class Codes

G06F 40/268 Morphological analysis

G06F 40/295 Named entity recognition

System, method, program product, and networking use for recognizing words and their parts of speech in one or more natural languages

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

System, method, program product, and networking use for recognizing words and their parts of speech in one or more natural languages

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links