Enhancing QA system cognition with improved lexical simplification using multilingual resources

US 10,318,634 B2
Filed: 01/02/2017
Issued: 06/11/2019
Est. Priority Date: 01/02/2017
Status: Active Grant

First Claim

Patent Images

1. An information handling system comprising:

one or more processors;

a memory coupled to at least one of the processors; and

a set of computer program instructions stored in the memory and executed by at least one of the processors to return a simplified set of text to a user of a natural language processing (NLP) system, wherein the simplified set of text comprises text appropriate to a reading level of the user, wherein a lexical simplification process selects the simplified set of text from a corpus of plurality of words that have a complexity level appropriate to the reading level, and wherein the complexity level is based on a multi-language word mapping performed on at least a selected one of the plurality of words using a process comprising;

creating the multi-language word mapping by a multi-language word mapping generator executing on the information handling system, wherein the creating further comprises;

retrieving the selected word that belongs to a first natural language;

retrieving a first set of complexity data pertaining to the selected word in the first natural language, wherein the first set of complexity data comprises a first word length and a first word frequency;

translating the selected word to one or more translated words, wherein each of the translated words corresponds to one or more second natural languages;

retrieving one or more second sets of complexity data, wherein each of the second sets of complexity data correspond to a different one of the translated words, and wherein the one or more second sets of complexity data comprises one or more second word lengths and one or more second word frequencies;

computing a complexity of the selected word in the first natural language based on an overall word length and an overall word frequency, wherein the overall word length is based on the first word length and the one or more second word lengths, and wherein the overall word frequency is based on the first word frequency and the one or more second word frequencies; and

storing the computed complexity of the word in the multi-language word mapping; and

wherein the lexical simplification process selects, based on the computed complexity of the word, one of the one or more translated words to replace the selected word.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An approach is provided that returns a simplified set of text to a user of a natural language processing (NLP) system with the simplified set of text having a complexity appropriate to the reading level of the user. The approach receives a word that belongs to a first natural language and retrieves a first set of complexity data pertaining to the word in the first natural language. The approach translates the word to one or more translated words, with each of the translated words corresponding to one or more second natural languages. The approach then retrieves sets of complexity data, with the sets of complexity data corresponding to a different translated word. The approach determines a complexity of the word in the first natural language based on an analysis of the first and second sets of complexity data.

Citations

9 Claims

1. An information handling system comprising:
- one or more processors;
  
  a memory coupled to at least one of the processors; and
  
  a set of computer program instructions stored in the memory and executed by at least one of the processors to return a simplified set of text to a user of a natural language processing (NLP) system, wherein the simplified set of text comprises text appropriate to a reading level of the user, wherein a lexical simplification process selects the simplified set of text from a corpus of plurality of words that have a complexity level appropriate to the reading level, and wherein the complexity level is based on a multi-language word mapping performed on at least a selected one of the plurality of words using a process comprising;
  
  creating the multi-language word mapping by a multi-language word mapping generator executing on the information handling system, wherein the creating further comprises;
  
  retrieving the selected word that belongs to a first natural language;
  
  retrieving a first set of complexity data pertaining to the selected word in the first natural language, wherein the first set of complexity data comprises a first word length and a first word frequency;
  
  translating the selected word to one or more translated words, wherein each of the translated words corresponds to one or more second natural languages;
  
  retrieving one or more second sets of complexity data, wherein each of the second sets of complexity data correspond to a different one of the translated words, and wherein the one or more second sets of complexity data comprises one or more second word lengths and one or more second word frequencies;
  
  computing a complexity of the selected word in the first natural language based on an overall word length and an overall word frequency, wherein the overall word length is based on the first word length and the one or more second word lengths, and wherein the overall word frequency is based on the first word frequency and the one or more second word frequencies; and
  
  storing the computed complexity of the word in the multi-language word mapping; and
  
  wherein the lexical simplification process selects, based on the computed complexity of the word, one of the one or more translated words to replace the selected word.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The information handling system of claim 1 wherein the first set of complexity data includes a first word n-gram of the selected word in the first natural language, wherein the second sets of complexity data includes one or more second word n-grams of the selected word in each of the second natural languages, and wherein the actions further comprise:
    - determining an overall word n-gram based on the first word n-gram and the second one or more word n-grams, wherein the complexity of the selected word is based on the overall word n-gram.
  - 3. The information handling system of claim 1 wherein the first set of complexity data includes a first word encyclopedia entry of the selected word in the first natural language, wherein the second sets of complexity data includes one or more second encyclopedia entries of the selected word in each of the second natural languages, and wherein the actions further comprise:
    - determining an overall word n-gram based on the first word encyclopedia entry and the second one or more encyclopedia entries, wherein the complexity of the selected word is based on the overall word n-gram.
  - 4. The information handling system of claim 1 wherein the complexity of the selected word is based on an average length of characters of the selected word and the translated words in each of the first and second natural languages, a total number of translated words, a frequency of the selected word in the first natural language, a sum of the normalized frequencies of the one or more translated words in the second natural languages, an existence of an encyclopedia entry of the selected word, a number of encyclopedia entries of the translated words in the second natural languages, and a vector value of possible character n-grams in the second natural languages collectively.
  - 5. The information handling system of claim 1 wherein the translated words include synonyms of the translated words in the second natural languages.

6. A computer program product stored in a non-transitory computer readable storage medium, comprising computer program code that, when executed by an information handling system, performs actions comprising:
- returning a simplified set of text to a user of a natural language processing (NLP) system, wherein the simplified set of text comprises text appropriate to a reading level of the user, wherein a lexical simplification process selects the simplified set of text from a corpus of plurality of words that have a complexity level appropriate to the reading level, and wherein the complexity level is based on a multi-language word mapping performed on at least a selected one of the plurality of words using a process comprising;
  
  creating the multi-language word mapping by a multi-language word mapping generator executing on the information handling system, wherein the creating further comprises;
  
  retrieving the selected word that belongs to a first natural language;
  
  retrieving a first set of complexity data pertaining to the selected word in the first natural language, wherein the first set of complexity data comprises a first word length and a first word frequency;
  
  translating the selected word to one or more translated words, wherein each of the translated words corresponds to one or more second natural languages;
  
  retrieving one or more second sets of complexity data, wherein each of the second sets of complexity data correspond to a different one of the translated words, and wherein the one or more second sets of complexity data comprises one or more second word lengths and one or more second word frequencies;
  
  computing a complexity of the selected word in the first natural language based on an overall word length and an overall word frequency, wherein the overall word length is based on the first word length and the one or more second word lengths, and wherein the overall word frequency is based on the first word frequency and the one or more second word frequencies; and
  
  storing the computed complexity of the word in the multi-language word mapping; and
  
  wherein the lexical simplification process selects, based on the computed complexity of the word, one of the one or more translated words to replace the selected word.
- View Dependent Claims (7, 8, 9)
- - 7. The computer program product of claim 6 wherein the first set of complexity data includes a first word n-gram of the selected word in the first natural language, wherein the second sets of complexity data includes one or more second word n-grams of the selected word in each of the second natural languages, and wherein the actions further comprise:
    - determining an overall word n-gram based on the first word n-gram and the second one or more word n-grams, wherein the complexity of the selected word is based on the overall word n-gram.
  - 8. The computer program product of claim 6 wherein the first set of complexity data includes a first word encyclopedia entry of the selected word in the first natural language, wherein the second sets of complexity data includes one or more second encyclopedia entries of the selected word in each of the second natural languages, and wherein the actions further comprise:
    - determining an overall word n-gram based on the first word encyclopedia entry and the second one or more encyclopedia entries, wherein the complexity of the selected word is based on the overall word n-gram.
  - 9. The computer program product of claim 6 wherein the complexity of the selected word is based on an average length of characters of the selected word and the translated words in each of the first and second natural languages, a total number of translated words, a frequency of the selected word in the first natural language, a sum of the normalized frequencies of the one or more translated words in the second natural languages, an existence of an encyclopedia entry of the selected word, a number of encyclopedia entries of the translated words in the second natural languages, and a vector value of possible character n-grams in the second natural languages collectively.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Dandala, Bharath, Sinha, Ravi S.
Primary Examiner(s)
Guerra-Erazo, Edgar X

Application Number

US15/396,712
Publication Number

US 20180189262A1
Time in Patent Office

890 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/243   Natural language query form...

G06F 40/157   using dictionaries or tables

G06F 40/247   Thesauruses; Synonyms

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/55   Rule-based translation

Enhancing QA system cognition with improved lexical simplification using multilingual resources

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

Enhancing QA system cognition with improved lexical simplification using multilingual resources

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links