DOMAIN-SPECIFIC COMPUTATIONAL LEXICON FORMATION

US 20160179783A1
Filed: 03/05/2015
Published: 06/23/2016
Est. Priority Date: 12/23/2014
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

extracting a candidate token sequence comprising one or more word tokens from an unstructured domain glossary comprising a plurality of entries associated with a domain;

performing a look-up operation to retrieve language data for each word token in the candidate token sequence;

annotating each word token in the candidate token sequence found by the look-up operation with corresponding retrieved language data to form an annotated sequence;

performing a pattern match of the annotated sequence relative to a repository of patterns;

identifying a best matching pattern from the repository of patterns to the annotated sequence based on matching criteria;

refining the annotated sequence with lexical information associated with the best matching pattern as a refined annotated sequence; and

outputting the candidate token sequence and the refined annotated sequence to a domain-specific computational lexicon file.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

According to an aspect, a candidate token sequence including one or more word tokens is extracted from an unstructured domain glossary that includes entries associated with a domain. A look-up operation is performed to retrieve language data for each word token in the candidate token sequence and annotates each word token in the candidate token sequence found by the look-up operation with corresponding retrieved language data to form an annotated sequence. A pattern match of the annotated sequence is performed relative to a repository of patterns and identifies a best matching pattern from the repository of patterns to the annotated sequence based on matching criteria. The annotated sequence is refined with lexical information associated with the best matching pattern as a refined annotated sequence. The candidate token sequence and the refined annotated sequence are output to a domain-specific computational lexicon file.

Citations

8 Claims

1. A method comprising:
- extracting a candidate token sequence comprising one or more word tokens from an unstructured domain glossary comprising a plurality of entries associated with a domain;
  
  performing a look-up operation to retrieve language data for each word token in the candidate token sequence;
  
  annotating each word token in the candidate token sequence found by the look-up operation with corresponding retrieved language data to form an annotated sequence;
  
  performing a pattern match of the annotated sequence relative to a repository of patterns;
  
  identifying a best matching pattern from the repository of patterns to the annotated sequence based on matching criteria;
  
  refining the annotated sequence with lexical information associated with the best matching pattern as a refined annotated sequence; and
  
  outputting the candidate token sequence and the refined annotated sequence to a domain-specific computational lexicon file.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising:
    - pre-filtering the unstructured domain glossary to prevent an entry that fails to meet one or more well-formed entry criteria from being extracted as the candidate token sequence.
  - 3. The method of claim 1, further comprising:
    - performing a morphological decomposition of the candidate token sequence to identify a base form of at least one of the word tokens in the candidate token sequence; and
      
      using the base form in the look-up operation.
  - 4. The method of claim 1, further comprising:
    - annotating a word token in the candidate token sequence as being an unknown word token based on an inability of the look-up operation to locate language data about the word token.
  - 5. The method of claim 4, further comprising:
    - applying a set of heuristics to attempt to resolve the unknown word token as a part of speech; and
      
      adding an entry to a lexical resource used for the look-up operation based on resolving the unknown word token.
  - 6. The method of claim 1, further comprising:
    - performing a lexical phrase analysis of the annotated sequence to identify a candidate head of phrase; and
      
      augmenting the refined annotated sequence based on identifying the candidate head of phrase of the annotated sequence.
  - 7. The method of claim 1, wherein the pattern matching attempts to resolve a part of speech for at least one word token in the annotated sequence having multiple candidate parts of speech.
  - 8. The method of claim 1, further comprising:
    - repeating the extracting, the look-up operation, the annotating, and the pattern matching for additional candidate token sequences; and
      
      recording instances of the additional candidate token sequences with corresponding annotated sequences in a pattern analysis file based on determining that a matching pattern cannot be located in the repository of patterns.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Boguraev, Branimir K., Manandise, Esme, Segal, Benjamin P.

Granted Patent

US 9,684,647 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/3329   Natural language query form...

G06F 40/242   Dictionaries

G06F 40/268   Morphological analysis

DOMAIN-SPECIFIC COMPUTATIONAL LEXICON FORMATION

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

DOMAIN-SPECIFIC COMPUTATIONAL LEXICON FORMATION

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links