DOMAIN-SPECIFIC COMPUTATIONAL LEXICON FORMATION
First Claim
1. A method comprising:
- extracting a candidate token sequence comprising one or more word tokens from an unstructured domain glossary comprising a plurality of entries associated with a domain;
performing a look-up operation to retrieve language data for each word token in the candidate token sequence;
annotating each word token in the candidate token sequence found by the look-up operation with corresponding retrieved language data to form an annotated sequence;
performing a pattern match of the annotated sequence relative to a repository of patterns;
identifying a best matching pattern from the repository of patterns to the annotated sequence based on matching criteria;
refining the annotated sequence with lexical information associated with the best matching pattern as a refined annotated sequence; and
outputting the candidate token sequence and the refined annotated sequence to a domain-specific computational lexicon file.
1 Assignment
0 Petitions
Accused Products
Abstract
According to an aspect, a candidate token sequence including one or more word tokens is extracted from an unstructured domain glossary that includes entries associated with a domain. A look-up operation is performed to retrieve language data for each word token in the candidate token sequence and annotates each word token in the candidate token sequence found by the look-up operation with corresponding retrieved language data to form an annotated sequence. A pattern match of the annotated sequence is performed relative to a repository of patterns and identifies a best matching pattern from the repository of patterns to the annotated sequence based on matching criteria. The annotated sequence is refined with lexical information associated with the best matching pattern as a refined annotated sequence. The candidate token sequence and the refined annotated sequence are output to a domain-specific computational lexicon file.
-
Citations
8 Claims
-
1. A method comprising:
-
extracting a candidate token sequence comprising one or more word tokens from an unstructured domain glossary comprising a plurality of entries associated with a domain; performing a look-up operation to retrieve language data for each word token in the candidate token sequence; annotating each word token in the candidate token sequence found by the look-up operation with corresponding retrieved language data to form an annotated sequence; performing a pattern match of the annotated sequence relative to a repository of patterns; identifying a best matching pattern from the repository of patterns to the annotated sequence based on matching criteria; refining the annotated sequence with lexical information associated with the best matching pattern as a refined annotated sequence; and outputting the candidate token sequence and the refined annotated sequence to a domain-specific computational lexicon file. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
Specification