System and method of generating dictionary entries
First Claim
1. A method for automatically generating a dictionary based on a corpus of full text articles, comprising:
- applying linguistic pattern analysis to the sentences in the corpus to extract <
term, definition>
pairs and identify sentences with candidate complex <
term, definition>
pairs;
applying grammar analysis to the sentences with candidate complex <
term, definition>
pairs to extract <
term, definition>
pairs;
storing the extracted <
term, definition>
pairs in a dictionary database;
wherein the linguistic pattern analysis further comprises identifying sentences including text markers, and wherein the sentences including text markers are subjected to filtering to remove sentences not likely to include <
term, definition>
pairs, and wherein the filtering includes rules selected from the group including sentences with conjunctions at the beginning of a text marker, sentences having phrases indicative of explanation, and sentences having patterns indicative of enumeration.
1 Assignment
0 Petitions
Accused Products
Abstract
A system for automatically generating a dictionary from full text articles extracts <term, definition> pairs from full text articles and stores the <term, definition> pairs as dictionary entries. The system includes a computer readable corpus having a plurality of documents therein. A pattern processing module (120) and a grammar processing module (125) are provided for extracting <term, definition> pairs from the corpus and storing the <term, definition> pairs in a dictionary database (145). A routing processing module selectively routes sentences in the corpus to at least one of the pattern processing module or grammar processing module. In one embodiment, the routing module is incorporated into the pattern processing module which then selectively routes a portion of the sentences to the grammar processing module. A bootstrapping processing module (150) can be used to apply <term, definition> entries against the corpus to identify and extract additional <terms, definition> entries.
-
Citations
22 Claims
-
1. A method for automatically generating a dictionary based on a corpus of full text articles, comprising:
-
applying linguistic pattern analysis to the sentences in the corpus to extract <
term, definition>
pairs and identify sentences with candidate complex <
term, definition>
pairs;applying grammar analysis to the sentences with candidate complex <
term, definition>
pairs to extract <
term, definition>
pairs;storing the extracted <
term, definition>
pairs in a dictionary database;wherein the linguistic pattern analysis further comprises identifying sentences including text markers, and wherein the sentences including text markers are subjected to filtering to remove sentences not likely to include <
term, definition>
pairs, and wherein the filtering includes rules selected from the group including sentences with conjunctions at the beginning of a text marker, sentences having phrases indicative of explanation, and sentences having patterns indicative of enumeration. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for automatically generating a dictionary based on a corpus of full text articles, comprising:
-
applying linguistic pattern analysis to the sentences in the corpus to extract <
term, definition>
pairs and identify sentences with candidate complex <
term, definition>
pairs;applying grammar analysis to the sentences with candidate complex <
term, definition>
pairs to extract <
term, definition>
pairs;storing the extracted <
term, definition>
pairs in a dictionary database;wherein the linguistic pattern analysis further comprises identifying sentences including cue phrases and wherein sentences including predetermined cue phrases are parsed to identify left hand side context and right hand side context of the cue phrase, and wherein if the left hand side and right hand side contexts are noun phrases, then the left hand side context is considered the term and the right hand side context is considered the definition for the term.
-
-
9. A method for automatically generating a dictionary based on a corpus of full text articles, comprising:
-
applying linguistic pattern analysis to the sentences in the corpus to extract <
term, definition>
pairs and identify sentences with candidate complex <
term, definition>
pairs;applying grammar analysis to the sentences with candidate complex <
term, definition>
pairs to extract <
term, definition>
pairs;storing the extracted <
term, definition>
pairs in a dictionary database;wherein grammar analysis of sentences including candidate complex <
term, definition>
pairs includes processing of sentences exhibiting apposition wherein apposition processing includes;identifying the left conjunction and right conjunction in the sentence; and if the right conjunction is a noun phrase and does not include a further conjunction, the right conjunction is considered the definition and the left conjunction is considered the term of a <
term, definition>
pair.
-
-
10. A method for automatically generating a dictionary based on a corpus of full text articles, comprising:
-
applying linguistic pattern analysis to the sentences in the corpus to extract <
term, definition>
pairs and identify sentences with candidate complex <
term, definition>
pairs;applying grammar analysis to the sentences with candidate complex <
term, definition>
pairs to extract <
term, definition>
pairs;storing the extracted <
term, definition>
pairs in a dictionary database;wherein grammar analysis of sentences including candidate complex <
term, definition>
pairs includes processing of sentences in the form <
term>
is <
definition> and
wherein sentences of the form <
term>
is <
definition>
are further processed by the steps;if the root of the sentence is a present tense for of the verb “
to be” and
the subject of the sentence is a noun then evaluate the predicate of the sentence;evaluate daughters of subject to determine if daughters are left conjunctions and right conjunctions; if daughters are not in the form of left conjunctive and right conjunctive, evaluate predicate, and if the predicate of the sentence is a noun, the subject is considered the term and a subtree rooted at the predicate is considered the definition of the <
term, definition>
pair. - View Dependent Claims (11)
-
-
12. A system for automatically generating a dictionary from full text articles comprising:
-
a computer readable corpus having a plurality of documents therein; a computer readable dictionary database; a pattern processing module for extracting <
term, definition>
pairs from the corpus and storing the <
term, definition>
pairs in the dictionary database;a grammar processing module for extracting <
term, definition>
pairs from the corpus and storing the <
term, definition>
pairs in the dictionary database; anda routing processing module which routes sentences in the corpus to the pattern processing module; wherein the pattern processing module further comprises a filtering module to remove sentences not likely to include <
term, definition>
pairs and wherein the filtering module applies at least one rule selected from the group including sentences with conjunctions at the beginning of a text marker, sentences having phrases indicative of explanation, and sentences having patterns indicative of enumeration. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A system for automatically generating a dictionary from full text articles comprising:
-
a computer readable corpus having a plurality of documents therein; a computer readable dictionary database; a pattern processing module for extracting <
term, definition>
pairs from the corpus and storing the <
term, definition>
pairs in the dictionary database;a grammar processing module for extracting <
term, definition>
pairs from the corpus and storing the <
term, definition>
pairs in the dictionary database; anda routing processing module which routes sentences in the corpus to the pattern processing module, wherein the pattern processing module identifies sentences including cue phrases as <
term, definition>
pairs, and wherein the pattern processing module parses sentences including predetermined cue phrases to identify left hand side context and right hand side context of the cue phrase, and wherein if the left hand side and right hand side contexts are noun phrases, then the left hand side context is assigned as the term and the right hand side context is assigned as the definition for the term.
-
-
22. Computer readable media encoded with instructions to direct a computer system to generate a dictionary based on a corpus of full text articles, comprising the steps of:
-
applying linguistic pattern analysis to sentences in a corpus to extract <
term, definition>
pairs and identify sentences with candidate complex <
term, definition>
pairs;applying grammar analysis to the sentences with candidate complex <
term, definition>
pairs to extract <
term, definition>
pairs; andstoring the extracted <
term, definition>
pairs in a dictionary database;wherein the linguistic pattern analysis further comprises identifying sentences including text markers and, wherein the sentences including text markers are subjected to filtering to remove sentences not likely to include <
term, definition>
pairs, and wherein the filtering includes rules selected from the group including sentences with conjunctions at the beginning of a text marker, sentences having phrases indicative of explanation, and sentences having patterns indicative of enumeration.
-
Specification