Providing capitalization correction for unstructured excerpts
First Claim
Patent Images
1. A computer system for building a lexicon for use in capitalization correction for unstructured excerpts, comprising:
- at least one processing unit;
at least one storage device being coupled with the at least one processing unit and storing program code;
a ripper adapted to assemble a list of word sets from unstructured content, at least one of the word sets comprising a word and at least two non-standard capitalization variations for the word; and
an aggregator adapted to aggregate at least one of the word sets, the aggregator includingan analyzer adapted to identify non-standard capitalization variations based on at least one criteria; and
a non-standard capitalization selector adapted to select at least one of the identified non-standard capitalization variations within one of the at least one word sets, and adding the selected at least one of the identified non-standard capitalization variations to the lexicon, wherein the lexicon includes records, each record including a word, wherein the lexicon is indexed by the words included in the records, and wherein at least one of the records includes more than one non-standard capitalization variation.
2 Assignments
0 Petitions
Accused Products
Abstract
Providing capitalization correction for unstructured excerpts is described. An excerpt of unstructured content is tokenized into a set of words. The set of words is analyzed for correct capitalization. Individual characters constituting at least one such word in the set of words are evaluated. The at least one such word is skipped if determined to be of a predefined type.
-
Citations
24 Claims
-
1. A computer system for building a lexicon for use in capitalization correction for unstructured excerpts, comprising:
-
at least one processing unit; at least one storage device being coupled with the at least one processing unit and storing program code; a ripper adapted to assemble a list of word sets from unstructured content, at least one of the word sets comprising a word and at least two non-standard capitalization variations for the word; and an aggregator adapted to aggregate at least one of the word sets, the aggregator including an analyzer adapted to identify non-standard capitalization variations based on at least one criteria; and a non-standard capitalization selector adapted to select at least one of the identified non-standard capitalization variations within one of the at least one word sets, and adding the selected at least one of the identified non-standard capitalization variations to the lexicon, wherein the lexicon includes records, each record including a word, wherein the lexicon is indexed by the words included in the records, and wherein at least one of the records includes more than one non-standard capitalization variation. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19)
-
-
15. A computer-implemented method comprising:
-
a) generating a plurality of word sets from a text corpus, at least one of the words sets including a word identified from the text corpus, at least one non-standard capitalization variation of the word included in the word set, and a frequency of occurrence of each of the at least one non-standard capitalization variation of the word included in the word set; b) generating a lexicon using the generated plurality of word sets, wherein the lexicon includes, for each of a plurality of words, at least one capitalization variation identified using at least one criteria, wherein at least one of the words of the lexicon includes more than one non-standard capitalization variation identified using the at least one criteria; and c) storing the generated lexicon.
-
-
20. Apparatus comprising:
-
a) means for generating a plurality of word sets from a text corpus, at least one of the word sets including a word identified from the text corpus, at least one non-standard capitalization variation of the word included in the word set, and a frequency of occurrence of each of the at least one non-standard capitalization variation of the word included in the word set; and b) means for generating a lexicon using the generated plurality of word sets, wherein the lexicon includes, for each of a plurality of words, at least one capitalization variation identified using at least one criteria, wherein at least one of the words of the lexicon includes more than one non-standard capitalization variation identified using the at least one criteria. - View Dependent Claims (21, 22, 23, 24)
-
Specification