CREATING A TERMS DICTIONARY WITH NAMED ENTITIES OR TERMINOLOGIES INCLUDED IN TEXT DATA
First Claim
1. A method of creating a terms dictionary with named entities or terminologies included in text data, comprising:
- acquiring token sequence data by performing morphological analysis for the text data;
distinguishing tokens of the token sequence data by using a category dictionary to extract uncategorized words;
comparing each of the extracted uncategorized words with an uncategorized-word comparison rule to extract an uncategorized word matching the uncategorized-word comparison rule as a registration candidate word, wherein the uncategorized-word comparison rule includes a token composed of a first character string and a first regular expression for use in extracting the matching uncategorized word;
comparing a token sequence of the token sequence data with a token-sequence comparison rule to extract a token sequence matching the token-sequence comparison rule as registration candidate words, wherein the token-sequence comparison rule includes a token sequence including a second character string and a second regular expression for use in extracting the matching token sequence; and
permitting a user to select whether to register the registration candidate words in the category dictionary.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer system of an embodiment of the disclosure can be used to automatically create or populate a terms dictionary using a set of computing units. A morphological analysis unit can acquire token sequence data by performing morphological analysis for the text data. A category distinguishing unit can distinguish tokens of the token sequence data by using a category dictionary to extract uncategorized words. An uncategorized-word comparing unit can compare each of the extracted uncategorized words with an uncategorized-word comparison rule to extract an uncategorized word matching the uncategorized-word comparison rule as a registration candidate word. A token-sequence comparing unit can compare a token sequence of the token sequence data with a token-sequence comparison rule to extract a token sequence matching the token-sequence comparison rule as registration candidate words. A permission unit can permit a user to select whether to register the registration candidate words in the category dictionary.
68 Citations
20 Claims
-
1. A method of creating a terms dictionary with named entities or terminologies included in text data, comprising:
-
acquiring token sequence data by performing morphological analysis for the text data; distinguishing tokens of the token sequence data by using a category dictionary to extract uncategorized words; comparing each of the extracted uncategorized words with an uncategorized-word comparison rule to extract an uncategorized word matching the uncategorized-word comparison rule as a registration candidate word, wherein the uncategorized-word comparison rule includes a token composed of a first character string and a first regular expression for use in extracting the matching uncategorized word; comparing a token sequence of the token sequence data with a token-sequence comparison rule to extract a token sequence matching the token-sequence comparison rule as registration candidate words, wherein the token-sequence comparison rule includes a token sequence including a second character string and a second regular expression for use in extracting the matching token sequence; and permitting a user to select whether to register the registration candidate words in the category dictionary. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer program for creating a terms dictionary with named entities or terminologies included in text data, wherein the computer program is stored in a tangible storage medium and when executed causes a computer system to:
-
acquire token sequence data by performing morphological analysis for the text data; distinguish tokens of the token sequence data by using a category dictionary to extract uncategorized words; compare each of the extracted uncategorized words with an uncategorized-word comparison rule to extract an uncategorized word matching the uncategorized-word comparison rule as a registration candidate word, wherein the uncategorized-word comparison rule includes a token composed of a first character string and a first regular expression for use in extracting the matching uncategorized word; compare a token sequence of the token sequence data with a token-sequence comparison rule to extract a token sequence matching the token-sequence comparison rule as registration candidate words, wherein the token-sequence comparison rule includes a token sequence including a second character string and a second regular expression for use in extracting the matching token sequence; and permit a user to select whether to register the registration candidate words in the category dictionary.
-
-
14. A computer system for creating a terms dictionary with named entities or terminologies included in text data, the computer system comprising:
-
a morphological analysis unit for acquiring token sequence data by performing morphological analysis for the text data; a category distinguishing unit for distinguishing tokens of the token sequence data by using a category dictionary to extract uncategorized words; an uncategorized-word comparing unit for comparing each of the extracted uncategorized words with an uncategorized-word comparison rule to extract an uncategorized word matching the uncategorized-word comparison rule as a registration candidate word, wherein the uncategorized-word comparison rule includes a token composed of a first character string and a first regular expression for use in extracting the matching uncategorized word; a token-sequence comparing unit for comparing a token sequence of the token sequence data with a token-sequence comparison rule to extract a token sequence matching the token-sequence comparison rule as registration candidate words, wherein the token-sequence comparison rule includes a token sequence including a second character string and a second regular expression for use in extracting the matching token sequence; and a permission unit for permitting a user to select whether to register the registration candidate words in the category dictionary. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification