Keyword extraction apparatus, keyword extraction method, and computer readable recording medium storing keyword extraction program
First Claim
1. A keyword extraction apparatus comprising:
- a technical term storage means for storing technical terms with proper expressions and different expressions thereof, a basic word storage means for storing general basic words of high frequency, an input means through which a sentence is input, a technical-term segmentation point setting means for, when any of the technical terms stored in said technical term storage means exists in the sentence input through said input means, cutting out a range of that technical term from the input sentence, a proper-expression replacing means for, when the technical term cut out by said technical-term segmentation point setting means is written in a different expression, replacing the different expression by a corresponding proper expression, a character-type segmentation point setting means for detecting a difference in character type in the input sentence, a basic-word segmentation point setting means for cutting out, from the input sentence, a range of any of the basic words stored in said basic word storage means, a partial character string cutting means for cutting out partial character strings based on segmentation points set by said technical-term segmentation point setting means, said character-type segmentation point setting means and said basic-word segmentation point setting means, and an output means for outputting, as keywords, the partial character strings cut out by said partial character string cutting means.
1 Assignment
0 Petitions
Accused Products
Abstract
Disclosed is a keyword extraction apparatus and method capable of overcoming a problem in the conventional automatic keyword extraction wherein character strings in a sentence to be processed are employed, as they are, to assign a document with an index in terms of keywords; hence words having the similar meaning but different expressions in written language cannot be retrieved. The keyword extraction apparatus comprises technical term storage means for storing technical terms with proper expressions and different expressions thereof, and basic word storage means for storing general basic words of high frequency. Technical-term segmentation point setting means cuts out a range of any of the technical terms stored in technical term storage means from an input sentence. When the cut-out technical term is written in a different expression, the different expression is replaced by a corresponding proper expression in proper expression replacing means. Character-type segmentation point setting means detects a difference in character type in the input sentence. Basic-word segmentation point setting means cuts out, from the input sentence, a range of any of the basic words stored in the basic word storage means. Partial character string cutting means cuts out, as keywords, all relevant partial character strings based on segmentation points set by the technical-term segmentation point setting means, the character-type segmentation point setting means and the basic-word segmentation point setting means.
-
Citations
9 Claims
-
1. A keyword extraction apparatus comprising:
-
a technical term storage means for storing technical terms with proper expressions and different expressions thereof, a basic word storage means for storing general basic words of high frequency, an input means through which a sentence is input, a technical-term segmentation point setting means for, when any of the technical terms stored in said technical term storage means exists in the sentence input through said input means, cutting out a range of that technical term from the input sentence, a proper-expression replacing means for, when the technical term cut out by said technical-term segmentation point setting means is written in a different expression, replacing the different expression by a corresponding proper expression, a character-type segmentation point setting means for detecting a difference in character type in the input sentence, a basic-word segmentation point setting means for cutting out, from the input sentence, a range of any of the basic words stored in said basic word storage means, a partial character string cutting means for cutting out partial character strings based on segmentation points set by said technical-term segmentation point setting means, said character-type segmentation point setting means and said basic-word segmentation point setting means, and an output means for outputting, as keywords, the partial character strings cut out by said partial character string cutting means.
-
-
2. A keyword extraction method comprising:
-
an input step for inputting a sentence, a technical-term segmentation point setting step for, when any of technical terms in a technical term storage means for storing technical terms with proper expressions and different expressions thereof exists in the sentence input in said input step, cutting out a range of that technical term from the input sentence, a proper-expression replacing step for, when the technical term cut out in said technical-term segmentation point setting step is written in a different expression, replacing a range of said technical term in the input sentence with a corresponding proper expression, a character-type segmentation point setting step for detecting a difference in character type in the input sentence, a basic-word segmentation point setting step for, when any of basic words in a basic word storage means for storing, as the basic words, general words of a high frequency existing in the input sentence, cutting out a range of any of the basic words from the input sentence, and a partial character string cutting step for cutting out, as keywords, partial character strings based on segmentation points set in said technical-term segmentation point setting step, said character-type segmentation point setting step and said basic-word segmentation point setting step. - View Dependent Claims (3, 4, 5, 6, 7, 8)
a prefix segmentation point setting step for cutting out a range of any of prefixes in the Japanese input sentence by referring to a prefix storage means for storing the prefixes, wherein said partial character string cutting step cuts out, as keywords, all relevant partial character strings based on the segmentation points set in said technical-term segmentation point setting step, said character-type segmentation point setting step, said basic-word segmentation point setting step, and said prefix segmentation point setting step.
-
-
4. A keyword extraction method according to claim 3, further comprising, when the sentence input in said input step is written in Japanese:
a suffix segmentation point setting step for cutting out a range of any of suffixes in the Japanese input sentence by referring to a suffix storage means for storing the prefixes, wherein said partial character string cutting step cuts out, as keywords, all relevant partial character strings based on the segmentation points set in said technical-term segmentation point setting step, said character-type segmentation point setting step, said basic-word segmentation point setting step, said prefix segmentation point setting step, and said suffix segmentation point setting step.
-
5. A keyword extraction method according to claim 2, further comprising a number-of-characters limiting step for deleting the keywords extracted in said partial character string cutting step which have a character string length outside a predetermined range, thereby providing redetermined keywords.
-
6. A keyword extraction method according to claim 5, further comprising a frequency totalizing step for counting an appearance frequency of each of the keywords or the redetermined keywords extracted in said partial character string cutting step or said number-of-characters limiting step.
-
7. A keyword extraction method according to claim 5, further comprising a symbolic-character segmentation point setting step for, when any of prescribed symbolic characters appears in the input sentence, cutting out the symbolic character, and
a symbolic character deleting step for deleting the symbolic character cut out in said symbolic-character segmentation point setting step when said symbolic character is contained as one character in any of the keywords or the redetermined keywords extracted in said partial character string cutting step or said number-of-characters limiting step. -
8. A keyword extraction method according to claim 2, wherein said technical term storage means stores technical terms which are created in a different expression adding step with the aid of different expressions registered in non-technical-term different expression storage means for storing different expressions of general words of high frequency and different expressions of the technical terms registered in said technical term storage means, said different expression adding step comprising:
-
a word dividing step for, when a technical term in the input sentence is a compound word, dividing the compound word into partial character strings composing said compound word, a different expression developing step for combining different expressions of said partial character strings with each other to create different expressions of said compound word, and a registering step for creating pairs of each of said created different expressions and a proper expression of said compound word, and registering the pairs in said technical term storage means.
-
-
9. A computer readable recording medium storing a program which enables a keyword extraction process to be executed in a computer, said keyword extraction process comprising:
-
an input sequence for inputting a sentence, a technical-term segmentation point setting sequence for, when any of technical terms in technical term storage means for storing technical terms with proper expressions and different expressions thereof exist in the sentence input in said input step, cutting out a range of that technical term from the input sentence, a proper-expression replacing sequence for, when the technical term cut out in said technical-term segmentation point setting step is written in a different expression, replacing a range of said technical term in the input sentence by a corresponding proper expression, a character-type segmentation point setting sequence for detecting a difference in character type in the input sentence, a basic-word segmentation point setting sequence for, when any of basic words in basic word storage means for storing, as the basic words, general words of high frequency existing in the input sentence, cutting out a range of any of the basic words from the input sentence, and a partial character string cutting sequence for cutting out, as keywords, all relevant partial character strings based on segmentation points set in said technical-term segmentation point setting sequence, said character-type segmentation point setting sequence and said basic-word segmentation point setting sequence.
-
Specification