Keyword extraction apparatus for Japanese texts

US 5,619,410 A
Filed: 03/29/1994
Issued: 04/08/1997
Est. Priority Date: 03/29/1993
Status: Expired due to Fees

First Claim

Patent Images

1. A keyword extraction apparatus for extracting keywords from Japanese text data, comprising:

sentence segmentation means for segmenting the Japanese text data into sentence-by-sentence data;

analytical information storage means for storing information regarding mutual continuation between morphemes;

morpheme analysis means for dividing the sentence-by-sentence data segmented by the sentence segmentation means into morphemes and for analyzing the morphemes;

morpheme information storage means for storing morpheme information on a morpheme-by-morpheme basis, the morpheme information including part of speech information, semantic classification information, sentence pattern information, and noted term information;

morpheme information development means for developing morpheme information with respect to each morpheme analyzed by the morpheme analysis means, on a basis of the morpheme information stored in the morpheme information storage means;

keyword candidate extraction means for extracting keyword candidates from the sentence-by-sentence data, on a basis of the morpheme information developed by the morpheme information development means;

noted term information storage means for storing information regarding case classes of keyword candidates, among all of the keyword candidates, that immediately precede noted terms;

case class conversion information storage means for storing relational information between case types and the case classes;

case information acquisition means for acquiring case classes of the keyword candidates on a basis of the information stored in the noted term information storage means, and for acquiring case types corresponding to the acquired case classes on a basis of the relational information stored in the case class conversion information storage means;

frequency information acquisition means for acquiring an appearance frequency of each keyword candidate by classifying each keyword candidate into the case types obtained from the case information acquisition means, and for acquiring a number of all morphemes in the Japanese text data, the number of all morphemes being indicative of a length of the Japanese text data;

importance calculating means for calculating a frequency score on a basis of the appearance frequency of each keyword candidate and the number of all morphemes in the Japanese text data, for calculating a class-by-class appearance frequency of each keyword candidate in the Japanese text data, and for calculating an overall importance of each keyword candidate on a basis of the corresponding frequency score and the class-by-class appearance frequency; and

keyword finalizing means for determining keywords from the keyword candidates, wherein the keywords have a corresponding overall importance obtained from the importance calculating means which exceeds a predetermined value.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Sentence segmentation means performing sentence segmentation on the Japanese text data to be processed. Morpheme analysis means divides sentence-by-sentence data into morphemes and analyzes the resultant morphemes on the basis of information regarding morpheme-by-morpheme continuation contained in an analytical dictionary. Morpheme dictionary information development means develops the contents of the morpheme dictionary including part of speech information, semantic classification information, sentence pattern information and noted term information. Keyword candidate extraction means extracts keyword candidates from sentence-by-sentence data on the basis of the part of speech information and the like of each morpheme. Case information acquisition means acquires case information from information regarding the classes of case of keyword candidates immediately preceding noted terms stored in a noted term table and case class classification information for stored in a case class conversion table. Frequency information acquisition means acquires the appearance frequency of each keyword candidate. Importance calculation means calculates the importance of each keyword candidate as keyword. Keyword finalizing means definitely determines as true keywords only those keyword candidates having degrees of importance above a designated level of importance.

Citations

3 Claims

1. A keyword extraction apparatus for extracting keywords from Japanese text data, comprising:
- sentence segmentation means for segmenting the Japanese text data into sentence-by-sentence data;
  
  analytical information storage means for storing information regarding mutual continuation between morphemes;
  
  morpheme analysis means for dividing the sentence-by-sentence data segmented by the sentence segmentation means into morphemes and for analyzing the morphemes;
  
  morpheme information storage means for storing morpheme information on a morpheme-by-morpheme basis, the morpheme information including part of speech information, semantic classification information, sentence pattern information, and noted term information;
  
  morpheme information development means for developing morpheme information with respect to each morpheme analyzed by the morpheme analysis means, on a basis of the morpheme information stored in the morpheme information storage means;
  
  keyword candidate extraction means for extracting keyword candidates from the sentence-by-sentence data, on a basis of the morpheme information developed by the morpheme information development means;
  
  noted term information storage means for storing information regarding case classes of keyword candidates, among all of the keyword candidates, that immediately precede noted terms;
  
  case class conversion information storage means for storing relational information between case types and the case classes;
  
  case information acquisition means for acquiring case classes of the keyword candidates on a basis of the information stored in the noted term information storage means, and for acquiring case types corresponding to the acquired case classes on a basis of the relational information stored in the case class conversion information storage means;
  
  frequency information acquisition means for acquiring an appearance frequency of each keyword candidate by classifying each keyword candidate into the case types obtained from the case information acquisition means, and for acquiring a number of all morphemes in the Japanese text data, the number of all morphemes being indicative of a length of the Japanese text data;
  
  importance calculating means for calculating a frequency score on a basis of the appearance frequency of each keyword candidate and the number of all morphemes in the Japanese text data, for calculating a class-by-class appearance frequency of each keyword candidate in the Japanese text data, and for calculating an overall importance of each keyword candidate on a basis of the corresponding frequency score and the class-by-class appearance frequency; and
  
  keyword finalizing means for determining keywords from the keyword candidates, wherein the keywords have a corresponding overall importance obtained from the importance calculating means which exceeds a predetermined value.
- View Dependent Claims (2)
- - 2. A keyword extraction apparatus, as claimed in claim 1, wherein the appearance frequency of each keyword candidate based on the case type, as determined by the frequency information acquisition means, indicates a semantic role of each keyword candidate in the sentence-by-sentence data.

3. A keyword extraction method for extracting keywords from Japanese text data, comprising the steps of:
- a) segmenting the Japanese text data into sentence-by-sentence data;
  
  b) storing information regarding mutual continuation between morphemes;
  
  c) partitioning the segmented sentence-by-sentence data into morphemes and analyzing the morphemes;
  
  d) storing morpheme information on a morpheme-by-morpheme basia, the morpheme information including part of speech information, semantic classification information, sentence pattern information, and noted term information;
  
  e) developing morpheme information with respect to each analyzed morpheme on a basis of the stored morpheme information;
  
  f) extracting keyword candidates from segmented sentence-by-sentence data on a basis of the stored morpheme information;
  
  g) storing information regarding case classes of keyword candidates, among all of the keyword candidates, that immediately precede noted terms;
  
  h) storing relational information between case types and the case classes;
  
  i) acquired case classes of the keyword candidates on a basis of the information regarding case classes;
  
  j) acquiring case types corresponding to the acquired case classes on a basis of the relational information;
  
  k) acquiring an appearance frequency of each keyword candidate by classifying each keyword candidate into the acquired case types;
  
  l) acquiring a number of all morphemes in the Japanese text data, the number of all morphemes being indicative of a length of the Japanese text data;
  
  m) calculating a frequency score on a basis of the appearance frequency of each keyword candidate and the number of all morphemes in the Japanese text data;
  
  n) calculating a class-by-class appearance frequency of each keyword candidate in the Japanese text data;
  
  o) calculating an overall importance of each keyword candidate on a basis of the corresponding frequency score and the class-by-class appearance frequency; and
  
  p) determining keywords from all of the keyword candidates, wherein the keywords have a corresponding overall importance which exceeds a threshold value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
NEC Corporation
Original Assignee
NEC Corporation
Inventors
Emori, Kiyoshi, Ohtsuki, Noriko
Primary Examiner(s)
Hayes, Gail O.
Assistant Examiner(s)
Kyle, Charles

Application Number

US08/219,530
Time in Patent Office

1,106 Days
Field of Search

364/419.02, 364/419.04, 364/419.07
US Class Current

704/7
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 16/3335   Syntactic pre-processing, e...

G06F 40/247   Thesauruses; Synonyms

Keyword extraction apparatus for Japanese texts

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

3 Claims

Specification

Solutions

Use Cases

Quick Links

Keyword extraction apparatus for Japanese texts

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

3 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links