System and method for analyzing language using supervised machine learning method

US 7,542,894 B2
Filed: 07/08/2002
Issued: 06/02/2009
Est. Priority Date: 10/09/2001
Status: Expired due to Fees

First Claim

Patent Images

1. A system for analyzing Japanese language using supervised learning method, the system comprising:

sentence data storage means for storing sentence data which do not include solutions for a target problem;

problem expression storage means for storing problem expression data comprising a problem expression which indicates an object of a language analysis and information of expressions corresponding to said problem expression;

problem expression extraction processing means for extracting a portion which corresponds to any one of the expressions corresponding to the problem expression from said sentence data by using a predetermined language analysis and replacing the extracted portion of the sentence data with the problem expression;

supervised data creation processing means for creating a plurality of supervised data, which is formed as a pair of a problem and either a solution or a solution candidate, wherein the pair comprises the sentence data in which the portion is replaced with the problem expression as the problem and either the portion extracted from said sentence data by the problem expression extracting processing means as the solution or the portion extracted from other sentence data except said sentence data, which are stored in said sentence data storage means as the solution candidate;

supervised data features obtaining processing means for obtaining a plurality of predetermined syntactic supervised data features, which include one or more of a part of speech, root form, lexical category, dependency structure and modification structure from each sentence of the supervised data using syntactic analysis and then generating solution/features pairs of each sentence of the supervised data, wherein the solution/features pairs are a positive example having the plurality of supervised data features and the solution and negative examples having the plurality of supervised data features and each one of the solution candidates;

machine learning processing means for performing machine learning, processing on the solution/features pairs using a kernel function executed as a support vector machine, by classifying the solution based upon generating a hyperplane which maximizes an interval of the positive and negative examples and divides these two examples by the hyperplane on a space having dimensions determined by the plurality of obtained featuresand storing the hyperplane as the result of the machine learning processing in the learning result storing database;

object sentence data obtaining processing means for inputting object sentence data and obtaining a plurality of syntactic object sentence features, which include one or more of a part of speech, root form, lexical category, dependency structure and modification structure from the input object sentence data using the syntactic analysis; and

solution extrapolation processing means for using the stored hyperplane to determine which divided part of the space does the plurality of the syntactic object sentence features belong to, and estimates a determined part with highest probability as the solution as classified for the plurality of syntactic object sentence features.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for analyzing language using supervised learning method. The system extracts portions matching the structures of problem expressions from a raw corpus that is not supplemented with analysis information, then converts the extracted portions corresponding to the problem expressions into supervised data including problems and solutions and stores in the data storage. The system extracts sets of solutions and features from the supervised data stored in the data storage, carries out machine learning processing using the sets and stores learned results as to what kind of solution is the most straightforward for which feature in the learning results database. The system then extracts sets of features from the inputting object data, extrapolates analysis information showing the most optimum for a certain feature, from the sets of features based on the learning results database.

100 Citations

View as Search Results

7 Claims

1. A system for analyzing Japanese language using supervised learning method, the system comprising:
- sentence data storage means for storing sentence data which do not include solutions for a target problem;
  
  problem expression storage means for storing problem expression data comprising a problem expression which indicates an object of a language analysis and information of expressions corresponding to said problem expression;
  
  problem expression extraction processing means for extracting a portion which corresponds to any one of the expressions corresponding to the problem expression from said sentence data by using a predetermined language analysis and replacing the extracted portion of the sentence data with the problem expression;
  
  supervised data creation processing means for creating a plurality of supervised data, which is formed as a pair of a problem and either a solution or a solution candidate, wherein the pair comprises the sentence data in which the portion is replaced with the problem expression as the problem and either the portion extracted from said sentence data by the problem expression extracting processing means as the solution or the portion extracted from other sentence data except said sentence data, which are stored in said sentence data storage means as the solution candidate;
  
  supervised data features obtaining processing means for obtaining a plurality of predetermined syntactic supervised data features, which include one or more of a part of speech, root form, lexical category, dependency structure and modification structure from each sentence of the supervised data using syntactic analysis and then generating solution/features pairs of each sentence of the supervised data, wherein the solution/features pairs are a positive example having the plurality of supervised data features and the solution and negative examples having the plurality of supervised data features and each one of the solution candidates;
  
  machine learning processing means for performing machine learning, processing on the solution/features pairs using a kernel function executed as a support vector machine, by classifying the solution based upon generating a hyperplane which maximizes an interval of the positive and negative examples and divides these two examples by the hyperplane on a space having dimensions determined by the plurality of obtained featuresand storing the hyperplane as the result of the machine learning processing in the learning result storing database;
  
  object sentence data obtaining processing means for inputting object sentence data and obtaining a plurality of syntactic object sentence features, which include one or more of a part of speech, root form, lexical category, dependency structure and modification structure from the input object sentence data using the syntactic analysis; and
  
  solution extrapolation processing means for using the stored hyperplane to determine which divided part of the space does the plurality of the syntactic object sentence features belong to, and estimates a determined part with highest probability as the solution as classified for the plurality of syntactic object sentence features.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system according to claim 1, wherein the machine learning processing means processes machine learning on the solution/features pairs according to importance of each feature in relation to the dependency relation among features obtained using the analysis.
  - 3. The system according to claim 1, further comprising:
    - solution data storage means for storing supervised data as a pair of a sentence data and a solution corresponding to a target problem,wherein the machine learning processing means performs machine learning using borrowing-type supervised data which is created by said supervised data creation processing means, and non-borrowing-type supervised data which is stored in said solution data storage means.
  - 4. The system according to claim 1, whereinthe machine learning processing means generates rules which is constituted from the solution/features pair, arranges the rules in a rule list according to a predetermined order, and stores the rule list as the result of the machine learning processing in the learning result storing database;
    - andthe solution extrapolating processing means searches, in said rule list, for a rule from the stored rule list, matches the solution/features pair with the plurality of object sentence features from input object sentence data, and estimates the solution of the rule as the solution classified according to the plurality of object sentence features from input object sentence data.
  - 5. The system according to claim 1, whereinthe machine learning processing means specifies a classification which can serve as a solution of supervised data, calculates a probability distribution made of two terms of the classifications, each term being a solution/features pair, when said plurality of features fulfills a predetermined condition and maximizes a value of a predetermined formula representing an entropy, and stores said probability distribution as the result of the maching learning processing in the learning result storing database;
    - andthe solution extraction processing means specifies the classification according to which the solution has the largest value of the formula based on the result and extrapolates the specified classification as the most suitable solution for the plurality of object sentence features.

6. A Japanese language ellipsis analysis processing method for carrying out ellipsoidal analysis including transformation by paraphrasing using machine learning method, the method comprising:
- storing sentence data, which do not include solutions for a target problem, in a sentence data storage;
  
  storing problem expression data, each data comprising a problem expression that is the object of language analysis and information of expressions corresponding to that problem expression, in a problem expression storage;
  
  extracting a portion of each sentence data that matches any of the expressions corresponding to the problem expression using a predetermined language analysis method and replacing the extracted portion of the sentence data with the problem expression;
  
  creating supervised data as a pair of a problem and either a solution or a solution candidate for each sentence data, the problem being the sentence data in which the extracted portion has been replaced with the problem expression, the solution being the extracted portion of the sentence data, and the solution candidate being extracted from other sentence data;
  
  obtaining a plurality of predetermined syntactic supervised data features, which include one or more of a part of speech, root formm, lexical category, dependency structure and modification structure, from each sentence of the supervised data using syntactic analysis and then generating solution/features pairs, for each sentence of the supervised data, wherein the solution/features pairs are a positive example having the plurality of supervised data features and the solution and negative examples having the plurality of supervised data features and each one of the solution candidates;
  
  performing machine learning on the solution/features pairs using a kernel function executed as a support vector machine, by classifying the solution based upon generating a hyperplane which maximizes an interval of the positive and negative examples and divides these two examples by the hyperplane on a space having dimensions determined by the plurality of obtained features and storing the hyperplane as a result of the machine learning in a learning result database;
  
  inputting object sentence data and obtaining a plurality of syntactic object sentence features, which include one or more of a part of speech, root form, lexical category, dependency structure and modification structure from the input object sentence data using syntactic analysis; and
  
  using the stored hyperplane to determine which divided part of the space does the plurality of the syntactic object sentence features belong to, and estimates a determined part with highest probability as the solution as classified for the plurality of syntactic object sentence features.

7. An apparatus analyzing Japanese language using supervised learning method, the system comprising:
- sentence data storage storing sentence data which do not include solutions for a target problem;
  
  problem expression storage storing problem expression data comprising a problem expression which indicates an object of a language analysis and information of expressions corresponding to said problem expression; and
  
  a controller,extracting a portion which corresponds to any one of the expressions corresponding to the problem expression from the sentence data by using a predetermined language analysis and replacing the extracted portion of the sentence data with the problem expression,creating a plurality of supervised data which is formed as a pair of a problem and either a solution or a solution candidate, wherein the pair comprises the sentence data in which the portion is replaced with the problem expression as the problem and the portion extracted from said sentence data by the problem expression extracting processing means as the solution or the portion extracted from other sentence data as the solution candidate,obtaining a plurality of predetermined syntactic supervised data features, which include one or more of a part of speech, root form, lexical category, dependency structure and modification structure from each supervised data using syntactic analysis and then generating solution/features pairs for each sentence of the supervised data, wherein the solution/features pairs are a positive example having the plurality of supervised data features and the solution and negative examples having the plurality of supervised data features and each one of the solution candidates,performing machine learning processing on the solution/features pairs using a kernel function executed as a support vector machine, classifying the solution based upon generating a hyperplane which maximizes an interval of the positive and negative examples and divides these two examples by the hyperplane on a space having dimensions determined by the plurality of obtained features and storing the hyperplane in a learning result database,inputting object sentence data and obtaining a plurality of syntactic object sentence features, which include one or more of a part of speech, root form, lexical category, dependency structure and modification structure from the input object sentence data using the syntactic analysis, andusing the stored hyperplane in determining which divided part of the space does the plurality of the syntactic object sentence features belong to, and estimating a determined part with highest probability as the solution as classified for the plurality of syntactic object sentence features.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
National Institute of Information and Communications Technology
Original Assignee
National Institute of Information and Communications Technology
Inventors
Murata, Masaki
Primary Examiner(s)
Edouard; Patrick N
Assistant Examiner(s)
YEN, ERIC L

Application Number

US10/189,580
Publication Number

US 20030083859A1
Time in Patent Office

2,521 Days
Field of Search

704 1- 10, 706/12, 706/47, 706/20, 707/6
US Class Current

704/9
CPC Class Codes

G06F 40/20 Natural language analysis s...

System and method for analyzing language using supervised machine learning method

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

100 Citations

7 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for analyzing language using supervised machine learning method

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

100 Citations

7 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links