Structured document type determination system and structured document type determination method
First Claim
1. A structured document type determination system comprising:
- a structured document database for storing a plurality of structured documents collected by way of a network;
a teacher data input means for inputting, as teacher data, a type of each of the plurality of structured documents stored in said structured document database;
a determination rule creating means for creating a determination rule used for determining a type of each of the plurality of structured documents based on a plurality of structured documents stored in said structured document database and the teacher data; and
a determination rule applying means for determining the type of a structured document that exists on said network according to the determination rule created by said determination rule creating means.
1 Assignment
0 Petitions
Accused Products
Abstract
A structured document type determination system is provided with a feature value extraction unit for extracting a value of each of a plurality of features included in a feature list which is disposed in advance from each of a plurality of structured documents and a determination rule creating unit for creating a determination rule from extracted feature values by using a data mining tool. The structured document type determination system makes an evaluation of the determination rule by comparing results of determining the types of structured documents according to the determination rule and teacher data, and repeatedly delivers a tuning parameter to the data mining tool so as to create a plurality of determination rules and to derive an optimum determination rule.
-
Citations
22 Claims
-
1. A structured document type determination system comprising:
-
a structured document database for storing a plurality of structured documents collected by way of a network;
a teacher data input means for inputting, as teacher data, a type of each of the plurality of structured documents stored in said structured document database;
a determination rule creating means for creating a determination rule used for determining a type of each of the plurality of structured documents based on a plurality of structured documents stored in said structured document database and the teacher data; and
a determination rule applying means for determining the type of a structured document that exists on said network according to the determination rule created by said determination rule creating means. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A structured document type determination method comprising the steps of:
-
sampling a plurality of arbitrary structured documents from a structured document database for storing structured documents so as to create a sampled structured document database;
providing a list of features each of which is a measure to classify a plurality of structured document into a plurality of predetermined types and each of which is to be extracted from each of the plurality of structured documents;
by extracting a value of each of the plurality of features (referred to as a feature value from here on) from each of the plurality of structured documents stored in said sampled structured document database according to the list of features and by inputting teacher data which is a result of determining which one of the plurality of types each of the plurality of structured documents stored in said sampled structured document database is classified into, creating a feature value and teacher data database including the input teacher data and extracted feature values for each of the plurality of structured documents stored in said sampled structured document database;
by dividing said feature value and teacher data database into two portions, creating both a made-for-machine-learning feature value and teacher data database and a made-for-verification feature value and teacher data database;
creating a determination rule used for determining which one of the plurality of types a structured document is classified into based on said made-for-machine-learning feature value and teacher data database by using a data mining tool;
determining which one of the plurality of types each of a plurality of structured documents whose feature values and teacher data are stored in said made-for-verification feature value and teacher data database is classified into according to the determination rule so as to produce determination results;
making an evaluation of the determination rule by comparing the determination results with the teacher data stored in said made-for-verification feature value and teacher data database; and
selecting a tuning pattern from a list of tuning patterns used for tuning of the creation of the determination rule one by one so as to deliver the selected tuning pattern to said determination rule creating step, and repeating a series of processes, such as causing said determination rule creating step to create a determination rule again according to the selected tuning pattern, causing said determining step to make a determination of the type of each of the plurality of structured documents stored in said made-for-verification feature value and teacher data database again according to the created determination rule and causing said determination rule evaluation step to make an evaluation of the created determination rule, until the determination rule creation and the evaluation are completed for all the tuning patterns in said tuning pattern list, so as to derive an optimum determination rule from among a plurality of determination rules acquired during the above processes.
-
Specification