Apparatus, method, and program for text classification using frozen pattern
First Claim
Patent Images
1. A document classification apparatus for classifying an input document in accordance with a document style, comprising a processor arrangement for:
- °
(a) generating a style-specific frozen pattern for characterizing the document style;
(b) extracting a frozen pattern for characterizing a list from the input document by collating the input document with the style-specific frozen pattern;
(c) calculating confidence of the document style of the input document on the basis of the frozen pattern list; and
(d) deciding the document style to which the input document belongs on the basis of the calculated confidence.
0 Assignments
0 Petitions
Accused Products
Abstract
A document is classified by document style on the basis of textual analysis without depending upon morphological analysis. A style-specific frozen pattern is prepared as a reference dictionary for each document style. A frozen pattern list is extracted for an input document based on the basis of a state of appearance of a style-specific frozen pattern present in the document. Confidence for each document style is calculated based on the frozen pattern list and the detected style of the input document.
-
Citations
12 Claims
-
1. A document classification apparatus for classifying an input document in accordance with a document style, comprising a processor arrangement for:
- °
(a) generating a style-specific frozen pattern for characterizing the document style;
(b) extracting a frozen pattern for characterizing a list from the input document by collating the input document with the style-specific frozen pattern;
(c) calculating confidence of the document style of the input document on the basis of the frozen pattern list; and
(d) deciding the document style to which the input document belongs on the basis of the calculated confidence. - View Dependent Claims (2, 3, 4, 5, 6)
- °
-
7. A style-specific frozen pattern generating apparatus for generating a style-specific frozen pattern characterizing a document style, comprising an arrangement for (a) generating the style-specific frozen pattern by using a set of documents that belong to known document styles and (b) targeting an arbitrary character string present in a document on the basis of entropy of an occurrence probability of character sets appearing in the front and the rear of the character string.
-
8. A document classification apparatus for classifying an input document having plural sentences in accordance with a document style, comprising a processor arrangement for:
-
(a) generating a style-specific frozen pattern corresponding to a document style;
(b) dividing the style-specific frozen pattern into plural groups;
(c) generating plural decision trees for document style from the style-specific frozen pattern divided into the plural groups by using a set of documents that belong to known document styles;
(d) extracting for the input document separate frozen pattern lists using the respective style-specific frozen pattern group;
(e) calculating confidence for each of the decision trees for document style corresponding to the input document on the basis of the respective frozen pattern list by using the plural decision trees for document style; and
(f) deciding document styles to which the input document belongs on the basis of the confidences.
-
-
9. A method of classifying an input document in accordance with a document style, comprising:
-
(a) generating a style-specific frozen pattern that characterizes the document style;
(b) extracting a frozen pattern list from the input document by collating the input document with the style-specific frozen pattern;
(c) calculating confidence of the document style of the input document on the basis of the frozen pattern list; and
(d) deciding the document style to which the input document belongs on the basis of the confidence. - View Dependent Claims (11)
-
-
10. A method of classifying an input document in accordance with a document style, comprising:
-
(a) generating a style-specific frozen pattern characterizing the document style;
(b) finding a decision tree for the document style by using a set of documents that belong to known document styles;
(c) extracting a frozen pattern list from the input document by collating the input document with the style-specific frozen pattern;
(d) calculating confidence of the document style of the input document on the basis of the frozen pattern list by using the decision tree for the document style; and
(e) deciding the document style to which the input document belongs on the basis of the calculated confidence. - View Dependent Claims (12)
-
Specification