Document classification system and method for classifying a document according to contents of the document
First Claim
Patent Images
1. A document classification system for classifying a document according to contents of the document, said document classification system comprising:
- input means for inputting document data of the document;
analyzing means for analyzing the document data so as to obtain analysis information;
vector producing means for producing a document feature vector with respect to the document data based on the analysis information;
transforming function calculating means for calculating a representation transforming function used for projecting the document feature vector onto a space in which similarity between the document feature vectors is reflected with a dimensional number different from a dimensional number of the document feature vector, the transforming function calculating means calculating the representation transforming function by using an inner product calculated between the document feature vectors;
vector transforming means for transforming the document feature vector by using the representation transforming function;
classification means for classifying the document based on similarity between the document feature vectors transformed by the vector transforming means; and
classification result storing means for storing a result of classification performed by the classification means.
1 Assignment
0 Petitions
Accused Products
Abstract
A document classification system and method reflects operator'"'"'s intention in a result of classification of document so that an accurate result of classification can be achieved. The document to be classifies has contents contains a plurality of items. At least one of the items contained in the document is designated. The document data is converted into converted data so that the converted data contains only data corresponding to the designated item. Classification of the document is done by using the converted data.
268 Citations
24 Claims
-
1. A document classification system for classifying a document according to contents of the document, said document classification system comprising:
-
input means for inputting document data of the document; analyzing means for analyzing the document data so as to obtain analysis information; vector producing means for producing a document feature vector with respect to the document data based on the analysis information; transforming function calculating means for calculating a representation transforming function used for projecting the document feature vector onto a space in which similarity between the document feature vectors is reflected with a dimensional number different from a dimensional number of the document feature vector, the transforming function calculating means calculating the representation transforming function by using an inner product calculated between the document feature vectors; vector transforming means for transforming the document feature vector by using the representation transforming function; classification means for classifying the document based on similarity between the document feature vectors transformed by the vector transforming means; and classification result storing means for storing a result of classification performed by the classification means.
-
-
2. The document classification system as claimed in 1, further comprising inner product calculating means for calculating an inner product between the document feature vectors, wherein said representation transforming function calculating means calculates the representation transforming function by using the inner product.
-
3. A document classification system for classifying a document according to contents of the document, said document classification system comprising:
-
input means for inputting document data of the document; analyzing means for analyzing the document data so as to obtain analysis information; vector producing means for producing a document feature vector with respect to the document data based on the analysis information; transforming function calculating means for calculating a representation transforming function used for projecting the document feature vector onto a space in which similarity between the document feature vectors is reflected; vector transforming means for transforming the document feature vector by using the representation transforming function; classification means for classifying the document based on similarity between the document feature vectors transformed by the vector transforming means; classification result storing means for storing a result of classification performed by the classification means; inner product calculating means for calculating an inner product between the document feature vectors, wherein said representation transforming function calculating means calculates the representation transforming function by using the inner product; and document similarity information setting means for setting document similarity setting information including data representing an author of the document and a date of production of the document, wherein said representation transforming function calculating means calculates the representation transforming function by using the inner product and the document similarity information.
-
-
4. The document classification system as claimed in 1, further comprising:
-
vector storing means for storing the document feature vector produced by said vector producing means; and transforming function storing means for storing the representation transforming function calculated by said representation transforming function calculating means.
-
-
5. A document classification system for classifying a document according to contents of the document, said document classification system comprising:
-
input means for inputting document data of the document; analyzing means for analyzing the document data so as to obtain analysis information; vector producing means for producing a document feature vector with respect to the document data based on the analysis information; transforming function calculating means for calculating a representation transforming function used for projecting the document feature vector onto a space in which similarity between the document feature vectors is reflected; vector transforming means for transforming the document feature vector by using the representation transforming function; classification means for classifying the document based on similarity between the document feature vectors transformed by the vector transforming means; classification result storing means for storing a result of classification performed by the classification means; and vector correcting means for correcting the document feature vector before the document feature vector is transformed by said vector transforming means, a correction being performed by processing one of the document feature vector and a feature dimension constituting the document feature vector in accordance with a rule established by characteristics of words extracted by said analyzing means.
-
-
6. The document classification system as claimed in 5, further comprising transforming function correcting means for correcting the representation transforming function calculated by said transforming function calculating means when the feature dimension is changed due to a correction of the document feature vector by said vector correcting means so that the document feature vector is transformed by said vector transforming means in accordance with the changed feature dimension.
-
7. A document classification system for classifying a document according to contents of the document, said document classification system comprising:
-
input means for inputting document data of the document; analyzing means for analyzing the document data so as to obtain analysis information; vector producing means for producing a document feature vector with respect to the document data based on the analysis information; transforming function calculating means for calculating a representation transforming function used for projecting the document feature vector onto a space in which similarity between the document feature vectors is reflected; vector transforming means for transforming the document feature vector by using the representation transforming function; classification means for classifying the document based on similarity between the document feature vectors transformed by the vector transforming means; classification-result storing means for storing a result of classification performed by the classification means; transforming function correction instructing means for sending an instruction regarding a process to be applied on a feature dimension of the representation transforming function; and transforming function correcting means for correcting the representation transforming function based on a content of the instruction sent from said transforming function correction instructing means.
-
-
8. The document classification system as claimed in 7, wherein the process indicated in the content of the instruction is performed by using data of an arbitrary document vector.
-
9. The document classification system as claimed in 7, wherein the process indicated in the content of the instruction is performed by using the document feature vectors.
-
10. The document classification system as claimed in 7, wherein the process indicated in the content of the instruction is performed by using the analysis information obtained by said analyzing means.
-
11. The document classification system as claimed in 7, wherein the process indicated in the content of the instruction is performed by using the result of classification stored in said classification-result storing means.
-
12. A document classification system for classifying a document according to contents of the document, said document classification system comprising:
-
input means for inputting document data of the document; analyzing means for analyzing the document data so as to obtain analysis information; vector producing means for producing a document feature vector with respect to the document data based on the analysis information; transforming function calculating means for calculating a representation transforming function used for projecting the document feature vector onto a space in which similarity between the document feature vectors is reflected; vector transforming means for transforming the document feature vector by using the representation transforming function; classification means for classifying the document based on similarity between the document feature vectors transformed by the vector transforming means; classification result storing means for storing a result of classification performed by the classification means; an initial cluster centroid designating means for designating an initial cluster centroid; and initial cluster centroid registering means for registering the initial cluster centroid designated by said initial cluster centroid designating means, wherein said classification means classifies the document in accordance with the initial cluster centroid registered by said initial cluster centroid registering means.
-
-
13. The document classification system as claimed in 12, wherein the initial cluster centroid designated by said initial cluster centroid designating means is arbitrary document vector data.
-
14. The document classification system as claimed in 12, wherein the initial cluster centroid designated by said initial cluster centroid designating means is the document feature vector.
-
15. The document classification system as claimed in 12, wherein the initial cluster centroid designated by said initial cluster centroid designating means is the analysis information obtained by said analyzing means.
-
16. The document classification system as claimed in 12, wherein the initial cluster centroid designated by said initial cluster centroid designating means is the result of classification stored by said classification-result storing means.
-
17. A processor readable medium storing program code causing a computer to classify a document according to contents of the document, comprising:
-
first program code means for inputting document data of the document; second program code means for analyzing the document data so as to obtain analysis information; third program code means for producing a document feature vector with respect to the document data based on the analysis information; fourth program code means for calculating a representation transforming function used for projecting the document feature vector onto a space in which similarity between the document feature vectors is reflected with a dimensional number different from a dimensional number of the document feature vector, the fourth program code means calculating the representation transforming function by using an inner product calculated between the document feature vectors; fifth program code means for transforming the document feature vector by using the representation transforming function; sixth program code means for classifying the document based on similarity between the document feature vectors transformed by the fifth program code means; and seventh program code means for storing a result of classification performed by the classification means.
-
-
18. The processor readable medium as claimed in 17, further comprising eighth program code means for calculating an inner product between the document feature vectors, wherein the representation transforming function is calculated by using the inner product.
-
19. A processor readable medium storing program code causing a computer to classify a document according to contents of the document, comprising:
-
first program code means for inputting document data of the document; second program code means for analyzing the document data so as to obtain analysis information; third program code means for producing a document feature vector with respect to the document data based on the analysis information; fourth program code means for calculating a representation transforming function used for projecting the document feature vector onto a space in which similarity between the document feature vectors is reflected; fifth program code means for transforming the document feature vector by using the representation transforming function; sixth program code means for classifying the document based on similarity between the document feature vectors transformed by the fifth program code means; seventh program code means for storing a result of classification performed by the classification means; eighth program code means for calculating an inner product between the document feature vectors, wherein the representation transforming function is calculated by using the inner product; and ninth program code means for setting document similarity setting information including data representing an author of the document and a date of production of the document, wherein the representation transforming function is calculated by using the inner product and the document similarity information.
-
-
20. The processor readable medium as claimed in 17, further comprising:
-
tenth program code means for storing the document feature vector produced by the third program code means; and eleventh program code means for storing the representation transforming function calculated by the fourth program code means.
-
-
21. A processor readable medium storing program code causing a computer to classify a document according to contents of the document, comprising:
-
first program code means for inputting document data of the document; second program code means for analyzing the document data so as to obtain analysis information; third program code means for producing a document feature vector with respect to the document data based on the analysis information; fourth program code means for calculating a representation transforming function used for projecting the document feature vector onto a space in which similarity between the document feature vectors is reflected; fifth program code means for transforming the document feature vector by using the representation transforming function; sixth program code means for classifying the document based on similarity between the document feature vectors transformed by the fifth program code means; seventh program code means for storing a result of classification performed by the classification means; and eighth program code means for correcting the document feature vector before the document feature vector is transformed by the fifth program code means, a correction being performed by processing one of the document feature vector and a feature dimension constituting the document feature vector in accordance with a rule established by characteristics of words extracted by the second program code means.
-
-
22. The processor readable medium as claimed in 21, further comprising ninth program code means for correcting the representation transforming function calculated by the fourth program code means when the feature dimension is changed due to a correction of the document feature vector by the eighth program code means so that the document feature vector is transformed by the fifth program code means in accordance with the changed feature dimension.
-
23. A processor readable medium storing program code causing a computer to classify a document according to contents of the document, comprising:
-
first program code means for inputting document data of the document; second program code means for analyzing the document data so as to obtain analysis information; third program code means for producing a document feature vector with respect to the document data based on the analysis information; fourth program code means for calculating a representation transforming function used for projecting the document feature vector onto a space in which similarity between the document feature vectors is reflected; fifth program code means for transforming the document feature vector by using the representation transforming function; sixth program code means for classifying the document based on similarity between the document feature vectors transformed by the fifth program code means; seventh program code means for storing a result of classification performed by the classification means; eighth program code means for sending an instruction regarding a process to be applied on a feature dimension of the representation transforming function; and ninth program code means for correcting the representation transforming function based on a content of the instruction sent by the eighth program code means.
-
-
24. A processor readable medium storing program code causing a computer to classify a document according to contents of the document, comprising:
-
first program code means for inputting document data of the document; second program code means for analyzing the document data so as to obtain analysis information; third program code means for producing a document feature vector with respect to the document data based on the analysis information; fourth program code means for calculating a representation transforming function used for projecting the document feature vector onto a space in which similarity between the document feature vectors is reflected; fifth program code means for transforming the document feature vector by using the representation transforming function; sixth program code means for classifying the document based on similarity between the document feature vectors transformed by the fifth program code means; seventh program code means for storing a result of classification performed by the classification means; eighth program code means for designating an initial cluster centroid; and ninth program code means for registering the initial cluster centroid designated by the eighth program code means, wherein the document is classified in accordance with the initial cluster centroid registered by the ninth program code means.
-
Specification