System and method for clustering documents
First Claim
1. A clustering device having software and hardware for clustering documents, the clustering device comprising:
- a document storage unit storing documents;
a document retrieving unit to retrieve at least two selected documents form a plurality of documents stored in the document storage unit;
a document feature writing unit to acquire text from at least one identical or similar specific field among a plurality of fields in each of the selected documents, and to create index files of the selected documents, and to determine a feature vector for each of the selected documents using the index files, wherein the feature vector includes occurrence frequencies of a plurality of keywords located in the specific field of the selected document;
a representative vector calculator to extract representative keywords having highest occurrence frequencies among the plurality of keywords in the index file of selected the document, to calculate percentages of occurrence frequencies of each of the representative keywords in each selected document, to choose a group of keywords from the representative keywords of the selected documents to be representative keywords used in clustering of the documents, the representative keywords corresponding to a group of keywords having greatest values obtained by adding the percentage of occurrence frequencies among same keywords in each of the selected documents, wherein the representative keywords are chosen to be components of the representative vector;
a cluster database storage unit to store representative keywords constituting components of the representative vector with the representative vector; and
a clustering unit comprising;
a similarity calculator to determine a similarity of the specific field of a stored document in the document storage unit through an inner product value between a feature vector of the stored document and stored representative vector in the cluster database storage unit;
receiving a stored document as a target document for clustering;
clustering the stored document using the specific field of the stored document through an inner product value between a feature vector of the stored document and stored representative vector in the cluster database storage unit; and
providing the user with a clustering result.
1 Assignment
0 Petitions
Accused Products
Abstract
Provided are a system and method of clustering documents. The system includes a document DB, a document feature writing unit storing documents, a document retrieving unit, a clustering unit, and a cluster DB. The document DB stores documents. The document feature writing unit extracts attribute information of documents stored in the document database, and writes indexes with respect to the respective documents on the basis of the attribute information. The document retrieving unit retrieves documents including a query input by a user, using the indexes. The clustering unit includes a representative vector calculator calculating feature vectors and a representative vector of the retrieved documents, and a similarity calculator calculating similarities between the documents using the feature vectors and the representative vector. The cluster database stores documents clustered by the clustering unit.
307 Citations
18 Claims
-
1. A clustering device having software and hardware for clustering documents, the clustering device comprising:
-
a document storage unit storing documents; a document retrieving unit to retrieve at least two selected documents form a plurality of documents stored in the document storage unit; a document feature writing unit to acquire text from at least one identical or similar specific field among a plurality of fields in each of the selected documents, and to create index files of the selected documents, and to determine a feature vector for each of the selected documents using the index files, wherein the feature vector includes occurrence frequencies of a plurality of keywords located in the specific field of the selected document; a representative vector calculator to extract representative keywords having highest occurrence frequencies among the plurality of keywords in the index file of selected the document, to calculate percentages of occurrence frequencies of each of the representative keywords in each selected document, to choose a group of keywords from the representative keywords of the selected documents to be representative keywords used in clustering of the documents, the representative keywords corresponding to a group of keywords having greatest values obtained by adding the percentage of occurrence frequencies among same keywords in each of the selected documents, wherein the representative keywords are chosen to be components of the representative vector; a cluster database storage unit to store representative keywords constituting components of the representative vector with the representative vector; and a clustering unit comprising; a similarity calculator to determine a similarity of the specific field of a stored document in the document storage unit through an inner product value between a feature vector of the stored document and stored representative vector in the cluster database storage unit; receiving a stored document as a target document for clustering; clustering the stored document using the specific field of the stored document through an inner product value between a feature vector of the stored document and stored representative vector in the cluster database storage unit; and providing the user with a clustering result. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method for clustering documents, being employed in a clustering device that comprises a microprocessor, the method comprising:
-
retrieving at least two selected documents from a plurality of documents stored in a document storage unit; acquiring, using the microprocessor, text from the at least one identical or similar specific field among a plurality of fields in each of the selected documents, and to create index files of the selected documents, and to determine a feature vector for each of the selected documents using the index files, wherein the feature vector includes occurrence frequencies of a plurality of keywords located in the specific field of the selected document; extracting, using the microprocessor, representative keywords having highest occurrence frequencies among the plurality of keywords in the index file of the selected document, to calculate percentages of occurrence frequencies of each of the representative keywords in each selected document, to choose a group of keywords from the representative keywords of the selected documents to be representative keywords used in clustering of the documents, the representative keywords corresponding to a group of keywords having greatest values obtained by adding the percentage of occurrence frequencies among same keywords in each of the selected documents, wherein the representative keywords are chosen to be components of the representative vector; storing, in a cluster database storage unit, representative keywords constituting components of the representative vector with the representative vector; providing, using the microprocessor, a user with a document list which includes a plurality of documents; receiving, using the microprocessor, a document selection input which designates target documents for clustering; clustering, using the microprocessor, using identical or similar specific field of a document through an inner product value between a feature vector of the target document and stored representative vector in the clustering database storage unit. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
Specification