System and method for clustering documents

US 8,046,363 B2
Filed: 01/10/2007
Issued: 10/25/2011
Est. Priority Date: 04/13/2006
Status: Active Grant

First Claim

Patent Images

1. A clustering device having software and hardware for clustering documents, the clustering device comprising:

a document storage unit storing documents;

a document retrieving unit to retrieve at least two selected documents form a plurality of documents stored in the document storage unit;

a document feature writing unit to acquire text from at least one identical or similar specific field among a plurality of fields in each of the selected documents, and to create index files of the selected documents, and to determine a feature vector for each of the selected documents using the index files, wherein the feature vector includes occurrence frequencies of a plurality of keywords located in the specific field of the selected document;

a representative vector calculator to extract representative keywords having highest occurrence frequencies among the plurality of keywords in the index file of selected the document, to calculate percentages of occurrence frequencies of each of the representative keywords in each selected document, to choose a group of keywords from the representative keywords of the selected documents to be representative keywords used in clustering of the documents, the representative keywords corresponding to a group of keywords having greatest values obtained by adding the percentage of occurrence frequencies among same keywords in each of the selected documents, wherein the representative keywords are chosen to be components of the representative vector;

a cluster database storage unit to store representative keywords constituting components of the representative vector with the representative vector; and

a clustering unit comprising;

a similarity calculator to determine a similarity of the specific field of a stored document in the document storage unit through an inner product value between a feature vector of the stored document and stored representative vector in the cluster database storage unit;

receiving a stored document as a target document for clustering;

clustering the stored document using the specific field of the stored document through an inner product value between a feature vector of the stored document and stored representative vector in the cluster database storage unit; and

providing the user with a clustering result.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Provided are a system and method of clustering documents. The system includes a document DB, a document feature writing unit storing documents, a document retrieving unit, a clustering unit, and a cluster DB. The document DB stores documents. The document feature writing unit extracts attribute information of documents stored in the document database, and writes indexes with respect to the respective documents on the basis of the attribute information. The document retrieving unit retrieves documents including a query input by a user, using the indexes. The clustering unit includes a representative vector calculator calculating feature vectors and a representative vector of the retrieved documents, and a similarity calculator calculating similarities between the documents using the feature vectors and the representative vector. The cluster database stores documents clustered by the clustering unit.

308 Citations

18 Claims

1. A clustering device having software and hardware for clustering documents, the clustering device comprising:
- a document storage unit storing documents;
  
  a document retrieving unit to retrieve at least two selected documents form a plurality of documents stored in the document storage unit;
  
  a document feature writing unit to acquire text from at least one identical or similar specific field among a plurality of fields in each of the selected documents, and to create index files of the selected documents, and to determine a feature vector for each of the selected documents using the index files, wherein the feature vector includes occurrence frequencies of a plurality of keywords located in the specific field of the selected document;
  
  a representative vector calculator to extract representative keywords having highest occurrence frequencies among the plurality of keywords in the index file of selected the document, to calculate percentages of occurrence frequencies of each of the representative keywords in each selected document, to choose a group of keywords from the representative keywords of the selected documents to be representative keywords used in clustering of the documents, the representative keywords corresponding to a group of keywords having greatest values obtained by adding the percentage of occurrence frequencies among same keywords in each of the selected documents, wherein the representative keywords are chosen to be components of the representative vector;
  
  a cluster database storage unit to store representative keywords constituting components of the representative vector with the representative vector; and
  
  a clustering unit comprising;
  
  a similarity calculator to determine a similarity of the specific field of a stored document in the document storage unit through an inner product value between a feature vector of the stored document and stored representative vector in the cluster database storage unit;
  
  receiving a stored document as a target document for clustering;
  
  clustering the stored document using the specific field of the stored document through an inner product value between a feature vector of the stored document and stored representative vector in the cluster database storage unit; and
  
  providing the user with a clustering result.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The clustering device according to claim 1, wherein the representative vector calculator sets a vector component of the feature vector to one if a part of an index of the stored document includes the representative keyword, and sets the vector component to zero if the part of the index does not include the representative keyword, so that the clustering unit generates the feature vector for the each stored document.
  - 3. The clustering device according to claim 1, wherein the documents are patent documents, andthe clustering unit further comprises a field clustering unit clustering documents that are identical or similar to each other in the specific field defined in terms of an identification item in the patent document.
  - 4. The clustering device according to claim 1, wherein the document storage unit stores a new document provided to the clustering device, andwhen the new document is provided to the document storage unit, the clustering unit clusters the new document using the feature vector with respect to the new document and the representative vector stored in the cluster database.
  - 5. The clustering device according to claim 4, wherein the clustering unit further comprises a cluster database manager managing clustered documents stored in the cluster database, and the representative vector used for clustering, andthe cluster database manager performs clustering of the new document.
  - 6. The clustering device according to claim 1, wherein the clustering unit clusters the designated target documents among the plurality of documents in the document list when the clustering unit receives the document selection input from the user.
  - 7. The clustering device according to claim 1, wherein the clustering unit generates the clustering groups whose number corresponds to the clustering group number input when the clustering unit receives the clustering group number input from the user.
  - 8. The clustering device according to claim 1, wherein the clustering unit clusters the documents such that each clustering group has documents whose number corresponds to the document number input when the clustering unit receives the document number input from the user.
  - 9. The clustering device according to claim 1, wherein the stored documents are documents acquired as a result of document retrieval.

10. A method for clustering documents, being employed in a clustering device that comprises a microprocessor, the method comprising:
- retrieving at least two selected documents from a plurality of documents stored in a document storage unit;
  
  acquiring, using the microprocessor, text from the at least one identical or similar specific field among a plurality of fields in each of the selected documents, and to create index files of the selected documents, and to determine a feature vector for each of the selected documents using the index files, wherein the feature vector includes occurrence frequencies of a plurality of keywords located in the specific field of the selected document;
  
  extracting, using the microprocessor, representative keywords having highest occurrence frequencies among the plurality of keywords in the index file of the selected document, to calculate percentages of occurrence frequencies of each of the representative keywords in each selected document, to choose a group of keywords from the representative keywords of the selected documents to be representative keywords used in clustering of the documents, the representative keywords corresponding to a group of keywords having greatest values obtained by adding the percentage of occurrence frequencies among same keywords in each of the selected documents, wherein the representative keywords are chosen to be components of the representative vector;
  
  storing, in a cluster database storage unit, representative keywords constituting components of the representative vector with the representative vector;
  
  providing, using the microprocessor, a user with a document list which includes a plurality of documents;
  
  receiving, using the microprocessor, a document selection input which designates target documents for clustering;
  
  clustering, using the microprocessor, using identical or similar specific field of a document through an inner product value between a feature vector of the target document and stored representative vector in the clustering database storage unit.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The method according to claim 10, wherein the similarities are determined by comparing a preset reference value with a value obtained by dividing an inner product value between the representative vector and the feature vector by a square of an absolute value of the representative vector.
  - 12. The method according to claim 10, wherein the clustering of the documents comprises storing representative vectors used for clustering of the documents.
  - 13. The method according to claim 12, wherein when a new document is stored in the document storage unit, the feature vector with respect to the new document is calculated, and clustering of the new document is performed automatically using a value obtained by inner product between the pre-stored representative vectors and the feature vector of the new document.
  - 14. The method according to claim 10, wherein the documents are patent documents, and the feature vector and the representative vector are calculated with respect to a specific field defined in terms of identification items of the patent documents.
  - 15. The method according to claim 10, wherein the step of generating the feature vector comprises:
    - setting a vector component of the feature vector to one if a part of an index of a document includes the representative keyword; and
      
      setting the vector component to zero if the part of the index does not include a representative keyword.
  - 16. The method according to claim 10, wherein the step of clustering comprises:
    - clustering the designated target documents among the plurality of documents in the document list when the clustering device receives the document selection input from the user.
  - 17. The method according to claim 10, wherein the step of clustering comprises:
    - generating the clustering groups whose number corresponds to the clustering group number input when the clustering device receives the clustering group number input from the user.
  - 18. The method according to claim 10, wherein the step of clustering comprises:
    - clustering the documents such that each clustering group has documents whose number corresponds to the document number input when the clustering device receives the document number input from the user.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
LG Electronics, Inc. (LG Corporation)
Original Assignee
LG Electronics, Inc. (LG Corporation)
Inventors
Ahn, Han Joon, Kim, Jeong Joong, Cha, Wan Kyu
Primary Examiner(s)
Ali; Mohammad
Assistant Examiner(s)
DARNO, PATRICK A

Application Number

US11/621,817
Publication Number

US 20070244915A1
Time in Patent Office

1,749 Days
Field of Search

707/737, 707/739, 707/999.004, 707/930, 707/937
US Class Current

707/739
CPC Class Codes

G06F 16/355   Class or cluster creation o...

Y10S 707/93   intellectual property analysis

Y10S 707/937   intellectual property searc...

System and method for clustering documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

308 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for clustering documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

308 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links