×

Apparatus for clustering a plurality of documents

  • US 9,342,591 B2
  • Filed: 02/14/2013
  • Issued: 05/17/2016
  • Est. Priority Date: 02/14/2012
  • Status: Expired due to Fees
First Claim
Patent Images

1. An apparatus comprising:

  • a selection section for selecting a plurality of sample documents from a plurality of documents;

    a first parameter generation section for analyzing the plurality of sample documents using preset values as initial values to generate an initial parameter matrix expressing a probability that each of a plurality of words included in the plurality of sample documents is included in each of a plurality of topics; and

    a second parameter generation section for analyzing the plurality of documents by using each value included in the initial parameter matrix as an initial value to generate a parameter matrix expressing a probability that each of a plurality of words included in the plurality of documents is included in each of a plurality of topics,wherein the second parameter generation section includes;

    a division section for dividing the plurality of documents into a plurality of groups, the division section including;

    a pre-clustering section for clustering the plurality of documents into a plurality of clusters based on the initial parameter matrix; and

    an allocation section for allocating each of documents in the plurality of clusters to each of the plurality of groups, wherein the allocation section allocates the plurality of documents to the plurality of groups in such a manner that each of the plurality of clusters obtained by clustering the plurality of documents based on the initial parameter matrix is included in equal proportions among groups;

    a plurality of element parameter generation sections, each of which is provided to correspond to each of the plurality of groups to analyze a plurality of documents included in a corresponding group by using each value in the initial parameter matrix as an initial value in order to generate an element parameter matrix expressing a probability that each of a plurality of words included in the plurality of documents included in the corresponding group is included in each of a plurality of topics; and

    a synthesis section for synthesizing a plurality of element parameter matrices to generate the parameter matrix, wherein the synthesis section;

    feeds the calculated parameter matrix back to each of the plurality of element parameter generation sections, each of the plurality of element parameter generation sections uses, as an initial value, each value in the parameter matrix fed back to analyze a plurality of documents included in a corresponding group in order to generate the element parameter matrix again,synthesizes the plurality of element parameter matrices generated by using, as the initial value, each value in the parameter matrix fed back to generate the parameter matrix again, andfeeds the calculated parameter matrix back to the division section,wherein the division section divides the plurality of documents into the plurality of groups again in such a manner that each of the plurality of clusters obtained by clustering the plurality of documents based on the parameter matrix fed back is included in equal proportions among groups, andeach of the plurality of element parameter generation sections analyzes a plurality of documents included in a corresponding group divided again to generate the element parameter matrix.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×