High dimension sparse data clustering system and method
High dimension sparse data clustering system and method
 CN 101,266,621 B
 Filed: 04/24/2008
 Issued: 08/10/2011
 Est. Priority Date: 04/24/2008
 Status: Active Grant
First Claim
1. a high dimension sparse data clustering method is characterized in that, comprises the steps:
 Step 1;
extract the sparse features of each data object, calculate the diversity factor between the data object in twos according to the sparse features of each data object;
Step 2;
data object and ant are randomly dispersed on the two dimensional surface, and distribute a data object for ant, ant bears this data object;
This two dimensional surface is the net region that boundary condition is arranged;
Step 3;
when ant moves to a net region, calculate the average similarity of other data object in current data object that it is born and the current net region according to diversity factor;
Calculate ant to bearing the probability that puts down of data object according to average similarity, meet first preconditionedly if put down probability, then ant is put down it and bears data object;
Step 4;
when ant does not bear data object, select a data object of not born at random, according to diversity factor calculate this data object in it self net region, place with the average similarity of other data object, and according to the pick up probability of average similarity calculating ant to this data object, meet second preconditioned if this picks up probability, then execution in step 2, do not born and select data object otherwise choose one again;
Wherein, in step 3 and the step 4, calculate average similarity according to the following equation;
Chinese PRB Reexamination
Abstract
The invention relates to a high dimensional sparse data clustering system and a high dimensional sparse data clustering method. The method comprises: sparse feature of each data object is extracted, and difference degree between each two data objects is calculated; the data object and an ant are scattered in a two dimensional plane and the data object is distributed for the ant; the two dimensional plane is a mesh region with boundary conditions; when the ant moves into the mesh region, a dropping probability of the ant for the carried data object is calculated according to the difference degree, if according with condition, the ant drops down the carried data object; when the ant does not carry the data object, the data object which is not carried is chose to calculate a picking up probability of the ant for the data object according to the difference degree, if non according with the condition, the data object which is not carried is chose again, or the ant carries the data object and moves to another mesh so that the dropping probability of the data object is calculated. The invention effectively avoids '"'"'dimension curse'"'"' via mapping a '"'"'similarity'"'"' of the high dimensional sparse data in a high dimensional attribute space into an '"'"'adjacency'"'"' of the high dimensional sparse data in the two dimensional plane.
12 Claims

1. a high dimension sparse data clustering method is characterized in that, comprises the steps:

Step 1;
extract the sparse features of each data object, calculate the diversity factor between the data object in twos according to the sparse features of each data object;Step 2;
data object and ant are randomly dispersed on the two dimensional surface, and distribute a data object for ant, ant bears this data object;
This two dimensional surface is the net region that boundary condition is arranged;Step 3;
when ant moves to a net region, calculate the average similarity of other data object in current data object that it is born and the current net region according to diversity factor;
Calculate ant to bearing the probability that puts down of data object according to average similarity, meet first preconditionedly if put down probability, then ant is put down it and bears data object;Step 4;
when ant does not bear data object, select a data object of not born at random, according to diversity factor calculate this data object in it self net region, place with the average similarity of other data object, and according to the pick up probability of average similarity calculating ant to this data object, meet second preconditioned if this picks up probability, then execution in step 2, do not born and select data object otherwise choose one again;Wherein, in step 3 and the step 4, calculate average similarity according to the following equation;


2. high dimension sparse data clustering method as claimed in claim 1 is characterized in that, data object o _{i}, o _{j}Diversity factor distance be:

3. high dimension sparse data clustering method as claimed in claim 1 or 2 is characterized in that, in the step 3, calculates according to the following equation and puts down probability:

4. high dimension sparse data clustering method as claimed in claim 1 or 2 is characterized in that, in the step 4, calculates according to the following equation and picks up probability:

5. high dimension sparse data clustering method as claimed in claim 3 is characterized in that, in the step 3, first is preconditioned for putting down probability greater than the probable value that produces at random.

6. high dimension sparse data clustering method as claimed in claim 3 is characterized in that, in the step 4, second is preconditioned for picking up probability greater than the probable value that produces at random.

7. high dimension sparse data clustering method as claimed in claim 1 or 2 is characterized in that, in step 3 and the step 4, progressively increases radius of neighbourhood S when calculating average similarity.

8. high dimension sparse data clustering method as claimed in claim 1 or 2 is characterized in that, in the step 1, compresses storage also according to diversity factor structural differences degree matrix, and to the diversity factor matrix.

9. a high dimension sparse data clustering system is characterized in that, comprising:

The data object pretreatment module is used to extract the sparse features of data object, calculates the diversity factor between the data object in twos according to the sparse features of each data object; The data object carrier module is used for data object and ant are randomly dispersed in two dimensional surface, and is data object of ant distribution;
This two dimensional surface is the net region that boundary condition is arranged;Average similarity calculation module is used for calculating according to diversity factor the average similarity of other data object in data object that current ant bears or select and the current net region; Put down probability and pick up the probability calculation module, be used for calculating the probability that puts down of data object that current ant bears according to average similarity, the selected data object of perhaps current ant pick up probability; The data objects processing module is used to put down probability and meets first when preconditioned, puts down current ant and bears data object, perhaps picks up probability and meets second when preconditioned, picks up the selected data object of current ant; Wherein, average similarity calculation module is calculated average similarity according to the following equation;


10. high dimension sparse data clustering system as claimed in claim 9 is characterized in that, puts down probability and picks up the probability calculation module and calculate according to the following equation and put down probability:

Wherein, p _{Drop}(o _{i}) be data object o _{i}Put down probability, k _{2}Be threshold constant;
Put down probability and pick up the probability calculation module and calculate according to the following equation and pick up probability;


11. high dimension sparse data clustering system as claimed in claim 10 is characterized in that, first is preconditioned for putting down probability greater than the probable value that produces at random;
 Second is preconditioned for picking up probability greater than the probable value that produces at random.

12. high dimension sparse data clustering system as claimed in claim 9 is characterized in that, the data object pretreatment module also is used for according to diversity factor structural differences degree matrix, and the diversity factor matrix is compressed storage.
Specification(s)