Methods and systems for using map-reduce for large-scale analysis of graph-based data

US 8,943,011 B2
Filed: 06/12/2012
Issued: 01/27/2015
Est. Priority Date: 06/28/2011
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for processing graph data, the method comprising:

executing a Markov Clustering algorithm (MCL) to find clusters of vertices of the graph data;

organizing the graph data by column by calculating a probability percentage for each column of a similarity matrix of the graph data to produce column data;

generating a probability matrix of states of the column data;

performing an expansion of the probability matrix by computing a power of the probability matrix using a Map-Reduce model executed in a processor-based computing device; and

organizing the probability matrix into a set of sub-matrices to find the least amount of data needed for the Map-Reduce model given that two lines of data in the probability matrix are required to compute a single value for the power of the probability matrix.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments are described for a method for processing graph data by executing a Markov Clustering algorithm (MCL) to find clusters of vertices of the graph data, organizing the graph data by column by calculating a probability percentage for each column of a similarity matrix of the graph data to produce column data, generating a probability matrix of states of the column data, performing an expansion of the probability matrix by computing a power of the matrix using a Map-Reduce model executed in a processor-based computing device; and organizing the probability matrix into a set of sub-matrices to find the least amount of data needed for the Map-Reduce model given that two lines of data in the matrix are required to compute a single value for the power of the matrix. One of at least two strategies may be used to computing the power of the matrix (matrix square, M²) based on simplicity of execution or improved memory usage.

Citations

22 Claims

1. A computer-implemented method for processing graph data, the method comprising:
- executing a Markov Clustering algorithm (MCL) to find clusters of vertices of the graph data;
  
  organizing the graph data by column by calculating a probability percentage for each column of a similarity matrix of the graph data to produce column data;
  
  generating a probability matrix of states of the column data;
  
  performing an expansion of the probability matrix by computing a power of the probability matrix using a Map-Reduce model executed in a processor-based computing device; and
  
  organizing the probability matrix into a set of sub-matrices to find the least amount of data needed for the Map-Reduce model given that two lines of data in the probability matrix are required to compute a single value for the power of the probability matrix.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1 wherein the MCL simulates random walks of the graph data using matrices computations.
  - 3. The method of claim 2 wherein the step of performing the expansion of the probability matrix comprises computing a Hadamard power of the matrix.
  - 4. The method of claim 1 wherein generating the probability matrix comprises:
    - performing a map function of the Map-Reduce model to read the similarity matrix in a column-by-column manner;
      
      calculating the probability value of each element of the similarity matrix;
      
      collecting a column ID or row ID as a key; and
      
      assigning a respective probability as a value.
  - 5. The method of claim 4 further comprising:
    - performing a reduce function of the Map-Reduce model that fetches all of the values under a same key; and
      
      writes new row and columns with corresponding probabilities values to a file system used by the computing device.
  - 6. The method of claim 5 wherein the probability matrix is generated using a first strategy that uses all of the columns and rows of the similarity matrix and the power of the matrix is calculated using a defined matrix multiplication process, a result is collected using both the column ID as the key and the row ID and respective new probability value as the value, and wherein the reduce function obtains all values of a result by column ID.
  - 7. The method of claim 5 wherein the probability matrix is generated using a second strategy that represents the probability data in column format only, organizing the similarity matrix into sub-blocks such that each sub-matrix referenced by a sub-block ID, and calculating the power of the matrix by multiplying sub-matrix pairs to calculate units of sub-matrices, and wherein the reduce function sums received units of the sub-matrices.
  - 8. The method of claim 5 further comprising implementing a partitioner function to arrange a range of key values pairs generated by the map function to form input to a specific reduce function in order to facilitate calculation of sub-matrices in a single reduction step.
  - 9. The method of claim 8 further comprising terminating the processing of graph data upon one of the following:
    - upon a determination that no elements of the probability matrix are changed by further map function operations or reduce function operations, or upon a determination that a number of output records in the reduce function is equal to the number of vertices in the graph data.
  - 10. The method of claim 1 further comprising implementing the Map-Reduce model using a Hadoop distributed file system platform.

11. A system for processing graph data in a distributed computing network coupling one or more server computers to a plurality of workstation computers, comprising:
- a first component executing a Markov Clustering algorithm (MCL) to find clusters of vertices of the graph data and organizing the graph data by column by calculating a probability percentage for each column of a similarity matrix of the graph data to produce column data; and
  
  a Map-Reduce component implemented on a Hadoop distributed file system to generate a probability matrix of states of the column data and performing an expansion of the probability matrix by computing a power of the probability matrix using, and to organize the probability matrix into a set of sub-matrices to find the least amount of data needed for the Map-Reduce model given that two lines of data in the probability matrix are required to compute a single value for the power of the matrix.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
- - 12. The system of claim 11 wherein the MCL simulates random walks of the graph data using matrices computations, and wherein the expansion of the probability matrix is performed by computing a Hadamard power of the matrix.
  - 13. The system of claim 11 wherein a map function of the Map-Reduce model reads the similarity matrix in a column-by-column manner, calculates the probability value of each element of the similarity matrix, collects column ID or row ID as a key, and assigns a respective probability as a value.
  - 14. The system of claim 13 wherein a reduce function of the Map-Reduce model fetches all of the values under a same key;
    - and writes new row and columns with corresponding probabilities values to a file system used by the computing device.
  - 15. The system of claim 14 wherein the probability matrix is generated using one of:
    - a first strategy that uses all of the columns and rows of the similarity matrix and the power of the matrix is calculated using a defined matrix multiplication process, a result is collected using both the column ID as the key and the row ID and respective new probability value as the value, and wherein the reduce function obtains all values of a result by column ID; and
      
      a second strategy that represents the probability data in column format only, organizing the similarity matrix into sub-blocks such that each sub-matrix referenced by a sub-block ID, and calculating the power of the matrix by multiplying sub-matrix pairs to calculate units of sub-matrices, and wherein the reduce function sums received units of the sub-matrices.
  - 16. The system of claim 15 further comprising a partitioner component to arrange a range of key values pairs generated by the map function to form input to a specific reduce function in order to facilitate calculation of sub-matrices in a single reduction step of the reducer component.
  - 17. The system of claim 16 the processing of the graph data is terminated upon one of the following:
    - upon a determination that no elements of the probability matrix are changed by further map function operations or reduce function operations, or upon a determination that a number of output records in the reduce function is equal to the number of vertices in the graph data.
  - 18. The system of claim 17 wherein the Map-Reduce component is implemented on the plurality of workstations each performing respective calculations on the graph data and utilizing the distributed file system as coordinated by the one or more server computers.

19. A non-volatile, non-transitory machine-readable medium containing one or more sequences of instructions for processing large-scale graph data in a distributed computing environment through a computer network coupling client computers to a server computer, the instructions configured to cause a processor to:
- execute a Markov Clustering algorithm (MCL) to find clusters of vertices of the graph data;
  
  organize the graph data by column by calculating a probability percentage for each column of a similarity matrix of the graph data to produce column data;
  
  generate a probability matrix of states of the column data;
  
  perform an expansion of the probability matrix by computing a power of the probability matrix using a Map-Reduce model executed in a processor-based computing device; and
  
  organize the probability matrix into a set of sub-matrices to find the least amount of data needed for the Map-Reduce model given that two lines of data in the probability matrix are required to compute a single value for the power of the probability matrix.
- View Dependent Claims (20, 21, 22)
- - 20. The medium of claim 19 further comprising instructions configured to cause the processor to:
    - perform a map function of the Map-Reduce model to read the similarity matrix in a column-by-column manner;
      
      calculate the probability value of each element of the similarity matrix;
      
      collect column ID or row ID as a key;
      
      assign a respective probability as a value;
      
      perform a reduce function of the Map-Reduce model that fetches all of the values under a same key; and
      
      write new row and columns with corresponding probabilities values to a file system used by the computing device.
  - 21. The medium of claim 20 wherein the probability matrix is generated using one of:
    - a first strategy that uses all of the columns and rows of the similarity matrix and the power of the matrix is calculated using a defined matrix multiplication process, a result is collected using both the column ID as the key and the row ID and respective new probability value as the value, and wherein the reduce function obtains all values of a result by column ID;
      
      or a second strategy that represents the probability data in column format only, organizing the similarity matrix into sub-blocks such that each sub-matrix referenced by a sub-block ID, and calculating the power of the matrix by multiplying sub-matrix pairs to calculate units of sub-matrices, and wherein the reduce function sums received units of the sub-matrices.
  - 22. The medium of claim 21 further comprising instructions configured to cause the processor to arrange a range of key values pairs generated by the map function through a partitioner function to form input to a specific reduce function in order to facilitate calculation of sub-matrices in a single reduction step.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Salesforce.com, Inc.
Original Assignee
Salesforce.com, Inc.
Inventors
Gong, Nan, Koister, Jari
Primary Examiner(s)
Gaffin, Jeffrey A
Assistant Examiner(s)
Bharadwaj, Kalpana

Application Number

US13/494,594
Publication Number

US 20130024412A1
Time in Patent Office

959 Days
Field of Search
US Class Current

706/46
CPC Class Codes

G06N 5/00 Computing arrangements usin...

Methods and systems for using map-reduce for large-scale analysis of graph-based data

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems for using map-reduce for large-scale analysis of graph-based data

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links