MapGraph is Massively Parallel Graph processing on GPUs. (Previously known as "MPGraph").
MapGraph is under the Apache 2 license. You can download MapGraph from http://sourceforge.net/projects/mpgraph/ . For the lastest version of this documentation, see http://mapgraph.io. You can subscribe to receive notice for future updates on the project home page. For open source support, please ask a question on the MapGraph mailing lists or file a ticket. To inquire about commercial support, please email us at firstname.lastname@example.org@.email@example.com@firstname.lastname@example.org. You can follow MapGraph and the bigdata graph database platform at http://www.bigdata.com/blog.
This work was (partially) funded by the DARPA XDATA program under AFRL
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under
Contract No. D14PC00029.
MapGraph is up to two orders of magnitude faster than parallel CPU implementations on up to 24 CPU cores and has performance comparable to a state-of-the-art manually optimized GPU implementation. For example, the diagram below shows the speedups of MapGraph versus GraphLab for SSSP.
For our GPU evaluations we used a NVIDIA c2075 (Fermi architecture), but performance is similar on other NVIDIA cards. The CPU platform was a machine containing a 3.33 GHz X5680 CPU chipset. This is a dual-socket Westmere chipset that contains 12 physical cores and 12 MB of cache. The machine contains 24 GB of 1333 MHz ECC memory. The software environment is RedHat 6.2 Beta. CPU code was compiled with gcc (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3). The results were obtained using the synchronous engine for GraphLab due to core faults with some data sets when using the asynchronous engine.
MapGraph is implemented as a set of templates following a design pattern that is similar to the Gather-Apply-Scatter (GAS) API. GAS is a vertex-centric API, similar to the API first popularized by Pregel. The GAS API breaks down operations into the following phases:
The GAS API has been extended in order to: (a) maximize parallelism; (b) manage simultaneous discovery of duplicate vertices (this is not an issue in multi-core CPU code); (c) provide appropriate memory barriers (each kernel provides a memory barrier); (d) optimize memory layout; and (e) allow "push" style scatter operators are similar to "signal" with a message value to create a side-effect in GraphLab.
MapGraph defines the following kernels and supports their invocation from templated CUDA programs. Each kernel may have one or more device functions that it invokes. User code (a) provides implementations of those device functions to customize the behavior of the algorithm; and (b) provides custom data structures for the vertices and links (see below).
Gather Phase Kernels::
Apply Phase Kernels::
Scatter Phase Kernels::
In order to write code to the MapGraph API, it helps to have a high-level understanding of the data structures used to maintain the frontier, the topology, and the user data associated with the vertices and edges of the graph.
MapGraph uses frontier queues to maintain a dense list of the active vertices. These queues are managed by the MapGraph kernels, but user data may be allocated and accessed that is 1:1 with the frontier (see below). The frontier array dimensions are determined by the number of vertices in the graph times the frontier queue size multiplier. The frontier is in global memory, but is buffered in shared memory by some kernels.
The frontier is populated by the expand() kernel. The contract() kernel may be used to eliminate some or all of the duplicates depending on the strategy and the needs of the graph algorithm. There are actually two two frontier queues - this is for double-buffering.
In addition to the frontier, you may allocate optional user data arrays that are 1:1 with the active vertices in the frontier. These arrays provides an important scratch area for many calculations and benefit from dense, coalesced access. The arrays are accessed from the same kernels that operate on the vertex frontier. For example, BFS uses a scratch array to store the predecessor value.
The topology of the graph is modeled by a forward and reverse sparse matrix and is constructed at runtime from the sparse matrix data file. The digrams below illustrate the use of CSR and CSC data structures to model the graph topology. However, these topology data structures are not directly exposed to user algorithms and their internals may change. Users write device functions that are invoked from kernels that process the topology using a variety of different strategies. Users do not need to access or understand the internals of the topology data structures to write graph algorithms.
The forward topology index is currently a Compressed Sparse Row (CSR) matrix that provides row based indexing into the graph. This data structure is not directly exposed to user algorithms. CSR is used to access (traverse) the out-edges of the graph.
The reverse topology index is currently a Compressed Sparse Column (CSC) matrix that provides column based indexing into the graph. This data structure is not directly exposed to user algorithms. CSC is used to access (traverse) the in-edges of the graph. The CSC edgeId array gives the index into the EdgeList arrays. The CSC data structure is only maintained if the algorithm will read over the in-edges.
User data is specific to a given algorithm. It is laid out in a structure of arrays format in order to maximize coalesced memory access. (User data can also be 1:1 with the frontier - see above.) There are two basic user data structures: The vertex list and the edge list.
The VertexList is a structure of named arrays that provides data for each vertex in the graph. The index into each array is the vertexId. See the VertexData structure in one of the existing algorithms for examples.
The EdgeList is a structure of named arrays that provides data for each edge in the graph. The index into each array is the edgeId. The CSR.colind and the CSC.edgeId are both 1:1 with the edge list arrays. Again, see an EdgeData structure in one of the existing algorithms for examples.
The vertex list and edge list are laid out in vertical stripes using a Structures of Arrays pattern for optimal memory access patterns on the GPU. To add your own data, you add a field to the vertex data struct or the edge data struct. That field will be an array that is 1:1 with the vertex identifiers. You will need to initialize your array. MapGraph will provide you with access to your data from within the appropriate device functions.
MapGraph is based on templates. This means that there is no interface or super class from which you can derive your code. Instead, you need to start with one of the existing implementations that uses the MapGraph template "pattern". You then need to review and modify the function that initializes the user data structures (the vertex list and the edge list) and the device functions that implement the user code for the Gather, Apply, and Scatter primitives. You can also define functions that will extract the results from the GPU.