DistArray

Think globally, act locally.

DistArray provides general multidimensional NumPy-like distributed arrays to Python. It intends to bring the strengths of NumPy to data-parallel high-performance computing. DistArray has a similar API to NumPy.

DistArray is ready for real-world testing and deployment; however, the project is still evolving rapidly, and we appreciate continued input from the scientific-Python community.

DistArray is for users who

  • know and love Python and NumPy,
  • want to scale NumPy to larger distributed datasets,
  • want to interactively play with distributed data but also
  • want to run batch-oriented distributed programs;
  • want an easier way to drive and coordinate existing MPI-based codes,
  • have a lot of data that may already be distributed,
  • want a global view ("think globally") with local control ("act locally"),
  • need to tap into existing parallel libraries like Trilinos, PETSc, or Elemental,
  • want the interactivity of IPython and the performance of MPI.

DistArray is designed to work with other packages that implement the Distributed Array Protocol.

Getting Started

To see some examples of what DistArray can do, check out our IPython notebooks on nbviewer (also in the examples directory of the DistArray source).

Overview

NumPy is at the foundation of the scientific Python stack for good reason: NumPy arrays are easy to use, they have many powerful features like ufuncs, slicing, and broadcasting, and they work easily with external libraries.

As data sets grow and parallel hardware becomes more widely available, wouldn't it be great if NumPy easily supported parallel execution, without losing its nice interface in a miasma of low-level parallel coordination? What would that look like?

What we want is transparent distribution of NumPy arrays over the CPU, cluster, and supercomputer. We want to interact with distributed NumPy arrays the way we think about them and get the benefit of all that parallelism. We also want to be able to drop down a level to control what's going on at the data-local level when performance demands it.

Such a NumPy opens doors to providing a high-level NumPy-like interface to distributed libraries like Trilinos, PETSc, Global Arrays, Elemental, and ScaLAPACK, among others.

All this coordination has overhead and is at risk of becoming a performance bottleneck. This NumPy will need a way to allow direct execution at a data-local level. We will also need a way to communicate directly between local processes when needed, rather than doing everything at a global level.

This distributed NumPy should be a good citizen and work easily with regular NumPy arrays, with MPI, with IPython parallel, and with external distributed algorithms.

DistArray is our vision of what distributed NumPy can be. It brings the best parts of NumPy to data-parallel computing. We want to think globally about our arrays, interacting with them as if they are just really big NumPy arrays, all the while acting locally on them for performance and control.

Installation

DistArray requires the following Python libraries:

Optionally, DistArray can make use of:

  • h5py built against a parallel-enabled build of HDF5 (for HDF5 IO), and
  • matplotlib (for making plots of DistArray distributions).

If you have the above, you should be able to install DistArray with:

python setup.py install

or:

pip install distarray

Experimental quickstart scripts

Alternatively, we have experimental installation scripts in the quickstart directory of the root of this source tree. Given a Canopy or Anaconda installation and a couple of other prerequisites, these scripts attempt to install DistArray and its dependencies for you. See the readme files in that directory for more information.

Testing Your Installation

To test your installation, you will first need to start an IPython.parallel cluster with MPI enabled. The easist way is to use the dacluster command that comes with DistArray:

dacluster start

See dacluster's help for more:

dacluster --help

You should then be able to run all the tests from the DistArray source directory with:

make test

If you've installed DistArray with python setup.py develop, you should be able to run the tests from anywhere with:

python -m distarray.run_tests

History

DistArray was started by Brian Granger in 2008 and is currently being developed by Enthought in partnership with Bill Spotz from Sandia's (Py)Trilinos project and Brian Granger and Min RK from the IPython project.