MPI Parallelisation

If you want to speed up time-consuming tangos operations from your command line, such as tangos link and tangos write, you can run them in parallel if you have MPI and mpi4py on your machine. This is straight-forward with, for example, anaconda python distributions – just type conda install mpi4py.

Once this has successfully installed, you can run tangos link or tangos write within MPI with

mpirun -np N [normal tangos command here] --backend mpi4py

Here, * mpirun -np N instructs MPI on your machine to run N processes. For simple setups, this would normally be the number of processors plus one; for example you'd choose 5 for a 4-core machine. The extra process is optimal because there is always one manager process (which requires relatively little CPU) and 4 worker processes. If you want to run across multiple nodes, only the first node should have one more process running than cores available. Consult your system administrator to figure this out. * Then you type your normal set of commands, e.g. tangos link ... or tangos write .... * --backend mpi4py crucially instructs tangos to parallelise using the mpi4py library. Alternatively you can use the pypar library. If you specify no backend tangos will default to running in single-processor mode which means MPI will launch N processes that are not aware of each other's presence. This is very much not what you want. Limitations in the MPI library mean it's not possible for tangos to reliably auto-detect it has been MPI-launched.

Advanced options and memory implications: tangos write

For tangos write, there are multiple parallelisation modes. The default mode parallelises at the snapshot level. Each worker loads an entire snapshot at once, then works through the halos within that snapshot. This is highly efficient in terms of limiting disk access and communication requirements. However, for large simulations it can cause memory usage to be overly high.

To control the parallelisation, tangos write accepts a --load-mode argument:

tangos write example

Let's consider the longest process in the tutorials which involves writing images and more to the changa+AHF tutorial simulation.

Some of the underlying pynbody manipulations are already parallelised. One can therefore experiment with different configurations but experience suggests the best option is to switch off all pynbody parallelisation (i.e. set the number of threads to 1) and allow tangos complete control. This is because only some pynbody routines are parallelised whereas tangos is close to embarassingly parallel. Once pynbody threading is disabled, the version of the above command that is most efficient is:

bash mpirun -np 5 tangos write dm_density_profile gas_density_profile uvi_image --with-prerequisites --include-only="NDM()>5000" --include-only="contamination_fraction<0.01" --for tutorial_changa --backend mpi4py --load-mode server

for a machine with 4 processors. Why did we specify --load-mode=server? Let's consider the possibilities:

For tangos link, parallelisation is currently implemented only at the snapshot level. Suppose you have a simulation with M particles. Each worker loads needs to store at least 2M integers at any one time (possibly more depending on the underlying formats) in order to work out the merger tree.

Consequently for large simulations, you may need to use a machine with lots of memory and/or use fewer processes than you have cores available.

It would be possible to implement algorithms where the data is more distributed – or to load in halo trees generated by other tools. If demand is sufficient for either of these abilities, support could be prioritised for future versions of tangos.