Hpcbench - High Performance Networks Benchmarking

High Performance Networks Benchmarking

MPI Communication Experiments

[ Blocking unidirectional throughput test ] [ Blocking bidirectional throughput test ] [ Blocking exponential throughput test ]

[ Non-blocking unidirectional throughput test ] [ Non-blocking bidirectional throughput test ]

[ Non-blocking exponential throughput test ] [ Test with system log ] [ Latency test ] [ Plot data ]

Our testbed is a cluster named "mako" (mako.sharcnet.ca) in SHARCNET. Mako consists of 8 nodes (mk1-mk8), each node possessing 4 Intel Xeon 3GHz Hyperthreading processors and 2GB of RAM. There are two high speed, low latency interconnects between all nodes: Myrinet and Gigabit Ethernet. Our MPI tests will examine both of them.

Mako has two MPI implementations installed: MPICH and LAM/MPI. Libraries of both implementations are linked to Myrinet interconnect. We use MPICH for our MPI tests. Mako's MPICH is installed in /usr/local/mpich-gm/, with version of 1.2.5.2.

We also built the MPICH 1.2.5.2 packages based on TCP/IP stacks in our home directory so that we can run our version MPICH to test Gigabit Ethernet connection for comparison. Our GE MPICH is installed in $HOME/mpich-1.2.5.2/.

The executable file mpitest is compiled to test Myrinet, and ge-mpitest is to test Gigabit Ethernet:

[mk1 ~/hpcbench/mpi]$ /usr/local/mpich-gm/bin/mpicc -O3 util.c mpitest.c -o mpitest
[mk1 ~/hpcbench/mpi]$ ~/mpich-1.2.5.2/bin/mpicc -O3 util.c mpitest.c -o ge-mpitest
[mk1 ~/hpcbench/mpi]$ which mpirun
/usr/local/mpich-gm/bin/mpirun
[mk1 ~/hpcbench/mpi]$ ln -s ~/mpich-1.2.5.2/bin/mpirun ./ge-mpirun
[mk1 ~/hpcbench/mpi]$ ls -l
total 538
lrwxrwxrwx    1 huang    bauer          36 Jul 14 11:16 ge-mpirun -> /home/huang/mpich-1.2.5.2/bin/mpirun
-rwxr-xr-x    1 huang    bauer      375232 Jul 14 11:14 ge-mpitest
-rw-------    1 huang    bauer          11 Jul  9 22:02 machine-file
-rw-------    1 huang    bauer         640 Jul  9 22:02 makefile
-rwxr-xr-x    1 huang    bauer      283840 Jul 14 11:13 mpitest
-rw-------    1 huang    bauer       31632 Jul  9 22:02 mpitest.c
-rw-------    1 huang    bauer        1997 Jul  9 22:02 README
-rw-------    1 huang    bauer          78 Jul  9 22:02 p4pg-file
-rwx------    1 huang    bauer       29804 Jul  9 22:02 util.c
-rwx------    1 huang    bauer        6827 Jul  9 22:02 util.h
[mk1 ~/hpcbench/mpi]$

To specify two machines (mk3 and mk4) for our tests, we define a machine file:

[mk1 ~/hpcbench/mpi]$ cat machine-file
mk3
mk4
[mk1 ~/hpcbench/mpi]$

[ Blocking unidirectional throughput test ] [TOP]

Blocking stream (unidirectional) throughput test is the default setting for mpitest. We start testing Gigabit Ethernet and then the Myrinet:

[mk1 ~/hpcbench/mpi]$ ge-mpirun -np 2 -machinefile machine-file ge-mpitest -r 5 -o output 
mk3(Master-node) <--> mk4(Secondary-node)
Fixed-size blocking stream (unidirectional) test
Test result: "output"
Test done!
[mk1 ~/hpcbench/mpi]$ cat output
# MPI communication test -- Wed Jul 14 16:32:53 2004
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: mk3 <----> mk4
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 515899392
# Message size (Bytes): 1048576
# Iteration : 492
# Test time: 5.000000
# Test repetition: 5
#
#      Overall    Master-node  M-process  M-process   Slave-node   S-process  S-process
#    Throughput  Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#        Mbps        Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
1     822.8124         5.02        0.11       1.59        5.02        0.70        3.00
2     822.7987         5.02        0.09       1.86        5.02        0.83        3.08
3     822.8318         5.02        0.07       1.76        5.02        0.84        2.99
4     822.2579         5.02        0.03       1.87        5.02        0.79        2.98
5     822.2205         5.02        0.04       1.71        5.02        0.82        2.96
[mk1 ~/hpcbench/mpi]$ mpirun -np 2 -machinefile machine-file mpitest -r 5 -o output.txt
mk3(Master-node) <--> mk4(Secondary-node)
Fixed-size blocking stream (unidirectional) test
Test result: "output.txt"
Test done!
[mk1 ~/hpcbench/mpi]$ cat output.txt
# MPI communication test -- Wed Jul 14 16:39:33 2004
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: mk3 <----> mk4
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 1196425216
# Message size (Bytes): 1048576
# Iteration : 1141
# Test time: 5.000000
# Test repetition: 5
#
#      Overall    Master-node  M-process  M-process   Slave-node   S-process  S-process
#    Throughput  Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#        Mbps        Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
1    1914.6484         5.00        5.00       0.00        5.00        5.00        0.00
2    1914.4683         5.00        5.00       0.00        5.00        5.00        0.00
3    1913.2747         5.00        5.00       0.00        5.00        5.00        0.00
4    1914.4117         5.00        5.00       0.00        5.00        5.00        0.00
5    1914.5514         5.00        5.00       0.00        5.00        5.00        0.00
[mk1 ~/hpcbench/mpi]$

Results show that the Myrinet interconnect's throughput is about the double of that of Gigabit Ethernet.

[ Blocking bidirectional throughput test ] [ TOP ]

The blocking bidirectional throughput test is so called "ping-pong" test: slave (secondary) node receives a message and sends back to master node:

[mk1 ~/hpcbench/mpi]$ ge-mpirun -np 2 -machinefile machine-file ge-mpitest -i -r 6 -o output 
mk3(Master-node) <--> mk4(Secondary-node)
Fixed-size blocking ping-pong (bidirectional) test
Test result: "output"
Test done!
[mk1 ~/hpcbench/mpi]$ cat output
# MPI communication test -- Wed Jul 14 16:48:46 2004
# Test mode: Fixed-size ping-pong (bidirectional) test
# Hosts: mk3 <----> mk4
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 254803968
# Message size (Bytes): 1048576
# Iteration : 243
# Test time: 5.000000
# Test repetition: 6
#
#      Overall    Master-node  M-process  M-process   Slave-node   S-process  S-process
#    Throughput  Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#        Mbps        Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
1     817.5592         4.99        0.39       3.12        4.99        0.43        3.16
2     817.5251         4.99        0.33       2.85        4.99        0.42        3.10
3     818.5469         4.98        0.51       2.53        4.98        0.34        3.10
4     818.4082         4.98        0.42       2.41        4.98        0.42        2.94
5     818.6736         4.98        0.43       2.49        4.98        0.45        2.86
6     818.7928         4.98        0.45       2.33        4.98        0.41        3.11
[mk1 ~/hpcbench/mpi]$ mpirun -np 2 -machinefile machine-file mpitest -ir 6 -o output.txt
mk3(Master-node) <--> mk4(Secondary-node)
Fixed-size blocking ping-pong (bidirectional) test
Test result: "output.txt"
Test done!
[mk1 ~/hpcbench/mpi]$ cat output.txt 
# MPI communication test -- Wed Jul 14 16:51:39 2004
# Test mode: Fixed-size ping-pong (bidirectional) test
# Hosts: mk3 <----> mk4
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 597688320
# Message size (Bytes): 1048576
# Iteration : 570
# Test time: 5.000000
# Test repetition: 6
#
#      Overall    Master-node  M-process  M-process   Slave-node   S-process  S-process
#    Throughput  Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#        Mbps        Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
1    1910.7564         5.00        5.00       0.00        5.00        5.01        0.00
2    1912.5686         5.00        5.00       0.00        5.00        5.00        0.00
3    1910.3087         5.01        5.01       0.00        5.01        5.00        0.00
4    1909.9115         5.01        5.00       0.00        5.01        5.01        0.00
5    1912.1223         5.00        5.00       0.00        5.00        5.00        0.00
6    1912.5028         5.00        5.00       0.00        5.00        5.00        0.00
[mk1 ~/hpcbench/mpi]$

Results show that the throughputs of ping-pong tests and stream tests are almost the same.

[ Blocking exponential throughput test ] [ TOP ]

In exponential tests, the message size will increase exponentially from 1 Byte to a 2^n Bytes, where n is defined by (-e) option. The following tests define the maximum data size of 64MByte (2^26):

[mk1 ~/hpcbench/mpi]$ ge-mpirun -np 2 -machinefile machine-file ge-mpitest -e 26 -o output
mk3(Master-node) <--> mk4(Secondary-node)
Exponential blocking stream (unidirectional) test
Test result: "output"
Test done!
[mk1 ~/hpcbench/mpi]$ cat output
# MPI communication test -- Wed Jul 14 16:54:02 2004
# Test mode: Exponential stream (unidirectional) test
# Hosts: mk3 <----> mk4
# Blocking communication (MPI_Send/MPI_Recv)
#
#   Message    Overall             Master-node  M-process  M-process   Slave-node   S-process  S-process
#     Size   Throughput Iteration Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#    Bytes       Mbps                 Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
         1      0.0081     10000        9.92        0.00       0.00        9.92        0.01        0.00
         2      0.0160      2520        2.52        0.00       0.00        2.52        0.00        0.00
         4      0.0317      2499        2.52        0.00       0.00        2.52        0.00        0.00
         8      0.0640      2479        2.48        0.00       0.00        2.48        0.00        0.00
        16     12.0482      2498        0.03        0.00       0.02        0.03        0.00        0.02
        32     24.0612     10000        0.11        0.01       0.04        0.11        0.01        0.09
        64     46.9700     10000        0.11        0.01       0.06        0.11        0.03        0.08
       128    112.5437     10000        0.09        0.00       0.06        0.09        0.03        0.06
       256    197.8421     10000        0.10        0.00       0.01        0.10        0.03        0.07
       512    232.4170     10000        0.18        0.01       0.02        0.18        0.03        0.09
      1024    251.0281     10000        0.33        0.01       0.00        0.33        0.03        0.13
      2048    258.4786     10000        0.63        0.02       0.04        0.63        0.08        0.22
      4096    261.9986     10000        1.25        0.01       0.12        1.25        0.13        0.36
      8192    264.0463     10000        2.48        0.06       0.09        2.48        0.06        0.52
     16384    264.3353     10000        4.96        0.03       0.21        4.96        0.15        0.86
     32768    264.7443      5041        4.99        0.03       0.33        4.99        0.10        0.99
     65536    259.5392      2524        5.10        0.01       0.39        5.10        0.12        0.82
    131072    249.8145      1237        5.19        0.04       0.26        5.19        0.18        1.10
    262144    247.5106       595        5.04        0.04       0.29        5.04        0.27        1.15
    524288    247.4411       295        5.00        0.02       0.27        5.00        0.25        1.01
   1048576    248.6544       147        4.96        0.02       0.18        4.96        0.21        1.00
   2097152    244.4746        74        5.08        0.01       0.18        5.08        0.23        1.04
   4194304    243.7699        36        4.96        0.04       0.28        4.96        0.25        0.88
   8388608    244.1473        18        4.95        0.03       0.34        4.95        0.26        1.32
  16777216    243.7410         9        4.96        0.00       0.30        4.96        0.23        1.28
  33554432    242.4236         5        5.54        0.07       0.32        5.54        0.30        1.24
  67108864    244.1236         5       11.00        0.03       0.61       11.00        0.53        2.58
[mk1 ~/hpcbench/mpi]$ mpirun -np 2 -machinefile machine-file mpitest -e 26 -o output.txt
mk3(Master-node) <--> mk4(Secondary-node)
Exponential blocking stream (unidirectional) test
Test result: "output.txt"
Test done!
[mk1 ~/hpcbench/mpi]$ cat output.txt
# MPI communication test -- Wed Jul 14 16:58:04 2004
# Test mode: Exponential stream (unidirectional) test
# Hosts: mk3 <----> mk4
# Blocking communication (MPI_Send/MPI_Recv)
#
#   Message    Overall             Master-node  M-process  M-process   Slave-node   S-process  S-process
#     Size   Throughput Iteration Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#    Bytes       Mbps                 Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
         1      1.3492     10000        0.06        0.06       0.00        0.06        0.06        0.00
         2      2.7097     10000        0.06        0.06       0.00        0.06        0.06        0.00
         4      5.4210     10000        0.06        0.06       0.00        0.06        0.06        0.00
         8     10.8340     10000        0.06        0.06       0.00        0.06        0.06        0.00
        16     21.6515     10000        0.06        0.06       0.00        0.06        0.06        0.00
        32     43.2567     10000        0.06        0.05       0.01        0.06        0.06        0.00
        64     86.1781     10000        0.06        0.06       0.00        0.06        0.06        0.00
       128    161.0577     10000        0.06        0.04       0.01        0.06        0.06        0.00
       256    336.5160     10000        0.06        0.06       0.01        0.06        0.06        0.00
       512    650.0364     10000        0.06        0.05       0.01        0.06        0.07        0.00
      1024   1222.6667     10000        0.07        0.05       0.02        0.07        0.06        0.00
      2048   1922.8257     10000        0.09        0.06       0.02        0.09        0.09        0.00
      4096   1949.6520     10000        0.17        0.16       0.01        0.17        0.17        0.00
      8192   1969.3541     10000        0.33        0.32       0.01        0.33        0.33        0.00
     16384   1976.0888     10000        0.66        0.65       0.02        0.66        0.66        0.00
     32768   1532.8212     10000        1.71        1.71       0.00        1.71        1.71        0.00
     65536   1711.6819     10000        3.06        3.06       0.00        3.06        3.06        0.00
    131072   1820.3357      8161        4.70        4.70       0.00        4.70        4.70        0.00
    262144   1874.0833      4340        4.86        4.85       0.01        4.86        4.86        0.00
    524288   1899.8430      2234        4.93        4.93       0.00        4.93        4.93        0.00
   1048576   1912.2904      1132        4.97        4.95       0.00        4.97        4.96        0.00
   2097152   1945.7018       569        4.91        4.91       0.00        4.91        4.90        0.00
   4194304   1962.1807       289        4.94        4.94       0.00        4.94        4.94        0.00
   8388608   1970.1613       146        4.97        4.97       0.01        4.97        4.97        0.01
  16777216   1974.6538        73        4.96        4.96       0.00        4.96        4.96        0.00
  33554432   1976.3894        36        4.89        4.87       0.01        4.89        4.88        0.01
  67108864   1977.6266        18        4.89        4.86       0.02        4.89        4.84        0.03
[mk1 ~/hpcbench/mpi]$

The result of GE communication looks unreasonable. The throughputs are much lower than those of fixed size tests. I couldn't figure out the problem. I guess there maybe some mechanism (bug?) in MPICH implementation handling the small size data exchanging and resulting the delay. The difference of mpitest implementation between fixed-size tests and exponential tests is that in exponential tests, the program allocates a memory with maximum test size (64MBytes in this case) for all message size (2^0 ~2^26) tests. We can see the results of Myrinet test are normal.

[ Non-blocking unidirectional throughput test ] [ TOP ]

There are a couple of MPI function calls have the concept of Non-blocking communication. We only test MPI_Isend() and MPI_Irecv() pair:

[mk1 ~/hpcbench/mpi]$ ge-mpirun -np 2 -machinefile machine-file ge-mpitest -n -r 6 -o output
mk3(Master-node) <--> mk4(Secondary-node)
Fixed-size non-blocking stream (unidirectional) test
Test result: "output"
Test done!
[mk1 ~/hpcbench/mpi]$ cat output
# MPI communication test -- Wed Jul 14 18:51:25 2004
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: mk3 <----> mk4
# Non-blocking communication (MPI_Isend/MPI_Irecv)
# Total data size of each test (Bytes): 509607936
# Message size (Bytes): 1048576
# Iteration : 486
# Test time: 5.000000
# Test repetition: 6
#
#      Overall    Master-node  M-process  M-process   Slave-node   S-process  S-process
#    Throughput  Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#        Mbps        Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
1     819.4042         4.98        0.04       1.83        4.98        1.04        2.95
2     819.4323         4.98        0.12       1.66        4.98        1.02        2.76
3     819.4075         4.98        0.08       1.69        4.98        0.86        3.11
4     819.4816         4.97        0.06       1.66        4.98        1.00        3.01
5     818.4036         4.98        0.09       1.71        4.98        0.92        3.09
6     819.4279         4.98        0.07       1.64        4.98        0.97        3.02
[mk1 ~/hpcbench/mpi]$ mpirun -np 2 -machinefile machine-file mpitest -n -r 6 -o output.txt
mk3(Master-node) <--> mk4(Secondary-node)
Fixed-size non-blocking stream (unidirectional) test
Test result: "output.txt"
Test done!
[mk1 ~/hpcbench/mpi]$ cat output.txt 
# MPI communication test -- Wed Jul 14 18:54:45 2004
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: mk3 <----> mk4
# Non-blocking communication (MPI_Isend/MPI_Irecv)
# Total data size of each test (Bytes): 1192230912
# Message size (Bytes): 1048576
# Iteration : 1137
# Test time: 5.000000
# Test repetition: 6
#
#      Overall    Master-node  M-process  M-process   Slave-node   S-process  S-process
#    Throughput  Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#        Mbps        Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
1    1911.0491         4.99        4.99       0.00        4.99        4.99        0.00
2    1908.1163         5.00        4.99       0.00        5.00        5.00        0.00
3    1911.2969         4.99        4.99       0.00        4.99        4.99        0.00
4    1910.9985         4.99        4.99       0.00        4.99        4.99        0.00
5    1910.7160         4.99        5.00       0.00        4.99        4.99        0.00
6    1910.9235         4.99        4.99       0.00        4.99        4.99        0.00
[mk1 ~/hpcbench/mpi]$

Contrast to blocking unidirectional tests above, there is no big difference of throughputs between blocking and non-blocking MPI function calls.

[ Non-blocking bidirectional throughput test ] [ TOP ]

In non-blocking bidirectional MPI communication, Master node and slave node keep sending and receiving simultaneously, and MPI_Wait() function is used for the termination.

[mk1 ~/hpcbench/mpi]$ ge-mpirun -np 2 -machinefile machine-file ge-mpitest -n -i -r 6 -o output
mk3(Master-node) <--> mk4(Secondary-node)
Fixed-size non-blocking ping-pong (bidirectional) test
Test result: "output"
Test done!
[mk1 ~/hpcbench/mpi]$ cat output
# MPI communication test -- Wed Jul 14 18:59:30 2004
# Test mode: Fixed-size ping-pong (bidirectional) test
# Hosts: mk3 <----> mk4
# Non-blocking communication (MPI_Isend/MPI_Irecv)
# Total data size of each test (Bytes): 253755392
# Message size (Bytes): 1048576
# Iteration : 242
# Test time: 5.000000
# Test repetition: 6
#
#      Overall    Master-node  M-process  M-process   Slave-node   S-process  S-process
#    Throughput  Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#        Mbps        Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
1     876.3000         4.63        0.53       2.44        4.63        0.55        3.98
2     876.7293         4.63        0.56       2.27        4.63        0.49        4.02
3     876.2711         4.63        0.51       2.30        4.63        0.54        4.00
4     875.8637         4.64        0.55       2.42        4.64        0.46        4.02
5     876.6733         4.63        0.43       2.25        4.63        0.51        4.00
6     876.2160         4.63        0.47       2.46        4.63        0.51        3.99
[mk1 ~/hpcbench/mpi]$ mpirun -np 2 -machinefile machine-file mpitest -n -i -r 6 -o output.txt
mk3(Master-node) <--> mk4(Secondary-node)
Fixed-size non-blocking ping-pong (bidirectional) test
Test result: "output.txt"
Test done!
[mk1 ~/hpcbench/mpi]$ cat output.txt 
# MPI communication test -- Wed Jul 14 19:01:13 2004
# Test mode: Fixed-size ping-pong (bidirectional) test
# Hosts: mk3 <----> mk4
# Non-blocking communication (MPI_Isend/MPI_Irecv)
# Total data size of each test (Bytes): 594542592
# Message size (Bytes): 1048576
# Iteration : 567
# Test time: 5.000000
# Test repetition: 6
#
#      Overall    Master-node  M-process  M-process   Slave-node   S-process  S-process
#    Throughput  Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#        Mbps        Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
1    3776.2298         2.52        2.52       0.00        2.52        2.50        0.00
2    3776.9762         2.52        2.51       0.01        2.52        2.52        0.00
3    3776.4214         2.52        2.52       0.00        2.52        2.52        0.00
4    3772.0101         2.52        2.51       0.00        2.52        2.52        0.00
5    3775.7260         2.52        2.52       0.00        2.52        2.52        0.00
6    3778.5458         2.52        2.52       0.00        2.52        2.52        0.00
[mk1 ~/hpcbench/mpi]$

In non-blocking bidirectional MPI tests, Gigabit version increases the throughput a bit than that of blocking mode, while Myrinet has a double throughput than its blocking mode.

[ Non-blocking exponential throughput test ] [ TOP ]

In exponential tests, the message size will increase exponentially from 1 Byte to a 2^n Bytes, where n is defined by (-e) option. We will examine bidirectional exponential tests with maximum data size of 64MByte (2^26) in the following examples:

[mk1 ~/hpcbench/mpi]$ ge-mpirun -np 2 -machinefile machine-file ge-mpitest -in -e 26 -o output
mk3(Master-node) <--> mk4(Secondary-node)
Exponential non-blocking ping-pong (bidirectional) test
Test result: "output"
Test done!
[mk1 ~/hpcbench/mpi]$ cat output
# MPI communication test -- Wed Jul 14 19:08:32 2004
# Test mode: Exponential ping-pong (bidirectional) test
# Hosts: mk3 <----> mk4
# Non-blocking communication (MPI_Isend/MPI_Irecv)
#
#   Message    Overall             Master-node  M-process  M-process   Slave-node   S-process  S-process
#     Size   Throughput Iteration Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#    Bytes       Mbps                 Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
         1      0.4059     10000        0.39        0.05       0.11        0.39        0.02        0.12
         2      0.8127     10000        0.39        0.04       0.12        0.39        0.04        0.12
         4      1.6235     10000        0.39        0.03       0.12        0.39        0.05        0.12
         8      3.2465     10000        0.39        0.05       0.10        0.39        0.03        0.16
        16      6.1180     10000        0.42        0.05       0.14        0.42        0.06        0.18
        32     11.8581     10000        0.43        0.07       0.13        0.43        0.02        0.12
        64     23.1937     10000        0.44        0.04       0.16        0.44        0.11        0.12
       128     45.0759     10000        0.45        0.00       0.02        0.45        0.03        0.10
       256     84.0895     10000        0.49        0.07       0.08        0.49        0.05        0.13
       512    154.2349     10000        0.53        0.04       0.19        0.53        0.07        0.16
      1024    244.6253     10000        0.67        0.07       0.11        0.67        0.09        0.15
      2048    372.3746     10000        0.88        0.05       0.15        0.88        0.04        0.18
      4096    652.1813     10000        1.00        0.08       0.57        1.00        0.08        0.50
      8192    793.8965     10000        1.65        0.09       0.83        1.65        0.17        0.78
     16384   1143.5172     10000        2.29        0.19       1.81        2.29        0.15        1.82
     32768   1013.1281     10000        5.17        0.29       3.74        5.18        0.32        3.47
     65536    932.4241      4830        5.43        0.34       3.47        5.43        0.39        4.06
    131072    885.8682      2223        5.26        0.37       3.35        5.26        0.41        4.08
    262144    872.6263      1056        5.08        0.34       3.31        5.08        0.46        4.32
    524288    874.1935       520        4.99        0.45       3.17        4.99        0.56        4.25
   1048576    878.5305       260        4.97        0.39       3.26        4.97        0.66        4.14
   2097152    873.4249       130        4.99        0.48       3.21        4.99        0.47        4.49
   4194304    873.2915        65        4.99        0.38       3.33        5.00        0.52        4.46
   8388608    873.6141        32        4.92        0.38       3.08        4.92        0.45        4.44
  16777216    874.1411        16        4.91        0.37       3.06        4.91        0.62        4.30
  33554432    874.0057         8        4.91        0.37       2.97        4.91        0.53        4.37
  67108864    868.5910         5        6.18        0.42       3.85        6.18        0.64        5.52
[mk1 ~/hpcbench/mpi]$ mpirun -np 2 -machinefile machine-file mpitest -ine 26 -o output.txt
mk3(Master-node) <--> mk4(Secondary-node)
Exponential non-blocking ping-pong (bidirectional) test
Test result: "output.txt"
Test done!
[mk1 ~/hpcbench/mpi]$ cat output.txt 
# MPI communication test -- Wed Jul 14 19:17:54 2004
# Test mode: Exponential ping-pong (bidirectional) test
# Hosts: mk3 <----> mk4
# Non-blocking communication (MPI_Isend/MPI_Irecv)
#
#   Message    Overall             Master-node  M-process  M-process   Slave-node   S-process  S-process
#     Size   Throughput Iteration Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#    Bytes       Mbps                 Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
         1      1.6293     10000        0.10        0.10       0.00        0.10        0.09        0.00
         2      3.2537     10000        0.10        0.10       0.00        0.10        0.10        0.00
         4      6.4752     10000        0.10        0.10       0.00        0.10        0.10        0.00
         8     13.0232     10000        0.10        0.10       0.00        0.10        0.10        0.00
        16     25.7786     10000        0.10        0.10       0.00        0.10        0.10        0.00
        32     51.1105     10000        0.10        0.10       0.00        0.10        0.10        0.00
        64    100.3291     10000        0.10        0.10       0.00        0.10        0.10        0.00
       128    187.9520     10000        0.11        0.11       0.00        0.11        0.11        0.00
       256    311.8096     10000        0.13        0.13       0.00        0.13        0.13        0.00
       512    547.8572     10000        0.15        0.15       0.00        0.15        0.15        0.00
      1024    850.1319     10000        0.19        0.19       0.00        0.19        0.19        0.00
      2048   1282.9666     10000        0.26        0.26       0.00        0.26        0.26        0.00
      4096   1694.9692     10000        0.39        0.39       0.00        0.39        0.39        0.00
      8192   2268.3214     10000        0.58        0.57       0.00        0.58        0.57        0.00
     16384   2585.6188     10000        1.01        1.02       0.00        1.01        1.02        0.00
     32768   2860.4786     10000        1.83        1.83       0.00        1.83        1.83        0.00
     65536   3290.2138     10000        3.19        3.19       0.00        3.19        3.19        0.00
    131072   3540.9363      7844        4.65        4.64       0.00        4.65        4.64        0.00
    262144   3677.4967      4221        4.81        4.82       0.00        4.81        4.82        0.00
    524288   3733.4466      2191        4.92        4.90       0.00        4.92        4.92        0.00
   1048576   3775.4940      1112        4.94        4.95       0.00        4.94        4.94        0.00
   2097152   3845.2595       562        4.90        4.90       0.00        4.90        4.91        0.00
   4194304   3876.0652       286        4.95        4.94       0.00        4.95        4.95        0.00
   8388608   3888.2378       144        4.97        4.96       0.01        4.97        4.96        0.00
  16777216   3080.4510        72        6.27        6.26       0.02        6.27        6.26        0.01
  33554432   3050.5405        28        4.93        4.89       0.03        4.93        4.90        0.03
  67108864   3073.8384        14        4.89        4.85       0.04        4.89        4.82        0.06
[mk1 ~/hpcbench/mpi]$

[ Test with system log ] [ TOP ]

Currently the system resource tracing functionality is only available for Linux boxes. To enable the system logging, you should enable the write option (-o) and CPU logging option (-c). In the following example, the file "output" records the results of tests, "ouput.m_log" logs master node's system information, "output.s_log" logs slave (secondary) node's system information. System logs have two more entries than test repetition, the first one showing pre-test system information and the last one showing system's post-test information.

[mk1 ~/hpcbench/mpi]$ ge-mpirun -np 2 -machinefile machine-file ge-mpitest -r 5 -co output
mk3(Master-node) <--> mk4(Secondary-node)
Fixed-size blocking stream (unidirectional) test
Message-size: 1048576 Bytes   iteration: 490   test-time: 5.000000 Seconds
(1) Throughput(Mbps): 817.6352  Message-size(Bytes): 1048576  Test-time: 5.03
(2) Throughput(Mbps): 819.4548  Message-size(Bytes): 1048576  Test-time: 5.02
(3) Throughput(Mbps): 819.4652  Message-size(Bytes): 1048576  Test-time: 5.02
(4) Throughput(Mbps): 819.4164  Message-size(Bytes): 1048576  Test-time: 5.02
(5) Throughput(Mbps): 819.4528  Message-size(Bytes): 1048576  Test-time: 5.02
Test result: "output"
Secondary node's syslog: "output.s_log"
Master node's syslog: "output.m_log"
Test done!
[mk1 ~/hpcbench/mpi]$ cat output
# MPI communication test -- Wed Jul 14 20:10:55 2004
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: mk3 <----> mk4
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 513802240
# Message size (Bytes): 1048576
# Iteration : 490
# Test time: 5.000000
# Test repetition: 5
#
#      Overall    Master-node  M-process  M-process   Slave-node   S-process  S-process
#    Throughput  Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#        Mbps        Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
1     817.6352         5.03        0.05       1.84        5.03        0.88        3.06
2     819.4548         5.02        0.09       1.71        5.02        0.81        3.04
3     819.4652         5.02        0.08       1.81        5.02        0.83        3.18
4     819.4164         5.02        0.03       1.64        5.02        0.83        3.11
5     819.4528         5.02        0.08       1.74        5.02        0.90        2.90
[mk1 ~/hpcbench/mpi]$ cat output.m_log 
# mk3 syslog -- Wed Jul 14 20:10:55 2004
# Watch times: 7
# Network devices (interface): 2 ( loop eth0 )
# CPU number: 4

##### System info, statistics of network interface <loop> and its interrupts to each CPU #####
#       CPU(%)     Mem(%)  Interrupt  Page   Swap   Context           <loop> information
#   Load User  Sys  Usage   Overall  In/out In/out   Swtich   RecvPkg    RecvByte   SentPkg    SentByte  Int-CPU0 Int-CPU1 Int-CPU2 Int-CPU3 
0      0    0    0     12       175      24      0      104         0           0         0           0         0        0        0        0
1     18    0   18     12    181172      16      0    91453         0           0         0           0         0        0        0        0
2     14    0   13     12    181996      64      0    90028         0           0         0           0         0        0        0        0
3     15    0   14     12    181872      16      0    89691         0           0         0           0         0        0        0        0
4     14    0   14     12    181886      16      0    89903         0           0         0           0         0        0        0        0
5     14    0   14     12    181996      16      0    89858         0           0         0           0         0        0        0        0
6      0    0    0     12       172       0      0      104         0           0         0           0         0        0        0        0

##### System info, statistics of network interface <eth0> and its interrupts to each CPU #####
#       CPU(%)     Mem(%)  Interrupt  Page   Swap   Context           <eth0> information
#   Load User  Sys  Usage   Overall  In/out In/out   Swtich   RecvPkg    RecvByte   SentPkg    SentByte  Int-CPU0 Int-CPU1 Int-CPU2 Int-CPU3 
0      0    0    0     12       175      24      0      104       114       31262       135       35056        51        0        0        0
1     18    0   18     12    181172      16      0    91453    194432    14712281    387683   586438922    180568        0        0        0
2     14    0   13     12    181996      64      0    90028    171696    12064155    343253   519623902    181392        0        0        0
3     15    0   14     12    181872      16      0    89691    172493    12118663    344843   522033042    181295        0        0        0
4     14    0   14     12    181886      16      0    89903    171803    12072089    343458   520026646    181288        0        0        0
5     14    0   14     12    181996      16      0    89858    171785    12069006    343450   519975620    181385        0        0        0
6      0    0    0     12       172       0      0      104     28902     2031079     57721    87384210        55        0        0        0

## CPU workload distribution: 
##
##         CPU0 workload (%)           Overall CPU workload (%)
#   < load   user  system   idle >  < load   user  system   idle >
0      1.0    0.0    1.0    99.0       0.2    0.0    0.2    99.8
1     49.1    0.4   48.7    50.9      18.4    0.2   18.1    81.6
2     56.7    1.8   54.9    43.3      14.2    0.4   13.7    85.8
3     60.4    1.6   58.8    39.6      15.2    0.4   14.8    84.8
4     58.6    0.6   58.1    41.4      14.7    0.1   14.5    85.3
5     59.6    1.6   58.1    40.4      14.9    0.4   14.5    85.1
6      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0

##         CPU1 workload (%)           Overall CPU workload (%)
#   < load   user  system   idle >  < load   user  system   idle >
0      0.0    0.0    0.0   100.0       0.2    0.0    0.2    99.8
1      0.0    0.0    0.0   100.0      18.4    0.2   18.1    81.6
2      0.0    0.0    0.0   100.0      14.2    0.4   13.7    85.8
3      0.0    0.0    0.0   100.0      15.2    0.4   14.8    84.8
4      0.0    0.0    0.0   100.0      14.7    0.1   14.5    85.3
5      0.0    0.0    0.0   100.0      14.9    0.4   14.5    85.1
6      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0

##         CPU2 workload (%)           Overall CPU workload (%)
#   < load   user  system   idle >  < load   user  system   idle >
0      0.0    0.0    0.0   100.0       0.2    0.0    0.2    99.8
1      0.0    0.0    0.0   100.0      18.4    0.2   18.1    81.6
2      0.0    0.0    0.0   100.0      14.2    0.4   13.7    85.8
3      0.2    0.0    0.2    99.8      15.2    0.4   14.8    84.8
4      0.0    0.0    0.0   100.0      14.7    0.1   14.5    85.3
5      0.0    0.0    0.0   100.0      14.9    0.4   14.5    85.1
6      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0

##         CPU3 workload (%)           Overall CPU workload (%)
#   < load   user  system   idle >  < load   user  system   idle >
0      0.0    0.0    0.0   100.0       0.2    0.0    0.2    99.8
1     24.4    0.6   23.8    75.6      18.4    0.2   18.1    81.6
2      0.0    0.0    0.0   100.0      14.2    0.4   13.7    85.8
3      0.2    0.0    0.2    99.8      15.2    0.4   14.8    84.8
4      0.0    0.0    0.0   100.0      14.7    0.1   14.5    85.3
5      0.0    0.0    0.0   100.0      14.9    0.4   14.5    85.1
6      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0
[mk1 ~/hpcbench/mpi]$ cat output.s_log 
# mk4 syslog -- Wed Jul 14 20:09:59 2004
# Watch times: 7
# Network devices (interface): 2 ( loop eth0 )
# CPU number: 4

##### System info, statistics of network interface <loop> and its interrupts to each CPU #####
#       CPU(%)     Mem(%)  Interrupt  Page   Swap   Context           <loop> information
#   Load User  Sys  Usage   Overall  In/out In/out   Swtich   RecvPkg    RecvByte   SentPkg    SentByte  Int-CPU0 Int-CPU1 Int-CPU2 Int-CPU3 
0      0    0    0     10       197       0      0      124         0           0         0           0         0        0        0        0
1     38    4   33     11    339496      24      0   505587         0           0         0           0         0        0        0        0
2     37    4   33     11    340876      16      0   508579         0           0         0           0         0        0        0        0
3     38    4   34     11    341520      16      0   507959         0           0         0           0         0        0        0        0
4     37    4   33     11    341452      16      0   510991         0           0         0           0         0        0        0        0
5     37    4   33     11    341590      16      0   510688         0           0         0           0         0        0        0        0
6      0    0    0     11       194       0      0      124         0           0         0           0         0        0        0        0

##### System info, statistics of network interface <eth0> and its interrupts to each CPU #####
#       CPU(%)     Mem(%)  Interrupt  Page   Swap   Context           <eth0> information
#   Load User  Sys  Usage   Overall  In/out In/out   Swtich   RecvPkg    RecvByte   SentPkg    SentByte  Int-CPU0 Int-CPU1 Int-CPU2 Int-CPU3 
0      0    0    0     10       197       0      0      124       140       39002       141       34492        52        0        0        0
1     38    4   33     11    339496      24      0   505587    390093   590152444    195707    14799182    338843        0        0        0
2     37    4   33     11    340876      16      0   508579    343136   519496213    171651    12059178    340231        0        0        0
3     38    4   34     11    341520      16      0   507959    344903   522102320    172491    12117566    340862        0        0        0
4     37    4   33     11    341452      16      0   510991    343470   520085567    171836    12072898    340819        0        0        0
5     37    4   33     11    341590      16      0   510688    343460   519928360    171810    12070238    340942        0        0        0
6      0    0    0     11       194       0      0      124     55319    83722883     27656     1943130        53        0        0        0

## CPU workload distribution: 
##
##         CPU0 workload (%)           Overall CPU workload (%)
#   < load   user  system   idle >  < load   user  system   idle >
0      0.0    0.0    0.0   100.0       0.2    0.0    0.2    99.8
1     73.5    0.0   73.5    26.5      38.2    4.4   33.8    61.8
2     73.1    0.0   73.1    26.9      37.6    4.2   33.4    62.4
3     72.8    0.0   72.8    27.2      38.1    4.1   34.0    61.9
4     70.7    0.0   70.7    29.3      37.2    4.1   33.1    62.8
5     75.2    0.0   75.2    24.8      37.6    4.5   33.2    62.4
6      0.0    0.0    0.0   100.0       0.2    0.0    0.2    99.8

##         CPU1 workload (%)           Overall CPU workload (%)
#   < load   user  system   idle >  < load   user  system   idle >
0      1.0    0.0    1.0    99.0       0.2    0.0    0.2    99.8
1      1.0    0.2    0.8    99.0      38.2    4.4   33.8    61.8
2      1.0    0.6    0.4    99.0      37.6    4.2   33.4    62.4
3      0.2    0.0    0.2    99.8      38.1    4.1   34.0    61.9
4      0.0    0.0    0.0   100.0      37.2    4.1   33.1    62.8
5      0.0    0.0    0.0   100.0      37.6    4.5   33.2    62.4
6      0.0    0.0    0.0   100.0       0.2    0.0    0.2    99.8

##         CPU2 workload (%)           Overall CPU workload (%)
#   < load   user  system   idle >  < load   user  system   idle >
0      0.0    0.0    0.0   100.0       0.2    0.0    0.2    99.8
1     78.2   17.4   60.8    21.8      38.2    4.4   33.8    61.8
2     76.2   16.0   60.2    23.8      37.6    4.2   33.4    62.4
3     79.6   16.5   63.1    20.4      38.1    4.1   34.0    61.9
4     78.0   16.4   61.6    22.0      37.2    4.1   33.1    62.8
5     75.4   17.9   57.5    24.6      37.6    4.5   33.2    62.4
6      1.0    0.0    1.0    99.0       0.2    0.0    0.2    99.8

##         CPU3 workload (%)           Overall CPU workload (%)
#   < load   user  system   idle >  < load   user  system   idle >
0      0.0    0.0    0.0   100.0       0.2    0.0    0.2    99.8
1      0.0    0.0    0.0   100.0      38.2    4.4   33.8    61.8
2      0.0    0.0    0.0   100.0      37.6    4.2   33.4    62.4
3      0.0    0.0    0.0   100.0      38.1    4.1   34.0    61.9
4      0.0    0.0    0.0   100.0      37.2    4.1   33.1    62.8
5      0.0    0.0    0.0   100.0      37.6    4.5   33.2    62.4
6      0.0    0.0    0.0   100.0       0.2    0.0    0.2    99.8
[mk1 ~/hpcbench/mpi]$

In the GE blocking stream MPI tests, master nodes has about 15% CPU usage and mainly consumes CPU0 system time, while slave (secondary) node has about 37% CPU usage and distributes workload to CPU0 and CPU2. Let's examine Myrinet interconnect:

[mk1 ~/hpcbench/mpi]$ mpirun -np 2 -machinefile machine-file mpitest -r 5 -co output.txt
mk3(Master-node) <--> mk4(Secondary-node)
Fixed-size blocking stream (unidirectional) test
Message-size: 1048576 Bytes   iteration: 1138   test-time: 5.000000 Seconds
(1) Throughput(Mbps): 1914.4257  Message-size(Bytes): 1048576  Test-time: 4.99
(2) Throughput(Mbps): 1914.5681  Message-size(Bytes): 1048576  Test-time: 4.99
(3) Throughput(Mbps): 1914.1915  Message-size(Bytes): 1048576  Test-time: 4.99
(4) Throughput(Mbps): 1914.5889  Message-size(Bytes): 1048576  Test-time: 4.99
(5) Throughput(Mbps): 1914.4461  Message-size(Bytes): 1048576  Test-time: 4.99
Test result: "output.txt"
Master node's syslog: "output.txt.m_log"
Test done!
Secondary node's syslog: "output.txt.s_log"
[mk1 ~/hpcbench/mpi]$ cat output.txt
# MPI communication test -- Wed Jul 14 20:20:44 2004
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: mk3 <----> mk4
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 1193279488
# Message size (Bytes): 1048576
# Iteration : 1138
# Test time: 5.000000
# Test repetition: 5
#
#      Overall    Master-node  M-process  M-process   Slave-node   S-process  S-process
#    Throughput  Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#        Mbps        Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
1    1914.4257         4.99        4.98       0.00        4.99        4.99        0.00
2    1914.5681         4.99        4.98       0.00        4.99        4.99        0.00
3    1914.1915         4.99        4.98       0.00        4.99        4.99        0.00
4    1914.5889         4.99        4.98       0.00        4.99        5.00        0.00
5    1914.4461         4.99        4.98       0.00        4.99        4.99        0.00
[mk1 ~/hpcbench/mpi]$ cat output.txt.m_log 
# mk3 syslog -- Wed Jul 14 20:20:44 2004
# Watch times: 7
# Network devices (interface): 2 ( loop eth0 )
# CPU number: 4

##### System info, statistics of network interface <loop> and its interrupts to each CPU #####
#       CPU(%)     Mem(%)  Interrupt  Page   Swap   Context           <loop> information
#   Load User  Sys  Usage   Overall  In/out In/out   Swtich   RecvPkg    RecvByte   SentPkg    SentByte  Int-CPU0 Int-CPU1 Int-CPU2 Int-CPU3 
0      0    0    0     14       180       0      0      106         0           0         0           0         0        0        0        0
1     24   24    0     14       827      16      0      455         0           0         0           0         0        0        0        0
2     24   24    0     14       851      16      0      493         0           0         0           0         0        0        0        0
3     24   24    0     14       884      16      0      527         0           0         0           0         0        0        0        0
4     25   24    0     14       876      16      0      521         0           0         0           0         0        0        0        0
5     24   24    0     14       885      16      0      537         0           0         0           0         0        0        0        0
6      0    0    0     14       177       0      0      104         0           0         0           0         0        0        0        0

##### System info, statistics of network interface <eth0> and its interrupts to each CPU #####
#       CPU(%)     Mem(%)  Interrupt  Page   Swap   Context           <eth0> information
#   Load User  Sys  Usage   Overall  In/out In/out   Swtich   RecvPkg    RecvByte   SentPkg    SentByte  Int-CPU0 Int-CPU1 Int-CPU2 Int-CPU3 
0      0    0    0     14       180       0      0      106        27        3633        30        3360        51        0        0        0
1     24   24    0     14       827      16      0      455       100       13491       106       12428       237        0        0        0
2     24   24    0     14       851      16      0      493        99       13427       106       12428       249        0        0        0
3     24   24    0     14       884      16      0      527        99       13427       106       12428       248        0        0        0
4     25   24    0     14       876      16      0      521        83       11162        88       10352       250        0        0        0
5     24   24    0     14       885      16      0      537        96       13235       106       12428       244        0        0        0
6      0    0    0     14       177       0      0      104        26        3547        28        4520        51        0        0        0

## CPU workload distribution: 
##
##         CPU0 workload (%)           Overall CPU workload (%)
#   < load   user  system   idle >  < load   user  system   idle >
0      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0
1      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.1
2      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.1
3      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.1
4      0.2    0.0    0.2    99.8      25.0   24.9    0.1    75.0
5      0.0    0.0    0.0   100.0      24.9   24.9    0.1    75.0
6      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0

##         CPU1 workload (%)           Overall CPU workload (%)
#   < load   user  system   idle >  < load   user  system   idle >
0      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0
1      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.1
2      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.1
3      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.1
4      0.0    0.0    0.0   100.0      25.0   24.9    0.1    75.0
5      0.0    0.0    0.0   100.0      24.9   24.9    0.1    75.0
6      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0

##         CPU2 workload (%)           Overall CPU workload (%)
#   < load   user  system   idle >  < load   user  system   idle >
0      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0
1      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.1
2      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.1
3      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.1
4      0.0    0.0    0.0   100.0      25.0   24.9    0.1    75.0
5      0.2    0.0    0.2    99.8      24.9   24.9    0.1    75.0
6      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0

##         CPU3 workload (%)           Overall CPU workload (%)
#   < load   user  system   idle >  < load   user  system   idle >
0      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0
1     99.6   99.6    0.0     0.4      24.9   24.9    0.0    75.1
2     99.6   99.6    0.0     0.4      24.9   24.9    0.0    75.1
3     99.4   99.4    0.0     0.6      24.9   24.9    0.0    75.1
4     99.8   99.6    0.2     0.2      25.0   24.9    0.1    75.0
5     99.6   99.6    0.0     0.4      24.9   24.9    0.1    75.0
6      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0
[mk1 ~/hpcbench/mpi]$ cat output.txt.s_log 
# mk4 syslog -- Wed Jul 14 20:19:47 2004
# Watch times: 7
# Network devices (interface): 2 ( loop eth0 )
# CPU number: 4

##### System info, statistics of network interface <loop> and its interrupts to each CPU #####
#       CPU(%)     Mem(%)  Interrupt  Page   Swap   Context           <loop> information
#   Load User  Sys  Usage   Overall  In/out In/out   Swtich   RecvPkg    RecvByte   SentPkg    SentByte  Int-CPU0 Int-CPU1 Int-CPU2 Int-CPU3 
0      0    0    0     12       191      40      0      126         0           0         0           0         0        0        0        0
1     24   24    0     12       951      16      0      580         0           0         0           0         0        0        0        0
2     24   24    0     12       961      16      0      600         0           0         0           0         0        0        0        0
3     24   24    0     12       953      16      0      590         0           0         0           0         0        0        0        0
4     25   24    0     12       925      16      0      562         0           0         0           0         0        0        0        0
5     25   24    0     12       928      16      0      556         0           0         0           0         0        0        0        0
6      0    0    0     12       198       0      0      122         0           0         0           0         0        0        0        0

##### System info, statistics of network interface <eth0> and its interrupts to each CPU #####
#       CPU(%)     Mem(%)  Interrupt  Page   Swap   Context           <eth0> information
#   Load User  Sys  Usage   Overall  In/out In/out   Swtich   RecvPkg    RecvByte   SentPkg    SentByte  Int-CPU0 Int-CPU1 Int-CPU2 Int-CPU3 
0      0    0    0     12       191      40      0      126        27        3633        30        3360        51        0        0        0
1     24   24    0     12       951      16      0      580        99       13421       105       12230       249        0        0        0
2     24   24    0     12       961      16      0      600        98       13357       105       12230       248        0        0        0
3     24   24    0     12       953      16      0      590        79       10724        84        9784       245        0        0        0
4     25   24    0     12       925      16      0      562        99       13421       105       12230       247        0        0        0
5     25   24    0     12       928      16      0      556        95       13165       105       12230       242        0        0        0
6      0    0    0     12       198       0      0      122        21        2761        21        2446        52        0        0        0

## CPU workload distribution: 
##
##         CPU0 workload (%)           Overall CPU workload (%)
#   < load   user  system   idle >  < load   user  system   idle >
0      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0
1      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.0
2      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.0
3      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.0
4      0.0    0.0    0.0   100.0      25.0   25.0    0.0    75.0
5      0.2    0.0    0.2    99.8      25.1   24.9    0.1    75.0
6      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0

##         CPU1 workload (%)           Overall CPU workload (%)
#   < load   user  system   idle >  < load   user  system   idle >
0      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0
1      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.0
2      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.0
3      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.0
4      0.2    0.0    0.2    99.8      25.0   25.0    0.0    75.0
5      0.0    0.0    0.0   100.0      25.1   24.9    0.1    75.0
6      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0

##         CPU2 workload (%)           Overall CPU workload (%)
#   < load   user  system   idle >  < load   user  system   idle >
0      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0
1      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.0
2      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.0
3      0.0    0.0    0.0   100.0      24.9   24.9    0.0    75.0
4      0.0    0.0    0.0   100.0      25.0   25.0    0.0    75.0
5      0.2    0.0    0.2    99.8      25.1   24.9    0.1    75.0
6      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0

##         CPU3 workload (%)           Overall CPU workload (%)
#   < load   user  system   idle >  < load   user  system   idle >
0      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0
1     99.8   99.8    0.0     0.2      24.9   24.9    0.0    75.0
2     99.8   99.8    0.0     0.2      24.9   24.9    0.0    75.0
3     99.8   99.8    0.0     0.2      24.9   24.9    0.0    75.0
4     99.8   99.8    0.0     0.2      25.0   25.0    0.0    75.0
5     99.8   99.8    0.0     0.2      25.1   24.9    0.1    75.0
6      0.0    0.0    0.0   100.0       0.0    0.0    0.0   100.0
[mk1 ~/hpcbench/mpi]$

Wow, with Myrinet communications, the master node and slave (secondary) node's CPU usages are almost the same! Both are exclusively occupying CPU3's clock cycles, and all process time is spent in user mode. NO system calls in Myrinet communications! Is it the power of Zero-copy technique?

[ Latency (Roundtrip time) test ] [ TOP ]

This test is like a MPI version of "ping". We test MPI roundtrip time with default messge size (64Bytes) and 1KBytes data:

[mk1 ~/hpcbench/mpi]$ ge-mpirun -np 2 -machinefile machine-file ge-mpitest -a -r 5 
mk3(Master-node) <--> mk4(Secondary-node)
MPI communicaiton Round Trip Time (latency) test
MPI Round Trip Time (1) : 72.694 usec
MPI Round Trip Time (2) : 72.716 usec
MPI Round Trip Time (3) : 72.763 usec
MPI Round Trip Time (4) : 72.541 usec
MPI Round Trip Time (5) : 72.749 usec
Message size (Bytes) : 64
MPI RTT min/avg/max = 72.541/72.693/72.763 usec
Test done!
[mk1 ~/hpcbench/mpi]$ ge-mpirun -np 2 -machinefile machine-file ge-mpitest -A 1k -r 5 
mk3(Master-node) <--> mk4(Secondary-node)
MPI communicaiton Round Trip Time (latency) test
MPI Round Trip Time (1) : 124.785 usec
MPI Round Trip Time (2) : 124.601 usec
MPI Round Trip Time (3) : 124.611 usec
MPI Round Trip Time (4) : 124.756 usec
MPI Round Trip Time (5) : 124.623 usec
Message size (Bytes) : 1024
MPI RTT min/avg/max = 124.601/124.675/124.785 usec
Test done!
[mk1 ~/hpcbench/mpi]$ mpirun -np 2 -machinefile machine-file mpitest -a -r 5
mk3(Master-node) <--> mk4(Secondary-node)
MPI communicaiton Round Trip Time (latency) test
MPI Round Trip Time (1) : 14.247 usec
MPI Round Trip Time (2) : 14.245 usec
MPI Round Trip Time (3) : 14.243 usec
MPI Round Trip Time (4) : 14.245 usec
MPI Round Trip Time (5) : 14.255 usec
Message size (Bytes) : 64
MPI RTT min/avg/max = 14.243/14.247/14.255 usec
Test done!
[mk1 ~/hpcbench/mpi]$ mpirun -np 2 -machinefile machine-file mpitest -A 1k -r 5
mk3(Master-node) <--> mk4(Secondary-node)
MPI communicaiton Round Trip Time (latency) test
MPI Round Trip Time (1) : 32.261 usec
MPI Round Trip Time (2) : 32.312 usec
MPI Round Trip Time (3) : 32.328 usec
MPI Round Trip Time (4) : 32.261 usec
MPI Round Trip Time (5) : 32.246 usec
Message size (Bytes) : 1024
MPI RTT min/avg/max = 32.246/32.282/32.328 usec
Test done!
[mk1 ~/hpcbench/mpi]$

[ Plot data ] [ TOP ]

If write option (-o) and plot option (-P) are both defined, a configuration file for plotting with format of "ouput.plot" will be created. Use gnuplot to plot the data or create the postscript files of the plotting:

Last updated: Sept. 2004 by Ben Huang