Research journal

Table of Contents

1 2017

1.1 2017-02 February

1.1.1 2017-02-06 Monday

  1. DONE Read An Evaluation of Network Architectures for Next Generation Supercomputers   PAPER

    Bibtex: Chen:2016:ENA:3019057.3019059

    • The authors roughly do what we want to do: they use a simulator to do performance evaluation of different topologies, with different workloads and routing algorithms.
    • In a first part, they detail what are these topologies, routing algorithms and workloads. This could give us some ideas of what to test. Maybe we could try to reproduce their results?
    • They focus on topologies having:
      • Full uniform bandwidth.
      • Have good partitionability and can be grown modularly.
      • Come at a lower cost than a 3-level fat tree (which is the state of the art in terms of pure performances).
    • They test an adversarial traffic (task i sends to task (i+D) mod G, tuned to “be bad”).
      • Fat tree has great performances, regardless of the routing algorithms.
      • Other topologies (Dragonfly+, Stacked all-to-all, Stacked 2D hyperX) have terrible performances with direct routing. For indirect or adaptive routing, performances are much better (but still a factor 2 lower than the fat tree).
    • Then, they test neighbor traffic (the logical topology is a grid for instance).
      • Again, the fat tree has nearly full performances, regardless of the routing algorithm.
      • Other topologies have lower performances with indirect routing. Their performances are ok with direct or adaptive routing.
    • Next, they look at AMR.
      • Here, all topologies and routing algorithms have poor performances.
      • The average throughput is high at the beginning, but decreases very quickly to nearly 0. This long tail with low throughput accounts for the major part of the execution time.
      • Thus, AMR seems to be inherently bad for parallelism.
    • To sum up, the best routing algorithm is the adaptive routing (except maybe for the fat tree), the best topology is the fat tree.
    • The authors then had a look at random-mappings of the processes to the nodes (until now, the mapping was ideal). This could reflect what would do a scheduler which is not topology-aware. In general, with adaptive routing, the Fat Tree and the Dragonfly+ are very robust to irregular placements, the completion time is not impacted too much. This is not the case for stacked topologies (due to a lack of path diversity). Thus, we should use a topology-aware job scheduler, especially for stacked topologies. With non-adaptive routing, all the topologies suffer of performance degradations.
  2. DONE Talk with Arnaud about the internship.   MEETING
    1. Two possible things to have a look at.
      • Simulate the impact of network failures on the performances.
        • May need to work on Simgrid implementation, to handle failures.
        • A recent paper has shown that, in their case, removing one of the two root switches of their fat tree did not impact significantly the performances.
        • A reason is that jobs rarely occupy the full tree, they are localized in one of its sub-trees. Thus, nearly no communication go to the top switches.
      • Modelize in Simgrid Stampede super computer.
        • It uses a fat tree topology.
        • We have access to real benchmark results.
        • We have access to its configuration (e.g. OS and compiler used).
    2. A first step for this internship would be to run HPL on a fat tree, with Simgrid.
    3. Some features of Simgrid to speedup a simulation.
      • A macro to only run the first steps of a loop and infer the total time from it.
      • An allocator (replacing malloc/free) to share memory between processes.
    4. Some technical details about Simgrid
      • For every process, we run each piece of code until we reach a MPI operation. This gives us the execution time of this code block.
      • We know all the communication flows of the “current step”, thanks to the routing. We thus have a list of linear constraints (e.g. the bandwidth of all flows going through a same link should not exceed the capacity of this link). We solve this by maximizing the minimum bandwidth of any flow (empirically, this is close to the reality, where flows have a fair share of the resources).
      • Routing is made with an AS hierarchy. There are local routing decisions (within an AS) and global routing decisions (between two AS).
    5. There exists other simulators.

      Mainly codes/ross. Discrete event simulators, so they consider the problem at a lower level. But being too precises has some drawbacks:

      • The exact version of every piece of code can have noticeable impact → tedious to calibrate.
      • The simulation takes much more time, does not scale as much as Simgrid.

1.1.2 2017-02-07 Tuesday

  1. DONE Begin writing a journal \o/
  2. DONE Read Characterizing Parallel Scientific Applications on Commodity Clusters: An Empirical Study of a Tapered Fat-Tree   PAPER

    Bibtex: Leon:2016:CPS:3014904.3015009

    • The authors want to characterize the behavior of applications that run on their clusters, with an emphasis on communication requirements.
    • This should help to make more informed choices when building new clusters (should we use our budget to get more links or more nodes?).
    • They measured the utilization of their cluster during one week. It has a fat tree topology. The measurements show that the network is not used very much: the maximal link utilization is approximately 50%, the average link utilization is 2.4%.
    • They did the same measures with a tapered fat tree (they removed one of the root switches). Except for some outliers having a 90% link utilization at some point, this modification did not had a major impact on the link utilization, which was 3.9%.
    • The authors recorded which type of jobs were submitted. A great majority of them was really small. 95% of jobs have at most 16 nodes, 76% have only one node. Jobs of less than 64 nodes consume 75% of the time. Thus, if the jobs are well placed, the need for distant communications is very low, which explains the good performances of the tapered fat tree. Of course this may change from one cluster to another, so we should reproduce these measurements and make our own conclusions.
    • Then, the authors removed one of there two top switches.
    • A first micro-benchmark shows that it only impacts the aggregate bisection bandwidth, for large messages (> 32kB).
    • Then, they evaluated the impact of the tapering on the performances of several “real-life” applications.
    • They found that only one of these applications was sensible to the tapering. This application does collective communications as well as point-to-point commulnications of large messages.
    • However, the impact on the execution time of this application remains small: only 1-2% (it impacts its communication time by 6-7.5% which itself accounts for only 9-15%). Furthermore, this only happens for a large number of nodes (> 512).
    • Finally, the authors claim that next generation hardware (faster CPU, memory and network, accelerators…) will lead to some rewriting of the application to leverage this new hardware. In some applications, message sizes will be larger. Thus, a tapered fat tree may have more impact with this new hardware, new experimentations will be needed to find out.
  3. DONE Some thoughts regarding previous paper, to discuss with Arnaud   PAPER MEETING
    1. Can we have data about the utilization of clusters we will work on (Stampede, Bull)?
      • It would help us to find relevant hypothesis (e.g. “pruning the fat-tree will not have any impact”).
      • We need this for the simulation. What kind of jobs should we run? Small ones? Large ones?
    2. Can we have data about the nature of the jobs submitted on these clusters?
      • What are these applications?
      • What fraction of the time do they use for communications?
      • Small or large messages?
      • Again, it will help us to make hypothesis and perform meaningful experimentations.

      → It changes a lot from one cluster to another, or even across time. It is also hard to record (a batch scheduler does not know the nature of the jobs that it handles).

    3. How to simulate “big nodes”?
      • Can we simulate MPI+OpenMP programs with Simgrid?
      • The paper from Christian explains briefly how Simgrid simulates multi-core machines (with one MPI process per core, no threads). Why don't they talk about it in the other paper? Both papers are from the same year.

      → It would be very hard to support OpenMP in Simgrid, the standard is quite big. Also, in OpenMP, communications are made with shared memory, so much more difficult to track than MPI communications.

    4. Are time-independent traces larger than the “classical” traces?

      → No, same thing.

    5. With ScalaTrace, traces have “near-constant size”. How?

      → Compression, lossless or lossy.

    6. What is “detached mode” in point-to-point communication?
      • Does the OS of the sender interrupt it, to ask it to send the data?
      • If so, why is large mode slower for the sender? In detached mode, the sender has to stop what it is doing, whereas in synchronous mode it is waiting.

      → Yes, the kernel interrupts the sender when the receiver is ready. Simgrid does not model the small messages used for the synchronization.

    7. What does the community think of closed source simulators, like xSim? Researchers behind xSim are doing strong claims that cannot be verified by independent researchers…
    8. Why are there never confidence intervals in the plots of Simgrid's papers?

      → They are often not needed, because of too small variation.

    9. About the paper Git/Org-mode
      • Is there an implementation somewhere? Creating custom git commands seems really easy. → Yes, but not packaged yet. To test the “beta version”, ask Vincent.
      • Was not convinced by the subsection 4.3.3 (“fixing code”). When terminating an experiment, they revert all the changes made to the source code since these may be ad hoc changes. Then the user has to cherry pick the changes (s)he wants to keep. Sounds really dirty… It seems better to have generic scripts that you configure by giving command line arguments and/or configuration files. Then you can simply put these arguments/files in the journal.
    10. Routing in Simgrid (according to the doc)
      • Routing tables are static (to achieve high performance). → Does it mean that handling link failures and dynanmic re-routing will require a large code refactoring? What about the performance penalty?
      • Routing algorithms are either based on short path (e.g. Floyd, Dijkstra) or manually entered. What about “classical” algorithms like D-mod-K? An example is provided on Github. The example implements a two levels fat-tree with D-mod-K. However, D-mod-K is not specified in the XML, it seems to be implicit. Does it mean that we are forced to use this routing algorithm for fat trees?

      → Read the code. Shortest path routing is a feature introduced by some Belgian researchers. For specific topologies like fat-trees, the routing algorithm is hard-coded.

  4. DONE Read Simulating MPI applications: the SMPI approach   PAPER

    Bibtex: degomme:hal-01415484

    • This paper is about simulation of HPC systems.
    • The authors claim that some research papers are based on simulation made with one-off programs with poor documentation, making simplifying assumptions. Worse, these programs are sometimes not public. This is a big issue for reproducibility.
    • The whole paper consider several important aspects that a good simulator should take care of.
    • Several use cases for simulation.
      • Quantitative performance evaluation (what will be the performances if we take a bigger version of our hardware?).
      • Qualitative performance evaluation (what will be the performances if we take different hardware?).
      • Detection of hardware misconfiguration (leading to unexpected performance behaviors).
      • MPI runtime tuning (e.g. choosing the algorithms of MPI collective operations).
      • Teaching (supercomputers are expensive, we cannot let the students play with them).
    1. Capturing the behavior of an application.
      • Off-line simulation. A trace of MPI communication events is first obtained and then replayed.
        • We measure the durations of the CPU bursts. Then, when replaying the application, we modify them to account for the performance differences between the target platform and the platform used to get the traces.
        • One problem is the size of the traces, which can be very large. To fix this, we may only record aggregated statistics. They can be enough to detect some anomalies, but we cannot do more in-depth analysis.
        • Another issue is extrapolation. Being able to extrapolate in the general case require assumptions hardly justifiable.
        • In SMPI, they use “time-independent traces”. Instead of recording time durations, they log the number of instructions and the number of bytes transferred by MPI primitives. These are independent of the hardware, so the extrapolation issue is fixed.
        • It does not solve anything for applications that adapt their behavior to the platform. But this is hopeless with off-line simulation.
        • There is still the issue of very large traces, they grow linearly with the problem size and the number of processes. It seems to be fixed by ScalaTrace, but no explanation is given.
      • On-line simulation. The actual application code is executed, part of the instruction stream is intercepted and passed to a simulator.
        • Several challenges. Intercepting MPI calls. Interractions between the application and the simulation kernel. Obtaining full coverage of MPI standard. Over-subscribing resources.
        • Several possibilities to capture MPI calls. Use PMPI interface (provided by every MPI implementation), but limited to the high- level calls. Design a specific MPICH or OpenMPI driver, but tie the solution to a specific implementation. One can also develop an ad-hoc implementation of the MPI standard.
        • Many tool fold the application into a single process with several threads. This raise an issue for global variables, they must be protected. One can duplicate the memory area of the global variables, or use a trick based on the Global Offset Table (GOT).
        • SMPI is based on a complete reimplementation of MPI standard. No full-coverage yet (e.g. remote memory access or multithreaded MPI applications).
        • Run MPICH internal compliance tests as part of their automatic testing.
        • To protect global variables, duplicate their memory zone using mmap (smart thing, much more efficient thanks to COW).
    2. Modeling the infrastructure (network and CPU)
      • Network modeling.
        • Several solutions exist to modelize the network.
        • Packet-level simulation, here we look at individual packets. It is very precise, but it is hard to know precisely what we are modeling. Being precise with a wrong model is useless. Moreover, this model is very costly in terms of simulations.
        • Flow model. The finest grain here is the communication. Time to transfer a message of size S from i to j: Li,j + S*Bi,j. The Bi,j are not constant, they need to be evaluated for every moment. This model catch some complex behaviors (e.g. RTT unfairness of TCP). Quite complex to implement, more costly than the delay model. Also, until recently, contentions could be neglected.
        • Delay model, we have some equations to describe the communication times (e.g. LogP, LogGPS). It is elegant and cheap in terms of simulations, but very unprecise. Does not take into account network topology (and eventual contentions) and suppose a processor can only send one message at a time (single-port model).
        • SMPI uses a hybrid network model. Point-to-point communications are divided in three modes: asynchronous, detached and synchronous. Each mode has different values of bandwidth and latency, estimated by doing some benchmarks and then a linear regression.
        • To modelize network contentions, SMPI has three logical links for any physical link: a downlink, an uplink, and a limiter link. The bandwidth of uploads (resp. downloads) must be lower than the capacity of uplinks (resp. downlinks). The sum of the bandwidths must be lower than the capacity of the limiter link.
      • CPU modeling.
        • Like network modeling, several solutions.
        • Microscopic models, very precise, but also very costly.
        • Models with a coarser grain. For instance, we neglect the CPU load induced by communications → focus on Sequential Execution Blocks (SEB).
        • Most simplistic model: “CPU A is x times faster than CPU B”. Results ok for similar architectures, but not precise at all if too different. For instance, number of registers, number of hypertreading cores, speed of floating point computations, bandwidth to memory, etc.
        • Thus, impossible to predict precisely without a perfect knowledge of the system state (and therefore a microscopic model).
        • Approach of SMPI: run SEB on a processor of the target architecture. Predict performances of similar architecrues by applying a constant factor.
        • Also, not all the code logic is data dependent. We can therefore greatly decrease the simulation time with two tricks.
          • Kernel sampling. Annotate some regions with macros. Execute them only a few times to obtain estimations, then skip them.
          • Memory folding. Share some data structures across processes.
    3. Modeling the collective operations
      • Again, several solutions for the modelization.
      • More analytical ones: each collective operation has a cost equation (depending for instance on the message size and the number of processes). As discussed for the network modelization, such approaches do not catch the eventual network contention.
      • Another approach is to benchmark each collective operation on the target platform, with various parameters and communicators. Then, the obtained timings are reinjected in the simulation. We cannot do performance extrapolation with this approach. Also, the benchmarking phase may be very long.
      • Some replace every collective operation by the corresponding sequence of point-to-point communications (at compile time). This does not capture the logic of selecting the right algorithm.
      • Others capture this decomposition into point-to-point communication during the execution, then replay it. But this is limited to off-line analysis.
      • Simgrid implements all the collective algorithms and selection logics of both OpenMPI and MPICH. We are sure to capture correctly the behavior of the operations, but this is an important work. Another interesting feature is that the user can chose the selector or the algorithm from the command line.
    4. Efficient simulation engine
      • Rely on a efficient Discrete Event Simulation (DES) kernel.
      • Some simulators parallelized this part (using MPI). But this results in a more complex implementations.
      • In the way Simgrid works, there is not much potential parallelism. They therefore decided to keep a sequential DES.
      • Simulation cost comes from the application itself (which can be greatly reduced, CPU modelization) and from the flow level model.
    5. Evaluation

      Here, the authors show that the use cases mentionned at the beginning of the paper are all realised by Simgrid.

      • Simgrid is very scalable, more than xSim which is already one of the most scalable simulators (self proclaimed).
      • Kernel sampling and memory folding enable simulations of non-trivial applications with a very large number of cores.
      • Then, the ability to make good predictions is demonstrated with a Mont Blanc project example. Here, Simgrid is much closer to the reality than LogGPS model. However, no comparison is done with other simulators, so this result is hard to evaluate.
      • A quantitative performance extrapolation is demonstrated, showing good results.
      • Empirically, the largest error made by SMPI in terms of time prediction is 5%. This allow to use SMPI to detect hardware misconfiguration. Indeed, it already happened to the Simgrid team.
      • Similarly to the previous point, the good accuracy of SMPI allow to investigate to find which MPI parameters lead to the best performances.
      • Finally, for obvious reasons, using a simulator is great for teaching MPI (rather than using a real cluster).
    6. Conclusion
      • The paper focused on MPI applications.
      • But Simgrid has other use cases: formal verification of HPC applications, hybrid applications (MPI+CUDA).
  5. DONE Read Predicting the Performance and the Power Consumption of MPI Applications With SimGrid   PAPER

    Bibtex: heinrich:hal-01446134

    • The paper is about using Simgrid to predict energy consumption.
    • This is a challenging question, the modelization is tricky.
      • Power consumption of nodes has a static part (consumption of the node when idle) and a dynamic part.
      • The static part is very significant (~50%), so we should really do something when the core is idle.
      • A first solution is to power off the node, but the latency to power it on is large.
      • Another solution is to use Dynamic Voltage Frequency Scaling (DVFS). This is not limited to the case where the core is idle, it can also be used when the load is low but non-null. Performance loss is linear in the decrease of the frequency, but the power consumption is quadratic.
      • No other HPC simulator than Simgrid embed a power model yet.
    1. Modeling multi-core architecture
      • If two processes are in the same node (either a same core, or two cores of a same CPU), the simulation becomes tricky.
        • The “naive” approach is to simply give a fair share to each of htese processes. But it does not take into accoutn some memory effects.
        • Simgrid can be pessimistic for processes heavily exploiting the L1 cache. In the simulation, the cache will be cold after each MPI call, in reality the cache would be hot.
        • Simgrid can be optimistic for processes heavily exploiting the L3 cache and the memory. In the simulation, they will have exclusive access, in reality they will interfer between each other.
    2. Modeling energy consumption
      • The instantaneous power consumption is Pi,f,w(u) = Pstatici,f + Pdynamici,f,w * u, for a machine i, a frequency f, a computational workload w and a usage u.
      • In general, we assume that Pstatici,f = Pstatici (idle state, the frequency does not matter).
      • Users can specify arbitrary relation (linear in the usage) for each possible frequency (in general, they should be quadratic in the frequency, but it may change with new technologies).
      • Each machine can have its own model, accounting for heterogeneity in the platform.
      • Power consumption of each host is exposed to the application, allowing it to dynamically decide to change (or not) the current frequency.
    3. Modeling multicore computation
      • A first step is to run the target application with a small workload using all the cores of a single node, on the target platform.
      • Then, re-execute the application with the same workload on top of the simulator (hence using a single core).
      • From these measures, associate to each code region a speedup factor that should be applied when emulating.
      • In some applications, speedups are very close to 1. In other applications, some regions have a speedup of 0.16 while other regions have a speedup of 14. Not taking this into accoutn can result to a large inaccuracy (~20-30%).
    4. Modeling the network
      • See the other paper for the details on the network model of SMPI.
      • The authors also speak about local communications, within a node. They are implemented with shared memory. The model here is also piecewise linear, but with less variability and higher speed. However, they did not implement this model, they kept the classical network model since local communications were rare enough.
    5. Validation
      • The authors obtain a very good accuracy for performance estimations (as stated in the previous paper).
      • For two of the three applications, they also have a very good accuracy for energy consumption estimations.
      • With the last application, the accuracy is bad. The reason is that the application (HPL) does busy waiting on communications (with MPIProbe). In the current model, they assume that it does not cost energy.
    6. Experimental environment

      Minor modifications to the setup can have a major impact on the performances and/or the power consumption. The authors therefore give a list of settings to track.

      • Hardware. If we suppose that the cluster is homogeneous, it has to be the case. Two CPU having the same type can still exhibit different performances (e.g. if they come from two different batches/factories).
      • Date of the measurements. A lot of things having an impact can change in time: temperature of the machine room, vibrations, BIOS and firmware version, etc.
      • Operating system. The whole software stack and how it is compiled can have a huge impact. Also, always observe a delay between the boot and the beginning of experiments.
      • Kernel configuration. For instance, its version, the scheduling algorithm, technologies like hyperthreading, etc.
      • The application itself and the runtime (e.g. the algorithms used for collective operations).
    7. Conclusion / future work
      • The approach to simulate power consumption is accurate only if the application is regular in time. To handle applications with very different computation patterns, we could specify the power consumption for each code region. But to do so, Simgrid has to be modified, and there need to be very precise measurements to instantiate the model (impossible with the hardware of Grid 5000, sampling rate of only 1Hz).
      • In Simgrid, we can currently not have different network models at the same time, to account for local and remote communications. A refactoring of the code is underway to fix this.

1.1.3 2017-02-08 Wednesday

  1. DONE Paper reading.
    • Notes have been added in the relevant section.
    • One paper read today: “Simulating MPI applications: the SMPI approach”.

1.1.4 2017-02-09 Thursday

  1. TODO Read Scheduling for Large Scale Distributed Computing Systems: Approaches and Performance Evaluation Issues   PAPER

    Bibtex: legrand:tel-01247932

  2. DONE Read An Effective Git And Org-Mode Based Workflow For Reproducible Research   ORGMODE GIT PAPER

    Bibtex: stanisic:hal-01112795

    • A branching scheme for git, based on four types of branches.

      • One src branch, where the code to run the experiments is located. This branch is quite light.
      • One xp branch per experiment, that exists only during the period of the experiment. We can find here all the data specific to this experiment. Also a light branch, since limited to an experiment.
      • One data branch, in which all xp branches are merged when they are terminated. Quite an heavy branch, a lot of things.
      • One art branch per article, where only the code and data related to the article are pulled from the data branch.

      When an xp branch is merged in data and deleted, a tag is added. Then, we can easily checkout to this experiment in the future.

    • Org-mode used as a laboratory notebook. All details about the experiments (what, why, how…) are written here. Thanks to literate programming, the command lines to execute are also contained in the notebook.
  3. Presentation about org-mode by Christian.   ORGMODE
    • Have a per day entry in the journal. If you work more than an our wthout writing anything in the journal, there is an issue.
    • Put tags in the headlines, to be able to search them (e.g. :SMPI: or :PRESENTATION:). Search with “match” key word. Hierarchy of tags, described in the headline.
    • About papers, tags READ/UNREAD. Also the bibtex included in the file. Attach files to the org modeo (different than a simple link). Use C-a, then move.
    • Spacemacs: add a lot of stuff to evil mode.
    • Can also use tags to have link on entries, use the CUSTOMID tag.
    • Can use org mode to put some code.
  4. DONE Paper reading.
    • One paper read today: “Predicting the Performance and the Power Consumption of MPI Applications With SimGrid”.
    • Notes have been added in the relevant section.
  5. DONE Apply the things learnt at the org-mode presentation.

1.1.5 2017-02-10 Friday

  1. Tried to get good org-mode settings.   ORGMODE
  2. DONE Paper reading.
    • One paper read today: “An Effective Git and Org-Mode Based Worflow for Reproducible Research”.
    • Notes have been added in the relevant section.
  3. Begin looking at the documentation.
  4. Run a matrix product MPI code in a fat tree
    • Code from the parallel system course.
    • Tried Github example (fat tree 2;4,4;1,2;1,2, 2 levels and 16 nodes).
    • Tried a personal example (fat tree 3;4,4,4;1,4,2;1,1,1, 3 levels and 64 nodes).
  5. DONE Find something to automatically draw a fat tree.
    • Maybe there exists some tools? Did not find one however.
    • Maybe Simgrid has a way to export a topology in a graphical way? Would be very nice.
    • Could adapt the Tikz code I wrote during 2015 internship?

1.1.6 2017-02-13 Monday

  1. Keep working on the matrix product.   SMPI C BUG
    • Observe strange behavior.

      • Commit: 719a0fd1775340628ef8f1ec0e7aa4033470356b
      • Compilation: smpicc -O4 matmul.c -o matmul
      • Execution: smpirun –cfg=smpi/bcast:mpich –cfg=smpi/running-power:6217956542.969 -np 64 -hostfile ./hostfile64.txt -platform ./clusterfattree64.xml ./matmul 2000

      Then, processes 0 and 63 behave very differently than others.

      • Processes 0 and 63 have a communication time of about 0.21 and a computation time of about 1.52.
      • Other processes have a communication time of about 0.85 and a computation time of about 0.75.

      With other topologies and/or matrix sizes, we still have this behavior (more or less accentuated).

    • If we change the order of the loops of the sequential matrix product from i-j-k to k-i-j:
      • The execution time is shorter. Hypothesis: this solution has a better usage of the cache.
      • The computation times are decreased (expected), but the communication times are also decreased (unexpected).
      • Still observe the same trend than above for processes 0 and 63.
    • Checked with some printf: all processes are the root of a line broadcast and of a column broadcast exactly once (expected).
    • Tried several broadcast algorithms (default, mpich, ompi), still have the same behavior.
    • Adding a call to MPIBarrier at the beginning of the for loop fix the issue for the communication (all processes now have a communication time of about 0.22) but not for the computation (still the same differences for processes 0 and 63).
    • When using a smaller numbmer of processes (16 or 4), communication times and computation times are more consistent (with still some variability).
    • With one process and a matrix size of 250, we have a computation time of 0.10 to 0.12. When we have 64 processes and a matrix size of 2000, each block has as ize of 250. Thus, we can extrapolate that the “normal” computation time in this case should be about 0.8 (8 iterations, so 8*0.10). Thus, processes 0 and 63 have a non-normal behavior, the others are ok.
    • Also tried other topologies, e.g. a simple cluster. Still have the same behavior (with different times).
      • Again, normal behavior with less processes (e.g. 16).
      • We get a normal behavior if we take hostfile1600.txt, very strange.
    • Bug fixed, the problem came from the hostfile. For some unknown reason, it missed a end-of-line character at the last line. I suspect that two processes (0 and 63) were therefore mapped to a same host, because the last host was not parsed correctly by smpi. The two versions of the file have been added to the repository.
    • Issue reported on Github.
  2. Try to optimize the matrix product code.   SMPI C
    • For the record, the following command yields communication times between 0.27 and 0.31 and computation times between 0.78 and 0.83, for a total time of about 1.14: smpirun –cfg=smpi/bcast:mpich –cfg=smpi/running-power:6217956542.969 -np 64 -hostfile ./hostfile64.txt -platform ./clusterfattree64.xml ./matmul 2000
    • Replaced malloc/free by SMPISHAREDMALLOC/SMPISHAREDFREE. Got similar times (approximately).
    • Added SMPISAMPLEGLOBAL(0.5*size, 0.01) to the outer loop of the sequential matrix product. Got similar times (approximately).
    • Remark: we should verify more rigorously that these optimizations do not change the estimated time.
    • Greatly reduced simulation time (from 8.2s to 0.5s).
    • Other optimization: stop initializing the content of the matrices (since we do not care of their content).
  3. Meeting with Arnaud.   MEETING
    • There exists some visualization tools for Simgrid, to see the bandwidth that goes on some links. May be very useful in the future, to have a better understanding of what is going on.
    • The characteristics of the jobs (number of nodes, patterns of communication) have an important impact on performances. However, it is difficult for us to have access to this, we do not own a supercomputer… Maybe Matthieu can have more information (e.g. from Bull's clients)?
  4. DONE Add supervisors on Github for the journal.
  5. Some quick performance tests.   SMPI EXPERIMENTS
    • Run my matrix product code, with SMPI optimizations.
    • Use a 2-level fat-tree made with switches of 48 ports.
    • First case: non-tapered. We use all the switches. The fat-tree is 2;24,48;1,24;1,1 (total of 1152 nodes).
      • Use 1089 processes, matrix size of 4950.
      • Time: 1.75s.
      • Communication time: 0.94s.
      • Computation time: 0.81s.
    • Second case: tapered. We remove half of the root switches. The fat-tree is 2;24,48;1,12;1,1 (still 1152 nodes).
      • Still uses 1089 processes, matrix size of 4950.
      • Time: 1.78s.
      • Communication time: 0.94s.
      • Computation time: 0.82s.
    • The observed difference does not seem significant, but we should check with a carefully designed experiment and analysis.
    • For the record, running the same application on the same topology with only one process takes a time of 3607s. Thus, we have a speedup of about 2026, so an efficiency of 1.86. This is a very nice speedup (superlinear). Certainly due to cache effects.
    • These quick tests suggest that we could remove root switches without impacting the performances, even if we use nearly the whole fat-tree (this is obvious if we use a small subtree).
  6. DONE Run another benchmark (e.g. HPL), with more carefully designed experiments.
  7. DONE The 3-level fat-tree was very long to load (aborted). Find why.

1.1.7 2017-02-14 Tuesday

  1. Work on experiment automatization.   PYTHON
    • Add Python functions to generate topology and host files from a given fat-tree description.
    • Adapt Python script and Jupyter notebook from parallel system course to run experiments.
    • The matrix size and the number of processes are fixed. We compute matrix products for various numbers of root switches (we test fat-trees (2;24,48;1,n;1,1) for n in [1, 24]).
    • Results seem very promising. For a matrix size of 6600, we can have as few as 10 root switches without important impact on performances (recall that a typical 2-level fat tree with 48 port switches would have 24 root switches). If we keep removing switches, then performances are quickly impacted.
    • Repeated the experiment with the same topology and the same matrix size, but with only 576 processes. We observe a same trend, we can remove a lot of root switches without having an impact.
  2. DONE Ask if it would be possible to have an SSH access to some dedicated computer.
    • Does not need to have a lot of cores (Simgrid is not a parallel program), but it would be nice if it had a fast core.
    • Needs to be dedicated so as to not perturbate the experiments.
  3. Webinar on reproducible research: Publication modes favoring reproducible research   MEETING
    • Speakers: Konrad Hinsen and Nicolas Rougier.
    • Two parts in research: dissemination (of the results/ideas) and evaluation (of the researcher).
    • If we want reproducible research to become a norm, researchers should be rewarded for this (their reputation should also depend on the reproducibility of their research, not only the number of citations or the impact factor).
    • The speaker compares reproducible research for two points of view: human part and computer part, both for dissemination and evaluation.
    1. ActivePapers
      • Not a tool that one should use (yet), neither a proposition of new standard. It is mainly an idea for computer-aided research.
      • How to have more trusts on the software? The “ideal” one is reimplementation (e.g. ReScience). The speaker tried this on a dozen projects, he never got identical results. Other good ideas: good practices like verison control and testing, keep track of the software stack (hardware, OS, tools, etc).
      • ActivePapers group scripts, software dependencies and data into a same archive.
    2. ReScience
      • Idea: replicate science.
      • For a great majority of papers, we cannot replicate their reuse their code.
      • It is hard to publish replication of an original paper, most journals will reject it since not original.
      • This is why ReScience was born. It is (currently) used on Github.
      • To publish a new study, do a pull-request on ReScience repository. Then it is reviewed openly by reviewers selected by the editor. The replication is improved until it is publishable.

1.1.8 2017-02-15 Wednesday

  1. Use Christian’s config files for org mode   ORGMODE
  2. Work on the experiment script
    • Parsing more generic fat-tree descriptions. For instance, our current topology description would be 2;24,48;1,1:24;1,1. It means that the L1 switches can have between 1 and 24 up ports.
    • Modify the script for experiments to be more generic.
      • Can give as command line arguments the fat-tree description, the (unique) matrix size, the (unique) number of processes.
      • Use Python’s argparse for a cleaner interface.
  3. Re-run experiments with this new script
    • Still observe the same trend: we can afford to remove a lot of up-ports for the L1 switches.
    • Some points seem to be outliers. But we have not a lot of points, so it is difficult to say. We whould do more experiments to see if these points are still significantly separated from the rest.

1.1.9 2017-02-16 Thursday

  1. DONE Enhance/fix Emacs configuration   ORGMODE
    • Translate days and months in English.
    • Increase the line length limit (120 columns?).
    • Reformat the whole document with such limit.
    • Add tags where relevant.
    • Attach files, instead of putting a link.
  2. Try to use even more SMPI optimizations   SMPI
    • Currently, we use the macro SMPISAMPLEGLOBAL only once: for the outer for loop of the sequential matrix product.
    • Maybe we can also use it for the two other loops? We could also reduce the number of iterations (currently, it is 0.5*size). Let’s try.
    • Currently, we get communication times of about 0.14s and computation times of about 0.42s, for a total time of 0.57s, with the following command: smpirun –cfg=smpi/bcast:mpich –cfg=smpi/running-power:6217956542.969 -np 64 -hostfile ./hostfile1152.txt -platform ./bigtaperedfattree.xml ./matmul 1600
    • FAIL. It seems we cannot use imbricated sample blocks. Quite strange, do not understand why…
  3. Try to run HPL with Simgrid   SMPI HPL
    • Copied from Christian’s repository.
    • Compilation fails, don’t know why. But binaries are stored in the git repository (don’t know why either), so I can use them to do some first tests. In fact, file Make.SMPI needed to be modified. Changed mkdir by mkdir -p, ln by ln -f and cp by cp -f. Changed top directory. Also, the Makefile couldn’t find the shared library atlas. It was in /usr/lib, but named libatlas.so.3. Added a symbolic link to libatlas.so.
    • Tested my laptop (with MPI, not SMPI). With a problem size of 10000 and 12 processes, it corresponds to 16.51 Gflops.
    • Tested with SMPI, with a problem size of 10000 and 4 processes. Command: smpirun –cfg=smpi/bcast:mpich –cfg=smpi/running-power:6217956542.969 -platform ../../../smalltests/clusterfattree64.xml -hostfile ../../../smalltests/hostfile64.txt -np 4 ./xhpl Result: 1.849Gflops.
    • Same thing, with 12 processes. Very similar: 1.847Gflops. Why is it not faster?
    • Same thing, with 64 processes. Very similar: 1.858Gflops. Why is it not faster?
    • Retried with a freshly compiled program. Still the same thing.
    • Understood the issue: it is not enough to specify the number of processes with -np 12, we also have to tell it in the file HPL.dat.
    • Tried with -np 4, P=4 and Q=1. Now, 6.6224Gflops. We have a speedup of 3.59, which seems reasonable.
    • The number of processes given with -np must be greater or equal to P × Q.
    • Tried with -np 4, P=1 and Q=4. Did not have a noticeable impact on performances (in comparison with P=4, Q=1).
    • Tried with -np 4, P=2 and Q=2. Did not have a noticeable impact on performances (in comparison with P=4, Q=1).
    • Tried with -np 64, P=8 and Q=8. Now, 22.46Gflops. Speedup of 12, very disappointing.
    • Tried with -np 64, P=8 and Q=8 again, but with a problem size of 20000 (it was 10000). Now 52.2Gflops (speedup of 28.3).
  4. Comparison with top 500
    • For the record, the order of magnitude for Intel desktop CPU of today is between 10 and 100 Gflops, according to this website, this website and this website. My laptop supposedly has a speed of 3.84 Gflops per core and 15.21 Gflops in total according to the last two websites.
    • According to Wikipedia, the first generation Raspberry Pi has a speed of 0.041 Gflops, a 64 nodes cluster made of those has a speed of 1.14 Gflops.
    • The first supercomputer has a speed of about 93Pflops, or 93,000,000Gflops.
    • The last one has a speed of about 349Tflops, or 349,000Gflops.
    • In June 2005, the first one had a speed of about 136Tflops, the last one 1.2Tflops.
    • In our settings with 64 nodes, each node has one core that computes at 1Gflops. Thus, our Rpeak is 64Gflops. We have an efficiency of 52.2/64 = 0.81. This is not bad, compared to the three first supercomputers of the top 500 (respectively at 0.74, 0.61 and 0.63). But we should maybe not compare the efficiency of a 64 nodes cluster with these supercomputers, since it becomes harder to be efficient with a large topology.
  5. DONE SMPI optimization of HPL   SMPI HPL
    • It seems that no SMPI optimization is done in the code obtained from Christian’s repository. Maybe we could speed things up?
    • Need to check what is the algorithm behing HPL, whether it is regular (to use SMPISAMPLE) and data independent (to use SMPISHARED).
  6. DONE Adapt the experience script to run HPL
    • Parse the output (quite ugly to parse, but easy, use methods str.split and list.index).
    • Run the same kind of experiments than for the matrix product. Will be much longer if we cannot use SMPI optimizations.

1.1.10 2017-02-17 Friday

  1. Refactor the experiment script   PYTHON
    • Aim: reuse for HPL the code already done for the matrix product.
    • Now, we have a clase AbstractRunner, which runs the common logic (e.g. some basic checks on the parameters, or running the desired number of experiments).
    • We also have classes MatrixProduct and HPL, containing the piece of codes specific to the matrix product or HPL (e.g. running one experiment).
  2. Some strange things with HPL   SMPI BUG HPL
    • The output has the following format:

      ================================================================================
      T/V                N    NB     P     Q               Time                 Gflops
      --------------------------------------------------------------------------------
      WR00L2L2        2000   120     1     1               3.17              1.683e+00
      
    • Sometimes, the last line is missing, so we do not have any informaiton on time and flops.
    • Quite often it is present, but with wrong values: the time is 0.00 and the Gflops are absurdly high (e.g. 2.302e+03 Gflops for a cluster made of 96 machines of 1 Gflops). It may come from an erroneous measure of the time.
    • For instance, with the script of commit dbdfeabbef3f90a3d4e2ecfbe5e8f505738cac23, the following command line: ./run_measures.py --global_csv /tmp/bla --nb_runs 10 --size 5000 --nb_proc 64 --fat_tree "2;24,48;1,24;1,1" --experiment HPL

      • It may get this output in one experiment:
      ================================================================================
      T/V                N    NB     P     Q               Time                 Gflops
      --------------------------------------------------------------------------------
      WR00L2L2        5000   120     8     8               0.00              1.108e+05
      
      • And this output in another one:
      ================================================================================
      T/V                N    NB     P     Q               Time                 Gflops
      --------------------------------------------------------------------------------
      WR00L2L2        5000   120     8     8               5.35              1.560e+01
      

      Note that, for the two experiments, nothing has changed. The file HPL.dat is the same, the number of processes given to the option -np is the same, the topology file and the host file are the same.

1.1.11 2017-02-20 Monday

  1. Keep investigating on the HPL anomaly
  2. Found the issue with HPL   SMPI BUG HPL
    • Debugging with Christian, to understand what was going on.
    • This was a concurrency issue. The private variables of the processes were in fact not private. This caused two processes to write a same variable, which led to an inconsistent value when measuring time.
    • The function is HPL_ptimer, in file testing/ptest/HPL_pdtest.c.
    • When using simgrid, need to use option --cfg=smpi/privatize-global-variables:yes to fix this.
    • Used a tool to search for a word, looks nice: cg and vg (package cgvg).
    • Another nice thing: ctags (command ctags --fields=+l -R -f ./ctags src testing).

1.1.12 2017-02-21 Tuesday

  1. Test the experiment script for HPL   EXPERIMENTS HPL
    • It seems to work well, the bug is fixed.
    • Scalability issue. Testing for a size of 20k already takes a lot of time, and it is still too small to have a good efficiency with 1000 processes (performances are worse than with 100 processes).
    • Definitely need to use SMPI optimizations if we want to do anything with HPL.
  2. Re-do experiments with matrix product
    • Stuck with HPL…
    • We also output the speed of the computation, in Gflops (this is redondant with the time, but we can use it for comparison with other algorithms like HPL).
    • The plot looks nice, but nothing new.
  3. Work on the drawing of fat-trees
    • Generate all nodes and edges of a fat-tree.
    • No drawing yet.
    • Will try to output Tikz code.
  4. DONE Look at where to put SMPI macros in HPL, with Christian
    • Have a look at a trace, to see where most of the time is spent.
  5. Keep working on the drawing of fat-trees.
    • Now produce working Tikz code.
    • Figure quickly becomes unreadable for large fat-trees (not surprising).

1.1.13 2017-02-22 Wednesday

  1. Terminate the work on fat-tree drawing   PYTHON
    • We can now do ./draw_topo.py bla.pdf "2;8,16;1,1:8;1,1" "2;4,8;1,1:4;1,1" to draw all the fat-trees in the file bla.pdf. Very useful to visualize the differences between the trees.
    • No limit on the fat-tree size, they should fit on the pdf (a very large page is generated, then cropped to the right dimension). However, a large fat-tree may not be very readable.
  2. Tried to move the SMPISAMPLE of the matrix product
    • Cannot use one SMPISAMPLE per loop (don’t know why, but it seems to be forbidden).
    • It was used for the outer loop. Tried the inner loops, but performances were greatly degraded (about ×50 in simulation time).
    • Reverting the change.
  3. DONE Cannot use more than 1024 processes with Simgrid (need to fix)   SMPI BUG
    • The open() system call fails with EMFILE error code.
    • It used to work, don’t understand what changed in the meantime.
  4. Talk with Christian about SMPI optimizations in HPL   PERFORMANCE HPL
    • He gave me a trace of HPL execution obtained with Simgrid.
    • The parts taking most of the time are the following:

      50        /home/cheinrich/src/hpl-2.2/src/pgesv/hpl_rollt.c       242          /home/cheinrich/src/hpl-2.2/src/comm/hpl_recv.c     136 190.785263    498
      51        /home/cheinrich/src/hpl-2.2/src/pgesv/hpl_rollt.c       242          /home/cheinrich/src/hpl-2.2/src/comm/hpl_sdrv.c     180 372.272945    996
      52        /home/cheinrich/src/hpl-2.2/src/pgesv/hpl_rollt.c       242          /home/cheinrich/src/hpl-2.2/src/comm/hpl_send.c     133 179.711679    498
      
  5. Let’s track these piece of code   PERFORMANCE HPL
    • HPL_rollT.c has only one function: HPL_rollT.
    • This function is called only once: at the end of function HPL_pdlaswp01T (eponym file).
    • This function is called once in function HPL_pduptateNT and once in function HPL_pdupdateTT (eponym files). There are very few differences between these two functions (4 line changes are relevant, which are small variations in the arguments of a function, HPL_dtrsm). These files have 443 lines: this is a huge copy-paste, very dirty.
    • A candidate for the long function we are looking for is HPL_dlaswp10N (found by Christian). Has two nested loops. This function is also a good candidate for the most terrible piece of code ever written.
    • Added a SMPI_SAMPLE_GLOBAL after the outer loop, did not reduce the simulation time. Also tried to remove the whole code of the function, did not reduce the simulation time either. So we can say this function is not our big consummer.
    • Functions HPL_recv and HPL_sdrv are both called only in HPL_pdmxswp and HPL_pdlaswp00N.
    • Function HPL_pdlaswp00n is used only in HPL_pdupdateTN and HPL_pdupdateNN, which are nearly identical. These two functions are then used in the testing folder, with something like algo.upfun = HPL_pdupdateNN. Might be hard to track…
    • Function HPL_pdmxswp is used in HPL_pdpancrT, HPL_pdpanllT, HPL_pdpanllN, HPL_pdpanlT, HPL_pdpanrlN, HPL_pdpancrN. These functions are used in the testing folder, with something like algo.pffun = HPL_pdpancrN.
    • Trying to put some printf. We use the command:

      smpirun --cfg=smpi/bcast:mpich --cfg=smpi/running-power:6217956542.969 --cfg=smpi/display-timing:yes
      --cfg=smpi/privatize-global-variables:yes -np 16 -hostfile ../../../small_tests/hostfile_64.txt -platform
      ../../../small_tests/cluster_fat_tree_64.xml ./xhpl
      
      • Function HPLpdupdateNN never used.
      • Function HPLpdupdateTN never user.
      • Thus, function HPL_pdlaswp00n also never used (verified with printf in this function).
      • Function HPL_pdmxswp is used and takes a significant (albeit not huge) amount of time (about 2 seconds when the total time is 41 seconds (virtual time)).

1.1.14 2017-02-23 Thursday

  1. Try to increase the file limit   SMPI BUG
    • First try, following this question and this question from Stackoverflow.
      • Added the following to /etc/security/limits.conf:

        *     soft    nofile          40000
        *     hard    nofile          40000
        
      • Added the following to /etc/pam.d/common-session:

        session required pam_limits.so
        
      • Rebooting.
    • Success, ulimit -Sn shows 40000 and we can now run experiments with more than 1024 processes.
  2. Keep tracking the time consumming piece of code in HPL   PERFORMANCE HPL
    • Function HPL_pdmxswp is used in some functions which are chosen with algo.pffun (see above).
    • They are then used (through a call to algo.pffun) in functions HPL_pdrpancrN, HPL_pdrpanrlN, HPL_pdrpanllN, HPL_pdpanrlT, HPL_pdrpancrT and HPL_pdranllT.
    • Again, these functions are not used directly in src, there is something like algo.rffun = HPL_pdrpancrT in the testing folder.
    • This rffun is used only once, in HPL_pdfact.
    • Function HPL_pdfact takes between 2.5 and 2.8 seconds when the total time is 41 seconds (virtual time). This time includes the time spent in HPL_pdmxswp.
    • Function HPL_pdffact is used in functions HPL_pdgesvK1, HPL_pdgesvK2 and HPL_pdgesv0. These functions are then called in HPL_pdgesv.
    • Function HPL_pdgesv takes a time of about 3 seconds when the total time is 41 seconds (virtual time).
    • Strange thing. Deleting the content of this function gives a very short run-time. Maybe the way I measured time (using MPI_WTIME) is not consistent with the way HPL measure time.
    • Identified the long loop in HPL_pdgesv0. But cannot put a SMPI_SAMPLE here, there are calls to MPI primitives in the block.
    • Found the right function to measure time: use HPL_timer_walltime, not MPI_Wtime.
    • Instrumented the code of HPL_pdgesv0 to have an idea of what takes time. Measures are taken with HPL_timer_walltime. What takes time is the part “factor and broadcast current panel” in the loop. Within this part, the call to HPL_bcast and HPL_pdupdate are what take most of the (virtual) time. In an execution of 40.96 seconds:

      pdfact = 2.907908, binit = 0.002633, bcast = 11.013843, bwait = 0.000669, pdupdate = 26.709408
      

      Obviously there is nothing to do for the broadcast, but there may be hope for pdupdate.

    • Several versions exist for this function:

      • HPL_pdupdateTN
      • HPL_pdupdateNT
      • HPL_pdupdateTT
      • HPL_pdupdateNN

      Only HPL_pdupdateTT seems to be used (with our settings). Removed body of function HPL_pdupdateTT, the simulation time becomes about 8 seconds (was 69 seconds).

    • Might be tricky to optimize with SMPI macros, this function mixes computations and communications.
    • Tried to insert a return line 208 (before comment “The panel has been forwarded at that point, finish the update”. The time is not impacted and the correction test are valid, so the part of the code after this point seems useless here. Verified by inserting a printf, this paprt is never executed.
    • Line 143 is executed (just after comment “1 x Q case”).
    • Adding a return statement line 136 (just before comment “Enable/disable th column panel probing mechanism”) gives a simulation time of 8 seconds. Same thing line 140, after the broadcast.
    • The if block of lines 143-258 is never executed in our settings. Explain why acting on line 208 did not have any effect.
    • Adding a return statement line 358 (just before comment “The panel has been forwarded at that point, finish the update”) gives a simulation time of 9.7 seconds.
    • The if block of lines 360-414 seem to be always executed. The if block of lines 366-390 is executed sometimes, but not always. In this block, we execute the #else part of the #ifdef.
    • In this block, removing the call to HPL_dgemm reduce a lot the simulation time (from 68s to 13s).
    • Several definitions exist for HPL_dgemm: there is an implementation in src/blas/HPL_dgemm.c, but also a #define HPL_dgemm cblas_dgemm in include/hpl_blas.h.
    • Can disable this #define by removing the line HPL_OPTS = -DHPL_CALL_CBLAS in the file Make.SMPI. Then, HPL_dgemm is executed, but not the others (HPL_dgemm0, HPL_dgemmTT, HPL_dgemmTN, HPL_dgemmNT, HPL_dgemmNN). It seems that HPL_dgemm can call HPL_dgemm0 which can itself call the four others, but this only happens when HPL_CALL_VSIPL is defined.
    • In fact, there is maybe no need to insert the SMPI_SAMPLE macro in dgemm function. We can put it inside HPL_pdupdateTT. For instance, line 360, just above the big if block. However, this performs realy badly. With SMPI_SAMPLE_GLOBAL(10, 0.1), the real time becomes about 10 seconds (speedup of ×4) but the virtual time becomes about 90 seconds (×2 error). If we increase one of the two numbers, the real times quickly become as large as it was before. Same thing with SMPI_SAMPLE_LOCAL. Maybe this code is too irregular? Or we should “zoom in” and insert the SMPI optimizations in dgemm (which is in an external library, so not that easy).

1.1.15 2017-02-27 Monday

  1. Try running matrix product experiment with big fat-trees   SMPI BUG
    • Run a medium number of processes on a big fat-tree.

        ./run_measures.py --global_csv big_global.csv --local_csv big_local.csv --nb_runs 3 --size 9300 --nb_proc 961
        --fat_tree "3;24,24,48;1,24,1:24;1,1,1" --experiment matrix_product
        #+end_src sh
        Seems to work properly, one CPU core is quickly loaded at 100% and one experiment approximately takes two minutes.
      - Try a larger number of processes with the same topology and the same matrix size.
        #+begin_src sh
        ./run_measures.py --global_csv big_global.csv --local_csv big_local.csv --nb_runs 3 --size 9300 --nb_proc 8649
        --fat_tree "3;24,24,48;1,24,1:24;1,1,1" --experiment matrix_product
      

      The CPU is loaded at about 3% for quite a long time with the script smpirun. It finally launches matmul and becomes loaded at 100%. Then it quickly terminates with a non-null exit code: Could not map fd 8652 with size 80000: Cannot allocate memory. The memory consumption was only 3% of the total memory, this is strange. This happens in function shm_map, called by SMPI_SHARED_MALLOC.

    • Retrying the same command, with malloc instead of SMPI_SHARED_MALLOC and free instead of SMPI_SHARED_FREE. As expected, larger memory consumption (10.9% of total memory). There is no error this time. The first experiment terminates in about 20min. For the record, it achieved 1525 Gflops, with communication time and computation time of approximately 0.48 seconds.
    • Revert the changes to get back SMPI_SHARED macros. Retry to run smpirun with the same settings, except the option --cfg=smpi/privatize-global-variables:yes which is not passed here. No error either this time, run for 13 minutes. Also a large memory consumption (13.5%), maybe the 3% we observed was not the final memory consumption, since the process exited with an error?
    • Remark: for matrix product, there is no global variable. So maybe we can safely remove this option in this case? This does not solve the problem since we need it for HPL.
    • Try the initial command with a smaller matrix size (size=93, i.e. all processes have a sub-matrix of size 1×1). Observed the same error.
    • Also try to reproduce this with HPL, with this command:

      ./run_measures.py --global_csv big_global.csv --nb_runs 3 --size 5000 --nb_proc 8649 --fat_tree
      "3;24,24,48;1,24,1:24;1,1,1" --experiment HPL
      

      Not any error, although we have a memory consumption of 71.2%.

    • Try the initial command, still with a size of 93, but commenting the call to matrix_product in matmul.c. Thus, there is no allocation of temporary buffers, only the initial matrices (3 allocations instead of 5). No error.
    • Same thing, with the call to matrix_product uncommented, but a return statement placed just after the temporary buffers allocations. We get the mmap error.
    • Create a MWE from this, called mmap_error.c.
  2. Work on a MWE for the mmap error   SMPI BUG
    • File mmap_error.c is a MWE for the mmap error. It consists in 5 calls to SMPI_SHARED_MALLOC with a size of 1, we launch it with 8652 processes. We also get an error if we do 100k calls to SMPI_SHARED_MALLOC with only one process. The total number of calls to this macro seem to be the issue. We get the error with or without the option smpi/privatize-global-variables:yes.
    • The following file mmap_error.c:

      #include <stdio.h>
      #include <mpi.h>
      
      #define N 65471
      
      int main(int argc, char *argv[]) {
      
          MPI_Init(&argc, &argv);
      
          for(int i = 0; i < N; i++) {
              float *a = SMPI_SHARED_MALLOC(1);
          }
      
          MPI_Barrier(MPI_COMM_WORLD);
          printf("Success\n");
          MPI_Finalize();
          return 0;
      }
      

      With the following command (commit 8eb0cf0b6993e174df58607e9492a134b85a4669 of Simgrid):

      smpicc -O4 mmap_error.c -o mmap_error
      smpirun -np 1 -hostfile hostfile_64.txt -platform cluster_fat_tree_64.xml ./mmap_error
      

      Yields an error. Note that the host and topology files are irrelevant here.

      • For N<65471, we have no error (Success is printed).
      • For N>65471, we have the error Could not map fd 3 with size 1: Cannot allocate memory.
      • For N=65471, we have the error Memory callocation of 524288 bytes failed.
    • Retried with latest version of Simgrid (commit c8db21208f3436c35d3fdf5a875a0059719bff43). Now have the message:

      Could not map folded virtual memory (Cannot allocate memory). Do you perhaps need to increase
      the STARPU_MALLOC_SIMULATION_FOLD environment variable or the sysctl vm.max_map_count?
      

      Found the issue:

      $ sysctl vm.max_map_count
      vm.max_map_count = 65530
      

      To modify the value of a sysctl variable, follow this link. Temporary fix:

      sudo sysctl -w vm.max_map_count=100000
      
  3. Run the matrix product experiment with 8649 processes
    • Using the command:

      ./run_measures.py --global_csv big_global.csv --local_csv big_local.csv --nb_runs 3 --size 9300 --nb_proc 8649
      --fat_tree "3;24,24,48;1,24,1:24;1,1,1" --experiment matrix_product
      
    • The experiments are very long, about 30 minutes. The code is already optimized a lot (SMPI macros, no initialization of the matrices), a large part of this time is spent outside of the application, so there is not much hope to run it faster without modifying Simgrid.
    • This shows that we really need to optimize HPL if we want to run it with a large number of processes.
    • Anyway, without SMPI macros, every floating-point operation of the application is actually performed. Thus, if we are simulating a computation made on a 1000 Gflops cluster, using a 1 Gflops laptop, the simulation should take at least 1000 times longer than the same computation on a real 1000 Gflops cluster.
    • First results show no large difference in the total time for small or large number of roots. The communication time is about twice as large as the computation time, so maybe we should take a larger matrix. When we had 961 processes, each one had a sub-matrix of size 300×300. With 8649 processes, they have a sub-matrix of size 100×100. Problem: if we want to get back to the 300×300 sub-matrices, we need to multiply the size by 3 and thus the memory consumption by 9. It was already about 25%, so not feasible on this laptop. But this is strange, we should have the memory of only one process and we successfully ran 300×300 sub-matrices, need to check.

1.1.16 2017-02-28 Tuesday

  1. Other benchmarks on Simgrid   SMPI EXPERIMENTS
    • The paper “Simulating MPI application: the SMPI approach” uses the benchmark NAS EP to demonstrate the scalability of SMPI. With SMPI optimizations, they ran it with 16384 processes in 200 to 400 seconds (depending on the topology). Where is the code for this?
      • Found an old repository. Not clear if it is relevant.
      • Also a (shorter) version in the official Simgrid repository. Executable located in simgrid/build/examples/smpi/NAS/. Launch with two arguments: number of processes (don’t know what it does, we already have -np option given to smpirun) and the class to use (S, W, A, B, C, D, E, F).
    • The NAS EP benchmark from Simgrid repository seems promising. Added a new class to have a larger problem (maybe we could instead give the size as an argument). With a large enough size, we can go to about 3.5 Gflops per process, i.e. an efficiency of 3.5 (recall that we use 1 Gflops nodes). It seems large, is it normal?
    • Longer than the matrix product, 745 seconds for 1152 processes and class F (custom class with m=42). Only 93 seconds were spent in the application, so the code is already correctly optimized (one call to SMPI_SAMPLE_GLOBAL).
    • Apparently not impacted by a tapered fat tree. Roughly the same speed for 2;24,48;1,24;1,1 and 2;24,48;1,1;1,1, 1152 processes and class F: about 3.5 Gflops. The application is made of a computation followed by three MPI_Allreduce of only one double, so very few communications (hence the name “embarassingly parallel”).
  2. Talk with Christian about benchmarks
    • Get an access to grid 5000.
    • Profile the code, with something like smpirun -wrapper “valgrind <param>”.
    • To use SMPI macros, run the HPL_dgemm implemented in HPL, not the one from the external library.

1.2 2017-03 March

1.2.1 2017-03-01 Wednesday

  1. Trying to use HPL without external BLAS library   HPL
    • Failed.
    • It seems that three options are available for compilation, according to this page:
      • BLAS Fortran 77 interface (the default),
      • BLAS C interface (option -DHPL_CALL_CBLAS),
      • VSIPL library (option -DHPL_CALL_VSIPL).
    • We currently use the C interface, which rely on an external library (e.g. Atlas).
    • There is an implementation of HPL_dgemm in HPL, but it seems to need either code from Fortran 77 or from VSIPL.
    • According to the HPL homepage:

      The HPL software package requires the availibility on your system of an implementation of the Message Passing
      Interface MPI (1.1 compliant). An implementation of either the Basic Linear Algebra Subprograms BLAS or the Vector
      Signal Image Processing Library VSIPL is also needed. Machine-specific as well as generic implementations of MPI, the
      BLAS and VSIPL are available for a large variety of systems.
      

      So it seems hopeless to get rid of a BLAS library.

  2. Idea: trace calls to HPL_dgemm (Arnaud’s idea)   SMPI TRACING HPL
    • To do so, surround them by calls to trivial MPI primitives (e.g. MPI_Initialized). For instance:

      #define HPL_dgemm(...) ({int simgrid_test; MPI_Initialized(&simgrid_test); cblas_dgemm(__VA_ARGS__);\
      MPI_Initialized(&simgrid_test);})
      
    • Then, trace the execution (output in /tmp/trace):

      smpirun -trace -trace-file /tmp/trace --cfg=smpi/trace-call-location:1 --cfg=smpi/bcast:mpich\
      --cfg=smpi/running-power:6217956542.969 --cfg=smpi/display-timing:yes --cfg=smpi/privatize-global-variables:yes -np 16\
      -hostfile ../../../small_tests/hostfile_64.txt -platform ../../../small_tests/cluster_fat_tree_64.xml ./xhpl\
      
    • Finally, dump this trace in CSV format:

      pj_dump --user-defined --ignore-incomplete-links trace > trace.dump
      
    • Did not work, no MPI_Initialized in the trace. In fact, this primitive is currently not traced. We could modify SMPI to achieve this behavior, or use another MPI primitive that is already traced.

1.2.2 2017-03-02 Thursday

  1. Keep trying to trace calls to HPL_dgemm   SMPI TRACING HPL
    • A MPI primitive is traced ⇔ the functions new_pajePushState and new_pagePopState are called (not sure, this is an intuition).
    • This function is not called by MPI_Initialized, or MPI_Wtime.
    • It is called by MPI_Test, but only if the MPI_Request object passed as argument is non-null, so we would need to do a fake asynchronous communication just before, which is probably not a good idea.
    • Anyway, it looks dirty to use a MPI primitive like this. Wouldn’t it be better to have a custom no-op primitive that force the introduction of a trace entry? For instance, something like

      SMPI_Trace {
          HPL_dgemm();
      }
      

      or like

      SMPI_BeginTrace();
      HPL_dgemm();
      SMPI_EndTrace();
      
    • Every MPI primitive is defined by a #define with a call to smpi_trace_set_call_location followed by a call to the function. For instance:

      #define MPI_Test(...) ({ smpi_trace_set_call_location(__FILE__,__LINE__); MPI_Test(__VA_ARGS__); })
      

      However, this only record the file name and the line number, I do not think it dumps anything in the trace.

  2. Arnaud’s keynote: reproducible research   MEETING
    • Intro: article we had in exam, “Is everything we eat associated with cancer?”.
    • In most articles, we can read formulae and trust results, but much less often reproduce the results.
    • Reproducibility crisis, several scandals with falsified results (intentionnaly or not).
    • Video: Brendan Gregg, shouting in the data center.
  3. Discussion with Arnaud   MEETING
    • Regarding the matrix product:
      • Compare the (tapered) fat-tree with “perfect” topology (cluster with no latency and infinit bandwidth).
      • Run it with larger matrices for the same amount of processes. Do not aim at spending as much time in communication than computation. We want the communication time to become nearly negligible. In practices, users of a supercomputer try to fill the memory of their nodes.
    • Regarding HPL:
      • As discussed yesterday, we want to trace the calls to HPL_dgemm by putting calls to a MPI primitive just before and after.
      • The short-time goal is to have an idea of the behavior of HPL regarding this function. Are there a lot of different calls to HPL_dgemm coming from different locations? Do these calls always take the same amount of time (i.e. do we always multiply matrices of the same size)?
      • It seems that there is some variability in the duration of HPL_dgemm (to be verified with the trace). If HPL really use the function to multiply matrices of different size, we cannot do something like SMPI_SAMPLE(){HPL_dgemm()}, it will not be precise. What we could do however is to generalize SMPI_SAMPLE: we could parametrize it by a number, representing the size of the problem that is sampled. If this size is always the same, then we could do what we are doing now, simply take the average. If this size changes over time, we could do something more elaborated for the prediction, like a linear regression.
      • Using MPI functions like MPI_Test is not very “clean”, but we do not want to waste time on this currently, so we stick with existing MPI primitives. We could try to change this in the future.
      • It is always safe to call smpi_process_index. Thus, we could modify PMPI_Test to call TRACE_smpi_testing functions even when the given request is NULL.

1.2.3 2017-03-03 Friday

  1. Tracing calls to HPL_dgemm   SMPI C PYTHON R EXPERIMENTS TRACING PERFORMANCE HPL
    • Modification of the function PMPI_Test of Simgrid so that MPI_Test is traced even when the MPI_Request handle is NULL. To do that, we need to get the rank of the process, with smpi_process_index. The value returned is always 0 in this case. This is a problem, since we could not distinguish between calls to MPI_Test from different processes, thus it would be impossible to measure time. Reverting the changes.
    • To get a non-null MPI_Request, did a MPI_Isend followed by a MPI_Recv:

      #define    HPL_dgemm(...)      ({\
        int my_rank, buff=0;\
        MPI_Request request;\
        MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);\
        MPI_Isend(&buff, 1, MPI_INT, my_rank, 0, MPI_COMM_WORLD, &request);\
        MPI_Recv(&buff, 1, MPI_INT, my_rank, 0, MPI_COMM_WORLD, NULL);\
        MPI_Wait(&request, MPI_STATUS_IGNORE);\
        cblas_dgemm(__VA_ARGS__);\
        MPI_Isend(&buff, 1, MPI_INT, my_rank, 0, MPI_COMM_WORLD, &request);\
        MPI_Recv(&buff, 1, MPI_INT, my_rank, 0, MPI_COMM_WORLD, NULL);\
        MPI_Wait(&request, MPI_STATUS_IGNORE);\
      })
      
    • Forget this. HPL was executed with only one process (-np 16 but P and Q were 1 in HPL.dat). This is why we only had a rank 0 when giving NULL as MPI_Request. Let’s revert this and use simple MPI_Test with NULL.
    • Calls to MPI_Test seem to be correctly traced, but the post-processing of the trace with pj_dump crashes:

      terminate called after throwing an instance of 'std::out_of_range'
      what():  vector::_M_range_check: __n (which is 4) >= this->size() (which is 4)
      

      It also happened with the more complex piece of code that is shown above (with MPI_Test instead of MPI_Wait). Reverting again, to use the bigger piece of code above.

    • Now, the call to pj_dump succeeds, and we can see calls to MPI_Wait in the trace.
    • The call to smpirun was:
    smpirun -trace -trace-file /tmp/trace --cfg=smpi/trace-call-location:1 --cfg=smpi/bcast:mpich\
    --cfg=smpi/running-power:6217956542.969 --cfg=smpi/display-timing:yes --cfg=smpi/privatize-global-variables:yes -np 16\
    -hostfile ../../../small_tests/hostfile_64.txt -platform ../../../small_tests/cluster_fat_tree_64.xml ./xhpl
    
    • Processing of the trace. Clean the file:
    pj_dump --user-defined --ignore-incomplete-links /tmp/trace > /tmp/trace.csv
    grep "State," /tmp/trace.csv | grep MPI_Wait | sed -e 's/()//' -e 's/MPI_STATE, //ig'  -e 's/State, //ig' -e 's/rank-//' -e\
    's/PMPI_/MPI_/' | grep MPI_  | tr 'A-Z' 'a-z' > /tmp/trace_processed.csv
    

    Clean the paths:

    import re
    reg = re.compile('((?:[^/])*)(?:/[a-zA-Z0-9_-]*)*((?:/hpl-2.2(?:/[a-zA-Z0-9_-]*)*).*)')
    def process(in_file, out_file):
        with open(in_file, 'r') as in_f:
            with open(out_file, 'w') as out_f:
                for line in in_f:
                    match = reg.match(line)
                    out_f.write('%s%s\n' % (match.group(1), match.group(2)))
    process('/tmp/trace_processed.csv', '/tmp/trace_cleaned.csv')
    
    df <- read.csv("/tmp/trace_cleaned.csv", header=F, strip.white=T, sep=",");
    names(df) = c("rank", "start", "end", "duration", "level", "state", "Filename", "Linenumber");
    head(df)
    
      rank    start      end duration level    state
    1    8 2.743960 2.743960        0     0 mpi_wait
    2    8 2.744005 2.744005        0     0 mpi_wait
    3    8 2.744005 2.744005        0     0 mpi_wait
    4    8 2.744005 2.744005        0     0 mpi_wait
    5    8 2.744005 2.744005        0     0 mpi_wait
    6    8 2.744005 2.744005        0     0 mpi_wait
                                Filename Linenumber
    1 /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    2 /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    3 /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    4 /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    5 /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    6 /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    
    duration_compute = function(df) {
        ndf = data.frame();
        df = df[with(df,order(rank,start)),];
        #origin = unique(df$origin)
        for(i in (sort(unique(df$rank)))) {
    	start     = df[df$rank==i,]$start;
    	end       = df[df$rank==i,]$end;
    	l         = length(end);
    	end       = c(0,end[1:(l-1)]); # Computation starts at time 0
    
    	startline = c(0, df[df$rank==i,]$Linenumber[1:(l-1)]);
    	startfile = c("", as.character(df[df$rank==i,]$Filename[1:(l-1)]));
    	endline   = df[df$rank==i,]$Linenumber;
    	endfile   = df[df$rank==i,]$Filename;
    
    	ndf       = rbind(ndf, data.frame(rank=i, start=end, end=start,
    	    duration=start-end, state="Computing",
    	    startline=startline, startfile=startfile, endline=endline,
    	    endfile=endfile));
        }
        ndf$idx = 1:length(ndf$duration)
        ndf;
    }
    durations = duration_compute(df);
    durations = durations[durations["startfile"] == "/hpl-2.2/src/pgesv/hpl_pdupdatett.c" & durations["endfile"] == "/hpl-2.2/src/pgesv/hpl_pdupdatett.c" &
        durations["startline"] == durations["endline"],]
    
    library(dplyr)
    options(width=200)
    group_by(durations, startfile, startline, endfile, endline) %>% summarise(duration=sum(duration), count=n()) %>% as.data.frame()
    
                                startfile startline                             endfile endline  duration count
    1 /hpl-2.2/src/pgesv/hpl_pdupdatett.c       387 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     387  683.6677   659
    2 /hpl-2.2/src/pgesv/hpl_pdupdatett.c       411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     411 2115.8129  1977
    
    library(ggplot2)
    ggplot(durations, aes(x=idx, y=duration, color=factor(rank))) +
        geom_point(shape=1) + ggtitle("Durations of HPL_dgemm")
    

    trace1_16.png

    ggplot(durations, aes(x=start, y=duration, color=factor(rank))) +
        geom_point(shape=1) + ggtitle("Durations of HPL_dgemm")
    

    trace2_16.png

    Same results, with four processes:

    trace1_4.png

    trace2_4.png

  2. Seminaire   MEETING

    On the asymptotic behavior of the price of anarchy, how bad is selfish routing in highly congested networks?

    • For instance, cars on a road make their own routing decisions, hence the “selfish” routing. This is not optimal (in comparison with a centralized routing).
  3. Discussion with Arnaud & Christian   MEETING
    • According to the plots, it is impossible to use SMPI_SAMPLE as is, since there are huge variations on the duration of HPL_dgemm.
    • The idea of a parametrized SMPI_SAMPLE is also not super. Every process does consecutive calls to HPL_dgemm, each call being shorter than the previous ones. So we would still have to compute expensive calls.
    • A long term idea may be to have a “SimBLAS” library, that simulates the calls to HPL_dgemm (and other BLAS primitives). Christian will work on this.
    • Answers to all my questions from the paper readings.
  4. TODO New tasks [3/4]
    • [X] Do the linear regression by hand, off-line. Output the sizes of the matrices given to HPL_dgemm (with printf).
    • [X] Register on Grid5000. Compile HPL on one Grid5000 machine.
    • [X] Try to run HPL with a very large matrix, by using SMPI_SHARED_MALLOC (thus look at where all the allocations of matrices are done).
    • [ ] Have a look at the code of Simgrid, in particular the routing in fat-trees.

1.2.4 2017-03-06 Monday

  1. Output the matrix sizes   C PYTHON TRACING HPL
    • Add the following before the relevant calls to HPL_dgemm:

      printf("line=%d rank=%d m=%d n=%d k=%d\n", __LINE__+3, rank, mp, nn, jb);
      

      Then, run HPL by redirecting stdout to /tmp/output.

    • Process the output, to get a CSV file:
    import re
    import csv
    reg = re.compile('line=([0-9]+) rank=([0-9]+) m=([0-9]+) n=([0-9]+) k=([0-9]+)')
    
    def process(in_file, out_file):
        with open(in_file, 'r') as in_f:
            with open(out_file, 'w') as out_f:
                csv_writer = csv.writer(out_f)
                csv_writer.writerow(('line', 'rank', 'n', 'm', 'k'))
                for line in in_f:
                    match = reg.match(line)
                    if match is not None:
                        csv_writer.writerow(tuple(match.group(i) for i in range(1,6)))
    process('/tmp/output', '/tmp/sizes.csv')
    
  2. Merge the sizes with the durations   R EXPERIMENTS PERFORMANCE
    • Run smpirun as stated above, then process the output and the trace as before.
    • Process the data:
    df <- read.csv("/tmp/trace_cleaned.csv", header=F, strip.white=T, sep=",");
    names(df) = c("rank", "start", "end", "duration", "level", "state", "Filename", "Linenumber");
    head(df)
    
      rank    start      end duration level    state                           Filename Linenumber
    1    8 2.743960 2.743960        0     0 mpi_wait /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    2    8 2.744005 2.744005        0     0 mpi_wait /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    3    8 2.744005 2.744005        0     0 mpi_wait /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    4    8 2.744005 2.744005        0     0 mpi_wait /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    5    8 2.744005 2.744005        0     0 mpi_wait /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    6    8 2.744005 2.744005        0     0 mpi_wait /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    
    sizes <- read.csv("/tmp/sizes.csv");
    head(sizes)
    
      line rank    n    m   k
    1  411   12 4920 4920 120
    2  387    0 4920 4920 120
    3  411    8 5000 4920 120
    4  411    4 5040 4920 120
    5  411   13 4920 5040 120
    6  387    1 4920 5040 120
    
    durations = duration_compute(df); # same function as above
    durations = durations[durations["startfile"] == "/hpl-2.2/src/pgesv/hpl_pdupdatett.c" & durations["endfile"] == "/hpl-2.2/src/pgesv/hpl_pdupdatett.c" &
        durations["startline"] == durations["endline"],]
    head(durations)
    
        rank     start       end duration     state startline                           startfile endline                             endfile idx
    481    0  3.153899  6.271075 3.117176 Computing       387 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     387 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 481
    486    0  7.047247 10.063367 3.016120 Computing       411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 486
    491    0 10.648367 13.716045 3.067678 Computing       411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 491
    496    0 14.104534 17.155418 3.050884 Computing       411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 496
    977    0 17.557080 20.430869 2.873789 Computing       387 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     387 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 977
    982    0 21.104026 24.044767 2.940741 Computing       411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 982
    
    insert_sizes = function(durations, sizes) {
        stopifnot(nrow(durations)==nrow(sizes))
        ndf = data.frame();
        for(i in (sort(unique(durations$rank)))) {
    	tmp_dur = durations[durations$rank == i,]
    	tmp_sizes = sizes[sizes$rank == i,]
    	stopifnot(nrow(tmp_dur) == nrow(tmp_sizes))
    	stopifnot(tmp_dur$startline == tmp_sizes$line)
    	storage.mode(tmp_sizes$m) <- "double" # avoiding integer overflow when taking the product
    	storage.mode(tmp_sizes$n) <- "double"
    	storage.mode(tmp_sizes$k) <- "double"
    	tmp_dur$m = tmp_sizes$m
    	tmp_dur$n = tmp_sizes$n
    	tmp_dur$k = tmp_sizes$k
    	tmp_dur$size_product = tmp_sizes$m * tmp_sizes$n * tmp_sizes$k
    	ndf = rbind(ndf, tmp_dur)
        }
        return(ndf);
    }
    
    result = insert_sizes(durations, sizes)
    head(result)
    
        rank     start       end duration     state startline                           startfile endline                             endfile idx    m    n   k size_product
    481    0  3.153899  6.271075 3.117176 Computing       387 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     387 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 481 4920 4920 120   2904768000
    486    0  7.047247 10.063367 3.016120 Computing       411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 486 4920 4920 120   2904768000
    491    0 10.648367 13.716045 3.067678 Computing       411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 491 4920 4920 120   2904768000
    496    0 14.104534 17.155418 3.050884 Computing       411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 496 4920 4920 120   2904768000
    977    0 17.557080 20.430869 2.873789 Computing       387 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     387 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 977 4800 4800 120   2764800000
    982    0 21.104026 24.044767 2.940741 Computing       411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 982 4800 4800 120   2764800000
    
  3. Plot and linear regression   R EXPERIMENTS PERFORMANCE
    library(ggplot2)
    ggplot(result, aes(x=size_product, y=duration, color=factor(rank))) +
        geom_point(shape=1) + ggtitle("Durations of HPL_dgemm as a function of the sizes")
    

    trace3_16.png

    reg <- lm(duration~I(m*n*k), data=result)
    summary(reg)
    
    
    Call:
    lm(formula = duration ~ I(m * n * k), data = result)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.10066 -0.01700 -0.00085  0.00351  0.57745 
    
    Coefficients:
                   Estimate Std. Error  t value Pr(>|t|)    
    (Intercept)  -2.476e-03  1.235e-03   -2.005   0.0451 *  
    I(m * n * k)  1.062e-09  9.220e-13 1151.470   <2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.04205 on 2634 degrees of freedom
    Multiple R-squared:  0.998,	Adjusted R-squared:  0.998 
    F-statistic: 1.326e+06 on 1 and 2634 DF,  p-value: < 2.2e-16
    
    layout(matrix(c(1,2,3,4),2,2))
    plot(reg)
    

    reg_16.png

  4. Comments on the linear regression   EXPERIMENTS
    • The plot of the duration as a function of m*n*k looks great. Maybe a bit of heteroscedasticity, but not so much. It is clearly linear.
    • The linear regression however is not so good. We have a high R-squared (0.998), but the plots look bad. The residual-vs-fitted plot shows that the results are clearly heteroscedastic. The normal-QQ shows that they are not linear (in m*n*k) but rather exponential.
    • The plot of the linear regression seems to contradict the first plot, this is strange.
  5. Investigating the linear regression   C
    • We can print other relevant parameters of HPL_dgemm:

      printf("line=%d rank=%d m=%d n=%d k=%d a=%f lead_A=%d lead_B=%d lead_C=%d\n", __LINE__+3,
        rank, mp, nn, jb, -HPL_rone, ldl2, LDU, lda);
      

      Here, a is a scaling factor applied to the matrix, lead_A, lead_B and lead_C are the leading dimensions of matrices A, B and C. A sample of what we get is (only some lines are reported here):

      line=411 rank=2 m=2240 n=2160 k=120 a=-1.000000 lead_A=2480 lead_B=2160 lead_C=2480
      line=387 rank=3 m=1640 n=1641 k=120 a=-1.000000 lead_A=2480 lead_B=1641 lead_C=2480
      line=387 rank=2 m=680 n=720 k=120 a=-1.000000 lead_A=680 lead_B=720 lead_C=2480
      line=387 rank=2 m=200 n=240 k=120 a=-1.000000 lead_A=200 lead_B=240 lead_C=2480
      177 line=411 rank=1 m=480 n=441 k=120 a=-1.000000 lead_A=2520 lead_B=441 lead_C=2520
      

      This trend seems to roughly repeat: a is always -1, lead_C is always either 2480 or 2520. For small enough values, lead_A is equal to m and lead_C is equal to n. For larger values, they are not equal anymore, but all are large. However, there are still some noticeable variations. For instance:

      line=387 rank=0 m=600 n=600 k=120 a=-1.000000 lead_A=2520 lead_B=600 lead_C=2520
      line=411 rank=0 m=600 n=600 k=120 a=-1.000000 lead_A=600 lead_B=600 lead_C=2520
      

      In this last example, all parameters are equal, except lead_A which is more than four times larger in one case.

    • A small leading dimension means a better locality and thus better performances. These differences in the leading dimensions could explain the non-linearity and the heteroscedasticity.

1.2.5 2017-03-07 Tuesday

  1. And the leading dimensions?   C PYTHON R EXPERIMENTS TRACING PERFORMANCE
    • We have this printf before the calls to HPL_dgemm (same as before, except for the a that is removed):

      printf("line=%d rank=%d m=%d n=%d k=%d lead_A=%d lead_B=%d lead_C=%d\n", __LINE__+3,
        rank, mp, nn, jb, ldl2, LDU, lda);
      
    • The trace is in the file /tmp/trace, we process it as before. The output is redirected in the file /tmp/output.
    • Processing of the output:
    import re
    import csv
    reg = re.compile('line=([0-9]+) rank=([0-9]+) m=([0-9]+) n=([0-9]+) k=([0-9]+) lead_A=([0-9]+) lead_B=([0-9]+) lead_C=([0-9]+)')
    
    def process(in_file, out_file):
        with open(in_file, 'r') as in_f:
            with open(out_file, 'w') as out_f:
                csv_writer = csv.writer(out_f)
                csv_writer.writerow(('line', 'rank', 'n', 'm', 'k', 'lead_A', 'lead_B', 'lead_C'))
                for line in in_f:
                    match = reg.match(line)
                    if match is not None:
                        csv_writer.writerow(tuple(match.group(i) for i in range(1,9)))
    process('/tmp/output', '/tmp/sizes.csv')
    

    We have the durations dataframe, obtained as before:

    head(durations)
    
        rank     start       end duration     state startline                           startfile endline                             endfile idx
    481    0  4.111176  7.158459 3.047283 Computing       387 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     387 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 481
    486    0  7.827329 10.848572 3.021243 Computing       411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 486
    491    0 11.411456 14.445789 3.034333 Computing       411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 491
    496    0 14.837377 17.868118 3.030741 Computing       411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 496
    977    0 18.268679 21.142146 2.873467 Computing       387 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     387 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 977
    982    0 21.809954 24.699182 2.889228 Computing       411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 982
    

    Then we get the sizes dataframe:

    sizes <- read.csv("/tmp/sizes.csv");
    head(sizes)
    
      line rank    n    m   k lead_A lead_B lead_C
    1  387    0 4920 4920 120   5040   4920   5040
    2  411    8 5000 4920 120   5000   4920   5000
    3  411    4 5040 4920 120   5040   4920   5040
    4  411   12 4920 4920 120   4920   4920   4920
    5  387    1 4920 5040 120   4920   5040   5040
    6  411    5 5040 5040 120   5040   5040   5040
    
    insert_sizes = function(durations, sizes) {
        stopifnot(nrow(durations)==nrow(sizes))
        ndf = data.frame();
        for(i in (sort(unique(durations$rank)))) {
    	tmp_dur = durations[durations$rank == i,]
    	tmp_sizes = sizes[sizes$rank == i,]
    	stopifnot(nrow(tmp_dur) == nrow(tmp_sizes))
    	stopifnot(tmp_dur$startline == tmp_sizes$line)
    	storage.mode(tmp_sizes$m) <- "double" # avoiding integer overflow when taking the product
    	storage.mode(tmp_sizes$n) <- "double"
    	storage.mode(tmp_sizes$k) <- "double"
    	storage.mode(tmp_sizes$lead_A) <- "double"
    	storage.mode(tmp_sizes$lead_B) <- "double"
    	storage.mode(tmp_sizes$lead_C) <- "double"
    	tmp_dur$m = tmp_sizes$m
    	tmp_dur$n = tmp_sizes$n
    	tmp_dur$k = tmp_sizes$k
    	tmp_dur$lead_A = tmp_sizes$lead_A
    	tmp_dur$lead_B = tmp_sizes$lead_B
    	tmp_dur$lead_C = tmp_sizes$lead_C
    	tmp_dur$lead_product = tmp_sizes$lead_A * tmp_sizes$lead_B * tmp_sizes$lead_C
    	tmp_dur$size_product = tmp_sizes$m * tmp_sizes$n * tmp_sizes$k
    	tmp_dur$ratio = tmp_dur$lead_product/tmp_dur$size_product
    	ndf = rbind(ndf, tmp_dur)
        }
        return(ndf);
    }
    
    result = insert_sizes(durations, sizes)
    head(result)
    
        rank     start       end duration     state startline                           startfile endline                             endfile idx    m    n   k lead_A lead_B lead_C lead_product
    481    0  4.111176  7.158459 3.047283 Computing       387 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     387 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 481 4920 4920 120   5040   4920   5040 124975872000
    486    0  7.827329 10.848572 3.021243 Computing       411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 486 4920 4920 120   4920   4920   5040 122000256000
    491    0 11.411456 14.445789 3.034333 Computing       411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 491 4920 4920 120   4920   4920   5040 122000256000
    496    0 14.837377 17.868118 3.030741 Computing       411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 496 4920 4920 120   4920   4920   5040 122000256000
    977    0 18.268679 21.142146 2.873467 Computing       387 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     387 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 977 4800 4800 120   5040   4800   5040 121927680000
    982    0 21.809954 24.699182 2.889228 Computing       411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     411 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 982 4800 4800 120   4800   4800   5040 116121600000
        size_product    ratio
    481   2904768000 43.02439
    486   2904768000 42.00000
    491   2904768000 42.00000
    496   2904768000 42.00000
    977   2764800000 44.10000
    982   2764800000 42.00000
    
    library(ggplot2)
    ggplot(result, aes(x=lead_product, y=duration, color=factor(rank))) +
        geom_point(shape=1) + ggtitle("Durations of HPL_dgemm as a function of the leading dimensions")
    

    trace4_16.png

    library(ggplot2)
    ggplot(result, aes(x=lead_product, y=size_product, color=factor(rank))) +
        geom_point(shape=1) + ggtitle("Size of the matrices of HPL_dgemm as a function of the leading dimensions")
    

    trace5_16.png

    ggplot(result, aes(x=idx, y=ratio, color=factor(rank))) +
        geom_point(shape=1) + ggtitle("Ratios of the leading dimensions by the sizes over time")
    

    trace6_16.png

    reg <- lm(duration~ I(m*n*k) + lead_A+lead_B+lead_C, data=result)
    summary(reg)
    
    
    Call:
    lm(formula = duration ~ I(m * n * k) + lead_A + lead_B + lead_C, 
        data = result)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.09477 -0.01804 -0.00439  0.00850  1.39992 
    
    Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  -7.741e-01  9.915e-02  -7.807 8.37e-15 ***
    I(m * n * k)  1.069e-09  4.431e-12 241.217  < 2e-16 ***
    lead_A        2.965e-06  7.744e-07   3.828 0.000132 ***
    lead_B       -7.048e-06  2.799e-06  -2.518 0.011863 *  
    lead_C        1.547e-04  1.981e-05   7.810 8.16e-15 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.04981 on 2631 degrees of freedom
    Multiple R-squared:  0.9972,	Adjusted R-squared:  0.9972 
    F-statistic: 2.361e+05 on 4 and 2631 DF,  p-value: < 2.2e-16
    
    layout(matrix(c(1,2,3,4),2,2))
    plot(reg)
    

    reg2_16.png

  2. Discussion about the leading dimensions   EXPERIMENTS
    • In the three previous plots, we see that the leading dimensions have two modes, which are directly observable in the durations of HPL_dgemm.
      • One of the modes seems to be linear in the sizes, we observe a straight line.
      • The other mode is clearly non-linear. Maybe quadratic? Exponential?
    • The linear regression shows that the variables lead_A, lead_B and lead_C have a non-negligible impact on the performances, albeit smaller than the sizes. We still have terrible plots, adding parameters in the model did not change anything.
    • This could explain the “bad” plots of the linear regression.
  3. Performance analysis of dgemm outside of HPL   C EXPERIMENTS PERFORMANCE
    • In the above analysis, the raw results come from a trace of HPL. Thus, we cannot control the sizes and/or leading dimensions. We only have observational data and not experimental data.
    • To fix this, let’s write a short C code, called dgemm_test, that call cblas_dgemm (the function to which is aliased HPL_dgemm).
    • Currently, this code takes six parameters as arguments: the three sizes and the three leading dimensions. Be careful, the meaning of these sizes and leading dimensions change depending on how dgemm is called: CblasColMajor or CblasRowMajor, and CblasNoTrans or CblasTrans. In the current code, these are fixed to be the same than in HPL.
    • Then, a Python script (called runner.py) sample random sizes and leading dimensions (taking care of the constraints between the sizes and dimensions) and call dgemm_test. It then writes the results in a CSV file.
    • Quick analysis of these results in R:
      • We got plots with the same shape (both the plot of the raw results and the plot of the linear regression).
      • The call to dgemm is 10 times faster in dgemm_test than in HPL. Need to find why. Firstly, what is the time obtained in the HPL traces? Is it virtual or real?
      • Similarly than with HPL, the linear regression shows that the ratio has a significative impact, but lower than the sizes.

1.2.6 2017-03-08 Wednesday

  1. Keep looking at dgemm outside of HPL   C EXPERIMENTS PERFORMANCE
    • Use dgemm_test at commit 0455edcb0af1eb673725959d216137997fc40fd2. Run 1000 experiments.
    • Here, the variable product is sampled randomly and uniformly in [1, 20003]. Then, the three sizes are set to ⌊ product(1/3) ⌋.
    • The leading dimensions are equal to the sizes.
    • Analysis in R:

      result <- read.csv('~/tmp/3/result.csv')
      head(result)
      
            time size_product lead_product ratio    m    n    k lead_A lead_B lead_C
      1 0.160235    843908625    843908625     1  945  945  945    945    945    945
      2 0.719003   4298942376   4298942376     1 1626 1626 1626   1626   1626   1626
      3 0.783674   4549540393   4549540393     1 1657 1657 1657   1657   1657   1657
      4 0.472595   2656741625   2656741625     1 1385 1385 1385   1385   1385   1385
      5 0.319670   1874516337   1874516337     1 1233 1233 1233   1233   1233   1233
      6 1.131936   6676532387   6676532387     1 1883 1883 1883   1883   1883   1883
      
      library(ggplot2)
      ggplot(result, aes(x=size_product, y=time)) +
          geom_point(shape=1) + ggtitle("Durations of cblas_dgemm as a function of the sizes product.")
      

      dgemm_test_raw.png

      reg <- lm(time ~ size_product, result)
      summary(reg)
      
      
      Call:
      lm(formula = time ~ size_product, data = result)
      
      Residuals:
            Min        1Q    Median        3Q       Max 
      -0.027295 -0.008640 -0.002781  0.005900  0.229935 
      
      Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
      (Intercept)  1.172e-02  1.087e-03   10.78   <2e-16 ***
      size_product 1.666e-10  2.353e-13  707.87   <2e-16 ***
      ---
      Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
      
      Residual standard error: 0.01716 on 998 degrees of freedom
      Multiple R-squared:  0.998,	Adjusted R-squared:  0.998 
      F-statistic: 5.011e+05 on 1 and 998 DF,  p-value: < 2.2e-16
      
    layout(matrix(c(1,2,3,4),2,2))
    plot(reg)
    

    dgemm_test_lm.png

    • In the above plots, we can observe similar trends than with HPL_dgemm, albeit less important. The data is slightly heteroscedastic and the residuals do not follow exactly a normal distribution. It seems that there are several “outliers” where dgemm takes significantly more time, i.e. the distribution of the residuals is skewed to the “right”.
    • For instance, the entry n°208 has been obtained with sizes of 1503. It took a time of 0.807207. Let’s run this experiment again 100 times (with the command ./dgemm_test 1503 1503 1503 1503 1503 1503). The min and the max over all observed times are respectively 0.5813 and 0.6494. The mean is 0.5897 and the standard deviation is 0.0082.
    • Thus, it seems that this point is a real outlier. We can suppose that this is also true for the other similar points.
    • This outlier is 0.2 seconds larger than the average we got and 0.15 seconds larger than the max. It seems very large. Maybe the process had a “bad” context switch (e.g. if it was moved to another core, but the execution time is not that high, so it seems unlikely).
    • There seems to be a pattern, the outliers look to happen at regular intervals.

      x = df[abs(df$time - (1.666e-10*df$size_product + 1.172e-2)) > 5e-2, ]
      x$id = which(abs(df$time - (1.666e-10*df$size_product + 1.172e-2)) > 5e-2)
      x$prev_id = c(0, x$id[1:(length(x$id)-1)])
      x$id_diff = x$id - x$prev_id
      x
      
              time size_product lead_product ratio    m    n    k lead_A lead_B
      37  0.674633   3602686437   3602686437     1 1533 1533 1533   1533   1533
      38  0.409866   2053225511   2053225511     1 1271 1271 1271   1271   1271
      207 1.295097   7055792632   7055792632     1 1918 1918 1918   1918   1918
      208 0.807207   3395290527   3395290527     1 1503 1503 1503   1503   1503
      381 1.079795   5535839609   5535839609     1 1769 1769 1769   1769   1769
      558 0.453775   1869959168   1869959168     1 1232 1232 1232   1232   1232
      657 0.917557   4699421875   4699421875     1 1675 1675 1675   1675   1675
      748 1.233466   6414120712   6414120712     1 1858 1858 1858   1858   1858
      753 0.708934   3884701248   3884701248     1 1572 1572 1572   1572   1572
      914 1.337868   7166730752   7166730752     1 1928 1928 1928   1928   1928
          lead_C  id prev_id id_diff
      37    1533  37       0      37
      38    1271  38      37       1
      207   1918 207      38     169
      208   1503 208     207       1
      381   1769 381     208     173
      558   1232 558     381     177
      657   1675 657     558      99
      748   1858 748     657      91
      753   1572 753     748       5
      914   1928 914     753     161
      

      We see here that the differences between the ids do not seem to be uniformly random. Some of them are small (1, 5), others are large (161, 169, 173, 177), or in between (37, 91, 99).

    • This pattern has been reproduced by runing 1000 experiments with a size of 1503. Among the results, 26 of them are larger than 0.7 (mean of 0.6024, standard deviation of 0.0249, min of 0.5811, max of 0.8363). Here is the list of the differences between the indices of these elements. The list have been sorted:

      [1, 1, 1, 1, 1, 1, 2, 4, 4, 5, 7, 7, 10, 15, 20, 25, 28, 32, 42, 42, 43, 53, 108, 200, 201]
      

      A lot of them are small or medium, and two are much larger.

  2. Time prediction in HPL   C PYTHON R EXPERIMENTS PERFORMANCE HPL
    • Let’s try to predict the time that will be spend in HPL_dgemm, and compare it with the real time. The aim is then to have a cheap SimBLAS: replacing calls to the function by a sleep of the predicted time. We have this printf before the calls to HPL_dgemm:

      printf("line=%d rank=%d m=%d n=%d k=%d lead_A=%d lead_B=%d lead_C=%d expected_time=%f\n",
              __LINE__+3, rank, mp, nn, jb, ldl2, LDU, lda, expected_time);
      

      We do as before: we run HPL with P=Q=4 and N=20000. The trace is dumped in /tmp/trace and stdout is redirected to /tmp/output.

    • Processing of the output:
    import re
    import csv
    reg = re.compile('line=([0-9]+) rank=([0-9]+) m=([0-9]+) n=([0-9]+) k=([0-9]+) lead_A=([0-9]+) lead_B=([0-9]+) lead_C=([0-9]+) expected_time=(-?[0-9]+.[0-9]+)')
    
    def process(in_file, out_file):
        with open(in_file, 'r') as in_f:
            with open(out_file, 'w') as out_f:
                csv_writer = csv.writer(out_f)
                csv_writer.writerow(('line', 'rank', 'n', 'm', 'k', 'lead_A', 'lead_B', 'lead_C', 'expected_time'))
                for line in in_f:
                    match = reg.match(line)
                    if match is not None:
                        csv_writer.writerow(tuple(match.group(i) for i in range(1,10)))
    process('/tmp/output', '/tmp/sizes.csv')
    
    • We process the trace as before, we get a dataframe durations.

      head(durations)
      
          rank     start      end duration     state startline                           startfile endline                             endfile idx
      481    0  3.480994  6.54468 3.063686 Computing       388 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     388 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 481
      486    0  7.225255 10.24889 3.023633 Computing       413 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     413 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 486
      491    0 10.803780 13.82799 3.024215 Computing       413 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     413 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 491
      496    0 14.230774 17.26467 3.033897 Computing       413 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     413 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 496
      977    0 17.676746 20.58197 2.905229 Computing       388 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     388 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 977
      982    0 21.258337 24.16961 2.911277 Computing       413 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     413 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 982
      
      sizes <- read.csv("/tmp/sizes.csv");
      head(sizes)
      
        line rank    n    m   k lead_A lead_B lead_C expected_time
      1  413    8 5000 4920 120   5000   4920   5000      3.132548
      2  413   12 4920 4920 120   4920   4920   4920      3.082388
      3  413    4 5040 4920 120   5040   4920   5040      3.157628
      4  388    0 4920 4920 120   5040   4920   5040      3.082388
      5  413    5 5040 5040 120   5040   5040   5040      3.234704
      6  413    9 5000 5040 120   5000   5040   5000      3.209012
      
      insert_sizes = function(durations, sizes) {
          stopifnot(nrow(durations)==nrow(sizes))
          ndf = data.frame();
          for(i in (sort(unique(durations$rank)))) {
      	tmp_dur = durations[durations$rank == i,]
      	tmp_sizes = sizes[sizes$rank == i,]
      	stopifnot(nrow(tmp_dur) == nrow(tmp_sizes))
      	stopifnot(tmp_dur$startline == tmp_sizes$line)
      	storage.mode(tmp_sizes$m) <- "double" # avoiding integer overflow when taking the product
      	storage.mode(tmp_sizes$n) <- "double"
      	storage.mode(tmp_sizes$k) <- "double"
      	storage.mode(tmp_sizes$lead_A) <- "double"
      	storage.mode(tmp_sizes$lead_B) <- "double"
      	storage.mode(tmp_sizes$lead_C) <- "double"
      	tmp_dur$m = tmp_sizes$m
      	tmp_dur$n = tmp_sizes$n
      	tmp_dur$k = tmp_sizes$k
      	tmp_dur$lead_A = tmp_sizes$lead_A
      	tmp_dur$lead_B = tmp_sizes$lead_B
      	tmp_dur$lead_C = tmp_sizes$lead_C
      	tmp_dur$lead_product = tmp_sizes$lead_A * tmp_sizes$lead_B * tmp_sizes$lead_C
      	tmp_dur$size_product = tmp_sizes$m * tmp_sizes$n * tmp_sizes$k
      	tmp_dur$ratio = tmp_dur$lead_product/tmp_dur$size_product
      	tmp_dur$expected_time = tmp_sizes$expected_time
      	tmp_dur$absolute_time_diff = tmp_dur$expected_time - tmp_dur$duration
      	tmp_dur$relative_time_diff = (tmp_dur$expected_time - tmp_dur$duration)/tmp_dur$expected_time
      	ndf = rbind(ndf, tmp_dur)
          }
          return(ndf);
      }
      
      result = insert_sizes(durations, sizes)
      head(result)
      
          rank     start      end duration     state startline                           startfile endline                             endfile idx    m    n   k lead_A lead_B lead_C lead_product
      481    0  3.480994  6.54468 3.063686 Computing       388 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     388 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 481 4920 4920 120   5040   4920   5040 124975872000
      486    0  7.225255 10.24889 3.023633 Computing       413 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     413 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 486 4920 4920 120   4920   4920   5040 122000256000
      491    0 10.803780 13.82799 3.024215 Computing       413 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     413 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 491 4920 4920 120   4920   4920   5040 122000256000
      496    0 14.230774 17.26467 3.033897 Computing       413 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     413 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 496 4920 4920 120   4920   4920   5040 122000256000
      977    0 17.676746 20.58197 2.905229 Computing       388 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     388 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 977 4800 4800 120   5040   4800   5040 121927680000
      982    0 21.258337 24.16961 2.911277 Computing       413 /hpl-2.2/src/pgesv/hpl_pdupdatett.c     413 /hpl-2.2/src/pgesv/hpl_pdupdatett.c 982 4800 4800 120   4800   4800   5040 116121600000
          size_product    ratio expected_time absolute_time_diff relative_time_diff
      481   2904768000 43.02439      3.082388           0.018702        0.006067374
      486   2904768000 42.00000      3.082388           0.058755        0.019061520
      491   2904768000 42.00000      3.082388           0.058173        0.018872705
      496   2904768000 42.00000      3.082388           0.048491        0.015731634
      977   2764800000 44.10000      2.933742           0.028513        0.009718987
      982   2764800000 42.00000      2.933742           0.022465        0.007657456
      
    ggplot(result, aes(x=idx, y=absolute_time_diff, color=factor(rank))) +
        geom_point(shape=1) + ggtitle("Absolute difference between the expected time and the real time")
    

    trace7_16.png

    ggplot(result, aes(x=start, y=absolute_time_diff, color=factor(rank))) +
        geom_point(shape=1) + ggtitle("Absolute difference between the expected time and the real time")
    

    trace8_16.png

    ggplot(result, aes(x=start, y=relative_time_diff, color=factor(rank))) +
        geom_point(shape=1) + ggtitle("Relative difference between the expected time and the real time")
    

    trace9_16.png

    ggplot(result[result$start < 200,], aes(x=start, y=relative_time_diff, color=factor(rank))) +
        geom_point(shape=1) + ggtitle("Relative difference between the expected time and the real time\n“Large enough” matrices")
    

    trace10_16.png

    for(i in (sort(unique(result$rank)))) {
        print(sum(result[result$rank == i,]$absolute_time_diff))
    }
    
    [1] 1.494745
    [1] 1.343339
    [1] -2.940891
    [1] -1.11672
    [1] 0.466087
    [1] 1.90049
    [1] -3.441326
    [1] -1.564635
    [1] -2.708597
    [1] -1.647053
    [1] 0.027765
    [1] -4.653833
    [1] 2.878523
    [1] 3.572304
    [1] 1.124928
    [1] 3.749203
    
    • We can see several things.
      • There are very large differences between the ranks. We could already see it in the first plots (duration vs size_product), but it is even more obvious here. We should find why.
      • There are some outliers that may have a very significant impact on the agregated difference between prediction and reality.
      • The prediction ability of this is better than SMPI_Sample, but still far from perfect.
  3. Let’s try a cheap SimBLAS   SMPI C PERFORMANCE HPL
    • We can replace the call to HPL_dgemm by the following:

      double expected_time = (1.062e-09)*(double)mp*(double)nn*(double)jb - 2.476e-03
      if(expected_time > 0)
          smpi_usleep((useconds_t)(expected_time*1e6));
      
    • First test: it works pretty well. We roughly got the same results than with the true call to HPL_dgemm: 2.329e+01 Gflops, against 2.332e01, 2.305e01 and 2.315e01 Gflops. The simulation time is much shorter, about 46 seconds, against about 495 seconds (8 minutes and 15 seconds). Note than with or without a real call to HPL_dgemm, the time spent outside of the application is much lower: between 6 and 8 seconds. Thus, there is room for new optimizations.
  4. Tracking the other expensive BLAS functions   PERFORMANCE HPL
    • In the file hpl_blas.h, several functions are defined like HPL_dgemm, with #define aliasing them to the real cblas function.
    • We can try to replace them by a no-op, to see if it changes the simulation time significantly.
    • The following table sum up the (very approximate) gain we get on simulation time if we remove each of the functions. We use the same parameters than above for HPL.

      Function time (s)
      HPL_dswap 0.5
      HPL_dcopy N/A
      HPL_daxpy 0
      HPL_dscal N/A
      HPL_idamax N/A
      HPL_dgemv 1
      HPL_dtrsv 0
      HPL_dger 0.5
      HPL_dtrsm 10
      • The function HPL_idamax cannot be removed, since it returns an integer used to index an array.
      • The functions HPL_dscal and HPL_dcopy cannot be removed either, since removing them causes the following error:

        /home/tom/simgrid/src/simix/smx_global.cpp:557: [simix_kernel/CRITICAL] Oops ! Deadlock or code not perfectly clean.
        
    • It is clear that we should now focus on HPL_dtrsm. This function solves a triangular system of equations.
    • It is also clear that the time spent in the application is not entirely spent in the BLAS functions, we should look for something else.
  5. Forgot a call to HPL_dgemm   PERFORMANCE HPL
    • I found out that I forgot a place where HPL_dgemm was used.
    • If we remove all additional occurences of HPL_dgemm, we gain 6 seconds (in addition of the high gain we already had).
    • I thought that it was used only in HPL_pduptateTT, but it appears that it is also used in HPL_pdrpanllT.
    • The call to HPL_dgemm was correctly traced. But I filtered the results in the R script and kept only the ones of HPL_pdupdateTT.
    • The printf function with the parameters was only present in HPL_pdupdateTT.
    • Consequently, all the visualizations and linear regressions were done with missing data. We should redo them to check if this changes anything.
  6. Looking at HPL_dtrsm   PERFORMANCE HPL
    • This function is used in a lot of functions: HPL_pdrpan*** and HPL_pdupdate** (each has several variants).
    • By aliasing this function to printf("%s\n", __FILE___) and filtering the output with awk '!a[$0]++' (remove duplicates), we know that, in our settings, HPL_dtrsm is only used in HPL_pdrpanllT and HPL_pdupdateTT. By sorting with sort and then counting duplicates with uniq -dc, we know that HPL_pdrpanllT (resp. HPL_pdupdateTT) call our function 78664 times (resp. 2636 times).

1.2.7 2017-03-09 Thursday

  1. Fix HPL_dgemm trace   C TRACING HPL
    • In the old version, the calls to MPI_Wait were done in the #include, so we were sure that every call to HPL_dgemm was traced by Simgrid. However, the printf for the parameters had to be done before every call to HPL_dgemm, this is why I missed some of them.
    • Now, the printf is also done in the #include. Because we need to have the arguments given to HPL_dgemm here, we cannot anymore use variadic arguments. We have to put all the parameters.
    • The code is now as follows:

      #define  HPL_dgemm(layout, TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc)  ({\
          int my_rank, buff=0;\
          MPI_Request request;\
          MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);\
          double expected_time = (1.062e-09)*(double)M*(double)N*(double)K - 2.476e-03;\
          printf("file=%s line=%d rank=%d m=%d n=%d k=%d lead_A=%d lead_B=%d lead_C=%d expected_time=%f\n", __FILE__, __LINE__+3, my_rank, M, N, K, lda, ldb, ldc, expected_time);\
          MPI_Isend(&buff, 1, MPI_INT, my_rank, 0, MPI_COMM_WORLD, &request);\
          MPI_Recv(&buff, 1, MPI_INT, my_rank, 0, MPI_COMM_WORLD, NULL);\
          MPI_Wait(&request, MPI_STATUS_IGNORE);\
          cblas_dgemm(layout, TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc);\
          MPI_Isend(&buff, 1, MPI_INT, my_rank, 0, MPI_COMM_WORLD, &request);\
          MPI_Recv(&buff, 1, MPI_INT, my_rank, 0, MPI_COMM_WORLD, NULL);\
          MPI_Wait(&request, MPI_STATUS_IGNORE);\
      })
      
  2. Tentative of linear regression of HPL_dgemm: failed, there is a bug somewhere   PYTHON R EXPERIMENTS PERFORMANCE BUG
    • In the other linear regressions, some calls to HPL_dgemm were missing. Thus, the analysis need to be done again, just to check if it changes anything.
    • I tried to run roughly the same process as above, but failed, there seems to be a bug somewhere.
    • Everything piece of code is written here. The trace and the output have been obtained with N=5000 and P=Q=4.

    Clean the file:

    pj_dump --user-defined --ignore-incomplete-links /tmp/trace > /tmp/trace.csv
    grep "State," /tmp/trace.csv | grep MPI_Wait | sed -e 's/()//' -e 's/MPI_STATE, //ig'  -e 's/State, //ig' -e 's/rank-//' -e\
    's/PMPI_/MPI_/' | grep MPI_  | tr 'A-Z' 'a-z' > /tmp/trace_processed.csv
    

    Clean the paths:

    import re
    reg = re.compile('((?:[^/])*)(?:/[a-zA-Z0-9_-]*)*((?:/hpl-2.2(?:/[a-zA-Z0-9_-]*)*).*)')
    def process(in_file, out_file):
        with open(in_file, 'r') as in_f:
            with open(out_file, 'w') as out_f:
                for line in in_f:
                    match = reg.match(line)
                    out_f.write('%s%s\n' % (match.group(1), match.group(2)))
    process('/tmp/trace_processed.csv', '/tmp/trace_cleaned.csv')
    
    df <- read.csv("/tmp/trace_cleaned.csv", header=F, strip.white=T, sep=",");
    names(df) = c("rank", "start", "end", "duration", "level", "state", "Filename", "Linenumber");
    head(df)
    
      rank    start      end duration level    state
    1    8 0.207257 0.207257        0     0 mpi_wait
    2    8 0.207275 0.207275        0     0 mpi_wait
    3    8 0.207289 0.207289        0     0 mpi_wait
    4    8 0.207289 0.207289        0     0 mpi_wait
    5    8 0.207309 0.207309        0     0 mpi_wait
    6    8 0.207309 0.207309        0     0 mpi_wait
                                Filename Linenumber
    1 /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    2 /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    3 /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    4 /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    5 /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    6 /hpl-2.2/src/pfact/hpl_pdrpanllt.c        222
    
    duration_compute = function(df) {
        ndf = data.frame();
        df = df[with(df,order(rank,start)),];
        #origin = unique(df$origin)
        for(i in (sort(unique(df$rank)))) {
    	start     = df[df$rank==i,]$start;
    	end       = df[df$rank==i,]$end;
    	l         = length(end);
    	end       = c(0,end[1:(l-1)]); # Computation starts at time 0
    
    	startline = c(0, df[df$rank==i,]$Linenumber[1:(l-1)]);
    	startfile = c("", as.character(df[df$rank==i,]$Filename[1:(l-1)]));
    	endline   = df[df$rank==i,]$Linenumber;
    	endfile   = df[df$rank==i,]$Filename;
    
    	ndf       = rbind(ndf, data.frame(rank=i, start=end, end=start,
    	    duration=start-end, state="Computing",
    	    startline=startline, startfile=startfile, endline=endline,
    	    endfile=endfile));
        }
        ndf$idx = 1:length(ndf$duration)
        ndf;
    }
    durations = duration_compute(df);
    durations = durations[as.character(durations$startfile) == as.character(durations$endfile) &
        durations$startline == durations$endline,]
    
    head(durations)
    
      rank    start      end duration     state startline
    2    0 0.207097 0.207149  5.2e-05 Computing       222
    3    0 0.207149 0.207179  3.0e-05 Computing       222
    4    0 0.207179 0.207179  0.0e+00 Computing       222
    5    0 0.207179 0.207194  1.5e-05 Computing       222
    6    0 0.207194 0.207194  0.0e+00 Computing       222
    7    0 0.207194 0.207207  1.3e-05 Computing       222
                               startfile endline                            endfile
    2 /hpl-2.2/src/pfact/hpl_pdrpanllt.c     222 /hpl-2.2/src/pfact/hpl_pdrpanllt.c
    3 /hpl-2.2/src/pfact/hpl_pdrpanllt.c     222 /hpl-2.2/src/pfact/hpl_pdrpanllt.c
    4 /hpl-2.2/src/pfact/hpl_pdrpanllt.c     222 /hpl-2.2/src/pfact/hpl_pdrpanllt.c
    5 /hpl-2.2/src/pfact/hpl_pdrpanllt.c     222 /hpl-2.2/src/pfact/hpl_pdrpanllt.c
    6 /hpl-2.2/src/pfact/hpl_pdrpanllt.c     222 /hpl-2.2/src/pfact/hpl_pdrpanllt.c
    7 /hpl-2.2/src/pfact/hpl_pdrpanllt.c     222 /hpl-2.2/src/pfact/hpl_pdrpanllt.c
      idx
    2   2
    3   3
    4   4
    5   5
    6   6
    7   7
    
    unique(durations[c("startfile", "startline")])
    
                                  startfile startline
    2    /hpl-2.2/src/pfact/hpl_pdrpanllt.c       222
    14         /hpl-2.2/src/comm/hpl_sdrv.c       191
    478      /hpl-2.2/src/pgesv/hpl_rollt.c       242
    481 /hpl-2.2/src/pgesv/hpl_pdupdatett.c       384
    486 /hpl-2.2/src/pgesv/hpl_pdupdatett.c       407
    

    We need to check each of these to see if this is indeed a call to HPL_dgemm, or something else. It appears that HPL_rollT and HPL_sdrv are not calling HPL_dgemm, they are just calling MPI_Wait. Thus, we have to remove them.

    durations = durations[durations$startfile != "/hpl-2.2/src/comm/hpl_sdrv.c" & durations$startfile != "/hpl-2.2/src/pgesv/hpl_rollt.c",]
    unique(durations[c("startfile", "startline")])
    
                                  startfile startline
    2    /hpl-2.2/src/pfact/hpl_pdrpanllt.c       222
    481 /hpl-2.2/src/pgesv/hpl_pdupdatett.c       384
    486 /hpl-2.2/src/pgesv/hpl_pdupdatett.c       407
    

    Now, let us get what was output by the printf.

    Processing the output:

    import re
    import csv
    reg = re.compile('file=([a-zA-Z0-9/_.-]+) line=([0-9]+) rank=([0-9]+) m=([0-9]+) n=([0-9]+) k=([0-9]+) lead_A=([0-9]+) lead_B=([0-9]+) lead_C=([0-9]+) expected_time=(-?[0-9]+.[0-9]+)')
    
    def process(in_file, out_file):
        with open(in_file, 'r') as in_f:
            with open(out_file, 'w') as out_f:
                csv_writer = csv.writer(out_f)
                csv_writer.writerow(('file', 'line', 'rank', 'n', 'm', 'k', 'lead_A', 'lead_B', 'lead_C', 'expected_time'))
                for line in in_f:
                    match = reg.match(line)
                    if match is not None:
                        result = list(match.group(i) for i in range(1, 11))
                        result[0] = result[0][result[0].index('/hpl'):].lower()
                        csv_writer.writerow(result)
    process('/tmp/output', '/tmp/parameters.csv')
    
    parameters <- read.csv("/tmp/parameters.csv");
    head(parameters)
    
                                    file line rank    n  m k lead_A lead_B lead_C
    1 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    0 1320 60 0   1320    120   1320
    2 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    8 1200 60 0   1200    120   1200
    3 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    0 1320 30 0   1320    120   1320
    4 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    4 1280 60 0   1280    120   1280
    5 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    0 1320 16 0   1320    120   1320
    6 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    0 1320  8 0   1320    120   1320
      expected_time
    1     -0.002476
    2     -0.002476
    3     -0.002476
    4     -0.002476
    5     -0.002476
    6     -0.002476
    

    A first remark: we see that some rows have k=0, which is a bit surprising. I double-checked by adding some printf in the files, this is not a bug. This only happens in HPL_pdrpanllT so it was unnoticed until now.

    nrow(parameters)
    nrow(durations)
    nrow(parameters[parameters$file == "/hpl-2.2/src/pfact/hpl_pdrpanllt.c",])
    nrow(durations[durations$startfile == "/hpl-2.2/src/pfact/hpl_pdrpanllt.c",])
    
    [1] 20300
    [1] 29964
    [1] 19664
    [1] 29328
    
    • There is obviously something wrong. We should have a one-to-one correspondance between the elements of the parameters dataframe and the elements of the durations dataframe. It seems here that SMPI has produced additional entries in the trace, or some of the printf I put disapeared.
    • This is not an error in parsing the output (e.g. some lines not parsed because of a wrong format/regexp). The output file has 20359 lines.
    • Tried puting a printf("blabla\n") just before HPL_dgemm in the file HPL_pdrpanllT.c and counted the number of times it appeared. Exactly the same number, so definitely not an issue with the parsing or the definition with the #define.
    • Checked the durations dataframe. Nothing apparently wrong, all the entries for this file are at the same line, so I did not miss a hidden MPI_Wait somewhere else in this same file).
  3. Using another way to measure durations   C PYTHON R EXPERIMENTS TRACING PERFORMANCE HPL
    • Let’s use something else than SMPI trace to measure durations. We will measure the time directly in the code. But first we need to check that this new measure is consistent with what we got with the traces.
    • Now, HPL_dgemm is defined as:
    #define  HPL_dgemm(layout, TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc)  ({\
      int my_rank, buff=0;\
      MPI_Request request;\
      MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);\
      double expected_time = (1.062e-09)*(double)M*(double)N*(double)K - 2.476e-03;\
      struct timeval before = {};\
      struct timeval after = {};\
      gettimeofday(&before, NULL);\
      MPI_Isend(&buff, 1, MPI_INT, my_rank, 0, MPI_COMM_WORLD, &request);\
      MPI_Recv(&buff, 1, MPI_INT, my_rank, 0, MPI_COMM_WORLD, NULL);\
      MPI_Wait(&request, MPI_STATUS_IGNORE);\
      cblas_dgemm(layout, TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc);\
      gettimeofday(&after, NULL);\
      MPI_Isend(&buff, 1, MPI_INT, my_rank, 0, MPI_COMM_WORLD, &request);\
      MPI_Recv(&buff, 1, MPI_INT, my_rank, 0, MPI_COMM_WORLD, NULL);\
      MPI_Wait(&request, MPI_STATUS_IGNORE);\
      double time_before = (double)(before.tv_sec) + (double)(before.tv_usec)*1e-6;\
      double time_after = (double)(after.tv_sec) + (double)(after.tv_usec)*1e-6;\
      double real_time = time_after-time_before;\
      printf("file=%s line=%d rank=%d m=%d n=%d k=%d lead_A=%d lead_B=%d lead_C=%d real_time=%f expected_time=%f\n", __FILE__, __LINE__, my_rank, M, N, K, lda, ldb, ldc, real_time, expected_time);\
    })
    
    • We run the same code than above to get the durations frame.
    head(durations)
    
      rank    start      end duration     state startline
    2    0 0.275856 0.275896  4.0e-05 Computing       224
    3    0 0.275896 0.275929  3.3e-05 Computing       224
    4    0 0.275929 0.275929  0.0e+00 Computing       224
    5    0 0.275929 0.275948  1.9e-05 Computing       224
    6    0 0.275948 0.275948  0.0e+00 Computing       224
    7    0 0.275948 0.275965  1.7e-05 Computing       224
                               startfile endline                            endfile
    2 /hpl-2.2/src/pfact/hpl_pdrpanllt.c     224 /hpl-2.2/src/pfact/hpl_pdrpanllt.c
    3 /hpl-2.2/src/pfact/hpl_pdrpanllt.c     224 /hpl-2.2/src/pfact/hpl_pdrpanllt.c
    4 /hpl-2.2/src/pfact/hpl_pdrpanllt.c     224 /hpl-2.2/src/pfact/hpl_pdrpanllt.c
    5 /hpl-2.2/src/pfact/hpl_pdrpanllt.c     224 /hpl-2.2/src/pfact/hpl_pdrpanllt.c
    6 /hpl-2.2/src/pfact/hpl_pdrpanllt.c     224 /hpl-2.2/src/pfact/hpl_pdrpanllt.c
    7 /hpl-2.2/src/pfact/hpl_pdrpanllt.c     224 /hpl-2.2/src/pfact/hpl_pdrpanllt.c
      idx
    2   2
    3   3
    4   4
    5   5
    6   6
    7   7
    

    Now, we process the parameters:

    import re
    import csv
    reg = re.compile('file=([a-zA-Z0-9/_.-]+) line=([0-9]+) rank=([0-9]+) m=([0-9]+) n=([0-9]+) k=([0-9]+) lead_A=([0-9]+) lead_B=([0-9]+) lead_C=([0-9]+) real_time=(-?[0-9]+.[0-9]+) expected_time=(-?[0-9]+.[0-9]+)')
    
    def process(in_file, out_file):
        with open(in_file, 'r') as in_f:
            with open(out_file, 'w') as out_f:
                csv_writer = csv.writer(out_f)
                csv_writer.writerow(('file', 'line', 'rank', 'n', 'm', 'k', 'lead_A', 'lead_B', 'lead_C', 'real_time', 'expected_time'))
                for line in in_f:
                    match = reg.match(line)
                    if match is not None:
                        result = list(match.group(i) for i in range(1, 12))
                        result[0] = result[0][result[0].index('/hpl'):].lower()
                        csv_writer.writerow(result)
    process('/tmp/output', '/tmp/parameters.csv')
    
    parameters <- read.csv("/tmp/parameters.csv");
    head(parameters)
    
                                    file line rank    n  m k lead_A lead_B lead_C
    1 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  224    0 1320 60 0   1320    120   1320
    2 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  224    0 1320 30 0   1320    120   1320
    3 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  224    0 1320 16 0   1320    120   1320
    4 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  224    0 1320  8 0   1320    120   1320
    5 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  224    0 1320  4 0   1320    120   1320
    6 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  224    0 1320  2 0   1320    120   1320
      real_time expected_time
    1   8.1e-05     -0.002476
    2   0.0e+00     -0.002476
    3   0.0e+00     -0.002476
    4   0.0e+00     -0.002476
    5   1.0e-06     -0.002476
    6   0.0e+00     -0.002476
    

    We merge the durations and parameters dataframes, but only the entries for the file hpl_pdupdatett.c (we cannot do it for the other file since we have a mismatch).

    insert_sizes = function(durations, sizes) {
        stopifnot(nrow(durations)==nrow(sizes))
        ndf = data.frame();
        for(i in (sort(unique(durations$rank)))) {
    	tmp_dur = durations[durations$rank == i,]
    	tmp_sizes = sizes[sizes$rank == i,]
    	stopifnot(nrow(tmp_dur) == nrow(tmp_sizes))
    	stopifnot(tmp_dur$startline == tmp_sizes$line)
    	storage.mode(tmp_sizes$m) <- "double" # avoiding integer overflow when taking the product
    	storage.mode(tmp_sizes$n) <- "double"
    	storage.mode(tmp_sizes$k) <- "double"
    	storage.mode(tmp_sizes$lead_A) <- "double"
    	storage.mode(tmp_sizes$lead_B) <- "double"
    	storage.mode(tmp_sizes$lead_C) <- "double"
    	tmp_dur$m = tmp_sizes$m
    	tmp_dur$n = tmp_sizes$n
    	tmp_dur$k = tmp_sizes$k
    	tmp_dur$lead_A = tmp_sizes$lead_A
    	tmp_dur$lead_B = tmp_sizes$lead_B
    	tmp_dur$lead_C = tmp_sizes$lead_C
    	tmp_dur$lead_product = tmp_sizes$lead_A * tmp_sizes$lead_B * tmp_sizes$lead_C
    	tmp_dur$size_product = tmp_sizes$m * tmp_sizes$n * tmp_sizes$k
    	tmp_dur$ratio = tmp_dur$lead_product/tmp_dur$size_product
    	tmp_dur$real_time = tmp_sizes$real_time
    	tmp_dur$expected_time = tmp_sizes$expected_time
    	tmp_dur$absolute_time_diff = tmp_dur$expected_time - tmp_dur$duration
    	tmp_dur$relative_time_diff = (tmp_dur$expected_time - tmp_dur$duration)/tmp_dur$expected_time
    	ndf = rbind(ndf, tmp_dur)
        }
        return(ndf);
    }
    
    result = insert_sizes(durations[durations$startfile == "/hpl-2.2/src/pgesv/hpl_pdupdatett.c",], parameters[parameters$file == "/hpl-2.2/src/pgesv/hpl_pdupdatett.c",])
    

    Now we plot the time measured by SMPI traces against the time measured by gettimeofday.

    library(ggplot2)
    ggplot(result, aes(x=duration, y=real_time)) +
        geom_point(shape=1) + ggtitle("Time measured by SMPI against time measured by gettimeofday")
    

    gettimeofday.png

    Checking with a linear regression, just to be sure:

    summary(lm(duration~real_time, data=result))
    
    
    Call:
    lm(formula = duration ~ real_time, data = result)
    
    Residuals:
           Min         1Q     Median         3Q        Max 
    -4.917e-05 -4.088e-06  1.075e-06  5.261e-06  6.181e-05 
    
    Coefficients:
                  Estimate Std. Error    t value Pr(>|t|)    
    (Intercept) -2.617e-06  6.285e-07     -4.163 3.57e-05 ***
    real_time    9.999e-01  7.058e-06 141678.252  < 2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 1.034e-05 on 634 degrees of freedom
    Multiple R-squared:      1,	Adjusted R-squared:      1 
    F-statistic: 2.007e+10 on 1 and 634 DF,  p-value: < 2.2e-16
    

    It is not perfect, but it looks pretty great. So, let’s use this to measure time.

  4. Now we can finally re-do the analysis of HPL_dgemm   R EXPERIMENTS PERFORMANCE HPL
    • There are less things to do, since all the data come from the output file.
    • Recall the aim of doing this again: in the previous analysis, some calls to HPL_dgemm were missing. Thus, it needs to be done again, just to check if it changes anything.
    • Generate the CSV file by runing the same Python script as in the previous section (the output format did not change).
    • Then, analysis in R:
    results <- read.csv("/tmp/parameters.csv");
    head(results)
    
                                    file line rank    n  m k lead_A lead_B lead_C
    1 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    0 5040 60 0   5040    120   5040
    2 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    0 5040 30 0   5040    120   5040
    3 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    0 5040 16 0   5040    120   5040
    4 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    0 5040  8 0   5040    120   5040
    5 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    0 5040  4 0   5040    120   5040
    6 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    8 5000 60 0   5000    120   5000
      real_time expected_time
    1   5.7e-05     -0.002476
    2   7.0e-06     -0.002476
    3   0.0e+00     -0.002476
    4   0.0e+00     -0.002476
    5   0.0e+00     -0.002476
    6   9.0e-06     -0.002476
    
    process_results = function(results) {
        storage.mode(results$m) <- "double" # avoiding integer overflow when taking the product
        storage.mode(results$n) <- "double"
        storage.mode(results$k) <- "double"
        storage.mode(results$lead_A) <- "double"
        storage.mode(results$lead_B) <- "double"
        storage.mode(results$lead_C) <- "double"
        results$lead_product = results$lead_A * results$lead_B * results$lead_C
        results$size_product = results$m * results$n * results$k
        results$ratio = results$lead_product/results$size_product
        results$absolute_time_diff = results$expected_time - results$real_time
        results$relative_time_diff = (results$expected_time - results$real_time)/results$expected_time
        results$idx = 1:length(results$rank)
        return(results);
    }
    
    results = process_results(results)
    head(results)
    
                                    file line rank    n  m k lead_A lead_B lead_C
    1 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    0 5040 60 0   5040    120   5040
    2 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    0 5040 30 0   5040    120   5040
    3 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    0 5040 16 0   5040    120   5040
    4 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    0 5040  8 0   5040    120   5040
    5 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    0 5040  4 0   5040    120   5040
    6 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  222    8 5000 60 0   5000    120   5000
      real_time expected_time lead_product size_product ratio absolute_time_diff
    1   5.7e-05     -0.002476   3048192000            0   Inf          -0.002533
    2   7.0e-06     -0.002476   3048192000            0   Inf          -0.002483
    3   0.0e+00     -0.002476   3048192000            0   Inf          -0.002476
    4   0.0e+00     -0.002476   3048192000            0   Inf          -0.002476
    5   0.0e+00     -0.002476   3048192000            0   Inf          -0.002476
    6   9.0e-06     -0.002476   3000000000            0   Inf          -0.002485
      relative_time_diff idx
    1           1.023021   1
    2           1.002827   2
    3           1.000000   3
    4           1.000000   4
    5           1.000000   5
    6           1.003635   6
    
    library(ggplot2)
    ggplot(results, aes(x=idx, y=real_time, color=factor(file))) +
        geom_point(shape=1) + ggtitle("Durations of HPL_dgemm")
    

    trace_gettimeofday1_16.png

    This is the plot of the duration of HPL_dgemm over time (analogous to the plot duration vs start that we had). The part for hpl_pduptatett looks exactly as before. We see that the calls to HPL_dgemm in hpl_pdrpanllt are always very short.

    library(ggplot2)
    ggplot(results, aes(x=size_product, y=real_time, color=factor(rank))) +
        geom_point(shape=1) + ggtitle("Durations of HPL_dgemm")
    

    trace_gettimeofday2_16.png

    Without surprise, we find exactly the same kind of plot as before, since all the new calls to HPL_dgemm are very short and thus hidden in the left part of the graph.

    reg <- lm(duration~I(m*n*k), data=result)
    summary(reg)
    
    
    Call:
    lm(formula = duration ~ I(m * n * k), data = result)
    
    Residuals:
          Min        1Q    Median        3Q       Max 
    -0.004843 -0.001337 -0.000024  0.000280  0.055746 
    
    Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  2.393e-04  2.182e-04   1.097    0.273    
    I(m * n * k) 1.064e-09  2.615e-12 406.932   <2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.003594 on 634 degrees of freedom
    Multiple R-squared:  0.9962,	Adjusted R-squared:  0.9962 
    F-statistic: 1.656e+05 on 1 and 634 DF,  p-value: < 2.2e-16
    
    layout(matrix(c(1,2,3,4),2,2))
    plot(reg)
    

    reg_gettimeofday_16.png

    The summary of the linear regression shows that the factor m*n*k barely changed. The intercept is very different, but its t-value is too low, so it is not meaning-full. The residuals vs fitted plot seems to look better, with no more heteroscedasticity. My guess is that we added a lot of points with very low values, so their weight hide the problem. The QQ-plot still looks problematic.

  5. Replacing HPL_dgemm by smpi_usleep again   SMPI PERFORMANCE HPL
    • As for the printf, we will put the smpi_usleep in the #define. We take the coefficients of the latest linear regression.
    • Testing: we still get the same number of Gflops (about 23 Gflops) but the simulation runs in 41 seconds now.

1.2.8 2017-03-10 Friday

  1. Tracing HPL_dtrsm   C PYTHON R EXPERIMENTS TRACING PERFORMANCE
    • The goal is to do something similar for HPL_dtrsm. In a first time, we will trace the parameters used to call it and its durations, then we will do a linear regression, to finally replace it by a smpi_usleep.
    • Recall that this function solves a triangular set of equations. It takes as input two m × n matrices. We expect the complexity to be O(m*n).
    • Replace the definition of HPL_dtrsm in hpl_blas.h by the following:
    #define HPL_dtrsm(layout, Side, Uplo, TransA, Diag, M, N, alpha, A, lda, B, ldb) ({\
        int my_rank, buff=0;\
        MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);\
        struct timeval before = {};\
        struct timeval after = {};\
        gettimeofday(&before, NULL);\
        cblas_dtrsm(layout, Side, Uplo, TransA, Diag, M, N, alpha, A, lda, B, ldb);\
        gettimeofday(&after, NULL);\
        double time_before = (double)(before.tv_sec) + (double)(before.tv_usec)*1e-6;\
        double time_after = (double)(after.tv_sec) + (double)(after.tv_usec)*1e-6;\
        double real_time = time_after-time_before;\
        printf("file=%s line=%d rank=%d m=%d n=%d lead_A=%d lead_B=%d real_time=%f\n", __FILE__, __LINE__, my_rank, M, N, lda, ldb, real_time);\
    })
    
    • Run the simulation:
    smpirun --cfg=smpi/bcast:mpich --cfg=smpi/running-power:6217956542.969 --cfg=smpi/display-timing:yes\
    --cfg=smpi/privatize-global-variables:yes -np 16 -hostfile ../../../small_tests/hostfile_64.txt -platform\
    ../../../small_tests/cluster_fat_tree_64.xml ./xhpl > /tmp/output
    
    • Process the output file:
    import re
    import csv
    reg = re.compile('file=([a-zA-Z0-9/_.-]+) line=([0-9]+) rank=([0-9]+) m=([0-9]+) n=([0-9]+) lead_A=([0-9]+) lead_B=([0-9]+) real_time=(-?[0-9]+.[0-9]+)')
    
    def process(in_file, out_file):
        with open(in_file, 'r') as in_f:
            with open(out_file, 'w') as out_f:
                csv_writer = csv.writer(out_f)
                csv_writer.writerow(('file', 'line', 'rank', 'n', 'm', 'lead_A', 'lead_B', 'real_time'))
                for line in in_f:
                    match = reg.match(line)
                    if match is not None:
                        result = list(match.group(i) for i in range(1, 9))
                        result[0] = result[0][result[0].index('/hpl'):].lower()
                        csv_writer.writerow(result)
    process('/tmp/output', '/tmp/parameters.csv')
    
    • Analysis in R:
    results <- read.csv("/tmp/parameters.csv");
    head(results)
    
                                    file line rank  n m lead_A lead_B real_time
    1 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  171    8 60 0    120    120  0.000102
    2 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  171    8 30 0    120    120  0.000013
    3 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  171    8 16 0    120    120  0.000000
    4 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  171    8  8 0    120    120  0.000000
    5 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  171    8  4 0    120    120  0.000000
    6 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  171    8  2 0    120    120  0.000000
    
    process_results = function(results) {
        storage.mode(results$m) <- "double" # avoiding integer overflow when taking the product
        storage.mode(results$n) <- "double"
        storage.mode(results$lead_A) <- "double"
        storage.mode(results$lead_B) <- "double"
        results$lead_product = results$lead_A * results$lead_B
        results$size_product = results$m * results$n
        results$ratio = results$lead_product/results$size_product
     #  results$absolute_time_diff = results$expected_time - results$real_time
     #  results$relative_time_diff = (results$expected_time - results$real_time)/results$expected_time
        results$idx = 1:length(results$rank)
        return(results);
    }
    
    results = process_results(results)
    head(results)
    
                                    file line rank  n m lead_A lead_B real_time
    1 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  171    8 60 0    120    120  0.000102
    2 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  171    8 30 0    120    120  0.000013
    3 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  171    8 16 0    120    120  0.000000
    4 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  171    8  8 0    120    120  0.000000
    5 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  171    8  4 0    120    120  0.000000
    6 /hpl-2.2/src/pfact/hpl_pdrpanllt.c  171    8  2 0    120    120  0.000000
      lead_product size_product ratio idx
    1        14400            0   Inf   1
    2        14400            0   Inf   2
    3        14400            0   Inf   3
    4        14400            0   Inf   4
    5        14400            0   Inf   5
    6        14400            0   Inf   6
    
    library(ggplot2)
    ggplot(results, aes(x=idx, y=real_time, color=factor(file))) +
        geom_point(shape=1) + ggtitle("Durations of HPL_dgemm")
    

    trace_dtrsm1_16.png

    We can observe a trend similar to HPL_dgemm. The function is only used in two places, HPL_pdrpanllT and HPL_pdupdateTT. In the former, all the calls are very short, whereas in the later, the calls are long at the beginning and become shorter throughout the execution. We also have some outliers.

    library(ggplot2)
    ggplot(results, aes(x=size_product, y=real_time, color=factor(rank))) +
        geom_point(shape=1) + ggtitle("Durations of HPL_dgemm")
    

    trace_dtrsm2_16.png

    As expected, the duration looks proportional to the product of the sizes.

    reg <- lm(real_time~I(m*n), data=results)
    summary(reg)
    
    
    Call:
    lm(formula = real_time ~ I(m * n), data = results)
    
    Residuals:
          Min        1Q    Median        3Q       Max 
    -0.002999  0.000010  0.000010  0.000010  0.043651 
    
    Coefficients:
                  Estimate Std. Error  t value Pr(>|t|)    
    (Intercept) -1.042e-05  2.445e-06   -4.263 2.02e-05 ***
    I(m * n)     9.246e-08  3.915e-11 2361.957  < 2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.0006885 on 81298 degrees of freedom
    Multiple R-squared:  0.9856,	Adjusted R-squared:  0.9856 
    F-statistic: 5.579e+06 on 1 and 81298 DF,  p-value: < 2.2e-16
    
    layout(matrix(c(1,2,3,4),2,2))
    plot(reg)
    

    reg_dtrsm_16.png

    The R-squared is high and both the intercept and sizes have a significant impact. However, the outliers are even more concerning than with HPL_dgemm. The Q-Q plot shows a large tail, and the residual vs leverage shows that these outliers are non-negligible in the linear regression (i.e. if we removed them, the coefficients would change significantly).

  2. Replacing HPL_dtrsm by smpi_sleep   SMPI PERFORMANCE HPL
    • Similarly to what have been done with HPL_dgemm, we use the coefficients found with the linear regression to replace the function by a sleep.
    #define HPL_dtrsm(layout, Side, Uplo, TransA, Diag, M, N, alpha, A, lda, B, ldb) ({\
        double expected_time = (9.246e-08)*(double)M*(double)N - 1.024e-05;\
        if(expected_time > 0)\
            smpi_usleep((useconds_t)(expected_time*1e6));\
    })
    
    • Running HPL again. We get the expected speed (about 23 Gflops) and a simulation time of 29 seconds (gain of 12 seconds).
  3. Having a look at malloc   PYTHON R PERFORMANCE HPL
    • To run HPL with larger matrices, we need to replace some calls to malloc (resp. free) by SMPI_SHARED_MALLOC (resp. SMPI_SHARED_FREE).
    • Firstly, let’s see where the big allocations are.
    • Define MY_MALLOC in hpl.h as follows:
    #define MY_MALLOC(n) ({\
        int my_rank;\
        MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);\
        printf("file=%s line=%d rank=%d size=%lu\n", __FILE__, __LINE__, my_rank, n);\
        malloc(n);\
    })
    
    • Replace all the calls to malloc in the files by MY_MALLOC:
    grep -l malloc testing/**/*.c src/**/*.c | xargs sed -i 's/malloc/MY_MALLOC/g'
    
    • Run smpirun (N=20000, P=Q=4) and redirect the output to /tmp/output.
    • Process the output file:
    import re
    import csv
    reg = re.compile('file=([a-zA-Z0-9/_.-]+) line=([0-9]+) rank=([0-9]+) size=([0-9]+)')
    
    def process(in_file, out_file):
        with open(in_file, 'r') as in_f:
            with open(out_file, 'w') as out_f:
                csv_writer = csv.writer(out_f)
                csv_writer.writerow(('file', 'line', 'rank', 'size'))
                for line in in_f:
                    match = reg.match(line)
                    if match is not None:
                        result = list(match.group(i) for i in range(1, 5))
                        result[0] = result[0][result[0].index('/hpl'):].lower()
                        csv_writer.writerow(result)
    process('/tmp/output', '/tmp/malloc.csv')
    
    • Analysis in R:
    results <- read.csv("/tmp/malloc.csv");
    head(results)
    
                                file line rank size
    1 /hpl-2.2/src/grid/hpl_reduce.c  127    0    4
    2 /hpl-2.2/src/grid/hpl_reduce.c  127    1    4
    3 /hpl-2.2/src/grid/hpl_reduce.c  127    2    4
    4 /hpl-2.2/src/grid/hpl_reduce.c  127    3    4
    5 /hpl-2.2/src/grid/hpl_reduce.c  127    4    4
    6 /hpl-2.2/src/grid/hpl_reduce.c  127    5    4
    
    library(ggplot2)
    ggplot(results, aes(x=file, y=size)) +
        geom_boxplot() + ggtitle("Sizes of malloc") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
    

    trace_malloc1_16.png

    storage.mode(results$size) <- "double" # avoiding integer overflow when taking the product
    aggregated_results = aggregate(results$size, by=list(file=results$file), FUN=sum)
    head(aggregated_results)
    
                                       file           x
    1         /hpl-2.2/src/comm/hpl_packl.c     9034816
    2        /hpl-2.2/src/grid/hpl_reduce.c     3200736
    3 /hpl-2.2/src/panel/hpl_pdpanel_init.c 11592866048
    4  /hpl-2.2/src/panel/hpl_pdpanel_new.c        3456
    5     /hpl-2.2/src/pauxil/hpl_pdlange.c     2560032
    6       /hpl-2.2/src/pfact/hpl_pdfact.c     2645504
    
    library(ggplot2)
    ggplot(aggregated_results, aes(x=file, y=x)) +
        geom_boxplot() + ggtitle("Sizes of malloc") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
    

    trace_malloc2_16.png

    There are several things to notice:

    • The biggest chunks are allocated in HPL_pdtest. These are the local matrices of each process.
    • However, regarding the total quantity of allocated memory, HPL_pdpanel_init is the clear winner.
    • In these tests, htop reported that about 20% of the 16GB of my laptop’s memory were used, i.e. about 3.2GB. We use a matrix of size 20000, each element is of type double (8 bytes), so the total amount of memory for the whole matrix is 200002/8 = 3.2GB.
    • Thus, it seems that the malloc used in HPL_pdpanel_init are in fact negligible. An hypothesis is that they are quickly followed by a free.
    • Verifying that every process allocates the same thing:
    library(ggplot2)
    ggplot(results[results$file == "/hpl-2.2/testing/ptest/hpl_pdtest.c",], aes(x="", y=size, fill=factor(rank))) +
        coord_polar("y", start=0) +
        geom_bar(width=1, stat="identity") +
        ggtitle("Sizes of malloc in HPL_pdtest")
    

    trace_malloc3_16.png

    res_pdtest = results[results$file == "/hpl-2.2/testing/ptest/hpl_pdtest.c",]
    unique(res_pdtest[order(res_pdtest$size),]$size)
    
    [1] 193729992 196879432 198454152 200080072 201680392 203293512
    
    • The different calls to malloc in HPL_pdtest have approximately the same size, but not exactly. This understandable, P and Q may not divide the matrix sizes. Maybe this could cause SMPI_SHARED_MALLOC to not work properly?
  4. Tentative to use SMIP_SHARED_MALLOC and SMPI_SHARED_FREE in HPL   SMPI PERFORMANCE BUG HPL
    • Revert the previous changes regarding malloc.
    • In file hpl_pdtest.c, replace malloc by SMPI_SHARED_MALLOC and free by SMPI_SHARED_FREE.
    • Run HPL with Simgrid. Two issues:

      • The memory consumption stays the same, about 20% of my laptop’s memory. A first guess would be that the SHARED_MALLOC did not work, a new allocation was made for every process. Maybe because different sizes were given?
      • The execution time (both virutal and real) decreased significantly. The virtual time dropped from 233 to 223 seconds, the real time from 28 to 15 seconds. If we forget the first point, a guess could be that SHARED_MALLOC worked properly and resulted in a lower number of cache misses (since all processes share the same sub-matrix) and thus improved performances. It is an experimental bias, we should avoid it.

      The fact that we have these two issues combined is very surprising.

    • Let’s try to see if the SHARED_MALLOC makes only one allocation or not, by adding some printf in its implementation.
      • The path shmalloc_global is taken.
      • The bogusfile is created only once, as expected.
      • Then, every process maps the file in memory, chunk by chunk. The base adress is not the same for every process, but this is not an issue (we are speaking of virtual memory here).
    • Tested my matrix product program. Got 34% memory utilization, 44 virtual seconds and 8 real seconds with SMPI_SHARED_MALLOC, but 11% memory utilization, 81 virtual seconds and 7 real seconds with malloc. Very strange.
    • Hypothesis: either the measure of the memory consumption is broken, or SHARED_MALLOC is broken.
    • Try to use something else than htop:

      watch -n 0,1 cat /proc/meminfo
      
      • With malloc and free, the available memory drop from 14.4 GB to 11.0 GB.
      • With SMPI_SHARED_MALLOC and SMPI_SHARED_FREE, the available memory drop from 14.4 GB to 14.1 GB.

      This seems more coherent, so htop would be a bad tool to measure memory consumption when using SMPI_SHARED_MALLOC. But this does not solve the time issue.

1.2.9 2017-03-12 Sunday

  1. Experiment with SMPI macros in the matrix product code   C R EXPERIMENTS PERFORMANCE
    • Use the matrix product code, at commit 91633ea99463109736b900c92f2eacc84630e5b5. Run 10 tests with or without SMPI_SHARED_MALLOC and SMPI_SAMPLE with a matrix size of 4000 and 64 processes, by running the command:

      ./smpi_macros.py 10 /tmp/results.csv
      
    • Analysis, in R:
    results <- read.csv("/tmp/results.csv");
    head(results)
    
          time size smpi_sample smpi_malloc
    1 2.134820 4000           1           1
    2 2.608971 4000           0           0
    3 3.767625 4000           1           0
    4 2.412387 4000           0           1
    5 3.767162 4000           1           0
    6 2.497480 4000           0           0
    

    We already see that the case where we use SMPI_SAMPLE but not SMPI_SHARED_MALLOC seems to be different than the others.

    res_aov = aov(time~(smpi_sample + smpi_malloc)^2, data=results)
    summary(res_aov)
    
                            Df Sum Sq Mean Sq F value   Pr(>F)    
    smpi_sample              1  1.202   1.202   9.227  0.00442 ** 
    smpi_malloc              1  4.579   4.579  35.163 8.62e-07 ***
    smpi_sample:smpi_malloc  1  8.332   8.332  63.981 1.68e-09 ***
    Residuals               36  4.688   0.130                     
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    suppressWarnings(suppressMessages(library(FrF2))) # FrF2 outputs a bunch of useless messages...
    MEPlot(res_aov, abbrev=4, select=c(1, 2), response="time")
    

    smpi_macros_1.png

    IAPlot(res_aov, abbrev=4, show.alias=FALSE, select=c(1, 2))
    

    smpi_macros_2.png

    mean(results[results$smpi_sample == 0 & results$smpi_malloc == 0,]$time)
    mean(results[results$smpi_sample == 0 & results$smpi_malloc == 1,]$time)
    mean(results[results$smpi_sample == 1 & results$smpi_malloc == 0,]$time)
    mean(results[results$smpi_sample == 1 & results$smpi_malloc == 1,]$time)
    
    [1] 2.513953
    [1] 2.750056
    [1] 3.773385
    [1] 2.183901
    
    • In this small experiment, we see that both macros have a non-negligible impact on the time estimated by SMPI. When none of the optimizations is used, adding one of them will decreases the application’s performances. When one of the optimizations is already used, adding the other one increases the application’s performances.
    • When I have added the SMPI macros in matmul.c, I have firstly added SMPI_SHARED_MALLOC and then SMPI_SAMPLE_GLOBAL (see the entry for 13/02/2017). According to the tests above, here the variation is not huge (I did not try the configuration with SMPI_SAMPLE_GLOBAL and without SMPI_SHARED_MALLOC). Furthermore, I did not perform extensive tests. This may explain why I did not notice this sooner.

1.2.10 2017-03-13 Monday

  1. Let’s play with Grid 5000   G5K
    • Connect to Grenoble’s site:

      ssh tocornebize@access.grid5000.fr
      ssh grenoble
      
    • Reserve a node and deploy:

      oarsub -I -l nodes=1,walltime=7 -t deploy
      kadeploy3 -f $OAR_NODE_FILE -e jessie-x64-big -k
      
    • Connect as root on the new node:

      ssh root@genepi-33.grenoble.grid5000.fr
      
    • Install Simgrid:

      wget https://github.com/simgrid/simgrid/archive/c8db21208f3436c35d3fdf5a875a0059719bff43.zip -O simgrid.zip
      unzip simgrid.zip
      cd simgrid-*
      mkdir build
      cd build
      cmake -Denable_documentation=OFF ..
      make -j 8
      make install
      
    • Copy HPL on the machine, with scp.
    • Change the variable TOPDIR in the file Make.SMPI.
    • Do not forget to clean HPL directory when copying it, otherwise the modification of the variable TOPDIR will not be applied on the sub-makefiles.
    • Success of compilation and execution of HPL with Simgrid on one Grid5000 node.
    • Strange thing: the virtual time did not change much (228 seconds, or 23.3 Gflops), although the simulation time changed a lot (50 seconds, against 15 seconds on my laptop) and I used the same value for the option running-power.
  2. Scrit for automatic installation   SHELL G5K
    • A small bash script to install Simgrid and compile HPL. Store it in file deploy.sh. It assume that archives for Simgrid and HPL are located in /home/tocornebize.

      function abort {
          echo -e "\e[1;31m Error:" $1 "\e[0m"
          exit 1
      }
      
      rm -rf hpl* simgrid*
      cp /home/tocornebize/{hpl,simgrid}.zip . &&\
      unzip hpl.zip
      unzip simgrid.zip
      if [ $? -ne 0 ]
      then
          abort "Could not copy or extract the archives."
      fi
      
      echo ""
      echo -e "\e[1;34m Installing Simgrid\e[0m"
      cd simgrid* &&\
      mkdir build &&\
      cd build &&\
      cmake -Denable_documentation=OFF .. &&\
      make -j 8 &&\
      make install &&\
      cd ../..
      if [ $? -ne 0 ]
      then
          abort "Could not install Simgrid."
      fi
      
      echo ""
      echo -e "\e[1;34m Installing HPL\e[0m"
      cd hpl* &&\
      sed -ri "s|TOPdir\s+=.+|TOPdir="`pwd`"|g" Make.SMPI &&\ # fixing TOPdir variable
      make startup -j 8 arch=SMPI &&\
      make -j 8 arch=SMPI &&\
      cd ..
      if [ $? -ne 0 ]
      then
          abort "Could not compile HPL."
      fi
      
      echo ""
      echo -e "\e[1;32m Everything was ok\e[0m"
      
    • Given a node obtained with oarsub and kadeploy3, connect in ssh to it. Then, just run:

      /home/tocornebize/deploy.sh
      
  3. Recurrent failure in HPL with SMPI_SHARED_MALLOC   SMPI BUG HPL
    • The following error often happens when runing HPL with SMPI_SHARED_MALLOC:

      src/simix/smx_global.cpp:557: [simix_kernel/CRITICAL] Oops ! Deadlock or code not perfectly clean.
      
    • It does not seem to happen without SMPI_SHARED_MALLOC.
    • It does not always happen with SMPI_SHARED_MALLOC.
    • I do not understand what is happening.
  4. Another failure in HPL with SMPI_SHARED_MALLOC   SMPI BUG HPL
    • Similarly, the tests on the matrix at the end of HPL are never computed when we use =SMPISHAREDMALLOC, because of an error. For instance:

      HPL ERROR from process # 0, on line 331 of function HPL_pdtest:
      >>> Error code returned by solve is 1021, skip <<<
      
    • Example of error code: 1021, 1322, 1324, 1575… These values appear nowhere in the code.
  5. Tracking the error in HPL
    • Put some printf to track the error.

1.2.11 2017-03-14 Tuesday

  1. Keep tracking the error.   SMPI GIT TRACING BUG
    • Add the option --cfg=smpi/simulate-computation:0 to have a deterministic execution.
    • The error code is the field info of the matrix. It is modified in the execution path HPL_pdgesvHPL_pdgesv0HPL_pdpanel_free, by the following line:

      if( PANEL->pmat->info == 0 ) PANEL->pmat->info = *(PANEL->DINFO);
      

      Thus, we now have to track the values of the DINFO field in the panel.

    • Strange thing, the field DINFO is a pointer to a float.
    • To track this, use this function:

      void print_info(HPL_T_panel *PANEL, int line) {
         if(PANEL->grid->myrow == 0 && PANEL->grid->mycol == 0) {
              printf("info = %f, line = %d\n", *PANEL->DINFO, line);
         }
      }
      

      Put some calls to it at nearly every line of the target file (when you are done with a file, remove these calls).

    • Field DINFO is modified in the execution path HPL_pdgesv0HPL_pdfactpanel->algo->rffun. The pointer rffun is one of the functions HPL_pdrpan***. In our settings, HPL_pdrpanllT is used.
    • Field DINFO is modified by PANEL->algo->pffun, which is one of the functions HPL_pdpan***. In our settings, HPL_pdpanllT is used.
    • Then it is modified by the first call to HPL_dlocswpT. This function directly modifies the value of DINFO with the line:

      if( *(PANEL->DINFO) == 0.0 )
         *(PANEL->DINFO) = (double)(PANEL->ia + JJ + 1);
      
    • If we remove this line, as expected the message about the error code disappears. So it confirms the error code come from here.
    • Looking at HPL_pdpanel_init.c,

      • DINFO is a pointer to a part of DPIV:

        PANEL->DINFO = PANEL->DPIV + JB;
        
      • DPIV is a pointer to a part of L1:

        PANEL->DPIV  = PANEL->L1    + JB * JB;
        
      • L1 is an (aligned) alias for WORK, which is itself a block of memory allocated with malloc:

        PANEL->WORK = (void*) malloc((size_t)(lwork) * sizeof(double));
        // [...]
        PANEL->L1    = (double *)HPL_PTR( PANEL->WORK, dalign );
        

      L1 is the jb × jb upper block of the local matrix. It is used for computations. Thus, it seems that HPL expects a particular cell of this local matrix to have the value 0. This cell is not always the same. Interpretation: HPL is checking that the matrix is correctly factorized (it uses LU factorization, so it computes L and U such that A=LU, L is lower-triangular and U is upper-triangular). Since we use shared memory, it is not surprising that the correctness check do not pass anymore. What is more surprising is that this particular check was still passing when the two BLAS functions were replaced by smpi_usleep. A guess: the fact that the resulting matrices are triangular only depends on the correctness of the swapping of rows.

    • Thus, it seems that the error code is explained. This is a normal behavior, considered what we are doing.
    • The deadlock happening in some executions is not explained however.
  2. Webinar   MEETING
    • Enabling open and reproducible research at computer system’s conferences: good, bad and ugly
    • Grigori Fursin
    • The speaker created an organization about reproducible research.
    • Artifact evaluation is about peer review of experiments.
    • How it works: papers accepted to a conference can ask for an artifact evaluation. If they pass it, they would get a nice stamp on the paper. If they fail it, nobody will know. View this as a bonus for a paper. For the evaluation of the artifacts, the conference nominates several reviewers.
    • ACM conferences also start using this kind of things, with several different stamps.
    • But artifact evaluation is not easy to do. Firstly, there is a lot of artifacts to evaluate, hard to scale. Some artifact evaluations require proprietary software and/or rare hardware (e.g. supercomputers). Also, hard to find a reviewer with suitable skills for some cases.
    • Also, it is difficult to reproduce empirical results (changing software and hardware). Everyone has its own scripts, so hard to standardize a universal workflow.
    • Other website.

1.2.12 2017-03-15 Wednesday

  1. Hunting the deadlock   SMPI PYTHON R TRACING BUG HPL
    • With N=40000, P=Q=4 and the option --cfg=smpi/simulate-computation:0, it seems we always have a deadlock.
    • Let’s trace it, with option -trace -trace-file /tmp/trace --cfg=smpi/trace-call-location:1.
    • Processing the trace file:
    pj_dump --user-defined --ignore-incomplete-links /tmp/trace > /tmp/trace.csv
    grep "State," /tmp/trace.csv | sed -e 's/()//' -e 's/MPI_STATE, //ig'  -e 's/State, //ig' -e 's/rank-//' -e\
    's/PMPI_/MPI_/' | grep MPI_  | tr 'A-Z' 'a-z' > /tmp/trace_processed.csv
    

    Clean the paths:

    import re
    reg = re.compile('((?:[^/])*)(?:/[a-zA-Z0-9_-]*)*((?:/hpl-2.2(?:/[a-zA-Z0-9_-]*)*).*)')
    def process(in_file, out_file):
        with open(in_file, 'r') as in_f:
            with open(out_file, 'w') as out_f:
                for line in in_f:
                    match = reg.match(line)
                    out_f.write('%s%s\n' % (match.group(1), match.group(2)))
    process('/tmp/trace_processed.csv', '/tmp/trace_cleaned.csv')
    

    Analysis:

    trace <- read.csv("/tmp/trace_cleaned.csv", header=F, strip.white=T, sep=",");
    names(trace) = c("rank", "start", "end", "duration", "level", "state", "Filename", "Linenumber");
    trace$idx = 1:length(trace$rank)
    head(trace)
    
      rank    start      end duration level    state
    1    8 0.000000 0.000000 0.000000     0 mpi_init
    2    8 0.000000 0.000202 0.000202     0 mpi_recv
    3    8 0.000202 0.000403 0.000201     0 mpi_recv
    4    8 0.000403 0.000806 0.000403     0 mpi_recv
    5    8 0.000806 0.000806 0.000000     0 mpi_send
    6    8 0.000806 0.001612 0.000806     0 mpi_recv
                                   Filename Linenumber idx
    1 /hpl-2.2/testing/ptest/hpl_pddriver.c        109   1
    2        /hpl-2.2/src/grid/hpl_reduce.c        165   2
    3        /hpl-2.2/src/grid/hpl_reduce.c        165   3
    4        /hpl-2.2/src/grid/hpl_reduce.c        165   4
    5        /hpl-2.2/src/grid/hpl_reduce.c        159   5
    6     /hpl-2.2/src/grid/hpl_broadcast.c        130   6
    
    get_last_event = function(df) {
        result = data.frame() 
        for(rank in (sort(unique(trace$rank)))) {
    	tmp_trace = trace[trace$rank == rank,]
    	result = rbind(result, tmp_trace[which.max(tmp_trace$idx),])
        }
        return(result)
    }
    get_last_event(trace)[c(1, 2, 3, 6, 7, 8)]
    
          rank    start      end    state                         Filename
    18756    0 67.01313 67.10575 mpi_recv /hpl-2.2/src/pgesv/hpl_spreadt.c
    9391     1 66.84201 67.10575 mpi_recv /hpl-2.2/src/pgesv/hpl_spreadt.c
    7865     2 66.92821 67.10575 mpi_recv /hpl-2.2/src/pgesv/hpl_spreadt.c
    7048     3 67.01313 67.10575 mpi_recv /hpl-2.2/src/pgesv/hpl_spreadt.c
    6242     4 67.08334 67.10575 mpi_send   /hpl-2.2/src/pgesv/hpl_rollt.c
    4699     5 66.93228 67.10575 mpi_wait   /hpl-2.2/src/pgesv/hpl_rollt.c
    3174     6 67.02313 67.10575 mpi_wait   /hpl-2.2/src/pgesv/hpl_rollt.c
    2358     7 67.08334 67.10575 mpi_send   /hpl-2.2/src/pgesv/hpl_rollt.c
    1554     8 67.08334 67.10575 mpi_recv /hpl-2.2/src/pgesv/hpl_spreadt.c
    17201    9 66.93228 67.10575 mpi_send /hpl-2.2/src/pgesv/hpl_spreadt.c
    15675   10 67.02313 67.10575 mpi_send /hpl-2.2/src/pgesv/hpl_spreadt.c
    14858   11 67.08334 67.10575 mpi_recv /hpl-2.2/src/pgesv/hpl_spreadt.c
    14053   12 67.06093 67.10575 mpi_recv /hpl-2.2/src/pgesv/hpl_spreadt.c
    12516   13 66.88778 67.10575 mpi_recv /hpl-2.2/src/pgesv/hpl_spreadt.c
    10998   14 66.97831 67.10575 mpi_recv /hpl-2.2/src/pgesv/hpl_spreadt.c
    10189   15 67.06093 67.10575 mpi_recv /hpl-2.2/src/pgesv/hpl_spreadt.c
          Linenumber
    18756        321
    9391         321
    7865         321
    7048         321
    6242         235
    4699         242
    3174         242
    2358         235
    1554         321
    17201        351
    15675        351
    14858        321
    14053        321
    12516        321
    10998        321
    10189        321
    

    If the trace is correct, the deadlock happens in functions HPL_rollT and HPL_spreadT. Some printf confirm that the deadlock is indeed happening in these places.

  2. Found the deadlock   SMPI C BUG HPL
    • Let’s add some printf in files HPL_spreadT.c and HPL_rollT.c. First, add the functions:

      int local_rank_to_global(int local_rank, MPI_Comm local_communicator) {
          int result;
          MPI_Group local_group, world_group;
          MPI_Comm_group(local_communicator, &local_group);
          MPI_Comm_group(MPI_COMM_WORLD, &world_group);
          MPI_Group_translate_ranks(local_group, 1, &local_rank, world_group, &result);
          return result;
      }
      void print_info(int src_rank, int dst_rank, char *function, int line, char *file) {
          printf("src=%d dst=%d function=%s line=%d file=%s\n", src_rank, dst_rank, function,
      line, file);
      }
      

      Then, add a call to print_info before each of the four lines we found:

      • HPL_spreadT.c, line 321:

        int local_rank = local_rank_to_global(IPMAP[SRCDIST+partner], comm);
        print_info(my_rank, local_rank, "mpi_recv", __LINE__, __FILE__);
        
      • HPL_spreadT.c, line 351:

        int local_rank = local_rank_to_global(IPMAP[SRCDIST+partner], comm);
        print_info(my_rank, local_rank, "mpi_send", __LINE__, __FILE__);
        
      • HPL_rollT.c, line 235:

        int local_rank = local_rank_to_global(partner, comm);
        print_info(my_rank, local_rank, "mpi_send", __LINE__, __FILE__);
        
      • HPL_rollT.c, line 242:

        int local_rank = local_rank_to_global(partner, comm);
        print_info(my_rank, local_rank, "mpi_wait", __LINE__, __FILE__);
        
    • Then, run HPL with stdout redirected to a file /tmp/output.
    • For each rank, look for the last time this rank was the caller of a blocking MPI primitive. For instance, for rank 15:

      RANK="15 " && grep "src="$RANK /tmp/output | tail -n 1
      

      Observe the destination and the function. With P=Q=4, we had these dependencies:

               12
                |
      mpi_recv  |
                |
                v     mpi_recv
                4 <——————————————+
                |                |
                |                |
      mpi_wait  |                |
                |                |
                v                |
                8 —————————————> 0
                     mpi_send
      

      There is the same pattern for {1, 5, 9, 13}, {2, 6, 10, 14} and {3, 7, 11, 15}.

    • This exact deadlock has been reproduced on Grid 5000, with the same parameters.

1.2.13 2017-03-16 Thursday

  1. Still looking for the deadlock   SMPI BUG HPL
    • When HPL is ran with smpi_usleep but without SMPI_SHARED_{MALLOC,FREE}, there is no deadlock, even with the same parameters (N=40000, P=Q=4). Warning: testing with N=40000 require a lot of memory, about 12GB.
    • When HPL is ran with SMPI_SHARED_{MALLOC,FREE} but without smpi_usleep, there is a deadlock. Note that we still use the option --cfg=smpi/simulate-computation:0. It happens in the same location, but the deadlock is different. Now, it is like this (and is located only in HPL_spreadT):

                4
                |
      mpi_recv  |
                |
                v     mpi_recv
                0 <——————————————+
                |                |
                |                |
      mpi_send  |                |
                |                |
                v                |
               12 —————————————> 8
                     mpi_send
      

      There is the same pattern for {1, 5, 9, 13}, {2, 6, 10, 14} and {3, 7, 11, 15}.

  2. Understanding HPL code   SMPI TRACING BUG HPL
    • In file HPL_spreadT.c.
    • In our settings, the following if statement is never taken:

      if(SIDE == HplLeft)
      
    • In the else part, there is a big do while loop. Some initializations happen before this loop.
    • npm1: initialized to nprow - SRCDIST - 1, not modified during the loop.
    • ip2: initialized to the biggest power of 2 smaller or equal to npm1. Divided by 2 at each step. The loop stops when ip2 is 0.
    • mask: initialized to ip2*2-1 (ip2 is a single bit set to 1 followed by a bunch of 0, mask is the same bit set to 1 followed by a bunch of 1). At the beginning of each step, the first 1 of mask is flipped, so mask is ip2-1 after this statement.
    • IPMAP: mapping of the processes.
    • IPMAPM1: inverse mapping (IPMAPM1[IPMAP[i]] is equal to i).
    • mydist: initialized to IPMAP1[myrow], not modified after.
    • partner: at each step, set to mydist^ip2, i.e. we flip exactly one bit of mydist.
    • We do the communications only when mydist & mask is 0 and when lbuf > 0.
      • If mydist & ip2 is not 0, we receive.
      • If mydist & ip2 is 0, we send.
    • Print the content of IPMAP. Add the following line before the do while:

      printf("IPMAP: my_rank=%d, %d %d %d %d \n", my_rank,
        local_rank_to_global(IPMAP[0], comm), local_rank_to_global(IPMAP[1], comm),
        local_rank_to_global(IPMAP[2], comm), local_rank_to_global(IPMAP[3], comm));
      

      We get this output:

      IPMAP: my_rank=0, 0 4 12 8
      IPMAP: my_rank=12, 0 4 12 8
      IPMAP: my_rank=8, 0 4 8 12
      IPMAP: my_rank=4, 0 4 12 8
      IPMAP: my_rank=0, 0 4 12 8
      IPMAP: my_rank=4, 0 4 12 8
      IPMAP: my_rank=1, 1 5 13 9
      IPMAP: my_rank=5, 1 5 13 9
      IPMAP: my_rank=13, 1 5 13 9
      IPMAP: my_rank=9, 1 5 9 13
      IPMAP: my_rank=5, 1 5 13 9
      IPMAP: my_rank=1, 1 5 13 9
      IPMAP: my_rank=2, 2 6 14 10
      IPMAP: my_rank=3, 3 7 15 11
      IPMAP: my_rank=6, 2 6 14 10
      IPMAP: my_rank=7, 3 7 15 11
      IPMAP: my_rank=10, 2 6 10 14
      IPMAP: my_rank=11, 3 7 11 15
      IPMAP: my_rank=14, 2 6 14 10
      IPMAP: my_rank=15, 3 7 15 11
      IPMAP: my_rank=6, 2 6 14 10
      IPMAP: my_rank=2, 2 6 14 10
      IPMAP: my_rank=7, 3 7 15 11
      IPMAP: my_rank=3, 3 7 15 11
      

      Recall that our communicators are {n, n+4, n+8, n+12} for n in {0, 1, 2, 3}. We see a pattern here: when processes have a local rank in {0, 1, 3}, their IPMAP is {0, 1, 3, 2} (local ranks), but when the local rank is 2, then IPMAP is {0, 1, 2, 3}.

    • Now, let’s print the other parameters. Add the following line just after the modification of mask at the beginning of the do while:

      printf("### my_rank=%d (%d) id_func=%d mask=%d ip2=%d mydist=%d", my_rank,
        my_local_rank, id_func, mask, ip2, mydist);
      

      Here, id_func is a static variable initialized to -1 and incremented at the beginning of every function call. Later in the code, add these:

      printf(" partner=%d", partner);
      

      and

      printf(" mpi_recv(%d)\n", IPMAP[SRCDIST+partner]);
      

      or

      printf(" mpi_send(%d)\n", IPMAP[SRCDIST+partner]);
      

      (depending on if we do a send or a receive). We have this output for {0, 4, 8, 12} (this is similar for other communicators):

      grep "my_rank=0 " output | grep "###"
      ### my_rank=0 (0) id_func=0 mask=1 ip2=2 mydist=0 partner=2 mpi_send(3)
      ### my_rank=0 (0) id_func=0 mask=0 ip2=1 mydist=0 partner=1 mpi_send(1)
      ### my_rank=0 (0) id_func=1 mask=1 ip2=2 mydist=0 partner=2 mpi_send(3)
      grep "my_rank=4 " output | grep "###"
      ### my_rank=4 (1) id_func=0 mask=1 ip2=2 mydist=1
      ### my_rank=4 (1) id_func=0 mask=0 ip2=1 mydist=1 partner=0 mpi_recv(0)
      ### my_rank=4 (1) id_func=1 mask=1 ip2=2 mydist=1
      ### my_rank=4 (1) id_func=1 mask=0 ip2=1 mydist=1 partner=0 mpi_recv(0)
      grep "my_rank=8 " output | grep "###"
      ### my_rank=8 (2) id_func=0 mask=1 ip2=2 mydist=2 partner=0 mpi_recv(0)
      grep "my_rank=12 " output | grep "###"
      ### my_rank=12 (3) id_func=0 mask=1 ip2=2 mydist=2 partner=0 mpi_recv(0)
      ### my_rank=12 (3) id_func=0 mask=0 ip2=1 mydist=2 partner=3 mpi_send(2)
      

      We see that the pattern of communication looks like a binary tree. At each function call, in the first step 0 sends to 12, in the second step 0 sends to 4 and 12 sends to 8. The problem is that all the mpi_recv match the mpi_send except for the node 8. This node calls mpi_recv with node 0 for the source, but we would expect it to have 12 for source. The same pattern is observed for other communicators.

    • We saw that the nodes with local rank 2 call MPI_Recv with an unexpected source. These nodes also have a different IPMAP. Hypothesis: these different IPMAP are a bug.
    • Doing the same experiment without SMPI_SHARED_{MALLOC,FREE} (the case where we do not have a deadlock). Here, we observe that the values of IPMAP are the same in all processes. Also, there is a matching MPI_Recv for every MPI_Send, as expected.
    • Thus, to fix the deadlock, we should search where IPMAP is defined.
  3. Seminar   MEETING
    • Taking advantage of application structure for visual performance analysis
    • Lucas Mello Schnorr
    • Context: two models. Explicit programming (e.g. MPI) or task-based programming (e.g. Cilk).
    • In task-based programming, no clear phases (contrarily to things like MPI, where we have communication phases and computation phases). Thus, hard to understand the performances when visualizing a trace.
    • The scheduler has to assign tasks, anticipate the critical path and minimize data movements. The difficulty is that it does not know the whole DAG at the beginning.
    • Workflow based on several tools: pj_dump, R, tidyverse, ggplot2, plotly. Everything can be done in org-mode. Agile workflow, fail fast if the idea is not working, easily share experiments with colleagues.

1.2.14 2017-03-17 Friday

  1. Let’s look at IPMAP   SMPI C TRACING BUG HPL
    • IPMAP is given as an argument to HPL_spreadT.
    • The function HPL_spreadT is usedd in HPL_pdlaswp01T and HPL_equil.
    • In our settings, all processes begin by a call to HPL_pdlaswp01T. Then, all processes with local ranks 0 and 1 do a call to HPL_equil (local ranks 2 and 3 are already deadlocked). Values of IPMAP are the same between the two different calls. We thus have to look at HPL_pdlaswp01T.
    • IPMAP is defined in this function with other variables. They are all a contiguous block in PANEL->IWORK:

      iflag  = PANEL->IWORK;
      // [...]
      k = (int)((unsigned int)(jb) << 1);  ipl = iflag + 1; ipID = ipl + 1;
      ipA     = ipID + ((unsigned int)(k) << 1); lindxA = ipA + 1;
      lindxAU = lindxA + k; iplen = lindxAU + k; ipmap = iplen + nprow + 1;
      ipmapm1 = ipmap + nprow; permU = ipmapm1 + nprow; iwork = permU + jb;
      
    • PANEL->IWORK is allocated in HPL_pdpanel_init with a simple malloc. So the bugs does not come from here.
    • The content of IPMAP is defined in the function HPL_plindx10.
    • Function HPL_plindx10 firstly compute the content of array IPLEN, then call function HPL_logsort to compute IPMAP (the content of IPMAP depends on the content of IPLEN).
    • Printing the content of IPLEN just after its initialization. Add this code just before the call to HPL_logsort:

      int my_rank;
      MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
      printf(">> my_rank=%d, icurrow=%d, IPLEN =", my_rank, icurrow);
      for(i = 0; i <= nprow; i++) {
           printf(" %d", IPLEN[i]);
      }
      printf("\n");
      

      Here are the contents of IPLEN for ranks {0, 4, 8, 12}.

      • With SMPI_SHARED_{MALLOC,FREE}:

        Rank IPLEN[0] IPLEN[1] IPLEN[2] IPLEN[3] IPLEN[4]
        0 0 103 14 1 2
        4 0 102 15 1 2
        8 0 102 14 2 2
        12 0 102 14 1 3
      • Without SMPI_SHARED_{MALLOC,FREE}:

        Rank IPLEN[0] IPLEN[1] IPLEN[2] IPLEN[3] IPLEN[4]
        0 0 31 24 26 39
        4 0 31 24 26 39
        8 0 31 24 26 39
        12 0 31 24 26 39

      We can note two things. Firstly, without SMPI_SHARED_{MALLOC,FREE}, all processes have an IPLEN with the same content. This is not the case with SMPI_SHARED_{MALLOC,FREE}. Furthermore, values in IPLEN are closer in the malloc/free case. Thus, the issue is very likely to come from IPLEN.

  2. Let’s look at IPLEN and IPID   SMPI TRACING BUG HPL
    • The content of IPLEN depends on the content of IPID.
    • Add a printf to get its content. Every element it contains is present exactly twice in the array.
    • With SHARED_{MALLOC,FREE},
      • IPID has a size of 300 for local rank 0, 302 for the others.
      • IPID of local rank 1 is equal to IPID of local rank 0 plus twice the element 120.
      • IPID of local rank 2 is equal to IPID of local rank 0 plus twice the element 240.
      • IPID of local rank 3 is equal to IPID of local rank 0 plus twice the element 360.
    • Without SHARED_{MALLOC,FREE},
      • IPID has a size of 478 for all ranks.
      • All IPID are equal.
    • IPID is computed in function HPL_pipid.
    • The content of IPID depends on the content of the array PANEL->DPIV. This array is made of 120 elements. These elements are of type double. The function cast them to int and do some comparisons using them, which is strange.
    • Add a printf to get its content.
    • With SHARED_{MALLOC,FREE},
      • The DPIV of the processes having the same local rank are equal.
      • The 30 first elements of the arrays DPIV of the processes of a same communicator are equal. The following elements are different.
      • For the processes of local rank 0, these following elements are 30, 31, 32,…, 119. In other words, for i > 29, we have DPIV[i] equal to i.
      • For the processes of local rank 1, these elements are all equal to 120. For local rank 2, they are equal to 240. For local rank 3, they are equal to =360.
    • Without SHARED_{MALLOC,FREE},
      • All the DPIV of all processes are equal.
      • All its elements are present exactly once, except 4143 which is present twice.

    – Thus, it seems that the issue come from PANEL->DPIV.

  3. Summing up   SMPI BUG HPL
    • The values of IPMAP depends on the values of IPLEN.
    • The values of IPLEN depends on the values of IPID.
    • The values of IPID depends on the values of PANEL->DPIV.
    • For all these arrays, we can observe some strange things in the case SMPI_SHARED_{MALLOC,FREE} (comparing to the case malloc/free):
      • The content of the arrays is not the same for different ranks.
      • The content itself looks kind of strange (e.g. DPIV has a lot of identical values).
  4. So, why do we have these DPIV?   SMPI BUG HPL
    • The content of DPIV is defined at the end of function HPL_pdmxswp, by the line:

      (PANEL->DPIV)[JJ] = WORK[2];
      

      With some printf, we see that DPIV is filled in order. The values are the same that the ones already observed in DPIV.

1.2.15 2017-03-20 Monday

  1. Write a small Python script to monitor memory usage   PYTHON
    • Based on command smemstat.
    • Run the command every second in quiet mode with json output. Parse the json file and output the information on screen, nicely formated.
    • Future work:
      • Different sampling rate passed as a command line argument.
      • Export in CSV. This would allow to plot memory consumption over time.
  2. Failed tentatives for DPIV   SMPI BUG HPL
    • Tried to hard-code the values of DPIV with something like that:

      (PANEL->DPIV)[JJ] = 42;
      

      Got a segmentation fault.

  3. Discussion with Arnaud   SMPI BUG HPL MEETING
    • Had a look at HPL code.
    • Next steps to try to find the issue:
      • Try another block size for global SMPI_SHARED_MALLOC.
      • Retry local SMPI_SHARED_MALLOC.
      • Try other matrix sizes, other process grids.
      • In HPL_pdmxswp, print the values of WORK[{0,1,2,3}] before and after the execution.
  4. Looking at WORK[{0,1,2,3}]   SMPI C BUG HPL
    • Meaning of the first values of this array:
      • WORK[0] : local maximum absolute value scalar,
      • WORK[1] : corresponding local row index,
      • WORK[2] : corresponding global row index,
      • WORK[3] : coordinate of process owning this max.
    • Just before the call to HPL_pdxswp, these values are computed locally. Then, HPL_pdxswp does some computations to get the global values.
    • Adding some printf.
    • Without SHARED_{MALLOC,FREE}, the absolute value of WORK[0] increases at each call and quickly becomes very large. It reaches 3.8e+302, then it is NaN. This happens regardless of if we replace BLAS operations by smpi_usleep.
    • With SHARED_{MALLOC,FREE}, the absolute value of WORK[0] is relatively small.
    • If we replace the value of WORK[0] by a (small) constant, the simulations terminates without deadlock.
    • Recall that we run the simulation with N=40000 and P=Q=4.
    • The simulation takes 197 seconds, where 170 seconds are actual computations of the application (thus, there is still room for optimization).
    • The estimated performances are 27.5 Gflops. This is a bit higher than what we had before with the matrix of size 20000. We need to check if this difference is due to the higher matrix size (expected and ok) or our dirty hack (not ok).

1.2.16 2017-03-21 Tuesday

  1. Checking the consistency of IPMAP   SMPI PYTHON TRACING BUG HPL
    • Before the modification on WORK[0], the IPMAP were not consistent on the different processes of a same communicator (see the entry of 16/03/2017).
    • Let’s check if this issue is fixed.
    • Add the following in HPL_spreadT.c:

      int my_rank;
      MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
      printf("my_rank=%d IPMAP=%d,%d,%d,%d\n", my_rank, IPMAP[0], IPMAP[1], IPMAP[2], IPMAP[3]);
      
    • Run HPL with stdout redirected to /tmp/output.
    • Check that at each step, the values of IPMAP are the same for the processes of a same communicator. Recall that the communicators are {0,4,8,12}, {1,5,9,13}, {2,6,10,14} and {3,7,11,15}.
    import re
    reg = re.compile('my_rank=([0-9]+) IPMAP=(.+)')
    
    def process(in_file):
        results = {n: [] for n in range(16)}
        with open(in_file, 'r') as in_f:
            for line in in_f:
                match = reg.match(line)
                if match is not None:
                    n = int(match.group(1))
                    ipmap = match.group(2)
                    results[n].append(ipmap)
        for comm in range(4):
            print('Number of entries for communicator %d: %d' % (comm, len(results[comm])))
            for rank in range(1, 4):
                assert results[comm] == results[comm+4*rank]
        print('OK')
    process('/tmp/output')
    
    Number of entries for communicator 0: 913
    Number of entries for communicator 1: 904
    Number of entries for communicator 2: 907
    Number of entries for communicator 3: 910
    OK
    
    • We see here that the values of IPMAP are consistent.
  2. Comparison with previous code   SMPI HPL
    • Let’s compare with the previous version of the code (without the modification on WORK[0]). We use N=20000, P=Q=4.

      Code Virtual time Gflops Total simulation time Time for application computations
      Before 222.27 2.400e+01 19.2529 10.0526
      After 258.28 2.065e+01 48.2851 41.7249
    • We find that both the virtual time and the real time are longer, due to an higher ammount of time spent in the application.
    • Do not forget to remove the option --cfg=smpi/simulate-computation:0 when testing for things like that. At first, I did not removed it. The real time was higher but the virtual time was unchanged.
    • It seems that the modification of WORK[0] has led to a modification of the behavior of the application, which yields significant differences in terms of performances.
  3. Having a look at what takes time   SMPI PYTHON R EXPERIMENTS PERFORMANCE HPL
    • Using the settings N=20000, P=Q=4. Recall that with these settings, the simulation time was nearly 52 seconds.
    • Simulation time drops to 30 seconds if we disable the calls to HPL_dgemv (this was not the case before, according to experimentations of 08/03/2017).
    • There was no deadlock with N=20000, so we can compare the cases with and without the modification of WORK[0].
    • Modify the definition of HPL_dgemv in HPL_blas.h for both cases:

      #define    HPL_dgemv(Order, TransA, M, N, alpha, A, lda, X, incX, beta, Y, incY) ({\
          int my_rank, buff=0;\
          MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);\
          struct timeval before = {};\
          struct timeval after = {};\
          gettimeofday(&before, NULL);\
          cblas_dgemv(Order, TransA, M, N, alpha, A, lda, X, incX, beta, Y, incY);\
          gettimeofday(&after, NULL);\
          double time_before = (double)(before.tv_sec) + (double)(before.tv_usec)*1e-6;\
          double time_after = (double)(after.tv_sec) + (double)(after.tv_usec)*1e-6;\
          double real_time = time_after-time_before;\
          printf("file=%s line=%d rank=%d m=%d n=%d lead_A=%d inc_X=%d inc_Y=%d real_time=%f\n", __FILE__, __LINE__, my_rank, M, N, lda, incX, incY, real_time);\
      })
      
    • Run HPL for both cases, with stdout redirected to some file (/tmp/output_before when WORK[0] is unmodified, /tmp/output_after when it is modified).
    • Process the outputs:
    import re
    import csv
    reg = re.compile('file=([a-zA-Z0-9/_.-]+) line=([0-9]+) rank=([0-9]+) m=([0-9]+) n=([0-9]+) lead_A=([0-9]+) inc_X=([0-9]+) inc_Y=([0-9]+) real_time=(-?[0-9]+.[0-9]+)')
    
    def process(in_file, out_file):
        with open(in_file, 'r') as in_f:
            with open(out_file, 'w') as out_f:
                csv_writer = csv.writer(out_f)
                csv_writer.writerow(('file', 'line', 'rank', 'm', 'n', 'lead_A', 'inc_X', 'inc_Y', 'real_time'))
                for line in in_f:
                    match = reg.match(line)
                    if match is not None:
                        result = list(match.group(i) for i in range(1, 10))
                        result[0] = result[0][result[0].index('/hpl'):].lower()
                        csv_writer.writerow(result)
    process('/tmp/output_before', '/tmp/parameters_before.csv')
    process('/tmp/output_after', '/tmp/parameters_after.csv')
    
    • Analysis with R:
    parameters_before <- read.csv("/tmp/parameters_before.csv");
    parameters_after <- read.csv("/tmp/parameters_after.csv");
    head(parameters_before)
    head(parameters_after)
    
                                   file line rank    m n lead_A inc_X inc_Y
    1 /hpl-2.2/src/pfact/hpl_pdpanllt.c  207    4 5040 1   5040   120     1
    2 /hpl-2.2/src/pfact/hpl_pdpanllt.c  207   12 4920 1   4920   120     1
    3 /hpl-2.2/src/pfact/hpl_pdpanllt.c  207    0 5039 1   5040   120     1
    4 /hpl-2.2/src/pfact/hpl_pdpanllt.c  207    8 5000 1   5000   120     1
    5 /hpl-2.2/src/pfact/hpl_pdpanllt.c  207   12 4920 1   4920   120     1
    6 /hpl-2.2/src/pfact/hpl_pdpanllt.c  207    4 5040 1   5040   120     1
      real_time
    1  0.000034
    2  0.000034
    3  0.000030
    4  0.000156
    5  0.000043
    6  0.000031
                                   file line rank    m n lead_A inc_X inc_Y
    1 /hpl-2.2/src/pfact/hpl_pdpanllt.c  207    4 5040 1   5040   120     1
    2 /hpl-2.2/src/pfact/hpl_pdpanllt.c  207   12 4920 1   4920   120     1
    3 /hpl-2.2/src/pfact/hpl_pdpanllt.c  207    0 5039 1   5040   120     1
    4 /hpl-2.2/src/pfact/hpl_pdpanllt.c  207    8 5000 1   5000   120     1
    5 /hpl-2.2/src/pfact/hpl_pdpanllt.c  207   12 4920 1   4920   120     1
    6 /hpl-2.2/src/pfact/hpl_pdpanllt.c  207    8 5000 1   5000   120     1
      real_time
    1  0.000026
    2  0.000030
    3  0.000030
    4  0.000123
    5  0.000030
    6  0.000028
    
    sum(parameters_before$real_time)
    sum(parameters_after$real_time)
    
    [1] 0.127207
    [1] 2.61086
    
    • There is a clear difference between the two cases. When WORK[0] is modified, the time spent in the function HPL_dgemv is 20 times higher. However, this makes a difference of about 2.5 seconds, whereas a difference of 20 seconds was observed when removing HPL_dgemv.
    • Therefore, it seems that removing the calls to HPL_dgemv triggers a modification of the behavior of the application, resulting in a lower time, but this is not this function itself which takes time.
    • Note that this was not the case for functions HPL_dgemm and HPL_dtrsm: this was the calls to these functions which took time, not a consequence of the calls (just tested: taking the sum of all the times gives a total of about 75 seconds for HPL_dtrsm and about 2896 seconds for HPL_dgemm).
    • In experimentation of 08/03/2017, removing HPL_dgemv only resulted in a drop of 1 second for the execution time.
    • Thus, it seems that modifying WORK[0] has increased the execution time, which is cancelled if we then remove HPL_dgemv. Therefore, we should not treat this function as HPL_dgemm and HPL_dtrsm (replacing it by a smpi_usleep), we should simply remove it.
    • If we remove it, we get a virtual time of 226 seconds, i.e. 23.6 Gflops. This is much closer to what we used to have. Now the simulation time is 26 seconds, this is worse than what we used to have, but still better than what we had after the modification of WORK[0].

1.2.17 2017-03-22 Wednesday

  1. Better usability of HPL   SMPI C HPL
    • Before, HPL code had to be changed by hand to enable or disable SMPI optimizations (SHARED_{MALLOC,FREE} and smpi_usleep) and to enable or disable the tracing of BLAS function calls.
    • Now, thanks to some preprocessor macros, these different settings can be configured on the command line when compiling:

      # Compile vanilla HPL for SMPI
      make arch=SMPI
      # Compile HPL for SMPI with the tracing of BLAS function calls
      make SMPI_OPTS=-DSMPI_MEASURE
      # Compile HPL for SMPI with the SMPI optimizations (shared malloc/free, smpi_usleep)
      make SMPI_OPTS=-DSMPI_OPTIMIZATION
      
    • Next step: automation of the computation of the linear regression coefficients, to also pass these coefficients as preprocessor variables.
  2. Script to parse the output file and do the linear regression   SMPI PYTHON EXPERIMENTS TRACING HPL
    • Everything is done in Python (linear regression included) to simplify the procedure for the user.
    • Given an output file /tmp/output as produced by a call to HPL (compiled with -DSMPI_MEASURE option), call the script:

      python3 ../scripts/linear_regression.py /tmp/output
      
      -DSMPI_DGEMM_COEFF=1.097757e-09 -DSMPI_DTRSM_COEFF=9.134754e-08
      

      It outputs the list of the coefficients found by the linear regressions for the relevant BLAS functions. This list should then be passed to the variable SMPI_OPTS when compiling with -DSMPI_OPTIMIZATION.

  3. Discussion with Arnaud   SMPI ORGMODE PERFORMANCE HPL MEETING
    • Possible trip to Bordeaux on the week of [2017-04-10 Mon]-[2017-04-14 Fri]. The goal is to discuss with contributors of Simgrid.
    • Found very strange the different time we get when we modify WORK[0], especially because it is computation time (would be more understandable if it was communication time, since the communication patterns for the pivot exchange are very likely to be impacted). Should do a profiling.
    • Some tips regarding org-mode (tags).
  4. DONE Profile HPL
    • Use Valgrind with Callgrind and Kcachegrind or Gprof.
    • Do the profiling on unmodified HPL and modified HPL, to see if there is any obvious difference.

1.2.18 2017-03-23 Thursday

  1. Profiling vanilla HPL   EXPERIMENTS PERFORMANCE PROFILING VALGRIND SMPI HPL
    • Profiling with Valgrind of vanilla HPL (no time measurements nor SMPI optimizations). Add the option -g in the Makefile.
    • HPL commit: 4494976bc0dd67e04e54abec2520fd468792527a.
    • Settings: N=5000, P=Q=4.
    • Compile with the command:

      make -j 4 arch=SMPI
      
    • Run with the command:

      smpirun -wrapper "valgrind --tool=callgrind" --cfg=smpi/bcast:mpich --cfg=smpi/running-power:6217956542.969
      --cfg=smpi/display-timing:yes --cfg=smpi/privatize-global-variables:yes -np 16 -hostfile ./hostfile_64.txt -platform
      ./cluster_fat_tree_64.xml ./xhpl
      
    • At first, the package libatlas3-base was used for the BLAS functions. The name of these functions were not shown in Kcachegrind.
    • Then, removed this package and installed libatlas-cpp-0.6-2-dbg.
    • So now we have the names of the BLAS functions, but the layout is very different.
    • Also, the executions with this library take more time, especially with Vallgrind. It also impacts the virtual time and the Gflops.
    • What we observe with this new library seems to be consistent with what we observed previously: dgemm is the most time consumming function (by far), dtrsm comes after. So maybe this library is good enough to understand what happens, and then we could switch back to the previous library to have good performances.
  2. Profiling modified HPL   EXPERIMENTS PERFORMANCE PROFILING VALGRIND SMPI HPL
    • Profiling with Valgrind of modified HPL. Add the option -g in the Makefile.
    • HPL commit: 4494976bc0dd67e04e54abec2520fd468792527a. Then for each case, a small piece of the code has been modified.
    • Settings: N=5000, P=Q=4.
    • Compile with the command:

      make SMPI_OPTS=-DSMPI_OPTIMIZATION -j 4 arch=SMPI
      
    • Run with the command:

      smpirun -wrapper "valgrind --tool=callgrind" --cfg=smpi/bcast:mpich --cfg=smpi/running-power:6217956542.969
      --cfg=smpi/display-timing:yes --cfg=smpi/privatize-global-variables:yes -np 16 -hostfile ./hostfile_64.txt -platform
      ./cluster_fat_tree_64.xml ./xhpl
      
    • Using the library from libatlas-cpp-0.6-2-dbg.
    • First experiment, the call to HPL_dgemv is a no-op and WORK[0] is set to a constant.
    • Second experiment, the call to HPL_dgemv is aliased to cblas_dgemv and WORK[0] is set to a constant.
    • Third experiment, the call to HPL_dgemv is aliased to cblas_dgemv and WORK[0] is not modified.
    • It is clear that we can shrink even further the simulation by removing the code that initialize the matrices (this is the code that calls the function HPL_rand).
    • There is no explanation for the differences observed with HPL_dgemv and WORK[0], the figures look similar. However the differences observed between the three cases are quite small (in terms of execution time or Gflops).
  3. Comparison of the code   SMPI HPL
    • Let’s compare again the different versions of the code, this time with the new CBLAS library (package libatlas-cpp-0.6-2-dbg). We use N=20000, P=Q=4.

      Code Virtual time Gflops Total simulation time Time for application computations
      WORK[0] unmodified, real dgemv 223.81 2.383e+01 15.5049 9.5045
      WORK[0] modified, real dgemv 223.74 2.384e+01 25.9935 20.0480
      WORK[0] modified, no-op dgemv 226.28 2.357e+01 26.3907 20.3201

      Remark: for the first version of the code, the experiment had to be ran twice, since the first run ended in a deadlock.

    • The two first rows correspond to the two rows of the table of [2017-03-21 Tue].
    • There is no significant difference for the virtual time and the Gflops.
    • There is a significant difference for the total simulation time and the time spent for aapplication computations, but it is less important than what we previously observed.
    • It is strange that this difference in the computation time does not appear in the virtual time. Note that the option --cfg=smpi/simulate-computation:0 was not used, so it does not come from here.
  4. Seminar   MEETING
    • Decaf: Decoupled Dataflows for In Situ High-Performance Workflows
    • Mathieu Dreher
    • They do some physics experiment (with a particle collider). Then, they analyze the results and build a model. Thus, the whole process has three major steps.
    • In current systems, the bottleneck is the I/O. It will be even worse for future systems (computation speed will be increased, not I/O speed). This is why we should have in-situ workflows (less data to move).
    • In the “classical workflow”, we compute all the iterations, then we analyze them.
    • In the “in situ workflow”, two things are possible. Time partitionning: we compute one iteration, then analyze it, then go back to the computation. Space partitioning: the analysis is done in parallel on other nodes.
  5. Profiling modified HPL, bigger matrices   EXPERIMENTS PERFORMANCE PROFILING VALGRIND SMPI HPL
    • Profiling with Valgrind of modified HPL. Add the option -g in the Makefile.
    • HPL commit: 4494976bc0dd67e04e54abec2520fd468792527a. Then for each case, a small piece of the code has been modified.
    • Settings: N=20000, P=Q=4.
    • Compile with the command:

      make SMPI_OPTS=-DSMPI_OPTIMIZATION -j 4 arch=SMPI
      
    • Run with the command:

      smpirun -wrapper "valgrind --tool=callgrind" --cfg=smpi/bcast:mpich --cfg=smpi/running-power:6217956542.969
      --cfg=smpi/display-timing:yes --cfg=smpi/privatize-global-variables:yes -np 16 -hostfile ./hostfile_64.txt -platform
      ./cluster_fat_tree_64.xml ./xhpl
      
    • Using the library from libatlas-cpp-0.6-2-dbg.
    • First experiment, the call to HPL_dgemv is a no-op and WORK[0] is set to a constant.
    • Second experiment, the call to HPL_dgemv is aliased to cblas_dgemv and WORK[0] is set to a constant.
    • Third experiment, the call to HPL_dgemv is aliased to cblas_dgemv and WORK[0] is not modified.
    • The three figures have roughly the same pattern.
    • However, some numbers of the two first figures are twice as large as the corresponding numbers of the third figure. For instance, the biggest HPL_rand has 70803540000 in the third figure and 141607080000 in the two first ones.
    • The reason for that is that, in the two first cases, HPL_pdmatgen is called 32 times, whereas in the last case it is called only 16 times. In our settings, we would expect this function to be called 16 times, since we simulate 16 processes.
    • This is very strange that WORK[0] has an impact on the behavior of matrices generation.

1.2.19 2017-03-24 Friday

  1. Why WORK[0] impacts the number of calls to HPL_pdmatgen   SMPI HPL
    • Everything happens in the file HPL_pdtest.c. This is in relation with the error code issue discussed on [2017-03-14 Tue].
    • When we use SMPI optimizations (smpi_usleep and SMPI_SHARED_MALLOC) without modifying WORK[0], HPL detects an error in the data of the matrices and returns an error code. If we also fix WORK[0] to some constant, HPL does not detect this error.
    • After doing the factorization, if no error code has been returned, HPL does some additional tests on the values of the matrix. These tests are quite long, and imply to re-generate the matrix, by calling HPL_pdmatgen.
    • This explains why WORK[0] has an impact on the simulation time and the number of time HPL_pdmatgen is called.
    • This does not explain the difference in terms of virtual time observed on [2017-03-21 Tue], since this is only the time needed for the factorization and not the time for the initialization and the checks.
    • This difference of virtual time was not re-observed on [2017-03-23 Thu]. Note that another BLAS library was used.
  2. Comparison of the code   SMPI HPL
    • Let’s compare again the different versions of the code, with the “old” version of the CBLAS library (package libatlas3-base). We use N=20000, P=Q=4.
    • This is the same experiment than [2017-03-23 Thu], except for the BLAS library.
    • The two first rows correspond to the two rows of the table of [2017-03-21 Tue].

      Code Virtual time Gflops Total simulation time Time for application computations
      WORK[0] unmodified, real dgemv 223.68 2.385e+01 15.8909 9.5658
      WORK[0] modified, real dgemv 257.79 2.069e+01 47.9488 41.5125
      WORK[0] modified, no-op dgemv 225.91 2.361e+01 26.2768 20.1776
    • The experiment of [2017-03-21 Tue] is replicated: the two first rows look similar.
    • There is still the big gap in terms of both simulation time and virtual time. The former could be explained by the checks done at the end of HPL, but not the later (see previous entry of the journal).
    • Interestingly, the first and the last rows look very similar to the first and the last row of the [2017-03-23 Thu] experiment, although the BLAS library has changed. The middle row however is very different.
    • These gaps are not replicated when using Valgrind. For all simulations, we have a virtual time of about 262 to 263 seconds, which is about 2.03e+01 Gflops.
  3. Removing initialization and checks   SMPI PERFORMANCE PROFILING HPL
    • The previous experiments demonstrated that the initialization (done after the factorization) and the checks (done after) take a significant amount of time. They do not account for the estimation of the Gflops, so we can safely remove them.
    • Quick experiment, with HPL at commit cb54a92b8304e0cd2f1728b887cc4cc615334c2d, N=20000 and P=Q=4, using library from package libatlas3-base.
    • We get a virtual time of 227.35, which is 2.346e+01 Gflops. It confirms that the initialization and the checks are not accounted in this measure.
    • The simulation time is now 9.63 seconds, with 3.53 seconds spent for actual computations of the application.
    • We see here that the simulation is already well optimized, there is not much room for additional gains.
    • Profiling with Valgrind:
    • We see here that a large part of the time is spent in functions called by Simgrid (e.g. memcpy).
  4. Work on the experiment script   PYTHON
    • Add three features:
      • “Dump simulation and application times in the CSV.”
      • “Dump physical and virtual memory in the CSV.”
      • “Experiments with random sizes and number of processors.”
    • Example of usage:

      ./run_measures.py --global_csv /tmp/bla.csv --nb_runs 10 --size 1000:2000,4000:5000,20000:21000 --nb_proc 1:4,8,16,32,64
      --fat_tree "2;8,8;1,1:4;1,1" --experiment HPL
      

      This will run 10 times, in a random order, all combinations of the parameters:

      • Matrix size in [1000,2000]∪[4000,5000]∪[20000,21000]
      • Number of processes in {1,2,3,4,8,16,32,64}
      • Fat-trees 2;8,8;1,1;1,1 and 2;8,8;1,2;1,1 and 2;8,8;1,3;1,1 and 2;8,8;1,4;1,1.

      The results are dumped in a CSV file. For each experiment, we store all the parameters (topology, size, number of processes) as well as the interesting metrics (virtual time, Gflops, simulation time, time spent in application, peak physical and virtual memory used).

1.2.20 2017-03-25 Saturday

  1. Time and memory efficiency of HPL simulation   SMPI R EXPERIMENTS PERFORMANCE HPL
    • HPL commit: cb54a92b8304e0cd2f1728b887cc4cc615334c2d
    • Script commit: 8af35470776a0b6f2041cf8e0121739f94fdc34d
    • Command line to run the experiment:

      ./run_measures.py --global_csv hpl2.csv --nb_runs 3 --size 100,5000,10000,15000,20000,25000,30000,35000,40000
      --nb_proc 1,8,16,24,32,40,48,56,64 --fat_tree "2;8,8;1,8;1,1" --experiment HPL
      
    • Plots:

      library(ggplot2)
      do_plot <- function(my_plot, title) {
          return(my_plot +
      	stat_summary(geom="line", fun.y=mean)+
      	stat_summary(fun.data = mean_sdl)+
      	ggtitle(title)
          )
      }
      results <- read.csv('hpl_analysis/hpl.csv')
      head(results)
      
             topology nb_roots nb_proc  size    time Gflops simulation_time
      1 2;8,8;1,8;1,1        8      48 40000  593.10 71.940        60.75480
      2 2;8,8;1,8;1,1        8      40 20000  144.88 36.820        24.53460
      3 2;8,8;1,8;1,1        8       8 30000 1290.99 13.940        13.39820
      4 2;8,8;1,8;1,1        8      56 10000   37.93 17.580        12.92780
      5 2;8,8;1,8;1,1        8       1 30000 9609.94  1.873         3.67895
      6 2;8,8;1,8;1,1        8      64 10000   27.20 24.510         9.96141
        application_time       uss         rss
      1         14.47840 701091840 13509701632
      2          6.44959 327905280  3533713408
      3          6.14242 217612288  7422472192
      4          2.55716 211193856  1016156160
      5          3.58312   5619712  7209476096
      6          2.10660 179879936   984698880
      
      do_plot(ggplot(results, aes(x=size, y=simulation_time, group=nb_proc, color=nb_proc)),
         "Simulation time vs size")
      

      1.png

      do_plot(ggplot(results, aes(x=nb_proc, y=simulation_time, group=size, color=size)),
          "Simulation time vs number of processes")
      

      2.png

      do_plot(ggplot(results, aes(x=size, y=uss, group=nb_proc, color=nb_proc)),
          "Physical memory consumption vs size")
      

      3.png

      do_plot(ggplot(results, aes(x=nb_proc, y=uss, group=size, color=size)),
         "Physical memory consumption vs number of processes")
      

      4.png

    • We see here that despite all the optimizations:
      • The simulation time seems to be quadratic in the matrix size.
      • The simulation time seems to be (roughly) linear in the number of processes.
      • The memory consumption seems to be linear in the matrix size.
      • The memory consumption seems to be (roughly) linear in the number of processes.
    • There are some irregularities regarding the time and memory vs the number of processes. An hypothesis is that it is due to different virtual topologies. In this experiment, the number of processes are multiple of 8. So, some of these numbers are square numbers, others are not. It seems that we achieve the best performances when the number of processes is a square. To generate P and Q, the sizes of the process grid, we try to find two divisors of the number of processes that are reasonably close (if possible). Thus, when the number of processes is a square, we have P=Q.

1.2.21 2017-03-27 Monday

  1. DONE Remaining work on HPL (following discussion with Arnaud) [3/3]   SMPI HPL MEETING
    • [X] Do not look further regarding the WORK[0] anomaly.
    • [X] Do careful experiments to validate the optimizations.
    • [X] Currently, the simulation will not scale in memory. Track the sizes of the malloc in HPL_panel_init.
  2. More detailed analysis of malloc   R TRACING PERFORMANCE HPL
    • We saw that the memory consumption is still too high, we need to reduce it.
    • Let’s take back the results from [2017-03-17 Fri]. The corresponding CSV file has been copied in repository hpl_malloc.
    • Recall that this is a trace of all the malloc, with N=20000 and P=Q=4.
    • We will focus on the file HPL_pdpanel_init.c since we suppose that these are the biggest allocations (after the allocation of the matrix).

      results <- read.csv("hpl_malloc/malloc.csv");
      results <- results[results$file == "/hpl-2.2/src/panel/hpl_pdpanel_init.c",]
      results$idx <- 0:(length(results$size)-1)
      head(results)
      
                                           file line rank    size idx
      99  /hpl-2.2/src/panel/hpl_pdpanel_init.c  245    0 4839432   0
      100 /hpl-2.2/src/panel/hpl_pdpanel_init.c  339    0    5344   1
      101 /hpl-2.2/src/panel/hpl_pdpanel_init.c  245    0 4839432   2
      102 /hpl-2.2/src/panel/hpl_pdpanel_init.c  339    0    5344   3
      106 /hpl-2.2/src/panel/hpl_pdpanel_init.c  245    2 9640392   4
      107 /hpl-2.2/src/panel/hpl_pdpanel_init.c  339    2    5344   5
      
      library(ggplot2)
      ggplot(results, aes(x=idx, y=size, color=factor(line))) + geom_point(alpha=.2) + ggtitle("Sizes of malloc in HPL_pdpanel_init (N=20000, P=Q=4)")
      

      1.png

    • Now that we have removed the matrix allocation, the panel allocation is clearly the one responsible of the high memory consumption. Here, for 16 processes and a matrix of size 20000, this allocation is responsible for 160MB of memory.
    • The malloc of line 245 is the one that is a concern. It is made for the WORK attribute.
    • The malloc of line 339 is not a concern. It is made for the IWORK attribute.
    • It is strange that all these allocations are made. Why not allocating the panel once, and then reusing it until the end?
    • It may be difficult to split the panel in two parts (one SHARED_MALLOC and one classical malloc). In HPL_pdpanel_init.c, we can find this comment:

      * L1:    JB x JB in all processes
      * DPIV:  JB      in all processes
      * DINFO: 1       in all processes
      * We make sure that those three arrays are contiguous in memory for the
      * later panel broadcast.  We  also  choose  to put this amount of space
      * right  after  L2 (when it exist) so that one can receive a contiguous
      * buffer.
      
  3. Validation of the optimizations   SMPI R EXPERIMENTS HPL
    • Let’s compare vanilla HPL with optimized HPL, to see if the simulation is still faithful.
    • Results for optimized HPL are those of [2017-03-25 Sat].
    • Results for vanilla HPL have been freshly generated:
      • Using HPL commit 6cc643a5c2a123fa549d02a764bea408b5ad6114
      • Using script commit 7a9e467f9446c65a9dbc2a76c4dab7a3d8209148
      • Command:

        ./run_measures.py --global_csv hpl_vanilla.csv --nb_runs 1 --size 100,5000,10000,15000,20000 --nb_proc
        1,8,16,24,32,40,48,56,64 --fat_tree "2;8,8;1,8;1,1" --experiment HPL
        
    • Analysis:

      library(ggplot2)
      optimized_results <- read.csv('hpl_analysis/hpl.csv')
      vanilla_results <- read.csv('hpl_analysis/hpl_vanilla.csv')
      optimized_results$hpl = 'optimized_hpl'
      vanilla_results$hpl = 'vanilla_hpl'
      results = rbind(optimized_results, vanilla_results)
      head(results)
      
             topology nb_roots nb_proc  size    time Gflops simulation_time
      1 2;8,8;1,8;1,1        8      48 40000  593.10 71.940        60.75480
      2 2;8,8;1,8;1,1        8      40 20000  144.88 36.820        24.53460
      3 2;8,8;1,8;1,1        8       8 30000 1290.99 13.940        13.39820
      4 2;8,8;1,8;1,1        8      56 10000   37.93 17.580        12.92780
      5 2;8,8;1,8;1,1        8       1 30000 9609.94  1.873         3.67895
      6 2;8,8;1,8;1,1        8      64 10000   27.20 24.510         9.96141
        application_time       uss         rss           hpl
      1         14.47840 701091840 13509701632 optimized_hpl
      2          6.44959 327905280  3533713408 optimized_hpl
      3          6.14242 217612288  7422472192 optimized_hpl
      4          2.55716 211193856  1016156160 optimized_hpl
      5          3.58312   5619712  7209476096 optimized_hpl
      6          2.10660 179879936   984698880 optimized_hpl
      
      plot_results <- function(nb_proc) {
          ggplot(results[results$nb_proc==nb_proc,], aes(x=size, y=Gflops, color=hpl)) +
      	geom_point() + geom_line() +
      	expand_limits(x=0, y=0) +
      	ggtitle(paste("Gflops vs size, nb_proc = ", nb_proc))
      }
      
      plot_results(1)
      

      5.png

      plot_results(8)
      

      6.png

      plot_results(16)
      

      7.png

      plot_results(24)
      

      8.png

      plot_results(32)
      

      9.png

      plot_results(40)
      

      10.png

      plot_results(48)
      

      11.png

      plot_results(56)
      

      12.png

      plot_results(64)
      

      13.png

    • From the above plots, it seems that optimized HPL is always too optimistic in terms of performances. However, the difference is not so important.

      merged_results = merge(x=vanilla_results, y=optimized_results, by=c("nb_proc", "size"))
      merged_results$error = abs((merged_results$Gflops.x - merged_results$Gflops.y)/merged_results$Gflops.y)
      ggplot(merged_results, aes(x=factor(size), y=error)) +
          geom_boxplot() + geom_jitter(aes(color=nb_proc)) +
          ggtitle("Error vs size")
      

      14.png

      ggplot(merged_results, aes(x=factor(nb_proc), y=error)) +
          geom_boxplot() + geom_jitter(aes(color=size)) +
          ggtitle("Error vs nb_proc")
      

      15.png

    • We see here that the biggest errors made by the simulation are for a size of 100 and 1 process. For larger sizes and numbers of processes, the error never goes above 10%. In average, it is lower than 5%.

      ggplot(results[results$nb_proc==64,], aes(x=size, y=simulation_time, color=hpl)) +
          geom_point() + geom_line() +
          expand_limits(x=0, y=0) +
          ggtitle("Simulation time vs size, P=Q=8")
      

      16.png

      ggplot(results[results$nb_proc==64,], aes(x=size, y=uss, color=hpl)) +
          geom_point() + geom_line() +
          expand_limits(x=0, y=0) +
          ggtitle("Real memory vs size, P=Q=8")
      

      17.png

    • There is a very important gain in terms of memory consumption and simulation time.

1.2.22 2017-03-28 Tuesday

  1. Booked the plane tickets for Bordeaux
  2. Tentative of allocation hack in HPLpdpanelinit   SMPI C PERFORMANCE HPL
    • Greatly inspired from what is done for the global SMPI_SHARED_MALLOC.
    • The idea is to reserve a large block of virtual addresses. The first part is mapped to a (short) buffer in a cyclic way. The second part is kept private.
    • Currently some bugs (invalid writes, leading to a segmentation fault).

1.2.23 2017-03-29 Wednesday

  1. Keep trying to use some shared memory for PANEL->WORK   SMPI C PERFORMANCE BUG HPL
    • The invalid writes of yesterday were on accesses to WORK buffer. Forgot the space needed for the buffer U at the end of WORK. Now fixed.
    • Add some printf to see the start and end addresses of the different buffers. Everything seems fine.
    • Add a check. We fill the private zone (DPIV and DINFO) with 0. Then we fill the shared zone with garbage. Finally we check that the private zone is still full 0.
    • Now, there is an invalid write of 4 bytes, by HPL_plindx1, located just after the buffer IWORK (the allocation of this buffer did not change).
    • Test for N=5000, P=Q=4. Found that in file HPL_plindx1.c, variable ipU reaches 120 in the buggy case, but only 119 in the normal case. So it is likely that the array is not too short, but rather that this variable is incremented too much.
    • In the for loop where this happens, ipU is incremented when some conditions are fulfilled. One of these conditions is the combination of these two if:

      if( srcrow == icurrow ) {
          if( ( dstrow == icurrow ) && ( dst - ia < jb ) )
      // [...]
      

      When ipU reaches 120, the illegal write is:

      iwork[ipU] = IPLEN[il];
      

      When this happens, the variable dst is 0 and thus the condition dst-ia<jb is met. But intuitively, this condition should not be met like this (jb is always positive). A bit earlier in the loop, dst is set with:

      dst = IPID[i+1];
      

      Printing this array in the buggy case and in the normal case, we see that the last element of the array is sometimes 0 in the buggy case, but never in the normal case. Thus, it seems that there is an issue with IPID.

    • Note that we also had issues with IPID when using SHARED_MALLOC.
  2. Looking at PANEL->DPIV (again)   SMPI BUG HPL
    • Add a printf in HPL_pipid.c (function which compute IPID, using DPIV) to see the content of DPIV.
    • In the buggy case, sometimes, the array DPIV is full of 0. It does not happen in the normal case. If we put something else in DPIV when it is allocated, then this is shown instead of the zeroes (e.g. if we put 0, 1, 2…). Thus, in these cases, DPIV is never filled after its initialization.
    • Hypothesis: when the pannels are sent with MPI, the size is too short and DPIV is not sent.
  3. Discussion with Arnaud and Augustin   MEETING
    • Instead of putting an empty space between the shared block and the private block (for alignment), make them really contiguous (and do not share the last page of the “shared” block).
  4. Reimplement the shared allocation   SMPI C PERFORMANCE HPL
    • The code was a mess, let’s restart something better, using Augustin’s idea.
    • The interface is as follows:

      void *allocate_shared(int size, int start_private, int stop_private);
      

      It allocates a contiguous block of virtual addresses of given size that all fit in a small block of physical memory, except for a contiguous block located between the indices startprivate (included) and stopprivate (excluded). Calling allocate_shared(size, size, size) is (semantically) equivalent to calling SMPI_SHARED_MALLOC(size). Calling allocate_shared(size, 0, size) is (semantically) equivalent to calling malloc(size).

    • Similarly to SHARED_MALLOC, we map the shared zones by block, on a same range of addresses. The “best” block size is to discuss.
    • Since every call to mmap is a syscall, we should try to not have a too low block size. Used 0x1000 at the beginning, the performances were terrible.
    • Still for performance reasons, if the size is too low, we should simply do a malloc (and thus not have any shared zone).
    • Valgrind does not report any error (it was the case with the previous implementation). There are some small memory leaks however.
    • Performances are good. Tested with N=40000, P=Q=8. Simulation time increased from 85 seconds to 112 seconds. Memory consumption decreased from 675 MB to 95 MB. The virtual time and the Gflops were not impacted.
  5. DONE Remaining work for shared allocation [4/4]
    • [X] Track the memory leaks (unclosed file?).
    • [X] Clean the block size definition. Put it somewhere accessible by both HPL_pdpanel_init and HPL_pdpanel_free. Maybe use two different values for the block size and the condition to switch to a simple malloc.
    • [X] Find the best value(s) for the block size (and maybe the malloc condition).
    • [X] Contribute this function to Simgrid.

1.2.24 2017-03-30 Thursday

  1. Quick work on shared allocations   SMPI C HPL
    • Clean the size definitions.
      • Use a separate file that is imported in HPL_pdpanel_init.c and HPL_pdpanel_free.c.
      • Use two different sizes: the block size, and the size at which we switch for malloc.
    • Quick look at the possibilities for the sizes
      • Some quick experiments with N=40000, P=Q=8.
      • With BLOCK_SIZE and MALLOC_MAX_SIZE equal to 0x10000:
        • Simulation time: 112 seconds
        • Physical memory: 96 MB
      • With BLOCK_SIZE equal to 0x10000 and MALLOC_MAX_SIZE equal to 0 (never do a malloc):
        • Simulation time: also 112 seconds
        • Physical memory: also 96 MB
      • With BLOCK_SIZE equal to 0x10000 and MALLOC_MAX_SIZE equal to 0x40000 (4 times greater):
        • Simulation time: 137 seconds
        • Physical memory: 93 MB
      • Thus, it seems that the gain of using malloc is not so important. Worst: it can yield a significant loss. Let’s remove it.
      • With BLOCK_SIZE equal to 0x100000 and malloc removed: execution cancelled, all the physical memory was used.
    • Stop using malloc. Also move back the size definition in HPL_pdpanel_init.c
    • The code is simpler like this, and the malloc trick did not give better performances.
    • Do not bother with the memory leak. It was already here before the shared allocation.
    • Warning: calling munmap with a size of 0 gives a huge memory consumption. It should be called with the correct size.
  2. Implement the partial shared_malloc in Simgrid
    • Even more generic implementation than the one done in HPL. Now, we give a list of offsets of block that should be shared. Thus, we can have an arbitrary mix of shared zones with private zones inside an allocated block.
    • Tests currently fail. To run a single test and see its output, run:

      ctest --verbose -R tesh-smpi-macro-shared-thread
      

      I suspect (but did not check) that this is because we currently share only blocks aligned on the block size. It would be better to share blocks aligned on the page size (need to fix it). But this does not change the fact that some parts will not be shared. This is expected, we should modify the tests.

  3. Time and memory efficiency of the partial shared_malloc   SMPI R EXPERIMENTS PERFORMANCE HPL
    • We switch back to the implementation of partial shared_malloc done in HPL, to measure its performances.
    • Simgrid commit: c8db21208f3436c35d3fdf5a875a0059719bff43 (the same commit that for the previous performance analysis)
    • HPL commit: 7af9eb0ec54418bf1521c5eafa9acda1b150446f
    • Script commit: 7a9e467f9446c65a9dbc2a76c4dab7a3d8209148
    • Command line to run the experiment:

      ./run_measures.py --global_csv hpl_partial_shared.csv --nb_runs 1 --size 100,5000,10000,15000,20000,25000,30000,35000,40000
      --nb_proc 1,8,16,24,32,40,48,56,64 --fat_tree "2;8,8;1,8;1,1" --experiment HPL
      
    • Analysis:

      library(ggplot2)
      partial_shared_results <- read.csv('hpl_analysis/hpl_partial_shared.csv')
      optimized_results <- read.csv('hpl_analysis/hpl.csv')
      vanilla_results <- read.csv('hpl_analysis/hpl_vanilla.csv')
      partial_shared_results$hpl = 'partial_shared_hpl'
      optimized_results$hpl = 'optimized_hpl'
      vanilla_results$hpl = 'vanilla_hpl'
      results = rbind(partial_shared_results, optimized_results, vanilla_results)
      head(results)
      
             topology nb_roots nb_proc  size     time    Gflops simulation_time
      1 2;8,8;1,8;1,1        8      24 25000   319.37 32.620000      25.8119000
      2 2;8,8;1,8;1,1        8      24  5000    13.03  6.399000       2.7273300
      3 2;8,8;1,8;1,1        8      24 35000   781.76 36.570000      49.3234000
      4 2;8,8;1,8;1,1        8      40   100     0.23  0.003028       0.0779319
      5 2;8,8;1,8;1,1        8       1 35000 15257.68  1.873000       5.8686300
      6 2;8,8;1,8;1,1        8      64 40000   488.99 87.260000     111.7290000
        application_time      uss         rss                hpl
      1        8.0867100 55365632  5274730496 partial_shared_hpl
      2        0.6131710 14643200   257220608 partial_shared_hpl
      3       16.0733000 74350592 10180751360 partial_shared_hpl
      4        0.0196671        0           0 partial_shared_hpl
      5        5.7156200  4775936  9809465344 partial_shared_hpl
      6       29.3046000 95391744 13475909632 partial_shared_hpl
      
    plot_results <- function(nb_proc) {
        ggplot(results[results$nb_proc==nb_proc,], aes(x=size, y=Gflops, color=hpl)) +
    	geom_point() + geom_line() +
    	expand_limits(x=0, y=0) +
    	ggtitle(paste("Gflops vs size, nb_proc = ", nb_proc))
    }
    
    plot_results(32)
    

    18.png

    plot_results(64)
    

    19.png

    • It seems that this new optimization did not change the accuracy of the simulation. Let’s have a look at the time and memory.

      ggplot(results[results$nb_proc==64,], aes(x=size, y=simulation_time, color=hpl)) +
          geom_point() + geom_line() +
          expand_limits(x=0, y=0) +
          ggtitle("Simulation time vs size, P=Q=8")
      

      20.png

      ggplot(results[results$nb_proc==64,], aes(x=size, y=uss, color=hpl)) +
          geom_point() + geom_line() +
          expand_limits(x=0, y=0) +
          ggtitle("Real memory vs size, P=Q=8")
      

      21.png

    • We see here that sharing some parts of the PANEL->WORK buffer has two effects. The simulation time is a bit larger, but the memory consumption is much lower.
    • Let’s have a look in more details at this version of HPL.

      do_plot <- function(my_plot, title) {
          return(my_plot +
      	geom_point() + geom_line() +
      	ggtitle(title)
          )
      }
      
      do_plot(ggplot(partial_shared_results, aes(x=size, y=simulation_time, group=nb_proc, color=nb_proc)),
         "Simulation time vs size")
      

      22.png

      do_plot(ggplot(partial_shared_results, aes(x=nb_proc, y=simulation_time, group=size, color=size)),
          "Simulation time vs number of processes")
      

      23.png

      do_plot(ggplot(partial_shared_results, aes(x=size, y=uss, group=nb_proc, color=nb_proc)),
          "Physical memory consumption vs size")
      

      24.png

    do_plot(ggplot(partial_shared_results, aes(x=nb_proc, y=uss, group=size, color=size)),
        "Physical memory consumption vs number of processes")
    

    25.png

    • The trend for the simulation time looks similar to what we got previously.
    • The memory consumption still looks linear in the size and in the number of processes. However, it is almost flat for the number of processes.
  4. Regression of Time and memory efficiency of the partial shared_malloc (Arnaud)   SMPI R EXPERIMENTS PERFORMANCE HPL
    results$hpl=factor(results$hpl)
    data = results[results$hpl=="partial_shared_hpl" & 
    	       results$nb_proc > 1 & results$size > 1000, # get rid of particularly small values
    	       c("nb_proc","size","Gflops","simulation_time","uss")]
    head(data)
    
      nb_proc  size Gflops simulation_time      uss
    1      24 25000 32.620        25.81190 55365632
    2      24  5000  6.399         2.72733 14643200
    3      24 35000 36.570        49.32340 74350592
    6      64 40000 87.260       111.72900 95391744
    7      24 10000 16.600         6.22743 26472448
    8      40 40000 55.990       100.31300 91209728
    
    plot(data)
    

    26.png

    reg_rss = lm(data=data,uss ~ size+nb_proc) # Interactions do not bring much
    summary(reg_rss)
    
    
    Call:
    lm(formula = uss ~ size + nb_proc, data = data)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -6941093 -1573650  -348763  1611008  8790400 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 7.827e+05  1.030e+06   0.760     0.45    
    size        2.054e+03  3.045e+01  67.449  < 2e-16 ***
    nb_proc     1.717e+05  1.903e+04   9.022 7.85e-13 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 2791000 on 61 degrees of freedom
    Multiple R-squared:  0.987,	Adjusted R-squared:  0.9866 
    F-statistic:  2315 on 2 and 61 DF,  p-value: < 2.2e-16
    
    par(mfrow=c(2,3)) ; 
      plot(data=data,uss~size); 
      plot(data=data,uss~nb_proc);
      plot(reg_rss); 
    par(mfrow=c(1,1))
    

    27.png

    The Stampede HPL output indicates:

    The following parameter values will be used:
    
    N        : 3875000 
    NB       :    1024 
    PMAP     : Column-major process mapping
    P        :      77 
    Q        :      78 
    PFACT    :   Right 
    NBMIN    :       4 
    NDIV     :       2 
    RFACT    :   Crout 
    BCAST    :  BlongM 
    DEPTH    :       0 
    SWAP     : Binary-exchange
    L1       : no-transposed form
    U        : no-transposed form
    EQUIL    : no
    ALIGN    :    8 double precision words
    

    We aim at size=3875000 and nb_proc=77*78.

    data[data$nb_proc==64 & data$size==40000,]
    data[data$nb_proc==64 & data$size==40000,]$uss/1E6 # in MB
    example=data.frame(size=c(3875000,40000), nb_proc=c(77*78,64));
    predict(reg_rss, example, interval="prediction", level=0.95)/1E6
    
      nb_proc  size Gflops simulation_time      uss
    6      64 40000  87.26         111.729 95391744
    [1] 95.39174
             fit        lwr        upr
    1 8991.32610 8664.69163 9317.96056
    2   93.93216   88.10931   99.75501
    

    So we should need around 8 to 9 GB. Good.

    reg_time = lm(data=data,simulation_time ~ poly(size,3)*poly(nb_proc,2)) # Interactions do not bring much
    summary(reg_time)
    reg_time = lm(data=data,simulation_time ~ poly(size,3)+poly(nb_proc,2)+I(size*nb_proc)) # Interactions do not bring much
    summary(reg_time)
    reg_time = lm(data=data,simulation_time ~ poly(size,2)+poly(nb_proc,1)+I(size*nb_proc)) # Interactions do not bring much
    summary(reg_time)
    
    
    Call:
    lm(formula = simulation_time ~ poly(size, 3) * poly(nb_proc, 
        2), data = data)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -14.6972  -2.8188   0.1211   1.4618  23.6037 
    
    Coefficients:
                                     Estimate Std. Error t value Pr(>|t|)    
    (Intercept)                       34.3882     0.8715  39.458  < 2e-16 ***
    poly(size, 3)1                   200.7402     6.9721  28.792  < 2e-16 ***
    poly(size, 3)2                    37.6113     6.9721   5.395 1.71e-06 ***
    poly(size, 3)3                     0.9386     6.9721   0.135   0.8934    
    poly(nb_proc, 2)1                110.2551     6.9721  15.814  < 2e-16 ***
    poly(nb_proc, 2)2                 -9.0383     6.9721  -1.296   0.2006    
    poly(size, 3)1:poly(nb_proc, 2)1 619.6089    55.7771  11.109 2.43e-15 ***
    poly(size, 3)2:poly(nb_proc, 2)1 101.1174    55.7771   1.813   0.0756 .  
    poly(size, 3)3:poly(nb_proc, 2)1  -2.3618    55.7771  -0.042   0.9664    
    poly(size, 3)1:poly(nb_proc, 2)2 -54.5865    55.7771  -0.979   0.3323    
    poly(size, 3)2:poly(nb_proc, 2)2 -13.4280    55.7771  -0.241   0.8107    
    poly(size, 3)3:poly(nb_proc, 2)2  -6.7984    55.7771  -0.122   0.9035    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 6.972 on 52 degrees of freedom
    Multiple R-squared:  0.9597,	Adjusted R-squared:  0.9511 
    F-statistic: 112.5 on 11 and 52 DF,  p-value: < 2.2e-16
    
    Call:
    lm(formula = simulation_time ~ poly(size, 3) + poly(nb_proc, 
        2) + I(size * nb_proc), data = data)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -11.9992  -3.5157   0.0224   2.7090  25.8055 
    
    Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
    (Intercept)       -2.954e+00  3.452e+00  -0.856  0.39567    
    poly(size, 3)1     4.863e+01  1.527e+01   3.184  0.00236 ** 
    poly(size, 3)2     3.761e+01  6.930e+00   5.427 1.22e-06 ***
    poly(size, 3)3     9.386e-01  6.930e+00   0.135  0.89275    
    poly(nb_proc, 2)1 -4.186e+01  1.527e+01  -2.740  0.00818 ** 
    poly(nb_proc, 2)2 -9.038e+00  6.930e+00  -1.304  0.19742    
    I(size * nb_proc)  4.610e-05  4.125e-06  11.176 5.47e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 6.93 on 57 degrees of freedom
    Multiple R-squared:  0.9563,	Adjusted R-squared:  0.9517 
    F-statistic:   208 on 6 and 57 DF,  p-value: < 2.2e-16
    
    Call:
    lm(formula = simulation_time ~ poly(size, 2) + poly(nb_proc, 
        1) + I(size * nb_proc), data = data)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -11.8123  -3.6614   0.2628   2.4029  25.7019 
    
    Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
    (Intercept)       -2.954e+00  3.444e+00  -0.858  0.39442    
    poly(size, 2)1     4.863e+01  1.524e+01   3.191  0.00227 ** 
    poly(size, 2)2     3.761e+01  6.914e+00   5.440 1.07e-06 ***
    poly(nb_proc, 1)  -4.186e+01  1.524e+01  -2.747  0.00797 ** 
    I(size * nb_proc)  4.610e-05  4.115e-06  11.202 3.08e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 6.914 on 59 degrees of freedom
    Multiple R-squared:  0.955,	Adjusted R-squared:  0.952 
    F-statistic: 313.1 on 4 and 59 DF,  p-value: < 2.2e-16
    
    par(mfrow=c(2,3)) ; 
      plot(data=data,simulation_time~size); 
      plot(data=data,simulation_time~nb_proc);
      plot(reg_time); 
    par(mfrow=c(1,1))
    

    28.png

    data[data$nb_proc==64 & data$size==40000,]
    predict(reg_time, example, interval="prediction", level=0.95)/3600 # in hours
    
      nb_proc  size Gflops simulation_time      uss
    6      64 40000  87.26         111.729 95391744
               fit          lwr          upr
    1 467.31578577 385.82615026 548.80542127
    2   0.03431702   0.03008967   0.03854438
    

    Aouch. This would be a 3 weeks simulation. :( We need to speed things up.

1.2.25 2017-03-31 Friday

  1. Found a bug in the last commits of Simgrid   SMPI BUG HPL
    • Issue reported on Github.
    • Bug fixed.
    • There are still some problems with HPL, some unitialized values used for comparisons:

      ==3320== Memcheck, a memory error detector
      ==3320== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
      ==3320== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
      ==3320== Command: ./xhpl --cfg=surf/precision:1e-9 --cfg=network/model:SMPI --cfg=network/TCP-gamma:4194304 --cfg=smpi/bcast:mpich --cfg=smpi/running-power:6217956542.969 --cfg=smpi/display-timing:yes --cfg=smpi/privatize-global-variables:yes --cfg=smpi/shared-malloc:local --cfg=smpi/privatize-global-variables:1 ./cluster_fat_tree_64.xml smpitmp-apprXPdW8
      ==3320== 
      [0.000000] [xbt_cfg/INFO] Configuration change: Set 'surf/precision' to '1e-9'
      [0.000000] [xbt_cfg/INFO] Configuration change: Set 'network/model' to 'SMPI'
      [0.000000] [xbt_cfg/INFO] Configuration change: Set 'network/TCP-gamma' to '4194304'
      [0.000000] [xbt_cfg/INFO] Configuration change: Set 'smpi/bcast' to 'mpich'
      [0.000000] [xbt_cfg/INFO] Configuration change: Set 'smpi/running-power' to '6217956542.969'
      [0.000000] [xbt_cfg/INFO] Option smpi/running-power has been renamed to smpi/host-speed. Consider switching.
      [0.000000] [xbt_cfg/INFO] Configuration change: Set 'smpi/display-timing' to 'yes'
      [0.000000] [xbt_cfg/INFO] Configuration change: Set 'smpi/privatize-global-variables' to 'yes'
      [0.000000] [xbt_cfg/INFO] Configuration change: Set 'smpi/shared-malloc' to 'local'
      [0.000000] [xbt_cfg/INFO] Configuration change: Set 'smpi/privatize-global-variables' to '1'
      [0.000000] [smpi_coll/INFO] Switch to algorithm mpich for collective bcast
      ================================================================================
      HPLinpack 2.2  --  High-Performance Linpack benchmark  --   February 24, 2016
      Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
      Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
      Modified by Julien Langou, University of Colorado Denver
      ================================================================================
      
      An explanation of the input/output parameters follows:
      T/V    : Wall time / encoded variant.
      N      : The order of the coefficient matrix A.
      NB     : The partitioning blocking factor.
      P      : The number of process rows.
      Q      : The number of process columns.
      Time   : Time in seconds to solve the linear system.
      Gflops : Rate of execution for solving the linear system.
      
      The following parameter values will be used:
      
      N      :      29       30       34       35 
      NB     :       1        2        3        4 
      PMAP   : Row-major process mapping
      P      :       2        1        4 
      Q      :       2        4        1 
      PFACT  :    Left    Crout    Right 
      NBMIN  :       2        4 
      NDIV   :       2 
      RFACT  :    Left    Crout    Right 
      BCAST  :   1ring 
      DEPTH  :       0 
      SWAP   : Mix (threshold = 64)
      L1     : transposed form
      U      : transposed form
      EQUIL  : yes
      ALIGN  : 8 double precision words
      
      --------------------------------------------------------------------------------
      
      - The matrix A is randomly generated for each test.
      - The following scaled residual check will be computed:
            ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
      - The relative machine precision (eps) is taken to be               1.110223e-16
      - Computational tests pass if scaled residuals are less than                16.0
      
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x42447D: HPL_pipid (HPL_pipid.c:144)
      ==3320==    by 0x418ED8: HPL_pdlaswp00T (HPL_pdlaswp00T.c:171)
      ==3320==    by 0x40E878: HPL_pdupdateTT (HPL_pdupdateTT.c:271)
      ==3320==    by 0x41AF9F: HPL_pdgesv0 (HPL_pdgesv0.c:152)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x42476D: HPL_plindx0 (HPL_plindx0.c:246)
      ==3320==    by 0x418EF6: HPL_pdlaswp00T (HPL_pdlaswp00T.c:172)
      ==3320==    by 0x40E878: HPL_pdupdateTT (HPL_pdupdateTT.c:271)
      ==3320==    by 0x41AF9F: HPL_pdgesv0 (HPL_pdgesv0.c:152)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x4247A9: HPL_plindx0 (HPL_plindx0.c:250)
      ==3320==    by 0x418EF6: HPL_pdlaswp00T (HPL_pdlaswp00T.c:172)
      ==3320==    by 0x40E878: HPL_pdupdateTT (HPL_pdupdateTT.c:271)
      ==3320==    by 0x41AF9F: HPL_pdgesv0 (HPL_pdgesv0.c:152)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Use of uninitialised value of size 8
      ==3320==    at 0x420413: HPL_dlaswp01T (HPL_dlaswp01T.c:240)
      ==3320==    by 0x418BDD: HPL_pdlaswp00T (HPL_pdlaswp00T.c:194)
      ==3320==    by 0x40E878: HPL_pdupdateTT (HPL_pdupdateTT.c:271)
      ==3320==    by 0x41AF9F: HPL_pdgesv0 (HPL_pdgesv0.c:152)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x4E779CC: idamax_ (in /usr/lib/libblas/libblas.so.3.6.0)
      ==3320==    by 0x4E779FA: idamaxsub_ (in /usr/lib/libblas/libblas.so.3.6.0)
      ==3320==    by 0x4E4796F: cblas_idamax (in /usr/lib/libblas/libblas.so.3.6.0)
      ==3320==    by 0x4134F0: HPL_dlocmax (HPL_dlocmax.c:125)
      ==3320==    by 0x40B277: HPL_pdpanllT (HPL_pdpanllT.c:167)
      ==3320==    by 0x4243C8: HPL_pdfact (HPL_pdfact.c:129)
      ==3320==    by 0x41AF61: HPL_pdgesv0 (HPL_pdgesv0.c:146)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x417083: HPL_pdmxswp (HPL_pdmxswp.c:238)
      ==3320==    by 0x40B4C2: HPL_pdpanllT (HPL_pdpanllT.c:221)
      ==3320==    by 0x4243C8: HPL_pdfact (HPL_pdfact.c:129)
      ==3320==    by 0x41AF61: HPL_pdgesv0 (HPL_pdgesv0.c:146)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x417098: HPL_pdmxswp (HPL_pdmxswp.c:238)
      ==3320==    by 0x40B4C2: HPL_pdpanllT (HPL_pdpanllT.c:221)
      ==3320==    by 0x4243C8: HPL_pdfact (HPL_pdfact.c:129)
      ==3320==    by 0x41AF61: HPL_pdgesv0 (HPL_pdgesv0.c:146)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x4170A2: HPL_pdmxswp (HPL_pdmxswp.c:239)
      ==3320==    by 0x40B4C2: HPL_pdpanllT (HPL_pdpanllT.c:221)
      ==3320==    by 0x4243C8: HPL_pdfact (HPL_pdfact.c:129)
      ==3320==    by 0x41AF61: HPL_pdgesv0 (HPL_pdgesv0.c:146)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x4170A4: HPL_pdmxswp (HPL_pdmxswp.c:239)
      ==3320==    by 0x40B4C2: HPL_pdpanllT (HPL_pdpanllT.c:221)
      ==3320==    by 0x4243C8: HPL_pdfact (HPL_pdfact.c:129)
      ==3320==    by 0x41AF61: HPL_pdgesv0 (HPL_pdgesv0.c:146)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x4170A6: HPL_pdmxswp (HPL_pdmxswp.c:239)
      ==3320==    by 0x40B4C2: HPL_pdpanllT (HPL_pdpanllT.c:221)
      ==3320==    by 0x4243C8: HPL_pdfact (HPL_pdfact.c:129)
      ==3320==    by 0x41AF61: HPL_pdgesv0 (HPL_pdgesv0.c:146)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x4150D5: HPL_dlocswpT (HPL_dlocswpT.c:134)
      ==3320==    by 0x40B4D2: HPL_pdpanllT (HPL_pdpanllT.c:222)
      ==3320==    by 0x4243C8: HPL_pdfact (HPL_pdfact.c:129)
      ==3320==    by 0x41AF61: HPL_pdgesv0 (HPL_pdgesv0.c:146)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x4150D7: HPL_dlocswpT (HPL_dlocswpT.c:134)
      ==3320==    by 0x40B4D2: HPL_pdpanllT (HPL_pdpanllT.c:222)
      ==3320==    by 0x4243C8: HPL_pdfact (HPL_pdfact.c:129)
      ==3320==    by 0x41AF61: HPL_pdgesv0 (HPL_pdgesv0.c:146)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x40B4DF: HPL_pdpanllT (HPL_pdpanllT.c:223)
      ==3320==    by 0x4243C8: HPL_pdfact (HPL_pdfact.c:129)
      ==3320==    by 0x41AF61: HPL_pdgesv0 (HPL_pdgesv0.c:146)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x40B4E1: HPL_pdpanllT (HPL_pdpanllT.c:223)
      ==3320==    by 0x4243C8: HPL_pdfact (HPL_pdfact.c:129)
      ==3320==    by 0x41AF61: HPL_pdgesv0 (HPL_pdgesv0.c:146)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x42483B: HPL_plindx0 (HPL_plindx0.c:255)
      ==3320==    by 0x418EF6: HPL_pdlaswp00T (HPL_pdlaswp00T.c:172)
      ==3320==    by 0x40E878: HPL_pdupdateTT (HPL_pdupdateTT.c:271)
      ==3320==    by 0x41AF9F: HPL_pdgesv0 (HPL_pdgesv0.c:152)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x424877: HPL_plindx0 (HPL_plindx0.c:269)
      ==3320==    by 0x418EF6: HPL_pdlaswp00T (HPL_pdlaswp00T.c:172)
      ==3320==    by 0x40E878: HPL_pdupdateTT (HPL_pdupdateTT.c:271)
      ==3320==    by 0x41AF9F: HPL_pdgesv0 (HPL_pdgesv0.c:152)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Use of uninitialised value of size 8
      ==3320==    at 0x420B90: HPL_dlaswp02N (HPL_dlaswp02N.c:199)
      ==3320==    by 0x418570: HPL_pdlaswp00T (HPL_pdlaswp00T.c:198)
      ==3320==    by 0x40E878: HPL_pdupdateTT (HPL_pdupdateTT.c:271)
      ==3320==    by 0x41AF9F: HPL_pdgesv0 (HPL_pdgesv0.c:152)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Use of uninitialised value of size 8
      ==3320==    at 0x422901: HPL_dlaswp04T (HPL_dlaswp04T.c:259)
      ==3320==    by 0x418CC3: HPL_pdlaswp00T (HPL_pdlaswp00T.c:329)
      ==3320==    by 0x40E878: HPL_pdupdateTT (HPL_pdupdateTT.c:271)
      ==3320==    by 0x41AF9F: HPL_pdgesv0 (HPL_pdgesv0.c:152)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x41F06D: HPL_pdpanel_free (HPL_pdpanel_free.c:79)
      ==3320==    by 0x41AF31: HPL_pdgesv0 (HPL_pdgesv0.c:141)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x4248A5: HPL_plindx0 (HPL_plindx0.c:258)
      ==3320==    by 0x418EF6: HPL_pdlaswp00T (HPL_pdlaswp00T.c:172)
      ==3320==    by 0x40E878: HPL_pdupdateTT (HPL_pdupdateTT.c:271)
      ==3320==    by 0x41AF9F: HPL_pdgesv0 (HPL_pdgesv0.c:152)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x4203FF: HPL_dlaswp01T (HPL_dlaswp01T.c:237)
      ==3320==    by 0x418BDD: HPL_pdlaswp00T (HPL_pdlaswp00T.c:194)
      ==3320==    by 0x40E878: HPL_pdupdateTT (HPL_pdupdateTT.c:271)
      ==3320==    by 0x41AF9F: HPL_pdgesv0 (HPL_pdgesv0.c:152)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Use of uninitialised value of size 8
      ==3320==    at 0x4205A0: HPL_dlaswp01T (HPL_dlaswp01T.c:245)
      ==3320==    by 0x418BDD: HPL_pdlaswp00T (HPL_pdlaswp00T.c:194)
      ==3320==    by 0x40E878: HPL_pdupdateTT (HPL_pdupdateTT.c:271)
      ==3320==    by 0x41AF9F: HPL_pdgesv0 (HPL_pdgesv0.c:152)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x4170B5: HPL_pdmxswp (HPL_pdmxswp.c:240)
      ==3320==    by 0x40B4C2: HPL_pdpanllT (HPL_pdpanllT.c:221)
      ==3320==    by 0x4243C8: HPL_pdfact (HPL_pdfact.c:129)
      ==3320==    by 0x41AF61: HPL_pdgesv0 (HPL_pdgesv0.c:146)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      ==3320== Conditional jump or move depends on uninitialised value(s)
      ==3320==    at 0x41F06D: HPL_pdpanel_free (HPL_pdpanel_free.c:79)
      ==3320==    by 0x41F040: HPL_pdpanel_disp (HPL_pdpanel_disp.c:89)
      ==3320==    by 0x41AFCD: HPL_pdgesv0 (HPL_pdgesv0.c:161)
      ==3320==    by 0x40EFC4: HPL_pdgesv (HPL_pdgesv.c:103)
      ==3320==    by 0x406F64: HPL_pdtest (HPL_pdtest.c:197)
      ==3320==    by 0x401D38: smpi_simulated_main_ (HPL_pddriver.c:223)
      ==3320==    by 0x525BCDA: smpi_main_wrapper (smpi_global.cpp:366)
      ==3320==    by 0x5129B8D: operator() (functional.hpp:48)
      ==3320==    by 0x5129B8D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
      ==3320==    by 0x5151BB1: operator() (functional:2136)
      ==3320==    by 0x5151BB1: operator() (Context.hpp:92)
      ==3320==    by 0x5151BB1: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:303)
      ==3320== 
      [0.884470] /home/degomme/simgrid/src/simix/smx_global.cpp:567: [simix_kernel/CRITICAL] Oops ! Deadlock or code not perfectly clean.
      [0.884470] [simix_kernel/INFO] 16 processes are still running, waiting for something.
      [0.884470] [simix_kernel/INFO] Legend of the following listing: "Process <pid> (<name>@<host>): <status>"
      [0.884470] [simix_kernel/INFO] Process 1 (0@host-0.hawaii.edu): waiting for communication synchro 0xfb4beb0 () in state 0 to finish
      [0.884470] [simix_kernel/INFO] Process 2 (1@host-1.hawaii.edu): waiting for communication synchro 0xfb4b0c0 () in state 0 to finish
      [0.884470] [simix_kernel/INFO] Process 3 (2@host-2.hawaii.edu): waiting for communication synchro 0xfb49760 () in state 0 to finish
      [0.884470] [simix_kernel/INFO] Process 4 (3@host-3.hawaii.edu): waiting for communication synchro 0xfb47590 () in state 0 to finish
      [0.884470] [simix_kernel/INFO] Process 5 (4@host-4.hawaii.edu): waiting for synchronization synchro 0xf8a1ae0 () in state 0 to finish
      [0.884470] [simix_kernel/INFO] Process 6 (5@host-5.hawaii.edu): waiting for synchronization synchro 0xf8a1f10 () in state 0 to finish
      [0.884470] [simix_kernel/INFO] Process 7 (6@host-6.hawaii.edu): waiting for synchronization synchro 0xf897500 () in state 0 to finish
      [0.884470] [simix_kernel/INFO] Process 8 (7@host-7.hawaii.edu): waiting for synchronization synchro 0xf89b190 () in state 0 to finish
      [0.884470] [simix_kernel/INFO] Process 9 (8@host-8.hawaii.edu): waiting for synchronization synchro 0xf8a3680 () in state 0 to finish
      [0.884470] [simix_kernel/INFO] Process 10 (9@host-9.hawaii.edu): waiting for synchronization synchro 0xf896280 () in state 0 to finish
      [0.884470] [simix_kernel/INFO] Process 11 (10@host-10.hawaii.edu): waiting for synchronization synchro 0xf8970d0 () in state 0 to finish
      [0.884470] [simix_kernel/INFO] Process 12 (11@host-11.hawaii.edu): waiting for synchronization synchro 0xf89b5c0 () in state 0 to finish
      [0.884470] [simix_kernel/INFO] Process 13 (12@host-12.hawaii.edu): waiting for synchronization synchro 0xf89ce30 () in state 0 to finish
      [0.884470] [simix_kernel/INFO] Process 14 (13@host-13.hawaii.edu): waiting for synchronization synchro 0xf89f530 () in state 0 to finish
      [0.884470] [simix_kernel/INFO] Process 15 (14@host-14.hawaii.edu): waiting for synchronization synchro 0xf89f100 () in state 0 to finish
      [0.884470] [simix_kernel/INFO] Process 16 (15@host-15.hawaii.edu): waiting for synchronization synchro 0xf8a0ca0 () in state 0 to finish
      ==3320== 
      ==3320== Process terminating with default action of signal 6 (SIGABRT)
      ==3320==    at 0x5619428: raise (raise.c:54)
      ==3320==    by 0x561B029: abort (abort.c:89)
      ==3320==    by 0x52347B8: xbt_abort (xbt_main.cpp:167)
      ==3320==    by 0x52F4768: SIMIX_run.part.110 (smx_global.cpp:569)
      ==3320==    by 0x52F6204: SIMIX_run (stl_algobase.h:224)
      ==3320==    by 0x5263E66: smpi_main (smpi_global.cpp:474)
      ==3320==    by 0x560482F: (below main) (libc-start.c:291)
      ==3320== 
      ==3320== HEAP SUMMARY:
      ==3320==     in use at exit: 136,159,788 bytes in 7,560 blocks
      ==3320==   total heap usage: 39,378 allocs, 31,818 frees, 140,230,437 bytes allocated
      ==3320== 
      ==3320== LEAK SUMMARY:
      ==3320==    definitely lost: 321 bytes in 4 blocks
      ==3320==    indirectly lost: 0 bytes in 0 blocks
      ==3320==      possibly lost: 134,294,280 bytes in 96 blocks
      ==3320==    still reachable: 1,865,187 bytes in 7,460 blocks
      ==3320==         suppressed: 0 bytes in 0 blocks
      ==3320== Rerun with --leak-check=full to see details of leaked memory
      ==3320== 
      ==3320== For counts of detected and suppressed errors, rerun with: -v
      ==3320== Use --track-origins=yes to see where uninitialised values come from
      ==3320== ERROR SUMMARY: 1147 errors from 24 contexts (suppressed: 0 from 0)
      valgrind --track-origins:yes ./xhpl --cfg=surf/precision:1e-9 --cfg=network/model:SMPI --cfg=network/TCP-gamma:4194304 --cfg=smpi/bcast:mpich --cfg=smpi/running-power:6217956542.969 --cfg=smpi/display-timing:yes --cfg=smpi/privatize-global-variables:yes --cfg=smpi/shared-malloc:local --cfg=smpi/privatize-global-variables:1 ./cluster_fat_tree_64.xml smpitmp-apprXPdW8
      Execution failed with code 134.
      
    • Note that this file has been obtained with a nearly-vanilla HPL (see the Github issue). No smpi_usleep, and shared malloc only for the matrix (no partial shared malloc for PANEL->WORK). Thus, it is quite strange to see such errors.
    • The first error (HPL_pipid.c:144) happens because PANEL->ia is unitialized (checked by modifying the two operands one after the other to see if the error persists).

1.3 2017-04 April

1.4 2017-05 May

1.5 2017-06 June

1.5.1 2017-06-01 Thursday

  1. Redo validation of huge pages   SMPI EXPERIMENTS HPL REPORT
    • Simgrid commit: 9a8e2f5bce8c6758d4367d21a66466a497d136fe
    • HPL commit: 41774905395aebcb73650defaa7e2aa462e6e1a3
    • Script commit: eb071f09d822e1031ea0776949058bf2f55cb94a
    • Compilation and execution for optimized HPL (made on nova-10 without the huge pages, nova-11 with the huge pages)

      make SMPI_OPTS="-DSMPI_OPTIMIZATION_LEVEL=4 -DSMPI_DGEMM_COEFFICIENT=1.742435e-10
      -DSMPI_DTRSM_COEFFICIENT=8.897459e-11" arch=SMPI
      
      sysctl -w vm.overcommit_memory=1 && sysctl -w vm.max_map_count=40000000
      
      mount none /root/huge -t hugetlbfs -o rw,mode=0777 && echo 1 >> /proc/sys/vm/nr_hugepages
      
      ./run_measures.py --global_csv result_size.csv --nb_runs 3 --size 50000,100000,150000,200000,250000,300000 --nb_proc
      64 --topo "2;16,32;1,16;1,1" --experiment HPL --running_power 5004882812.500 --nb_cpu 8
      
      ./run_measures.py --global_csv result_size.csv --nb_runs 3 --size 50000,100000,150000,200000,250000,300000 --nb_proc
      64 --topo "2;16,32;1,16;1,1" --experiment HPL --running_power 5004882812.500 --nb_cpu 8 --hugepage /root/huge
      
    • Analysis

      library(ggplot2)
      library(gridExtra)
      old <- rbind(read.csv("validation/result_size_L4_big_nohugepage.csv"), read.csv("validation/result_size_L4_big_nohugepage_2.csv"))
      new <- read.csv("validation/result_size_L4_big_hugepage.csv")
      old$hugepage = FALSE
      new$hugepage =  TRUE
      results = rbind(old, new)
      
      do_plot(results, "size", "simulation_time", "hugepage", "Huge page", 64)
      

      validation/hugepage/1.pdf

      do_plot(results, "size", "memory_size", "hugepage", "Huge page", 64)
      

      validation/hugepage/3.pdf

      do_plot(results, "size", "Gflops", "hugepage", "Huge page", 64)
      

      validation/hugepage/5.pdf

      grid_arrange_shared_legend(
          do_plot(results, "size", "simulation_time", "hugepage", "Huge page", 64),
          do_plot(results, "size", "memory_size", "hugepage", "Huge page", 64),
          nrow=1, ncol=2
      )
      

      validation/hugepage/report_plot.pdf

      plot1 = generic_do_plot(ggplot(results, aes(x=size, y=cpu_utilization, color=hugepage))) +
          ggtitle("CPU utilization for different matrix sizes\nUsing 64 MPI processes")
      plot2 = generic_do_plot(ggplot(results, aes(x=size, y=minor_page_fault, color=hugepage))) +
          ggtitle("Number of page faults for different matrix sizes\nUsing 64 MPI processes")
      grid.arrange(plot1, plot2, ncol=2)
      

      2.png

    library(data.table)
    aggregate_results <- function(results) {
        x = data.table(results)
        x = as.data.frame(x[, list(simulation_time=mean(simulation_time), Gflops=mean(Gflops), application_time=mean(application_time)), by=c("size", "nb_proc")])
        return(x[with(x, order(size, nb_proc)),])
    }
    aggr_old = aggregate_results(old)
    aggr_new = aggregate_results(new)
    aggr_new$Gflops_error = (aggr_new$Gflops - aggr_old$Gflops)/aggr_new$Gflops
    
    generic_do_plot(ggplot(aggr_new, aes(x=size, y=Gflops_error)))
    

    3.png

    • The Gflops error is negligible.
    • The gain of using huge pages is pretty neat for both the simulation time and the memory consumption.
    • Very large variability of the CPU utilization, something weird has happened.
  2. Scalability test   SMPI EXPERIMENTS HPL REPORT
    • Simgrid commit: 9a8e2f5bce8c6758d4367d21a66466a497d136fe
    • HPL commit: 41774905395aebcb73650defaa7e2aa462e6e1a3
    • Script commit: 8cfd8d16787f39a29342b64599cf02166af6d632
    • Compilation and execution for optimized HPL (made on nova-10 and nova-11)

      make SMPI_OPTS="-DSMPI_OPTIMIZATION_LEVEL=4 -DSMPI_DGEMM_COEFFICIENT=1.742435e-10
      -DSMPI_DTRSM_COEFFICIENT=8.897459e-11" arch=SMPI
      
      sysctl -w vm.overcommit_memory=1 && sysctl -w vm.max_map_count=40000000
      
      mount none /root/huge -t hugetlbfs -o rw,mode=0777 && echo 1 >> /proc/sys/vm/nr_hugepages
      
      ./run_measures.py --global_csv result_size_1000000_512.csv --nb_runs 1 --size 1000000 --nb_proc 512 --topo
      "2;16,32;1,16;1,1" --experiment HPL --running_power 5004882812.500 --nb_cpu 8 --hugepage /root/huge
      
      ./run_measures.py --global_csv result_size_1000000_1024.csv --nb_runs 1 --size 1000000 --nb_proc 1024 --topo
      "2;16,32;1,16;1,1" --experiment HPL --running_power 5004882812.500 --nb_cpu 8 --hugepage /root/huge
      
      ./run_measures.py --global_csv result_size_1000000_2048.csv --nb_runs 1 --size 1000000 --nb_proc 2048 --topo
      "2;16,32;1,16;1,1" --experiment HPL --running_power 5004882812.500 --nb_cpu 8 --hugepage /root/huge
      
      ./run_measures.py --global_csv result_size_2000000_512.csv --nb_runs 1 --size 2000000 --nb_proc 512 --topo
      "2;16,32;1,16;1,1" --experiment HPL --running_power 5004882812.500 --nb_cpu 8 --hugepage /root/huge
      
      ./run_measures.py --global_csv result_size_2000000_1024.csv --nb_runs 1 --size 2000000 --nb_proc 1024 --topo
      "2;16,32;1,16;1,1" --experiment HPL --running_power 5004882812.500 --nb_cpu 8 --hugepage /root/huge
      
      ./run_measures.py --global_csv result_size_2000000_2048.csv --nb_runs 1 --size 2000000 --nb_proc 2048 --topo
      "2;16,32;1,16;1,1" --experiment HPL --running_power 5004882812.500 --nb_cpu 8 --hugepage /root/huge
      
    • Results:

      rbind(
          read.csv('scalability/result_1000000_512.csv'),
          read.csv('scalability/result_1000000_1024.csv'),
          read.csv('scalability/result_1000000_2048.csv'),
          read.csv('scalability/result_2000000_512.csv'),
          read.csv('scalability/result_2000000_1024.csv'),
          read.csv('scalability/result_2000000_2048.csv')
      )
      
                topology nb_roots nb_proc    size full_time      time Gflops
      1 2;16,32;1,16;1,1       16     512 1000000    716521  716521.0  930.4
      2 2;16,32;1,16;1,1       16    1024 1000000    363201  363201.0 1836.0
      3 2;16,32;1,16;1,1       16    2048 1000000    186496  186495.7 3575.0
      4 2;16,32;1,16;1,1       16     512 2000000   5685080 5685077.7  938.1
      5 2;16,32;1,16;1,1       16    1024 2000000   2861010 2861012.5 1864.0
      6 2;16,32;1,16;1,1       16    2048 2000000   1448900 1448899.1 3681.0
        simulation_time application_time user_time system_time major_page_fault
      1         2635.10           500.97   2367.19      259.91                0
      2         6037.89          1036.96   5515.36      515.05                0
      3        12391.90          2092.95  11389.36      995.39                0
      4         6934.86          1169.66   6193.80      683.73                0
      5        15198.30          2551.10  13714.01     1430.93                0
      6        32263.60          5236.56  29357.92     2844.89                0
        minor_page_fault cpu_utilization        uss        rss page_table_size
      1          1916208            0.99  153665536 2317279232        10600000
      2          2002989            0.99  369676288 4837175296        21252000
      3          2154982            0.99 1010696192 7774138368        42908000
      4          3801905            0.99  150765568 2758770688        10604000
      5          3872820            0.99  365555712 5273034752        21220000
      6          4038099            0.99 1009606656 7415914496        42884000
        memory_size
      1   894443520
      2  1055309824
      3  1581170688
      4  3338420224
      5  3497111552
      6  4027408384
      
  3. Add the Stampede output file in the repository   HPL

1.5.2 2017-06-02 Friday

  1. DONE New scalability tests to run [6/6]   SMPI HPL
    • [X] N=1000000, nbproc=4096, expected time ≈ 206min × 2.2 ≈ 7.5h
    • [X] N=2000000, nbproc=4096, expected time ≈ 537min × 2.2 ≈ 19.7h
    • [X] N=4000000, nbproc=512, expected time ≈ 115min × 2.6 ≈ 5h
    • [X] N=4000000, nbproc=1024, expected time ≈ 253min × 2.6 ≈ 11h
    • [X] N=4000000, nbproc=2048, expected time ≈ 537min × 2.6 ≈ 23.3h
    • [X] N=4000000, nbproc=4096, expected time ≈ 537min × 2.6 × 2.2 ≈ 51h
  2. Cannot connect anymore on G5K nodes in Lyon   BUG G5K
    • Reserved a job and made a deployment in lyon. Then, cannot connect to the node (both as tocornebize and as root).
    • Reserved a job and made a deployment in grenoble. Then, can connect to the node (both as tocornebize and as root).
    • Looked at the .ssh directories of grenoble and lyon, they look the same.
    • Can ssh from lyon to grenoble (or any other site) but cannot ssh from grenoble (or any other site) to lyon.
    • Fixed by replacing the .ssh folder from lyon by the .ssh folder from grenoble (might have messed up something…).
  3. First capacity planning test   SMPI EXPERIMENTS HPL REPORT
    • Simgrid commit: 9a8e2f5bce8c6758d4367d21a66466a497d136fe
    • HPL commit: 41774905395aebcb73650defaa7e2aa462e6e1a3
    • Script commit: 4ff3ccbcbb77e126e454a16dea0535493ff1ff0b
    • Compilation and execution (on nova-6 and nova-8).

      make SMPI_OPTS="-DSMPI_OPTIMIZATION_LEVEL=4 -DSMPI_DGEMM_COEFFICIENT=1.742435e-10
      -DSMPI_DTRSM_COEFFICIENT=8.897459e-11" arch=SMPI
      
      sysctl -w vm.overcommit_memory=1 && sysctl -w vm.max_map_count=40000000
      
      mount none /root/huge -t hugetlbfs -o rw,mode=0777 && echo 1 >> /proc/sys/vm/nr_hugepages
      
      ./run_measures.py --global_csv result_capacity_50000.csv --nb_runs 1 --size 50000 --nb_proc 512 --topo "2;16,32;1,1:16;1,1"
      --experiment HPL --running_power 5004882812.500 --hugepage /root/huge
      
      ./run_measures.py --global_csv result_capacity_100000.csv --nb_runs 1 --size 100000 --nb_proc 512 --topo "2;16,32;1,1:16;1,1"
      --experiment HPL --running_power 5004882812.500 --hugepage /root/huge
      
    • Results:

      library(ggplot2)
      results <- rbind(read.csv("capacity_planning/result_capacity_50000.csv"), read.csv("capacity_planning/result_capacity_100000.csv"))
      
      ggplot(results, aes(x=nb_roots, y=Gflops, color=size, group=size)) +
          stat_summary(fun.y = mean, geom="line")+
          stat_summary(fun.y = mean, geom="point")+
          expand_limits(x=0, y=0)+
          ggtitle("Gflops estimation for different number of root switches and matrix sizes\nUsing 512 MPI processes")
      

      1.png

    • In this experiment, we use a fat-tree which has a total of 512 nodes, all having only one core. We use 512 processes, one per node. We change the number of up-ports of the L1 switches and therefore the number of L2 switches.
    • It is strange, there is apparently no impact on the performances of HPL, we get the same performances with only one L2 switch than with 16 L2 switches.
    • Maybe we could try with a bigger matrix, to maybe have some network contention? But the experiment might take some time.
    • We could also try with a hostfile randomly shuffled, to maybe have a less good mapping and thus more traffic going through the L2 switches?
    • We could also try a fat-tree more “high” and less “wide”. We could have a third layer of switches, but decrease the number of ports to keep the same number of nodes. For instance, 3;8,8,8;1,8,16;1,1,1 instead of 2;16,32;1,16;1,1 (all have 512 nodes). But it is a bit artificial, such topology would certainly never happen in “real life”.

1.5.3 2017-06-03 Saturday

  1. New scalability tests   SMPI EXPERIMENTS HPL REPORT
    • Simgrid commit: 9a8e2f5bce8c6758d4367d21a66466a497d136fe
    • HPL commit: 41774905395aebcb73650defaa7e2aa462e6e1a3
    • Script commit: 4ff3ccbcbb77e126e454a16dea0535493ff1ff0b
    • Compilation and execution (made on nova-5, nova-11, nova-13, nova-14):

      make SMPI_OPTS="-DSMPI_OPTIMIZATION_LEVEL=4 -DSMPI_DGEMM_COEFFICIENT=1.742435e-10
      -DSMPI_DTRSM_COEFFICIENT=8.897459e-11" arch=SMPI
      
      sysctl -w vm.overcommit_memory=1 && sysctl -w vm.max_map_count=2000000000
      
      mount none /root/huge -t hugetlbfs -o rw,mode=0777 && echo 1 >> /proc/sys/vm/nr_hugepages
      
      ./run_measures.py --global_csv result_size_1000000_4096.csv --nb_runs 1 --size 1000000 --nb_proc 4096 --topo
      "2;16,32;1,16;1,1;8" --experiment HPL --running_power 5004882812.500 --hugepage /root/huge
      
      ./run_measures.py --global_csv result_size_4000000_512.csv --nb_runs 1 --size 4000000 --nb_proc 512 --topo
      "2;16,32;1,16;1,1;8" --experiment HPL --running_power 5004882812.500 --hugepage /root/huge
      
      ./run_measures.py --global_csv result_size_4000000_1024.csv --nb_runs 1 --size 4000000 --nb_proc 1024 --topo
      "2;16,32;1,16;1,1;8" --experiment HPL --running_power 5004882812.500 --hugepage /root/huge
      
      ./run_measures.py --global_csv result_size_2000000_4096.csv --nb_runs 1 --size 2000000 --nb_proc 4096 --topo
      "2;16,32;1,16;1,1;8" --experiment HPL --running_power 5004882812.500 --hugepage /root/huge
      
      ./run_measures.py --global_csv result_size_4000000_2048.csv --nb_runs 1 --size 4000000 --nb_proc 2048 --topo
      "2;16,32;1,16;1,1;8" --experiment HPL --running_power 5004882812.500 --hugepage /root/huge
      
      ./run_measures.py --global_csv result_size_4000000_4096.csv --nb_runs 1 --size 4000000 --nb_proc 4096 --topo
      "2;16,32;1,16;1,1;8" --experiment HPL --running_power 5004882812.500 --hugepage /root/huge
      
      
      rbind(
          read.csv('scalability/result_500000_512.csv'),
          read.csv('scalability/result_500000_1024.csv'),
          read.csv('scalability/result_500000_2048.csv'),
          read.csv('scalability/result_500000_4096.csv'),
          read.csv('scalability/result_1000000_4096.csv'),
          read.csv('scalability/result_2000000_4096.csv'),
          read.csv('scalability/result_4000000_512.csv'),
          read.csv('scalability/result_4000000_1024.csv'),
          read.csv('scalability/result_4000000_2048.csv'),
          read.csv('scalability/result_4000000_4096.csv')
      )
      
                   topology nb_roots nb_proc    size  full_time        time Gflops
      1  2;16,32;1,16;1,1;8       16     512  500000    91246.1    91246.02  913.3
      2  2;16,32;1,16;1,1;8       16    1024  500000    46990.1    46990.02 1773.0
      3  2;16,32;1,16;1,1;8       16    2048  500000    24795.5    24795.50 3361.0
      4  2;16,32;1,16;1,1;8       16    4096  500000    13561.0    13561.01 6145.0
      5  2;16,32;1,16;1,1;8       16    4096 1000000    97836.6    97836.54 6814.0
      6  2;16,32;1,16;1,1;8       16    4096 2000000   742691.0   742690.59 7181.0
      7  2;16,32;1,16;1,1;8       16     512 4000000 45305100.0 45305083.56  941.8
      8  2;16,32;1,16;1,1;8       16    1024 4000000 22723800.0 22723820.45 1878.0
      9  2;16,32;1,16;1,1;8       16    2048 4000000 11432900.0 11432938.62 3732.0
      10 2;16,32;1,16;1,1;8       16    4096 4000000  5787160.0  5787164.09 7373.0
         simulation_time application_time user_time system_time major_page_fault
      1          1191.99          204.992   1098.25       93.12                0
      2          2482.28          441.897   2296.51      184.70                0
      3          5091.97          872.425   4741.26      349.79                0
      4         11321.60         1947.320  10640.63      679.53                0
      5         26052.50         4362.660  24082.38     1966.10                0
      6         64856.30        10643.600  59444.40     5402.24                0
      7         17336.50         3030.400  15090.31     1945.23                0
      8         38380.90         6435.870  34249.71     3827.36                0
      9         83535.20        13080.500  75523.95     7684.52                0
      10       169659.00        26745.400 154314.76    15085.08                0
         minor_page_fault cpu_utilization        uss         rss page_table_size
      1            960072            0.99  155148288  2055086080        10604000
      2           1054062            0.99  369696768  4383203328        21240000
      3           1282294            0.99 1012477952  9367576576        42912000
      4           1852119            0.99 3103875072 15318568960        87740000
      5           2768705            0.99 3103895552 16934834176        87748000
      6           4704339            0.99 3102445568 19464646656        87748000
      7           7663911            0.98  151576576  2056916992        10604000
      8           7725625            0.99  369872896  4120702976        21212000
      9           7917525            0.99 1012191232  9221050368        42880000
      10          8550745            0.99 3113381888 20408209408        87808000
         memory_size
      1    282558464
      2    429948928
      3    962826240
      4   2814042112
      5   3425406976
      6   5910134784
      7  13079060480
      8  13275557888
      9  13825183744
      10 15763668992
      
    • Memory measurement failed for the experiments with 4096 nodes (smpimain took too much time to start, so its PID was not found by run_measure.py at the beginning, so it assumed it was already terminated… really need to find something more robust).
    • For the record, ran this command on the nodes (same command used in the script to estimate the memory consumption):

      python3 -c "import psutil; print(psutil.virtual_memory().available)"
      
    • Result:
      • For size=2000000 and nbproc=4096: 60468817920
      • For size=4000000 and nbproc=1024: 53105373184
      • For size=4000000 and nbproc=2048: 52539293696
      • For size=4000000 and nbproc=4096: 50614239232
    • On a freshly deployed node, the same command returns 66365100032

1.5.4 2017-06-04 Sunday

  1. Investigate capacity planning: small test program   SMPI HPL
    • As mentionned in [2017-06-02 Fri], the duration of HPL does not seem to be impacted by the topology, which is strange.
    • Implemented a small test program, called network_test. It takes as argument a size and a number of iterations. Every process sends the given number of messages, each having the given size, to the next process (and thus receives from the previous one).
    • Tested with the following topology (only changing the fat-tree description):

      <?xml version='1.0' encoding='ASCII'?>
      <!DOCTYPE platform SYSTEM "http://simgrid.gforge.inria.fr/simgrid/simgrid.dtd">
      <!--2-level fat-tree with 16 nodes-->
      <platform version="4">
        <AS id="AS0" routing="Full">
          <cluster id="cluster0" prefix="host-" suffix=".hawaii.edu" radical="0-15" speed="1Gf" bw="10Gbps" lat="2.4E-5s" loopback_bw="5120MiBps" loopback_lat="1.5E-9s" core="1" topology="FAT_TREE" topo_parameters="2;4,4;1,4;1,1"/>
        </AS>
      </platform>
      
    • Results for one iteration:
      • With a size of 200000000 and the fat-tree 2;4,4;1,4;1,1, takes a time of 1.28 seconds.
      • With a size of 200000000 and the fat-tree 2;4,4;1,1;1,1, takes a time of 2.69 seconds.
      • With a size of 200000 and the fat-tree 2;4,4;1,4;1,1, takes a time of 0.0025 seconds.
      • With a size of 200000 and the fat-tree 2;4,4;1,1;1,1, takes a time of 0.0040 seconds.
      • With a size of 2000 and the fat-tree 2;4,4;1,4;1,1, takes a time of 0.0004 seconds.
      • With a size of 2000 and the fat-tree 2;4,4;1,1;1,1, takes a time of 0.0004 seconds.
    • Thus, for large enough size, the difference is very clear, the topology does have a high impact. For small messages however, this is not the case.
    • It does not seem to change for several iterations.
  2. TODO Check whas are the sizes of the messages in HPL.   SMPI HPL
  3. Investigate capacity planning: odd networks   SMPI HPL
    • Simgrid commit: 9a8e2f5bce8c6758d4367d21a66466a497d136fe
    • HPL commit: 41774905395aebcb73650defaa7e2aa462e6e1a3
    • Script commit: 4ff3ccbcbb77e126e454a16dea0535493ff1ff0b
    • Try several topologies for HPL with absurdly good or bad networks (e.g. high/null bandwidth and/or high/null latency).
    • The idea is that if doing so has a little impact on performances, then it is hopeless to observe any impact from adding/removing switches.
    • Quick and dirty experiments: do not add any option to the script, just modify the values in topology.py (lines 161-164).
    • Note that in the previous experiments, where nearly no impact was observed, the different values were:

      bw = '10Gbps'
      lat = '2.4E-5s'
      loopback_bw = '5120MiBps'
      loopback_lat = '1.5E-9s'
      
    • Run this command, which outputs the Gflops:

      ./run_measures.py --global_csv /tmp/bla.csv --nb_runs 1 --size 10000 --nb_proc 16 --topo "2;4,4;1,4;1,1" --experiment
      HPL --running_power 6217956542.969 && tail -n 1 /tmp/bla.csv | cut -f10 -d','
      
    • Result for the same network characteristics: 21.96
    • Results with other characteristics:
      • Very high bandwidth: 22.15

        bw = '1000000Gbps'
        lat = '2.4E-5s'
        loopback_bw = '1000000GBps'
        loopback_lat = '1.5E-9s'
        
      • Very low bandwidth: 1.505

        bw = '10Mbps'
        lat = '2.4E-5s'
        loopback_bw = '10Mbps'
        loopback_lat = '1.5E-9s'
        
      • Low bandwidth: 19.95

        bw = '1Gbps'
        lat = '2.4E-5s'
        loopback_bw = '512MiBps'
        loopback_lat = '1.5E-9s'
        
      • Very low latency: 25.95

        bw = '10Gbps'
        lat = '0s'
        loopback_bw = '5120MiBps'
        loopback_lat = '0s'
        
      • Very high latency: 0.1534

        bw = '10Gbps'
        lat = '2.4E-2s'
        loopback_bw = '5120MiBps'
        loopback_lat = '1.5E-5s
        
      • High latency: 9.477

        bw = '10Gbps'
        lat = '2.4E-4s'
        loopback_bw = '5120MiBps'
        loopback_lat = '1.5E-8s'
        
    • Improving the network performances has a limited impact. Using a nearly infinite bandwidth increases the Gflops by less than 1%. Using a null latency has more impact, but still limited, it increases the Gflops by 18%.
    • Degrading the network performances has more impact. Using a bandwidth 1000 times lower divides by 15 the Gflops, but using a bandwidth 10 times lower decreases the Gflops by only 9%. Both the very high latency and the high latency have a great impact.
    • To sum up, the latency seems to have an higher impact on HPL performances than the bandwidth.
    • It is not clear if the contention created by using less switches will only decrease the bandwidth, or also increase the latency. It depends if there is on the switches one queue per port, or one queue for all the ports (in the former case, contention will have a much lower impact on the latency than in the later case).
    • Hypothesis: in the case of a one queue per port model, removing switches will not increase too much the latency and therefore have a very limited impact on HPL performances.
  4. TODO Ask what model is used in Simgrid’s switches   SIMGRID
    • Is it one queue per port, or one single queue for all the ports?
  5. More thoughts on capacity planning   SMPI HPL
    • The plot of the Gflops as a function of the bandwidth (resp. inverse of latency) seems to look like the plot of the Gflops as a function of the number of processes or the size. It is a concave function converging to some finite limit.
    • In the settings currently used for HPL, the bandwidth of 10Gbps seems to be already very close th the limit (since using a bandwidth thousands of time larger has little to no impact). This is why decreasing the bandwidth a bit has a very little impact. If we want to observe something when we remove switches, we should use lower bandwidths.
    • Quick test, using the same command than previous section and with these values:

      bw = '10Mbps'
      lat = '2.4E-5s'
      loopback_bw = '5120MiBps'
      loopback_lat = '1.5E-9s'
      
      • With 2;4,4;1,4;1,1, 1.505 Gflops.
      • With 2;4,4;1,1;1,1, 1.025 Gflops.
      • With 2;4,4;1,4;1,1 and a random mapping, 1.268 Gflops.
      • With 2;4,4;1,1;1,1 and a random mapping, 0.6464 Gflops.
    • The hypothesis seems to be confirmed. With a lower bandwidth, a difference of bandwidth has much more impact. Thus, removing a switch and/or using a random mapping has also much more impact.

1.5.5 2017-06-05 Monday

  1. Comparison with real Taurus experiment   SMPI EXPERIMENTS HPL REPORT
    • File hpl_analysis/taurus/real.csv holds real experiment data. It has been created manually, thanks to the energy paper repository.

      library(ggplot2)
      library(reshape2)
      library(gridExtra)
      
      get_results <- function(nb_proc) {
          result <- read.csv(paste('hpl_analysis/taurus/hpl_paper_', nb_proc, '.csv', sep=''))
          result$full_time = max(result$time)
          result$total_energy = sum(result$power_consumption)
      
          used_energy = 0
          result = result[with(result, order(-power_consumption)),] # sort by power consumption
          result$used_energy = sum(head(result, nb_proc/12)$power_consumption)
          result$nb_proc = nb_proc
          return(unique(result[c('nb_proc', 'full_time', 'total_energy', 'used_energy')]))
      }
      simulation_vanilla_results = data.frame()
      #  for(i in (c(1,4,8,12,48,96,144))) {
      for(i in (c(12,48,96,144))) {
        simulation_vanilla_results = rbind(simulation_vanilla_results, get_results(i))
      }
      simulation_vanilla_results$type = 'Vanilla simulation'
      simulation_vanilla_results$time = -1 # do not have it
      simulation_vanilla_results$Gflops = -1 # do not have it
      
      real_results = read.csv('hpl_analysis/taurus/real.csv')
      real_results$type = 'Real execution'
      real_results$used_energy = real_results$used_energy * 1e3 # kJ -> J
      sim_results <- read.csv('hpl_analysis/taurus/hpl2.csv')
      sim_results$type = 'Optimized simulation'
      results = rbind(real_results[c('nb_proc', 'full_time', 'time', 'Gflops', 'used_energy', 'type')],
      		sim_results[c('nb_proc', 'full_time', 'time', 'Gflops', 'used_energy', 'type')],
      		simulation_vanilla_results[c('nb_proc', 'full_time', 'time', 'Gflops', 'used_energy', 'type')])
      results$type <- factor(results$type, levels = c('Optimized simulation', 'Vanilla simulation', 'Real execution'))
      
      p1 = generic_do_plot(ggplot(results, aes(x=nb_proc, y=full_time, color=type, shape=type)), fixed_shape=FALSE) +
      	 xlab("Number of processes")+
      	 ylab("Duration (seconds)")+
      	 scale_shape_manual(values = c(0, 1, 2))+
      	 labs(colour="Experiment type")+
      	 labs(shape="Experiment type")+
      	 ggtitle("HPL duration for different numbers of processes\nMatrix size: 20,000")
      p2 = generic_do_plot(ggplot(results, aes(x=nb_proc, y=used_energy, color=type, shape=type)), fixed_shape=FALSE) +
      	 xlab("Number of processes")+
      	 ylab("Energy consumption (joules)")+
      	 scale_shape_manual(values = c(0, 1, 2))+
      	 labs(colour="Experiment type")+
      	 labs(shape="Experiment type")+
      	 ggtitle("HPL energy consumption for different numbers of processes\nMatrix size: 20,000")
      grid_arrange_shared_legend(p1, p2, nrow=1, ncol=2)
      

      hpl_analysis/taurus/validation.pdf

      tmp_results = results[results$type != "Vanilla simulation",]
      grid_arrange_shared_legend(
          generic_do_plot(ggplot(tmp_results, aes(x=nb_proc, y=time, color=type))) +
            xlab("Number of processes")+
            ylab("Duration (seconds)")+
            labs(colour="Simulated")+
            ggtitle("HPL “short” duration for different numbers of processes\nMatrix size: 20,000"),
          generic_do_plot(ggplot(tmp_results, aes(x=nb_proc, y=Gflops, color=type))) +
            xlab("Number of processes")+
            ylab("Energy consumption (joules)")+
            labs(colour="Simulated")+
            ggtitle("HPL performances for different numbers of processes\nMatrix size: 20,000"),
          nrow=1, ncol=2
      )
      

      hpl_analysis/taurus/validation2.pdf

      library(data.table)
      aggregate_results <- function(results) {
          x = data.table(results)
          x = x[x$nb_proc %in% c(12, 48, 96, 144)]
          x = as.data.frame(x[, list(time=mean(full_time), energy=mean(used_energy)), by=c("nb_proc")])
          return(x[with(x, order(nb_proc)),])
      }
      aggr_real = aggregate_results(real_results)
      aggr_sim = aggregate_results(sim_results)
      aggr_vanilla = aggregate_results(simulation_vanilla_results)
      aggr_sim$time_error = (aggr_sim$time - aggr_real$time)/aggr_real$time * 100
      aggr_sim$energy_error = (aggr_sim$energy - aggr_real$energy)/aggr_real$energy * 100
      aggr_sim$optimized = TRUE
      aggr_vanilla$time_error = (aggr_vanilla$time - aggr_real$time)/aggr_real$time * 100
      aggr_vanilla$energy_error = (aggr_vanilla$energy - aggr_real$energy)/aggr_real$energy * 100
      aggr_vanilla$optimized = FALSE
      aggr_results = rbind(aggr_vanilla, aggr_sim)
      aggr_results$optimized <- factor(aggr_results$optimized, levels = c(TRUE, FALSE))
      
    • Get the three colors used for the previous plots to use the ones corresponding to vanilla and optimized.

      x = unique(ggplot_build(p1)$data[[1]]$colour)
      x
      colors = x[c(1, 2)]
      colors
      
      [1] "#F8766D" "#00BA38" "#619CFF"
      [1] "#F8766D" "#00BA38"
      
      grid_arrange_shared_legend(
          generic_do_plot(ggplot(aggr_results, aes(x=nb_proc, y=time_error, color=optimized))) +
            geom_hline(yintercept=0) +
            scale_color_manual(values=colors) +
            xlab("Number of processes")+
            ylab("Relative error (percent)")+
            labs(colour="Optimized simulation")+
            ggtitle("Error on the duration prediction")+
            expand_limits(y=15)+
            expand_limits(y=-15),
          generic_do_plot(ggplot(aggr_results, aes(x=nb_proc, y=energy_error, color=optimized))) +
            geom_hline(yintercept=0) +
            scale_color_manual(values=colors) +
            xlab("Number of processes")+
            ylab("Relative error (percent)")+
            labs(colour="Optimized simulation")+
            ggtitle("Error on the energy consumption prediction")+
            expand_limits(y=15)+
            expand_limits(y=-15),
          nrow=1, ncol=2
      )
      

      hpl_analysis/taurus/errors.pdf

    • The plots are funny. The shapes of the error plots for optimized and vanilla look similar, but shifted. They both reach some high errors (~ 10%), but not for the same number of processes. Also, the optimized version is always above 0 while the vanill is below 0 for some points.
    • There are some mismatches between time prediction and energy prediction. For instance, optimized has a large error for the time prediction of 144 processes, but nearly no error for the energy prediction. Similarly, vanilla over-estimates the duration for 48 processes but under-estimates the energy consumption, which seems odd.
  2. Plots for scalability test   SMPI EXPERIMENTS HPL REPORT
    library(ggplot2)
    library(ggrepel)
    library(reshape2)
    library(gridExtra)
    results = rbind(
        read.csv('scalability/result_500000_512.csv'),
        read.csv('scalability/result_500000_1024.csv'),
        read.csv('scalability/result_500000_2048.csv'),
        read.csv('scalability/result_500000_4096.csv'),
        read.csv('scalability/result_1000000_512.csv'),
        read.csv('scalability/result_1000000_1024.csv'),
        read.csv('scalability/result_1000000_2048.csv'),
        read.csv('scalability/result_1000000_4096.csv'),
        read.csv('scalability/result_2000000_512.csv'),
        read.csv('scalability/result_2000000_1024.csv'),
        read.csv('scalability/result_2000000_2048.csv'),
        read.csv('scalability/result_2000000_4096.csv'),
        read.csv('scalability/result_4000000_512.csv'),
        read.csv('scalability/result_4000000_1024.csv'),
        read.csv('scalability/result_4000000_2048.csv'),
        read.csv('scalability/result_4000000_4096.csv')
    )
    results$simulation_time = results$simulation_time/3600
    results$memory_size = results$memory_size * 1e-9
    number_verb <- function(n) {
        return(format(n,big.mark=",",scientific=FALSE))
    }
    results$size_verb = factor(unlist(lapply(results$size, number_verb)), levels = c('500,000','1,000,000','2,000,000','4,000,000'))
    results$nb_proc_verb = factor(unlist(lapply(results$nb_proc, number_verb)), levels = c('512', '1,024', '2,048', '4,096'))
    results
    
                 topology nb_roots nb_proc    size  full_time        time Gflops
    1  2;16,32;1,16;1,1;8       16     512  500000    91246.1    91246.02  913.3
    2  2;16,32;1,16;1,1;8       16    1024  500000    46990.1    46990.02 1773.0
    3  2;16,32;1,16;1,1;8       16    2048  500000    24795.5    24795.50 3361.0
    4  2;16,32;1,16;1,1;8       16    4096  500000    13561.0    13561.01 6145.0
    5    2;16,32;1,16;1,1       16     512 1000000   716521.0   716521.00  930.4
    6    2;16,32;1,16;1,1       16    1024 1000000   363201.0   363201.04 1836.0
    7    2;16,32;1,16;1,1       16    2048 1000000   186496.0   186495.70 3575.0
    8  2;16,32;1,16;1,1;8       16    4096 1000000    97836.6    97836.54 6814.0
    9    2;16,32;1,16;1,1       16     512 2000000  5685080.0  5685077.72  938.1
    10   2;16,32;1,16;1,1       16    1024 2000000  2861010.0  2861012.55 1864.0
    11   2;16,32;1,16;1,1       16    2048 2000000  1448900.0  1448899.09 3681.0
    12 2;16,32;1,16;1,1;8       16    4096 2000000   742691.0   742690.59 7181.0
    13 2;16,32;1,16;1,1;8       16     512 4000000 45305100.0 45305083.56  941.8
    14 2;16,32;1,16;1,1;8       16    1024 4000000 22723800.0 22723820.45 1878.0
    15 2;16,32;1,16;1,1;8       16    2048 4000000 11432900.0 11432938.62 3732.0
    16 2;16,32;1,16;1,1;8       16    4096 4000000  5787160.0  5787164.09 7373.0
       simulation_time application_time user_time system_time major_page_fault
    1        0.3311083          204.992   1098.25       93.12                0
    2        0.6895222          441.897   2296.51      184.70                0
    3        1.4144361          872.425   4741.26      349.79                0
    4        3.1448889         1947.320  10640.63      679.53                0
    5        0.7319722          500.970   2367.19      259.91                0
    6        1.6771917         1036.960   5515.36      515.05                0
    7        3.4421944         2092.950  11389.36      995.39                0
    8        7.2368056         4362.660  24082.38     1966.10                0
    9        1.9263500         1169.660   6193.80      683.73                0
    10       4.2217500         2551.100  13714.01     1430.93                0
    11       8.9621111         5236.560  29357.92     2844.89                0
    12      18.0156389        10643.600  59444.40     5402.24                0
    13       4.8156944         3030.400  15090.31     1945.23                0
    14      10.6613611         6435.870  34249.71     3827.36                0
    15      23.2042222        13080.500  75523.95     7684.52                0
    16      47.1275000        26745.400 154314.76    15085.08                0
       minor_page_fault cpu_utilization        uss         rss page_table_size
    1            960072            0.99  155148288  2055086080        10604000
    2           1054062            0.99  369696768  4383203328        21240000
    3           1282294            0.99 1012477952  9367576576        42912000
    4           1852119            0.99 3103875072 15318568960        87740000
    5           1916208            0.99  153665536  2317279232        10600000
    6           2002989            0.99  369676288  4837175296        21252000
    7           2154982            0.99 1010696192  7774138368        42908000
    8           2768705            0.99 3103895552 16934834176        87748000
    9           3801905            0.99  150765568  2758770688        10604000
    10          3872820            0.99  365555712  5273034752        21220000
    11          4038099            0.99 1009606656  7415914496        42884000
    12          4704339            0.99 3102445568 19464646656        87748000
    13          7663911            0.98  151576576  2056916992        10604000
    14          7725625            0.99  369872896  4120702976        21212000
    15          7917525            0.99 1012191232  9221050368        42880000
    16          8550745            0.99 3113381888 20408209408        87808000
       memory_size size_verb nb_proc_verb
    1    0.2825585   500,000          512
    2    0.4299489   500,000        1,024
    3    0.9628262   500,000        2,048
    4    2.8140421   500,000        4,096
    5    0.8944435 1,000,000          512
    6    1.0553098 1,000,000        1,024
    7    1.5811707 1,000,000        2,048
    8    3.4254070 1,000,000        4,096
    9    3.3384202 2,000,000          512
    10   3.4971116 2,000,000        1,024
    11   4.0274084 2,000,000        2,048
    12   5.9101348 2,000,000        4,096
    13  13.0790605 4,000,000          512
    14  13.2755579 4,000,000        1,024
    15  13.8251837 4,000,000        2,048
    16  15.7636690 4,000,000        4,096
    
    size_time = generic_do_plot(ggplot(results, aes(x=size, y=simulation_time, color=nb_proc_verb))) +
        xlab("Matrix size") +
        ylab("Simulation time (hours)") +
        labs(colour="Number of processes")+
        ggtitle("Simulation time for different matrix sizes")+
        theme(legend.position = "none")+
        geom_text_repel(
    	data = subset(results, size == max(size)),
    	aes(label = nb_proc_verb),
    	nudge_x = 45,
    	segment.color = NA,
    	show.legend = FALSE
          )
    size_time
    

    scalability/1.pdf

    nbproc_time = generic_do_plot(ggplot(results, aes(x=nb_proc, y=simulation_time, color=size_verb))) +
        xlab("Number of processes") +
        ylab("Simulation time (hours)") +
        labs(colour="Matrix size")+
        ggtitle("Simulation time for different number of processes")+
        theme(legend.position = "none")+
        geom_text_repel(
    	data = subset(results, nb_proc == max(nb_proc)),
    	aes(label = size_verb),
    	nudge_x = 45,
    	segment.color = NA,
    	show.legend = FALSE
          )
    nbproc_time
    

    scalability/2.pdf

    size_mem = generic_do_plot(ggplot(results, aes(x=size, y=memory_size, color=nb_proc_verb))) +
        xlab("Matrix size") +
        ylab("Memory consumption (gigabytes)") +
        labs(colour="Number of processes")+
        ggtitle("Memory consumption for different matrix sizes")+
        theme(legend.position = "none")+
        geom_text_repel(
    	data = subset(results, size == max(size)),
    	aes(label = nb_proc_verb),
    	nudge_x = 45,
    	segment.color = NA,
    	show.legend = FALSE
          )
    size_mem
    

    scalability/3.pdf

    nbproc_mem = generic_do_plot(ggplot(results, aes(x=nb_proc, y=memory_size, color=size_verb))) +
        xlab("Number of processes") +
        ylab("Memory consumption (gigabytes)") +
        labs(colour="Matrix size")+
        ggtitle("Memory consumption for different number of processes")+
        theme(legend.position = "none")+
        geom_text_repel(
    	data = subset(results, nb_proc == max(nb_proc)),
    	aes(label = size_verb),
    	nudge_x = 45,
    	segment.color = NA,
    	show.legend = FALSE
        )
    nbproc_mem
    

    scalability/4.pdf

    grid_arrange_shared_legend(size_time, size_mem, nrow=1, ncol=2)
    

    scalability/plot_size.pdf

    grid_arrange_shared_legend(nbproc_time, nbproc_mem, nrow=1, ncol=2)
    

    scalability/plot_nbproc.pdf

1.5.6 2017-06-06 Tuesday

  1. Discussion about the report   MEETING REPORT
    1. State of the art
      1. Important features
        • offline vs. online, en particulier pour HPL (probes pour le pipeline des comms)
        • si online: language scope, besoin de modifier le code pour que ça passe
        • modèles: notion de topologies et prise en compte de la contention (super important a priori), prise en compte des spécificités des communications avec MPI (sémantique de synchronisation, différents ranges de performance, probablement pas trop grave dans le cas de HPL), collectives (dans le cas de HPL, on s’en fiche)
          • modèle classique dans ce contexte = LogP* mais prend mal en compte la contention (au niveau des noeuds mais pas du tout au niveau de la topologie)
          • deux approches principales: packet level et flow level
        • passage à l’échelle: ça motive l’utilisation de Parallel DES et de techniques d’émulation d’applications MPI un peu “système”
      2. Projects:
        • Dimemas (Barcelona Supercomputing center), offline (extrae/paraver), “performance debugging” (sensibility analysis, what if, performance prediction)
        • LogoPsim (Torsten Hoefler), offline (dag, GOAL), collective algorithms @ scale
        • SST macro, online/offline (DUMPI), MPI only, skeletonization/templating, more robust but more specialized (C++)
        • BigSIM (?), offline, PDES éventuellement, projet mort. source-to-source transformation for privatization for CHARM++/AMPI
        • xSim, online, PDES aux modèles sous-jacents à la validité plus que discutable, mais scalable, privatization à coup de copie du segment data et pas de mmap
        • CODES, offline, PDES, new kid on the block
    2. Validation and capacity planning
      • For the comparison with a real execution (Taurus), get the data for real experiment by executing the org-file. Long (~ 5 minutes).
      • On capacity planning, it is expected that removing switches has little to no impact. Computation is in O(n3) while communication is in O(n2) (and most of the communications are asynchronous, so happen during computations).
  2. Webinar   MEETING

1.5.7 2017-06-07 Wednesday

  1. DONE Some text is displayed in the pdf but not in the printed version   REPORT
    • It seems that no text entered between = signs (translated to \texttt in Latex) appear in the printed version of the report. It is displayed correctly in the pdf. Fix this.
    • Reprint the first page of the same file, it is fixed now. The difference is that I printed it by the network and not by plugging a USB stick in the printer.
  2. Network printer setup   TOOLS
    • Make sure the right package is installed:

      sudo aptitude install cups-browsed
      
    • Add these lines to the file /etc/cups/cups-browsed.conf:

      BrowseRemoteProtocols cups
      BrowsePoll print.imag.fr:631
      
    • Enable the service:

      sudo systemctl enable cups-browsed
      
    • Restart the service:

      sudo service cups-browsed restart
      

1.5.8 2017-06-08 Thursday

  1. Capacity planning: components   SMPI EXPERIMENTS HPL REPORT
    • Simgrid commit: 9a8e2f5bce8c6758d4367d21a66466a497d136fe
    • HPL commit: 41774905395aebcb73650defaa7e2aa462e6e1a3
    • Script commit: c2d1d734c80f084157ad70d702e8c669772fb2e4
    • Command (used on nova-21, configured as above experiments):

      bash run_capacity_planning.sh 100000 512
      
      bash run_capacity_planning.sh 50000 512
      
    • Results:

      library(ggplot2)
      library(reshape2)
      library(gridExtra)
      
      get_results <- function(directory, name) {
          result <- read.csv(paste('capacity_planning/', directory, '/', name, '.csv', sep=''))
          result$name = name
          return(result)
      }
      get_all_results <- function(directory) {
          results = data.frame()
          for(type in c('bandwidth', 'latency', 'speed')) {
      	for(subtype in c('high', 'low')) {
      	    name = paste(type, subtype, sep='_')
      	    tmp = get_results(directory, name)
      	    tmp$type = type
      	    if(type == 'latency'){
      		if(subtype == 'high')
      		    tmp$subtype = 'bad'
      		else
      		    tmp$subtype = 'good'
      	    }
      	    else {
      		if(subtype == 'high')
      		    tmp$subtype = 'good'
      		else
      		    tmp$subtype = 'bad'
      	    }
      	    results = rbind(results, tmp)
      	}
      	default = get_results(directory, 'default')
      	default$type = type
      	default$subtype = 'default'
      	results = rbind(results, default)
          }
          return(results[c('size', 'Gflops', 'type', 'subtype')])
      }
      results_1E5 = get_all_results('exp_100000_512')
      results_5E4 = get_all_results('exp_50000_512')
      results_1E5
      results_5E4
      
          size  Gflops      type subtype
      1 100000  710.40 bandwidth    good
      2 100000  702.20 bandwidth     bad
      3 100000  722.70 bandwidth default
      4 100000  349.10   latency     bad
      5 100000  823.70   latency    good
      6 100000  722.70   latency default
      7 100000 3419.00     speed    good
      8 100000   83.94     speed     bad
      9 100000  722.70     speed default
         size  Gflops      type subtype
      1 50000  458.80 bandwidth    good
      2 50000  477.00 bandwidth     bad
      3 50000  475.20 bandwidth default
      4 50000  127.30   latency     bad
      5 50000  697.60   latency    good
      6 50000  475.20   latency default
      7 50000 1346.00     speed    good
      8 50000   71.95     speed     bad
      9 50000  475.20     speed default
      
    do_plot <- function(results, type) {
        tmp = results[results$type == type,]
        title = paste('HPL performance estimation for different components\nMatrix size of',
    	format(unique(results$size),big.mark=",",scientific=FALSE))
        plot = ggplot(results, aes(x=type, y=Gflops, color=subtype, shape=subtype)) +
    	geom_point(size=4, stroke=1) +
    	scale_shape_manual(values = c(0, 1, 2))+
    	theme_bw()+
    	expand_limits(x=0, y=0)+
    	ggtitle(title)+
    	xlab('Component')+
    	ylab('Performance estimation (Gflops)')+
    	labs(colour='Metric')+
    	labs(shape='Metric')
        return(plot)
    }
    
    grid_arrange_shared_legend(
        do_plot(results_5E4, 'bandwidth') + expand_limits(x=0, y=max(results_1E5$Gflops)),
        do_plot(results_1E5, 'bandwidth') + expand_limits(x=0, y=max(results_1E5$Gflops)),
        nrow=1,
        ncol=2
    )
    

    capacity_planning/components_perf.pdf

  2. Capacity planning: topology   SMPI EXPERIMENTS HPL REPORT
    • Simgrid commit: 9a8e2f5bce8c6758d4367d21a66466a497d136fe
    • HPL commit: 41774905395aebcb73650defaa7e2aa462e6e1a3
    • Script commit: c2d1d734c80f084157ad70d702e8c669772fb2e4
    • Four series of experiments:
      • Bandwidth of 10Gbps, sequential mapping of the processes
      • Bandwidth of 10Gbps, random mapping of the processes
      • Bandwidth of 10Mbps, sequential mapping of the processes
      • Bandwidth of 10Mbps, random mapping of the processes
    • For the series with a bandwidth of 10MBps, the file topology.py has been locally modified to use a bandwidth 1000 times lower:

      176 % git diff                                                        -- INSERT -- 15:42:08
      diff --git a/topology.py b/topology.py
      index 2d7d76c..1a3cd67 100644
      --- a/topology.py
      +++ b/topology.py
      @@ -158,7 +158,7 @@ class FatTree:
           prefix = 'host-'
           suffix = '.hawaii.edu'
           speed = '1Gf'
      -    bw = '10Gbps'
      +    bw = '10Mbps'
           lat = '2.4E-5s'
           loopback_bw = '5120MiBps'
           loopback_lat = '1.5E-9s'
      
    • Command (used on nova-2, nova-8, nova-15 and nova-16 configured as above experiments):
      • For sequential mapping:

        ./run_measures.py --global_csv result_capacity_50000.csv --nb_runs 3 --size 50000 --nb_proc 512 --topo
        "2;16,32;1,1:16;1,1" --experiment HPL --running_power 5004882812.500 --hugepage /root/huge 
        
      • For random mapping:

        ./run_measures.py --global_csv result_capacity_50000.csv --nb_runs 3 --size 50000 --nb_proc 512 --topo
        "2;16,32;1,1:16;1,1" --experiment HPL --running_power 5004882812.500 --hugepage /root/huge --shuffle_hosts
        
    • For the random mapping with 10Mbps bandwidth, more runs have been done (8 instead of 3) to get rid of any bias.
    • Results:

      library(ggplot2)
      results_highbw_sequential <- read.csv("capacity_planning/exp_topo_50000_512/result_capacity_50000.csv")
      results_highbw_random <- read.csv("capacity_planning/exp_topo_50000_512/result_capacity_50000_shuffled.csv")
      results_lowbw_sequential <- read.csv("capacity_planning/exp_topo_50000_512/result_capacity_50000_lowbw.csv")
      results_lowbw_random <- read.csv("capacity_planning/exp_topo_50000_512/result_capacity_50000_lowbw_shuffled.csv")
      results_highbw_sequential$mapping = "Sequential"
      results_highbw_random$mapping = "Random"
      results_lowbw_sequential$mapping = "Sequential"
      results_lowbw_random$mapping = "Random"
      results_highbw = rbind(results_highbw_sequential, results_highbw_random)
      results_highbw$bandwidth = '10Gbps'
      results_lowbw = rbind(results_lowbw_sequential, results_lowbw_random)
      results_lowbw$bandwidth = '10Mbps'
      
      do_plot <- function(results) {
          title = paste('HPL performance estimation for different topologies\nBandwidth of', unique(results$bandwidth))
          plot = generic_do_plot(ggplot(results, aes(x=nb_roots, y=Gflops, color=mapping, shape=mapping)), fixed_shape=FALSE) +
      	ggtitle(title)+
      	xlab('Number of L2 switches')+
      	ylab('Performance estimation (Gflops)')+
      	scale_shape_manual(values = c(1, 2))+
      	labs(colour='Mapping')+
      	labs(shape='Mapping')
          return(plot)
      }
      
      grid_arrange_shared_legend(
          do_plot(results_lowbw),
          do_plot(results_highbw),
          nrow=1, ncol=2
      )
      

      capacity_planning/topology.pdf

    • The results for 10MBps are somehow expected. Removing switches deteriorates the performances, using a random mapping of the processes makes things even worse. Also, we can observe some performance peaks for 4, 8 and 16 root switches. Maybe this is due to the D mod K algorithm (TODO: check that this is indeed this algorithm). For instance, 16 divides 512 but 15 does not. So the load of all messages should be spread more uniformly with 16 root switches than with 15.
    • For 10GBps however, this is more strange. The number of switches has no impact, but this has already been observed on previous experiments (see [2017-06-02 Fri]). What is more surprising however is that the random mapping yields to betterperformances than the sequential mapping. Is it a bug?

      tmp = rbind(results_lowbw, results_highbw)
      tmp$bandwidth <- factor(tmp$bandwidth, levels = c('10Mbps', '10Gbps'))
      generic_do_plot(ggplot(tmp, aes(x=nb_roots, y=simulation_time, color=mapping, shape=mapping, linetype=bandwidth)), fixed_shape=FALSE)+
      	ggtitle('Simulation time for different networks')+
      	xlab('Number of L2 switches')+
      	ylab('Simulation time (seconds)')+
      	scale_shape_manual(values = c(1, 2))+
      	labs(colour='Mapping')+
      	labs(shape='Mapping')+
      	labs(linetype='Bandwidth')
      

      capacity_planning/topology_sim_time.pdf

      results_lowbw$simgrid_time = results_lowbw$simulation_time - results_lowbw$application_time
      generic_do_plot(ggplot(results_lowbw, aes(x=nb_roots, y=simgrid_time, color=mapping)))
      

      2.png

      library(data.table)
      aggregate_results <- function(results) {
          x = data.table(results)
          x = as.data.frame(x[, list(Gflops=mean(Gflops)), by=c("nb_roots")])
          return(x[with(x, order(nb_roots)),])
      }
      aggr_seq = aggregate_results(results_lowbw_sequential)
      aggr_rand = aggregate_results(results_lowbw_random)
      aggr_rand$gflops_ratio = aggr_seq$Gflops / aggr_rand$Gflops
      
      generic_do_plot(ggplot(aggr_rand, aes(x=nb_roots, y=gflops_ratio)))
      

      3.png

    • There are huge differences (factor 10) in the simulation time depending on the mapping and the number of root switches. This time is spent in Simgrid. It certainly comes from more complex communication behaviors (congestion) that gives much more work to the network part of Simgrid.

1.5.9 2017-06-09 Friday

  1. TODO Work on run_measure.py script [0/5]   PYTHON
    • [ ] Clean the code. In particular, remove the stuff related to the small matrix product test.
    • [ ] Write some unit tests.
    • [ ] Add options, e.g. to set the bandwidth or the latency without modifying the code.
    • [ ] Add flexibility in the way the series of experiments are described. Maybe describe them with Python code in a separate file? Or a JSON file?
    • [ ] Parallelism: allow to launch experiments on remote machines by ssh.

1.5.10 2017-06-12 Monday

  1. Add The LINPACK Benchmark: Past, Present and Future   PAPER

    Bibtex: Dongarra03thelinpack

1.5.11 2017-06-14 Wednesday

  1. Add Versatile, Scalable and Accurate Simulation of Distributed Applications and Platforms   PAPER

    Bibtex: casanova:hal-01017319

  2. Add LogP: Towards a Realistic Model of Parallel Computation   PAPER

    Bibtex: Culler1993

  3. Finally found a grammar checker \o/   TOOLS

1.5.12 2017-06-19 Monday

  1. Discussion about the slides   MEETING
    • Grosse partie sur le contexte (~10min).
      • Supercalculateurs du top500 (dont Stampede), avec leur topologie. Aussi Piz-daint (dragonfly, en Suisse). Monter la variabilité des topologies. Photos des supercalculateurs et schéma de la topo.
      • Routage, workload, placement des processus.
      • HPL, HPL sur Stampede, lutte contre le n3, et aussi contre le p (p*n2 dans la complexité).
    • Pas d’état de l’art (ou seulement on-line vs off-line).
    • Dessins: inkscape ou xfig
    • Contribution: pas trop long (~7min).
    • Validation: but final, comparer avec Stampede.
    • Ouverture: capacity planning, étude de topologie.

1.5.13 2017-06-20 Tuesday

  1. Pré-soutenance Tom   MEETING
    • Numéroter les slides 1/18
    • Essayer d'inclure le schéma général des modifications
    • distiller des informations sur le type de gain.

    Slides:

    1. Rank ?
      • On peut plus accélérer les processeurs?
      • Informations sur l'échelle, sur la topologie, la diversité

        As an answer to the power and heat challenges, processor constructors have increased the amount of computing units (or cores) per processor. Modern High Performance Computing (HPC) systems comprise thousands of nodes, each of them holding several multi-core processors. For example, one of the world fastest computers, the IBM Sequoia system 1 Laurence Livermoor National Laboratory (USA), contains 96 racks of 98,304 nodes interconnected through a 5-dimensional torus and comprising 16-core each, for a total of 1,572,864 cores. The Cray Titan system 2 at Oak Ridge National Laboratory is made of 18,688 AMD Opteron (16-core CPUs) and 18,688 Nvidia Tesla K20X GPUs interconnected through a Gemini three-dimensional torus. Another recent Cray machine, Piz Daint 3 at the Swiss National Supercomputing Centre, comprises 5,272 nodes (with 8 cores and a Nvidia Tesla K20X GPU each) interconnected through a custom Aries dragonfly topology. More recently, the Tianhe-2 4 was built with 32,000 Intel Xeon (12 cores) and 48,000 Xeon Phi 31S1P interconnected through a TH-Express fat tree. Finally the Sunway TaihuLight 5 (Jiangsu, China), which is currently the fastest supercomputer in the world, is made of 40,950 nodes interconnected through a custom five level hierarchy of cabinets and comprising each 260 custom RISC cores for a total 10,649,600 cores

    2. HPL:
      • où N est le rang de la matrice
      • ça marche comme un pivot de Gauss
        • recherche du maximum, petite factorization, diffusion, update et on recommence
        • dans le code, c'est un peu mélangé afin de bien recouvrir les calculs et les communications
    3. Lien avec le slide 1…
      • Transitions un peu maladroites. Expliquer que c'est un domaine très actif.
    4. SimGrid Simulation of HPC applications
      • Trace.
        • Deux problèmes (taille pour obtention de la trace, application dynamique -faire le lien avec HPL-)
        • Émulation à la simgrid: exclusion mutuelle. Avantage = émulation sans modification mais ne passe pas à l'échelle. Il faut des approches hybrides.
      • Plein de projets. Majoritairement offline. SimGrid permet les deux.
    5. 10:36. Pourquoi Stampede ? On s'en fiche. On est là pour donner un ordre de grandeur
      • 500 jours de calcul sans même compter la simulation de l'application elle même.
      • Laboratory notebook and scripts
      • Modified HPL
      • Modifications to SimGrid
    6. Intégrer le "To sum up" dans cette série de slides.

      • Tdgemm = .4354*M*N*L
      • Tdtrsm = .234234*M*N2

      Gain = ??? ordre de grandeur. Illustration sur une config donnée (une petite et une grande ?).

      • Négligeable mais gain important ?
      • Quantité de modifications sur HPL ?
      • À ce stade, On ne fait quasiment plus de calcul, mais la consommation mémoire reste importante.
      • L'application accède de temps en temps à ces zones donc on ne peut pas simplement supprimer ces allocations…
    7. Panel = information échangée entre les processus au cours de l'exécution.
    8. 10:44
      • Cette allocation pose un "problème".
      • Modification HPL ?
      • Consequences = observation à grande échelle.
    9. 10:50 Cas difficile, erreur principalement sur 1 noeud et qui diminue apprès, sous-estimation systématique

      • Expérience à petite échelle
      • Sous-estimation systématique
      • Facteur 2 sur les outliers ?

      Optimistic après dgemm.

    10. Conclusion
      • Modifications légères de HPL
      • Nouvelles fonctionnalités dans SG
      • Démontré qu'on pouvait simuler à cette échelle tout en prenant en compte les caractéristiques fines de la topologie, du placement, …
    11. Rajouter capacity planning ?
  2. Last remarks from Arnaud   MEETING
    • Various functions → swap, max, …
    • Simulation of HPC application → parler de Simgrid
    • Slide 7: écrire que c’est très optimiste (remplacer ≈ par ≥ )
    • Slide 18: ajouter un mot sur les aspects failure et énergétiques
    • Slide 16 → systematically
    • Slide 1: ajouter nom, rang, nombre de noeuds et de coeurs, topologie

1.5.14 2017-06-21 Wednesday

  1. Pré-soutenance Tom V2   MEETING

    Intro: simulation MPI à large échelle… et capacity planning ?

    1. Expliquer l'algo avant l'animation
      • mentionner l'overlapping
    2. Il y a des questions… Il y a plusieurs leviers sur lesquels on peut agir pour que ça aille plus vite.
      • Il y a des "recettes". Les gens disent "je veux ça" mais c'est leur expérience/avis et l'argumentation est limitée
      • applications adaptatives. C'est d'ailleurs le cas de HPL
      • avantage/inconvénient de l'approche émulée?
      • several optimizations (certaines assez logiques et d'autres qui étaient moins évidentes)
      • pourquoi ne pas simplement enlever les mallocs ?
    3. Bon, ben maintenant, on a enlevé tous les calculs, toutes les allocations, il ne reste quasiment plus que le contrôle. Et pourtant, à grande échelle, ça ne passe toujours pas.
      • Vous voyez, l'effet quadratique en N et en P est toujours là et c'est ça qui était dur.
      • Expliquer la courbe! C'est très petit.
      • Pas d'outlier. Dire plutôt pas de variabilité, et du coup pas d'outlier. Problème: ça à des conséquences à cause des synchros.
      • Modèle optimiste (pas de variabilité injectée, partage de bande passante parfait)

1.5.15 2017-06-23 Friday

  1. Trying to understand the low CPU utilization for large allocations   C EXPERIMENTS
    • According to Olivier, the low CPU utilization when doing large allocations (without huge pages) is not expected. Let’s investigate.
    • Script commit: 80c6cd6f0853821a08da3994ce89572c9996b5ea
    • Command (the size correspond to an allocation of a matrix of size at most 600,000):

      ./cpu_utilization.py 8 2880000000000 /tmp/cpu_exp.csv
      
    • Analysis:

      library(ggplot2)
      results <- read.csv('cpu_utilization/cpu_exp.csv')
      
      ggplot(results, aes(x=size, y=cpu_utilization)) +
          geom_point() + geom_line()
      
    • So we reproduce this behavior outside of HPL and Simgrid.
  2. DONE Draw a flame graph with this small program and a large allocation.
  3. Flame graph for the CPU utilization   C EXPERIMENTS
    • Script commit: 80c6cd6f0853821a08da3994ce89572c9996b5ea
    • Command (the size correspond to an allocation of a matrix of size 600,000):

      sudo perf record -F1000 --call-graph dwarf ./page_faults 1 2880000000000 1
      
      sudo perf script | ~/Documents/FlameGraph/stackcollapse-perf.pl --kernel | ~/Documents/FlameGraph/flamegraph.pl > /tmp/flame_2880000000000.svg
      
    • Kernel version:

      uname -r
      
    • Result: Sorry, your browser does not support SVG.
    • This flame graph is very interesting, although incomplete. First, note that the function main accounts for less than 40% of the samples, which is approximately equal to the CPU utilization. It means that this approach also captures what is done when the process is not executed.
    • Most of the time spent in the function main is spent in a function do_page_fault.
    • The remaining 60% of the whole execution time is spent on two functions, one unknown, and one called native_irq_return_iret.
    • It is also strange to see this very large function page_faults located below the function main (and _start) and not on the side, although these functions are (a priori) not called by the function page_faults. Maybe a bug of perf?
  4. TODO Next steps in the investigation of low CPU utilization [4/7]
    • [X] Plot the CPU utilization for different number of calls to memset (including 0).
    • [ ] Draw the flame graph with more calls to memset.
    • [ ] Draw the flame graph with no call to memset.
    • [X] Try other flags for the mmap, try adding the flag MAP_POPULATE.
    • [X] Try with another kernel version.
    • [X] Try with huge pages, to see the difference.
    • [ ] Speak with someone (Olivier? Samuel? Vincent? Stack Overflow?).

1.5.16 2017-06-24 Saturday

  1. Small test: several calls to memset   C EXPERIMENTS
    • Script commit: b8a110e9a57c821b37a3843738b97bc0affb52f6
    • No call to memset:

      /usr/bin/time ./page_faults 1 2880000000000 0
      
      2.00202
      0.04user 1.95system 0:02.00elapsed 99%CPU (0avgtext+0avgdata 5108maxresident)k
      0inputs+4096outputs (0major+521minor)pagefaults 0swaps
      
    • One call to memset:

      /usr/bin/time ./page_faults 1 2880000000000 1
      
      2013.29
      158.71user 604.73system 33:33.29elapsed 37%CPU (0avgtext+0avgdata 2812501956maxresident)k
      0inputs+102400outputs (0major+703125270minor)pagefaults 0swaps
      
    • Ten call to memset:

      /usr/bin/time ./page_faults 1 2880000000000 10
      
      23344.3
      1622.97user 5224.14system 6:29:04elapsed 29%CPU (0avgtext+0avgdata 2812502520maxresident)k
      0inputs+958464outputs (0major+7031250411minor)pagefaults 0swaps
      
    • No call to memset, but using the flag MAP_POPULATE:

      /usr/bin/time ./page_faults 1 2880000000000 0
      
      136.016
      0.04user 103.22system 2:16.01elapsed 75%CPU (0avgtext+0avgdata 2812501680maxresident)k
      0inputs+4096outputs (0major+43946592minor)pagefaults 0swaps
      
    • When no accesses are made and the flag MAP_POPULATE is not used, then the execution is very fast, there is nearly no page fault and the CPU utilization is high.
    • With one access, we get the very low CPU utilization and the very large time.
    • With ten accesses, the CPU utilization is even lower, the number of page faults is ten times higher and both user time and system time are also about ten times higher. This is very strange.
    • With no access but with the flag MAP_POPULATE, there is a much larger time and number of page faults, but still about ten times lower than with one access and no MAP_POPULATE.

1.5.17 2017-06-25 Sunday

  1. More experments about the low CPU utilization for large allocations   C R EXPERIMENTS
    • Script commit: b8a110e9a57c821b37a3843738b97bc0affb52f6, modified to have between 0 and 3 calls to memset.
    • With these results, the flag MAP_POPULATE is not used when calling mmap.
    • Command (the size correspond to an allocation of a matrix of size at most 300,000):

      ./cpu_utilization.py 100 720000000000 cpu_exp2.csv
      
    • Run on nova-17 with kernel 4.9.0-2-amd64.
    • Analysis:

      library(gridExtra)
      library(ggplot2)
      results <- read.csv('cpu_utilization/cpu_exp2.csv')
      
      p1 = ggplot(results, aes(x=size, y=cpu_utilization, color=factor(mem_access))) +
          geom_point() + geom_line()
      p2 = ggplot(results, aes(x=size, y=total_time, color=factor(mem_access))) +
          geom_point() + geom_line()
      grid.arrange(p1, p2, ncol=2)
      
      p1 = ggplot(results, aes(x=size, y=user_time, color=factor(mem_access))) +
          geom_point() + geom_line()
      p2 = ggplot(results, aes(x=size, y=system_time, color=factor(mem_access))) +
          geom_point() + geom_line()
      grid.arrange(p1, p2, ncol=2)
      
      p1 = ggplot(results, aes(x=size, y=memory_size, color=factor(mem_access))) +
          geom_point() + geom_line()
      p2 = ggplot(results, aes(x=size, y=nb_page_faults, color=factor(mem_access))) +
          geom_point() + geom_line()
      grid.arrange(p1, p2, ncol=2)
      
    • Finally, the number of accesses to the buffer does not seem to impact the CPU utilization. The difference we observed on [2017-06-24 Sat] was probably only noise.
    • The user time seems to be proportionnal to both the allocation size and the number of calls to memset. This is expected.
    • The system time also seems to be proportionnal to them. We could expect the impact of the allocation size, but the impact of the number of accesses is not trivial. It seems to come from the number of page faults (which is also proportionnal to both the allocation size and the number of accesses). But the plot of the number of page faults is hard to understand. Why would more accesses cause more page faults, when the page table is already initialized?
    • Another strange thing is that the memory consumption is lower with only one access than with two or three. They should all have the same page table size and thus the same memory consumption.

1.5.18 2017-06-26 Monday

  1. Small test: several calls to memset with huge pages   C EXPERIMENTS
    • Script commit: 005461dad4c06a2e2463d54eec228e65c07b1015
    • Compilation:

      gcc -DHUGEPAGE -std=gnu11 -ggdb3 -O3 -o page_faults page_faults.c -Wall
      
    • So, same experiment than [2017-06-24 Sat], except that huge pages and the MAP_POPULATE flag are used.
    • No call to memset, but using the flag MAP_POPULATE:

      3.34278
      0.04user 3.29system 0:03.34elapsed 99%CPU (0avgtext+0avgdata 1476maxresident)k
      0inputs+0outputs (0major+65minor)pagefaults 0swaps
      

      Much lower number of page faults and system time. Higher CPU utilization.

    • One call to memset:

      /usr/bin/time ./page_faults 1 2880000000000 1
      
      102.2
      98.77user 3.26system 1:42.20elapsed 99%CPU (0avgtext+0avgdata 1492maxresident)k
      0inputs+0outputs (0major+67minor)pagefaults 0swaps
      

      In comparison with the case where no huge pages are used, the number of page faults and the time are much lower. Also, the system time and the number of page faults are the same than the previous test, where no memset were done, only the user time increased. It is strange that the number of page faults is so low. With such an allocation size, we have about 1.3M huge pages.

    • Ten call to memset:

      /usr/bin/time ./page_faults 1 2880000000000 10
      
      988.682
      984.74user 3.45system 16:28.68elapsed 99%CPU (0avgtext+0avgdata 1488maxresident)k
      0inputs+0outputs (0major+66minor)pagefaults 0swaps
      

      Same system time and number of page faults than with only one call to memset, only the user time increases. This is the expected behavior.

    • Let’s try without MAP_POPULATE flag.
    • One call to memset:

      /usr/bin/time ./page_faults 1 2880000000000 1
      
      102.302
      99.10user 3.18system 1:42.30elapsed 99%CPU (0avgtext+0avgdata 1520maxresident)k
      0inputs+0outputs (0major+1373356minor)pagefaults 0swaps
      

      The number of page faults is now as expected, but this did not change the system time.

    • Ten calls to memset:

      /usr/bin/time ./page_faults 1 2880000000000 10
      
      1001.42
      997.40user 3.30system 16:41.41elapsed 99%CPU (0avgtext+0avgdata 1572maxresident)k
      0inputs+0outputs (0major+1373359minor)pagefaults 0swaps
      

      We observe the same behavior than with the flag MAP_POPULATE: going from 1 call to memset to 10 does not impact the number of page faults or the system time, it only changes the user time.

    1. Conclusion

      Using classical pages or huge pages does not only change the page size (and thus the page table size). It actually changes the behavior of the OS. With classical pages, the system time and the number of page faults are proportionnal to both the allocation size and the number of accesses, whereas with huge pages they are only proportionnal to the allocation size.

  2. Flame graph for the CPU utilization   C EXPERIMENTS
    • Script commit: 005461dad4c06a2e2463d54eec228e65c07b1015 (the file has been modified to remove the flag MAP_POPULATE).
    • Command (the size correspond to an allocation of a matrix of size 600,000):

      sudo perf record -F1000 --call-graph dwarf ./page_faults 1 2880000000000 1
      
      sudo perf script | ~/Documents/FlameGraph/stackcollapse-perf.pl --kernel | ~/Documents/FlameGraph/flamegraph.pl > /tmp/flame_2880000000000_hugepage.svg
      
    • Kernel version:

      uname -r
      
    • Result: Sorry, your browser does not support SVG.
    • This flame graph is hard to relate with the previous results.
    • We saw that there was a high CPU utilization (99%) and that most of the time was spent in user mode. But the graph shows that a very large part of the time is spent in some other function, outside of the program scope. My guess would be that such function should not be accounted in the program execution time and that we should therefore have a very low CPU utilization.
  3. Segmented regression   R
    • Wikipedia page
    • Example on StackExchange
    • Let’s try with dummy data.

      NB = 100
      A1 = 2 # coeff for first part
      A2 = 1 # coeff for second part
      B1 = 0 # intercept for first part
      B2 = 100 # intercept for second part
      df = data.frame(n=1:NB)
      df$n = sample(500, size=NB, replace=TRUE)
      df$noise = sample(20, size=NB, replace=TRUE)-10
      my_func <- function(n, noise) {
          if(n < 100) {
      	return(A1*n+B1 + noise)
          }
          else {
      	return(A2*n+B2 + noise)
          }
      }
      df$fn = mapply(my_func, df$n, df$noise)
      
      library(ggplot2)
      ggplot(df, aes(x=n, y=fn)) + geom_point()
      
    • The two modes are clearly visible, let’s try some regressions.

      library(segmented)
      lm = segmented(lm(fn~n, data=df), seg.Z = ~ n)
      summary(lm)
      
      
              ***Regression Model with Segmented Relationship(s)***
      
      Call: 
      segmented.lm(obj = lm(fn ~ n, data = df), seg.Z = ~n)
      
      Estimated Break-Point(s):
         Est. St.Err 
      99.197  3.361 
      
      Meaningful coefficients of the linear terms:
                  Estimate Std. Error t value Pr(>|t|)    
      (Intercept)  1.22041    4.02077   0.304    0.762    
      n            1.99373    0.06389  31.208   <2e-16 ***
      U1.n        -0.98928    0.06420 -15.409       NA    
      ---
      Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
      
      Residual standard error: 6.183 on 96 degrees of freedom
      Multiple R-Squared: 0.9985,  Adjusted R-squared: 0.9985 
      
      Convergence attained in 6 iterations with relative change 2.230614e-15
      
      plot(lm)
      
    • Need to check, but it seems that:
      • It expects the underlying “function” to be “continuous”, which is not the case of what we have with dgemm on Stampede. If there is a discontinuity at the break point, the estimation fails.
      • The intercept value is B1.
      • The n coefficient is A1.
      • The U1.n coefficient is A2-A1.

1.5.19 2017-06-27 Tuesday

  1. Keep trying the segmented regression   R
    • Using code from stackoverflow
    • Asked a question on stackoverflow.
    • Let’s try with dummy data.

      NB = 100
      A1 = 2 # coeff for first part
      A2 = 1 # coeff for second part
      B1 = 0 # intercept for first part
      B2 = 300 # intercept for second part
      df = data.frame(n=1:NB)
      df$n = sample(500, size=NB, replace=TRUE)
      df$noise = sample(20, size=NB, replace=TRUE)-10
      my_func <- function(n, noise) {
          if(n < 100) {
      	return(A1*n+B1 + noise)
          }
          else {
      	return(A2*n+B2 + noise)
          }
      }
      df$fn = mapply(my_func, df$n, df$noise)
      
      library(ggplot2)
      ggplot(df, aes(x=n, y=fn)) + geom_point()
      
    • First, using segmented package.

      library(segmented)
      model_segmented = segmented(lm(fn~n, data=df), seg.Z = ~ n)
      summary(model_segmented)
      
      
              ***Regression Model with Segmented Relationship(s)***
      
      Call: 
      segmented.lm(obj = lm(fn ~ n, data = df), seg.Z = ~n)
      
      Estimated Break-Point(s):
          Est.  St.Err 
      136.566   5.677 
      
      Meaningful coefficients of the linear terms:
                  Estimate Std. Error t value Pr(>|t|)    
      (Intercept) -61.0463    11.7827  -5.181 1.22e-06 ***
      n             3.6374     0.1534  23.706  < 2e-16 ***
      U1.n         -2.6332     0.1593 -16.525       NA    
      ---
      Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
      
      Residual standard error: 33.92 on 96 degrees of freedom
      Multiple R-Squared: 0.9804,  Adjusted R-squared: 0.9798 
      
      Convergence attained in 4 iterations with relative change -7.90412e-16
      
      predict_segmented = data.frame(n = df$n, fn = broken.line(model_segmented)$fit)
      ggplot(df, aes(x = n, y = fn)) +
      geom_point() + geom_line(data = predict_segmented, color = 'blue')
      
    • Then, doing the segmentation by hand.

      Break<-sort(unique(df$n))
      Break<-Break[2:(length(Break)-1)]
      d<-numeric(length(Break))
      for (i in 1:length(Break)) {
          model_manual<-lm(fn~(n<Break[i])*n + (n>=Break[i])*n, data=df)
          d[i]<-summary(model_manual)[[6]]
      }
      plot(d)
      
      # Smallest breakpoint
      breakpoint = Break[which.min(d)]
      breakpoint
      df$group = df$n >= breakpoint
      model_manual<-lm(fn~n*group, data=df)
      summary(model_manual)
      
      [1] 100
      
      Call:
      lm(formula = fn ~ n * group, data = df)
      
      Residuals:
          Min      1Q  Median      3Q     Max 
      -9.6223 -5.0330 -0.5436  4.7791 10.4031 
      
      Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
      (Intercept)   1.02021    2.39788   0.425    0.671    
      n             1.98517    0.04128  48.090   <2e-16 ***
      groupTRUE   300.21629    3.07455  97.646   <2e-16 ***
      n:groupTRUE  -0.98826    0.04174 -23.678   <2e-16 ***
      ---
      Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
      
      Residual standard error: 5.984 on 96 degrees of freedom
      Multiple R-squared:  0.9994,	Adjusted R-squared:  0.9994 
      F-statistic: 5.248e+04 on 3 and 96 DF,  p-value: < 2.2e-16
      
      dat_pred = data.frame(n = df$n, fn = predict(model_manual, df))
      ggplot(df, aes(x = n, y = fn)) +
          geom_point() +
          geom_line(data=dat_pred[dat_pred$n < breakpoint,], color = 'blue')+
          geom_line(data=dat_pred[dat_pred$n >= breakpoint,], color = 'blue')
      
    • The segmented package fails when the data is discontinuous.
    • The dirty method works great.

Validate