We need a header line
An issue with the HDF5 library has been reported, causing corruption of checkpoints. On Vulcan and Hypatia, jobs are not preempted (there's no such policy yet), so the risk is low, but - of course - maintenances would, and do, cause jobs to be held, with the same effect.
Since both the appearance and implications of this (mis)behaviour aren't fully unterstood yet, let's collect some data points here.
* It all started with a request by Michael on 2019-06-06:
The lalinference MCMC code is currently affected by checkpointing problems that could stem from changes in the HDF5 library. This is a problem on CIT and other LDG clusters, AFAIK, and I’m not sure whether it affects vulcan / hypatia, but to avoid having LIGO analyses crash under us and resorting to rescue procedures that may not always work reliably, can I ask what the preemption policy is? If the MCMC runs were not preempted then we can avoid the issue altogether (modulo cluster maintenance or hardware issue) while a better solution is being found. Obviously this needs to be better understood and fixed.
* First occurrence (?) of the issue reported by Carl Haster, as communicated by Michael:
“I'm having problems with lalinference_mcmc runs that are done on CIT with recent builds of LALSuite/master (from April 3). All the samples to the higherT chains are written as they should, but the 0-tempareture samples (ie the ones written to the group that is called posterior_samples) are not written to file…”
- Ian suggests to set
export HDF5_USE_FILE_LOCKING=FALSE
and claims:
We had similar HDF issues on CIT, which are completely solved by setting this and from some discussion with John and Vivien about this it was felt that this might avoid the problem …. That said, we have never observed similar problems on vulcan/hypatia, but John believes there is no downside to using this setting (given what the codes are doing).
- Abhirup has seen the issue though:
I have encountered this problem on Hypatia, and have been tried to debug this but without success until now.
- Roberto hasn't found this issue on Vulcan.
- Michael:
This discussion on stackoverflow seems to be relevant: https://stackoverflow.com/questions/47696840/error-opening-a-file-with-h5fopen-f-in-fortran-mpi-hdf5-1-10-1. Here is another link http://hdf-forum.184993.n3.nabble.com/HDF5-files-on-NFS-td4029577.html. For deeper understanding one would need to read the HDF5 source code.
- Steffen checked the versions:
- Vulcan (Jessie):
ii hdf5-helpers 1.8.13+docs-15+deb8u1 amd64 Hierarchical Data Format 5 (HDF5) - Helper tools ii lalsuite-extra-hdf5-links 1.3.0-5.1+deb8u0 all LIGO Algorithm Library Extra Data - hdf5 links ii libhdf5-8:amd64 1.8.13+docs-15+deb8u1 amd64 Hierarchical Data Format 5 (HDF5) - runtime files - serial version ii libhdf5-cpp-8:amd64 1.8.13+docs-15+deb8u1 amd64 Hierarchical Data Format 5 (HDF5) - C++ libraries ii libhdf5-dev 1.8.13+docs-15+deb8u1 amd64 Hierarchical Data Format 5 (HDF5) - development files - serial version ii python-h5py 2.5.0-3 amd64 general-purpose Python interface to hdf5 (Python 2) ii python3-h5py 2.5.0-3 amd64 general-purpose Python interface to hdf5 (Python 3)
-
- Hypatia (Stretch, installed on Mar 27, no upgrades to libhdf5-100 since):
ii hdf5-helpers 1.10.0-patch1+docs-3+deb9u1 amd64 Hierarchical Data Format 5 (HDF5) - Helper tools ii hdf5-tools 1.10.0-patch1+docs-3+deb9u1 amd64 Hierarchical Data Format 5 (HDF5) - Runtime tools ii libhdf5-100:amd64 1.10.0-patch1+docs-3+deb9u1 amd64 Hierarchical Data Format 5 (HDF5) - runtime files - serial version ii libhdf5-cpp-100:amd64 1.10.0-patch1+docs-3+deb9u1 amd64 Hierarchical Data Format 5 (HDF5) - C++ libraries ii libhdf5-dev 1.10.0-patch1+docs-3+deb9u1 amd64 Hierarchical Data Format 5 (HDF5) - development files - serial version ii python-h5py 2.7.0-1 amd64 general-purpose Python interface to hdf5 (Python 2) ii python3-h5py 2.7.0-1 amd64 general-purpose Python interface to hdf5 (Python 3)
- The Debian bug page doesn't seem to contain anything matching: https://bugs.debian.org/cgi-bin/pkgreport.cgi?archive=both;src=hdf5
- Is this in any way related to the type of filesystem, client-side and server-side?
- on the compute node, /home mounted via NFS, /work mounted via BeeGFS
- on the server side, underlying filesystem (before June 4, /home was an XFS, is now ZFS; /work is EXT4 for meta, ZFS for storage)