A plugin implementing a wrapper around the User-Level Failure-Mitigation (ULFM) feature of the upcoming MPI 4 standard. This plugin and the accompanying example is tested with OpenMPI 5.0.2.
More...
|
| UserLevelFailureMitigation () |
| Default constructor; sets the error handler of MPI_COMM_WORLD (!) to MPI_ERRORS_RETURN. Although the standard allows setting the error handler for only a specific communicator; neither MPICH nor OpenMPI currently (March 2024) support this.
|
|
void | revoke () |
| Revokes the current communicator.
|
|
uint32_t | ack_failed (uint32_t const num_to_ack) |
| Acknowledges that the application intends to ignore the effect of currently known failures on wildcard receive completions and agreement return values.
|
|
uint32_t | num_ack_failed () |
| Gets the number of acknowledged failures.
|
|
uint32_t | ack_all_failed () |
| Acknowledge all failures.
|
|
Comm | shrink () |
| Creates a new communicator from this communicator, excluding the failed processes.
|
|
int | agree (int flag) |
| Agrees on a flag from all live processes and distributes the result back to all live processes, even after process failures.
|
|
bool | agree (bool flag) |
| Agrees on a boolean flag from all live processes and distributes the result back to all live processes, even after process failures.
|
|
Group | get_failed () |
| Obtains the group of currently failed processes.
|
|
bool | is_revoked () |
| Checks if this communicator has been revoked.
|
|
void | mpi_error_handler (int const ret, std::string const &callee) const |
| Overwrite the on-MPI-error handler to throw appropriate exceptions for then hardware faults happened.
|
|
template<
typename Comm,
template< typename... >
typename DefaultContainerType>
class kamping::plugin::UserLevelFailureMitigation< Comm, DefaultContainerType >
A plugin implementing a wrapper around the User-Level Failure-Mitigation (ULFM) feature of the upcoming MPI 4 standard. This plugin and the accompanying example is tested with OpenMPI 5.0.2.