People who deal with nvidia GPUs on Linux for computation: do you have a favorite stress test program that will reliably make marginal GPUs fail (eg, drop off the PCIe bus)? I thought I had found one but then it failed to consistently reproduce our problem (while people here can frequently do it but only with opaque SLURM jobs that we can't really pick up and use for diagnostics, plus graduate students are busy people).

0

If you have a fediverse account, you can quote this note from your own instance. Search https://mastodon.social/users/cks/statuses/114741026191992306 on your instance and quote it. (Note that quoting is not supported in Mastodon.)