I've tested the RAM using memtest86+ and found no errors, the rig uses a 700 W Corsair PSU which /should/ provide enough power, and I see no evidence of overheating based on a cronjob which runs every 2 minutes. Anyway, the first step in troubleshooting is finding a good way of reproducing the error reliably, and prime95 is what the windows overclockers use to stresstest.
Turns out prime95 (actually GIMPS) can run in a few different modes which tests different aspects of you system, which makes it sound like a pretty good program for my purposes.
See here for more information: http://www.mersenne.org/freesoft/
mkdir ~/tmp/mprime -p cd ~/tmp/mprime wget http://www.mersenne.info/gimps/p95v279.linux64.tar.gz tar xvf p95v279.linux64.tar.gz ./mprimeAnd so on.Welcome to GIMPS, the hunt for huge prime numbers. You will be asked a few simple questions and then the program will contact the primenet server to get some work for your computer. Good luck! Attention OVERCLOCKERS!! Mprime has gained a reputation as a useful stress testing tool for people that enjoy pushing their hardware to the limit. You are more than welcome to use this software for that purpose. Please select the stress testing choice below to avoid interfering with the PrimeNet server. Use the Options/Torture Test menu choice for your stress tests. Also, read the stress.txt file. If you want to both join GIMPS and run stress tests, then Join GIMPS and answer the questions. After the server gets some work for you, stop mprime, then run mprime -m and choose Options/Torture Test. Join Gimps? (Y=Yes, N=Just stress testing) (Y): N Number of torture test threads to run (3): 2 Choose a type of torture test to run. 1 = Small FFTs (maximum FPU stress, data fits in L2 cache, RAM not tested much). 2 = In-place large FFTs (maximum heat and power consumption, some RAM tested). 3 = Blend (tests some of everything, lots of RAM tested). 11,12,13 = Allows you to fine tune the above three selections. Blend is the default. NOTE: if you fail the blend test, but can pass the small FFT test then your problem is likely bad memory or a bad memory controller. Type of torture test to run (3): 1 Accept the answers above? (Y): Y [Main thread Sep 20 11:06] Starting workers. [Worker #1 Sep 20 11:06] Worker starting [Worker #1 Sep 20 11:06] Setting affinity to run worker on any logical CPU. [Worker #2 Sep 20 11:06] Worker starting [Worker #2 Sep 20 11:06] Setting affinity to run worker on any logical CPU. [Worker #1 Sep 20 11:06] Beginning a continuous self-test to check your computer. [Worker #1 Sep 20 11:06] Please read stress.txt. Hit ^C to end this test. [Worker #2 Sep 20 11:06] Beginning a continuous self-test to check your computer. [Worker #2 Sep 20 11:06] Please read stress.txt. Hit ^C to end this test. [Worker #1 Sep 20 11:06] Test 1, 180000 Lucas-Lehmer iterations of M580673 using AMD K10 type-1 FFT length 28K, Pass1=112, Pass2=256. [Worker #2 Sep 20 11:06] Test 1, 180000 Lucas-Lehmer iterations of M580673 using AMD K10 type-1 FFT length 28K, Pass1=112, Pass2=256. CTRL+C