Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 23 Nov 2017 01:05:58 +0100
From: "Jeroen" <spam@...lab.nl>
To: <john-users@...ts.openwall.com>
Subject: Re: OpenMPI and .rec files?

magnum wrote:
<SNAP>
This sounds like either a bug or PEBCAK but it may well be a bug - I'm pretty
>sure I have never tested that many nodes at once.
>
>> Same result for OpenMPI tasks with (more OR less than 640) AND more
>> than 100 subtasks.
>>
>> Is all the resume data in 100 recovery files, don't matter the number
>> of tasks or is there something going wrong?
>
>You should get one session file per node. What "exact" command line did you
>use to start the job?

For example when submitted with prun (control framework for cluster job management):

prun -np 18 -32 -t 5:00 -script openmpi-config /home/john/run/test hashes

where -np 18 is #hosts, -32 is 32 processes per host, openmpi-config is a basic bash script, loading openmpi (gcc 64 bit) on the workers.

The number is jobs started is - as mentioned before - ok, benchmark (--test) also works fine (... (640xMPI) DONE). Number of .rec files never exceeds 100.
 
>Are all nodes running in the same $JOHN directory, eg.
>using NFS?

Yes.

>What happens if you try to resume such a session? It should fail and complain
>about missing files unless the bug is deeper than I can imagine.

Is resuming like any other normal job, no complains as far as I can see.

Please let me know if you need specific debug info.


Thanks,

Jeroen



Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ