Thursday, 15 August 2013

Can Ubuntu "Sometimes" Fail to Generate Core Files?

Can Ubuntu "Sometimes" Fail to Generate Core Files?

I've configured an Amazon EC2 instance to generate core files when
processes crash, and for the most part it works as expected. The problem
is that it doesn't always work. The program that I have issues with is
comprised of 9 concurrent processes working in concert via MPI. When this
program crashes, I almost always get a core dump, but in some rare cases
no core dump is generated, even though a segfault(11) was reported in my
logs that are capturing stdErr. In other cases (very rare), the resulting
core file is truncated.
I have not configured my core pattern, so only one core (named "core") can
exist in the directory my process is launched from. Further details below
my question.
How can no core dump be generated "sometimes"? Is it possible two
processes are attempting to dump a core file at once, and both failing
because they are in conflict? Are core dumps just not a reliable method of
tracing bugs?
.bash_profile
export LD_LIBRARY_PATH=/usr/local/lib
source ./.bashrc
ulimit -c unlimited
/etc/security/limuits.conf
* soft core unlimited
root hard core unlimited
ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 29879
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 29879
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
EDIT1
I've found a bug that will reliably omit a core file. In an attempt to
generate a core when the program is in a suspicious state, I have inserted
the following lines in several places:
if (value > FLT_MAX){
int *i=NULL;
*i=1;
}
About half of my processes reach one of these lines and segfault, probably
within a few milliseconds of each other since they take almost identical
code paths. I don't simply raise(SIGSEGV) because I've seen my program
swallow that and continue before; perhaps because the signal technically
doesn't require a quit?

No comments:

Post a Comment