OpenMP causes heisenbug segfault

问题

I'm trying to parallelize a pretty massive for-loop in OpenMP. About 20% of the time it runs through fine, but the rest of the time it crashes with various segfaults such as;

*** glibc detected *** ./execute: double free or corruption (!prev): <address> ***

*** glibc detected *** ./execute: free(): invalid next size (fast): <address> ***

[2] <PID> segmentation fault ./execute

My general code structure is as follows;

<declare and initialize shared variables here>
#pragma omp parallel private(list of private variables which are initialized in for loop) shared(much shorter list of shared variables) 
{
   #pragma omp for
   for (index = 0 ; index < end ; index++) {

     // Lots of functionality (science!)
     // Calls to other deep functions which manipulate private variables
     // Finally generated some calculated_values

     shared_array1[index] = calculated_value1;
     shared_array2[index] = calculated_value2;
     shared_array3[index] = calculated_value3;

   } // end for
 }

// final tidy up

}

In terms of what's going on, each loop iteration is totally independent of each other loop iteration, other than the fact they pull data from shared matrices (but different columns on each loop iteration). Where I call other functions, they're only changing private variables (although occasionally reading shared variables) so I'd assume they'd be thread safe as they're only messing with stuff local to a specific thread? The only writing to any shared variables happens right at the end, where we write various calculated values to some shared arrays, where array elements are indexed by the for-loop index. This code is in C++, although the code it calls is both C and C++ code.

I've been trying to identify the source of the problem, but no luck so far. If I set num_theads(1) it runs fine, as it does if I enclose the contents of the for-loop in a single

#pragma omp for
for(index = 0 ; index < end ; index++) { 
  #pragma omp critical(whole_loop)
  { 
      // loop body
  }
}

which presumably gives the same effect (i.e. only one thread can pass through the loop at any one time).

If, on the other hand, I enclose the for-loop's contents in two critical directives e.g.

#pragma omp for
for(index = 0 ; index < end ; index++) { 
  #pragma omp critical(whole_loop)
  { 
      // first half of loop body
  }

 #pragma omp critical(whole_loop2)
  { 
      // second half of loop body
  }

}

I get the unpredictable segfaulting. Similarly, if I enclose EVERY function call in a critical directive it still doesn't work.

The reason I think the problem may be linked to a function call is because when I profile with Valgrind (using valgrind --tool=drd --check-stack-var=yes --read-var-info=yes ./execute) as well as SIGSEGing I get an insane number of load and store errors, such as;

Conflicting load by thread 2 at <address> size <number>
   at <address> : function which is ultimately called from within my for loop

Which according to the valgrind manual is exactly what you'd expect with race conditions. Certainly this kind of weirdly appearing/disappearing issue seems consistent with the kinds of non-deterministic errors race conditions would give, but I don't understand how, if every call which gives apparent race conditions is in a critical section.

Things which could be wrong but I don't think are include;

All private() variables are initialized inside the for-loops (because they're thread local).
I've checked that shared variables have the same memory address while private variables have different memory addresses.
I'm not sure synchronization would help, but given there are implicit barrier directives on entry and exit to critical directives and I've tried versions of my code where every function call is enclosed in a (uniquely named) critical section I think we can rule that out.

Any thoughts on how to best proceed would be hugely appreciated. I've been banging my head against this all day. Obviously I'm not looking for a, "Oh - here's the problem" type answer, but more how best to proceed in terms of debugging/deconstructing.

Things which could be an issue, or might be helpful;

There are some std::Vectors in the code which utilize the vector.pushback() function to add elements. I remember reading that resizing vectors isn't threadsafe, but the vectors are only private variables, so not shared between threads. I figured this would be OK?
If I enclose the entire for-loop body in an critical directive and slowly shrink back the end of the codeblock (so an ever growing region at the end of the for-loop is outside the critical section) it runs fine until I expose one of a the function calls, at which point segfaulting resumes. Analyzing this binary with Valgrind shows race conditions in many other function calls, not just the one I exposed.
One of the function calls is to a GSL function, which doesn't trigger any race conditions according to Valgrind.
Do I need to go and explicitly define private and shared variables in the functions being called? If so, this seems somewhat limiting for OpenMP - would this not mean you need to have OpenMP compatibility for any legacy code you call?
Is parallelizing a big for-loop just not something that works?
If you've read this far, thank you and Godspeed.

回答1:

So there is no way anyone could have answered this, but having figured it out I hope this helps someone, given my system's behaviors was so bizarre.

One of the (C) functions I was ultimately calling to (my_function->intermediate_function->lower_function->BAD_FUNCTION) declared a number of it's variables as static, which meant that they were retaining the same memory address and so essentially acting a shared variables. Interesting that the static overrides OpenMP.

I discovered all this by;

Using Valgrid to identify where errors were happening, and looking at the specific variables involved.
Defining the entire for-loop as a critical section and then exposing more code at the top and bottom.
Talking to my boss. More sets of eyes always help, not least because you're forced to verbalize the problem (which ended up with me opening the culprit function and point at the declarations)

来源：https://stackoverflow.com/questions/10729732/openmp-causes-heisenbug-segfault

标签

c++

openmp