I am trying to figure out why a modified C program is running faster than its non modified counter part (I am adding very few lines of code to perform some additional work).
You seem to think that the cache-misses event is the sum of all other kind of cache misses (L1-dcache-load-misses, and so on). That is actually not true.
the cache-misses event represents the number of memory access that could not be served by any of the cache.
I admit that perf's documentation is not the best around.
However, one can learn quite a lot about it by reading (assuming that you already have a good knowledge of how a CPU and a performance monitoring unit work, this is clearly not a computer architecture course) the doc of the perf_event_open() function:
http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html
For example, by reading it you can see that the cache-misses event showed by perf list corresponds to PERF_COUNT_HW_CACHE_MISSES