I am trying to figure out why a modified C program is running faster than its non modified counter part (I am adding very few lines of code to perform some additional work).
Some answers:
L1 is the Level-1 cache, the smallest and fastest one. LLC on the other hand refers to the last level of the cache hierarchy, thus denoting the largest but slowest cache.i vs. d distinguishes instruction cache from data cache. Only L1 is split in this way, other caches are shared between data and instructions.TLB refers to the translation lookaside buffer, a cache used when mapping virtual addresses to physical ones.You seem to think that the cache-misses event is the sum of all other kind of cache misses (L1-dcache-load-misses, and so on). That is actually not true.
the cache-misses event represents the number of memory access that could not be served by any of the cache.
I admit that perf's documentation is not the best around.
However, one can learn quite a lot about it by reading (assuming that you already have a good knowledge of how a CPU and a performance monitoring unit work, this is clearly not a computer architecture course) the doc of the perf_event_open() function:
http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html
For example, by reading it you can see that the cache-misses event showed by perf list corresponds to PERF_COUNT_HW_CACHE_MISSES
According to perf tutorial, Performance Monitoring Unit (PMU) events or hardware events refer to those events which can be mapped directly to CPU specific events for a CPU vendor. But the hardware cache events refer to some hardware events monikers provided by perf, which may be mapped to actual events provided by the CPU. For the list of perf's cache events use perf list cache in Linux terminal.