Best way of “looping over a 2-D array”, using Repa

后端未结

关注

 3  978

I find the array library Repa for Haskell very interesting, and wanted to make a simple program, to try to understand how to use it. I also made a simple implementation usin

相关标签:

3条回答

既然无缘

2020-12-23 12:25
Code Review Notes
- 47.8% of your time is spent in GC.
- 1.5G is allocated on the heap (!)
- The repa code looks a lot more complicated than the list code.
- Lots of parallel GC is occuring
- I can get up to 300% efficiency on a -N4 machine
- Putting in more type signatures will make it easier to analyze...
- rsize isn't used (looks expensive!)
- You convert repa arrays to vectors, why?
- All your uses of (**) could be replaced by the cheaper (^) on Int.
- There's a suspicious number of large, constant lists. Those all have to be converted to arrays -- that seems expensive.
- any (==True) is the same as or
Time profiling
```
COST CENTRE                    MODULE               %time %alloc

squared_diff                   Main                  25.0   27.3
insideParticle                 Main                  13.8   15.3
sum_squared_diff               Main                   9.8    5.6
rcoords                        Main                   7.4    5.6
particle_extended              Main                   6.8    9.0
particle_slice                 Main                   5.0    7.6
insideParticles                Main                   5.0    4.4
yslice                         Main                   3.6    3.0
xslice                         Main                   3.0    3.0
ssd_vec                        Main                   2.8    2.1
**^                            Main                   2.6    1.4
```
shows that, your function squared_diff is a bit suspicious:
```
squared_diff :: Array DIM2 Double
squared_diff = deepSeqArrays [rcoords,particle_extended]
                    ((force2 rcoords) -^ (force2 particle_extended)) **^ 2    
```
though I don't see any obvious fix.

Space profiling

Nothing too amazing in the space profile: you clearly see the list phase, then the vector phase. The list phase allocates a lot, which gets reclaimed.

Breaking down the heap by type, we see initially a lot of lists and tuples being allocated (on demand), then a big chunk of arrays are allocated and held:

Again, kinda what we expected to see... the array stuff isn't allocating especially more than the list code (in fact, a bit less overall), but it is just taking a lot longer to run.

Checking for space leaks with retainer profiling:

There's a few interesting things there, but nothing startling. zcoords gets retained for the length of the list program execution, then some arrays (SYSTEM) are being allocated for the repa run.

Inspecting the Core

So at this point I'm firstly assuming that you really did implement the same algorithms in lists and arrays (i.e. no extra work is being done in the array case), and there's no obvious space leak. So my suspicion is badly-optimized repa code. Let's look at the core (with ghc-core.
- The list-based code looks fine.
- The array code looks reasonable (i.e. unboxed primitives appear), but very complex, and a lot of it.
Inlining all the CAFs

I added inline pragmas to all the top level array definitions, in a hope to remove some of the CAfs, and get GHC to optimize the array code a bit harder. This really made GHC struggle to compile the module (allocating up to 4.3G and 10 minutes while working on it). This is a clue to me that GHC wasn't able to optimize this program well before, since there's new stuff for it to do when I increase the thresholds.

Actions
- Using -H can decrease the time spent in GC.
- Try to eliminate the conversions from lists to repas to vectors.
- All those CAFs (top level constant data structures) are kinda weird -- a real program wouldn't be a list of top level constants -- in fact, this module is pathologically so, causing lots of values to be retained over long periods, instead of being optimized away. Float local definitions inwards.
- Ask for help from Ben Lippmeier, the author of Repa, particularly since there's some funky optimization stuff happening.
0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2020-12-23 12:31

I've added some advice about how to optimise Repa programs to the Haskell wiki: http://www.haskell.org/haskellwiki/Numeric_Haskell:_A_Repa_Tutorial#Optimising_Repa_programs

0 讨论(0)
发布评论:

提交评论
- 加载中...
终归单人心

2020-12-23 12:34
I changed the code to force rcoords and particle_extended, and disovered we were losing the lion's share of time within them directly:
```
COST CENTRE                    MODULE               %time %alloc

rcoords                        Main                  32.6   34.4
particle_extended              Main                  21.5   27.2
**^                            Main                   9.8   12.7
```
The biggest single improvement to this code would clearly be to generate those two constant inputs in a better fashion.

Note that this is basically a lazy, streaming algorithm, and where you're losing time is the sunk cost of allocating at least two 24361803-element arrays all in one go, and then probably allocating at least once or twice more or giving up sharing. The very best case for this code, I think, with a very good optimizer and a zillion rewrite rules, will be to roughly match the list version (which can also parallelize very easily).

I think dons is right that Ben & co. will be interested in this benchmark, but my overwhelming suspicion is that this is not a good use case for a strict array library, and my suspicion is that matlab is hiding some clever optimizations behind its ngrid function (optimizations, I'll grant, which it might be useful to port to repa).]

Edit:

Here's a quick and dirty way to parallelize the list code. Import Control.Parallel.Strategies and then write numberInsideParticles as:
```
numberInsideParticles particles coords = length $ filter id $ 
    withStrategy (parListChunk 2000 rseq) $ P.map (insideParticles particles) coords
```
This shows good speedup as we scale up cores (12s at one core to 3.7s at 8), but the overhead of spark creation means that even a 8 cores we only match the single core non-parallel version. I tried a few alternate strategies and got similar results. Again, I'm not sure how much better we can possibly do than a single-threaded list version here. Since the computations on each individual particle are so cheap, we're mainly stressing allocation, not computation. The big win on something like this I imagine would be vectorized computation more than anything else, and as far as I know that pretty much requires hand-coding.

Also note that the parallel version spends roughly 70% of its time in GC, while the one-core version spend 1% of its time there (i.e. the allocation is, to the extent possible, is effectively fused away.).
0 讨论(0)
发布评论:

提交评论
- 加载中...