问题
I'm getting started with Halide, and whilst I've grasped the basic tenets of its design, I'm struggling with the particulars (read: magic) required to efficiently schedule computations.
I've posted below a MWE of using Halide to copy an array from one location to another. I had assumed this would compile down to only a handful of instructions and take less than a microsecond to run. Instead, it produces 4000 lines of assembly and takes 40ms to run! Clearly, therefore, I have a significant hole in my understanding.
- What is the canonical way of wrapping an existing array in a
Halide::Image
? - How should the function
copy
be scheduled to perform the copy efficiently?
Minimal working example
#include <Halide.h>
using namespace Halide;
void _copy(uint8_t* in_ptr, uint8_t* out_ptr, const int M, const int N) {
Image<uint8_t> in(Buffer(UInt(8), N, M, 0, 0, in_ptr));
Image<uint8_t> out(Buffer(UInt(8), N, M, 0, 0, out_ptr));
Var x,y;
Func copy;
copy(x,y) = in(x,y);
copy.realize(out);
}
int main(void) {
uint8_t in[10000], out[10000];
_copy(in, out, 100, 100);
}
Compilation Flags
clang++ -O3 -march=native -std=c++11 -Iinclude -Lbin -lHalide copy.cpp
回答1:
Let me start with your second question: _copy
takes a long time, because it needs to compile Halide code to x86 machine code. IIRC, Func
caches the machine code, but since copy
is local to _copy
that cache cannot be reused. Anyways, scheduling copy
is pretty simple because it's a pointwise operation: First, it would probably make sense to vectorize it. Second, it might make sense to parallelize it (depending on how much data there is). For example:
copy.vectorize(x, 32).parallel(y);
will vectorize along x
with a vector size of 32 and parallelize along y
. (I am making this up from memory, there might be some confusion about the correct names.) Of course, doing all this might also increase compile times...
There is no recipe for good scheduling. I do it by looking at the output of compile_to_lowered_stmt
and profiling the code. I also use the AOT compilation provided by Halide::Generator
, this makes sure that I only measure the runtime of the code and not the compile time.
Your other question was, how to wrap an existing array in a Halide::Image
. I don't do that, mostly because I use AOT compilation. However, internally Halide uses a type called buffer_t
for everything image related. There is also C++ wrapper called Halide::Buffer
that makes using buffer_t
a little easier, I think it can also be used in Func::realize
instead of Halide::Image
. The point is: If you understand buffer_t
you can wrap almost everything into something digestible by Halide.
回答2:
To emphasize the first thing Florian mentioned, which I think is the key point of misunderstanding here: you appear to be timing the compilation of the copy
operation ("pipeline," in common Halide terms), not just its execution. Your code size estimate is presumably also for the whole binary resulting from copy.cpp
, not just the code in the Halide-generated copy
function (which won't actually even appear in the binary you're compiling with clang, since it is only constructed by JITing at runtime in this program).
You can observe the actual cost of your pipeline here by first calling copy.compile_jit()
before realize
(realize
implicitly calls compile_jit
the first time it is run, so it's not necessary, but it's valuable to factor apart the runtime from the compile overhead). You would then put your timer exclusively around realize
.
If you actually want to pre-compile this (or any other) pipeline for static linking into your ultimate program, which is what it seems you might be expecting, what you really want to do is use Func::compile_to_file
in one program to compile and emit the code (as copy.h
and copy.o
), and then link and call these in another program. Check out tutorial lesson 10 to see this in more detail:
https://github.com/halide/Halide/blob/master/tutorial/lesson_10_aot_compilation_generate.cpp https://github.com/halide/Halide/blob/master/tutorial/lesson_10_aot_compilation_run.cpp
来源:https://stackoverflow.com/questions/31063064/c-array-to-halide-image-and-back