Average filter is windowed filter of linear class, that smooths signal (image). The filter works as low-pass one. The basic idea behind filter is for any elemen
If the filter's size is normal and not humongous, the average filter is a very good case for implementing with CUDA. You can set this up using square blocks and every thread of the block is responsible for the calculation of the value of one pixel, by summing and averaging its neighbors.
If you store the image in Global Memory then it can be programmed easily. One possible optimization is that you load blocks of the image into the block's Shared Memory. Using phantom elements (so that you won't exceed the shared block's dimensions when looking for neighboring pixels) you can calculate the average of the pixels within a block.
The only think that you have to be careful of is how the "stitching" will be done in the end, because the shared memory blocks will overlap (because of the extra "padding" pixels) and you don't want to calculate their values twice.