I am trying to calculate a matrix using C++ AMP. I use an array with width and height of 3000 x 3000 and I repeat the calculating procedure 20000 times:
    //_height=_width=3000
    extent<2> ext(_height,_width);
    array<int, 2> GPU_main(ext,gpuDevice.default_view);
    array<int, 2> GPU_res(ext,gpuDevice.default_view);
    copy(_main, GPU_main);
    array_view<int,2> main(GPU_main);
    array_view<int,2> res(GPU_res);
    res.discard_data();
    number=20000;
    for(int i=0;i<number;i++)
    {
        parallel_for_each(e,[=](index<2> idx)restrict(amp)
        {
           res(idx)=main(idx)+idx[0];//not depend from calculation type
        }
    array_view<TYPE, 2>  temp=res;
    res=main;
    main=temp;
    }
    copy(main, _main);
Before the calculation I copy my matrix from host memory to GPU memory and create an array_view, code line from 0 to 7. 
After that I start a loop for calculating some operation and repeat it 20000 times. Every iteration I start a parallel_for_each loop where calculate using C++ AMP. 
The GPU calculates very fast but when I copy the result to host array _main I found that this operation takes a lot of time, and also I found that if I decrease number from 20000 to 2000, the time for copy also decreases.
Why does this happen, it is some synchronization issue?
Your code (as is) doesn't compile, below is a fixed version which I think has the same intent If you want to break out the time for copying from the compute time then the simplest thing to do is to use array<> and explicit copies.
        int _height, _width;
        _height = _width = 3000;
        std::vector<int> _main(_height * _width); // host data.
        concurrency::extent<2> ext(_height, _width);
        // Start timing data copy
        concurrency::array<int, 2> GPU_main(ext /* default accelerator */);
        concurrency::array<int, 2> GPU_res(ext);
        concurrency::array<int, 2> GPU_temp(ext);
        concurrency::copy(begin(_main), end(_main), GPU_main);
        // Finish timing data copy
        int number = 20000;
        // Start timing compute
        for(int i=0; i < number; ++i)
        {
            concurrency::parallel_for_each(ext,
                [=, &GPU_res, &GPU_main](index<2> idx)restrict(amp)
            {
               GPU_res(idx) = GPU_main(idx) + idx[0];
            });
            concurrency::copy(GPU_res, GPU_temp);       // Swap arrays on GPU
            concurrency::copy(GPU_main, GPU_res);
            concurrency::copy(GPU_temp, GPU_main);
        }
        GPU_main.accelerator_view.wait(); // Wait for compute
        // Finish timing compute
        // Start timing data copy
        concurrency::copy(GPU_main, begin(_main));
        // Finish timing data copy
Note the wait() call to force the compute to finish. Remember that C++AMP commands usually queue work on the GPU and it is only guarenteed to have executed if you explicitly wait, with wait(), or for it or implicitly wait by calling (for example) synchronize() on an array_view<>. To get a good idea of timing you should really time the compute and data copies separately (as shown above). You can find some basic timing code here: http://ampbook.codeplex.com/SourceControl/changeset/view/100791#1983676 in Timer.h There are examples of it's use in the same folder.
However. I'm not sure I would really write the code this way unless I wanted to break out the copy and compute times. It is far simpler to use array<> for data that lives purely on the GPU and array_view<> for data that is copied to and from the GPU.
This would look like the code below.
        int _height, _width;
        _height = _width = 3000;
        std::vector<int> _main(_height * _width); // host data.
        concurrency::extent<2> ext(_height, _width);
        concurrency::array_view<int, 2> _main_av(_main.size(), _main); 
        concurrency::array<int, 2> GPU_res(ext);
        concurrency::array<int, 2> GPU_temp(ext);
        concurrency::copy(begin(_main), end(_main), _main_av);
        int number = 20000;
        // Start timing compute and possibly copy
        for(int i=0; i < number; ++i)
        {
            concurrency::parallel_for_each(ext,
                [=, &GPU_res, &_main_av](index<2> idx)restrict(amp)
            {
               GPU_res(idx) = _main_av(idx) + idx[0];
            });
            concurrency::copy(GPU_res, GPU_temp);  // Swap arrays on GPU
            concurrency::copy(_main_av, GPU_res);
            concurrency::copy(GPU_temp, _main_av);
        }
        _main_av.synchronize();  // Will wait for all work to finish
        // Finish timing compute & copy
Now the data that is only required on the GPU is declared to be on the GPU and the data that needs to be synchronized is declared as such. Clearer and less code.
You can find out more about this by reading my book on C++ AMP :)
How did you measure the timing? You need to wait on the accelerator_view after parallel_for_each before doing the copy for accurate timing of computation and copy. You may want to check out the following blog posts for some tips of measuring performance of C++ AMP programs:
来源:https://stackoverflow.com/questions/13936994/copy-data-from-gpu-to-cpu