For the purpose of testing printf() call on device, I wrote a simple program which copies an array of moderate size to device and print the value of device array to screen.
printf from the device has a limited queue. It's intended for small scale debug-style output, not large scale output.
referring to the programmer's guide:
The output buffer for printf() is set to a fixed size before kernel launch (see Associated Host-Side API). It is circular and if more output is produced during kernel execution than can fit in the buffer, older output is overwritten.
Your in-kernel printf output overran the buffer, and so the first printed elements were lost (overwritten) before the buffer was dumped into the standard I/O queue.
The linked documentation indicates that the buffer size can be increased, also.