The tutorial you are looking at is so old, 2008? It might not be compatible with the version of CUDA you are using.
You can use __global__ and that means __host__ __device__, this works:
__global__ void f()
{
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
}
int main()
{
f<<<1,1>>>();
}