Using the runtime API: cudaDeviceSynchronize, cudaDeviceGetLimit, or anything that actually accesses the context should work.
I'm quite certain you're not using the driver API, as it doesn't do that sort of lazy initialization, but for others' benefit the driver call would be cuCtxCreate.