Depends on what the native hardware does.
If the hardware implements double (like the x86 does), then float is emulated by extending it there, and the conversion will cost time. In this case, double will be faster.
If the hardware implements float only, then emulating double with it will cost even more time. In this case, float will be faster.
And if the hardware implements neither, and both have to be implemented in software. In this case, both will be slow, but double will be slightly slower (more load and store operations at the least).
The quote you mention is probably referring to the x86 platform, where the first case was given. But this doesn't hold true in general.