Recently I was profiling a program in which the hotspot is definitely this
double d = somevalue();
double d2=d*d;
double c = 1.0/d2 // HOT SPOT
For your current program you have identified the hotspot - good. As an alternative to speeding up 1/d^2, you have the option of changing the program so that it does not compute 1/d^2 so often. Can you hoist it out of an inner loop? For how many different values of d do you compute 1/d^2? Could you pre-compute all the values you need and then look up the results? This is a bit cumbersome for 1/d^2, but if 1/d^2 is part of some larger chunk of code, it might be worthwhile applying this trick to that. You say that if you lower the precision, you don't get good enough answers. Is there any way you can rephrase the code, that might provide better behaviour? Numerical analysis is subtle enough that it might be worth trying a few things and seeing what happened.
Ideally, of course, you would find some optimised routine that draws on years of research - is there anything in lapack or linpack that you could link to?