I work on AVX2 and need to calculate 64-bit x64-bit -> 128-bit widening multiplication and got 64-bit high part in the fastest manner. Since AVX2 has not such an instruction
It's hard to tell without trying, but it might me faster to just use the AMD64 MUL instruction, which supports 64x64=128 with the same throughput as most AVX2 instructions (but not vectorized). The drawback is that you need to load to regular registers if the operands were in YMM registers. That would give something like LOAD + MUL + STORE for a single 64x64=128.
If you can vectorize Karatsuba in AVX2, try both AVX2 and MUL and see which is faster. If you can't vectorize, single MUL will probably be faster. If you can remove the load and store to regular registers, single MUL will be definitely faster.
Both MUL and AVX2 instructions can have an operand in memory with the same throughput, and it may help to remove one load for MUL.