BTW, you might find it useful to save out and read in the F-transformed image file and kernels. You don't need to recompute them unless they change. I don't know if that is useful at your image size, but it might be worth testing.
The FFT'd target is 13 GB (16384x16384x3x8x2), so it's probably faster to read in the image, pad it, and compute the FFT than it would be to read in the FFT'd version, although it wouldn't make mush difference either way since the FFT'd target is only computed once per batch run now. The kernels change on every iteration, as they are, in general, a function of both the f-stop and the pixel pitch. At roughly 1 minute per iteration, I think it's fast enough now. But thanks for the advice.