FPGAs for defense

To have maximum advantage over CPU and GPU, number of cores times number of pipeline stages (if applicable) should be maximized
Thus, each core should be as small as possible
Rationale: CPU has a limited number of relatively feature-rich execution units.
By having very simple cores, we leave more logic in the execution units unused.
Difficulty: SIMD instructions may operate on many narrow bit width values in parallel.
A way to defeat implementation of small S-boxes with SIMD byte permute instructions (in Cell, SSSE3, XOP) or with bitslicing is through making the S-boxes variable, but parallel S-box lookups may nevertheless be performed with gather loads (in AVX2 VSIB, to be available in 2013+).
Alternatively, focus on making optimal use of resources without trying to slow down CPU/GPU implementations