Why is bitslice DES faster? A single instance of DES uses at most 12 bits per machine word when doing 12-to-8 dual S-box lookups, which also typically exceed the size of L1 data cache With bitslicing, we compute e.g. 64 instances of DES in parallel on a 64-bit CPU - making full use of every bit in the 64-bit machine words We could compute multiple non-bitsliced instances of DES side-by-side and more fully use the machine word width in this way, but this requires support for vectorized array lookups for efficient implementation In 2000s, some CPUs got SIMD permute instructions that are potentially usable: PowerPC AltiVec VPERM, Cell SHUFB, Intel SSSE3 PSHUFB, AMD XOP VPERM 2013+: Intel Haswell microarchitecture is expected to include AVX2 VSIB (gather)