Why is bitslice DES faster?

A single instance of DES uses at most 12 bits per machine word
when doing 12-to-8 dual S-box lookups, which also typically exceed the size of L1 data cache
With bitslicing, we compute e.g. 64 instances of DES in parallel on a 64-bit CPU - making full use of every bit in the 64-bit machine words
We could compute multiple non-bitsliced instances of DES side-by-side and more fully use the machine word width in this way, but this requires support for vectorized array lookups for efficient implementation
In 2000s, some CPUs got SIMD permute instructions that are potentially usable: PowerPC AltiVec VPERM, Cell SHUFB, Intel SSSE3 PSHUFB, AMD XOP VPERM
2013+: Intel Haswell microarchitecture is expected to include AVX2 VSIB (gather)