Commit 76268ac6 authored Dec 22, 2012 by Benjamin Kramer

X86: Turn mul of <4 x i32> into pmuludq when no SSE4.1 is available.

pmuludq is slow, but it turns out that all the unpacking and packing of the
scalarized mul is even slower. 10% speedup on loop-vectorized paq8p.

llvm-svn: 170985

parent b2f0a2bd

Show whitespace changes

Inline Side-by-side

Please to comment