See the following results:
SSE::alignedMalloc is our implementation, which works in the standard way - it allocates requested size + alignment, then moves up the returned pointer to so that it is aligned, while storing the original pointer so it can be freed later.
_mm_malloc is the VS function that returns aligned memory. It compiles to a call to the same function as _aligned_malloc.
Finally malloc is of course the standard, boring memory allocation function.
As you can see, for some weird reason, VS's _mm_malloc is much slower than malloc and our aligned alloc implementation (SSE::alignedMalloc).
Issue filed with MS here.