I did not realize that the AMD64 ABI specifies a silly calling convention rule that forces structs with 3 or more word-sized (i.e. 8-byte) fields to be passed by pointer, even if you write your code expecting to have it be passed by value. I learned this from: Speed up your code: don’t pass structs bigger than 16 bytes on AMD64 (HN discussion).

For example:

struct Vector { double x; double y; double z; };
void f(Vector v);
...
f(Vector{ .x = 1, .y = 2, .z = 3 });

One would hope that this gets optimized into something like:

void f(double x, double y, double x);
...
f(1, 2, 3);

And indeed, this does happen, but only if your struct only contains 2 doubles and not 3 (or more). It also happens if the compiler is able to inline the function call, e.g. with the help of link-time optimization. The rest of the time though, it gets compiled more as:

void f(Vector*);
...
auto v = Vector{ .x = 1, .y = 2, .z = 3 };
f(&v);

The calling convention used on Windows is even worse, and prevents any structs or arrays to be broken up for argument passing: Microsoft docs. Yikes!

Only tangentially related, but one of the comments on GitHub points out that GCC and Clang have support for efficient vector structs: Clang docs, GCC docs. Godbolt shows that compilers generate vastly better code with this than naive struct passing due to more aggressive SIMD optimizations, which also happens to bypass the above issue. StackOverflow has some additional explanation on what these attributes do.