Working through Robert Nystrom’s excellent Crafting Interpreters book, I came across something I hadn’t thought about for a long time, as someone who generally sticks to dynamic languages at work: type punning and strict aliasing.

In Crafting Interpreters, this arises in the context of an optimization called NaN boxing that I won’t get into (though you can read more about it in the book here, or what is the purpose of NaN boxing?), but basically the idea is:

  • the basic Value type of the Lox language from the book takes up 16 bytes: 8 bytes for the largest possible values (doubles and Obj pointers), plus 4 bytes for the type tag (since a dynamic language needs to keep track of this at runtime), and 4 bytes of padding.
  • can we make this smaller somehow?
  • well certain IEEE 754 NaN layouts have a bunch of unused bits, maybe we can use those, since a IEEE 754 float is also 64 bits, enough for floats and pointers, and since most 64-bit architectures actually only use 48 bits for a pointer, we still have 16 bits to store type information.

To do this, we’ll need to convince the C compiler (since this particular Lox implementation is written in C) to treat this 64 bit value as variously a float or uint64_t, and this is the essence of type punning.

However, in a side note Nystrom mentions

Spec authors don’t like type punning because it makes optimization harder. A key optimization technique is reordering instructions to fill the CPU’s execution pipelines. A compiler can reorder code only when doing so doesn’t have a user-visible effect, obviously.

Pointers make that harder. If two pointers point to the same value, then a write through one and a read through the other cannot be reordered. But what about two pointers of different types? If those could point to the same object, then basically any two pointers could be aliases to the same value. That drastically limits the amount of code the compiler is free to rearrange.

To avoid that, compilers want to assume strict aliasing—pointers of incompatible types cannot point to the same value. Type punning, by nature, breaks that assumption.

https://craftinginterpreters.com/optimization.html#numbers

Unstated here is that “cannot point to the same value” means that if they do, this will invoke dreaded undefined behaviour (there are many thorough references to UB, see John Regehr’s Guide to undefined behaviour in C and C++ for a particularly good one, also Garbage In, Garbage Out: Arguing about UB (youtube presentation), and also Falsehoods Programmers Believe about UB).

This reminded me of having worked through the classic Beej’s Guide to Network Programming, which I still think is a great (if perhaps outdated) intro to sockets and programming in C. (It was also the basis for my attempt to write a basic tcp server in Fortran). An essential element of the magic of BSD sockets you are introduced to in that guide is the addrinfo struct, which looks like this:

struct addrinfo {
    int              ai_flags;     // AI_PASSIVE, AI_CANONNAME, etc.
    int              ai_family;    // AF_INET, AF_INET6, AF_UNSPEC
    int              ai_socktype;  // SOCK_STREAM, SOCK_DGRAM
    int              ai_protocol;  // use 0 for "any"
    size_t           ai_addrlen;   // size of ai_addr in bytes
    struct sockaddr *ai_addr;      // struct sockaddr_in or _in6
    char            *ai_canonname; // full canonical hostname

    struct addrinfo *ai_next;      // linked list, next node
};

Note that the ai_addr pointer to struct sockaddr is said to be either sockaddr_in or sockaddr_in6 – this allows the code to work with either ipv4 or ipv6.

When we are setting up a server to listen on a port, we first need to call bind and pass it, among other things, a pointer to a struct sockaddr – but we’ll actually have either a sockaddr_in for ipv4 or sockaddr_in6 for ipv6, which will have to be cast first to a generic sockaddr.

Beej notes that the “old” way (maybe this is what the guide had when I first went through it?) was like this:

bind(sockfd, (struct sockaddr *)&my_addr, sizeof my_addr);

But this seems to violate the rule of strict aliasing outlined above: we now have two pointers to the same memory of different types, one is a struct sockaddr_in (assuming we’re using ipv4) and the one passed to bind is a cast to a struct sockaddr. How is this not introducing undefined behaviour?

Am I alone in my confusion?

Turns out some other folks on the internet had the same questions:

One Stack Overflow response states:

The cast itself in that line does not break the strict aliasing rule. The rule is only broken if the implementation of bind() dereferences that pointer without converting it back to the right type.

Any strict aliasing problems there are problems for the implementer of bind(), not the user.

https://stackoverflow.com/a/17988634

So this means perhaps our code isn’t causing UB, but the bind implementation itself may be? (Looking at __inet_bind, which seems to be what ultimately does the heavy lifting for ipv4 sockets, the code is definitely dereferencing and reading from the pointer.) That still seems like a very undesirable situation.

Responses to the reddit post above (“Am I crazy…”) echo a similar idea:

And bind() gets a free pass for either one of two reasons: it’s written in C and thus is only bound by the rules of C, and/or it’s provided by your implementation, and those don’t have to abide by the same rules as us mere mortals.

If *you* try to read or write it through the aliased type, yes, it is UB.

Fortunately, you’re not, bind() is a black box to the compiler.

I didn’t find either of these responses entirely satisfying.

  • Just because bind is a syscall provided by the kernel doesn’t seem like it should mean it is exempt from C’s aliasing rules.
  • Again, even though bind is in this case a “black box” to the compiler, are we to just trust that the compiled bind code will not be subject to UB?

But it turns out that there is some truth to these responses, as per this rant from Linus on aliasing, in which he says the kernel uses the GCC flag -fno-strict-aliasing precisely to avoid issues like this. (Of course, “us mere mortals” are free to use this flag too.)

Linus also comes out strongly in support of using C unions to do type punning, which has apparently been the approved way to do this sort of thing in GCC for some time.

So does this mean that it is not possible to write bind (and likely related kernel networking code) without using potentially-standard-violating compiler extensions? It seems like the answer would have been “yes”, prior to the C99 and C11 standards, which added official support for type punning through unions (though there appears to still be disagreement by some over how official the support is).

Additional reading

Tags

Leave a comment