Simultaneous multi-threading: priority signalling

I’ve encountered a statement today in a random blog post about the IBM PowerPC 600 series that broke my brain for a minute while I was trying to figure out what it could possibly have meant, so now I’m going to subject you to it as well. Here goes:

Moving a register to itself is functionally a nop, but the processor overloads it to signal information about priority.

The blog post does go on to explain this… well, for certain values of “explain” anyway:

 or      r1, r1, r1       ; low priority
 or      r6, r6, r6       ; medium-low priority
 or      r2, r2, r2       ; normal priority

A program can voluntarily set itself to low priority if it is waiting for a spin lock.

Ok, let’s back up a step and see if we can make sense of this. The code we’re looking it is Assembly (the human-readable form of machine code) for this particular IBM CPU architecture. or executes a bit-wise logical OR, that is, an independent logical OR operation for each bit in a machine word (however many bits that is): ra | rb with the result stored in rd would be written as or rd, ra, rb. (The r prefixes just denote that we are talking about CPU registers; just think of them as hardware-level variables if you’re not sure what that means.)

Based on this, or r1, r1, r1 would calculate r1 | r1 and store the result in r1. It should be easy to see that r1 | r1 will always just yield r1 unchanged as the result (since 1 | 1 = 1 and 0 | 0 = 0), so this simplifies down to: store r1 into r1, i.e. do not do anything at all. Instructions that have no actual effect can be useful in a few cases (such as ensuring memory alignment of code, or as patching points for debuggers), and there are lots of ways of writing code that has no effect: here, for example, we could use any register in place of r1 and it would work the same way. (x86-64, the dominant architecture for PCs today, has a dedicated nop instruction for this purpose.)

Apparently, the PowerPC 600 series architecture uses these different typings of “no operation” to signal priority: using r1 signals low priority, and r2 normal priority (I’m not sure what to make of “medium-low priority” denoted by the use of r6). But how is “priority” even a concept in CPU land? One might think that it has a single stream of instructions to execute, and it just goes executing them one by one. This turns out to not exactly be the case, even putting aside more complicated optimizations such as out-of-order execution: the PowerPC architecture uses simultaneous multi-threading (SMT) between two “threads” per core (similarly to Intel’s hyper-threading technology for x86-64).

To understand why this is useful, consider that each CPU core contains multiple execution units (e.g. 8 per core for PowerPC5), which essentially means that each core separately is actually able to perform multiple calculations concurrently. These execution units are specialized: for example, you’d have different execution units for integer and floating point (decimal) arithmetic: usually called ALU (Arithmetic-Logical Unit) and FPU (Floating-Point Unit). This means that if, for example. the running application (more precisely, the currently scheduled operating system thread) is performing lots of integer arithmetic, then the core’s ALUs are be fully utilized while the FPUs are sitting idle. This is a waste, especially if we know that there’s another thread that could be utilizing those FPUs.

Beyond this, there are other situations where allowing two threads to run per CPU core is useful: the most common one would be when one thread is waiting for a result of a memory operation. CPUs are much, much, much faster than accessing memory is, so CPUs spend subjective lifetimes simply waiting for the two numbers to arrive so that it can finally add them together. (This is the entire justification for the multi-level caches.) While one thread is waiting for memory (perhaps due to a cache miss), the other may be able to proceed.

Of course, the two threads sharing the same CPU core must also share lots of hardware resources on the core (such as the register file or an instruction decoder), so they will never be able to run nearly as concurrently as two threads running on two separate cores can. Therefore, performance of the individual threads suffers, even though the overall amount of work done by the entire system improves. (This is sometimes actually undesirable, such as when running latency-sensitive applications like high-frequency trading algorithms. In these scenarios, you’ll want to disable this hardware feature.)

Now that we understand all of this, we should be able to suspect where the concept of priority comes into play when dealing with two threads running on the same CPU core; the higher-priority thread gets more of the shared resources:

In SMT mode, the Power5 uses two separate instruction fetch address registers to store the program counters for the two threads. … Instruction fetches alternate between the two threads. After fetching, the Power5 places instructions in the predicted path in separate instruction fetch queues for the two threads. […] On the basis of thread priorities, the processor selects instructions from one of the instruction fetch queues […]. … The Power5 chip observes the difference in priority levels between the two threads and gives the one with higher priority additional decode cycles. If both threads are at the lowest running priority, the microprocessor assumes that neither thread is doing meaningful work and throttles the decode rate to conserve power.

The article even goes into examples for when this could be especially useful:

A thread is in a spin loop waiting for a lock. Software would give the thread lower priority, because it is not doing useful work while spinning.

A thread has no immediate work to do and is waiting in an idle loop. Again, software would give this thread lower priority.

One application must run faster than another. For example, software would give higher priority to real-time tasks over concurrently running background tasks.

Implementations of the x86-64 architecture also feature similar simultaneous multi-threading capability: Intel calls this Hyper-threading, while AMD has named it Clustered Multi-Threading. x86 has a dedicated nop instruction for not doing anything, and does not have anything to my knowledge for signalling priority. There is a separate pause instruction specifically for spin-wait loops, which while helps with SMT is actually an important performance optimization even in ST (single-threaded) mode because it shields against the cost incurred by the branch misprediction when the wait stops.