This post is focused on improving performance and measuring the tail latency of the read and write operations of the lockless implementation of a bounded MPMC ring buffer. That said, this entry will be structured differently than the last1; rather than reading like detective notes, it will compare the initial naive implementation with the final result and leave out the parts where I’m still figuring things out, in an effort to create an artifact that reads more coherently. The writing is not generated by AI and if any code is, it’s called out.
Clarifying Changes
I wanted to take a moment to walk through the core changes from the original naive implementation to the most recent1 to clarify how things work and what the problems were along the way.
The first core change was relatively obvious, removing the size variable from the array to remove race conditions on the indices. This created a new problem, where the index calculations for full and empty could become ambiguous states:
let capacity = 10;
let ridx = 0; // read index
let widx = 0; // write index The question was clearing up whether or not the above meant empty, or whether widx had already wrapped around making it full; the wasted slot solution where empty was equality and full was when widx + 1 == ridx was what followed, also including a flag marking whether or not the particular space in the buffer was ready for a read or write, preventing race conditions. This got rid of some ambiguity but not all:
Slot {
initialization: AtomicBoolean,
data: ...
} This made it possible for the ABA problem to occur:
block-beta
columns 1
block:Idx
R["Read Index"]
W["Write Index"]
end
block:Arr1
A["1 {init: true, val: 1}"]
B["2 {init: true, val: 2}"]
C["3 {init: true, val: 3}"]
D["4 {init: false, val: 0x0}"]
end
block:Threads
t1["Thread 1: sleep @ CAS op"]
t2["Thread 2: working"]
t3["Thread 3: working"]
tn["Thread n: working"]
end
space
block:Threads2
tt1["Thread 1: wake @ CAS op"]
tt2["Thread 2: finished"]
tt3["Thread 3: finished"]
ttn["Thread n: finished"]
end
block:Arr2
A2["1 {init: false, val: 1}"]
B2["2 {init: false, val: 2}"]
C2["3 {init: false, val: 3}"]
D2["4 {init: false, val: 4}"]
end
block:Idx2
R2["Read Index"]
W2["Write Index"]
end
space
R --> A
W --> D
t1 --> A
R2 --> A2
W2 --> A2
tt1 --> A2 As shown above, in the case of multiple read threads, it was possible to put a thread to sleep after checking its initialization state (Thread 1 in the above example). If it were to sleep long enough, represented by the upper portion of the example, for a full loop to have occurred only to wake up before a write happened to reinitialize it, the bottom portion of the example, it would have been able to read uninitialized data. The fix here was to add more information to the initialization state in the form of a stamp, removing the remaining ambiguity. This was done ultimately by marking a single bit on the index to indicate whether or not it wrapped; the stamp was responsible for marking the space a read or write, encoding the proper state as well as when it could be used. As an example:
0b101 == 5 // capacity
0b1000 == 8 // next power of two aka mbc
0b0101 // read index at capacity
0b1101 // read index at capacity on another loop Every lap performs an xor operation against the index and mbc to flip the bit at the non-counting position. By doing this, it encodes whether or not a lap had occurred so that the stamp could initialize the state for the next operation and the index would be protected from a local value where the counter was in the right position but a lap had occurred, removing all ambiguity.
Branchless Index Computation
The actual indexing strategy is solid but it can be computed efficiently in all branching cases. We left the last post with a conditional statement checking whether or not the Ring Buffer had reached its capacity before deciding to wrap back to zero. That same calculation can be performed in a branchless manner such that the conditional check is a series of math operations; this matters because a computer will do pre-work in loading a set of instructions to be ready to execute but in order for it to work best it needs to load the right set of operations. Conditional branches can upset this balance by making the CPU guess which branch is more likely to execute then fill up its execution pipeline with those instructions. When it guesses wrong, the CPU throws out its pipeline of instructions and has to refill them which costs cycles. A branchless computation performs tricks to remove branch instructions such that a conditional, in either case, performs the same computation even though the result may be different. For example:
- IF (Instruction Fetch) - get opcode from PC (program counter)
- ID (Instruction Decode) - prepare register file for execution
- EX (Execute) - execute opcode
- MEM (Memory Access) - perform memory load/store operations as required
- WB (Write Back) - update internal registers
block-beta
columns 5
%% Define Style Classes
classDef gap stroke-width:2px,stroke-dasharray: 5 5;
classDef title fill:transparent,stroke:none,font-weight:bold,font-size:16px;
Title1["Mispredicted Pipeline"]:5
block:S1:5
columns 5
IF1["IF<br>Instr D"]
ID1["ID<br>Instr C"]
EX1["EX<br>Branch"]
MEM1["MEM<br>Instr B"]
WB1["WB<br>Instr A"]
end
Title2["Misprediction Detected"]:5
block:S2:5
columns 5
IF2["IF<br>Instr E"]
ID2["ID<br>Instr D"]
EX2["EX<br>Instr C"]
MEM2["MEM<br>Branch"]
WB2["WB<br>Instr B"]
end
Title3["Pipeline Flushed"]:5
block:S3:5
columns 5
IF3["IF<br>Correct Path 1"]
ID3["ID<br>--- GAP ---"]
EX3["EX<br>--- GAP ---"]
MEM3["MEM<br>--- GAP ---"]
WB3["WB<br>Branch"]
end
Title4["Refill"]:5
block:S4:5
columns 5
IF4["IF<br>Correct Path 2"]
ID4["ID<br>Correct Path 1"]
EX4["EX<br>--- GAP ---"]
MEM4["MEM<br>--- GAP ---"]
WB4["WB<br>--- GAP ---"]
end
%% --- APPLY STYLES
class Title1,Title2,Title3,Title4 title
class ID3,EX3,MEM3,EX4,MEM4,WB4 gap Since a misprediction happens every operations the impact is felt more frequently when is small, so it’s possible to argue this change is not very meaningful; it does increase tail latency though, which is the more famous issue with this particular style of ring buffer. In order to resolve it, the conditional check if i + 1 >= self.capacity {0 and flip bit} else {index + 1} needs to become branchless. Here’s how it’s done:
let idx = self.read_idx.load(Ordering::Acquire);
let mcb = (self.capacity + 1).next_power_of_two();
// mask the lower bits responsible for the counter
let i = idx & (mcb - 1);
// save true or false to 0 or 1 in branchless computation
let at_capacity = (i + 1 >= self.capacity) as usize;
let new_idx = ((idx + 1) & at_capacity.wrapping_sub(1))
| (at_capacity * ((idx & mcb) ^ mcb)); - save the result of the conditional operation as an integer
- perform the
wrapping_subon the result- when we’ve reached capacity the result is which paired with the logical and operation flips all the
idxbits to zero resulting in:0 | (at_capacity * ((idx & mcb) ^ mcb))- since
at_capacityis it letsmcb & idxset all the counter bits back to then set the lapping bit with thexoroperation pairing it with the logicalorat the end - the total result here is either or based on pure math alone
- since
- when capacity is not reached the result is which is binary for
0b111111111...so the first part results in(idx + 1)the second part becomes and so(idx + 1) | 0is justidx + 1
- when we’ve reached capacity the result is which paired with the logical and operation flips all the
In both cases the same exact math is used to determine the result of the index which means no branch predictions in the CPU pipeline which can cause gaps in execution.
I didn’t have the foresight to add USDT probes to measure tail latency before I implemented the change but moving forward they’ll be included.
Adding USDT Probes
Since I am running on Apple Silicon, I can use DTrace to measure tail latency with USDT probes but they will also work with tools like perf and ebpf in the Linux ecosystem. Next is defining the probes and then placing them in the right locations. This brings up the question, how much overhead is there to the probes? To my understanding, negligible amounts!
The probes when turned off are essentially nop instructions which are so cheap they’re effectively meaningless. This is paired with some metadata that names the probes so when the probes are turned on, the nop instructions act as locations for DTrace to inject a breakpoint while running its script. This happens via atomic instructions which are fast and safe. This is the limit to my understanding but when the DTrace script quits running it sets the injected instruction back to a nop. If you are worried about performance costs of DTrace, it is possible to use it without injecting probes into your program using almost the same method I describe below, but it requires that you know the name of your function and get its entry and exit points which change dynamically in rust. For me running sudo dtrace -ln 'pid$target:::entry' -c './target/release/limitless' results in:
...
659944 pid20512 limitless limitless::RingBuffer$LT$T$GT$::read::h1e77ee054a38d049 entry
659945 pid20512 limitless limitless::RingBuffer$LT$T$GT$::write::hdc9e78a6b8c28aaf entry
... I will not use those and instead create deterministically named probes at the exact points measurement is desired (which are probably the same tbh):
#[usdt::provider]
mod limitless_probes {
fn read__start(_: &usdt::UniqueId, idx: u64) {}
fn read__done(_: &usdt::UniqueId, idx: u64, ok: u8) {}
fn write__start(_: &usdt::UniqueId, idx: u64) {}
fn write__done(_: &usdt::UniqueId, idx: u64, ok: u8) {}
} The first step is defining the providers, which I think of as probe data types with names, using the usdt library’s macro. Next is placing the probes at the read and write functions’ entry and return points.
let probe_id = usdt::UniqueId::new();
limitless_probes::read__start!(|| &probe_id); This is placed at the function’s entry site which defines a new probe_id letting DTrace know it’s an event that should be tied to that ID.
limitless_probes::read__done!(|| (&probe_id, i as u64, 0u8)); This is set right before returning, tracking the same probe_id as the start to tie the events together. It then records i which is the index that was used in the buffer, and the 0 here indicates that we’re on the error path whereas the 1 used before the other return call indicates the success path. The same exact setup is done for the writes.
Lastly, the library sets it up to export the module for use in the main binary so anyone can use the probes without needing to add the usdt crate as a dependency:
// lib.rs
pub use usdt::register_probes;
// main.rs
fn main() {
// top of main
limitless::register_probes().expect("failed to register USDT probes");
} The script to measure tail latency bundles both error and success paths:
limitless_probes*:::read-start {
self->read_ts = timestamp;
}
limitless_probes*:::read-done /self->read_ts/ {
@read_lat = quantize(timestamp - self->read_ts);
self->read_ts = 0;
}
limitless_probes*:::write-start {
self->write_ts = timestamp;
}
limitless_probes*:::write-done /self->write_ts/ {
@write_lat = quantize(timestamp - self->write_ts);
self->write_ts = 0;
} Not being an expert with DTrace I used AI to help generate the script. The self->read_ts is basically a variable where the self bit makes it thread safe by making it local to the thread then each one is assigned a monotonic timestamp. The read-done side is checking that self->read_ts is not 0 and then performs the quantize function which is bucketing individual results by a power of two in @read_lat before resetting the local variable back to zero. The same exact thing is done for the writes and the end result when run is a histogram like the following:
sudo dtrace -c ./target/release/limitless -s tail_latency.d value ------------- Distribution ------------- count
64 | 0
128 | 25
256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 17932
512 |@@@@@@@@ 4943
1024 |@@@ 1818
2048 | 2
4096 | 3
8192 | 1
16384 | 3
32768 | 0
value ------------- Distribution ------------- count
64 | 0
128 | 2
256 |@@@@@@@@@ 2950
512 |@@@@@@@@@@@@@@@@@@@@@@@@ 8316
1024 |@@@@@@@ 2510
2048 | 3
4096 | 2
8192 | 2
16384 | 0
32768 | 2
65536 | 0
131072 | 1
262144 | 0 The one on top shows the read distribution and the one below is the write distribution. The value is which timestamp bucket it fell in, recorded in nanoseconds. This is great because it’s showing the outliers which is what tail latency is all about, the tail of the distribution. The reads mostly operate within and writes fall largely in which does not look too surprising. However, the outliers show reads at and writes at which tell the story.
The read tail latency is worse than where it fits in the distribution and the write tail latency is worse by . The reason to use USDT probes and DTrace is that it’s measuring the release build’s tail latency and can be toggled at anytime without affecting the program’s normal running state2. It can be distributed to anyone this way which means the library can be instrumented seamlessly into anyone’s production builds if they wish to gain access to the probes for measuring them against their workloads.
This is a good mechanism to look at tail latency but the results above don’t paint a complete picture. For one, it’s measuring a specific workload (4 read 4 write threads) for a specific architecture. Second, it’s just a single run and really should be performed multiple times to get statistical relevance. Lastly, given our knowledge that this is a lockless and not a lock free data structure, the tail latency is technically unbounded—the OS could sleep a thread that others are waiting on and spin wait indefinitely. There is no guarantee that any thread could make forward progress. That said, it is still better to understand the general characteristics of a system’s performance given what kind of workload it will operate under and the tail latency script gives anyone a mechanism to do that3. If having a bound tail latency is a strict requirement, then a different implementation is needed.
Running the criterion benchmarks, it shows a very modest regression which indicates that the probes are worth the cost!
Adding Backoff
After adding the Backoff mechanism from crossbeam, both libraries have effectively the same performance and latency on the same benchmark now. The backoff is useful because it prevents false sharing storms as core counts increase and contention grows. With respect to multi-threading, we’ve essentially hit our limits for work happening in parallel with this implementation. Higher core counts will increase contention resulting in higher latencies particularly at the tail. This makes intuitive sense:
flowchart
core1["Core"]
core2["Core"]
core3["Core"]
core4["Core"]
core5["Core"]
core6["Core"]
core7["Core"]
core8["Core"]
read["Read Index"]
write["Write Index"]
read ~~~ write
core1 ~~~ core2 ~~~ core3 ~~~ core4 ~~~ core1
core5 ~~~ core6 ~~~ core7 ~~~ core8 ~~~ core5
core1 & core2 & core3 & core4 ---- read
core5 & core6 & core7 & core8 ---- write Updating Loom
The last thing I’d like to do before moving on is to update the loom tests. Last post spent a good amount of time learning how to run and understand loom where we ran into the problem of it running endlessly. I’d like to modify the tests to specifically check three different combinations that all use two threads instead of three, to significantly cut down on the combination of testable conditions:
- a read/write test which starts empty
- a write/write test which starts empty
- a read/read test which starts full
What’s Next
After this implementation, I hope it is very clear why you would want different implementations for different purposes. The limitation on this lockless MPMC bounded ring buffer is that it only scales with threads so far until it becomes a contention nightmare.
An SPSC buffer can perform even faster reads and writes by optimizing a couple things:
- pinning to specific CPU cores so that caches are maintained and hot
- removing index checks since there is only ever one other thread
Since there is no contention it removes issues with tail latency but it sacrifices throughput. So we’ll either explore this next but more likely we’ll move onto a lock-free version. Thanks for reading!