MentaBlog

Simplicity is everything

Inter-thread communication with 2-digit nanosecond latency

 |  2 Comments

With the proliferation of multi-core processors and the high cost of collocation, it is tempting to run more than one critical application in the same machine using a thread affinity library to pin each application to its own isolated core. However, multithreading can be a big source of latency due to locking. The solution is to make your threads communicate by exchanging messages through a lock-free queue.

A lock-free queue is not just super fast but it can reduce the complexity of your code by removing the need for synchronization and avoiding common pitfalls of multithreaded programming: race-conditions and deadlocks. It also maintains an internal pool of objects that you can use as transfer objects of raw data (i.e. primitives) which leads to zero garbage creation.

For our benchmarks I am going to use MentaQueue, a lock-free queue create by me and inspired by the Disruptor ideas. See the example below of how to send a message from thread A (producer) to thread B (consumer):

		final BatchingQueue<StringBuilder> queue = new AtomicQueue<StringBuilder>(1024, 
								new Builder<StringBuilder>() {
			@Override
            public StringBuilder newInstance() {
	            return new StringBuilder(1024);
            }
		});

		Thread a = new Thread(new Runnable() {

			@Override
			public void run() {

				StringBuilder sb;

				while(true) { // the main loop of the thread

					// (...) do whatever you have to do here...

					// and whenever you want to send a message to
					// the other thread you can just do:
					sb = queue.nextToDispatch();
					sb.setLength(0);
					sb.append("Hello!");
					queue.flush();

					// you can also send in batches to increase throughput:
					sb = queue.nextToDispatch();
					sb.setLength(0);
					sb.append("Hi!");

					sb = queue.nextToDispatch();
					sb.setLength(0);
					sb.append("Hi again!");

					queue.flush(); // dispatch the two messages above...
				}
			}
		}, "Thread-Producer");

		Thread b = new Thread(new Runnable() {

			@Override
			public void run() {

				while (true) { // main loop of the thread

					// (...) do whatever you have to do here...

					// and whenever you want to check if the producer
					// has sent you a message you just do:

					long avail = queue.availableToPoll();
					if (avail > 0) {
    					for(int i = 0; i < avail; i++) {
    						StringBuilder sb = queue.poll();
    						// (...) do whatever you want to do with the data
    					}
    					queue.donePolling();
					}
				}
			}
		}, "Thread-Consumer");

Benchmarking

For the benchmarks I coded the simple test described below. The source code can be seen here.

  • A long (8 bytes) with the timestamp (System.nanoTime()) is sent from thread A to thread B. Thread B gets the long and calculates the one-way trip time.
  • The full process is timed: the time it takes for thread A (i.e. producer) to get a transfer object from the queue, copy the timestamp into it and flush the queue; plus the time it takes for thread B (i.e. consumer) to poll the object from the queue, read the long and call donePolling().
  • The test uses another queue to send the message back to thread A. That’s necessary so that thread A only dispatches another message when it hears back from thread B. That guarantees that we are only timing the latency of a single message in isolation. Because of batching, sending two messages together with a single flush() is faster than sending one message at a time.
  • I am warming up with 100 million messages before the test begins.
  • I am using thread affinity to isolate thread A and thread B in their own cores.
  • The machine specification used for the tests can be seen here.

The results:

Messages: 100,000,000
Avg Time: 87.1 nanos
Min Time: 54 nanos
Max Time: 4364 nanos
75%: avg=80 max=93 nanos
90%: avg=83 max=111 nanos
99%: avg=86 max=119 nanos
99.9%: avg=87 max=130 nanos
99.99%: avg=87 max=186 nanos
99.999%: avg=87 max=1144 nanos

Hyperthreading

For maximum speed it is important to try to hit the processor caches (L1, L2 and L3) as much as possible. On Intel, each core gets its own dedicated L1 cache, the fastest one. Hyper-threading allows you to run two threads in the same core so they share the same L1 cache. We can try to pin both threads, the producer and the consumer, to the same core and see what happens. Below the results:

Messages: 100,000,000
Avg Time: 49.33 nanos
Min Time: 26 nanos
Max Time: 81602 nanos
75%: avg=47 max=52 nanos
90%: avg=48 max=54 nanos
99%: avg=49 max=58 nanos
99.9%: avg=49 max=63 nanos
99.99%: avg=49 max=135 nanos
99.999%: avg=49 max=1088 nanos

Even thought hyper-threading gives a big boost for inter-thread communication it might not be the best approach because it makes two threads share the processing power of a single core, in other words, care must be taken not to end up in the situation where your threads are now very fast in talking to each other but slower in doing everything else.

Conclusion

Instead of using locks and synchronization you have the option of exchanging messages among your threads through a lock-free super-fast batching queue. The goal is to reduce complexity, bugs, garbage and latency in multithreaded programming. With that approach, latencies below 100 nanoseconds can be obtained. If you pin both threads to the same core through hyper-threading, even smaller latencies below 50 nanoseconds are possible. Another important feature of this approach is that it produces zero garbage as all the transfer objects are pooled inside a circular array of the queue.

2 Comments

  1. Pingback: Inter-socket communication with less than 2 microseconds latency | MentaBlog

  2. Pingback: Inter-thread communication with 2-digit nanosecond latency | G3nt00r's Blog

Leave a Reply    (moderated so note that SPAM will not be approved!)