MentaBlog

Simplicity is everything

Inter-socket communication with less than 2 microseconds latency

 |  3 Comments

Non-blocking I/O through selectors is the part of networking that I like the most. The Java NIO API is not easy, but once you understand the reactor pattern and abstract away its complexities you end up with a powerful and re-usable network multiplexer. The classic one-thread-per-socket approach does not scale, has a lot of overhead and almost always lead to complex code. It does not scale because threads have to compete for limit CPU cores. Having 32 threads competing for 4 logical processors in a quad-core CPU does not make your code any faster, but instead sends its latencies through the roof. Threads are great for human beings that are incapable of noticing latencies below 100 milliseconds, but for a high performance software that wants to run in the microsecond level uncontrolled threads are the recipe for bad latencies. The performance overhead caused by threads is huge due to context switches performed by the kernel scheduler and due to lock contention as threads manage to share information without stepping into each other. Multithreaded programming is hard to debug as the thread execution order is unpredictable. Making changes when every corner of your code can hide a race condition is not just unproductive but unsafe. So to conclude: the classic one-thread-per-socket approach is inadequate for high-performance networking. Instead of trying to demultiplex every socket on its own thread, a socket multiplexer that serializes network access will do a better job. That’s the reactor pattern.

Disclaimer: There are good cases for using threads, as sometimes you cannot escape from slow work that needs to be done asynchronously. Other times you have a need for real parallelism. In these scenarios, thread pinning, lock-free queues, demultiplexers and other techniques can be used to maximize performance and simplify the inter-thread communication. In a past article I talked about asynchronous logging using two threads and a lock-free queue. As you can see threads are not always evil, just refrain from using them for high-performance network I/O. :)

Exchanging UDP messages across JVMs

For communication among JVMs, in the same machine or across a network, we can use UDP packets. UDP has many advantages over TCP for inter-host communication inside a private network. Of course if you cannot guarantee the quality of the network, the unreliability of UDP becomes unbearable as packet drops are expensive to recover from. But on a controlled network, where you can guarantee the quality of your switches and keep packets drops to a minimum, the extra overhead of TCP becomes unnecessary. However, it is impossible to keep packet loss at 0%, even inside your private network, so a reliable UDP protocol is mandatory. There are many options for a reliable UDP protocol, but the one I like the most, which is little known outside the trading industry, is MoldUDP. What a reliable UDP protocol really does is transfer the responsibility of retransmission from the network layer (TCP) to the application layer (MOLD), turning off the extra TCP overhead for good packets in exchange for a little extra overhead to the very few packets that may get dropped. It is a very good deal in terms of performance. Another great advantage of UDP over TCP is the support for broadcasting/multicasting packets on the switch level. TCP does not support broadcasting and doing so in the application level is impracticable.

So if you code a single-threaded selector, take care of the GC by creating zero garbage, warm up your code so that the JIT kicks in, pin each thread to an isolated core, etc. you can get the following numbers by sending a UDP message through loopback from JVM A to JVM B running in the same machine:

Iterations: 1,000,000 messages
Message Size: 256 bytes
Avg Time: 12,367 nanos
Min Time: 4525 nanos
Max Time: 104873 nanos
75%: avg=12257 max=12410 nanos
90%: avg=12293 max=12570 nanos
99%: avg=12338 max=13477 nanos
99.9%: avg=12356 max=20468 nanos
99.99%: avg=12364 max=22495 nanos
99.999%: avg=12366 max=64250 nanos

Optimizing your selector for maximum performance

There are many tricks your can use to optimize your selector. I will leave them for a second part of this article and just state my latest results for now:

Iterations: 1,000,000
Message Size: 256 bytes
Avg Time: 1,680 nanos
Min Time: 1379 nanos
Max Time: 7020 nanos
75%: avg=1618 max=1782 nanos
90%: avg=1653 max=1869 nanos
99%: avg=1675 max=1964 nanos
99.9%: avg=1678 max=2166 nanos
99.99%: avg=1679 max=5094 nanos
99.999%: avg=1680 max=5638 nanos

Note: If you are actually sending packets across the network, the wire-time must be taken into account. Doing the math, a 256-byte packet traveling through a 10 Gigabits ethernet will take at least 382 nanoseconds to go from NIC to NIC (ignoring the switch hop). If your ethernet is 1 Gigabits then we are talking about a considerable delay of almost 4 microseconds. Another important optimization that can save a couple of microseconds when going through the wire is kernel bypass supported by some good network cards in the market.

Conclusion

UDP messaging is the way to go if you want to do ultra low latency inter-socket communication in Java. By using a Java NIO selector, you are not just sustaining the best performance but also making your code more simple and safe. Threads can introduce a lot of latency if set loose untamed and your code can gain a lot by adopting the single-threaded reactor pattern. When you do need to process work asynchronously, a proper demux strategy with minimum impact in the critical selector thread must be used and that will be the topic of future articles.

3 Comments

  1. Pingback: Inter-socket communication with less than 2 microseconds latency | MentaBlog | G3nt00r's Blog

  2. Sergio, those are really interesting results you’ve got, congrats :-)

    Can you give a little bit more info on the test env of yours (CPU, RAM, OS, Java version, etc.). Also I wonder if that 256 bytes of data were always the same?

    Also, by saying loopback, do you mean the lo interface, or did you use the machine with two cross-connected networking intefaces?

  3. Thanks, Pavel. I have been to Minsk once. :)

    The machine is here: http://mentablog.soliveirajr.com/lab/

    Same bytes. I might be wrong but I believe it won’t make a difference if you send random bytes.

    lo interface. cross connected NICs would be an interesting test, after subtracting the wire time from the numbers.

Leave a Reply    (moderated so note that SPAM will not be approved!)