MentaBlog

Simplicity is everything

Asynchronous logging versus Memory Mapped Files

 |  8 Comments

One of the challenges of HFT systems is to minimize I/O latency. Whenever you want to write something to the disk or send something to the network you risk introducing latency in your critical thread. One solution for this problem is NIO (non-blocking I/O). For disk I/O you can use a memory-mapped file and for network I/O you can use a non-blocking channel. NIO is supposed to provide asynchronous output (i.e writes), meaning you pass the bytes to the OS (i.e. copy them to a direct byte buffer) and hope that the OS will make the best effort to do its job.

One common requirement for a trade gateway is to record everything it does on a log file. For that task, the time it takes for the log message to reach the file has little importance, but the time it takes for the gateway to write the log message becomes of paramount importance. Of course that’s latency we must eliminate.

In this article I will investigate how really asynchronous a write to a memory-mapped file is and then I will show you how you can use a lock-free queue to make a writing operation as asynchronous as possible.

ByteBuffer Latency

The minimum that needs to be done for logging is copying bytes. To establish the minimum latency that must be paid when logging a message, we start by benchmarking how long it takes to copy a message to a byte buffer. This operation is illustrated below:

		String msg = "This is a log message not so small that you can use to test. I hope you have fun and it works well!";
		byte[] msgBytes = msg.getBytes();
		
		ByteBuffer bb = ByteBuffer.allocate(1024);
		
		long start = System.nanoTime();
		bb.put(msgBytes);
		long end = System.nanoTime();
		
		System.out.println("Time: " + (end - start));

When you allocate a big enough ByteBuffer and do this operation a couple of million times with the proper warmup and everything you get the following results: (the full test code is here)

Log Messages: 4,012,000
Avg Time: 26.42 nanos
Min Time: 22 nanos
Max Time: 9828 nanos
75%: avg=24 max=26 nanos
90%: avg=25 max=28 nanos
99%: avg=25 max=54 nanos
99.9%: avg=26 max=143 nanos
99.99%: avg=26 max=190 nanos
99.999%: avg=26 max=532 nanos

MappedByteBuffer Latency

Modern operating systems support memory-mapped I/O where a large file can be made available to a program as it was loaded in memory. Of course the file is not entirely loaded in memory and the OS has to perform some kind of swapping under the hood, but the programer can see it as just a big byte array. The way this is accomplished in Java is through a MappedByteBuffer. After you map a file to a MappedByteBuffer, you can not just read from it but also write to it and the data will be later persisted to the disk by the OS asynchronously. Below an example:

		File file = new File("mMapFile.txt");
		RandomAccessFile raf = new RandomAccessFile(file, "rw");
		MappedByteBuffer mappedBuffer = raf.getChannel().map(MapMode.READ_WRITE, 0, 1024);
		
		long start = System.nanoTime();
		mappedBuffer.put(msgBytes);
		long end = System.nanoTime();
		
		System.out.println("Time: " + (end - start));

As you can see the file becomes a ByteBuffer. When you map a big enough MappedByteBuffer and do the operation above a couple of million times with the proper warmup and everything you get the following results: (the full test code is here)

Log Messages: 4,012,000
Avg Time: 53.52 nanos
Min Time: 19 nanos
Max Time: 20061 nanos
75%: avg=22 max=24 nanos
90%: avg=23 max=25 nanos
99%: avg=34 max=803 nanos
99.9%: avg=42 max=1866 nanos
99.99%: avg=52 max=14301 nanos
99.999%: avg=53 max=16237 nanos

Notice that the MappedByteBuffer is two times slower than a regular ByteBuffer with a lot more outliers above the 99% percentile.

Lock-free Queue Latency

Instead of relying on the OS for non-blocking file I/O, let’s take matters in our own hands and implement our own asynchronous logging system. You should have noticed by now that asynchronous logging is just a regular producer/consumer problem and for that we can use a super-fast, garbage-free and lock-free queue. The asynchronous logging problem was actually the motivation behind MentaQueue. Below an example:

		Builder<ByteBuffer> builder = new Builder<ByteBuffer>() {
			@Override
	        public ByteBuffer newInstance() {
				// max log message size is 1024
		        return ByteBuffer.allocate(1024);
	        }
		};
		
		final BatchingQueue<ByteBuffer> queue = 
				new AtomicQueue<ByteBuffer>(1024 * 1024, builder);

		final WaitStrategy consumerWaitStrategy = new ParkWaitStrategy();
		
		Thread asyncLogger = new Thread(new Runnable() {

			@Override
			public void run() {
				
				while (true) {

					long avail;
					while((avail = queue.availableToPoll()) == 0) {
						consumerWaitStrategy.waitForOtherThread();
					}
					
					consumerWaitStrategy.reset();
					
					for(int i = 0; i < avail; i++) {
						ByteBuffer bb = queue.poll();
						bb.flip();
						if (bb.remaining() == 0) {
							// EOF
							return;
						} else {
							// log your byte buffer here
							// any way you want since you
							// are not disrupting the main
							// thread anymore...				
						}
						bb.clear();
					}
					
					queue.donePolling();
				}
			}
		}, "AsyncLogger");
		
		asyncLogger.start();
		
		long start = System.nanoTime();
		
		ByteBuffer bb;
		while((bb = queue.nextToDispatch()) == null) {
			// busy spin in case queue is full
		}
		bb.put(msgBytes);
		queue.flush(true); // true = lazySet()
		
		long end = System.nanoTime();
		
		System.out.println("Time: " + (end - start));

If you execute the highlighted code above a couple of million times with the proper warmup and everything you get the following results: (the full test code is here)

Log Messages: 4,012,000
Avg Time: 30.95 nanos
Min Time: 23 nanos
Max Time: 5194 nanos
75%: avg=27 max=31 nanos
90%: avg=28 max=39 nanos
99%: avg=30 max=67 nanos
99.9%: avg=30 max=117 nanos
99.99%: avg=30 max=163 nanos
99.999%: avg=30 max=522 nanos

Note that it comes very close the the raw ByteBuffer latencies. One of the reasons it is able to deliver this performance is through the lazySet() operation used on line 57 which I will describe in the next section.

Atomic variable lazySet operation

If you change the flag on line 57 above to false, then the queue uses the AtomicLong set() operation instead of the faster lazySet() one. Below the results with this change:

Log Messages: 4,012,000
Avg Time: 44.78 nanos
Min Time: 29 nanos
Max Time: 6118 nanos
75%: avg=38 max=48 nanos
90%: avg=41 max=63 nanos
99%: avg=44 max=97 nanos
99.9%: avg=44 max=120 nanos
99.99%: avg=44 max=163 nanos
99.999%: avg=44 max=1561 nanos

To understand the difference between set() and lazySet() we must examine the Java memory model when it comes to a primitive variable shared among different CPU cores. Each core has its own local cache where it can store variables for faster access. If another thread, running on another core, comes along and change this variable, the thread with the cached version will not see this change. To solve this visibility problem, the Java memory model offers the volatile keyword to force reads and writes to "bypass" the cache and go to main memory. Of course bypassing the cache imposes a latency cost and the lazySet() can be used to mitigate this cost. (Note: the cache is not really bypassed but instead a memory barrier is used)

The Java atomic variables (AtomicLong, AtomicInteger, etc.) have get() and set() methods that work as reads and writes on volatile variables with the additional feature of a semi-volatile write. When you change the value of a atomic variable through its lazySet() method, the value is written to the local cache of the core but not immediately to main memory. As a result, it will take an indefinite amount of time for the change to be flushed out to main memory so that other threads can see the new value.

This is an important distinction applied to two very different requirements. Sometimes you want the message to be sent as quickly as possible to the consumer. For that you can refer to my inter-thread communication article. Other times, you don't care about the in-transit time of the message, you just want the producer to get rid of it as quickly as possible.

But how long will it take for the write to go through? Well, for that I will quote Doug Lea here:

You can be sure only if you can be sure the process makes progress. But this is a common limitation that most people are comfortable ignoring given the practical considerations that compilers always eventually emit the writes, and processors can only buffer a finite number of them before committing them.

Note: MentaLog is a logging framework developed with the ideas above. Its average producer latency is around 65 nanoseconds as you can see here.

Conclusion

Asynchronous logging is the way to go to minimize file I/O latencies. Memory-mapped files through a MappedByteBuffer is a great way to approach this problem without having to dive into multithreaded programming. The latencies imposed by the MappedByteBuffer to the producer are low and although the outliers above the 99% are slow, it should be enough for most HFT systems. However, for ultra-low latencies as close as possible to the raw ByteBuffer, a lock-free queue with the lazySet() operation provides the best performance.

8 Comments

  1. Pingback: Asynchronous logging versus Memory Mapped Files | MentaBlog | Software Development Hub | Scoop.it

  2. Just did a post about lazySet performance wins for single writers(and just generally) that is relevant in the context of you article: http://psy-lob-saw.blogspot.co.uk/2012/12/atomiclazyset-is-performance-win-for.html
    Also note a previous post covered a fast and low garbage way of encoding Strings into byte buffers that is of interest to log implementations(as in it’s significantly faster).
    Nice blog :)

  3. Hi Sergio,

    Looks like this “asynchronous logging system” is only intended to be used where there is just one producer thread.
    Have you given a thought on what it would take (in a lock free manner), if you wanted to support multiple producer threads?

    Thanks

  4. You simply use two queues to connect two producers to the same consumer. The consumer will poll from the two queues and write to disk however it wants to. :)

  5. I see, thank you.

    Finally, am I correct that using flush() above ensures that the new ByteBuffer added to the plain array will be visible to your “AsyncLogger” thread (assuming AtomicLong’s set() creates a memory store fence)?

    Cheers

  6. Good post!

    I just wrote a big, fast and persistent queue by leveaging memory mapped file, the speed to enqueue and dequeue is close to in memoy access, at the same time, it is thread-safe.

    https://github.com/bulldog2011/bigqueue

  7. Pingback: Inter-socket communication with less than 2 microseconds latency | MentaBlog

Leave a Reply    (moderated so note that SPAM will not be approved!)