Hacker News Clone

NanoLog – a nanosecond scale logging system for C++

by tzm on 9/1/2019, 5:50 PM with 63 comments

by muststopmyths on 9/1/2019, 6:26 PM
I spent a lot of time trying to build a fast logging system in my last couple of jobs. The basic lesson (and I'm only talking about C/C++/C# here) is that you will spend most of your time formatting strings if you do your file I/O asynchronously.
Since this system has a preprocessor mode, I assume they learnt the same lesson.
The bigger lesson is that it really doesn't matter how many millions of logs you can generate per second if you don't have the infrastructure to store and analyze them easily. No one is going to enjoy digging through these things and the more you generate, the harder it is to extract meaningful information.
In other words, pay a lot more attention to what you'll do with the logs rather than how fast you can spew them out. I have rarely needed more than a few thousand log messages/sec even for a loaded MMO server. I spent way more time creating ways to look at these logs and making them accessible in near real time.
by MauranKilom on 9/1/2019, 6:40 PM
The code overall is pretty clever, but at the core of all this they completely ignore strict aliasing to dump stuff into a char* buffer indiscriminately... Look at this function:
https://github.com/PlatformLab/NanoLog/blob/master/runtime/N...
```
     T argument = *reinterpret_cast<T*>(*in);
     ...
     uint32_t stringBytes = *reinterpret_cast<uint32_t*>(*in);
```
A total of 6 reinterpret_casts in that file alone.
I didn't see any indication that you have to turn off strict aliasing to use this library (and would you really want to make all your remaining code slower just so your logging has better benchmark scores?). Which makes all this code UB as far as the C++ standard is concerned (no, their allocator does not magically placement-new the correct object types into the right places).
They could just inspect their T values as char* and/or memcpy them around instead of all this pointer aliasing (well, as long as their values are trivial types - I couldn't figure out whether you can log non-trivial stuff here?) at virtually no additional cost.
Edit: And this right here convinces me that I definitely wouldn't want to use this library in production... Adding a volatile to fix a threading bug is just a big no-no.
https://github.com/PlatformLab/NanoLog/commit/e9691246ede6da...
by rafa1981 on 9/1/2019, 7:28 PM
Author of mini-async-log and mini-async-log-c here.
https://github.com/RafaGago/mini-async-log
https://github.com/RafaGago/mini-async-log-c
I wrote both some time ago and AFAIK they were the fastest, how does nanolog compare against them?
At least before the source file preprocessor Nanolog was slower.
Here I had a benchmark project with Nanolog on an old version, maybe you can PR an update:
https://github.com/RafaGago/logger-bench
by wrs on 9/2/2019, 1:03 AM
I did something very similar to the “preprocessor” version of this back in 2001 for Microsoft application crash reporting (the problem then being upload size, not so much logging overhead). It’s very cool to see how you can do the whole thing with constexpr now. As I recall, people really didn’t like using the preprocessor!
by usefulcat on 9/1/2019, 7:42 PM
The description says "the compilation builds a version of the NanoLog library that is non-portable, even between compilations of the same application". They also mention that a decompressor application also gets built and can be used to reconstitute a human-readable version of the log file. My question is, is the decompressor application also tightly coupled to a particular compilation? It sounds like the answer is yes, but I'd really like to be wrong about that.
That would really suck for cases where you want to store old log files for later use. You'd have to also store a copy of the decompressor program that corresponds to each log file. Or you'd have to convert to human-readable format before storing the log files, which loses a number of the benefits of the binary log format (compact size, faster to parse).
by on 9/1/2019, 7:09 PM
undefined
by gmueckl on 9/1/2019, 7:01 PM
I built a similar system with a focus on profiling events with extremely low overhead. When taking an timestamp, the processor cycle count register is read, and pushed into a ring buffer with a tiny descriotion including a static string label. A background thread perodically writes that buffer contents to disk. An offline tool visualizes the recorded events and intervals over time and can do a bit of statistics.
This system has served me quite well, but I need to rework the UI because zooming to microsecond resolution on a 30 second trace overflows the range values in the Qt scrollbars.
by Const-me on 9/1/2019, 7:46 PM
Recently built similar stuff. Even using same __rdtsc() for time. I didn't require any preprocessing. Instead, I require the strings come from readonly section of a module, and logging pointer values. Also logging into circular buffer in shared memory instead of file, as I was only interested in knowing what the app did immediately before a rare crash which takes hours to reproduce.
by nemetroid on 9/1/2019, 7:27 PM
From a quick reading of the paper, it sounds like Nanolog is internally translating logging messages into a compacted format, which must be further processed to become human readable. This further processing is not included in the benchmark.
section 4.3, "Decompressor/Aggregator"
> The final component of the NanoLog system is the decompressor/aggregator, which takes as input the compacted log file generated by the runtime and either outputs a human-readable log file or runs aggregations over the compacted log messages. The decompressor reads the dictionary information from the log header, then it processes each of the log messages in turn. For each message, it uses the log id embedded in the file to find the corresponding dictionary entry. It then decompresses the log data as indicated in the dictionary entry and combines that data with static information from the dictionary to generate a human-readable log message. If the decompressor is being used for aggregation, it skips the message formatting step and passes the decompressed log data, along with the dictionary information, to an aggregation function.
section 5.1.2, "Throughput":
> [...] NanoLog performs best when there is little dynamic information in the log message. This is reflected by staticString, a static message, in the throughput benchmark. Here, NanoLog only needs to output about 3-4 bytes per log message due to its compaction and static extraction techniques. Other systems require over an order of magnitude more bytes to represent the messages (41-90 bytes).
> [...] Overall, NanoLog is faster than all other logging systems tested. This is primarily due to NanoLog consistently outputting fewer bytes per message and secondarily because NanoLog defers the formatting and sorting of log messages.
section 5.1.3, "Latency":
> All of the other systems except ETW require the logging thread to either fully or partially materialize the human-readable log message before transferring control to the background thread, resulting in higher invocation latencies. NanoLog on the other hand, performs no formatting and simply pushes all arguments to the staging buffer. This means less computation and fewer bytes copied, resulting in a lower invocation latency.
So NanoLog performs less work and therefore has much higher throughput. This separation of logging and formatting may possibly be a great idea, but it should probably be more prominently mentioned wherever the benchmark table is posted.
by bt848 on 9/1/2019, 6:21 PM
Looks interesting but how does one reproduce the comparative benchmarks from the paper? I see the code that benchmarks nanolog but how do I reproduce the baseline for glog?
by RafaGC on 9/2/2019, 5:45 AM
Is this a new incarnation of this project?
https://github.com/Iyengar111/NanoLog
Otherwise it seems that the name is already taken.
by jart on 9/2/2019, 12:19 AM
Cool stuff. Judging by his code, this guys mind is probably going to get blown when he discovers mmap() and bcd arithmetic.