After exploring the basic profiling story, let’s turn to ETW events. On Windows, the CLR is instrumented with a myriad of ETW events, which can be used to tackle very hard problems at runtime. Here are some examples of these events:
- Garbage collections
- Assembly load/unload
- Thread start/stop (including thread pool threads)
- Object allocations
- Exceptions thrown, caught, filtered
- Methods compiled (JIT)
By collecting all of, or a subset of, these events, you can get a very nice picture of what your .NET application is doing. By combining these with Windows kernel events for CPU sampling, file accesses, process creations and more — you have a fairly complete tool for performance investigations. You might find my recorded DotNext talk on using PerfView for .NET performance investigations useful — it shows how ETW and pretty much nothing else but ETW can be used to solve a huge variety of performance problems.
Unfortunately, ETW is a Windows-only mechanism. Event Tracing for Windows, you know? It is implemented in the Windows kernel, and that’s partly why it is so efficient and powerful. While looking for an equivalent Linux framework, the CLR team considered multiple alternatives, and decided to go with LTTng. it’s not that Linux doesn’t have enough tracers — quite the opposite — they had to choose which tracer is most appropriate for the CLR’s needs.
Hello, LTTNG
LTTng is similar in spirit to ETW. It is a lightweight tracing framework that can process events in real-time or record them to a file for later processing. It supports multiple simultaneous trace sessions, and each session can have multiple providers enabled — a system call provider (which requires a kernel module to be installed) alongside with user-space providers, such as the CoreCLR. Traces can be analyzed on the same machine or on another machine, and there are viewers available for more advanced visualization. Custom EventSource-based providers are also supported.
The massive, unfortunate, painful, unforgivable downside of this choice made by the CLR team is that LTTng doesn’t have stack trace support for user-space events. And this hurts more than you can imagine:
- You can collect GC events, but you can’t aggregate stack traces to figure out where the GCs are coming from
- You can collect exception events, but you don’t have the exception call stack
- You can collect assembly load events (and potential failures), but you don’t know what’s triggering that assembly load
- You can collect object allocations, but you can’t aggregate statistics to indicate which code paths are causing lots of allocations (and thereby garbage collections)
Effectively, not having stack trace support makes LTTng for CoreCLR a logging framework, which can be used to record and investigate logs, but not a tracing framework, which can help diagnose performance problems and troubleshoot hard issues in the field.
Collecting LTTng Traces
Let’s take a look at collecting LTTng traces from CoreCLR. As in the previous post, you could use the perfcollect tool — but it is typically overkill. First, perfcollect’s current implementation always turns on CPU sampling, which takes a massive amount of space and introduces a certain overhead. Second, perfcollect doesn’t have an event filter — it has exactly three modes for CoreCLR events: everything, GC only, and GC collect only. If you only care about assembly load events, or GC events, or JIT events, you’re not in luck.
Fortunately, it’s very easy to roll your own LTTng collection (you’ll need to install lttng-tools to record, and babeltrace to view):
# lttng create exceptions-trace # lttng add-context --userspace --type vpid # lttng add-context --userspace --type vtid # lttng add-context --userspace --type procname # lttng enable-event --userspace --tracepoint DotNETRuntime:Exception* # lttng start
In the preceding commands, lttng create creates a trace session that you then add providers to. The add-context command makes sure each event will have, in addition to the provider data, the PID, TID, and process name. Then, enable-event adds a specific event set from the CoreCLR provider — note that LTTNG doesn’t need any metadata about these events ahead of time. To get a list of all the possible events, one easy way is to just read the perfcollect script — it declares them all, e.g.:
declare -a DotNETRuntime_NoKeyword=( DotNETRuntime:ExceptionThrown DotNETRuntime:Contention DotNETRuntime:RuntimeInformationStart DotNETRuntime:EventSource )
Finally, lttng start starts the trace session with the enabled providers. By default, the traces are written to ~/session-name-timestamp — in our case, it’s going to be /root/exceptions-trace-20170330-something.
Now you run your scenario (the target application should have the COMPlus_EnableEventLog environment variable set to 1), and stop the trace when you’re done:
# lttng stop # lttng destroy # babeltrace ~/exceptions-trace [07:31:11.751548909] (+?.?????????) ubuntu-16 DotNETRuntime:ExceptionThrown_V1: { cpu_id = 0 }, { ExceptionType = "System.NotSupportedException", ExceptionMessage = "Sample exception.", ExceptionEIP = 139767278604807, ExceptionHRESULT = 2148734229, ExceptionFlags = 16, ClrInstanceID = 0 } [07:31:11.751603953] (+0.000055044) ubuntu-16 DotNETRuntime:ExceptionCatchStart: { cpu_id = 0 }, { EntryEIP = 139765244804131, MethodID = 139765233785640, MethodName = "void [Runny] Runny.Program::Main(string[])", ClrInstanceID = 0 }
babeltrace is a simple trace viewer, but there are also UI tools like Trace Compass and other visualization tools that can parse the CTF (Common Trace Format) specification.
What Now?
We have the fundamentals for collecting LTTng traces from CoreCLR and from custom providers, but the lack of stack traces is gnawing at me. In the next post, we will explore a hacky way of getting stack traces from interesting events, and even using BCC/SystemTap for tracing them.
You can also follow me on Twitter, where I put stuff that doesn’t necessarily deserve a full-blown blog post.