Exploration and Instrumentation of Per-Process Windows Telemetry via ETW

Contents

  1. Motivation
  2. ETW Background
  3. Exploring ETW Providers
  4. Instrumenting Per-Process DNS and TCP Telemetry

Motivation

Tracking anomalous behavior across disparate data sources to a “Patient Zero Process” on a single machine is often a challenging task for defenders. Imagine a scenario where defenders have found a pseudo-periodic beacon to an attacker controlled IP while looking at network logs. Identifying the compromised machine is relatively straightforward, however identifying the culprit process and gaining context around it can be a time consuming affair.

One potential solution to this challenge is to instrument per-process optics for network, inter-process, and other activity, allowing a defender to trace attack lifecycles even through evasion-heavy TTPs like process injection and encryption. This post will explore how a detections engineer can approach instrumentation to support such optics using Event Tracing for Windows (ETW), specifically:

  1. Exploring interesting ETW providers
  2. Ascertaining if there is useful signal from a given provider
  3. Rigging callback functions to handle the event and extract the signal from the noise

ETW Background

ETW is a native kernel-level tracing infrastructure originally provided by Microsoft for debugging purposes. It has begun to see re-purposing by the security community for its detailed tracing optics which are otherwise unavailable in classic Windows Events. ETW relies on three core components: Controllers, Providers, and Consumers.

For the purposes of this post, Controllers are responsible for starting and stopping trace sessions, Providers are responsible for enabling or disabling event data, and Consumers consume and interpret the events. More information on the structure of the ETW API can be found in the official docs here:
https://docs.microsoft.com/en-us/windows/desktop/etw/about-event-tracing

Historically, instrumenting ETW in order to rapidly perform detections research, investigate Providers of interest, and tailor callback functions to extract usable signal has been a source of friction.

To reduce this friction and focus on the goal of obtaining usable detections from ETW events, we will utilize a Microsoft tool called Message Analyzer to extract ETW provider information and a Python ETW wrapper from our friends at FireEye called Pywintrace to take a deeper dive into the events themselves.

Exploring ETW Providers

We start by asking a simple question: what ETW providers exist out of the box that could provide interesting data for engineering new detections?

To answer that, we will obtain a list of default ETW providers from Message Analyzer, loop through each one to gauge a baseline, non-stimulated data volume, then focus on the providers that generate data by default without additional tuning.

After downloading Message Analyzer from this link, click New Session > Live Trace > Add New Provider. You will be presented with a list of available System ETW Providers:

Instead of individually running traces from the Message Analyzer GUI, it is much more useful to copy/paste the thousands of rows of providers into a separate file, where they will appear as tab separated values. (The provider list could also have been obtained by running the query “logman query providers” on a command line, however extracting them directly from Message Analyzer allows us to immediately grab them in TSV form.)

We will now iterate through the list of ETW providers, listen to each for 10 seconds, and output how many events were captured. This will allow us to understand which providers generate events out of the box and which require additional instrumentation. Assuming Pywintrace is downloaded and installed, the code is as follows:

Observing the traces run for each provider, we notice that some providers are not capturing anything (at least without surgical stimulus) whereas others are actively capturing events. Note that both WMI and the LSA subsystem produce events which may be of future interest:

After the script has run through all 1000+ providers that come out of the box, we can utilize Pandas inside a Jupyter Notebook to do some quick exploratory analysis and find the top talkers.

First, we will load the output CSV into a Pandas dataframe and look at the first few rows using df.head().

Next we’ll remove the providers that are not outputting any events and sort the others by event count:

Great. Now that we know which providers are generating data and have a representative volume of each over a 10 second interval, we have a nice starting point from which to investigate what kind of signal the individual providers have to offer. We can also address the initial problem presented in the discussion – how to generate interesting per-process optics.

Instrumenting Per-Process DNS and TCP Telemetry

Investigating a compromised machine routinely leads to finding one or two nefarious processes that kicked off the remainder of the attack chain. The goal of a defender at that point is to understand context around that process across multiple dimensions – what did it spawn, where it connect, what handles did it open, and the like. One question that comes in handy here is: what DNS requests has a particular process made?

To facilitate answering this question as well as prototype some live per-process DNS telemetry, we can utilize the Microsoft-Windows-DNS-Client ETW provider to rig up some exploratory tooling.

To set up the capture, we’ll utilize code that is similar to the testing harness used above:

Note that when it’s time to handle an ETW event, the process() callback function will be used to perform additional work on the event. This can be used to parse, transform, or aggregate the data into more useful representations.

Let’s instrument the callback function to:

  1. Exclude ETW setup prologue events and localhost lookups
  2. Only return events querying for a specific site in order to focus on how the events are sequenced, in this case ‘www.twitter.com’

Running the program and navigating to “www.twitter.com” – we see the ETW provider yielding several different events of interest. It is left as an exercise to the reader to understand what the various event types represent, but a hint is that both network and cached DNS is involved.

Let us take a close look at the very first event that is fired, Event 3006. It is here that we finally get the connective tissue we want – both process information and DNS query information in the same place:

Modifying our callback function, we can put together some logic for real time monitoring of DNS queries by various PIDs, resolve those PIDs to executable paths using a WMI helper call, and then look for anomalies in the data.

The result is a streaming feed of DNS calls mapped to processes:

And with slight modifications to the ETW provider we are targeting, per-process TCP data indicating bytes sent and received (although it would be prudent to aggregate these with some bucketing function):

For a production environment, we can now utilize a collector to grab the logs and ship them to a centralized location or SIEM knowing that we have a good sense of the data volume involved for network<>process relationships.

Hopefully this has been a useful overview of the ETW investigation process and how it can be utilized to engineer useful detections. Some areas of potential future research:

  • What other ETW providers provide useful info out of the box?
  • How can we automate stimulating ETW providers to see if they generate data given certain conditions?
  • What kind of correlation can be performed amongst the various providers?
  • How can we use ETW to enrich classic Windows Events?