Exploration and Instrumentation of Per-Process Windows Telemetry via ETW


  1. Motivation
  2. ETW Background
  3. Exploring ETW Providers
  4. Instrumenting Per-Process DNS and TCP Telemetry


Tracking anomalous behavior across disparate data sources to a “Patient Zero Process” on a single machine is often a challenging task for defenders. Imagine a scenario where defenders have found a pseudo-periodic beacon to an attacker controlled IP while looking at network logs. Identifying the compromised machine is relatively straightforward, however identifying the culprit process and gaining context around it can be a time consuming affair.

One potential solution to this challenge is to instrument per-process optics for network, inter-process, and other activity, allowing a defender to trace attack lifecycles even through evasion-heavy TTPs like process injection and encryption. This post will explore how a detections engineer can approach instrumentation to support such optics using Event Tracing for Windows (ETW), specifically:

  1. Exploring interesting ETW providers
  2. Ascertaining if there is useful signal from a given provider
  3. Rigging callback functions to handle the event and extract the signal from the noise

ETW Background

ETW is a native kernel-level tracing infrastructure originally provided by Microsoft for debugging purposes. It has begun to see re-purposing by the security community for its detailed tracing optics which are otherwise unavailable in classic Windows Events. ETW relies on three core components: Controllers, Providers, and Consumers.

For the purposes of this post, Controllers are responsible for starting and stopping trace sessions, Providers are responsible for enabling or disabling event data, and Consumers consume and interpret the events. More information on the structure of the ETW API can be found in the official docs here:

Historically, instrumenting ETW in order to rapidly perform detections research, investigate Providers of interest, and tailor callback functions to extract usable signal has been a source of friction.

To reduce this friction and focus on the goal of obtaining usable detections from ETW events, we will utilize a Microsoft tool called Message Analyzer to extract ETW provider information and a Python ETW wrapper from our friends at FireEye called Pywintrace to take a deeper dive into the events themselves.

Exploring ETW Providers

We start by asking a simple question: what ETW providers exist out of the box that could provide interesting data for engineering new detections?

To answer that, we will obtain a list of default ETW providers from Message Analyzer, loop through each one to gauge a baseline, non-stimulated data volume, then focus on the providers that generate data by default without additional tuning.

After downloading Message Analyzer from this link, click New Session > Live Trace > Add New Provider. You will be presented with a list of available System ETW Providers:

Instead of individually running traces from the Message Analyzer GUI, it is much more useful to copy/paste the thousands of rows of providers into a separate file, where they will appear as tab separated values. (The provider list could also have been obtained by running the query “logman query providers” on a command line, however extracting them directly from Message Analyzer allows us to immediately grab them in TSV form.)

We will now iterate through the list of ETW providers, listen to each for 10 seconds, and output how many events were captured. This will allow us to understand which providers generate events out of the box and which require additional instrumentation. Assuming Pywintrace is downloaded and installed, the code is as follows:

Observing the traces run for each provider, we notice that some providers are not capturing anything (at least without surgical stimulus) whereas others are actively capturing events. Note that both WMI and the LSA subsystem produce events which may be of future interest:

After the script has run through all 1000+ providers that come out of the box, we can utilize Pandas inside a Jupyter Notebook to do some quick exploratory analysis and find the top talkers.

First, we will load the output CSV into a Pandas dataframe and look at the first few rows using df.head().

Next we’ll remove the providers that are not outputting any events and sort the others by event count:

Great. Now that we know which providers are generating data and have a representative volume of each over a 10 second interval, we have a nice starting point from which to investigate what kind of signal the individual providers have to offer. We can also address the initial problem presented in the discussion – how to generate interesting per-process optics.

Instrumenting Per-Process DNS and TCP Telemetry

Investigating a compromised machine routinely leads to finding one or two nefarious processes that kicked off the remainder of the attack chain. The goal of a defender at that point is to understand context around that process across multiple dimensions – what did it spawn, where it connect, what handles did it open, and the like. One question that comes in handy here is: what DNS requests has a particular process made?

To facilitate answering this question as well as prototype some live per-process DNS telemetry, we can utilize the Microsoft-Windows-DNS-Client ETW provider to rig up some exploratory tooling.

To set up the capture, we’ll utilize code that is similar to the testing harness used above:

Note that when it’s time to handle an ETW event, the process() callback function will be used to perform additional work on the event. This can be used to parse, transform, or aggregate the data into more useful representations.

Let’s instrument the callback function to:

  1. Exclude ETW setup prologue events and localhost lookups
  2. Only return events querying for a specific site in order to focus on how the events are sequenced, in this case ‘www.twitter.com’

Running the program and navigating to “www.twitter.com” – we see the ETW provider yielding several different events of interest. It is left as an exercise to the reader to understand what the various event types represent, but a hint is that both network and cached DNS is involved.

Let us take a close look at the very first event that is fired, Event 3006. It is here that we finally get the connective tissue we want – both process information and DNS query information in the same place:

Modifying our callback function, we can put together some logic for real time monitoring of DNS queries by various PIDs, resolve those PIDs to executable paths using a WMI helper call, and then look for anomalies in the data.

The result is a streaming feed of DNS calls mapped to processes:

And with slight modifications to the ETW provider we are targeting, per-process TCP data indicating bytes sent and received (although it would be prudent to aggregate these with some bucketing function):

For a production environment, we can now utilize a collector to grab the logs and ship them to a centralized location or SIEM knowing that we have a good sense of the data volume involved for network<>process relationships.

Hopefully this has been a useful overview of the ETW investigation process and how it can be utilized to engineer useful detections. Some areas of potential future research:

  • What other ETW providers provide useful info out of the box?
  • How can we automate stimulating ETW providers to see if they generate data given certain conditions?
  • What kind of correlation can be performed amongst the various providers?
  • How can we use ETW to enrich classic Windows Events?

Intuitive Detections Research With Graph Analytics and Neo4J

By Nik Seetharaman


I explore using graph analytics tools to visualize simulated attacks during the course of detections research to gain an intuition of what an attack is doing across multiple axes. I also discuss streamlining detections research workflows by automating Sysmon setup, teardown, and log export.


On the heels of DEFCON, Black Hat, and the overall 2018 security research season, there are a number of new offensive techniques I’m psyched to begin researching through the lens of the Blue Team, especially some of SpecterOps’ and Will Schroeder’s new C#-based tooling. 

One of the challenges in doing Blue Team and Detections research is being able to easily digest relationships among various entities on a system during the execution of a particular attack. Event Viewer isn’t sufficient to gain such contextual understanding, and tools like Splunk and ELK aren’t much better when trying to execute link based analysis. Additionally, it’s a pain to time-bound logs to the exact events you care about after running a simulated attack. Ideally this constraining would happen upstream of the analysis such that a high percentage of the events we receive are directly related to the attack we are attempting to research.

In this post I’ll walk through how to use graph analytics with Neo4J to visualize what happens during execution of an attack, as well as how I think about the overall workflow of:

  • Firing the attack while only capturing the events we want
  • Extracting objects and relationships out of the event logs
  • Loading the objects and relationships into Ne04J to visualize them

The ultimate goal is to be able to execute a desired simulated attack (i.e. one of the many in Red Canary’s Atomic Red Team) and then quickly and intuitively evaluate what the footprint of that attack is along various dimensions.


Developing the graph-based research workflow is reliant on the following:

1. A VM on which we will be running our atomic attack test.
2. An analysis machine separate from the victim VM.
3. Sysmon with a modified version of Swift on Security’s configuration file.
4. Batch scripts to automate Sysmon configuration, log clearing, and exporting of logs.
5. Python to parse the exported Sysmon logs and generate queries that we’ll need to use in Neo4J.
6. Neo4J Desktop and Neo4J Browser to visualize our resulting graphs.

Automating Sysmon Setup and Teardown

We can obtain a victim VM from Chris Long’s Detection Lab project and then install Sysmon with the config located here. 

We can also download additional configuration files for more tailored research from Olaf Hartong’s modular Sysmon project.

We can then utilize a batch script on the victim VM to automate the process of setting up Sysmon after specifying the configuration file that we want depending on our research goals. The modified Swift on Security configuration provided above is a good start. After we run the attack that we want to research and capture the relevant logs, another batch file can be used to tear down that configuration and return Sysmon to a baseline state.

The reason we want to do this is we may want to use several versions of a Sysmon configuration in sequence to evaluate the footprint of a given attack under that specific set of Sysmon rules. Some of those rulsets may cause high system load, and in order to preserve VM and host system performance, we’ll want to revert Sysmon to a base state of not collecting anything when we’re not actively running attacks.

That no-capture configuration will look like this:

The following batch commands can then be used for the pre-attack setup:

In the first command, we use the Windows Event command line utility to clear any existing logs. Line 2 sets up a variable to capture the first user specified command line argument for the Sysmon configuration file we want to use. Line 3 executes Sysmon with that configuration.

After we run the atomic attack that we intend to research, we use the next set of batch commands in order to export logs, tear down and revert Sysmon to a no-capture baseline:

Line 1 again uses the Windows Event command line utility to query Sysmon logs and output an XML file that we will use later on. Line 2 reverts Sysmon to a no-capture state by using a special configuration file called “nocapture.xml” and finally line 3 clears the logs.

(Note – it is feasible that instead of reverting to a no-capture state, we could simply uninstall Sysmon via running the command Sysmon -u. This however can take longer and induce latency into your workflow.)

We’ll now run the setup script, execute Casey Smith’s remote COM scriptlet execution attack (MITRE T1117) as our simulated attack, and then tear down Sysmon:

We now have the log data we want to visualize in the form of an XML file. The next step will be to extract entities and relationships between them in order to populate the graph.

Modeling Objects and Relationships

Departure from log analysis towards graph analysis requires that we take the event logs in the XML output and extract entities and relationships from it that can then be displayed on the graph in the form of nodes and edges (links) between them. To do this, we’ll use a Python script to turn each Sysmon event into a Neo4J Cypher command that expresses the relationships between entities in that Sysmon event.

Cypher is Neo4J’s graph query language that can create graph objects and describe relationships between those objects using ASCII art syntax. For example, to express that calc.exe is a child process of cmd.exe, we would say:

This would show up in the Neo4J Graph as:

To do this for a Sysmon Process Create event, we would write the following to translate from the XML to Cypher:

In this case, we are using the command MERGE instead of CREATE so that Neo4J will bind relationships to objects if they already exist, or create objects and then bind relationships to them if they do not.

Each Sysmon Event ID will require merging potentially new types of objects (files for example) and then modeling the relationship between the Process GUID and the resultant object.

Let’s write XML to Cypher translations for most major Sysmon event types that we’d care about:
– Process Creates
– Network Connections
– Image Loads
– Process Accesses
– File Creates
– Registry Value Sets

We’ll also have a small helper function called getFilename to parse out a display-friendly name from the full path shown by Sysmon.


Parsing the XML Logfile

After creating our Cypher translation functions, we’re ready to parse the XML logfile containing the event data for Casey Smith’s Squiblydoo attack.

In the above code, we create a dictionary-like object using the collections.defaultdict() class to hold the Sysmon Event Data we need to extract entities and links. We then select the proper Cypher translation function depending on the Event ID and in the case of Process Access events, suppress anything with MsMpEng.exe (a Windows Defender process) and because I’m using VirtualBox, VBoxService.exe (VirtualBox’s Guest Additions service).

Our resulting Cypher queries look like so:


Generating the Graph

After downloading Neo4J and opening the Desktop client, we are presented with something similar to the following:

Click on New Graph and select “Create Local Graph.” Enter a name for the graph and set a password. Once the graph card populates with the new graph, click “Start.” Then click “Manage” and in the resulting window “Open Browser.”

This will bring you to the meat of the operation, the Neo4J Graph Browser:

Cypher commands may be run from the command bar at the top, with results of any commands / queries sequentially appearing in the results area below. New results will show up at the top of the result stack. To clear the stack, you can type “:clear” in the command bar.

To generate our graph, we’ll keep it simple and copy / paste our generated Cypher code into the command bar, and then hit CTRL+Enter to execute it. Each query will then run in sequence.

To view the resulting graph, we can click on the button labeled “*(9)” on the left under “Node Labels.”

The above graph serves as an intuitive visual reference about the different interactions between the objects involved when we ran our Squiblydoo attack, namely various Process Accesses, cmd.exe spawning regsvr32.exe and regsvr32 subsequently spawning calc.exe. Finally we have the fairly obvious network connection from regsvr32.exe to the external IP address where it pulled down the remote scriptlet from Github.

The relationship type label on each link lets us know what the nature of the relationship between nodes is. In the case of this attack, we observe types of:

  • Accessed
  • Spawned
  • ConnectedTo

Note that you can constrain what is visualized on the graph by selecting Node Labels or a certain Relationship Type on the left. The result will be generated in a new panel. So if I only wanted to see Process Spawns, I would select the “Spawned” relationship type and only be presented with the cmd->regsvr32->calc flow.

To clear your work and start from scratch (i.e. to try a new cypher import), clear the current database by typing

Then type

It is left as an exercise to the reader to:

  • Implement the automation discussed
  • Run several other atomic red team tests and capture the event logs via Sysmon
  • Export the logs to XML and translate them to Cypher
  • Load the translated Cypher into Neo4J to visualize each attack

In the coming weeks I’ll be working on a follow on to this post focused on utilizing graph analysis to research the nuances of unmanaged Powershell.