CI/CD Detection Engineering: Failing, Part 3

December 04, 2020 at 12:21 pm EST

By José Enrique Hernandez December 04, 2020

It was over a month ago that I promised we would tie together Splunk Security Content and the Splunk Attack Range to automatically test detections. Ultimately, using these projects together in a Continuous Integration / Continuous Delivery (CI/CD) workflow with CircleCI brings the rigors of software development to the SOC and truly treats detection as code.

Well, I want to share how we have failed at achieving this goal. Not many in our industry talk about failures but in my opinion, if you are not failing then you are not making progress. Let me share what our original plan was and how we realized it was going to fail in the long term and why we decided to scrap it.

In 'CI/CD Detection Engineering: Splunk's Security Content, Part 1' we shared how the Splunk Security Content project can be used as a repository for treating Splunk detections as code. In 'CI/CD Detection Engineering: Splunk's Attack Range, Part 2' we discussed how the Attack Range allowed us to test these detections in a replicable environment. Our original goal for part 3 of this series was to tie these two projects together using the newly released Attack Range test files and eventually test detections in a CI/CD workflow. Spoiler alert: we failed.

Here are the main three reasons why the approach failed:

Testing detections per Pull Request caused CI jobs to the queue and rendered the testing CI pipeline unusable.
Putting all tests together caused a very long testing time for a nightly job which surpassed the CircleCI job timeout limit.
When multiple test jobs executions fail, Attack Range components were not properly cleaned which caused us to hit AWS resource limits.

Let's dig into how our first approach was architected. First, a new argument was added to the Attack Range that would ingest a test file that has predefined configurations. You see an example of a test file below:

The key arguments specifically are:

Target: attack range target to attack
Simulation Technique: a technique to launch
An array of detections: to test with a pass/fail condition.

We created a few of these test files under their respective MITRE ATT&CK technique in the security content repo as we slowly tested them.

The Attack Range was modified to ingest these test files and run through the following process for testing. Build an environment, using Atomic Red Team simulate the technique associated with the detection, run the detection, evaluate its results based on the passed condition. Below is a visual representation of this process:

The final piece in our plans was generating CircleCI jobs for each of these test files that executed the above process. For this, we created a simple script called ci-generate.py that would read in every file under the /test folder in Security Content and create a CircleCI task from the file under the CircleCI job test-detections. The task looks like this:

The First Failure ⏲❌

We could not run each of these detection tests per PR since it's execution was over 30 minutes for each test file.

To circumvent this we first started queuing incoming jobs per PR, but it quickly became unusable as we have +10 jobs queued with a test wait time of 16 hours. On our second attempt, we decided to test the detections nightly instead of per PR. To run our detection tests daily we also added a workflow step to our CircleCI configuration file to run the detection. The workflow definition looks like this:

In short, we planned to have our Threat Research team (or anyone in the community) make a PR for new detections with its corresponding test files. After merging the PR we run ci-generate.py the script and update the /.circleci/config file with a new task job under the test-detections job for the corresponding test file to be executed in the nightly workflow. Note that each task is just executing our newly created Attack Range test flags. The overall logical process that we expected was:

The Second Failure ⏰

When we started building our library of tested detection it became obvious that our current approach would not scale. After 12 detection files, our nightly testing-detection CI job started failing consistently. This particular one tells the full story of why:

It took 5 full hours to run the job and only 10 detections were tested, and then the job timeouts. We learned that day that CircleCI has a maximum job time limit of 5 hours. After much analysis at this point, I was content with calling this approach a failure, but the truth was we were not done with dealing with issues.

The Third Failure ‍♂️

An after-effect of moving to nightly jobs was the fact that we did not catch when things had gone wrong until our next working day ☀️. When nightly jobs failed there were occasions that the test would crash or fail and the next test would begin. Each failed or crash test left behind a tainted Attack Range environment on AWS ⛈. After several job failures, our AWS account started hitting limits ❌ on available resources like VPCs, EIPs, and EC2 instances allowed in the region. These zombie Attack Ranges were extremely labor-intensive to clean up, it entailed an engineer manually removing all the pieces created by Terraform during the build process. To circumvent this we added a reaping job that only executed if a test failed or crashed. This reaper job ran at the end of all the tests using the condition when: on_fail. You can see an example below:

Lessons Learned

Even after addressing the zombie ‍♀️ Attack Ranges and moving to nightly jobs to avoid exploding our job queue we could still not get around the CircleCI maximum job time limit of 5 hours. At this point, we realized that our attempt at using CircleCI to automate our tests was a failure and started thinking of a better solution. Furthermore, we learned a few lessons on how to better improve the stability of jobs and their execution time.

In part 4 of this series, we will share how we solved the problem above by changing drastically how we approached our testing workflow. For starters, we decided to leave behind the idea of needing a complete Splunk Attack Range for every test and instead broke off attack data generation into its own project. During Splunk .conf20 in October, Patrick Bareiss and I announced the Splunk Attack Data Repository on our talk 'SEC1392C Simulated Adversary Techniques Datasets for Splunk.' If you want to get a preview of the next part of this series I highly recommend you to watch it. Splunk Threat Research is now in the process of testing this new service to work out its bugs. Stay tuned for part 4 of this series after testing is completed .

Attachments

Original document
Permalink

Disclaimer

Splunk Inc. published this content on 04 December 2020 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 04 December 2020 17:20:01 UTC

	1st Jan change	Capi.
SYNOPSYS INC.	+5.58%	82.93B
CADENCE DESIGN SYSTEMS, INC.	+3.69%	76.85B
DASSAULT SYSTÈMES SE	-14.62%	53.04B
ATLASSIAN CORPORATION	-24.56%	46.55B
PALANTIR TECHNOLOGIES INC.	+31.16%	50.15B
THE TRADE DESK, INC.	+17.75%	41.41B
SEA LIMITED	+55.21%	35.63B
TAKE-TWO INTERACTIVE SOFTWARE, INC.	-10.24%	24.64B
ROBLOX CORPORATION	-21.24%	23.04B

1st Jan change

Capi.

SYNOPSYS INC.

+5.58%

82.93B

CADENCE DESIGN SYSTEMS, INC.

+3.69%

76.85B

DASSAULT SYSTÈMES SE

-14.62%

53.04B

ATLASSIAN CORPORATION

-24.56%

46.55B

PALANTIR TECHNOLOGIES INC.

+31.16%

50.15B

THE TRADE DESK, INC.

+17.75%

41.41B

SEA LIMITED

+55.21%

35.63B

TAKE-TWO INTERACTIVE SOFTWARE, INC.

-10.24%

24.64B

ROBLOX CORPORATION

-21.24%

23.04B

ANALYST RECOMMENDATIONS : Best Buy, Wells Fargo, AMD, Netflix, Nvidia...	Mar. 20
Splunk Inc.(NasdaqGM:SPLK) dropped from FTSE All-World Index	Mar. 19	CI
Splunk Inc.(NasdaqGS:SPLK) dropped from S&P Software & Services Select Industry Index	Mar. 19	CI
Splunk Inc.(NasdaqGS:SPLK) dropped from S&P TMI Index	Mar. 19	CI
Splunk Inc.(NasdaqGS:SPLK) dropped from S&P Global BMI Index	Mar. 19	CI
ANALYST RECOMMENDATIONS : 3M Company, Snowflake, Splunk, Micron, Nvidia...	Mar. 19
How Cisco Will Integrate Splunk Into Company	Mar. 18	MT
Cisco: completes acquisition of Splunk for $28 billion	Mar. 18	CF
Splunk Inc.(NasdaqGS:SPLK) dropped from NASDAQ Composite Index	Mar. 17	CI
Cisco Systems, Inc. entered into an agreement and plan of merger to acquire Splunk Inc. from Hellman & Friedman Capital Partners X, L.P., managed by Hellman & Friedman LLC, BlackRock, Inc., The Vanguard Group, Inc., PRIMECAP Management Company and others.	Mar. 17	CI
Splunk Inc.(NasdaqGS:SPLK) dropped from NASDAQ-100 Index	Mar. 14	CI
Add a little SaaS to your life	Mar. 14
EU Watchdog Green-lights Cisco Systems' Purchase of Splunk	Mar. 14	MT
Cisco gains EU antitrust nod for $28 billion Splunk acquisition	Mar. 14	RE
Oracle posts rise in quarterly profit on strong cloud demand	Mar. 11	RE
Linde to Join Nasdaq-100 Index	Mar. 11	MT
Cisco's Splunk deal set to win unconditional EU antitrust OK, sources say	Mar. 05	RE
GitLab shares drop as 'less conservative' forecast disappoints investors	Mar. 05	RE
Splunk beats quarterly revenue estimates on steady demand for cloud services	Feb. 27	RE
Splunk Fiscal Q4 Earnings, Revenue Rise	Feb. 27	MT
Earnings Flash (SPLK) SPLUNK Posts Q4 Revenue $1.49B, vs. Street Est of $1.27B	Feb. 27	MT
Splunk Inc. Reports Earnings Results for the Full Year Ended January 31, 2024	Feb. 27	CI
Splunk Inc. Reports Earnings Results for the Fourth Quarter and Full Year Ended January 31, 2024	Feb. 27	CI
Equities Mixed as Traders Parse Economic Data, Fed Governor Remarks	Feb. 27	MT
Cisco to lay off 5% of workforce, cuts annual revenue forecast	Feb. 14	RE

Splunk Inc.

Equities

SPLK

US8486371045

Software

CI/CD Detection Engineering: Failing, Part 3

Latest news about Splunk Inc.

Chart Splunk Inc.

Company Profile

Sector Other Software