Splunk : AWS Firehose to Splunk - Two Easy Ways to Recover Those Failed Events

January 15, 2021 at 09:30 am EST

By Paul Davies January 15, 2021

With Kinesis Firehose being Splunk's preferred option when collecting logs at scale from AWS Cloudwatch Logs, we've seen plenty of posts on setting this up, automation and examples on transforming event content. But what about when things go wrong?

When Kinesis Firehose fails to write to Splunk via HEC (due to a connection timeout, HEC token issues or other connectivity issues), it will eventually write its logs into a 'splashback' S3 bucket to ensure that there is no loss of data. However, if you wish to retry sending the contents of the logs in the bucket back into Splunk you will note that the log contents that are written to the 'splashback' bucket from Firehose are wrapped in JSON with additional information about the failure and the original message is base64 encoded.

This makes re-ingesting these 'failed' logs a little more complex than simply using Splunk Add-On for AWS for instance, as it would not be possible to decode the contents of the message directly into Splunk. Also, note that Firehose cannot ingest directly from S3.

This blog describes two simple options of re-ingesting these logs using Lambda functions:

employing a route using the Splunk Add-On for AWS
the other sending the messages back into a Firehose data stream.

These solutions can work with both Splunk Enterprise (on-premise or in your Cloud) and Splunk Cloud.

The Splunk Add-On for AWS route

The main component of this solution is a simple Lambda function that allows an ingest process to be possible with the Add-On. The function, once set up, is triggered when objects containing the failed logs from Firehose are written to the S3 bucket. The function reads the contents of the object, extracting and decoding the 'raw content' that was attempted to be sent via HEC, then writing the output back into S3. It is written back to the same bucket, but as an object prefixed with SplashbackRawFailed/.

These objects can then be ingested by the Splunk Add-On for AWS using the standard inputs and configuration for S3 ingest - we would recommend using the SQS-based S3 input.

So the flow of data, as shown in the above diagram for a 'failed' scenario is as follows:

Initial logs generated and written to a CloudWatch log group. Firehose, with a subscription filter, pulls this into Kinesis. Optionally, (1b), a Lambda function does some processing/transformation on the log events.
Firehose attempts to write a batch of events to Splunk via HEC. For this example, there's a failure to connect.
After a retry and timeout period, the failed events are written to the 'splashback' S3 bucket.
With an Object 'Put' notification from S3, the Lambda Function is triggered, and reads the failed events from the object, decodes the content and writes the events back into S3 in the original format.
S3 Object 'Put' notification is sent to SNS and subsequently into an SQS subscription.
SQS based S3 Input on Add-On for AWS reads the logs from the S3 object and writes to Splunk. (The Add-On would usually run either on a Heavy Forwarder or Inputs Data Manager in Splunk Cloud)

The Firehose Re-ingest route

This solution is very similar to the previous method and uses a Lambda function to read from the S3 'splashback' bucket. However, rather than writing the output into S3, the function writes back into a Kinesis Firehose data stream. The advantage of this method over the first is that the data collection method into Splunk doesn't change, and no Add-On configuration is required.

For this method, although technically it would be possible to re-ingest back into the same Firehose, a separate dedicated 're-ingest' Firehose data stream is recommended. This has two advantages: it could add the option to send the events into a separate Splunk HEC token input (or even instance), and it can also provide a 'generic' retry capability for any Firehose. (note that the sample code provides this generic approach).

The flow of data, as shown in the above diagram for a 'failed' scenario is as follows:

Initial logs generated and written to a CloudWatch log group. Firehose, with a subscription filter, pulls this into Kinesis. Optionally, (1b), a Lambda function does some processing/transformation on the log events.
Firehose attempts to write a batch of events to Splunk via HEC. For this example, there's a failure to connect.
After a retry and timeout period, the failed events are written to the 'splashback' S3 bucket.
With an Object 'Put' notification from S3, the Lambda Function is triggered, and reads the failed events from the object, decodes the content.
The function writes the events back into a Retry Firehose (a separate Firehose data stream is not shown on the diagram).
Firehose connectivity to Splunk hopefully recovered. If not, this will loop back into the Retry Firehose, following the same process again until the number of retry attempts has been exceeded. (This will eventually result in the messages being sent to the original S3 bucket with a prefix of SplashbackRawFailed/ as per the 1st solution)

This solution is the recommended option, although it should be noted that if there is a very prolonged period of disconnect between Firehose and Splunk HEC, the volume of re-ingest and therefore data load on the retry Firehose may be significant and beyond a single firehose's capacity. This will be unlikely in most cases, as disconnects (especially to Splunk Cloud) are very unlikely to last very long. The example function provides a 'timeout' mechanism for looping re-tries (max 9 attempts which could be up to 18 hours) - this prevents a continuous looping scenario where there is a total loss of connectivity to Splunk. In the event of a full time-out, the events are eventually written (not encoded) to S3 in the same method as the first option.

Full details of the setup instructions and the source code for the sample Lambda functions can be found here: https://github.com/pauld-splunk/aws-splunk-firehose-error-reingest

Happy Splunking!

Attachments

Original document
Permalink

Disclaimer

Splunk Inc. published this content on 15 January 2021 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 15 January 2021 14:29:03 UTC

	1st Jan change	Capi.
MICROSOFT CORPORATION	+6.12%	3,039B
SYNOPSYS INC.	+2.95%	80.47B
CADENCE DESIGN SYSTEMS, INC.	+1.73%	75.69B
DASSAULT SYSTÈMES SE	-15.70%	54.73B
ATLASSIAN CORPORATION	-16.59%	51.62B
PALANTIR TECHNOLOGIES INC.	+26.44%	47.77B
THE TRADE DESK, INC.	+15.81%	40.95B
SEA LIMITED	+55.14%	35.86B
TAKE-TWO INTERACTIVE SOFTWARE, INC.	-10.89%	24.36B

1st Jan change

Capi.

MICROSOFT CORPORATION

+6.12%

3,039B

SYNOPSYS INC.

+2.95%

80.47B

CADENCE DESIGN SYSTEMS, INC.

+1.73%

75.69B

DASSAULT SYSTÈMES SE

-15.70%

54.73B

ATLASSIAN CORPORATION

-16.59%

51.62B

PALANTIR TECHNOLOGIES INC.

+26.44%

47.77B

THE TRADE DESK, INC.

+15.81%

40.95B

SEA LIMITED

+55.14%

35.86B

TAKE-TWO INTERACTIVE SOFTWARE, INC.

-10.89%

24.36B

ANALYST RECOMMENDATIONS : Best Buy, Wells Fargo, AMD, Netflix, Nvidia...	Mar. 20
Splunk Inc.(NasdaqGM:SPLK) dropped from FTSE All-World Index	Mar. 19	CI
Splunk Inc.(NasdaqGS:SPLK) dropped from S&P Software & Services Select Industry Index	Mar. 19	CI
Splunk Inc.(NasdaqGS:SPLK) dropped from S&P TMI Index	Mar. 19	CI
Splunk Inc.(NasdaqGS:SPLK) dropped from S&P Global BMI Index	Mar. 19	CI
ANALYST RECOMMENDATIONS : 3M Company, Snowflake, Splunk, Micron, Nvidia...	Mar. 19
How Cisco Will Integrate Splunk Into Company	Mar. 18	MT
Cisco: completes acquisition of Splunk for $28 billion	Mar. 18	CF
Splunk Inc.(NasdaqGS:SPLK) dropped from NASDAQ Composite Index	Mar. 17	CI
Cisco Systems, Inc. entered into an agreement and plan of merger to acquire Splunk Inc. from Hellman & Friedman Capital Partners X, L.P., managed by Hellman & Friedman LLC, BlackRock, Inc., The Vanguard Group, Inc., PRIMECAP Management Company and others.	Mar. 17	CI
Splunk Inc.(NasdaqGS:SPLK) dropped from NASDAQ-100 Index	Mar. 14	CI
Add a little SaaS to your life	Mar. 14
EU Watchdog Green-lights Cisco Systems' Purchase of Splunk	Mar. 14	MT
Cisco gains EU antitrust nod for $28 billion Splunk acquisition	Mar. 14	RE
Oracle posts rise in quarterly profit on strong cloud demand	Mar. 11	RE
Linde to Join Nasdaq-100 Index	Mar. 11	MT
Cisco's Splunk deal set to win unconditional EU antitrust OK, sources say	Mar. 05	RE
GitLab shares drop as 'less conservative' forecast disappoints investors	Mar. 05	RE
Splunk beats quarterly revenue estimates on steady demand for cloud services	Feb. 27	RE
Splunk Fiscal Q4 Earnings, Revenue Rise	Feb. 27	MT
Earnings Flash (SPLK) SPLUNK Posts Q4 Revenue $1.49B, vs. Street Est of $1.27B	Feb. 27	MT
Splunk Inc. Reports Earnings Results for the Full Year Ended January 31, 2024	Feb. 27	CI
Splunk Inc. Reports Earnings Results for the Fourth Quarter and Full Year Ended January 31, 2024	Feb. 27	CI
Equities Mixed as Traders Parse Economic Data, Fed Governor Remarks	Feb. 27	MT
Cisco to lay off 5% of workforce, cuts annual revenue forecast	Feb. 14	RE

Splunk Inc.

Equities

SPLK

US8486371045

Software

Splunk : AWS Firehose to Splunk - Two Easy Ways to Recover Those Failed Events

Latest news about Splunk Inc.

Chart Splunk Inc.

Company Profile

Sector Other Software