Splunk : How to Determine Whether an Error is Really an Error

August 26, 2021 at 01:01 pm EDT

By Bill Emmett August 26, 2021

This post was co-authored by Sharmin Yousuf, a summer 2021 product marketing management intern focused specifically on search engine optimization for Splunk's Application Performance Monitoring and Digital Experience Monitoring pages.

There is nothing worse than waking up to an angry customer complaining that your website is failing to accept their payment at checkout. This may be worrying for some since payments not being processed can be equivalent to losing money; however with Tag Spotlight, this should be a relatively quick problem to dissect. The key question here is whether this is an issue that all our customers are facing or an isolated event.

What is Tag Spotlight?

Before we dive any further, let's address the elephant in the room. What is Tag Spotlight, and how will it help solve problems? Tag Spotlight is a critical component of Splunk APM, enabled by its infinite cardinality and full-fidelity architecture. Essentially, it is a one-stop-shop to understand service behaviour, allowing users to visualize errors/latency for any given service through the lens of all the tags affiliated with it.

Want to skip the how-to and see for yourself? Start a free trial of Splunk Observability Cloud instantly. No credit card required.

Now that we have a fundamental understanding of Tag Spotlight, let's look into our service map to see precisely where the problem lies. Splunk's service maps provide a high-level view of all the services of an application and visually display how the services interact with one another. The service map updates automatically, allowing DevOps teams to understand their apps in real-time.

In the example below, we have an online store with several inter-connected microservices. The service map uses color-coding to depict the services currently experiencing errors, enabling SREs to identify the root cause of an issue at a glance.

_{1: Interactive service map displaying errors that have occurred in the selected time frame (200 min).}

_{2: Service Map Legend}

Based on our service map (Image 1), we can immediately identify the culprit as the path from our front-end service to the checkout service to finally the payment service. By selecting the payment service, we can see a top-level view of Tag Spotlight (Image 3), where key details such as errors, latencies and the version number are displayed. In our case, several errors exist in the latest version (v350.10), which could be the source of our problem (Image 4).

Nonetheless, let's leverage Tag Spotlight to conduct a deep-dive and determine what exactly is causing the error and whether our DevOps teams will need to allocate resources to resolve it.

_{3: Paymentservice's top-level view of Tag Spotlight in the right pane.}

_{4: Paymentservice's top-level view of Tag Spotlight in the right pane filtered by version number.}

_{5: Tag Spotlight view of Frontend service.}

We can start by clicking into the Tag Spotlight view of the frontend service. Here we see several errors (pink peaks on the graph in Image 5) that our customers experienced over the past 200 min. To dissect these errors further, we can filter by removing the successful requests and specifying the exact span tags we are interested in, i.e./ cart checkout failures (Images 6, 7).

_{6: Tag Spotlight view of Frontend service filtered by errors.}

_{7: Tag Spotlight view of Frontend service filtered by cart/checkout endpoint.}

Upon selecting an error (pink peak), we can see the exact trace id, error-initiating operation, start time and all its impacted services (Image 6). In an alert storm situation, where there are errors galore, Tag Spotlight also provides the link to the full stack of traces by selecting the 'View More Traces in Full Trace Search.' Here, you will be able to slice and dice your trace data with any combination of tags you choose to better understand and assess the errors. With Splunk APM's unique NoSample™ full-fidelity tracing, you can rest assured that all your traces will be ingested and stored with no sampling whatsoever, allowing you to conduct a thorough analysis.

For our purposes, let's track down trace #130f2bb7928cbfd0 across the impacted services. From the frontend service itself, we can see that a customer hit the cart/checkout endpoint at 10:10 a.m., and the entire process lasted for 54.7 minutes (Image 6). Following through, this same trace was found exactly 54.7 minutes later, at 11:04 a.m., in the checkout service (Image 8).

_{8: Tag Spotlight view of Checkout service.}

_{9: Failure to charge card errors in checkout service.}

Taking a closer look at the errors in the checkout service, we can observe that they are all affiliated with the /placeOrder endpoint. Hovering over the error message tag specifically, we can confirm that all the errors are identical and due to a failure in charging customer credit cards (Image 9).

Similarly, the latency view of Tag Spotlight provides detailed insight into how long a process has been running (Image 10). The percentiles (p50, p90, p99) help quantify the number of users affected by a slow endpoint and identify SLIs (service level indicators). In our case, the checkout service has been awaiting a response for 54.7 minutes.

_{10: Latency corresponds to errors in checkout service.}

_{11: Tag spotlight view of payment service.}

Continuing our investigation into the payment's service, we are immediately greeted with a storm of errors. To visualize the behaviour of trace #130f2bb7928cbfd0 better, we can zoom down to the minute of origin (10:10 a.m.) to get a detailed understanding of the trace (Images 11,12,3).

_{12: Tag spotlight view of errors in the selected time frame (highlighting trace #130f2bb7928cbfd0).}

_{13: Tag spotlight view of latency in the selected time frame (highlighting trace #130f2bb7928cbfd0).}

Now that we have a granular view of our trace, we can go ahead and filter for the exact endpoint causing the error, /Charge. Here, we see that the error is not restricted to tenant level. The error persists within all three tiers - gold, silver, bronze. This is definitely a cause for concern. With all tenant levels facing errors, it is clear that this is not an isolated event but an issue with our payments service.

Zooming back out to the original timeframe (last 200 min), we see that the errors only exist in v350.10 (Image 14). We can go one step further and filter by version tag to verify the behavior in each version. Version 350.9 appears spotless with 0% errors (Image 15), whereas v350.10 renders 100% errors (Image 16).

_{14: Tag spotlight view of payment service in the past 200 minutes, including all active versions.}

_{15: Tag spotlight view of v350.9 in the past 200 minutes (0 errors in 396 requests).}

_{16: Tag spotlight view of v350.10 in the past 200 minutes (818 errors in 818 requests).}

In conclusion, it appears that the root cause of the issue is in v350.10 of the payment service across all tenant levels. We can temporarily address the situation by diverting customer traffic to v350.9 while our payment service owners work on remediating v350.10.

How Can I Get Started with Tag Spotlight?

Tag Spotlight is a key feature of Splunk APM, which is a part of the Splunk Observability Cloud. You can sign up to start a free trial of the suite of products - from Infrastructure Monitoring and APM to Real User Monitoring and Log Observer. Get a real-time view of your environment and start solving problems with your microservices faster today!

Attachments

Original document
Permalink

Disclaimer

Splunk Inc. published this content on 26 August 2021 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 26 August 2021 17:00:10 UTC.

	1st Jan change	Capi.
SYNOPSYS INC.	+5.58%	80.86B
CADENCE DESIGN SYSTEMS, INC.	+3.69%	75.4B
DASSAULT SYSTÈMES SE	-14.62%	52.54B
ATLASSIAN CORPORATION	-24.56%	51.47B
PALANTIR TECHNOLOGIES INC.	+31.16%	48.04B
THE TRADE DESK, INC.	+17.75%	40.73B
SEA LIMITED	+55.21%	35.61B
TAKE-TWO INTERACTIVE SOFTWARE, INC.	-10.24%	24.47B
ROBLOX CORPORATION	-21.24%	22.73B

1st Jan change

Capi.

SYNOPSYS INC.

+5.58%

80.86B

CADENCE DESIGN SYSTEMS, INC.

+3.69%

75.4B

DASSAULT SYSTÈMES SE

-14.62%

52.54B

ATLASSIAN CORPORATION

-24.56%

51.47B

PALANTIR TECHNOLOGIES INC.

+31.16%

48.04B

THE TRADE DESK, INC.

+17.75%

40.73B

SEA LIMITED

+55.21%

35.61B

TAKE-TWO INTERACTIVE SOFTWARE, INC.

-10.24%

24.47B

ROBLOX CORPORATION

-21.24%

22.73B

ANALYST RECOMMENDATIONS : Best Buy, Wells Fargo, AMD, Netflix, Nvidia...	Mar. 20
Splunk Inc.(NasdaqGM:SPLK) dropped from FTSE All-World Index	Mar. 19	CI
Splunk Inc.(NasdaqGS:SPLK) dropped from S&P Software & Services Select Industry Index	Mar. 19	CI
Splunk Inc.(NasdaqGS:SPLK) dropped from S&P TMI Index	Mar. 19	CI
Splunk Inc.(NasdaqGS:SPLK) dropped from S&P Global BMI Index	Mar. 19	CI
ANALYST RECOMMENDATIONS : 3M Company, Snowflake, Splunk, Micron, Nvidia...	Mar. 19
How Cisco Will Integrate Splunk Into Company	Mar. 18	MT
Cisco: completes acquisition of Splunk for $28 billion	Mar. 18	CF
Splunk Inc.(NasdaqGS:SPLK) dropped from NASDAQ Composite Index	Mar. 17	CI
Cisco Systems, Inc. entered into an agreement and plan of merger to acquire Splunk Inc. from Hellman & Friedman Capital Partners X, L.P., managed by Hellman & Friedman LLC, BlackRock, Inc., The Vanguard Group, Inc., PRIMECAP Management Company and others.	Mar. 17	CI
Splunk Inc.(NasdaqGS:SPLK) dropped from NASDAQ-100 Index	Mar. 14	CI
Add a little SaaS to your life	Mar. 14
EU Watchdog Green-lights Cisco Systems' Purchase of Splunk	Mar. 14	MT
Cisco gains EU antitrust nod for $28 billion Splunk acquisition	Mar. 14	RE
Oracle posts rise in quarterly profit on strong cloud demand	Mar. 11	RE
Linde to Join Nasdaq-100 Index	Mar. 11	MT
Cisco's Splunk deal set to win unconditional EU antitrust OK, sources say	Mar. 05	RE
GitLab shares drop as 'less conservative' forecast disappoints investors	Mar. 05	RE
Splunk beats quarterly revenue estimates on steady demand for cloud services	Feb. 27	RE
Splunk Fiscal Q4 Earnings, Revenue Rise	Feb. 27	MT
Earnings Flash (SPLK) SPLUNK Posts Q4 Revenue $1.49B, vs. Street Est of $1.27B	Feb. 27	MT
Splunk Inc. Reports Earnings Results for the Full Year Ended January 31, 2024	Feb. 27	CI
Splunk Inc. Reports Earnings Results for the Fourth Quarter and Full Year Ended January 31, 2024	Feb. 27	CI
Equities Mixed as Traders Parse Economic Data, Fed Governor Remarks	Feb. 27	MT
Cisco to lay off 5% of workforce, cuts annual revenue forecast	Feb. 14	RE

Splunk Inc.

Equities

SPLK

US8486371045

Software

Splunk : How to Determine Whether an Error is Really an Error

Latest news about Splunk Inc.

Chart Splunk Inc.

Company Profile

Sector Other Software