This blog series unpacks everything from best practices to best philosophies when it comes to delivering the best analytics experiences.
Monitoring comes up very often with our customers. Even though monitoring is frequently considered very common in other parts of the tech world, usually when software engineering efforts are present, generally less emphasis seems to be put on the data side of things.
However, we have seen many times that effective monitoring (and subsequent proactive response) can be the differentiator between successful self-service analytics and a slow and potentially painful transition back to report factory patterns. Understandably, each of you may have individual requirements and components of the stack, but the general principles can be very similar.
The Advantages of Stack Monitoring
So, why do we do it? What sort of tangible benefits can we get out of this? To answer these questions, let us go back to the reason we do analytics – we analyse data to get value from it. We believe most of the value comes from the proactive response that enables us to utilise and promote self-service analytics.
Self-service analytics can bring tremendous value to a company, but it can be challenging to do at scale. Most of the issues usually stem from the analytics platforms getting too convoluted to maintain and increasingly more difficult to use over time. This phenomenon could lead to governance and performance issues, which could significantly slow down the development of the overall platform and cause further centralisation. You may be able to see how this could lead to a bit of a death spiral.
Furthermore, tech constantly evolves. Companies should have the ability to stay on top of it. Effective automated monitoring allows us to make better decisions around which tools work best for each use case and allow us to perform more accurate capacity planning for both licensing and infrastructure. The above is not by any means an exhaustive list of benefits, but hopefully it sheds some light on the most important reasons why our customers are looking into this topic.
Implementing Your Monitoring
To better understand the components that go into a reference architecture for BI, we suggest reading James Austin’s blog from the “What Makes Good Analytics” series. It outlines a few example components that we think are the best of breed for BI. These are a great starting point for greenfield deployments or if you are just looking for some inspiration. Preferences for specific tools aside, your BI stack should at least support the following processes:
- ELT/ETL
- Warehousing
- Analytics/BI
Each of these processes will perform fundamentally different tasks, so we need to keep that in mind when determining our monitoring standards like SLOs and SLIs. Usually, it is best to start with the simplest metrics. A few examples:
ETL/ELT
- How many jobs ran successfully/unsuccessfully?
- The average duration of these jobs and their standard deviation
- Rows of data processed and their standard deviation
Warehousing
- Average/median query time + top 10 users with worst performing queries
- Average/median number of unique users
- Load
Analytics/BI
- How many active users in the last 1w, 1m, 3m + max concurrency?
- Backgrounder utilisation
- Underlying infrastructure metrics (e.g., CPU, RAM, disk space)
It is more likely than not that your list of metrics grows over time, but just like with any analytics, you must not forget to utilise a combination of subscriptions and alerts for the most important ones to ensure timely responses.
How to Implement Your Monitoring Tools
The tools you might use to get your stack right could be split into two different camps. Both are arguably equally important.
- Infrastructure monitoring
- Application monitoring
We suspect that most companies would most commonly employ some sort of infrastructure monitoring. Generally speaking, if a company prefers to use on-premise virtualisation like Hyper-V or VMware, then a monitoring system for these virtual machines is present.
Some companies simply use whatever tools come natively via these virtualisation platforms or employ some other internal tool to collect machine data like availability, performance and resource utilisation of hosts. When a company uses a public cloud service like AWS, Azure or GCP, the situation is usually a bit more clear cut as each of the major public clouds has a monitoring service available natively. Amazon has CloudWatch, Microsoft simply calls theirs the Azure Monitor and GCP has something that used to be called Stackdriver. Now, we can simply refer to it as Cloud Monitoring & Logging, which is part of Google Cloud’s operational suite.
Each of these has robust documentation and will allow you to get a plethora of information out of your cloud resources. Ultimately, an internal investigation should be made to figure out which tool is being leveraged, who has access to it and, most importantly, set alerts so the engineers responsible can know when something isn’t quite right.
Typically, engineers use an infrastructure monitoring platform to assess if their backend components are causing a user-facing problem, such as errors when attempting to load a dashboard or degraded performance. Ideally, application monitoring is employed so that engineering can get in front of the problem before the users even complain.
There are several options here. Some BI tools, like Tableau, have a native tool built in. Tableau’s is called the Tableau Resource Monitoring tool (part of the Server Management add-on) or TabMon, which might be more suitable for smaller Windows deployments. These types of tools are great because they are tailor-made solutions, which usually provide you with the greatest amount of detail about what is going on inside your application.
Other parts of the stack usually do not have something native available, so we have to get a bit more creative. Some common options include a tool like the open-source Grafana and a variety of data sources that feed data into it (e.g., Prometheus or Telegraf as data collectors, InfluxDB for time-series data storage).
Grafana is very customisable, so you can even use the Snowflake plugin ($) to query Snowflake stats directly. Notifications are of course a built-in feature that allows you to raise alerts via Teams, Slack or Email. This wonderful customizability however lends itself to more advanced use cases as it can be a challenge to set up correctly without any other previous experience. It would certainly be helpful if Grafana already exists in the business and only the integration piece for BI is required.
Alternatively, there are companies like DataDog or Splunk that also approach this problem in their own unique way. If it all sounds a bit convoluted there’s always the option of using Tableau or the BI tool of your choice to build custom dashboards that perform something similar. If Matillion and Snowflake are present, such dashboards have already been built by InterWorks for our customers, so please get in touch if you’d like to see some examples. Of course, since subscription and alerts are native functionalities of Tableau, it is also relatively easy to leverage those features for notifications.
Ethics in Monitoring
Last but not least, we believe in transparency. It’s usually a good idea to expose some of the relevant information to interested parties in a self-service and governed way to allow for faster troubleshooting and a community feeling. Ideally, there’s also a centralised automated status page (e.g., Statuspage) that displays current and historic availability and any upcoming maintenance windows.
This can foster innovation and trust, allowing you to truly benefit from self-service analytics. Plus, it never hurts to reduce some of the unnecessary tickets that might be raised because the user missed a maintenance notification email. 🙂