In part one of this cloud app observability blog series, we provided a general 101 overview, defining important terms, the basic monitoring stack and the most common cloud metric types before discussing implementation and best practices. This post will compare different cloud monitoring tools to help readers determine the correct choice for their cloud project.
When you start reading about the various tools and platforms and other technical terms and jargon, the simple picture quickly gets complicated due to the high overlap in functionality amongst the multiple tools a service may offer. With most cloud metrics and cloud monitoring tools, the different functional boxes we discussed in the last post aren’t as sharply defined, and in some cases, one tool may traverse the entire stack. Consequently, those responsible for designing, describing and managing the technical solution frequently strain to determine what data to produce and where to transform and store that information.
Comparing Cloud Monitoring Tools
Let’s look at some specific cloud monitoring tools at our disposal. This article will focus on three particular packages: Datadog, Prometheus and Grafana. While including AWS Cloudwatch and Google Operations Suite would be tempting, this blog post is focused on cloud service independence and will therefore not discuss these options. However, many of the concepts are similar.
Easily the cat’s meow of cloud monitoring-as-a-service offerings, Datadog is designed as a turnkey solution that encompasses the entire basic stack and then some. Before we dive into details on each layer, Datadog — in summary — provides a full-stack solution that includes data ingest, transform, database/storage, query, transform, display and sourcing to other tools. The cloud monitoring platform also provides more advanced features, including application tracing, performance monitoring, alerting and much more.
|A tag in Datadog equates to the same tag in our common language. A tag is still a simple key/value pair associated with a metric, span, log or other Datadog type.
|Attributes are prescribed key-value pairs used by the Datadog toolchain. Attributes can include things like host, source, status, etc.
|A facet is an application-specific, user-defined tag, used in context of Datadog spans, logs and metrics.
Data Sourcing on Datadog
Ignoring tracing, which is a different use case than metrics, Datadog provides three primary data options: logs, metrics and events. Data is collected in several ways, but the most common is installing the Datadog cluster agent into Kubernetes or on an EC2 instance. The agent will automatically pull logging from stdout on all instances and aggregate and upload metrics across running instances on the host machine. Datadog also provides SDKs so application-specific metrics can be sent. Other architecture options include a Datadog agent on each “hardware” instance (again scraping stdout of hosted services) or a software SDK that allows a service to log directly.
Naturally, we’ll go with cluster or “hardware” agent options, as these are opaque to the code level and easy to install and manage. The agents can be configured to automatically append labels to all logs and metrics, such as instance ID, environment, service name, etc.
Datadog also provides some unique metric types and other constructs applications may use, such as the metrics below.
|Events are meant to indicate the occurrence of an action that may be useful to alert on or otherwise track. An event could also be implemented as a gauge metric using a value of 0 or 1 to indicate something happened. Events are simply a more direct way for Datadog to show something has occurred or gone wrong. Generating these events across the streaming pipeline but using shared tags would allow you to easily aggregate this data and present the list to enable support to immediately see the problem and where it is in the chain.
|A rate metric is a counter that represents several occurrences per second. A rate could be emulated using a counter and grouping the counts by second, but rates are so commonly used in dashboards and alerts that having a dedicated type is quite useful and saves some time.
|A set counts the number of unique occurrences of a given key. For example, you could pass a user’s ID, and the set would report the number of times each user ID occurred. Sets can also be useful but should be used cautiously when tracking a large set of keys.
|A distribution is basically a histogram. Unlike a histogram, distributions are not aggregated agent side. Instead, the raw data is sent server side and aggregated there, making a distribution useful for aggregating information across a service logically as a whole vs. per-instance subsets. Distribution allows for whole percentile calculations and enables customized tagging options.
The Datadog Database
Datadog manages the database and everything required to store logs and metrics for use and also provides a query language to view results on its website. The primary responsibility for developers here is to set retention and indexing policies to fit within your desired budget.
Datadog provides many transforms you can apply to raw data to put it in a format that’s easier to work with. Here are some common ones:
|Datadog prefers a JSON log format to describe log properties on ingress. The JSON schema includes standard properties for things like level (e.g., INFO, message and other common attributes). Logging is far from standard across service implementations, and retrofitting an existing application to a standard format could be daunting. Datadog provides tools to normalize and enrich log data server side to help with this via pipelines and processors. A pipeline processes incoming log records to normalize them into Datadog’s internal format. Once logs are in the preferred internal format, the entire suite of search, filter and other capabilities opens in the dashboard. For more details about log transforms, visit Datadog’s Log Configuration page.
|Synthetic metrics are server-side-generated metrics constructed from metrics, events and logs. The process is relatively straightforward. Simply set up a search+filter query for metrics, events or logs you want to synthesize a metric from, add some configuration details and save. Datadog will start creating new metrics cloud side based on your specified inputs.
|Datadog provides a collection of dashboard widgets (outlined below) that allow users to assemble a visual display with navigation to other dashboards, etc. The widgets are powered by base data elements, which are pulled from the database using a query language. Unsurprisingly, the query language operates on type, tags and other properties according to the source (e.g., logs).The query can then have transforms applied to the resulting data for display:
Datadog Dashboards and More
Datadog comes with a full suite of graphic and data display widgets that can be easily dragged and dropped onto a screen and linked to other dashboards. From standard line plots, bar plots, GeoMaps and tables to counters, pie charts, heat maps, alert views and everything in between, Datadog uses queries and transform functions to assign a data set to the widget in classic MVC fashion.
|Datadog provides a rules engine that allows one to use the same query, transform functionality applicable to dashboards and define thresholds and rules upon which an alert should be triggered. This is also known as setting up a monitor, of which there are a large number of types that can be chosen.
|Datadog has several notification options — which are set to trigger on alerts — available, including Jira, PagerDuty, Slack and webhooks. Many third parties, like Zendesk, have also integrated with Datadog’s API.
|Datadog enjoys a massive library of service integrations covering most organizations’ resource monitoring needs. Datadog can also ingest OpenMetrics and recently added OpenTelemetry as well.
In sum, Datadog is a complete, comprehensive cloud monitoring solution that offers one of the most comprehensive sets of tools in the market. The critical constraints are the cost of the service and how bound your organization would be to Datadog as a solution if you want to switch providers later. I strongly recommend implementing your application-level metrics in a standard like OpenMetrics or OpenTelemetry. Doing so positions you to use alternate solutions without touching your code. At some point, software-as-a-service solutions tend to become cost prohibitive at an immense scale. At that point, hiring a support team and operating a self-hosted solution is more cost-effective.
Prometheus is a free, self-hosted (AWS now has a managed offering) option that leverages a subset of the OpenMetrics standard. The state-of-the-art favorite in terms of current adoption, Prometheus — unlike Datadog — doesn’t attempt to be a complete solution. Regarding our monitoring stack in our first post, Prometheus specializes more in the middle — providing data collection, data storage, transforms and storage/query capability.
Prometheus, at its core, is a purpose-built time series database that provides similar functionality to Datadog for metrics collection, storage and query but differs in its approach to metrics collection and processing.
Prometheus Data Sourcing
Prometheus is not a logging solution — it only handles metrics, providing all common metric types and adding the summary type. A summary is a histogram that operates on a sliding time window instead of fixed counts.
Prometheus’ philosophy is built around the Prometheus server requesting metrics data from service instances vs. a push model. Aside from the simplicity and scalability of this approach, the primary motivation is that having Prometheus be able to detect an error on a data request is in itself a critical data point. Prometheus can surface this error condition as a failsafe if an instance is so bogged as nonresponsive (or simply not found).
Prometheus provides both ingress and egress transforms on collected data.
|Ingress transforms are set up in the recording rules. The basic concept is simple: A rule is set up that performs a query on incoming raw metrics and stores the query result as a new metric. Therefore, any of the egress query transforms are also available to ingress. This is useful if you need to compress the amount of data you store or your dev team doesn’t have time to implement raw metrics in the form desired.
|Prometheus provides a query language compatible with PromQL based on OpenTSDB. It includes functions to transform data including basic operations (+, -, /, *, ^, %) where ^ is power not XOR. Basic logical operators are also available (==, ! =, >, <,>=, <=).=).,>
Functions can also aggregate and otherwise manipulate data, including aggregation, predictions, absolute values, etc. As Prometheus is a time series database, operations and functions are designed to work with scalar and vector elements where applicable.
Prometheus has some basic dashboard tools but is nowhere near as robust as Datadog or Grafana. There are two Prometheus-native options for dashboards:
- Expression Browser: This tool plots queries and is meant more for debugging and exploring PromQLs to achieve desired results, not to be used as a production console.
- Console Templates: These models use go templates to describe a more full-featured dashboard with different types of plots and associated PromQL. Only line and stacked line graphs are available, so this option is limited.
Prometheus also supports basic alert rules as part of its configuration. Like Datadog, Prometheus can send notifications based on alerts. These include PagerDuty, SNS, email, webhook and Slack, to name a few. Alerts are set up via rules like a Datadog monitor, and configuration is handled much like ingress transforms but for alerts instead. One key difference with Datadog is Prometheus doesn’t provide a very robust UI for managing alert rules, requiring a bit more skill and effort to set up.
To sum Prometheus up, the open-source solution focuses on data collection, storage and basic transforms. Prometheus serves the bottom of our monitoring stack better than the top and has basic alert and dashboard capabilities. Short of using a managed solution, Prometheus requires more operations care and feeding to scale and maintain properly.
Grafana specializes more in the top area of our monitoring stack. One of the premier open-source dashboard solutions available, Grafana offers many integrations, visualizations and tools for application monitoring. When combined with Prometheus, Grafana replicates most of the functionality provided by Datadog in terms of dashboard visualization and alerts around metrics.
Grafana Data Sourcing
Grafana enjoys a plug-in model for data sources with many plug-in options. We’re mainly interested in Prometheus content in this 101, but data may also be pulled from MySQL, PostGres, Datadog (ironically), Snowflake and BigQuery, to name a few. Grafana’s ability to use data from traditional databases makes it somewhat unique, making it a powerful tool if the data you need is already hosted in a more traditional data store. Grafana also has no qualms about mixing and matching sources for display.
Grafana Data Transforms
Grafana offers many data transforms that can be used to manipulate data sources. Here is a link to the complete list of Grafana data transforms. To summarize:
- Add field from calculation: Use this transformation to add a new field calculated from two other fields. Each transformation allows you to add one new field.
- Concatenate fields: This transformation combines all fields from all frames into one result.
- Config from query results: This transformation allows you to select one query, extract standard options like Min, Max, Unit and Thresholds and apply it to other query results. This enables dynamic query-driven visualization configuration.
- Convert field type: This transformation changes the field type of the specified field.
- Filter data by name: This transformation removes portions of the query results.
- Filter data by query: This transformation can be used in panels with multiple queries to hide one or more of the queries.
- Filter data by value: This transformation filters your data directly in Grafana and removes some data points from your query result. You can include or exclude data that match one or more conditions you define. The conditions are applied to a selected field.
- Group by: This transformation groups the data by a specified field (column) value and processes calculations on each group.
- Join by field (outer join): This transformation can be used to join multiple time series from a result set by field, which is especially useful if you want to combine queries so that you can calculate results from the fields.
- Labels to fields: This transformation changes time series results that include labels or tags into a table where each label’s key and value are included in the results. The labels can be displayed either as columns or as row values.
- Merge: This transformation combines the result from multiple queries into one result, which is helpful when using the table panel visualization. Values that can be merged are combined into the same row. Values are mergeable if the shared fields contain the same data.
- Organize fields: This transformation can be used to rename, reorder or hide fields returned by the query.
- Reduce: This transformation applies a calculation to each field in the frame and returns a single value. Time fields are removed when using this transformation.
- Rename by regex: This transformation renames parts of the query results using a regular expression and replacement pattern.
- Rows to fields: This transformation converts rows into separate fields, which can be helpful as fields can be styled and configured individually. It can also use additional fields as sources for dynamic field configuration or map them to field labels. The extra labels can then be used to define better display names for the resulting fields.
- Prepare time series: This transformation is useful when a data source returns time series data in a format that isn’t supported by the panel you want to use.
- Series to rows: This transformation combines the result from multiple time series data queries into one single result, which is helpful when using the table panel visualization. The result from this transformation will contain three columns: Time, Metric and Value.
- Sort by: This transformation sorts each frame by the configured field.
Many of these transforms apply a calculation type. The built-in calculation types are in the table below.
|True when all values are null
|True when all values are 0
|Number of times the field’s value changes
|Number of values in a field
|Cumulative change in value, only counts increments
|Difference between first and last value of a field
|Percentage change between first and last value of a field
|Number of unique values in a field
|First (not null)
|First, not null value in a field
|Maximum value of a field
|Mean value of all values in a field
|Minimum value of a field
|Min (above zero)
|Minimum, positive value of a field
|Difference between maximum and minimum values of a field
|Minimal interval between values of a field
|Sum of all values in a field
Grafana: Dashboards and Other Features
|Grafana offers most of the same types of visualizations, aka widgets, as Datadog such as line, bar, and gauge. The process for setting up a dashboard and attaching data is also similar — a set of queries to data sources are used in conjunction with transform functions above and attached to a widget for display.
|Grafana Alerting integrates with Prometheus alerting right out of the box. There is also a provision for Grafana’s Alert manager, which relies on Grafana services that come with additional costs. Conceptually, Grafana’s alerting works like Datadog’s, with monitoring rules and notification policies — although it is interesting that I couldn’t find a full list of Grafana-integrated services.
The Grafana cloud monitoring tool specializes more in data visualization from numerous sources, including a robust set of transforms and operations. Grafana is roughly equivalent to Datadog concerning visualizing metrics. Grafana also supports visualization of logs and traces in a fashion similar to Datadog. However, you need to set up services to collect this data and provide it to Grafana — this isn’t a feature out of the box. Common integrations include Jaeger for traces and Fluentd for logs, but these are by no means the only options.
When you need to leverage raw data from an app’s services to create meaningful dashboards and alerts, several cloud metrics and cloud monitoring tools can help you get the job done. However, developers must understand the differences between app and resource monitoring as well as logs and metrics, their basic monitoring stack, different cloud monitoring metrics and data collection best practices to select the right cloud metrics and cloud monitoring tool for their application.
Strategically, standards are best, and OpenMetrics or OpenTelemetry should be considered when implementing your application-level metrics whenever possible. This grants flexibility in possibilities and migration to alternate solutions as needed. As your application grows, so too will your needs.
If your organization needs cloud migration support, connect with us to see how our cloud transformation expertise can empower your organization to improve data access, increase mobility, enhance security, reduce costs and increase productivity and scalability.