What’s under the hood?

Telemetry: a big opportunity to approach scaling in the cloud – Part 2

10 minute read

Image by Emma Gossett – Unsplash

by AI Insider team

What’s under the hood? How to instrument your deployed application using Azure Application Insights? What assets does OpenTelemetry bring? How to extract the component for scaling using telemetry?

You’re back. Great! This article is part two of a two-part series on using telemetry to approach scaling in the cloud. So if you forgot about Part 1 or didn’t have the chance to read it, we invite you to do a quick recap and read our first article in the series, Telemetry: a big opportunity to approach scaling in the cloud – Part 1.

Azure Application Insights (AI)

If your application is hosted in Azure, we have good news. You can instrument your already deployed application with no code modifications at all using Application Insights.

Fig. 7. Default page in Azure Application Insights

It attaches its own profiler to the required services, and then spots the calls to known APIs like database/HTTP/gRPC querying or, for example, HTTP handling. The profiling itself is as fast as the local one. If another service is involved in the request, the call to that service is automatically enriched with the context information. Then, the called spans are attached to the calling one.

Fig. 8. Span with many attached child spans

It also aggregates data and can display various charts:

Fig. 9. One of the dashboards of the Application Insights service in Azure

In Fig. 9., the slow request is under arrow 1. Under arrow 1, arrow 2 shows a peak on the Dependency Call Duration chart. Knowing that the dependency is a database, we can say the problem is in slow response from that database. Of course, we can click things and dig deeper to find the exact query that is slow.

Fig. 10. AI can show separate spans relation between them and details

With AI, we can define alerts to warn developers when something goes wrong (memory consumption is too high or an exception happened, etc.). Azure even has a feature called Auto Detection that, once activated, starts learning regular metrics of your application. Once learned, it can raise an alert in case it detects an abnormal behavior.

AI properly understands any already scaled services as different instances of the same service. The component map is very useful in general.

Fig. 11. Application Map in Azure Application Insights

AI has many more interesting features, e.g., tracing failures/exceptions, Geo-distribution of the users, and custom charts. Data can be queried with a special T-SQL-like language.

Applications outside of Azure can send traces with just a little of coding and are available for all the major frameworks: Java, .NET, NodeJS, etc.

One more popular tool to compare is Datadog.

OpenTelemetry

While the number of tracing tools continues growing, the request to standardize the data format and API of the tracing is coming. OpenTelemetry is being developed for that reason as it is a set of open standards, SDKs, and tools. Check the huge volume of supported products in the registry.

OpenTelemetry components can be containerized and reside in the Kubernetes cluster for testing and debugging purposes. For example, injection of Zipkin (a collection and visualization tool) is only two lines at the configuration file:

Fig. 12. Zipkin added to Docker-compose configuration

The general concept of OpenTelemetry components is very similar to what we have in other SDKs, but at the moment of writing this article, automatic instrumentation for .NET was in alpha version. However, manual instrumentation here is very simple:

Fig. 13. Application manual instrumentation with OpenTelemetry

We are providing the address of the data collector, common properties (service name and the version), and APIs that we want to observe: HttpClient, ASP.NET Core request processing, and SQL querying. An idea is to set up the console as the telemetry output (this way, we can satisfy the 9th factor of 12 factors concept).

Data can be exported to a tool of our choice. Here is how Zipkin looks like:

Fig 14. Zipkin data visualization

OpenTelemetry also recommends another visualization tool called Jagger. Please access more information on modularity here.

Fig. 15. Components concept of OpenTelemetry

Fig. 13. Application manual instrumentation with OpenTelemetry

As now we know what slows down our users, what is next?

Extracting the component for scaling

Next, we take the code of that slow process and move it into a separate newly created service. It is necessary to add some code for serialization, validation, and application domain maintenance. The request processing may be possibly slowed down, but we can now scale it! To control that, we want to introduce the Load Balancer, which will start adding instances to the requests number growth at a time.

Having that, we allow more users to make more requests to be served within the same period. The only output metric that grows with the number of users is the hosting cost.

Imagine the scenario of a video conversion service: users upload videos, the server converts videos to another format (that is already measured by us and happens to be slow), and, finally, users download the converted video back. Here we extract the conversion code into a separate service. Once a video has been uploaded, our Load Balancer spins up a new instance of conversion service, which handles the video without disturbing other instances of the conversion service. It does not matter how many videos have been uploaded; each will be available for downloading at a nearly constant time. Furthermore, having 0 users at a given moment means we also develop 0 running instances that, in the end, generate zero cost.

Besides the scaling, we are automatically optimizing the recovery after the conversion’s critical failure: we don’t have to restart the entire application anymore. Instead, we restart the conversion service only. Furthermore, we can plan to increase or shrink the pool of instances depending on the predictions we make. This way, the recovery is either immediate or invisible for the end-user. As an exercise, try to plan the separation of file storage – one user could upload and download files independently from another.

Later, once the application can scale, we should also give attention to keep testing the scaling. Azure has a feature called Load Testing for that. If we know our users’ behavior, we can define the appropriate test data to check how that behavior is preserved under the heavy load. This testing can run before pre-release deployment or even for pull request validation.

How do you approach scaling?

Now that we’ve discussed how to use telemetry to approach scaling in the cloud, where is your scaling on the map? Is it where you thought it was?

Tell us more about your experience with scaling and how you approach it, especially if you have a newcomer experience.

We hope this two-part article has helped you get insights into using telemetry in scaling in the cloud. If you’re looking to dive deeper into the matter, look no further. You’re welcome to get in touch with us and take the conversation forward together.