Performance Resources

DZone's Featured Performance Resources

The Art of Being Ready: Reliability in Extreme Conditions

By Eugene Retunsky

When it comes to online services, uptime is crucial, but it’s not the only thing to consider. Imagine running an online store — having your site available 99.9% of the time might sound good, but what if that 0.1% of downtime happens during the holiday shopping season? That could mean losing out on big sales. And what if most of your customers are only interested in a few popular items? If those pages aren’t available, it doesn’t matter that the rest of your site is working fine. Sometimes, being available during peak moments can make or break your business. It’s not just e-commerce — a small fraction of airports handle most of the air traffic, just a tiny minority of celebrities are household names, and only a handful of blockbuster movies dominate the box office each year. It’s the same distribution pattern everywhere. To be successful, it’s important to not only maintain uptime but also be ready for significant events. Some teams implement change freezes before key times, such as Prime Day, Black Friday, or Cyber Monday. This approach is reasonable, but it can be limiting as it doesn’t allow teams to quickly respond to unexpected opportunities or critical situations. Additionally, not all demand can be predicted, and it’s not always clear when those high-impact events will happen. This is where “Reliability when it matters” comes in. We need to be able to adapt and respond quickly to changes in customer demand without being held back by code freeze periods and being prepared for unforeseen situations. By considering time as a valuable resource and understanding the relative significance of different moments, organizations can better translate customer value and adjust risk and availability budgets accordingly. This approach allows organizations to be flexible and responsive to changes in demand without missing out on crucial features or opportunities. In the end, it’s about being ready when luck comes your way. It’s important to note that a system is not static and is constantly changing. The system itself, the infrastructure it’s hosted on, and the engineering organization all change over time. This means that knowledge about the system also changes, which can impact reliability. Besides that, incidents and outages are inevitable, no matter how much we try to prevent them. Bugs will be shipped, bad configurations will be deployed, and human error will occur. There can also be interdependencies that amplify outages. An incident rarely has a single cause and is often a combination of factors coming together. The same goes for solutions, which are most effective when they involve a combination of principles and practices working together to mitigate the impact of outages. Operating a system often means dealing with real-world pressures, such as time, market, and management demands to deliver faster. This can lead to shortcuts being taken and potentially compromise the reliability of the system. Growth and expansion of the user base and organization can also bring additional complexity and result in unintended or unforeseen behaviors and failure modes. However, by adopting a holistic approach and utilizing the principles and practices of engineering I’m going to cover below, we can have the best of both worlds — speed, and reliability. It’s not an either-or scenario but rather a delicate balance between the two. What Is Reliability? Reliability is a vital component of any system, as it guarantees not only availability but also proper functioning. A system may be accessible, yet if it fails to operate accurately, it lacks reliability. The objective is to achieve both availability and precision within the system, which entails containing failures and minimizing their impact. However, not all failures carry equal weight. For instance, an issue preventing checkout and payment is far more crucial than a minor glitch in image loading. It’s important to focus on ensuring important functions work correctly during critical moments. In other words, we want to focus on being available and functioning correctly during peak times, serving the most important functionality, whether it be popular pages or critical parts of the process. Making sure systems work well during busy times is tough, but it’s important to approach it in a thoughtful and thorough way. This includes thinking about the technical, operational, and organizational aspects of the system. Key parts of this approach include: Designing systems that are resilient, fault-tolerant, and self-healing. Proactively testing systems under extreme conditions to identify potential weak spots and prevent regressions. Effective operational practices: defining hosting topology, auto-scaling, automating deployment/rollbacks, implementing change management, monitoring, and incident response protocols. Navigating the competing pressures of growth, market demands, and engineering quality. Cultivating a culture that values collaboration, knowledge sharing, open-mindedness, simplicity, and craftsmanship. It also requires a focus on outcomes in order to avoid indecision and provide the best possible experience for customers. Further, we’re going to expand on the concept of “Reliability when it matters” and provide practical steps for organizations to ensure availability and functionality during critical moments. We’ll discuss key elements such as designing systems for reliability, proactively testing and monitoring, and also delve into practical steps like automating deployment and incident response protocols. Reliability Metrics: A Vital Tool for Optimization When optimizing a service or system, it's essential to initially define your objectives and establish a method for monitoring progress. The metrics you choose should give you a comprehensive view of the system’s reliability, be easy to understand, share, and highlight areas for improvement. Here are some common reliability metrics: Incident frequency: the number of incidents per unit of time. Incident duration: the total amount of time incidents last. While these metrics are a good starting point, they don’t show the impact of incidents on customers. Let’s consider the following graph: Blue — the number of requests per five minutes, Red — errors, Green — reliability in 0..1 Suppose we have two incidents, one at 1 am and one at 2 pm, each causing about 10% of requests to fail for an equal duration of 30 minutes. Treating these incidents as equally impactful on reliability wouldn’t reflect their true effects on customers. By considering traffic volume, the reliability metric can better show that an incident during peak traffic has a bigger impact and deserves higher priority. Our goal is to have a clear signal that an incident during peak traffic is a major problem that should be fixed. This distinction helps prioritize tasks and make sure resources are used effectively. For example, it can prevent the marketing team’s efforts to bring more visitors from being wasted. Additionally, tracking the incident frequency per release can help improve the deployment and testing processes and reduce unexpected issues. In the end, this should lead to faster delivery with lower risks. Digging Deeper Into Metrics To get a deeper understanding of these metrics and find areas for improvement, try tracking the following: Time to detection: how long it takes to notice an incident. Time to notification: how long it takes to notify relevant parties. Time to repair: how long it takes to fix an incident. Time between incidents: this can reveal patterns or trends in system failures. Action item completion rate: the percentage of tasks completed. Action item resolution time: the time it takes to implement solutions. Percentage of high-severity incidents: this measures the overall reliability of the system. Finally, regularly reviewing these metrics during weekly operations can help focus on progress, recognize successes, and prioritize. By making this a regular part of your culture, you can use the data from these metrics to drive better decisions and gradually optimize the system. Remember, the usefulness of metrics lies in the actions taken from them and their ability to drive progress. It’s a continuous feedback loop of refining both the data and the action items to keep the system improving. Designing for Resilience A system that isn’t designed to be resilient probably won’t handle peak times as smoothly. Here are some considerations that can help ensure a system’s reliability under a variety of conditions: Do’s: Prepare for component failure: By partitioning the service or using isolation, you can limit the blast radius and reduce the impact of failures. Implement fault-tolerance: Implementing mechanisms like retries, request hedging, and backpressure will improve the system’s availability and performance. Use rate-limiting and traffic quotas: Don’t rely solely on upstream dependencies to protect themselves. Use rate-limiting and traffic quotas to ensure that your system remains reliable. Categorize functionality: Prioritize functions by categorizing them into “critical,” “normal,” and “best-effort” categories. This will help keep essential functions available at all costs during high demand. Implement error-pacing and load-shedding: These mechanisms help prevent or mitigate traffic misuse or abuse. Continuously challenge the system: Continuously challenge the system and consider potential failures to identify areas for improvement. Plan for recovery: Implement fail-over mechanisms and plan for recovery in the event of a failure. This will help reduce downtime and ensure that essential services are available during challenging conditions. Make strategic trade-offs: Make strategic trade-offs and prioritize essential services during challenging external conditions. Dont’s: Don’t assume callers will use your service as intended. Don’t neglect rare but potential failures; plan and design prevention measures. Don’t overlook the possibility of hardware failures. I explored some of the ideas in the following blog posts: Ensuring Predictable Performance in Distributed Systems Navigating the Benefits and Risks of Request Hedging for Network Services FIFO vs. LIFO: Which Queueing Strategy Is Better for Availability and Latency? Isolating Noisy Neighbors in Distributed Systems: The Power of Shuffle-Sharding Reliability Testing Reliability testing is essential for maintaining the availability and functionality of a system during high demand. To ensure a reliable system, it is important to: Design for testability so each component can be tested individually. Have good enough testing coverage as a prerequisite for being agile. Calibrate testing by importance, focusing on essential functions and giving a bit of slack to secondary or experimental features. Perform extensive non-functional testing, such as load testing, stress testing, failure-injection testing, soak testing, and fuzzing/combinatorial testing. It’s crucial to avoid: Blindly pursuing high coverage numbers. Assuming that a single data point provides a comprehensive understanding. Ensure that results are robustly reproducible. Underinvesting in testing environments and tooling. Proper testing not only ensures correctness, serves as living documentation, and prevents non-functional regressions but also helps engineers to understand the system deeper, flex their creative muscles while trying to challenge them, and ultimately create more resilient, reliable systems for the benefit of all stakeholders. Remember, if you don’t deliberately stress test your system, your users will do it for you. And you won’t be able to choose when that moment comes. Reliability Oriented Operations Operating a distributed system is like conducting an orchestra, a delicate art that requires a high level of skill and attention to detail. Many engineers tend to underestimate the importance of operations or view it as secondary to software development. However, in reality, operations can have a significant impact on the reliability of a system. Just like a conductor’s skill and understanding of the orchestra is vital to ensure a harmonious performance. For example, cloud computing providers often offer services built on open-source products. It’s not just about using the software but how you use it. This is a big part of the cloud computing provider business. To ensure reliability, there are three key aspects of operations to consider: Running the service: This involves hosting configuration, deployment procedures, and regular maintenance tasks like security patches, backups, and more. Incident prevention: Monitoring systems in real-time to quickly detect and resolve issues, regularly testing the system for performance and reliability, capacity planning, etc. Incident response: Having clear incident response protocols that define the roles and responsibilities of team members during an incident, as well as effective review, communication, and follow-up mechanisms to address issues and prevent similar incidents from happening or minimize their impact in the future. The incident response aspect is particularly crucial, as it serves as a reality check. After all, all taken measures were insufficient. It’s a moment of being humble and realizing that the world is much more complex than we thought. And we need to try to be as honest as possible to identify all the engineering and procedural weaknesses that enabled the incident and see what we could do better in the future. To make incident retrospectives effective, consider incorporating the following practices: Assume the reader doesn’t have prior knowledge of your service. First of all, you write this retrospective to share knowledge and write clearly so that others can understand. Define the impact of the incident. It helps to calibrate the amount of effort needed to invest in the follow-up measures. Only relatively severe incidents require a deep process, do not normalize retrospectives by having them for every minor issue that doesn’t have the potential to have a lasting impact. Don’t stop at comfortable answers. Dig deeper without worrying about personal egos. The goal is to improve processes, not blame individuals or feel guilt. Prioritize action items that would have prevented or greatly reduced the severity of the incident. Aim to have as few action items as possible, each with critical priority. In terms of not stopping at the “comfortable answers,” it’s important to identify and address underlying root causes for long-term reliability. Here are a few examples of surface-level issues that can cause service disruptions: Human error while pushing configuration. Unreliable upstream dependency causes unresponsiveness. Traffic spike leading to the temporary unavailability of our service. It can be difficult to come up with action items to improve reliability in the long term based on these diagnoses. On the other hand, deeper underlying root causes may sound like: Our system allowed the deployment of an invalid configuration to the whole fleet without safety checks. Our service didn’t handle upstream unavailability and amplified the outage. Our service didn’t protect itself from excessive traffic. Addressing underlying root causes can be more challenging, but it is essential for achieving long-term reliability. This is just a brief overview of what we should strive for in terms of operations, but there is much more to explore and consider. From incident response protocols to capacity planning, there are many nuances and best practices to be aware of. The Human Factor in System Reliability While procedures and mechanisms play a vital role in ensuring system reliability, it is ultimately the humans who bring them to life. So, it’s not just about having the right tools but also cultivating the right mindset to breathe life into those mechanisms and make them work effectively. Here are some of the key qualities and habits that contribute to maintaining reliability (and not only): Collaboration with other teams and organizations in order to share knowledge and work towards a common goal. A degree of humility and an open-minded approach to new information in order to adapt and evolve the system. A focus on simplicity and craftsmanship in order to create evolvable and maintainable systems. An action-driven and outcome-focused mindset, avoiding stagnation and indecision. A curious and experimental approach akin to that of a child, constantly seeking to understand how the system works and finding ways to improve it. Conclusion Ensuring reliability in a system is a comprehensive effort that involves figuring out the right metrics, designing with resilience in mind, and implementing reliability testing and operations. With a focus on availability, functionality, and serving the most important needs, organizations can better translate customer value and adjust risks and priorities accordingly. Building and maintaining a system that can handle even the toughest conditions not only helps drive business success and pleases customers but also brings a sense of accomplishment to those who work on it. Reliability is a continuous journey that requires attention, skill, and discipline. By following best practices, continuously challenging the system, and fostering a resilient mindset, teams, and organizations can create robust and reliable systems that can withstand any challenges that come their way. More

Load Management With Istio Using FluxNinja Aperture

By Sudhanshu Prajapati

Service meshes are becoming increasingly popular in cloud-native applications as they provide a way to manage network traffic between microservices. Istio, one of the most popular service meshes, uses Envoy as its data plane. However, to maintain the stability and reliability of modern web-scale applications, organizations need more advanced load management capabilities. This is where Aperture comes in. It offers several features, including: Prioritized load shedding: Drops traffic that is deemed less important to ensure that the most critical traffic is served. Distributed rate-limiting: Prevents abuse and protects the service from excessive requests. Intelligent autoscaling: Adjusts resource allocation based on demand and performance. Monitoring and telemetry: Continuously monitors service performance and request attributes using an in-built telemetry system. Declarative policies: Provides a policy language that enables teams to define how to react to different situations. These capabilities help manage network traffic in a microservices architecture, prioritize critical requests, and ensure reliable operations at scale. Furthermore, the integration with Istio for flow control is seamless and without the need for application code changes. In this blog post, we will dive deeper into what a Service Mesh is, the role of Istio and Envoy, and how they work together to provide traffic management capabilities. Finally, we will show you how to manage loads with Aperture in an Istio-configured environment. What Is a Service Mesh? Since the advent of microservices architecture, managing and securing service-to-service communication has been a significant challenge. As the number of microservices grows, the complexity of managing and securing the communication between them increases. In recent years, a new approach has emerged that aims to address these challenges: the service mesh. The concept of a service mesh was first introduced in 2016 when a team of engineers from Buoyant, a startup focused on cloud-native infrastructure, released Linkerd, an open-source service mesh for cloud-native applications. Linkerd was designed to be lightweight and unobtrusive, providing a way to manage service-to-service communication without requiring significant changes to the application code. Istio as a Service Mesh Source: Istio Istio is an open-source service mesh that is designed to manage and secure communication between microservices in cloud-native applications. It provides a number of features for managing and securing service-to-service communication, including traffic management, security, and observability. At its core, Istio works by deploying a sidecar proxy alongside each service instance. This proxy intercepts all incoming and outgoing traffic for the service and provides a number of capabilities, including traffic routing, load balancing, service discovery, security, and observability. When a service sends a request to another service, the request is intercepted by the sidecar proxy, which then applies a set of policies and rules that are defined in Istio's configuration. These policies and rules dictate how traffic should be routed, how the load should be balanced, and how security should be enforced. For example, Istio can be used to implement traffic routing rules based on request headers, such as routing requests from a specific client to a specific service instance. Istio can also be used to apply circuit-breaking and rate-limiting policies to ensure that a single misbehaving service does not overwhelm the entire system. Istio also provides strong security capabilities, including mutual TLS authentication between services, encryption of traffic between services, and fine-grained access control policies that can be used to restrict access to services based on user identity, IP address, or other factors. What Is Envoy? Envoy is an open-source, high-performance edge and service proxy developed by Lyft. It was created to address the challenges of modern service architectures, such as microservices, cloud-native computing, and containerization. Envoy provides a number of features for managing and securing network traffic between services, including traffic routing, load balancing, service discovery, health checks, and more. Envoy is designed to be a universal data plane that can be used with a wide variety of service meshes and API gateways, and it can be deployed as a sidecar proxy alongside service instances. One of the key benefits of Envoy is its high-performance architecture, which makes it well-suited for managing large volumes of network traffic in distributed systems. Envoy uses a multi-threaded, event-driven architecture that can handle tens of thousands of connections per host with low latency and high throughput. Traffic Management in Istio Istio's traffic management capabilities enable the control of network traffic flow between microservices. This refers to the ability to manage how the traffic flows between them. In Istio, Traffic management is primarily done through the use of Envoy sidecar proxies that are deployed alongside your microservices. Envoy proxies are responsible for handling incoming and outgoing network traffic and can perform a variety of functions such as load balancing, routing, and security. Let's talk about EnvoyFilter and how it is used in Aperture. Aperture's EnvoyFilter One way to customize the behavior of Envoy proxies in Istio is through the use of Envoy filters. Envoy filters are a powerful feature of Envoy that allow you to modify, route, or terminate network traffic based on various conditions such as HTTP headers, request/response bodies, or connection metadata. In Istio, Envoy filters can be added to your Envoy sidecar proxies by creating custom EnvoyFilter resources in Kubernetes. These resources define the filter configuration, the filter type, and the filter order, among other parameters. There are several types of filters that can be used in Istio, including HTTP filters, network filters, and access log filters. Here are some of the Envoy filters that Apertures uses. HTTP filters: You can use HTTP filters to modify or route HTTP traffic based on specific criteria. For example, you can add an HTTP filter to strip or add headers, modify request/response bodies, or route traffic to different services based on HTTP headers or query parameters. Network filters: You can use network filters to modify network traffic at the transport layer, such as TCP or UDP. For example, you can use a network filter to add or remove SSL/TLS encryption or to redirect traffic to a different IP address. Filters are loaded dynamically into Envoy and can be applied globally to all traffic passing through the proxy or selectively to specific services or routes. Aperture uses EnvoyFilter to implement its flow control capabilities. The Aperture Agent is integrated with Envoy using EnvoyFilter, which allows the agent to use the External authorization API (we will learn more about this in the next section). This API allows Aperture to extract metadata from requests and makes flow control decisions based on that metadata. With EnvoyFilter, Aperture can intercept, inspect, and modify the traffic flowing through Envoy, providing more advanced and flexible flow control capabilities. External Authorization API The External Authorization API is a feature provided by Envoy Proxy that allows external authorization services to make access control decisions based on metadata extracted from incoming requests. It provides a standard interface for Envoy to call external authorization services for making authorization decisions, which allows for a more flexible and centralized authorization control for microservices. Control Using Envoy's External Authorization API enables Aperture to make flow control decisions based on a variety of request metadata beyond basic authentication and authorization. By extracting and analyzing data from request headers, paths, query parameters, and other attributes, Aperture can gain a more comprehensive understanding of the flow of traffic between microservices. This capability allows Aperture to prioritize critical requests over others, ensuring reliable operations at scale. Moreover, by utilizing Aperture flow control, applications can degrade gracefully in real-time, meaning Aperture can prioritize the different workloads. Aperture uses Envoy's External Authorization definition to describe request metadata, specifically the AttributeContext. However, the ability to extract values from the request body depends on how External Authorization in Envoy was configured. Access Logs In addition to flow control, Aperture also provides access logs that developers can use to gain insight into the flow of network traffic in their microservices architecture. These logs capture metadata about each request, such as the request method, path, headers, and response status code, which can be used to optimize application performance and reliability. By analyzing traffic patterns and identifying potential issues or performance bottlenecks, developers can make informed decisions to improve their microservices architecture. Aperture extract fields from access logs containing high-cardinality attributes that represent key attributes of requests and features. You can find all these extracted fields here. These fields allow for a detailed analysis of system performance and behavior and provide a comprehensive view of individual requests or features within services. Additionally, this data can be stored and visualized using FluxNinja ARC, making it available for querying using an OLAP backend (Druid). Let’s jump into the implementation of Aperture with Istio. Flow Control Using Aperture With Istio Load management is important in web-scale applications to ensure stability and reliability. Aperture provides flow control mechanisms such as weighted fair queuing, distributed rate-limiting, and prioritization of critical features to regulate the flow of requests and prevent overloading. These techniques help to manage the flow of requests and enable load management. Demo Prerequisites The following are prerequisites before going ahead with Aperture Flow Control Integration with Istio. A Kubernetes cluster: For the purposes of this demo, you can use the Kubernetes-in-Docker (Kind) environment. Aperture Controller and Aperture Agent installed in your Kubernetes cluster. Istio is installed in your Kubernetes cluster. Step 1: Install Aperture Controller and Aperture Agent To install Aperture Controller and Aperture Agent running as Daemon Set in your Kubernetes cluster, follow the instructions provided in the Aperture documentation. Step 2: Install Istio To install Istio in your Kubernetes cluster, follow the instructions provided in the Istio documentation. Step 3: Enable Sidecar Injection To enable Istio sidecar injection for your applications deployments in a specific namespace, run the following command: kubectl label namespace <namespace-name> istio-injection=enabled This will add the istio-injection=enabled label to the specified namespace, allowing Istio to inject sidecars automatically into the pods. Example: kubectl create namespace demoapp kubectl label namespace demoapp istio-injection=enabled Step 4: Apply Envoy Filters Patches To configure Aperture with Istio, four patches have to be applied via Envoy Filter. These patches essentially serve the following purposes: To merge the values extracted from the filter with the Open Telemetry Access Log configuration to the HTTP Connection Manager filter for the outbound listener in the Istio sidecar running with the application. This is called the NETWORK_FILTER Patch. To merge the Open Telemetry Access Log configuration with the HTTP Connection Manager filter for the inbound listener in the Istio sidecar running with the application. This is also a NETWORK_FILTER Patch. To insert the External Authorization before the Router sub-filter of the HTTP Connection Manager filter for the inbound listener in the Istio sidecar running with the application. This is called the HTTP_FILTER Patch. To insert the External Authorization before the Router sub-filter of the HTTP Connection Manager filter, but for the outbound listener in the Istio sidecar running with the application. This is another HTTP_FILTER Patch. In simpler terms, the patches are applied to the Istio sidecar running alongside the application. They help to modify the HTTP Connection Manager filter, which manages incoming and outgoing traffic to the application. The patches enable Aperture to extract metadata, such as request headers, paths, and query parameters, to make flow control decisions for each request. This allows Aperture to prioritize critical requests over others, ensure reliable operations at scale, and prevent overloading the application. Refer to Aperture Envoy Configuration documentation to know more about each patch. For convenience, Aperture provides two ways to do this, Helm: Using Aperture istioconfig Helm chart Aperturectl: Own Aperture cli provides a way to do istioconfig install in ISTIOD namespace. Using Helm: helm repo add aperture https://fluxninja.github.io/aperture/ helm repo update helm upgrade --install aperture-envoy-filter aperture/istioconfig --namespace ISTIOD_NAMESPACE_HERE Using aperturectl aperturectl install istioconfig --version v0.26.1 --namespace ISTIOD_NAMESPACE_HERE Example: aperturectl install istioconfig --version v0.26.1 --namespace istio-system The default values for the Aperture Agent service namespace are aperture-agent, the port is 8080, and sidecar mode is false. This means that the Aperture Agent target URL is aperture-agent.aperture-agent.svc.cluster.local:8080. If you have installed the Aperture Agent in a different namespace or port, you can create or update the values.yaml file and pass it with the install command. Please refer to the Aperture documentation for more information on how to configure Aperture for custom values and how to create or update the values.yaml file and pass it with the install command. Step 5: Verify Integration kubectl get envoyfilter aperture-envoy-filter -n ISTIOD_NAMESPACE_HERE Example Command: kubectl get envoyfilter aperture-envoy-filter -n istio-system Output: NAME AGE aperture-envoy-filter 46s Deploy Demo-App To demonstrate whether Istio integration is working, install a demo deployment in demoapp namespace with istio-injection enabled. kubectl create namespace demoapp kubectl label namespace demoapp istio-injection=enabled kubectl create -f https://gist.githubusercontent.com/sudhanshu456/10b8cfd09629ae5bbce9900d8055603e/raw/cfb8f0850cc91364c1ed27f6904655d9b338a571/demoapp-and-load-generator.yaml -n demoapp Check pods are up and running. kubectl get pods -n demoapp Output: NAME READY STATUS RESTARTS AGE service1-demo-app-686bbb9f64-rhmpp 2/2 Running 0 14m service2-demo-app-7f8dd75749-zccx8 2/2 Running 0 14m service3-demo-app-6899c6bc6b-r4mpd 2/2 Running 0 14m wavepool-generator-7bd868947d-fmn2w 1/1 Running 0 2m36s Apply Policy Using aperturectl To demonstrate control points and their functionality, we will use a basic concurrency limiting policy. Below is the values.yaml file we will apply using aperturectl. values.yaml common: policy_name: "basic-concurrency-limiting" policy: flux_meter: flow_selector: service_selector: agent_group: default service: service1-demo-app.demoapp.svc.cluster.local flow_matcher: control_point: ingress concurrency_controller: flow_selector: service_selector: agent_group: default service: service1-demo-app.demoapp.svc.cluster.local flow_matcher: control_point: ingress Use the below command to apply the above policy values.yaml. aperturectl blueprints generate --name=policies/latency-aimd-concurrency-limiting --values-file values.yaml --output-dir=policy-gen --version=v0.26.1 --apply Validate Policy Let’s check if the policy is applied or not. kubectl get policy -n aperture-controller Output: NAME STATUS AGE basic-concurrency-limiting uploaded 20s List Active Control Points Using aperturectl Control points are the targets for flow control decision-making. By defining control points, Aperture can regulate the flow of traffic between services by analyzing request metadata, such as headers, paths, and query parameters. Aperture provides a command-line interface called aperturectl that allows you to view active control points and live traffic samples. Using aperturectl flow-control control-points, you can list active control points. aperturectl flow-control control-points --kube Output: AGENT GROUP SERVICE NAME default egress service1-demo-app.demoapp.svc.cluster.local default egress service2-demo-app.demoapp.svc.cluster.local default ingress service1-demo-app.demoapp.svc.cluster.local default ingress service2-demo-app.demoapp.svc.cluster.local default ingress service3-demo-app.demoapp.svc.cluster.local Live Previewing Requests Using aperturectl aperturectl provides a flow-control preview feature that enables you to view live traffic and visualize it. This feature helps you understand the incoming request attributes, which is an added benefit on top of Istio. You can use this feature to preview incoming traffic and ensure that Aperture is making the correct flow control decisions based on the metadata extracted from the request headers, paths, query parameters, and other attributes. aperturectl flow-control preview --kube service1-demo-app.demoapp.svc.cluster.local ingress --http --samples 1 Output: { "samples": [ { "attributes": { "destination": { "address": { "socketAddress": { "address": "10.244.1.11", "portValue": 8099 } } }, "metadataContext": {}, "request": { "http": { "headers": { ":authority": "service1-demo-app.demoapp.svc.cluster.local", ":method": "POST", ":path": "/request", ":scheme": "http", "content-length": "201", "content-type": "application/json", "cookie": "session=eyJ1c2VyIjoia2Vub2JpIn0.YbsY4Q.kTaKRTyOIfVlIbNB48d9YH6Q0wo", "user-agent": "k6/0.43.1 (https://k6.io/)", "user-id": "52", "user-type": "subscriber", "x-forwarded-proto": "http", "x-request-id": "a8a2684d-9acc-4d28-920d-21f2a86c0447" }, "host": "service1-demo-app.demoapp.svc.cluster.local", "id": "14105052106838702753", "method": "POST", "path": "/request", "protocol": "HTTP/1.1", "scheme": "http" }, "time": "2023-03-15T11:50:44.900643Z" }, "source": { "address": { "socketAddress": { "address": "10.244.1.13", "portValue": 33622 } } } }, "parsed_body": null, "parsed_path": [ "request" ], "parsed_query": {}, "truncated_body": false, "version": { "encoding": "protojson", "ext_authz": "v3" } } ] } What's Next? Generating and Applying Policies: Check out our tutorials on flow-control and load-management techniques here. Conclusion In this blog post, we have explored the concept of a service mesh, which is a tool used to manage the communication between microservices in a distributed system. We have also discussed how Aperture can integrate with Istio, an open-source service mesh framework, to provide enhanced load management capabilities. By following a few simple steps, you can get started with using these load management techniques, which include concurrency limiting, dynamic rate limiting, and prioritized load shedding, to improve the performance and reliability of your microservices. To learn more about Aperture, please visit our GitHub repository and documentation site. You can also join our Slack community to discuss best practices, ask questions, and engage in discussions on reliability management. More

5 DNS Troubleshooting Tips for Network Teams

By Terry Bernstein

Apollo Router Performance Monitoring with OpenTelemetry and Splunk APM

By Venkata Thota

Loop Device in Linux

By Priyanka Nawalramka

Measuring Service Performance: The Whys and Hows

I enjoy improving application performance. After all, the primary purpose of computers is to execute tasks efficiently and swiftly. When you consider the fundamentals of computing, it seems almost magical — at its core, it involves simple arithmetic and logical operations, such as addition and comparison of binary numbers. Yet, by rapidly performing countless such operations, computers enable us to enjoy video games, watch endless videos, explore the vast expanse of human knowledge and culture, and even unlock the secrets of the universe and life itself. That is why optimizing applications is so important — it allows us to make better use of our most precious resource: time. To paraphrase a famous quote: In the end, it’s not the number of years you live that matter, but the number of innovations that transpire within those years. However, application performance often takes a backseat until it becomes a pressing issue. Why is this the case? Two primary reasons are at play: prioritization and the outdated notion of obtaining performance improvements for free. Software Development Priorities When it comes to system architecture, engineers typically rank their priorities as follows: 1. Security 2. Reliability 3. Performance This hierarchy is logical, as speed is only valuable if tasks can be executed securely and reliably. Crafting high-quality software demands considerable effort and discipline. The larger the scale of service, the more challenging it becomes to ensure security and reliability. In such scenarios, teams often grapple with bugs, new feature requests, and occasional availability issues stemming from various factors. Consequently, performance is frequently perceived as a luxury, with attention given only if it impacts the bottom line. This pattern is not unique to software development, though it may not be as apparent in other industries. For example, we rarely evaluate a car based on its "reliability" nowadays. In the past, this was not the case — it took several decades for people to shift their focus to aspects like design, comfort, and fuel efficiency, which were once considered "luxuries" or non-factors. A similar evolution is bound to occur within the software industry. As it matures, performance will increasingly become a key differentiator in a growing number of product areas, further underscoring its importance. The End of Free Performance Gains Another factor contributing to the low prioritization of performance is that it was once relatively effortless to achieve. As computers continually became faster, there was little incentive to invest in smarter engineering when a simple hardware upgrade could do the trick. However, this is no longer the case: According to Karl Rupp's microprocessor trend analysis data, CPU frequency has plateaued for over 15 years, with only marginal improvements in single-core speed. Although the number of CPU cores has increased to compensate for stagnating clock speeds, harnessing the power of multiple cores or additional machines demands more sophisticated engineering. Moreover, the law of diminishing returns comes into play rather quickly, emphasizing the need to prioritize performance in software development. Optimizing Application Performance Achieving exceptional results is rarely a matter of chance. If you want your product to meet specific standards and continually improve, you need a process that consistently produces such outcomes. For example, to develop secure applications, your process should include establishing security policies, conducting regular security reviews, and performing scans. Ensuring reliability involves multiple layers of testing (unit tests, integration tests, end-to-end tests), as well as monitoring, alerting, and training organizations to respond effectively to incidents. Owners of large systems also integrate stress and failure injection testing into their practices, often referred to as "chaos engineering." In essence, to build secure and reliable systems, software engineers must adhere to specific rules and practices applicable to all stages of a system development life cycle. Performance optimization is no exception to this principle. Application code is subject to various forces — new features, bug fixes, version upgrades, scaling, and more. While most code changes are performance-neutral, they are more likely to hinder performance than enhance it unless the modification is a deliberate optimization. Therefore, it is essential to implement a mechanism that encourages performance improvements while safeguarding against inadvertent regressions: It sounds simple in theory — we run a performance test suite and compare the results to previous runs. If a regression is detected, we block the release and fix it. In practice, it is a bit more complicated than that, especially for large/old projects. There could be different use cases that involve multiple independent components. Only some of them may regress, or such regressions only manifest under particular circumstances. These components communicate over the network, which may have noise levels high enough to hide regressions, especially for tail latency and outliers. The application may show no signs of performance degradation under a low load but substantially slow down under stress. Performance measurements may not capture general tendencies because some pay attention only to specific latency percentiles. On top of that, it’s impossible to model real-life conditions in non-production environments. So, without proper metrics and profiling, it would be difficult to identify and attribute performance issues. In other words, it requires a deep understanding of the system and the infrastructure, as well as having the right tools and knowledge of how to effectively put all of that together. In this post, I will touch only the tip of this iceberg. Essential Factors for Measuring Service Performance When aiming to measure the performance of an isolated network service, it is crucial to establish a well-defined methodology. This ensures that the results are accurate and relevant to real-world scenarios. The following criteria are critical for assessing service performance: Avoid network latency/noise Prevent “noisy neighbor” issues from a Virtual Machine (VM) co-location Achieve latency resolution down to microseconds Control request rate Minimize overhead from the benchmarking tool Saturate CPU up to 100% to determine the limit of your service Provide comprehensive latency statistics Let's delve deeper into each of the essential criteria for effectively measuring service performance. Avoid Network Latency/Noise Network latency is significantly larger than in-memory operations and is subject to various random events or states, such as packet loss, re-transmissions, and network saturation. These factors can impact results, particularly tail latency. Prevent “Noisy Neighbor” Issues From VM Co-Location Cloud providers often place multiple virtual machines (VMs) on the same hardware. To ensure accurate measurements, use either a dedicated "bare-metal" instance or a VM without over-subscription, providing exclusive access to all available CPUs. Control Request Rate Comparing different solutions at low, moderate, and high request rates allows for a more comprehensive evaluation. Thread contention and implementation deficiencies may only become evident at certain load levels. Additionally, ramping up traffic linearly helps analyze latency degradation patterns. Achieve Latency Resolution Down to Microseconds Millisecond latency resolution is suitable for measuring end-user latency but may be insufficient for a single service. To minimize noise-contributing factors and measure pure operation costs, a higher resolution is necessary. Although measuring batches of operations in milliseconds can be informative, this approach only calculates the average, losing visibility into percentiles and variance. Saturate CPU up to 100% To Determine the Service Limits The true test for a system is its performance at high request rates. Understanding a system's maximum throughput is crucial for allocating hardware resources and handling traffic spikes effectively. Minimize Overhead From the Benchmarking Tool Ensuring the precision and stability of measurements is paramount. The benchmarking tool itself should not be a significant source of overhead or latency, as this would undermine the reliability of the results. Comprehensive Latency Statistics To effectively measure performance, it is essential to utilize a range of metrics and statistics. The following list outlines some key indicators to consider: p50 (the median) — a value that is greater than 50% of observed latency samples. p90 — the 90th percentile, a value better than 9 out of 10 latency samples. This is typically a good proxy for latency perceivable by humans. p99 (tail latency) — the 99th percentile, the threshold for the worst 1% of samples. Outliers: p99.9 and p99.99 — crucial for systems with multiple network hops or large fan-outs (e.g., a request gathering data from tens or hundreds of microservices). max — the worst-case scenario should not be overlooked. tm99.9 — (trimmed mean), the mean value of all samples, excluding the best and worst 0.1%. This is more informative than the traditional mean, as it eliminates the potentially disproportionate influence of outliers. stddev — (standard deviation) — a measure of stability and predictability. While most of these metrics are considered standard in evaluating performance, it is worth taking a closer look at two less popular ones: outliers and the trimmed mean Outliers (P99.9 and P99.99) In simple terms, p99.9 and p99.99 represent the worst request per 1,000 and per 10,000 requests, respectively. While this may not seem significant to humans, it can considerably impact sub-request latency. Consider this example: To process a user request, your service makes 100 sub-requests to other services (whether sequential or parallel doesn't matter for this example). Assume the following latency distribution: 999 requests — 1ms 1 request — 1,000ms What is the probability your users will see 1,000+ms latency? The chance of a sub-request encountering 1s latency is 0.1%, but the probability of it occurring for at least one of the 100 sub-requests is: (1 — (1–0.001)¹⁰⁰)=0.095 , or 9.5%! Consequently, this may impact your end-user p90, which is certainly noticeable. The following image and table illustrate how the number of sub-requests translates p99, p99.9, and p99.99 into the end-user latency percentiles: x-axis — number of sub-requests, y-axis — the end-user percentile Or in the table form (for a few data points): As demonstrated, p99 may not be an accurate measure for systems with a large number of sub-requests. Therefore, it is crucial to monitor outliers like p99.9 or p99.99, depending on your system architecture, scale, and objectives. Trimmed Mean Are percentiles sufficient? The mean is commonly considered a flawed metric due to its sensitivity to outliers, which is why the median, also known as p50, is generally recommended as a better alternative. It's true that the mean can be sensitive to outliers, which can distort the metric. Let's assume we have the following latency measurements with a single outlier: The median gives a more precise picture of the general tendency Indeed, 1318.2 is not representative, and the median is much better at capturing the average user experience. However, the issue with percentiles is that they act as fences and are not sensitive to what happens inside the fence. In other words, this metric might not be proportional to the user experience. Let's assume our service degraded (from the top row to the bottom one): The median didn’t change, but users noticed the service became slower. That’s pretty bad. We may be degrading the user experience, and our p50/p90thresholds won’t detect it. Let's examine how the trimmed mean would handle this. It discards the best and the worst samples and calculates the average: The trimmed mean captures the general tendency better than the median. As shown, the trimmed mean reflects the performance regression while still providing a good sense of the average user experience. With the trimmed mean, we can enjoy the best of both worlds. Practice Exercise Now that we've explored the theoretical aspects of performance measurement let's dive into a practical exercise to put this knowledge to use. One interesting area to investigate is the baseline I/O performance of TCP proxies implemented in various languages such as C, C++, Rust, Golang, Java, and Python. A TCP proxy serves as an ideal test case because it focuses on the fundamental I/O operations without any additional processing. In essence, no TCP-based network service can be faster than a TCP proxy, making it an excellent subject for comparing the baseline performance across different programming languages. Upon starting my search for a benchmarking tool, I was somewhat surprised by the challenge in finding an appropriate option, especially within the open-source domain. Several existing tools are available for performance measurements and stress testing. However, tools written in Java/Python/Golang may suffer from additional overhead and noise generated by garbage-collection pauses. Apache Benchmarking tool ab doesn't allow the user to control the rate, and its latency statistics aren't as detailed. Also, it measures latency in milliseconds. To solve the first issue, I had to create a performance-gauging tool in Rust that satisfies all of the requirements mentioned earlier. It can produce output like this: Test: http-tunnel-rust Duration 60.005715353s Requests: 1499912 Request rate: 24996.152 per second Success rate: 100.000% Total bytes: 15.0 GB Bitrate: 1999.692 Mbps Summary: 200 OK: 1499912 Latency: Min : 109µs p50 : 252µs p90 : 392µs p99 : 628µs p99.9 : 1048µs p99.99 : 1917µs Max : 4690µs Mean : 276µs StdDev : 105µs tm95 : 265µs tm99 : 271µs tm99.9 : 274µs This output is useful for a single run. Unfortunately, a single run is never enough if you want to measure performance properly. It is also difficult to use this output for anything but taking a glance and forgetting the actual numbers. Reporting metrics to Prometheus and running a test continuously seems to be a natural solution to gather comprehensive performance data over time. Below, you can see the ramping up of traffic from 10k to 25k requests per second (rps) and the main percentiles (on the left — latency in µs; on the right — the request rate in 1,000 rps): Sending 25,000 requests per second and measuring the latency to µs resolution Why Prometheus? It’s an open-source framework integrated with Grafana and Alertmanager, so the tool can be easily incorporated into existing release pipelines. There are plenty of Docker images that allow you to quickly start Prom stack with Grafana (I installed them manually, which was also quick and straightforward.) Then I configured the following TCP proxies: HAProxy— in TCP-proxy mode. To compare to a mature solution written in C: http://www.haproxy.org/ draft-http-tunnel — a simple C++ solution with very basic functionality (trantor) (running in TCP mode): https://github.com/cmello/draft-http-tunnel/ (thanks to Cesar Mello, who coded it to make this benchmark possible). http-tunnel — a simple HTTP-tunnel/TCP-proxy written in Rust (tokio) (running in TCP mode): https://github.com/xnuter/http-tunnel/ tcp-proxy — a Golang solution: https://github.com/jpillora/go-tcp-proxy NetCrusher — a Java solution (Java NIO). Benchmarked on JDK 11, with G1: https://github.com/NetCrusherOrg/NetCrusher-java/ pproxy — a Python solution based on asyncio (running in TCP Proxy mode): https://pypi.org/project/pproxy/ All of the solutions above use Non-blocking I/O, it’s the best method to handle network communication if you need highly available services with low latency and large throughput. All these proxies were compared in the following modes: 25k rps, reusing connections for 50 requests. Max possible rate to determine the max throughput. 3.5k connections per second serving a single request (to test resource allocation and dispatching). If I were asked to summarize the test result in a single image, I would pick this one (from the max throughput test, the CPU utilization is close to 100%): Left axis: latency overhead. Right axis: throughput The blue line is tail latency (Y-axis on the left) — the lower, the better. The grey bars are throughput (Y-axis on the right) — the higher, the better. As you can see, Rust is indeed on par with C/C++, and Golang did well too. Java and Python are way behind. You can read more about the methodology and detailed results in the perf-gauge's wiki: Benchmarking low-level I/O: C, C++, Rust, Golang, Java, Python, in particular, measuring by these dimensions: Moderate RPS Max RPS Establishing connections (TLS handshake) Closing Thoughts In conclusion, understanding and accurately measuring the performance of network services is crucial for optimizing and maintaining the quality of your systems. As the software industry matures and performance becomes a key differentiator, it is essential to prioritize it alongside security and reliability. Historically, performance improvements were easily achieved through hardware upgrades. However, as CPU frequency plateaus and harnessing the power of multiple cores or additional machines demands more sophisticated engineering, it is increasingly important to focus on performance within the software development process itself. The evolution of performance prioritization is inevitable, and it is essential for software development teams to adapt and ensure that their systems can deliver secure, reliable, and high-performing services to their users. By emphasizing key metrics and statistics and applying accurate performance assessment methods, engineers can successfully navigate this shifting landscape and continue to create high-quality, competitive software.

By Eugene Retunsky

Scaling IBM App Connect Enterprise Integrations That Use MQ Request/Reply

In the initial "How to Move IBM App Connect Enterprise to Containers" post, a single MQ queue was used in place of an actual MQ-based back-end system used by an HTTP Input/Reply flow, which allowed for a clean example of moving from local MQ connections to remote client connections. In this post, we will look at what happens when an actual back-end is used with multiple client containers and explore solutions to the key challenge: how do we ensure that reply messages return to the correct server when all the containers have identical starting configurations? For a summary of the use of MQ correlation IDs; see Correlation ID Solution below. Issues With Multiple Containers and Reply Messages Consider the following scenario, which involves an HTTP-based flow that calls an MQ back-end service using one queue for requests and another for replies (some nodes not shown): The HTTP input message is sent to the MQ service by the top branch of the flow using the input queue, and the other branch of the flow receives the replies and sends them back to the HTTP client.The picture is complicated slightly by HTTPReply nodes requiring a “reply identifier” in order to know which HTTP client should receive a reply (there could be many clients connected simultaneously), with the identifier being provided by the HTTPInput node. The reply identifier can be saved in the flow (using ESQL shared variables or Java static variables) in the outbound branch and restored in the reply branch based on MQ-assigned message IDs, or else sent to the MQ service as a message or correlation ID to be passed back to the reply branch, or possibly sent as part of the message body; various solutions will work in this case.This behaves well with only one copy of the flow running, as all replies go through one server and back to the calling client. If the ACE container is scaled up, then there will be a second copy of the flow running with an identical configuration, and it might inadvertently pick up a message intended for the original server. At that point, it will attempt to reply but discover that the TCPIP socket is connected to the original server: This situation can arise even with only a single copy of the container deployed: a Kubernetes rolling update will create the new container before stopping the old one, leading to the situation shown above due to both containers running at the same time. While Kubernetes does have a “Recreate” deploy strategy that eliminates the overlap, it would clearly be better to solve the problem itself rather than restricting solutions to only one container. Containers present extra challenges when migrating from on-prem integration nodes: the scaling and restarts in the container world are often automated and not directly performed by administrators, and all of the replica containers have the same flows with the same nodes and options. There is also no “per-broker listener” in the container case, as each server has a separate listener. Solutions The underlying problem comes down to MQInput nodes picking up messages intended for other servers, and the solutions come in two general categories: Use correlation IDs for reply messages so that the MQInput nodes only get the messages for their server.Each server has a specific correlation ID, the messages sent from the MQOutput node are specific to that correlation ID, and the back-end service copies the correlation ID into the response message. The MQInput node only listens for messages with that ID, and so no messages for other servers will be picked up. This solution requires some way of preserving the HTTP reply identifier, which can be achieved in various ways. Create a separate queue for each server and configure the MQInput nodes for each container to use a separate queue.In this case, there is no danger of messages going back to the wrong server, as each server has a distinct queue for replies. These queues need to be created and the flows configured for this to work with ACE. The second category requires custom scripting in the current ACE v12 releases and so will not be covered in this blog post, but ACE v12 does have built-in support for the first category, with several options for implementing solutions that will allow for scaling and redeploying without messages going to the wrong server. Variations in the first category include the “message ID to correlation ID” pattern and synchronous flows, but the idea is the same. While containers show this problem (and therefore the solutions) nicely, the examples described here can be run on a local server also, and do not need to be run in containers. Scaling integration solutions to use multiple servers is much easier with containers, however, and so the examples focus on those scenarios. Example Solution Scenario Overview Building on the previous blog post, the examples we shall be using are at this GitHub repo and follow this pattern: The previous blog post showed how to create the MQ container and the ACE container for the simple flow used in that example, and these examples follow the same approach but with different queues and ACE applications. Two additional queues are needed, with backend-queues.mqsc showing the definitions: DEFINE QLOCAL(BACKEND.SHARED.INPUT) REPLACE DEFINE QLOCAL(BACKEND.SHARED.REPLY) REPLACE Also, the MQSC ConfigMap shown in a previous MQ blog post can be adjusted to include these definitions. The same back-end is used for all of the client flows, so it should be deployed once using the “MQBackend” application. A pre-built BAR file is available that can be used in place of the BAR file used in the post, "From IBM Integration Bus to IBM App Connect Enterprise in Containers (Part 4b)," and the IS-github-bar-MQBackend.yaml file can be used to deploy the backend service. See the README.md for details of how the flow works. Correlation ID Solution The CorrelIdClient example shows one way to use correlation IDs: This client flow relies on: The back-end flow honoring the MQMD Report settings A unique per-server correlation ID being available as a user variable The HTTP RequestIdentifier being usable as an MQ message ID The MQBackend application uses an MQReply node, which satisfies requirement 1, and requirement 3 is satisfied by the design of the ACE product itself: the request identifier is 24 bytes (which is the size of MQ’s MsgId and CorrelId) and is unique for a particular server. Requirement 2 is met in this case by setting a user variable to the SHA-1 sum of the HOSTNAME environment variable. SHA-1 is not being used in this case for cryptographic purposes but rather to ensure that the user variable is a valid hex number (with only letters from A-F and numbers) that is 24 bytes or less (20 in this case). The sha1sum command is run from a server startup script (supported from ACE 12.0.5) using the following server.conf.yaml setting (note that the spaces are very important due to it being YAML): YAML StartupScripts: EncodedHostScript: command: 'export ENCODED_VAR=`echo $HOSTNAME | sha1sum | tr -d "-" | tr -d " "` && echo UserVariables: && /bin/echo -e " script-encoded-hostname: \\x27$ENCODED_VAR\\x27"' readVariablesFromOutput: true This server.conf.yaml setting will cause the server to run the script and output the results: YAML 2022-11-10 13:04:57.226548: BIP9560I: Script 'EncodedHostScript' is about to run using command 'export ENCODED_VAR=`echo $HOSTNAME | sha1sum | tr -d "-" | tr -d " "` && echo UserVariables: && /bin/echo -e " script-encoded-hostname: \\x27$ENCODED_VAR\\x27"'. UserVariables: script-encoded-hostname:'adc83b19e793491b1c6ea0fd8b46cd9f32e592fc' 2022-11-10 13:04:57.229552: BIP9567I: Setting user variable 'script-encoded-hostname'. 2022-11-10 13:04:57.229588: BIP9565I: Script 'EncodedHostScript' has run successfully. Once the user variable is set, it can be used for the MQInput node (to ensure only matching messages are received) and the MQOutput node (to provide the correct ID for the outgoing message). The MQInput node can accept user variable references in the correlation ID field (ignore the red X): The MQOutput node will send the contents of the MQMD parser from the flow, and so this is set in the “Create Outbound Message” Compute node using ESQL, converting the string in the user variable into a binary CorrelId: SQL -- Set the CorrelId of the outgoing message to match the MQInput node. DECLARE encHost BLOB CAST("script-encoded-hostname" AS BLOB); SET OutputRoot.MQMD.CorrelId = OVERLAY(X'000000000000000000000000000000000000000000000000' PLACING encHost FROM 1); SET OutputRoot.Properties.ReplyIdentifier = OutputRoot.MQMD.CorrelId; The ReplyIdentifier field in the Properties parser overwrites the MQMD CorrelId in some cases, so both are set to ensure the ID is picked up. The “script-encoded-hostname” reference is the name of the user variable, declared as EXTERNAL in the ESQL to cause the server to read the user variable when the ESQL is loaded: DECLARE "script-encoded-hostname" EXTERNAL CHARACTER; Other sections of the ESQL set the Report options, the HTTP request identifier, and the ReplyToQ: SQL -- Store the HTTP reply identifier in the MsgId of the outgoing message. -- This works because HTTP reply identifiers are the same size as an MQ -- correlation/message ID (by design). SET OutputRoot.MQMD.MsgId = InputLocalEnvironment.Destination.HTTP.RequestIdentifier; -- Tell the backend flow to send us the MsgId and CorrelId we send it. SET OutputRoot.MQMD.Report = MQRO_PASS_CORREL_ID + MQRO_PASS_MSG_ID; -- Tell the backend flow to use the queue for our MQInput node. SET OutputRoot.MQMD.ReplyToQ = 'BACKEND.SHARED.REPLY'; Deploying the flow requires a configuration for the server.conf.yaml mentioned above, and this must include the remote default queue manager setting as well as the hostname encoding script due to only one server.conf.yaml configuration being allowed by the operator. See this combined file and an encoded form ready for CP4i deployment. Once the configurations are in place, the flow itself can be deployed using the IS-github-bar-CorrelIdClient.yaml file to the desired namespace (“cp4i” in this case): kubectl apply -n cp4i -f IS-github-bar-CorrelIdClient.yaml (or using a direct HTTP URL to the Git repo). This will create two replicas, and once the servers are running then curl can be used to verify the flows operating successfully, and the CorrelId field should alternate between requests to show both servers sending and receiving messages correctly: YAML $ curl http://http-mq-correlidclient-http-cp4i.apps.cp4i-domain/CorrelIdClient {"originalMessage": {"MsgId":"X'455648540000000000000000c6700e78d900000000000000'","CorrelId":"X'20fa1e68cb59a328f559cc306aa52df3e58ffd3200000000'","ReplyToQ":"BACKEND.SHARED.REPLY ","jsonData":{"Data":{"test":"CorrelIdClient message"}},"backendFlow":{"application":"MQBackend"} $ curl http://http-mq-correlidclient-http-cp4i.apps.cp4i-domain/CorrelIdClient {"originalMessage": {"MsgId":"X'455648540000000000000000be900e78d900000000000000'","CorrelId":"X'd7cb7b30c4775d27aabbb4997020ebafd14f775700000000'","ReplyToQ":"BACKEND.SHARED.REPLY ","jsonData":{"Data":{"test":"CorrelIdClient message"}},"backendFlow":{"application":"MQBackend"} $ curl http://http-mq-correlidclient-http-cp4i.apps.cp4i-domain/CorrelIdClient {"originalMessage":{"MsgId":"X'455648540000000000000000c6700e78d900000000000000'","CorrelId":"X'20fa1e68cb59a328f559cc306aa52df3e58ffd3200000000'","ReplyToQ":"BACKEND.SHARED.REPLY ","jsonData":{"Data":{"test":"CorrelIdClient message"}},"backendFlow":{"application":"MQBackend"} Variations Other possible solutions exist: CorrelIdClientUsingBody shows a solution storing the HTTP reply identifier in the body of the message rather than using the MsgId field of the MQMD. This avoids a situation where the same MsgId is used twice (once for the request and the second time for the reply), but requires the back-end service to copy the identifier from the request to the reply message, and not all real-life services will do this. Similar flows can be built using RFH2 headers to contain the HTTP reply identifier, with the same requirement that the back-end service copies the RFH2 information. This flow uses a separate CorrelId padding value of “11111111” instead of “00000000” used by CorrelIdClient to ensure that the flows do not collide when using the same reply queue. Like the CorrelIdClient, this solution relies on startup scripts supported in ACE 12.0.5 and later. Sha1HostnameClient is a variant of CorrelIdClient that uses a predefined user variable called “sha1sum-hostname” provided by the server in ACE 12.0.6 and later fixpacks; this eliminates the need for startup scripts to set user variables. SyncClient shows a different approach, using a single flow that waits for the reply message. No user variables are needed, and the back-end service needs only to copy the request MsgId to the CorrelId in the reply (which is the default Report option), but the client flow will block in the MQGet until the reply message is received. This is potentially a very resource-intensive way to implement request-reply messaging, as every message will block a thread for as long as it takes the back-end to reply, and each thread will consume storage while running. Summary Existing integration solutions using MQ request/reply patterns may work unchanged in the scalable container world if they implement mechanisms similar to those described above, but many solutions will need to be modified to ensure correct operation. This is especially true for integrations that rely on the per-broker listener to handle the matching of replies to clients, but solutions are possible as is shown in the example above. Further conversation in the comments or elsewhere is always welcome! Acknowledgments: Thanks to Amar Shah for creating the original ACE-with-MQ blog post on which this is based, and for editorial help.

By Trevor Dolby

Assessment of Scalability Constraints (and Solutions)

This is an article from DZone's 2023 Software Integration Trend Report.For more: Read the Report Our approach to scalability has gone through a tectonic shift over the past decade. Technologies that were staples in every enterprise back end (e.g., IIOP) have vanished completely with a shift to approaches such as eventual consistency. This shift introduced some complexities with the benefit of greater scalability. The rise of Kubernetes and serverless further cemented this approach: spinning a new container is cheap, turning scalability into a relatively simple problem. Orchestration changed our approach to scalability and facilitated the growth of microservices and observability, two key tools in modern scaling. Horizontal to Vertical Scaling The rise of Kubernetes correlates with the microservices trend as seen in Figure 1. Kubernetes heavily emphasizes horizontal scaling in which replications of servers provide scaling as opposed to vertical scaling in which we derive performance and throughput from a single host (many machines vs. few powerful machines). Figure 1: Google Trends chart showing correlation between Kubernetes and microservice (Data source: Google Trends ) In order to maximize horizontal scaling, companies focus on the idempotency and statelessness of their services. This is easier to accomplish with smaller isolated services, but the complexity shifts in two directions: Ops – Managing the complex relations between multiple disconnected services Dev – Quality, uniformity, and consistency become an issue. Complexity doesn't go away because of a switch to horizontal scaling. It shifts to a distinct form handled by a different team, such as network complexity instead of object graph complexity. The consensus of starting with a monolith isn't just about the ease of programming. Horizontal scaling is deceptively simple thanks to Kubernetes and serverless. However, this masks a level of complexity that is often harder to gauge for smaller projects. Scaling is a process, not a single operation; processes take time and require a team. A good analogy is physical traffic: we often reach a slow junction and wonder why the city didn't build an overpass. The reason could be that this will ease the jam in the current junction, but it might create a much bigger traffic jam down the road. The same is true for scaling a system — all of our planning might make matters worse, meaning that a faster server can overload a node in another system. Scalability is not performance! Scalability vs. Performance Scalability and performance can be closely related, in which case improving one can also improve the other. However, in other cases, there may be trade-offs between scalability and performance. For example, a system optimized for performance may be less scalable because it may require more resources to handle additional users or requests. Meanwhile, a system optimized for scalability may sacrifice some performance to ensure that it can handle a growing workload. To strike a balance between scalability and performance, it's essential to understand the requirements of the system and the expected workload. For example, if we expect a system to have a few users, performance may be more critical than scalability. However, if we expect a rapidly growing user base, scalability may be more important than performance. We see this expressed perfectly with the trend towards horizontal scaling. Modern Kubernetes systems usually focus on many small VM images with a limited number of cores as opposed to powerful machines/VMs. A system focused on performance would deliver better performance using few high-performance machines. Challenges of Horizontal Scale Horizontal scaling brought with it a unique level of problems that birthed new fields in our industry: platform engineers and SREs are prime examples. The complexity of maintaining a system with thousands of concurrent server processes is fantastic. Such a scale makes it much harder to debug and isolate issues. The asynchronous nature of these systems exacerbates this problem. Eventual consistency creates situations we can't realistically replicate locally, as we see in Figure 2. When a change needs to occur on multiple microservices, they create an inconsistent state, which can lead to invalid states. Figure 2: Inconsistent state may exist between wide-sweeping changes Typical solutions used for debugging dozens of instances don't apply when we have thousands of instances running concurrently. Failure is inevitable, and at these scales, it usually amounts to restarting an instance. On the surface, orchestration solved the problem, but the overhead and resulting edge cases make fixing such problems even harder. Strategies for Success We can answer such challenges with a combination of approaches and tools. There is no "one size fits all," and it is important to practice agility when dealing with scaling issues. We need to measure the impact of every decision and tool, then form decisions based on the results. Observability serves a crucial role in measuring success. In the world of microservices, there's no way to measure the success of scaling without such tooling. Observability tools also serve as a benchmark to pinpoint scalability bottlenecks, as we will cover soon enough. Vertically Integrated Teams Over the years, developers tended to silo themselves based on expertise, and as a result, we formed teams to suit these processes. This is problematic. An engineer making a decision that might affect resource consumption or might impact such a tradeoff needs to be educated about the production environment. When building a small system, we can afford to ignore such issues. Although as scale grows, we need to have a heterogeneous team that can advise on such matters. By assembling a full-stack team that is feature-driven and small, the team can handle all the different tasks required. However, this isn't a balanced team. Typically, a DevOps engineer will work with multiple teams simply because there are far more developers than DevOps. This is logistically challenging, but the division of work makes more sense in this way. As a particular microservice fails, responsibilities are clear, and the team can respond swiftly. Fail-Fast One of the biggest pitfalls to scalability is the fail-safe approach. Code might fail subtly and run in non-optimal form. A good example is code that tries to read a response from a website. In a case of failure, we might return cached data to facilitate a failsafe strategy. However, since the delay happens, we still wait for the response. It seems like everything is working correctly with the cache, but the performance is still at the timeout boundaries. This delays the processing. With asynchronous code, this is hard to notice and doesn't put an immediate toll on the system. Thus, such issues can go unnoticed. A request might succeed in the testing and staging environment, but it might always fall back to the fail-safe process in production. Failing fast includes several advantages for these scenarios: It makes bugs easier to spot in the testing phase. Failure is relatively easy to test as opposed to durability. A failure will trigger fallback behavior faster and prevent a cascading effect. Problems are easier to fix as they are usually in the same isolated area as the failure. API Gateway and Caching Internal APIs can leverage an API gateway to provide smart load balancing, caching, and rate limiting. Typically, caching is the most universal performance tip one can give. But when it comes to scale, failing fast might be even more important. In typical cases of heavy load, the division of users is stark. By limiting the heaviest users, we can dramatically shift the load on the system. Distributed caching is one of the hardest problems in programming. Implementing a caching policy over microservices is impractical; we need to cache an individual service and use the API gateway to alleviate some of the overhead. Level 2 caching is used to store database data in RAM and avoid DB access. This is often a major performance benefit that tips the scales, but sometimes it doesn't have an impact at all. Stack Overflow recently discovered that database caching had no impact on their architecture, and this was because higher-level caches filled in the gaps and grabbed all the cache hits at the web layer. By the time a call reached the database layer, it was clear this data wasn't in cache. Thus, they always missed the cache, and it had no impact. Only overhead. This is where caching in the API gateway layer becomes immensely helpful. This is a system we can manage centrally and control, unlike the caching in an individual service that might get polluted. Observability What we can't see, we can't fix or improve. Without a proper observability stack, we are blind to scaling problems and to the appropriate fixes. When discussing observability, we often make the mistake of focusing on tools. Observability isn't about tools — it's about questions and answers. When developing an observability stack, we need to understand the types of questions we will have for it and then provide two means to answer each question. It is important to have two means. Observability is often unreliable and misleading, so we need a way to verify its results. However, if we have more than two ways, it might mean we over-observe a system, which can have a serious impact on costs. A typical exercise to verify an observability stack is to hypothesize common problems and then find two ways to solve them. For example, a performance problem in microservice X: Inspect the logs of the microservice for errors or latency — this might require adding a specific log for coverage. Inspect Prometheus metrics for the service. Tracking a scalability issue within a microservices deployment is much easier when working with traces. They provide a context and a scale. When an edge service runs into an N+1 query bug, traces show that almost immediately when they're properly integrated throughout. Segregation One of the most important scalability approaches is the separation of high-volume data. Modern business tools save tremendous amounts of meta-data for every operation. Most of this data isn't applicable for the day-to-day operations of the application. It is meta-data meant for business intelligence, monitoring, and accountability. We can stream this data to remove the immediate need to process it. We can store such data in a separate time-series database to alleviate the scaling challenges from the current database. Conclusion Scaling in the age of serverless and microservices is a very different process than it was a mere decade ago. Controlling costs has become far harder, especially with observability costs which in the case of logs often exceed 30 percent of the total cloud bill. The good news is that we have many new tools at our disposal — including API gateways, observability, and much more. By leveraging these tools with a fail-fast strategy and tight observability, we can iteratively scale the deployment. This is key, as scaling is a process, not a single action. Tools can only go so far and often we can overuse them. In order to grow, we need to review and even eliminate unnecessary optimizations if they are not applicable. This is an article from DZone's 2023 Software Integration Trend Report.For more: Read the Report

By Shai Almog CORE

The Power of Zero-Knowledge Proofs: Exploring the New ConsenSys zkEVM

It’s well-known that Ethereum needs support in order to scale. A variety of L2s (layer twos) have launched or are in development to improve Ethereum’s scalability. Among the most popular L2s are zero-knowledge-based rollups (also known as zk-rollups). Zk-rollups offer a solution that has both high scalability and minimal costs. In this article, we’ll define what zk-rollups are and review the latest in the market, the new ConsenSys zkEVM. This new zk-rollup—a fully EVM-equivalent L2 by ConsenSys— makes building with zero-knowledge proofs easier than ever. ConsenSys achieves this by allowing developers to port smart contracts easily, stay with the same toolset they already use, and bring users along with them smoothly—all while staying highly performant and cost-effective. If you don’t know a lot about zk-rollups, you’ll find how they work fascinating. They’re at the cutting edge of computer science. And if you do already know about zk-rollups, and you’re a Solidity developer, you’ll be interested in how the new ConsenSys zkEVM makes your dApp development a whole lot easier. It’s zk-rollup time! So let’s jump in. The Power of Zero-Knowledge Proofs Zk-rollups depend on zero-knowledge proofs. But what is a zero-knowledge proof? A zero-knowledge proof allows you to prove a statement is true—without sharing what the actual statement is, or how the truth was discovered. At its most basic, a prover passes secret information to an algorithm to compute the zero-knowledge proof. Then a verifier uses this proof with another algorithm to check that the prover actually knows the secret information. All this happens without revealing the actual information. There are a lot of details behind that above statement. Check out this article if you want to understand the cryptographic magic behind how it all works. But for our purpose, what’s important are the use cases of zero-knowledge proofs. A few examples: Anonymous payments—Traditional digital payments are not private, and even most crypto payments are on public blockchains. Zero-knowledge proofs offer a way to make truly private transactions. You can prove you paid for something … without revealing any details of the transaction. Identity protection—With zero-knowledge proofs, you can prove details of your personal identity while still keeping them private. For example, you can prove citizenship … without revealing your passport. And the most important use case for our purposes: Verifiable computation. What Is Verifiable Computation? Verifiable computation means you can have some other entity process computations for you and trust that the results are true … without knowing any of the details of the transaction. That means a layer 2 blockchain, such as the ConsenSys zkEVM, can become the outsourced computation layer for Ethereum. It can process a batch of transactions (much faster than Ethereum), create the proof for the validity of the transactions, and submit just the results and the proof to Ethereum. Ethereum, since it has the proof, doesn’t need the details—nor does it need a way to prove that the results are true. So instead of processing every transaction, Ethereum offloads the work to a separate chain. All Ethereum has to do is apply the results to its state. This vastly improves the speed and scalability of Ethereum. Exploring the New ConsenSys zkEVM and Why It’s Important Several zk-rollup L2s for Ethereum have already been released or are in progress. But the ConsenSys zkEVM could be the king. Let’s look at why: Type 2 ZK-EVM For one thing, it’s a Type 2 ZK-EVM—an evolution of zk-rollups. It’s faster and easier to use than Type 1 zk solutions. It offers better scalability and performance while still being fully EVM-equivalent. Traditionally with zk-proofs, it’s computationally expensive and slow for the prover to create proofs, which limits the capabilities and usefulness of the rollup. However, the ConsenSys zkEVM uses a recursion-friendly, lattice-based zkSNARK prover—which means faster finality and seamless withdraws, all while retaining the security of Ethereum settlements. And it delivers ultra-low gas fees. Solves the Problems of Traditional L2s Second, the ConsenSys zkEVM solves many of the practical problems of other L2s: Zero switching costs - It’s super easy to port smart contracts to the zkEVM. The zkEVM is EVM-equivalent down to the bytecode. So no rewriting code or smart contracts. You already know what you need to know to get started, and your current smart contracts already work. Easy to move your dApp users to the L2 - The zkEVM is supported by MetaMask, the leading web3 wallet. So most of your users are probably already able to access the zkEVM. Easy for devs - The zkEVM supports most popular tools out of the box. You can build, test, debug, and deploy your smart contracts with Hardhat, Infura, Truffle, etc. All the tools you use now, you can keep using. And there is already a bridge to move tokens onto and off the network. It uses ETH for gas - There’s no native token to the zkEVM, so you don’t need to worry about new tokens, third-party transpilers, or custom middleware. It’s all open source! How To Get Started Using the ConsenSys zkEVM The zkEVM private testnet was released in December 2022 and is moving to public testnet on March 28th, 2023. It’s already processed 774,000 transactions(and growing). There are lots of dApps already: uniswap, the graph, hop, and others. You can read the documentation for the zkEVM and deploy your own smart contract. Conclusion It’s definitely time for zk-rollups to shine. They are evolving quickly and leading the way in helping Ethereum to scale. It’s a great time to jump in and learn how they work—and building with the ConsenSys zkEVM is a great place to start! Have a really great day!

By John Vester CORE

11 Observability Tools You Should Know

When organizations move toward the cloud, their systems also lean toward distributed architectures. One of the most common examples is the adoption of microservices. However, this also creates new challenges when it comes to observability. You need to find the right tools to monitor, track and trace these systems by analyzing outputs through metrics, logs, and traces. It enables teams to quickly pinpoint the root cause of issues, fix them and optimize the application performance, giving them the confidence to deliver code faster. So, this article looks at the features, limitations, and important selling points of eleven popular observability tools to help you select the best one for your project. Helios Helios is a developer-observability solution that provides actionable insight into the end-to-end application flow. It incorporates OpenTelemetry's context propagation framework and provides visibility across microservices, serverless functions, databases, and 3rd party APIs. You can check out their sandbox or use it for free by signing up here. Key Features Provide a complete overview: Helios provides distributed tracing information in full context, showing how data flows through your entire application in any environment. Visualization: Enables users to collect and visualize trace data from multiple data sources to drill down and troubleshoot potential issues. Multi-language support: Supports multiple languages and frameworks, including Python, JavaScript, Node.js, Java, Ruby, .NET, Go, C++, and Collector. Share and reuse: You can easily collaborate with team members by sharing traces, tests, and triggers through Helios. In addition, Helios allows reusing requests, queries, and payloads with team members. Automatic test generation: Automatically generate tests based on trace data. Easy integrations: Integrates with your existing ecosystem, including logs, tests, error monitoring, and more. Workflow reproduction: Helios allows you to reproduce an exact workflow, including HTTP requests, Kafka and RabbitMQ messages, and Lambda invocations, in just a few clicks. Popular Use Cases Distributed tracing Multi-language application trace integration Serverless application observability Test troubleshooting API call automation Bottleneck analysis Prometheus Prometheus is an open-source tool broadly used to enable observability in cloud-native environments. It can collect and store time-series data and provides visualization tools to analyze and visualize the data collected. Key Features Data Collection: It can scrape metrics from various sources, including applications, services, and systems. It also supports many data formats supported out of the box, including logs, traces, and metrics. Data Storage: It stores the data collected in a time-series database, allowing efficient querying and aggregating of data over time. Alerting: Includes a built-in alerting system that can trigger alerts based on queries. Service Discovery: It can automatically detect and scrape metrics from services running in multiple environments, such as Kubernetes and other container orchestration systems. Grafana Integration: The tool has flexible integrations with Grafana, allowing it to create dashboards to display and analyze Prometheus metrics. Limitations Limited root cause analysis capabilities: The tool is primarily designed for monitoring and alerting. Therefore, it does not provide built-in root cause analysis. Scaling: Although the tool can handle many metrics, it can become resource intensive since Prometheus stores all data in memory. Data modeling: Contains a key-value pair-based data model and does not support nested fields and joins. Popular Use Cases Metrics collection and storage Alerting Service Discovery Grafana Grafana is an open-source tool predominantly used for data visualization and monitoring. It allows users to easily create and share interactive dashboards to visualize and analyze data from various sources. Key Features Data visualization: Creates customizable and interactive dashboards to visualize metrics and logs from various data sources. Alerting: Allows users to set up alerts based on the state of their metrics to indicate potential issues. Anomaly detection: Allows users to set up anomaly detection to automatically detect and alert based on abnormal behavior in their metrics. Root cause analysis: Allows users to drill down into the metrics to analyze the root cause by providing detailed information with historical context. Limitations Data storage: Its design does not support long-term storage and requires additional tools such as Prometheus or Elasticsearch to store metrics and logs. Data modeling: Grafana does not provide advanced data modeling capabilities. Hence, it is to model specific data types and perform complicated queries. Data aggregation: Grafana does not include built-in data aggregation capabilities. Popular Use Cases Metrics visualization Alerting Anomaly detection Elasticsearch, Logstash, and Kibana (ELK) The ELK stack is a popular open-source solution that helps to manage logs and analyze data. It comprises three components: Elasticsearch, Logstash, and Kibana. Elasticsearch is a distributed search and analytics engine that can handle large volumes of structured and unstructured data enabling users to store, index, and search large amounts of data. Logstash is a data collection and processing pipeline that allows users to collect, process, and enrich data from numerous sources, such as log files. Kibana is a data visualization and exploration tool that enables users to create interactive dashboards and visualizations based on the data within Elasticsearch. Key Features Log management: ELK allows users to collect, process, store and analyze log data and metrics from multiple sources while providing a centralized console to search through the logs. Search and analysis: Allows users to search and analyze relevant log data crucial in resolving and drilling down the root cause of issues. Data visualization: Kibana allows users to create customizable dashboards which can visualize log data and metrics from multiple data sources. Anomaly detection: Kibana allows the creation of alerts for abnormal activity within the log data. Root cause analysis: ELK stack allows users to drill down into the log data to better understand the root causes by providing detailed logs and historical context. Limitations Tracing: ELK does not natively support distributed tracing. Therefore, users may need to use additional tools such as Jaeger. Real-time monitoring: The design of ELK allows it to perform well as a log management and data analysis platform. But, there is a slight delay in the log reporting, and users will experience minor latencies. Complicated setup and maintenance: The platform involves a complex setup and maintenance process. Also, it requires specific knowledge to manage large amounts of data and numerous data sources. Popular Use Cases Log management Data visualization Compliance and security InfluxDB and Telegraf InfluxDB and Telegraf are open-source tools that are popular for their time-series data storage and monitoring capabilities. InfluxDB is a time-series database that stores and queries large amounts of time-series data using its SQL-like query language. On the other hand, Telegraf is a well-known data collection agent that can collect and send metrics and events to a wide range of receivers, such as InfluxDB. It also supports many data sources. Key Features The combination of InfluxDB and Telegraf brings in many features that benefit applications' observability. Metrics collection and storage: Telegraf allows users to collect metrics from many sources and sends them to InfluxDB for storage and analysis. Data visualization: InfluxDB can be integrated with third-party visualization tools such as Grafana to create interactive dashboards. Scalability: InfluxDB's design allows it to handle large amounts of time-series data and scale horizontally. Multiple data source support: Telegraf supports over 200 input plugins to collect metrics. Limitations Limited alerting capabilities: Both tools lack alerting capabilities and require a third-party integration to provide alerting. Limited root cause analysis: These tools lack native root cause analysis capabilities and require third-party integrations. Popular Use Cases Metrics collection and storage Monitoring Datadog Datadog is a popular cloud-based monitoring and analytics platform. It is widely used to get insights into the health and performance of distributed systems to troubleshoot issues beforehand. Key Features Multi-cloud support: Users can monitor applications running on multi-vendor cloud platforms such as AWS, Azure, GCP, etc. Service maps: Allows visualization of service dependencies, locations, services, and containers. Trace Analytics: Users can analyze traces while providing detailed information about application performance. Root cause analysis: Allows users to drill down into the metrics and traces to understand the root cause of the issues by providing detailed information with historical context. Anomaly detection: Can set up anomaly detection that can automatically detect and alert on abnormal behavior in metrics. Limitations Cost: Datadog is a cloud-based paid service, and charges are known to increase with large-scale deployments. Limited log ingestion, retention, and indexing support: Datadog does not provide log analysis support by default. You have to purchase log ingestion and indexing support for that separately. Hence, most organizations decide only to keep a limited number of logs retained, which can cause issues in troubleshooting since you can't access the complete history of the issue. Lack of control over data storage: Datadog stores data on its own servers and doesn't allow users to store data locally or in their own data centers. Popular Use Cases Observability pipelines Distributed tracing Container monitoring New Relic New Relic is a cloud-based monitoring and analytics platform that allows users to monitor applications and systems within a distributed environment. It uses the "New Relic Edge" service for distributed tracing and can observe 100% of an application's traces. Key Features Application performance monitoring: Provides a comprehensive APM solution to monitor and troubleshoot application performance. Multi-cloud support: Supports monitoring applications on multiple cloud platforms such as AWS, Azure, GCP, and more. Trace analytics: Enables users to analyze traces while providing detailed information about system and application performance. Root cause analysis: Allows users to drill down into the metrics and traces to analyze the root cause of issues. Log management: Collect, process, and analyze log data from various sources, providing a holistic view of the logs. Limitations Limited open-source integration: New Relic is a closed-source platform, and its integration with other open-source tools may be limited. Cost: New Relic can be costly compared to other solutions when working with large-scale deployments. Popular Use Cases Application performance monitoring Multi-cloud monitoring Trace analytics AppDynamics AppDynamics is a monitoring and analytics platform that allows you to observe, visualize, and manage each component of your application. In addition, it provides root cause analysis to identify underlying issues that may impact the application's performance. Key Features Data collection: Users can collect metrics and traces from numerous sources such as hosts, containers, cloud services, and applications. Anomaly detection: Enables users to set up anomaly detection, which can detect and alert on abnormal behavior. Trace Analytics: Users can analyze traces and provide detailed performance information. Application performance monitoring: Provides a comprehensive APM solution that allows users to monitor and troubleshoot the application's performance. Limitations Limited open-source integration: The vendor maintains the tool. Therefore, there may be limited open-source integrations. Limited customization: Customization options are not flexible compared to other tools since the users can not customize the solution themselves. Popular Use Cases Application performance monitoring Multi-cloud monitoring Business transaction management Selecting the Best Observability Tool Observability is an integral part of modern software development and operations. It helps organizations monitor the health and performance of their system and quickly solve problems before they become critical. This article discussed the 11 best observability tools developers should know when working with distributed systems. As you can see, each tool has its features and limitations. Therefore, evaluating them against your requirements is important to find the right fit for your organization. The best observability tool for your organization will depend on your specific needs, such as your environments, tech stack, developer experience, user profiles, monitoring and troubleshooting requirements, and workflow. I hope you have found this helpful. Thank you for reading!

By Lahiru Hewawasam

Testing Your Monitoring Configurations

Monitoring is a small aspect of our operational needs; configuring, monitoring, and checking the configuration of tools such as Fluentd and Fluentbit can be a bit frustrating, particularly if we want to validate more advanced configuration that does more than simply lift log files and dump the content into a solution such as OpenSearch. Fluentd and Fluentbit provide us with some very powerful features that can make a real difference operationally. For example, the ability to identify specific log messages and send them to a notification service rather than waiting for the next log analysis cycle to be run by a log store like Splunk. If we want to test the configuration, we need to play log events in as if the system was really running, which means realistic logs at the right speed so we can make sure that our configuration prevents alerts or mail storms. The easiest way to do this is to either take a real log and copy the events into a new log file at the speed they occurred or create synthetic events and play them in at a realistic pace. This is what the open-source LogGenerator (aka LogSimulator) does. I created the LogGenerator a couple of years ago, having addressed the same challenges before and wanting something that would help demo Fluentd configurations for a book (Logging in Action with Fluentd, Kubernetes, and more). Why not simply copy the log file for the logging mechanism to read? Several reasons for this. For example, if you're logging framework can send the logs over the network without creating back pressure, then logs can be generated without being impacted by storage performance considerations. But there is nothing tangible to copy. If you want to simulate into your monitoring environment log events from a database, then this becomes even harder as the DB will store the logs internally. The other reason for this is that if you have alerting controls based on thresholds over time, you need the logs to be consumed at the correct pace. Just allowing logs to be ingested whole is not going to correctly exercise such time-based controls. Since then, I've seen similar needs to pump test events into other solutions, including OCI Queue and other Oracle Cloud services. The OCI service support has been implemented using a simple extensibility framework, so while I've focused on OCI, the same mechanism could be applied as easily to AWS' SQS, for example. A good practice for log handling is to treat each log entry as an event and think of log event handling as a specialized application of stream analytics. Given that the most common approach to streaming and stream analytics these days is based on Kafka, we're working on an adaptor for the LogSimulator that can send the events to a Kafka API point. We built the LogGenerator so it can be run as a script, so modifying it and extending its behavior is quick and easy. we started out with developing using Groovy on top of Java8, and if you want to create a Jar file, it will compile as Java. More recently, particularly with the extensions we've been working with, Java11 and its ability to run single file classes from the source. We've got plans to enhance the LogGenerator so we can inject OpenTelementry events into Fluentbit and other services. But we'd love to hear about other use cases see for this. For more on the utility: Read the posts on my blog See the documentation on GitHub

By Phil Wilkins

Green Software and Carbon Hack

In November 2022, the Green Software Foundation organized its first hackathon, “Carbon Hack 2022,” with the aim of supporting software projects whose objective is to reduce carbon emissions. I participated in this hackathon with the Carbon Optimised Process Scheduler project along with my colleagues Kamlesh Kshirsagar and Mayur Andulkar, in which we developed an API to optimize job scheduling in order to reduce carbon emissions, and we won the “Most Insightful” project prize. In this article, I will summarize the key concepts of “green software” and explain how software engineers can help reduce carbon emissions. I will also talk about the Green Software Foundation hackathon, Carbon Hack, and its winners. What Is “Green Software”? According to this research by Malmodin and Lundén (2018), the global ICT sector is responsible for 1.4% of carbon emissions and 4% of electricity use. In another article, it is estimated that the ICT sector’s emissions in 2020 were between 1.8% and 2.8% of global greenhouse gas emissions. Even though these estimates carry some uncertainty, they give a reasonable idea of the impact of the ICT sector. Green Software Foundation defines “green software” as a new field that combines climate science, hardware, software, electricity markets, and data center design to create carbon-efficient software that emits the least amount of carbon possible. Green software focuses on three crucial areas to do this: hardware efficiency, carbon awareness, and energy efficiency. Green software practitioners should be aware of these six key points: Carbon Efficiency: Emit the least amount of carbon Energy Efficiency: Use the least amount of energy Carbon Awareness: Aim to utilize “cleaner” sources of electricity when possible Hardware Efficiency: Use the least amount of embodied carbon Measurement: You can’t get better at something that you don’t measure Climate Commitments: Understand the mechanism of carbon reduction What Can We Do as Software Engineers? Fighting global warming and climate change involves all of us, and since we can do it by changing our code, we might start by reading Ismael Velasco’s advice, an expert in this field. These principles are extracted from his presentation at the Code For All Summit 2022: 1. Green By Default We should move our applications to a greener cloud provider or zone. This article collects three main cloud providers. Google Cloud has matched 100% of its electricity consumption with renewable energy purchases since 2017 and has recently committed to fully decarbonize its electricity supply by 2030. Azure has been 100% carbon-neutral since 2012, meaning they remove as much carbon each year as they emit, either by removing carbon or reducing carbon emissions. AWS purchases and retires environmental attributes like renewable energy credits and Guarantees of Origin to cover the non-renewable energy used in specific regions. Also, only a handful of their data centers have achieved carbon neutrality through offsets. Make sure the availability zone where your app is hosted is green This can be checked on the website of the Green Web Foundation. Transfers of data should be optional, minimal, and sent just once. Prevent pointless data transfers. Delete useless information (videos, special fonts, unused JavaScript, and CSS). Optimize media and minify assets. Reduce page loads and data consumption with service workers’ focused caching solutions Make use of a content delivery network (CDN) You can handle all requests from servers that are currently using renewable energy thanks to Cloudfront Reduce the number of HTTP requests and data exchanges in your API designs Track your app’s environmental impact Start out quickly and simply, then gradually increase complexity. 2. Green Mode Design Users have the option to reduce functionality for less energy using the Green Mode design. Sound-only videos, Transcript-only audio Cache-only web app Zero ads/trackers Images optional: click-to-view, Grayscale images Green Mode is a way of designing software that prioritizes the extension of the device life and digital inclusion over graceful degradation. To achieve this, it suggests designing for maximum backward compatibility with operating systems and web APIs, as well as offering minimal versions of CSS. 3. Green Partnerships We should ponder three questions: What knowledge are we lacking? What missing networks are there? What can we provide to partners? What Is the Green Software Foundation? Accenture, GitHub, Microsoft, and ThoughtWorks launched the Green Software Foundation with the Linux Foundation to put software engineering’s focus on sustainability. The Green Software Foundation is a non-profit organization created under the Linux Foundation with the goal of creating a reliable ecosystem of individuals, standards, tools, and “green software best practices”. It focuses on lowering the carbon emissions that software is responsible for and reducing the adverse effects of software on the environment. Moreover, it was established for those who work in the software industry and has the aim to provide them with information on what they may do to reduce the software emissions that their work in the software industry is responsible for. Carbon Hack 2022 Carbon Hack 2022 took place for the first time between October 13th and November 10th, 2022, and was supported by the GSF member organizations Accenture, Avanade, Intel, Thoughtworks, Globant, Goldman Sachs, UBS, BCG, and VMware. The aim of Carbon Hack was to create carbon-aware software projects using The GSF Carbon Aware SDK, which has two parts, a Hosted API and a client library available for 40 languages. The hackathon had 395 participants and 51 qualified projects from all over the world. Carbon-aware software refers to when an application is executed at different times or in regions where electricity is generated from greener sources — like wind and solar — as this can reduce its carbon footprint. When the electricity is clean, carbon-aware software works harder; when the electricity is dirty, it works less. By including carbon-aware features in an application, we can partially offset our carbon footprint and lower greenhouse gas emissions. Carbon Hack 2022 Winners The total prize pool of $100,000 USD was divided between the first three winners and 4 category winners: First place – Lowcarb Lowcarb is a plugin that enables carbon-aware scheduling of training jobs on geographically distributed clients for the well-known federated learning framework Flower. The results of this plugin displayed 13% lower carbon emissions without any negative impacts. Second place – Carbon-Aware DNN Training with Zeus This energy optimization framework adjusts the power limit of the GPU and can be integrated into any DNN training job. The use case for Zeus showed a 24% reduction in carbon emissions and only a 3% decrease in learning time. Third place – Circa Circa is a lightweight library – written in C – that can be installed from a release using a procedure of configuring and making install instructions. It chooses the most effective time to run a program within a predetermined window of time and also contains a simple scripting command that waits for the energy with the lowest carbon density over a specified period of time. Most Innovative – Sustainable UI A library that provides a set of base primitives for building carbon-aware UIs to any React application; in the future, the developers would like to offer versions for other popular frameworks as well. The developers predicted that Facebook’s monthly emissions would be reduced by 1,800 metric tons of gross CO2 emissions if they were to use SUI Headless by reducing a tenth of a gram of CO2e every visit while gradually degrading its user interface. This is comparable to the fuel used by 24 tanker trucks or the annual energy consumption of 350 houses. Most Polished – GreenCourier Scheduling plugin implemented for Kubernetes. To deploy carbon-aware scheduling across geographically connected Kubernetes clusters, the authors developed a scheduling policy based on marginal carbon emission statistics obtained from the Carbon-aware SDK. Most Insightful – Carbon Optimised Process Scheduler Disclosure: This was my Carbon Hack team! In order to reduce carbon emissions, we developed an API service with a UI application that optimizes job scheduling. The problem was modeled using mixed-integer linear programming and solved using Open Source Solver. If it were possible to optimize hundreds of high-energy industrial processes, carbon emissions could be reduced by up to 2 million tons per year. An example scenario from the IT sector demonstrates how moving work by just three hours can reduce CO2 emissions by almost 18.5%. This results in a savings of roughly 300 thousand tons of CO2 per year when applied to a million IT processes. Most Actionable – HEDGE.earth 83% of the carbon emissions on the web come from API requests. This team developed a reverse proxy — an application that sits in front of back-end applications and forwards client requests to those apps — to maximize the amount of clean energy needed to complete API requests (also accessible in NPM). Take a look at all the projects from Carbon Hack 2022 here. Conclusion The collective effort and cross-disciplinary cooperation across industries and within engineering are really important to achieve global climate goals, and we can start with these two courses that the Linux Foundation and Microsoft offer regarding green software and sustainable software engineering. Also, we can begin debating with our colleagues on how to lower the carbon emissions produced by our applications. In addition, we could follow the people on their social media accounts who have knowledge on this topic; I would recommend the articles of Ismael Velasco to start. Regarding this article, if we achieve to write our codes greener, our software projects will be more robust, reliable, faster, and brand resilient. Sustainable software applications will not only help to reduce our carbon footprint with our applications but will also help to sustain our applications with fewer dependencies, better performance, low-resource usage, cost savings, and energy-efficient features.

By Beste Bayhan

Getting Started With Prometheus Workshop: Introduction to Prometheus

Are you looking to get away from proprietary instrumentation? Are you interested in open-source observability but lack the knowledge to just dive right in? If so, this workshop is for you, designed to expand your knowledge and understanding of open-source observability tooling that is available to you today. Dive right into a free, online, self-paced, hands-on workshop introducing you to Prometheus. Prometheus is an open-source systems monitoring and alerting tool kit that enables you to hit the ground running with discovering, collecting, and querying your observability today. Over the course of this workshop, you will learn what Prometheus is and is not, install it, start collecting metrics, and learn all the things you need to know to become effective at running Prometheus in your observability stack. In this article, you'll be introduced to some basic concepts and learn what Prometheus is and is not before you start getting hands-on with it in the rest of the workshop. Introduction to Prometheus I'm going to get you started on your learning path with this first lab that provides a quick introduction to all things needed for metrics monitoring with Prometheus. Note this article is only a short summary, so please see the complete lab found online here to work through it in its entirety yourself: The following is a short overview of what is in this specific lab of the workshop. Each lab starts with a goal. In this case, it is fairly simple: This lab introduces you to the Prometheus project and provides you with an understanding of its role in the cloud-native observability community. The start is with background on the beginnings of the Prometheus project and how it came to be part of the Cloud Native Computing Foundation (CNCF) as a graduated project. This leads to some basic outlining of what a data point is, how they are gathered, what makes them a metric and all using a high-level metaphor. You are then walked through what Prometheus is, why we are looking at this project as an open-source solution for your cloud-native observability solution, and more importantly, what Prometheus can not do for you. A basic architecture is presented walking you through the most common usage and components of a Prometheus metrics deployment. Below you see the final overview of the Prometheus architecture: You are then presented with an overview of all the powerful features and tools you'll find in your new Prometheus toolbox: Dimensional data model - For multi-faceted tracking of metrics Query language - PromQL provides a powerful syntax to gather flexible answers across your gathered metrics data. Time series processing - Integration of metrics time series data processing and alerting Service discovery - Integrated discovery of systems and services in dynamic environments Simplicity and efficiency - Operational ease combined with implementation in Go language Finally, you'll touch on the fact that Prometheus has a very simple design and functioning principle and that this has an impact on running it as a highly available (HA) component in your architecture. This aspect is only briefly touched upon, but don't worry: we cover this in more depth later in the workshop. At the end of each lab, including this one, you are presented with the end state (in this case we have not yet done anything), a list of references for further reading, a list of ways to contact me for questions, and a link to the next lab. Missed Previous Labs? This is one lab in the more extensive free online workshop. Feel free to start from the very beginning of this workshop here if you missed anything previously: You can always proceed at your own pace and return any time you like as you work your way through this workshop. Just stop and later restart Perses to pick up where you left off. Coming up Next I'll be taking you through the following lab in this workshop where you'll learn how to install and set up Prometheus on your own local machine. Stay tuned for more hands-on material to help you with your cloud-native observability journey.

By Eric D. Schabell

Key Elements of Site Reliability Engineering (SRE)

Site Reliability Engineering (SRE) is a systematic and data-driven approach to improving the reliability, scalability, and efficiency of systems. It combines principles of software engineering, operations, and quality assurance to ensure that systems meet performance goals and business objectives. This article discusses the key elements of SRE, including reliability goals and objectives, reliability testing, workload modeling, chaos engineering, and infrastructure readiness testing. The importance of SRE in improving user experience, system efficiency, scalability, and reliability, and achieving better business outcomes is also discussed. Site Reliability Engineering (SRE) is an emerging field that seeks to address the challenge of delivering high-quality, highly available systems. It combines the principles of software engineering, operations, and quality assurance to ensure that systems meet performance goals and business objectives. SRE is a proactive and systematic approach to reliability optimization characterized by the use of data-driven models, continuous monitoring, and a focus on continuous improvement. SRE is a combination of software engineering and IT operations, combining the principles of DevOps with a focus on reliability. The goal of SRE is to automate repetitive tasks and to prioritize availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. The benefits of adopting SRE include increased reliability, faster resolution of incidents, reduced mean time to recovery, improved efficiency through automation, and increased collaboration between development and operations teams. In addition, organizations that adopt SRE principles can improve their overall system performance, increase the speed of innovation, and better meet the needs of their customers. SRE 5 Why's 1. Why Is SRE Important for Organizations? SRE is important for organizations because it ensures high availability, performance, and scalability of complex systems, leading to improved user experience and better business outcomes. 2. Why Is SRE Necessary in Today's Technology Landscape? SRE is necessary for today's technology landscape because systems and infrastructure have become increasingly complex and prone to failures, and organizations need a reliable and efficient approach to manage these systems. 3. Why Does SRE Involve Combining Software Engineering and Systems Administration? SRE involves combining software engineering and systems administration because both disciplines bring unique skills and expertise to the table. Software engineers have a deep understanding of how to design and build scalable and reliable systems, while systems administrators have a deep understanding of how to operate and manage these systems in production. 4. Why Is Infrastructure Readiness Testing a Critical Component of SRE? Infrastructure Readiness Testing is a critical component of SRE because it ensures that the infrastructure is prepared to support the desired system reliability goals. By testing the capacity and resilience of infrastructure before it is put into production, organizations can avoid critical failures and improve overall system performance. 5. Why Is Chaos Engineering an Important Aspect of SRE? Chaos Engineering is an important aspect of SRE because it tests the system's ability to handle and recover from failures in real-world conditions. By proactively identifying and fixing weaknesses, organizations can improve the resilience and reliability of their systems, reducing downtime and increasing confidence in their ability to respond to failures. Key Elements of SRE Reliability Metrics, Goals, and Objectives: Defining the desired reliability characteristics of the system and setting reliability targets. Reliability Testing: Using reliability testing techniques to measure and evaluate system reliability, including disaster recovery testing, availability testing, and fault tolerance testing. Workload Modeling: Creating mathematical models to represent system reliability, including Little's Law and capacity planning. Chaos Engineering: Intentionally introducing controlled failures and disruptions into production systems to test their ability to recover and maintain reliability. Infrastructure Readiness Testing: Evaluating the readiness of an infrastructure to support the desired reliability goals of a system. Reliability Metrics In SRE Reliability metrics are used in SRE is used to measure the quality and stability of systems, as well as to guide continuous improvement efforts. Availability: This metric measures the proportion of time a system is available and functioning correctly. It is often expressed as a percentage and calculated as the total uptime divided by the total time the system is expected to be running. Response Time: This measures the time it takes for the infrastructure to respond to a user request. Throughput: This measures the number of requests that can be processed in a given time period. Resource Utilization: This measures the utilization of the infrastructure's resources, such as CPU, memory, Network, Heap, caching, and storage. Error Rate: This measures the number of errors or failures that occur during the testing process. Mean Time to Recovery (MTTR): This metric measures the average time it takes to recover from a system failure or disruption, which provides insight into how quickly the system can be restored after a failure occurs. Mean Time Between Failures (MTBF): This metric measures the average time between failures for a system. MTBF helps organizations understand how reliable a system is over time and can inform decision-making about when to perform maintenance or upgrades. Reliability Testing In SRE Performance Testing: This involves evaluating the response time, processing time, and resource utilization of the infrastructure to identify any performance issues under BAU scenario 1X load. Load Testing: This technique involves simulating real-world user traffic and measuring the performance of the infrastructure under heavy loads 2X Load. Stress Testing: This technique involves applying more load than the expected maximum to test the infrastructure's ability to handle unexpected traffic spikes 3X Load. Chaos or Resilience Testing: This involves simulating different types of failures (e.g., network outages, hardware failures) to evaluate the infrastructure's ability to recover and continue operating. Security Testing: This involves evaluating the infrastructure's security posture and identifying any potential vulnerabilities or risks. Capacity Planning: This involves evaluating the current and future hardware, network, and storage requirements of the infrastructure to ensure it has the capacity to meet the growing demand. Workload Modeling In SRE Workload Modeling is a crucial part of SRE, which involves creating mathematical models to represent the expected behavior of systems. Little's Law is a key principle in this area, which states that the average number of items in a system, W, is equal to the average arrival rate (λ) multiplied by the average time each item spends in the system (T): W = λ * T. This formula can be used to determine the expected number of requests a system can handle under different conditions. Example: Consider a system that receives an average of 200 requests per minute, with an average response time of 2 seconds. We can calculate the average number of requests in the system using Little's Law as follows: W = λ * T W = 200 requests/minute * 2 seconds/request W = 400 requests This result indicates that the system can handle up to 400 requests before it becomes overwhelmed and reliability degradation occurs. By using the right workload modeling, organizations can determine the maximum workload that their systems can handle and take proactive steps to scale their infrastructure and improve reliability and allow them to identify potential issues and design solutions to improve system performance before they become real problems. Tools and techniques used for modeling and simulation: Performance Profiling: This technique involves monitoring the performance of an existing system under normal and peak loads to identify bottlenecks and determine the system's capacity limits. Load Testing: This is the process of simulating real-world user traffic to test the performance and stability of an IT system. Load testing helps organizations identify performance issues and ensure that the system can handle expected workloads. Traffic Modeling: This involves creating a mathematical model of the expected traffic patterns on a system. The model can be used to predict resource utilization and system behavior under different workload scenarios. Resource Utilization Modeling: This involves creating a mathematical model of the expected resource utilization of a system. The model can be used to predict resource utilization and system behavior under different workload scenarios. Capacity Planning Tools: There are various tools available that automate the process of capacity planning, including spreadsheet tools, predictive analytics tools, and cloud-based tools. Chaos Engineering and Infrastructure Readiness in SRE Chaos Engineering and Infrastructure Readiness are important components of a successful SRE strategy. They both involve intentionally inducing failures and stress into systems to assess their strength and identify weaknesses. Infrastructure readiness testing is done to verify the system's ability to handle failure scenarios, while chaos engineering tests the system's recovery and reliability under adverse conditions. The benefits of chaos engineering include improved system reliability, reduced downtime, and increased confidence in the system's ability to handle real-world failures and proactively identify and fix weaknesses; organizations can avoid costly downtime, improve customer experience, and reduce the risk of data loss or security breaches. Integrating chaos engineering into DevOps practices (CI\CD) can ensure their systems are thoroughly tested and validated before deployment. Methods of chaos engineering typically involve running experiments or simulations on a system to stress and test its various components, identify any weaknesses or bottlenecks, and assess its overall reliability. This is done by introducing controlled failures, such as network partitions, simulated resource exhaustion, or random process crashes, and observing the system's behavior and response. Example Scenarios for Chaos Testing Random Instance Termination: Selecting and terminating an instance from a cluster to test the system response to the failure. Network Partition: Partitioning the network between instances to simulate a network failure and assess the system's ability to recover. Increased Load: Increasing the load on the system to test its response to stress and observing any performance degradation or resource exhaustion. Configuration Change: Altering a configuration parameter to observe the system's response, including any unexpected behavior or errors. Database Failure: Simulating a database failure by shutting it down and observing the system's reaction, including any errors or unexpected behavior. By conducting both chaos experiments and infrastructure readiness testing, organizations can deepen their understanding of system behavior and improve their resilience and reliability. Conclusion In conclusion, SRE is a critical discipline for organizations that want to deliver highly reliable, highly available systems. By adopting SRE principles and practices, organizations can improve system reliability, reduce downtime, and improve the overall user experience.

By Srikarthick Vijayakumar

5 Reasons You Need to Care About API Performance Monitoring

Connectivity is so daunting. By far, we are all used to instant connectivity that puts the world at our fingertips. We can purchase, post, and pick anything, anywhere, with the aid of desktops and devices. But how does it happen? How do different applications in different devices connect with each other? Allowing us to place an order, plan a vacation, make a reservation, etc., with just a few clicks. API—Application Programming Interface—the unsung hero of the modern world, which is often underrated. What Is an API? APIs are building blocks of online connectivity. They are a medium for multiple applications, data, and devices to interact with each other. Simply put, an API is a messenger that takes a request and tells the system what you want to do and then returns the response to the user. Documentation is drafted for every API, including specifications regarding how the information gets transferred between two systems. Why Is API Important? APIs can interact with third-party applications publicly. Ultimately, upscaling the reach of an organization’s business. So, when we book a ticket via “Bookmyshow.com,” we fill in details regarding the movie we plan to watch, like: Movie name Locality 3D/2D Language These details are fetched by API and are taken to servers associated with different movie theatres to bring back the collected response from multiple third-party servers. Providing users the convenience of choosing which theatre fits best. This is how different applications interact with each other. Instead of making a large application and adding more functionalities via code in it. The present time demands microservice architecture wherein we create multiple individually focused modules with well-defined interfaces and then combine them to make a scalable, testable product. The product or software, which might have taken a year to deliver, can now be delivered in weeks with the help of microservice architecture. API serves as a necessity for microservice architecture. Consider an application that delivers music, shopping, and bill payments service to end users under a single hood. The user needs to log into the app and select the service for consumption. API is needed for collaborating different services for such applications, contributing to an overall enhanced UX. API also enables an extra layer of security to the data. Neither the user’s data is overexposed to the server: nor the server data is overexposed to the user. Say, in the case of movies, API tells the server what the user would like to watch and then the user what they have to give to redeem the service. Ultimately, you get to watch your movie, and the service provider is credited accordingly. API Performance Monitoring and Application Performance Monitoring Differences As similar as these two terms sound, they perform distinctive checks on the overall application connectivity: Application performance monitoring: is compulsory for high-level analytics regarding how well the app is executing on the integral front. It facilitates an internal check on the internal connectivity of software. The following are key data factors that must be monitored: Server loads User adoption Market share Downloads Latency Error logging API performance monitoring: is required to check if there are any bottlenecks outside the server; it could be in the cloud or load balancing service. These bottlenecks are not dependent on your application performance monitoring but are still considered to be catastrophic as they may abrupt the service for end users. It facilitates a check on the external connectivity of the software, aiding its core functionalities: Back-end business operations Alert operations Web services Why Is API Performance Monitoring a Necessity? 1. Functionality With the emergence of modern agile practices, organizations are adopting a virtuous cycle of developing, testing, delivering, and maintaining by monitoring the response. It is integral to involve API monitoring as part of the practice. A script must be maintained in relevance to the appropriate and latest versions of the functional tests for ensuring a flawless experience of services to the end user. Simply put, if your API goes south, your app goes with it. For instance, in January 2016, a worldwide outage was met by the Twitter API. This outage lasted more than an hour, and within that period, it impacted thousands of websites and applications. 2. Performance Organizations are open to performance reckoning if they neglect to thoroughly understand the process involved behind every API call. Also, API monitoring helps acknowledge which APIs are performing better and how to improvise on the APIs with weaker performance displays. 3. Speed/Responsiveness Users can specify the critical API calls in the performance monitoring tool. Set their threshold (acceptable response time) to ensure they get alerted if the expected response time deteriorates. 4. Availability With the help of monitoring, we can realize whether all the services hosted by our applications are accessible 24×7. Why Monitor API When We Can Test it? Well, an API test can be highly composite, considering the large number of multi-steps that get involved. This creates a problem in terms of the frequency required for the test to take place. This is where monitoring steps in! Allowing every hour band check regarding the indispensable aspects. Helping us focus on what’s most vital to our organization. How To Monitor API Performance Identify your dependable APIs—Recognize your employed APIs, whether they are third-party or partner APIs. Internally connecting or externally? Comprehend the functional and transactional use cases for facilitating transparency towards the services being hosted — improves performance and MTTR (Mean Time to Repair). Realize whether you have test cases required to monitor. Whether you have existing test cases that need to be altered, or is there an urgency for new ones to be developed? Know the right tool—API performance monitoring is highly dependent on the tool being used. You need an intuitive, user-friendly, result-optimizing tool with everything packed in. Some commonly well known platforms to perform API performance testing are: CA Technologies(Now Broadcom Inc.) AlertSite Rigor Runscope One more factor to keep a note of is API browser compatibility to realize how well your API can aid different browsers. To know more about this topic, follow our blog about “API and Browser Compatibility.” Conclusion API performance monitoring is a need of modern times that gives you a check regarding the internal as well as external impact of the services hosted by a product. Not everyone cares to bother about APIs, but we are glad you did! Hoping this article will help expand your understanding of the topic. Cheers!

By Harshit Paul

Performance

DZone's Featured Performance Resources

Top Performance Experts

The Latest Performance Topics