Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.
This is an article from DZone's 2023 Software Integration Trend Report.For more: Read the Report Our approach to scalability has gone through a tectonic shift over the past decade. Technologies that were staples in every enterprise back end (e.g., IIOP) have vanished completely with a shift to approaches such as eventual consistency. This shift introduced some complexities with the benefit of greater scalability. The rise of Kubernetes and serverless further cemented this approach: spinning a new container is cheap, turning scalability into a relatively simple problem. Orchestration changed our approach to scalability and facilitated the growth of microservices and observability, two key tools in modern scaling. Horizontal to Vertical Scaling The rise of Kubernetes correlates with the microservices trend as seen in Figure 1. Kubernetes heavily emphasizes horizontal scaling in which replications of servers provide scaling as opposed to vertical scaling in which we derive performance and throughput from a single host (many machines vs. few powerful machines). Figure 1: Google Trends chart showing correlation between Kubernetes and microservice (Data source: Google Trends ) In order to maximize horizontal scaling, companies focus on the idempotency and statelessness of their services. This is easier to accomplish with smaller isolated services, but the complexity shifts in two directions: Ops – Managing the complex relations between multiple disconnected services Dev – Quality, uniformity, and consistency become an issue. Complexity doesn't go away because of a switch to horizontal scaling. It shifts to a distinct form handled by a different team, such as network complexity instead of object graph complexity. The consensus of starting with a monolith isn't just about the ease of programming. Horizontal scaling is deceptively simple thanks to Kubernetes and serverless. However, this masks a level of complexity that is often harder to gauge for smaller projects. Scaling is a process, not a single operation; processes take time and require a team. A good analogy is physical traffic: we often reach a slow junction and wonder why the city didn't build an overpass. The reason could be that this will ease the jam in the current junction, but it might create a much bigger traffic jam down the road. The same is true for scaling a system — all of our planning might make matters worse, meaning that a faster server can overload a node in another system. Scalability is not performance! Scalability vs. Performance Scalability and performance can be closely related, in which case improving one can also improve the other. However, in other cases, there may be trade-offs between scalability and performance. For example, a system optimized for performance may be less scalable because it may require more resources to handle additional users or requests. Meanwhile, a system optimized for scalability may sacrifice some performance to ensure that it can handle a growing workload. To strike a balance between scalability and performance, it's essential to understand the requirements of the system and the expected workload. For example, if we expect a system to have a few users, performance may be more critical than scalability. However, if we expect a rapidly growing user base, scalability may be more important than performance. We see this expressed perfectly with the trend towards horizontal scaling. Modern Kubernetes systems usually focus on many small VM images with a limited number of cores as opposed to powerful machines/VMs. A system focused on performance would deliver better performance using few high-performance machines. Challenges of Horizontal Scale Horizontal scaling brought with it a unique level of problems that birthed new fields in our industry: platform engineers and SREs are prime examples. The complexity of maintaining a system with thousands of concurrent server processes is fantastic. Such a scale makes it much harder to debug and isolate issues. The asynchronous nature of these systems exacerbates this problem. Eventual consistency creates situations we can't realistically replicate locally, as we see in Figure 2. When a change needs to occur on multiple microservices, they create an inconsistent state, which can lead to invalid states. Figure 2: Inconsistent state may exist between wide-sweeping changes Typical solutions used for debugging dozens of instances don't apply when we have thousands of instances running concurrently. Failure is inevitable, and at these scales, it usually amounts to restarting an instance. On the surface, orchestration solved the problem, but the overhead and resulting edge cases make fixing such problems even harder. Strategies for Success We can answer such challenges with a combination of approaches and tools. There is no "one size fits all," and it is important to practice agility when dealing with scaling issues. We need to measure the impact of every decision and tool, then form decisions based on the results. Observability serves a crucial role in measuring success. In the world of microservices, there's no way to measure the success of scaling without such tooling. Observability tools also serve as a benchmark to pinpoint scalability bottlenecks, as we will cover soon enough. Vertically Integrated Teams Over the years, developers tended to silo themselves based on expertise, and as a result, we formed teams to suit these processes. This is problematic. An engineer making a decision that might affect resource consumption or might impact such a tradeoff needs to be educated about the production environment. When building a small system, we can afford to ignore such issues. Although as scale grows, we need to have a heterogeneous team that can advise on such matters. By assembling a full-stack team that is feature-driven and small, the team can handle all the different tasks required. However, this isn't a balanced team. Typically, a DevOps engineer will work with multiple teams simply because there are far more developers than DevOps. This is logistically challenging, but the division of work makes more sense in this way. As a particular microservice fails, responsibilities are clear, and the team can respond swiftly. Fail-Fast One of the biggest pitfalls to scalability is the fail-safe approach. Code might fail subtly and run in non-optimal form. A good example is code that tries to read a response from a website. In a case of failure, we might return cached data to facilitate a failsafe strategy. However, since the delay happens, we still wait for the response. It seems like everything is working correctly with the cache, but the performance is still at the timeout boundaries. This delays the processing. With asynchronous code, this is hard to notice and doesn't put an immediate toll on the system. Thus, such issues can go unnoticed. A request might succeed in the testing and staging environment, but it might always fall back to the fail-safe process in production. Failing fast includes several advantages for these scenarios: It makes bugs easier to spot in the testing phase. Failure is relatively easy to test as opposed to durability. A failure will trigger fallback behavior faster and prevent a cascading effect. Problems are easier to fix as they are usually in the same isolated area as the failure. API Gateway and Caching Internal APIs can leverage an API gateway to provide smart load balancing, caching, and rate limiting. Typically, caching is the most universal performance tip one can give. But when it comes to scale, failing fast might be even more important. In typical cases of heavy load, the division of users is stark. By limiting the heaviest users, we can dramatically shift the load on the system. Distributed caching is one of the hardest problems in programming. Implementing a caching policy over microservices is impractical; we need to cache an individual service and use the API gateway to alleviate some of the overhead. Level 2 caching is used to store database data in RAM and avoid DB access. This is often a major performance benefit that tips the scales, but sometimes it doesn't have an impact at all. Stack Overflow recently discovered that database caching had no impact on their architecture, and this was because higher-level caches filled in the gaps and grabbed all the cache hits at the web layer. By the time a call reached the database layer, it was clear this data wasn't in cache. Thus, they always missed the cache, and it had no impact. Only overhead. This is where caching in the API gateway layer becomes immensely helpful. This is a system we can manage centrally and control, unlike the caching in an individual service that might get polluted. Observability What we can't see, we can't fix or improve. Without a proper observability stack, we are blind to scaling problems and to the appropriate fixes. When discussing observability, we often make the mistake of focusing on tools. Observability isn't about tools — it's about questions and answers. When developing an observability stack, we need to understand the types of questions we will have for it and then provide two means to answer each question. It is important to have two means. Observability is often unreliable and misleading, so we need a way to verify its results. However, if we have more than two ways, it might mean we over-observe a system, which can have a serious impact on costs. A typical exercise to verify an observability stack is to hypothesize common problems and then find two ways to solve them. For example, a performance problem in microservice X: Inspect the logs of the microservice for errors or latency — this might require adding a specific log for coverage. Inspect Prometheus metrics for the service. Tracking a scalability issue within a microservices deployment is much easier when working with traces. They provide a context and a scale. When an edge service runs into an N+1 query bug, traces show that almost immediately when they're properly integrated throughout. Segregation One of the most important scalability approaches is the separation of high-volume data. Modern business tools save tremendous amounts of meta-data for every operation. Most of this data isn't applicable for the day-to-day operations of the application. It is meta-data meant for business intelligence, monitoring, and accountability. We can stream this data to remove the immediate need to process it. We can store such data in a separate time-series database to alleviate the scaling challenges from the current database. Conclusion Scaling in the age of serverless and microservices is a very different process than it was a mere decade ago. Controlling costs has become far harder, especially with observability costs which in the case of logs often exceed 30 percent of the total cloud bill. The good news is that we have many new tools at our disposal — including API gateways, observability, and much more. By leveraging these tools with a fail-fast strategy and tight observability, we can iteratively scale the deployment. This is key, as scaling is a process, not a single action. Tools can only go so far and often we can overuse them. In order to grow, we need to review and even eliminate unnecessary optimizations if they are not applicable. This is an article from DZone's 2023 Software Integration Trend Report.For more: Read the Report
In recent years, the term MLOps has become a buzzword in the world of AI, often discussed in the context of tools and technology. However, while much attention is given to the technical aspects of MLOps, what's often overlooked is the importance of the operations. There is often a lack of discussion around the operations needed for machine learning (ML) in production and monitoring specifically. Things like accountability for AI performance, timely alerts for relevant stakeholders, and the establishment of necessary processes to resolve issues are often disregarded for discussions about specific tools and tech stacks. ML teams have traditionally been research-oriented, focusing heavily on training models to achieve high testing scores. However, once the model is ready to be deployed in real business processes and applications, the culture around establishing production-oriented operations is lacking. As a consequence, there is a lack of clarity regarding who is responsible for the models' outcomes and performance. Without the right operations in place, even the most advanced tools and technology won't be enough to ensure healthy governance for your AI-driven processes. 1. Cultivate a Culture of Accountability As previously stated, data science and ML teams have traditionally been research-oriented and were measured on model evaluation scores and not on real-world, business-related outcomes. In such an environment, there is no way monitoring will be done correctly because frankly - no one cares sufficiently. To fix this situation, the team responsible for building AI models must take ownership and feel accountable for the models' success or failure in serving the business function it was designed for. The best way to achieve this is by measuring the individual's and the team's performance based on production-oriented KPIs and creating an environment that fosters a sense of ownership over the model's overall performance rather than just in controlled testing environments. While some team members may remain focused on research, it's important to recognize that achieving good test scores in experiments is not sufficient to ensure the model's success in production. The ultimate success of the model lies in its effectiveness in real-world business processes and applications. 2. Make a "Monitoring Plan" Part of Your Release Checklist To ensure the ongoing success of an AI-driven application, planning how it is going to be monitored is a critical factor that should not be overlooked. In healthy engineering organizations, there is always a release checklist that entails setting up a monitoring plan whenever a new component is released. AI teams should follow that pattern. The person or team responsible for building a model must have a clear understanding of how it fits into the overall system and should be able to predict potential issues that could arise, as well as identify who needs to be alerted and what actions should be taken in the event of an issue. While some potential issues may be more research-oriented, such as data or concept drift, there are many other factors to consider, such as a broken feature pipeline or a third-party data provider changing input formats. Therefore, it is important to anticipate as many of these issues as possible and set up a plan to effectively deal with them should they arise. Although it's very likely that there are potential issues that will remain unforeseen, it's still better to do something rather than nothing, and typically, the first 80% of issues can be anticipated with 20% of the work. 3. Establish an On-Call Rotation Sharing the responsibility among team members may be necessary or helpful, depending on the size of your team and the number of models or systems under your control. By setting up an "on-call" rotation, everyone can have peace of mind knowing that there is at least one knowledgeable person available to handle any issues the moment they arise. It's important to note that taking care of an issue doesn't necessarily mean solving the problem immediately. Sometimes, it might mean triaging and deferring it to a later time or waking up the person who is best equipped to solve the problem. Sharing an on-call rotation with pre-existing engineering teams can also be an option in some instances. However, this is use-case dependent and may not be possible for every team. Regardless of the approach, it is imperative to establish a shared knowledge base that the person on-call can utilize so that your team can be well-prepared to take care of emerging issues. 4. Set up a Shared Knowledge Base To maintain healthy monitoring operations, it is essential to have accessible resources that detail how your system works and its main components. This is where wikis and playbooks come in. Wikis can provide a central location for documentation on your system, including its architecture, data sources, and model dependencies. Playbooks can be used to document specific procedures for handling common issues or incidents that may arise. Having these resources in place can help facilitate knowledge sharing and ensure that everyone on the team is equipped to troubleshoot and resolve issues quickly. It also allows for smoother onboarding of new team members who can quickly get up to speed on the system. In addition, having well-documented procedures and protocols can help reduce downtime and improve response times when issues transpire. 5. Implement Post Mortems Monitoring is an iterative process. It is impossible to predict everything that might go wrong in advance. But when an issue does occur and goes undetected or unresolved for too long, it is important to conduct a thorough analysis of the issue and identify the root cause. Once a root cause is understood, the built monitoring plan can be amended and improved accordingly. Post mortems also help in building a culture of accountability, which, as discussed earlier, is the key factor in having successful monitoring operations. 6. Get the Right Tools for Effective Monitoring Once you have established the need of maintaining healthy monitoring operations and addressed any cultural considerations, the next critical step is to equip your team members with the appropriate tools to empower them to be accountable for the model's performance in the business function it serves. This means implementing tools that enable timely alerts for issues (which is difficult due to issues typically starting small and hidden), along with capabilities for root cause analysis and troubleshooting. Integrations with your existing tools, such as ticketing systems, as well as issue tracking and management capabilities, are also essential for seamless coordination and collaboration among team members. Investing in the right tools will empower your team to take full ownership and accountability, ultimately leading to better outcomes for the business. Conclusion By following these guidelines, you can be sure that your AI team will be set up for successful production-oriented operations. Monitoring is a crucial aspect of MLOps, involving accountability, timely alerts, troubleshooting, and much more. Taking the time to set up healthy monitoring practices leads to continuous improvements.
As with back-end development, observability is becoming increasingly crucial in front-end development, especially when it comes to troubleshooting. For example, imagine a simple e-commerce application that includes a mobile app, web server, and database. If a user reports that the app is freezing while attempting to make a purchase, it can be challenging to determine the root cause of the problem. That's where OpenTelemetry comes in. This article will dive into how front-end developers can leverage OpenTelemetry to improve observability and efficiently troubleshoot issues like this one. Why Front-End Troubleshooting? Similar to back-end development, troubleshooting is a crucial aspect of front-end development. For instance, consider a straightforward e-commerce application structure that includes a mobile app, a web server, and a database. Suppose a user reported that the app is freezing while attempting to purchase a dark-themed mechanical keyboard. Without front-end tracing, we wouldn't have enough information about the problem since it could be caused by different factors such as the front-end or back-end, latency issues, etc. We can try collecting logs to get some insight, but it's challenging to correlate client-side and server-side logs. We might attempt to reproduce the issue from the mobile application, but it could be time-consuming and impossible if the client-side conditions aren't available. However, if the issue isn't reproduced, we need more information to identify the specific problem. This is where front-end tracing comes in handy because, with the aid of front-end tracing, we can stop making assumptions and instead gain clarity on the location of the issue. Front-End Troubleshooting With Distributed Tracing Tracing data is organized in spans, which represent individual operations like an HTTP request or a database query. By displaying spans in a tree-like structure, developers can gain a comprehensive and real-time view of their system, including the specific issue they are examining. This allows them to investigate further and identify the cause of the problem, such as bottlenecks or latency issues. Tracing can be a valuable tool for pinpointing the root cause of an issue. The example below displays three simple components: a front-end a back-end, and a database. When there is an issue, the trace encompasses spans from both the front-end app and the back-end service. By reviewing the trace, it's possible to identify the data that was transmitted between the components, allowing developers to follow the path from the specific user click in the front-end to the DB query. Rather than relying on guesswork to identify the issue, with tracing, you can have a visual representation of it. For example, you can determine whether the request was sent out from the device, whether the back-end responded, whether certain components were missing from the response and other factors that may have caused the app to become unresponsive. Suppose we need to determine if a delay caused a problem. In Helios, there's a functionality that displays the span's duration. Here's what it looks like: Now you can simply analyze the trace to pinpoint the bottleneck. In addition, each span in the trace is timestamped, allowing you to see exactly when each action took place and whether there were any delays in processing the request. Helios comes with a span explorer that was created explicitly for this purpose. The explorer enables the sorting of spans based on their duration or timestamp: The trace visualization provides information on the time taken by each operation, which can help identify areas that require optimization. A default view available in Jaeger is also an effective method to explore all the bottlenecks by displaying a trace breakdown. Adding Front-End Instrumentation to Your Traces in OpenTelemetery: Advanced Use Cases It's advised to include front-end instrumentation in your traces to enhance the ability to analyze bottlenecks. While many SDKs provided by OpenTelemetry are designed for back-end services, it's worth noting that OpenTelemetry has also developed an SDK for JavaScript. Additionally, they plan to release more client libraries in the future. Below, we will look at how to integrate these libraries. Aggregating Traces Aggregating multiple traces from different requests into one large trace can be useful for analyzing a flow as a whole. For instance, imagine a purchasing process that involves three REST requests, such as validating the user, billing the user, and updating the database. To see this flow as a single trace for all three requests, developers can create a custom span that encapsulates all three into one flow. This can be achieved using a code example like the one below. const { createCustomSpan } = require('@heliosphere/web-sdk'); const purchaseFunction = () => { validateUser(user.id); chargeUser(user.cardToken); updateDB(user.id); }; createCustomSpan("purchase", {'id': purchase.id}, purchaseFunction); From now on, the trace will include all the spans generated under the validateUser, chargeUser, and updateDB categories. This will allow us to see the entire flow as a single trace rather than separate ones for each request. Adding Span Events Adding information about particular events can be beneficial when investigating and analyzing front-end bottlenecks. With OpenTelemetry, developers can utilize a feature called Span Event, which allows them to include a report about an event and associate it with a specific span. A Span Event is a message on a span that describes a specific event with no duration and can be identified by a single time stamp. It can be seen as a basic log and appears in this format: const activeSpan = opentelemetry.trace.getActiveSpan(); activeSpan.addEvent('User clicked Purchase button); Span Events can gather various data, such as clicks, device events, networking events, and so on. Adding Baggage Baggage is a useful feature provided by OpenTelemetry that allows adding contextual information to traces. This information can be propagated across all spans in a trace and can be helpful in transferring user data, such as user identification, preferences, and Stripe tokens, among other things. This feature can benefit front-end development since user data is a crucial element in this area. You can find more information about Baggage right here. Deploying Front-End Instrumentation Deploying the instrumentation added to your traces is straightforward, just like deploying any other OpenTelemetry SDK. Additionally, you can use Helios's SDK to visualize and gain more insights without setting up your own infrastructure. To do this, simply visit the Helios website, register, and follow the steps to install the SDK and add the code snippet to your application. The deployment instructions for the Helios front-end SDK are shown below: Where to Go From Here: Next Steps for Front-End Developers Enabling front-end instrumentation is a simple process that unlocks a plethora of new troubleshooting capabilities for full-stack and front-end developers. It allows you to map out a transaction, starting from a UI click and to lead up to a specific database query or scheduled job, providing unique insights for bottleneck identification and issue analysis. Both OpenTelemetry and Helios support front-end instrumentation, making it even more accessible for developers. Begin utilizing these tools today to enhance your development workflow.
Are you looking to get away from proprietary instrumentation? Are you interested in open-source observability, but lack the knowledge to just dive right in? This workshop is for you, designed to expand your knowledge and understanding of open-source observability tooling that is available to you today. Dive right into a free, online, self-paced, hands-on workshop introducing you to Prometheus. Prometheus is an open-source systems monitoring and alerting tool kit that enables you to hit the ground running with discovering, collecting, and querying your observability today. Over the course of this workshop, you will learn what Prometheus is, what it is not, install it, start collecting metrics, and learn all the things you need to know to become effective at running Prometheus in your observability stack. Previously, I shared an introduction to Prometheus in a lab that kicked off this workshop. In this article, you'll be installing Prometheus from either a pre-built binary from the project or using a container image. I'm going to get you started on your learning path with this first lab that provides a quick introduction to all things needed for metrics monitoring with Prometheus. Note this article is only a short summary, so please see the complete lab found online here to work through it in its entirety yourself: The following is a short overview of what is in this specific lab of the workshop. Each lab starts with a goal. In this case, it is fairly simple: This lab guides you through installing Prometheus on your local machine, configuring, and running it to start gathering metrics. You are confronted right from the start with two possible paths to installing the Prometheus tooling locally on your machine: using a pre-compiled binary for your machine's architecture, or using a container image. Installing Binaries The first path you can take to install Prometheus on your local machine is to obtain the right version of the pre-compiled binaries for your machine architecture. I've provided the links to directly obtain Mac OSX, Linux, and Windows binaries. The installation is straightforward. You'll learn what a basic configuration looks like while creating your own to get started with scraping your first metrics from the Prometheus server itself. Once it's up and running, you'll explore the basic information available to you through the Prometheus status pages, a web console. You explore how to verify that your configured scraping target is up and running, then go and break your configuration to see what a broken target looks like on the web console status page. Next, you browse the available configuration flags for running your Prometheus server, look at the time series database status, explore your active configuration, and finish up by playing with some yet-to-be-explained query expressions in the provided tooling. That last exercise is more extensive than just pasting in queries, you'll learn about built-in validation mechanisms and explore the graphing visualization offered out of the box. This lab completes with you having an installed binary package for your machine's architecture, a running Prometheus with a basic configuration, and an understanding of the available tooling in the provided web console. Installing Container Image The second path you can take is to install Prometheus using a container image. This lab path is provided using an Open Container Initiative (OCI) standards-compliant tool known as Podman. The default requirement will be to use Podman Desktop, a graphical tool that also includes the command line tooling referred to in the rest of this lab. I've chosen to avoid the more complex issues of mounting a volume for your local configuration file to be made available to your running Prometheus container image. Instead, I am choosing to walk you through a few short steps to building your own local container image with your workshop configuration file. Once all of this is done, you are up and running with your Prometheus server just like in the previous section. The rest of this path covers the same as previously covered in the above section, where you explore all the basic information available to you through the Prometheus status pages through its web console. Missed Previous Labs? This is one lab in the more extensive free online workshop. Feel free to start from the very beginning of this workshop here if you missed anything previously: You can always proceed at your own pace and return any time you like as you work your way through this workshop. Just stop and later restart Perses to pick up where you left off. Coming Up Next I'll be taking you through the following lab in this workshop where you'll start learning about the Prometheus Query Language and how to gain insights into your collected metrics. Stay tuned for more hands-on material to help you with your cloud-native observability journey.
The Current State of AWS Log Management Security professionals have used log data to detect cyber threats for many years. It was in the late 1990s when organizations first started to use Syslog data to detect attacks by identifying and tracking malicious activity. Security teams rely on log data to detect threats because it provides a wealth of information about what is happening on their networks and systems. By analyzing this data, they can identify patterns that may indicate an attack is taking place. Migration to the cloud has complicated how security teams use log data to protect their networks and systems. The cloud introduces new complexities into the environment, as well as new attack vectors. A cloud-centric infrastructure changes how data is accessed and stored, impacting how security teams collect and analyze log data. Finally, the cloud makes it more difficult to correlate log data with other data sources, limiting the effectiveness of security analysis. Today, security teams have hundreds of AWS-specific tools and services available to consider and potentially implement. Once an organization has chosen a set of services, the logs produced by those same services can be extensive—and the challenges associated with ingesting and normalizing cloud log data can tax the abilities of even experienced security professionals. Security teams must adapt their cloud log management approach to overcome these challenges. First, it can be difficult to redirect or copy logs out of AWS into an external log management solution. According to Panther's recent State of AWS Log Management survey and report, 48.8% of security practitioners find it challenging to do so. Additionally, each AWS environment produces unique data that can come from a variety of sources. This data can often be staggering in size and complexity. While the data coming from AWS is complicated enough, it is often siloed in the AWS environment, too — unlinked and uncorrelated with the rest of an organization's data. AWS customers often find their security teams overwhelmed with the amount of data they need to process in order to detect threats effectively. This data is spread across various AWS services, and teams have little guidance on implementing an effective and sustainable threat detection strategy. As a result, security teams can struggle to identify and respond to threats promptly. Last year a Google Cloud Blog post stated, "Developing cloud-based data ingestion pipelines that replicate data from various sources into your cloud data warehouse can be a massive undertaking that requires significant investment of staffing resources." This means that most organizations need an easy way to cost-effectively centralize organized AWS logs into a system that has visibility across the rest of their environment. They need a solution that will scale alongside a growing AWS footprint and perform quickly across massive amounts of log data. Why Continuous Monitoring Is Critical Organizations must monitor AWS log data to ensure their infrastructure runs securely and protects sensitive information. This is because the infrastructure that runs an organization's application or software may be on AWS and can reveal sensitive information, such as customer credit card data. And in the case of health technology companies, health records, and history are stored in AWS. Security teams must also continuously monitor their AWS log data in order to detect threats and prevent damage to their networks and systems. By identifying and analyzing patterns in the data, they can identify malicious activity before it causes damage. In addition to quickly identifying and responding to threats, continuous monitoring enables security teams to correlate AWS log data with other data sources for a complete view of an organization's security posture. The right log management solution will offer features specifically designed to address the challenges associated with AWS log data. It will also help teams ingest, normalize, and search their AWS logs quickly and effectively. Conclusion AWS has increasingly become the go-to provider for cloud infrastructure in the past decade, with more and more companies placing their crown jewels in its hands. This includes most of their regular IT operations, as the cloud provider has become a staple of modern business. Modern organizations need a cloud security platform that offers a log management solution specifically designed for AWS environments. They need a solution that can support a wide range of AWS data sources with the ability to quickly and effectively ingest and normalize large volumes of data.
As more and more companies move to the cloud, it’s becoming essential to keep track of their resource usage to ensure cost-effectiveness. Amazon Web Services (AWS) is a leading platform among cloud providers, but its extensive range of services can pose a challenge when monitoring resource consumption efficiently. This article delves into the significance of tracking AWS resource utilization for cost optimization and offers practical tips on accomplishing this. What Is AWS Resource Utilization? As an AWS professional, it’s essential to understand the concept of AWS resource utilization. Essentially, it refers to the computing resources that your website or application consumes on the AWS platform. These resources may include CPU, memory, disk I/O, and network usage, among others. Fortunately, AWS offers several tools you can utilize to monitor your resource utilization. These tools include Amazon CloudWatch, AWS Trusted Advisor, and AWS Cost Explorer. By leveraging these services, you can keep track of your resource consumption and optimize your AWS usage for maximum efficiency. Why Monitoring AWS Resource Utilization Is Critical for Cost Optimization Identify Underutilized Resources Maximizing resource efficiency is crucial to keeping AWS costs under control. Oversized EC2 instances can result in unnecessary expenses for compute resources, whereas storing data in an S3 bucket that isn’t required can lead to expenses for unused storage. It’s important to optimize resource usage to avoid paying more than necessary. Apart from directly impacting costs, inefficient resource utilization can also harm performance. For instance, using an RDS database that is not appropriately sized for your requirements can result in sluggish query response times or even downtime during peak traffic periods. Identifying Opportunities for Optimization Monitoring resource usage can aid in identifying optimization opportunities that can lead to cost reduction and enhanced performance. One potential optimization method is reducing the size of an EC2 instance or combining multiple instances into one, which can reduce costs. Additionally, utilizing S3 lifecycle policies to relocate data to lower-cost storage tiers as it becomes less frequently accessed can be another cost-saving option. It’s essential to note that optimization is an ongoing process. As your usage habits evolve, your resource consumption will also fluctuate. Consistent monitoring and optimization practices can guarantee your resource usage remains efficient at all times. Identify Overutilized Resources Keeping track of your resource usage can aid in detecting excessively utilized resources. Take, for instance, an RDS database that persistently operates at maximum CPU utilization. In such a scenario, it might be necessary to upgrade to a larger instance size to maintain the seamless functioning of your application and forestall any probable periods of system unavailability. Forecast Future Resource Needs Keeping a tab on resource utilization can also aid in predicting future resource requirements. By comprehending the rate at which your resource usage increases, you can forecast when you will have to scale up or down your resources. This proactive approach can prevent the risk of running out of resources when you require them the most, as well as steer clear of over-provisioning and paying excessively for resources that you do not essentially need. Leveraging Third-Party Tools There are many third-party tools available for monitoring AWS resource utilization. These tools can provide additional insights and analytics and automate AWS cloud cost optimization workflows. Some popular third-party tools include CloudCheckr, CloudHealth, and ParkMyCloud. As you choose a third-party tool, it’s paramount to consider your unique needs and financial limitations. While some tools focus exclusively on cost reduction, others provide a broader range of capabilities. Additionally, pricing models may vary depending on the number of resources you use or the level of functionality you require. Therefore, conducting a thorough assessment of your needs before selecting a tool is crucial to guarantee it satisfies your precise requirements while fitting within your budget. Efficient monitoring of AWS resource utilization is crucial for cost optimization and optimal efficiency. You can leverage native AWS tools, such as Cost Explorer, Trusted Advisor, and CloudWatch, along with third-party options, to obtain valuable insights into your resource utilization. Acting upon these insights can enable you to optimize costs and enhance overall performance. Tips for Monitoring AWS Resource Utilization To optimize the utilization of your AWS resources while keeping your expenses low, implementing a set of best practices for resource monitoring is crucial. To help you achieve this, below are some practical suggestions for effective resource utilization monitoring: Use AWS Cost Explorer As an AWS user, keeping track of your costs and usage can be daunting. However, AWS Cost Explorer is a robust solution to alleviate this concern. Utilizing this tool lets you delve into the intricacies of your AWS spending patterns and usage statistics. The feature-rich Cost Explorer provides detailed reports on various resource utilization, such as EC2 instances, S3 buckets, and RDS databases. It allows you to create personalized reports that cater to your specific needs. With cost alerts, you can proactively monitor your expenses and receive timely notifications to ensure you stay within your budget. Set Up AWS Trusted Advisor AWS Trusted Advisor could be an excellent choice if you are looking for a reliable resource monitoring tool. This tool offers real-time guidance and suggestions on enhancing cost optimization, security, and performance. By analyzing your resource usage, Trusted Advisor can help you identify opportunities to reduce costs and recommend AWS best practices accordingly. This feature-packed tool could be a valuable addition to your AWS toolkit. Use AWS CloudWatch CloudWatch, a monitoring and logging service offered by AWS, enables real-time monitoring of AWS resources. It allows monitoring of key metrics like CPU utilization and network traffic of EC2 instances, RDS databases, and other resources. Additionally, you can configure alarms to alert you when metrics cross set thresholds. This way, you can proactively address any issues and ensure the optimal performance of your AWS resources. Set Up Alerts One effective way to promptly detect and address issues in your system is by configuring alerts in CloudWatch. For instance, you can establish an alarm notifying you when your CPU usage surpasses a threshold. By doing this, you can take necessary measures to prevent any application downtime from occurring. Conclusion Efficiently managing AWS resource utilization is pivotal for achieving cost optimization goals. It involves analyzing which resources are most frequently utilized, identifying areas for optimization, and predicting future resource requirements, enabling informed decision-making regarding resource allocation and cost optimization. To obtain valuable insights into resource utilization and enhance overall efficiency, AWS-native tools, such as Cost Explorer, Trusted Advisor, and CloudWatch, as well as third-party tools, can be utilized. Leveraging these tools enables you to take necessary actions for optimizing costs and improving the overall efficiency of your AWS resources.
Monitoring data stream applications is a critical component of enterprise operations, as it allows organizations to ensure that their applications are functioning optimally and delivering value to their customers. In this article, we will discuss in detail the importance of monitoring data stream applications and why it is critical for enterprises. Data stream applications are those that handle large volumes of data in real-time, such as those used in financial trading, social media analytics, or IoT (Internet of Things) devices. These applications are critical to the success of many businesses, as they allow organizations to make quick decisions based on real-time data. However, these applications can be complex, and any issues or downtime can have significant consequences. By monitoring data stream applications, enterprises can proactively identify and address issues before they impact the business. This includes identifying performance issues, detecting errors and anomalies, and ensuring that the application is meeting its service level agreements (SLAs). Monitoring also allows organizations to track key metrics, such as data throughput, latency, and error rates, and to make adjustments to optimize the application's performance. Reference data steam system: Unlocking the Potential of IoT Applications. In addition to these benefits, monitoring data stream applications is critical for ensuring regulatory compliance. Many industries, such as finance and healthcare, have strict regulations governing data privacy and security. By monitoring these applications, organizations can ensure that they are meeting these regulatory requirements and avoid costly fines and legal penalties. Another key benefit of monitoring data stream applications is that it allows organizations to optimize their infrastructure and resource usage. By monitoring resource utilization, enterprises can identify areas of inefficiency, such as overprovisioned resources or bottlenecks, and make adjustments to improve performance and reduce costs. Prometheus: Prometheus is an open-source monitoring system that is designed for collecting and querying time-series data. It can be used to monitor metrics from a variety of sources, including data stream applications. Prometheus provides a range of tools for data visualization and alerting and integrates with a variety of popular tools and platforms. Splunk: Splunk is a popular data analytics and monitoring platform that can be used to monitor data stream applications. It provides real-time monitoring and alerting and can be used to track metrics such as data volume, latency, and error rates. Splunk also includes a range of machine learning and data analysis tools that can be used to identify anomalies and optimize performance. Amazon CloudWatch: Amazon CloudWatch is a monitoring and management service offered by Amazon Web Services (AWS). It can be used to monitor a variety of AWS resources, including data stream applications running on AWS. CloudWatch provides a range of metrics, logs, and alerts and can be integrated with other AWS tools, such as AWS Lambda. if your data streams running from AWS CloudWatch is the best option. DataDog: DataDog is a cloud-based monitoring and analytics platform that can be used to monitor data stream applications. It provides real-time monitoring and alerting and can be used to track a wide range of metrics, including data volume, latency, and error rates. DataDog also includes a range of visualization and collaboration tools that can be used to improve communication and collaboration across teams. Finally, monitoring data stream applications is critical for maintaining customer satisfaction. In today's fast-paced, digital world, customers expect instant responses and seamless experiences. Any issues or downtime can have a significant impact on customer satisfaction and brand reputation. By proactively monitoring these applications, organizations can ensure that their customers are receiving the expected level of service and address any issues quickly and efficiently. In conclusion, monitoring data stream applications is critical for enterprise success. It allows organizations to proactively identify and address issues, ensure regulatory compliance, optimize resource utilization, and maintain customer satisfaction. By investing in monitoring tools and processes, enterprises can ensure that their applications are delivering value to their customers and stay ahead of the competition in today's fast-paced digital landscape.
When organizations move toward the cloud, their systems also lean toward distributed architectures. One of the most common examples is the adoption of microservices. However, this also creates new challenges when it comes to observability. You need to find the right tools to monitor, track and trace these systems by analyzing outputs through metrics, logs, and traces. It enables teams to quickly pinpoint the root cause of issues, fix them and optimize the application performance, giving them the confidence to deliver code faster. So, this article looks at the features, limitations, and important selling points of eleven popular observability tools to help you select the best one for your project. Helios Helios is a developer-observability solution that provides actionable insight into the end-to-end application flow. It incorporates OpenTelemetry's context propagation framework and provides visibility across microservices, serverless functions, databases, and 3rd party APIs. You can check out their sandbox or use it for free by signing up here. Key Features Provide a complete overview: Helios provides distributed tracing information in full context, showing how data flows through your entire application in any environment. Visualization: Enables users to collect and visualize trace data from multiple data sources to drill down and troubleshoot potential issues. Multi-language support: Supports multiple languages and frameworks, including Python, JavaScript, Node.js, Java, Ruby, .NET, Go, C++, and Collector. Share and reuse: You can easily collaborate with team members by sharing traces, tests, and triggers through Helios. In addition, Helios allows reusing requests, queries, and payloads with team members. Automatic test generation: Automatically generate tests based on trace data. Easy integrations: Integrates with your existing ecosystem, including logs, tests, error monitoring, and more. Workflow reproduction: Helios allows you to reproduce an exact workflow, including HTTP requests, Kafka and RabbitMQ messages, and Lambda invocations, in just a few clicks. Popular Use Cases Distributed tracing Multi-language application trace integration Serverless application observability Test troubleshooting API call automation Bottleneck analysis Prometheus Prometheus is an open-source tool broadly used to enable observability in cloud-native environments. It can collect and store time-series data and provides visualization tools to analyze and visualize the data collected. Key Features Data Collection: It can scrape metrics from various sources, including applications, services, and systems. It also supports many data formats supported out of the box, including logs, traces, and metrics. Data Storage: It stores the data collected in a time-series database, allowing efficient querying and aggregating of data over time. Alerting: Includes a built-in alerting system that can trigger alerts based on queries. Service Discovery: It can automatically detect and scrape metrics from services running in multiple environments, such as Kubernetes and other container orchestration systems. Grafana Integration: The tool has flexible integrations with Grafana, allowing it to create dashboards to display and analyze Prometheus metrics. Limitations Limited root cause analysis capabilities: The tool is primarily designed for monitoring and alerting. Therefore, it does not provide built-in root cause analysis. Scaling: Although the tool can handle many metrics, it can become resource intensive since Prometheus stores all data in memory. Data modeling: Contains a key-value pair-based data model and does not support nested fields and joins. Popular Use Cases Metrics collection and storage Alerting Service Discovery Grafana Grafana is an open-source tool predominantly used for data visualization and monitoring. It allows users to easily create and share interactive dashboards to visualize and analyze data from various sources. Key Features Data visualization: Creates customizable and interactive dashboards to visualize metrics and logs from various data sources. Alerting: Allows users to set up alerts based on the state of their metrics to indicate potential issues. Anomaly detection: Allows users to set up anomaly detection to automatically detect and alert based on abnormal behavior in their metrics. Root cause analysis: Allows users to drill down into the metrics to analyze the root cause by providing detailed information with historical context. Limitations Data storage: Its design does not support long-term storage and requires additional tools such as Prometheus or Elasticsearch to store metrics and logs. Data modeling: Grafana does not provide advanced data modeling capabilities. Hence, it is to model specific data types and perform complicated queries. Data aggregation: Grafana does not include built-in data aggregation capabilities. Popular Use Cases Metrics visualization Alerting Anomaly detection Elasticsearch, Logstash, and Kibana (ELK) The ELK stack is a popular open-source solution that helps to manage logs and analyze data. It comprises three components: Elasticsearch, Logstash, and Kibana. Elasticsearch is a distributed search and analytics engine that can handle large volumes of structured and unstructured data enabling users to store, index, and search large amounts of data. Logstash is a data collection and processing pipeline that allows users to collect, process, and enrich data from numerous sources, such as log files. Kibana is a data visualization and exploration tool that enables users to create interactive dashboards and visualizations based on the data within Elasticsearch. Key Features Log management: ELK allows users to collect, process, store and analyze log data and metrics from multiple sources while providing a centralized console to search through the logs. Search and analysis: Allows users to search and analyze relevant log data crucial in resolving and drilling down the root cause of issues. Data visualization: Kibana allows users to create customizable dashboards which can visualize log data and metrics from multiple data sources. Anomaly detection: Kibana allows the creation of alerts for abnormal activity within the log data. Root cause analysis: ELK stack allows users to drill down into the log data to better understand the root causes by providing detailed logs and historical context. Limitations Tracing: ELK does not natively support distributed tracing. Therefore, users may need to use additional tools such as Jaeger. Real-time monitoring: The design of ELK allows it to perform well as a log management and data analysis platform. But, there is a slight delay in the log reporting, and users will experience minor latencies. Complicated setup and maintenance: The platform involves a complex setup and maintenance process. Also, it requires specific knowledge to manage large amounts of data and numerous data sources. Popular Use Cases Log management Data visualization Compliance and security InfluxDB and Telegraf InfluxDB and Telegraf are open-source tools that are popular for their time-series data storage and monitoring capabilities. InfluxDB is a time-series database that stores and queries large amounts of time-series data using its SQL-like query language. On the other hand, Telegraf is a well-known data collection agent that can collect and send metrics and events to a wide range of receivers, such as InfluxDB. It also supports many data sources. Key Features The combination of InfluxDB and Telegraf brings in many features that benefit applications' observability. Metrics collection and storage: Telegraf allows users to collect metrics from many sources and sends them to InfluxDB for storage and analysis. Data visualization: InfluxDB can be integrated with third-party visualization tools such as Grafana to create interactive dashboards. Scalability: InfluxDB's design allows it to handle large amounts of time-series data and scale horizontally. Multiple data source support: Telegraf supports over 200 input plugins to collect metrics. Limitations Limited alerting capabilities: Both tools lack alerting capabilities and require a third-party integration to provide alerting. Limited root cause analysis: These tools lack native root cause analysis capabilities and require third-party integrations. Popular Use Cases Metrics collection and storage Monitoring Datadog Datadog is a popular cloud-based monitoring and analytics platform. It is widely used to get insights into the health and performance of distributed systems to troubleshoot issues beforehand. Key Features Multi-cloud support: Users can monitor applications running on multi-vendor cloud platforms such as AWS, Azure, GCP, etc. Service maps: Allows visualization of service dependencies, locations, services, and containers. Trace Analytics: Users can analyze traces while providing detailed information about application performance. Root cause analysis: Allows users to drill down into the metrics and traces to understand the root cause of the issues by providing detailed information with historical context. Anomaly detection: Can set up anomaly detection that can automatically detect and alert on abnormal behavior in metrics. Limitations Cost: Datadog is a cloud-based paid service, and charges are known to increase with large-scale deployments. Limited log ingestion, retention, and indexing support: Datadog does not provide log analysis support by default. You have to purchase log ingestion and indexing support for that separately. Hence, most organizations decide only to keep a limited number of logs retained, which can cause issues in troubleshooting since you can't access the complete history of the issue. Lack of control over data storage: Datadog stores data on its own servers and doesn't allow users to store data locally or in their own data centers. Popular Use Cases Observability pipelines Distributed tracing Container monitoring New Relic New Relic is a cloud-based monitoring and analytics platform that allows users to monitor applications and systems within a distributed environment. It uses the "New Relic Edge" service for distributed tracing and can observe 100% of an application's traces. Key Features Application performance monitoring: Provides a comprehensive APM solution to monitor and troubleshoot application performance. Multi-cloud support: Supports monitoring applications on multiple cloud platforms such as AWS, Azure, GCP, and more. Trace analytics: Enables users to analyze traces while providing detailed information about system and application performance. Root cause analysis: Allows users to drill down into the metrics and traces to analyze the root cause of issues. Log management: Collect, process, and analyze log data from various sources, providing a holistic view of the logs. Limitations Limited open-source integration: New Relic is a closed-source platform, and its integration with other open-source tools may be limited. Cost: New Relic can be costly compared to other solutions when working with large-scale deployments. Popular Use Cases Application performance monitoring Multi-cloud monitoring Trace analytics AppDynamics AppDynamics is a monitoring and analytics platform that allows you to observe, visualize, and manage each component of your application. In addition, it provides root cause analysis to identify underlying issues that may impact the application's performance. Key Features Data collection: Users can collect metrics and traces from numerous sources such as hosts, containers, cloud services, and applications. Anomaly detection: Enables users to set up anomaly detection, which can detect and alert on abnormal behavior. Trace Analytics: Users can analyze traces and provide detailed performance information. Application performance monitoring: Provides a comprehensive APM solution that allows users to monitor and troubleshoot the application's performance. Limitations Limited open-source integration: The vendor maintains the tool. Therefore, there may be limited open-source integrations. Limited customization: Customization options are not flexible compared to other tools since the users can not customize the solution themselves. Popular Use Cases Application performance monitoring Multi-cloud monitoring Business transaction management Selecting the Best Observability Tool Observability is an integral part of modern software development and operations. It helps organizations monitor the health and performance of their system and quickly solve problems before they become critical. This article discussed the 11 best observability tools developers should know when working with distributed systems. As you can see, each tool has its features and limitations. Therefore, evaluating them against your requirements is important to find the right fit for your organization. The best observability tool for your organization will depend on your specific needs, such as your environments, tech stack, developer experience, user profiles, monitoring and troubleshooting requirements, and workflow. I hope you have found this helpful. Thank you for reading!
I had the opportunity to catch up with Andi Grabner, DevOps Activist at Dynatrace during day two of Dynatrace Perform. I've know Andi for seven years and he's one of the people that has helped me understand DevOps since I began writing for DZone. We covered several topics that I'll share in a series of articles. Do Developers Want to Expand Beyond Just Coding? There will always be developers that just want to code and do what they're told. They're great at coding and that's perfect. But I think everyone that creates something, including developers are creative engineers. I think it's in every human's interest to see the impact they have with what they create. The impact can only be seen if that piece of code gets into the hands of the beneficiary. It could be an end user, it could be a third party that is calling an API. In order to know if the beneficiary is actually getting the value out of the code, you need observability. I think we have the obligation to actually educate engineers to think about how can you create something that makes a positive impact on society. How can you get insights on your code in a fast feedback loop? In the end, if I'm a developer, and I just write code, I never know if what I'm creating actually has any impact. This is a really boring life. When I spoke to Kelsey Hightower last year, he told me a story about when he was working for a company. They were managing SNAP payments for grocery stores. If this system goes down and the family's SNAP card is declined, people do not eat. Developers and engineers need to know when something bad is happening. That should be the the main motivation. Put observability in to figure out if the stuff they are building has the desired impact. In this case the desired impact is that everybody can purchase food when they need it. If I'm an artist, I want to know, if anybody likes my painting. I would probably look into the museum to see if people actually stopped by or don't stop at my painting, it's the same thing. I want to know if my code reaches the right people, and if it has the desired effect, because if not, if nobody looks at it, maybe I'm just wasting my time. We need to educate developers, because over the last 15 to 20 years, we educated them to evolve from just coding to to test-driven development. I think now it's about observability-driven development. So whatever you do whatever you build, you need to have observability in mind, because if you cannot observe the impact that you have with the software, then you're just flying blind.
Monitoring is a small aspect of our operational needs; configuring, monitoring, and checking the configuration of tools such as Fluentd and Fluentbit can be a bit frustrating, particularly if we want to validate more advanced configuration that does more than simply lift log files and dump the content into a solution such as OpenSearch. Fluentd and Fluentbit provide us with some very powerful features that can make a real difference operationally. For example, the ability to identify specific log messages and send them to a notification service rather than waiting for the next log analysis cycle to be run by a log store like Splunk. If we want to test the configuration, we need to play log events in as if the system was really running, which means realistic logs at the right speed so we can make sure that our configuration prevents alerts or mail storms. The easiest way to do this is to either take a real log and copy the events into a new log file at the speed they occurred or create synthetic events and play them in at a realistic pace. This is what the open-source LogGenerator (aka LogSimulator) does. I created the LogGenerator a couple of years ago, having addressed the same challenges before and wanting something that would help demo Fluentd configurations for a book (Logging in Action with Fluentd, Kubernetes, and more). Why not simply copy the log file for the logging mechanism to read? Several reasons for this. For example, if you're logging framework can send the logs over the network without creating back pressure, then logs can be generated without being impacted by storage performance considerations. But there is nothing tangible to copy. If you want to simulate into your monitoring environment log events from a database, then this becomes even harder as the DB will store the logs internally. The other reason for this is that if you have alerting controls based on thresholds over time, you need the logs to be consumed at the correct pace. Just allowing logs to be ingested whole is not going to correctly exercise such time-based controls. Since then, I've seen similar needs to pump test events into other solutions, including OCI Queue and other Oracle Cloud services. The OCI service support has been implemented using a simple extensibility framework, so while I've focused on OCI, the same mechanism could be applied as easily to AWS' SQS, for example. A good practice for log handling is to treat each log entry as an event and think of log event handling as a specialized application of stream analytics. Given that the most common approach to streaming and stream analytics these days is based on Kafka, we're working on an adaptor for the LogSimulator that can send the events to a Kafka API point. We built the LogGenerator so it can be run as a script, so modifying it and extending its behavior is quick and easy. we started out with developing using Groovy on top of Java8, and if you want to create a Jar file, it will compile as Java. More recently, particularly with the extensions we've been working with, Java11 and its ability to run single file classes from the source. We've got plans to enhance the LogGenerator so we can inject OpenTelementry events into Fluentbit and other services. But we'd love to hear about other use cases see for this. For more on the utility: Read the posts on my blog See the documentation on GitHub
Joana Carvalho
Performance Engineer,
Postman
Eric D. Schabell
Director Technical Marketing & Evangelism,
Chronosphere
Chris Ward
Zone Leader,
DZone
Ted Young
Director of Open Source Development,
LightStep