A developer's work is never truly finished once a feature or change is deployed. There is always a need for constant maintenance to ensure that a product or application continues to run as it should and is configured to scale. This Zone focuses on all your maintenance must-haves — from ensuring that your infrastructure is set up to manage various loads and improving software and data quality to tackling incident management, quality assurance, and more.
Site Reliability Engineering (SRE) is a systematic and data-driven approach to improving the reliability, scalability, and efficiency of systems. It combines principles of software engineering, operations, and quality assurance to ensure that systems meet performance goals and business objectives. This article discusses the key elements of SRE, including reliability goals and objectives, reliability testing, workload modeling, chaos engineering, and infrastructure readiness testing. The importance of SRE in improving user experience, system efficiency, scalability, and reliability, and achieving better business outcomes is also discussed. Site Reliability Engineering (SRE) is an emerging field that seeks to address the challenge of delivering high-quality, highly available systems. It combines the principles of software engineering, operations, and quality assurance to ensure that systems meet performance goals and business objectives. SRE is a proactive and systematic approach to reliability optimization characterized by the use of data-driven models, continuous monitoring, and a focus on continuous improvement. SRE is a combination of software engineering and IT operations, combining the principles of DevOps with a focus on reliability. The goal of SRE is to automate repetitive tasks and to prioritize availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. The benefits of adopting SRE include increased reliability, faster resolution of incidents, reduced mean time to recovery, improved efficiency through automation, and increased collaboration between development and operations teams. In addition, organizations that adopt SRE principles can improve their overall system performance, increase the speed of innovation, and better meet the needs of their customers. SRE 5 Why's 1. Why Is SRE Important for Organizations? SRE is important for organizations because it ensures high availability, performance, and scalability of complex systems, leading to improved user experience and better business outcomes. 2. Why Is SRE Necessary in Today's Technology Landscape? SRE is necessary for today's technology landscape because systems and infrastructure have become increasingly complex and prone to failures, and organizations need a reliable and efficient approach to manage these systems. 3. Why Does SRE Involve Combining Software Engineering and Systems Administration? SRE involves combining software engineering and systems administration because both disciplines bring unique skills and expertise to the table. Software engineers have a deep understanding of how to design and build scalable and reliable systems, while systems administrators have a deep understanding of how to operate and manage these systems in production. 4. Why Is Infrastructure Readiness Testing a Critical Component of SRE? Infrastructure Readiness Testing is a critical component of SRE because it ensures that the infrastructure is prepared to support the desired system reliability goals. By testing the capacity and resilience of infrastructure before it is put into production, organizations can avoid critical failures and improve overall system performance. 5. Why Is Chaos Engineering an Important Aspect of SRE? Chaos Engineering is an important aspect of SRE because it tests the system's ability to handle and recover from failures in real-world conditions. By proactively identifying and fixing weaknesses, organizations can improve the resilience and reliability of their systems, reducing downtime and increasing confidence in their ability to respond to failures. Key Elements of SRE Reliability Metrics, Goals, and Objectives: Defining the desired reliability characteristics of the system and setting reliability targets. Reliability Testing: Using reliability testing techniques to measure and evaluate system reliability, including disaster recovery testing, availability testing, and fault tolerance testing. Workload Modeling: Creating mathematical models to represent system reliability, including Little's Law and capacity planning. Chaos Engineering: Intentionally introducing controlled failures and disruptions into production systems to test their ability to recover and maintain reliability. Infrastructure Readiness Testing: Evaluating the readiness of an infrastructure to support the desired reliability goals of a system. Reliability Metrics In SRE Reliability metrics are used in SRE is used to measure the quality and stability of systems, as well as to guide continuous improvement efforts. Availability: This metric measures the proportion of time a system is available and functioning correctly. It is often expressed as a percentage and calculated as the total uptime divided by the total time the system is expected to be running. Response Time: This measures the time it takes for the infrastructure to respond to a user request. Throughput: This measures the number of requests that can be processed in a given time period. Resource Utilization: This measures the utilization of the infrastructure's resources, such as CPU, memory, Network, Heap, caching, and storage. Error Rate: This measures the number of errors or failures that occur during the testing process. Mean Time to Recovery (MTTR): This metric measures the average time it takes to recover from a system failure or disruption, which provides insight into how quickly the system can be restored after a failure occurs. Mean Time Between Failures (MTBF): This metric measures the average time between failures for a system. MTBF helps organizations understand how reliable a system is over time and can inform decision-making about when to perform maintenance or upgrades. Reliability Testing In SRE Performance Testing: This involves evaluating the response time, processing time, and resource utilization of the infrastructure to identify any performance issues under BAU scenario 1X load. Load Testing: This technique involves simulating real-world user traffic and measuring the performance of the infrastructure under heavy loads 2X Load. Stress Testing: This technique involves applying more load than the expected maximum to test the infrastructure's ability to handle unexpected traffic spikes 3X Load. Chaos or Resilience Testing: This involves simulating different types of failures (e.g., network outages, hardware failures) to evaluate the infrastructure's ability to recover and continue operating. Security Testing: This involves evaluating the infrastructure's security posture and identifying any potential vulnerabilities or risks. Capacity Planning: This involves evaluating the current and future hardware, network, and storage requirements of the infrastructure to ensure it has the capacity to meet the growing demand. Workload Modeling In SRE Workload Modeling is a crucial part of SRE, which involves creating mathematical models to represent the expected behavior of systems. Little's Law is a key principle in this area, which states that the average number of items in a system, W, is equal to the average arrival rate (λ) multiplied by the average time each item spends in the system (T): W = λ * T. This formula can be used to determine the expected number of requests a system can handle under different conditions. Example: Consider a system that receives an average of 200 requests per minute, with an average response time of 2 seconds. We can calculate the average number of requests in the system using Little's Law as follows: W = λ * T W = 200 requests/minute * 2 seconds/request W = 400 requests This result indicates that the system can handle up to 400 requests before it becomes overwhelmed and reliability degradation occurs. By using the right workload modeling, organizations can determine the maximum workload that their systems can handle and take proactive steps to scale their infrastructure and improve reliability and allow them to identify potential issues and design solutions to improve system performance before they become real problems. Tools and techniques used for modeling and simulation: Performance Profiling: This technique involves monitoring the performance of an existing system under normal and peak loads to identify bottlenecks and determine the system's capacity limits. Load Testing: This is the process of simulating real-world user traffic to test the performance and stability of an IT system. Load testing helps organizations identify performance issues and ensure that the system can handle expected workloads. Traffic Modeling: This involves creating a mathematical model of the expected traffic patterns on a system. The model can be used to predict resource utilization and system behavior under different workload scenarios. Resource Utilization Modeling: This involves creating a mathematical model of the expected resource utilization of a system. The model can be used to predict resource utilization and system behavior under different workload scenarios. Capacity Planning Tools: There are various tools available that automate the process of capacity planning, including spreadsheet tools, predictive analytics tools, and cloud-based tools. Chaos Engineering and Infrastructure Readiness in SRE Chaos Engineering and Infrastructure Readiness are important components of a successful SRE strategy. They both involve intentionally inducing failures and stress into systems to assess their strength and identify weaknesses. Infrastructure readiness testing is done to verify the system's ability to handle failure scenarios, while chaos engineering tests the system's recovery and reliability under adverse conditions. The benefits of chaos engineering include improved system reliability, reduced downtime, and increased confidence in the system's ability to handle real-world failures and proactively identify and fix weaknesses; organizations can avoid costly downtime, improve customer experience, and reduce the risk of data loss or security breaches. Integrating chaos engineering into DevOps practices (CI\CD) can ensure their systems are thoroughly tested and validated before deployment. Methods of chaos engineering typically involve running experiments or simulations on a system to stress and test its various components, identify any weaknesses or bottlenecks, and assess its overall reliability. This is done by introducing controlled failures, such as network partitions, simulated resource exhaustion, or random process crashes, and observing the system's behavior and response. Example Scenarios for Chaos Testing Random Instance Termination: Selecting and terminating an instance from a cluster to test the system response to the failure. Network Partition: Partitioning the network between instances to simulate a network failure and assess the system's ability to recover. Increased Load: Increasing the load on the system to test its response to stress and observing any performance degradation or resource exhaustion. Configuration Change: Altering a configuration parameter to observe the system's response, including any unexpected behavior or errors. Database Failure: Simulating a database failure by shutting it down and observing the system's reaction, including any errors or unexpected behavior. By conducting both chaos experiments and infrastructure readiness testing, organizations can deepen their understanding of system behavior and improve their resilience and reliability. Conclusion In conclusion, SRE is a critical discipline for organizations that want to deliver highly reliable, highly available systems. By adopting SRE principles and practices, organizations can improve system reliability, reduce downtime, and improve the overall user experience.
Software has become an essential part of our daily lives, from the apps on our phones to the programs we use at work. However, software, like all things, has a lifecycle, and as it approaches its end-of-life (EOL). Then it poses risks to the security, privacy, and performance of the system on which it runs. End-of-life software is the one that no longer receives security updates, bug fixes, or technical support from the vendor. This article will look at reducing end-of-life software risks while protecting your systems and data. Tips to Minimize End-of-Life Software Risks Organizations can mitigate the risk associated with end-of-life technologies by researching modern technologies, developing a timeline for transitioning to the latest technology, training staff on new features and capabilities, and creating budget plans. Additionally, organizations should consider buying technology with longer life cycles and investing in backup or redundant systems in case of any problems or delays in transitioning to the new technology. Apart from this, they can give try to below-mentioned tips to get rid of their end-of-life software more quickly and efficiently: 1. Track Status of End-of-Life Software One can know minute details like its working pattern, operation, and dependencies by knowing one's software. But understanding how to keep it running after support has ended is critical. That is why developers should prepare a clear plan for end-of-life software. This plan should include the following: Identifying which software is at risk Assessing the challenges Implementing mitigation strategies Switching to open-source alternatives 2. Give Adequate Time to Planning While planning for an end-of-life software life cycle, it is necessary to consider a few core aspects, like training, implementation, and adoption. For this, one should carefully plan the timeline by accounting for supply-chain issues that often cause unnecessary delays. While dealing with end-of-life support issues, one should begin planning and allot dates for the project one wants to execute in the future. Knowing important dates will assist an organization in better planning, risk management, and reducing unforeseen budget expenses. Apart from this, one can have an accurate maintenance plan in place. Taking third-party maintenance support assistance is beneficial here. Third-party service providers offer valuable services like hardware replacement, repairing critical parts and hardware to keep your end-of-life products running even after their expiry date. 3. Evaluate Your Investments Planning for the EOL solutions allows organizations to rethink how they use existing technology. Also, it helps organizations to ascertain the viability of transitioning to an alternative solution like the cloud. Further, reviewing business challenges and knowing how alternative solutions may resolve them efficiently and cost-effectively becomes easier. Companies can boost employee effectiveness with a hybrid workforce and simplify network management by transitioning to cloud-based software. Because the conversion may take significant time and money, planning should give IT managers enough time to make an informed strategic decision. 4. Try to Keep the Tech Debt Low Developers often have the wrong notion that their legacy applications or software keeps on running smoothly even without upgrading. But the reality is quite the opposite. The EOL software ceases to communicate with modern technologies, and to upgrade the same requires new hardware for compatibility alignment, a firmware update, or third-party application compatibility, resulting in high-tech debts. Here are some effective tips to reduce tech debt: Find the codebase areas that increase maintenance costs Restructure the codebase by distributing it into small pieces Invest in automated testing to make changes to the codebase Keep proper documentation to track real-time code changes Use low-code development platforms to build complex software 5. Use Compatibility Testing Compatibility tests ensure the successful performance of software across all platforms, lowering the risk of failure and eliminating hurdles of releasing an application that is full of compatibility bugs. Besides this, compatibility testing examines how well software performs with modern operating systems, networks, databases, hardware platforms, etc. It even allows developers to detect and eliminate errors before the release of the final product. LambdaTest, BrowserStack, BrowseEmAll, TestingBot, etc., are popular compatibility testing tools developers widely use. 6. Adopt Best Cybersecurity Practices Organizations need to identify any possible vulnerabilities and take appropriate measures to minimize risks. A good place to start is by evaluating their current IT policies to determine whether they include strategies for disposing of software. Additionally, it is important to ensure that any sensitive data files are safely removed from the system and stored or transmitted using encryption. To enhance their cybersecurity, it is recommended that individuals adhere to password strength policies, regularly change passwords, and comply with relevant regulations. 7. Avoid Waiting for a Long Time End-of-life software is a time bomb waiting to explode. Waiting until the last minute leads to disaster. The sooner you identify obsolete software and replace it with something, much better. If you are not sure where to begin, then you can try out a few things to get started: Stay updated on relevant industry trends and news Constantly track new update releases from the vendor Visit the vendor’s website and check the software lifecycle section Scan your EOL software environment by using AppDetectivePro 8. Get Ready for the Price Hike The vendor will often increase its price as the software approaches its end-of-life date. Because it knows that customers are less likely to switch to a new product when their current product is about to end. Therefore, companies must be prepared for a price increase by budgeting for it in advance. Additionally, they need to conduct extensive research to find better alternatives to end-of-life software to save themselves from making holes in their pockets. In a Nutshell Organizations can implement any of the above solutions to combat EOL software risks. Additionally, they can prioritize software development practices that prioritize maintainability and long-term support to avoid end-of-life scenarios. These practices can include code maintainability reviews, regular software updates, and developing documentation that aids in the long-term support of the software. Overall, taking proactive steps to mitigate the risks associated with end-of-life software is critical to reducing the likelihood of security breaches, system failures, and other issues caused by end-of-life software.
Incident management has evolved considerably over the last couple of decades. Traditionally having been limited to just an on-call team and an alerting system, today it has evolved to include automated incident response combined with a complex set of SRE workflows. Importance of Reliability While the number of active internet users and people consuming digital products has been on the rise for a while, it is actually the combination of increased user expectations and competitive digital experiences that have led organizations to deliver super Reliable products and services. The bottom line is, customers have the right to seek reliable software, and the right to expect the product to work when they really want it. And it is the responsibility of the organizations to build Reliable products. But having said that, no software can be 100% reliable. Even achieving 99.9% reliability is a monumental task. As engineering infrastructure grows more complex by the day, the possibility of Incidents becomes inevitable. But triaging and remediating the issues quickly with minimal impact is what will make all of the difference. From the Vault: Recapping Incidents and Outages from the Past Let’s look back at some notable outages from the past that have had a major impact on both businesses and end users alike. October 2021: A mega outage took down Facebook, WhatsApp, Messenger, Instagram, and Oculus VR…for almost five hours! And no one could use any of those products during those five hours. November 2021: A downstream effect of a Google Cloud outage led to outages across multiple GCP products. This also indirectly impacted many non-Google companies. December 2022: An incident corresponding to Amazon’s Search issue impacted at least 20% of all global users for almost an entire day. Jan 2023: Most recently, the Federal Aviation Authority (FAA) suffered an outage due to a failed scheduled maintenance causing 32,578 flights to be delayed and a further 409 to get cancelled together. And needless to say, the monetary impact was massive. Share prices of numerous U.S. air carriers fell steeply in the immediate aftermath. Reliability Trends as of 2023 These are just a few of the major outages that have impacted users on a global scale. In reality, incidents such as these are not uncommon and are far more frequent. While businesses and business owners bear the brunt of such outages, the impact is experienced by end users too, resulting in a poor user/customer experience (UX/CX). Here are some interesting stats as a result of poor CX/UX: It takes 12 positive user experiences to make up for one unresolved negative experience 88% of web visitors are less likely to return to a site after a bad experience And even a 1 second delay in page load can cause a 7% loss in customers And that is why resolving incidents quickly is CRITICAL! But (literally :p) the million dollar question is how to effectively deal with incidents? Let’s address this by probing into the challenges of incident management in the first place. State of Incident Management Today Evolving business and user needs have directly impacted incident management practices. Increasingly complex systems have led to increasingly complex incidents.The use of public cloud and Microservices architecture has made it difficult to find out what went wrong, e.g.: which service is impacted, does the outage have an upstream/downstream on other services, etc. Hence incidents are complex too. User expectations have grown considerably due to increased dependency on technology.The widespread adoption of technologies has led to more dependency on technology. This has made them more comfortable using it, and as a result, they are unwilling to put up with any kind of downtime or bad experience that they may face. Tool sprawl amid evolving business needs adds to the complexity.The increasing number of tools within the tech stack to address complex requirements and use cases only adds to the complexity of incident management. “...you want teams to be able to reach for the right tool at the right time, not to be impeded by earlier decisions about what they think they might need in the future.” - Steve McGhee, Reliability Advocate, SRE, Google Cloud Evolution of Incident Management Over the years, the scope of activities associated with Incident Management has only been growing. And most of the evolution that’s taken place can be bucketed into one of the four categories: technology, people, process, and tools. Technology When? What was it like? 15 years ago Most teams ran monolithic applications These were easy to operate systems, with very less sophistication 7 years ago Sophisticated distributed systems in medium-to-large organizations were the norm Growing adoption of microservices architecture and public clouds Today Even the smallest teams run complex, distributed apps Widespread adoption of microservices architecture and public cloud services People When? What was it like? 15 years ago Large Operations teams with manual workloads Basic On-Call team with low-skilled labor 7 years ago Smaller, more efficient Ops teams with partially automated workload Dedicated Incident Response teams with basic automation to notify On-Call Today Fewer members in Operations; but fully automated workloads Dedicated Response teams with instant & diverse notifications for On-Call Process When? What was it like? 15 years ago Manual processes (with very low/no automation) Less stringent SLAs Customers more accepting of outages 7 years ago Improved automation in systems architecture More stringent SLAs Customers less accepting of outages Today Heavy reliance on automation due to prevailing system complexity Strict SLAs No or much less tolerance toward outages Tools When? What was it like? 15 years ago Less tooling involved Basic monitoring/alerting solutions in place 7 years ago Improved operations tooling with IaC Advanced monitoring/alerting with increased automation Today Heavy operations tooling Specialized tools associated with the observability world Problems Adjusting to Modern Incident Management Now is the ideal time to address issues that are holding engineering teams back from doing incident management the right way. Managing Complexity Service ownership and visibility are the foremost contributing factors preventing engineering teams from maximizing their time at hand during incident triage. This is a result of the adoption of distributed applications, in particular microservices. An irrational number of services makes it hard to track service health and their respective owners. Tool sprawl (a great number of tools within the tech stack) makes it even more difficult to track dependencies and ownership. Lack of Automation Achieving a respectable amount of automation is still a distant dream for most incident response teams. Automating their entire infrastructure stack through incident management will make a great deal of a difference in improving MTTA and MTTR. The tasks that are still manual, with great potential for automation during incident response are: Ability to quickly notify the On-Call team of service outages/service degradation Ability to automate incident escalations to the senior/ more experienced responders/ stakeholders Providing the appropriate conference bridge for communication and documenting incident notes Poor Collaboration A poor effort put into collaboration during an incident is a major reason keeping response teams from doing what they do best. The process of informing members within the team, across the team, within the organization, and outside of the organization must be simplified and organized. Activities that can improve with better collaboration are Bringing visibility of service health to team members, internal and external stakeholders, customers, etc. with a status page Maintaining a single source of truth in regard to incident impact and incident response Doing the root cause analysis or postmortems or incident retrospectives in a blameless way Lack of Visibility into Service Health One of the most important (and responsible) activities for the response team is to facilitate complete transparency about incident impact, triage, and resolution to internal and external stakeholders as well as business owners. The problems: Absence of a platform such as a status page, that can keep all stakeholders informed of impact timelines, and resolution progress Inability to track the health of the dependent upstream/downstream services and not just the affected service Now, the timely question to probe is: what should Engineering teams start doing? And how can organizations support them in their reliability journey? What Can Engineering Leaders/Teams Do to Mitigate the Problem? The facets of incident management today can be broadly classified into 3 categories: On-call alerting Incident response (automated and collaborative) Effective SRE Addressing the difficulties and devising appropriate processes and strategies around these categories can help engineering teams improve their incident management by 90%. Certainly sounds ambitious, so let's understand this in more detail. On-Call Alerting and Routing On-call is the foundation of a good reliability practice. Three are two main aspects to on-call alerting and they are highlighted below. a. Centralizing Incident Alerting and Monitoring The crucial aspect of on-call alerting is the ability to bring all the alerts into a single/centralized command center. This is important because a typical tech stack is made up of multiple alerting tools monitoring different services (or parts of the infrastructure), put in place by different users. An ecosystem that can bring such alerts together will make Incident Management much more organized. b. On-Call Scheduling and Intelligent Routing While organized alerting is a great first step, effective Incident Response is all about having an On-Call Schedule in place and routing alerts to the concerned On-Call responder. And in case of non-resolution or inaction, escalating it to the most appropriate engineer (or user). Incident Response (Automated and Collaborative) While on-call scheduling and alert routing are the fundamentals, it is incident response that gives structure to incident management. a. Alert Noise Reduction and Correlation Oftentimes, teams get notified of unnecessary events. And more commonly, during the process of resolution, engineers tend to get notified for similar and related alerts, which are better off addressing the collective incident and not just the specific incident. Hence with the right practices in place, incident/alert fatigue can be handled with automation rules for suppressing alerts and deduplicating alerts. b. Integration and Collaboration Integrating the infrastructure stack with tools well within the response process can possibly be the simplest and easiest way to organize incident response. Collaboration can improve by establishing integrations with: ITSM tools for ticket management ChatOps tools for communication CI/CD tools for deployment/ quick rollback Effective SRE Engineering reliability into a product requires the entire organization to adopt the SRE mindset and buy into the ideology. While on-call is at one end of the spectrum, SRE (site reliability engineering) can be thought of being at the other end of the spectrum. But what exactly is SRE? For starters, SRE should not be confused with what DevOps stands for. While DevOps focuses on Principles, SRE emphasizes the focus on Activities instead. SRE is fundamentally about taking an engineering approach to systems operations in order to achieve better reliability and performance. It puts a premium on monitoring, tracking bugs, and creating systems and automation that solve the problem in the long term. While Google was the birthplace of SRE, many top technology companies such as LinkedIn, Netflix, Amazon, Apple, and Facebook have adopted it and benefitted highly from doing that. POV: Gartner predicts that, by 2027, 75% of enterprises will use SRE practices organization-wide, up from 10% in 2022. What difference will SRE make? Today users are expecting nothing but the very best. And an exclusive focus on SRE practices will help in: Providing a delightful User experience (or Customer experience) Improving feature velocity Providing fast and proactive issue resolution How does SRE add value to the business? SRE adds a ton of value to any business that is digital-first. Below mentioned are some of the key points: Provides an engineering-driven and data-driven approach to improve customer satisfaction Enables you to measure toil and save time for strategic tasks Leverage Automation Learn from Incident Retrospectives Communicate with Status Pages The bottom line is, Reliability has evolved. You have to be proactive and preventive.Teams will have to fix things faster and keep getting better at it. And on that note, let’s look at the different SRE aspects that engineering teams can adopt for better incident management: a. Automated Response Actions Automating manual tasks and eliminating toil is one of the fundamental truths on which SRE is built. Be it automating workflows with Runooks, or automating response actions, SRE is a big advocate of automation, and response teams will widely benefit from having this in place. b. Transparency SRE advocates for providing complete visibility into the health status of services and this can be achieved by the use of Status Pages. It also puts a premium on the need to have greater transparency and visibility of service ownership within the organization. c. Blameless Culture During times of an incident, SRE stresses greatly on blaming the process and not the individuals responsible for it. This blameless culture of not blaming individuals for outages goes a long way in fostering a healthy team culture and promoting team harmony. This process of doing RCAs is called incident retrospectives or postmortems. d. SLO and Error Budget Tracking This is all about using a metric-driven approach to balance reliability and innovation. It encourages the use of SLIs to keep track of service health. By actively tracking SLIs, SLOs and error budgets can be in check, thus not breaching customer any of the customer SLAs.
In a world where quality is at the forefront of every business, you want to ensure that yours is providing excellent quality results to the customers. It is the most important factor in establishing, preserving, and expanding the brand. So, what is QA, and what makes it so critical to the quality and reliability of a product or service? Quality assurance is a method of verifying and validating a product or service that entails testing the software to identify and correct any issues before the project is released to users by applying various testing techniques. In other words, it assures that your customers will always be satisfied. Quality assurance as a service is one of the many QA practices. What Is Quality Assurance as a Service? Quality assurance as a service (QAaaS) is an on-demand approach to quality assurance that is designed to ensure that quality is built into a product or service and that it meets the requirements of customers by incorporating an outsourced or crowdsourced QA team into development in order to move a project into production faster. It is offered by a variety of independent testers or third-party vendors with a team of qualified QA engineers and the appropriate infrastructure to conduct different types of testing. This service model is adaptable with the correct combination of people, technology, and procedures to provide the maximum project coverage. QAaaS can seamlessly integrate a QA service into various organization programs, adding competent resources when they are most needed and successfully reducing them when the project draws gradually to a close. So how should it be done? And why would you do it? How Does QA as a Service Model Work? Quality assurance as a service can be introduced at the very beginning of the development lifecycle, perform testing the entire time, be used to check a concrete prototype build, or be applied only to validate a single feature of a bigger project. It gives the team complete control over the testing process, allowing it to be as comprehensive as the project requires. QAaaS can be either outsourced (via a third-party provider) or crowdsourced (via a community of independent testers), depending on how it is provided. Both types have advantages in terms of cost, expertise, and scalability. A team hired through QAaaS is more flexible and adaptive to whatever type of development process your company offers in comparison to a traditional, specified team. In a word, you hire a group of people with different testing backgrounds in various industries, locations, and technologies. As a result, people with a wide knowledge base and different points of view on bug detection bring new perspectives to the testing process. In addition, testing in software QA is usually performed via cloud-based tools and platforms. Therefore, scaling up or down testing resources based on the requirements becomes much easier. A quality assurance engineer is a person who is not just about testing whatever they are working on but also about advocating for product quality. QA engineers ensure that all possible cases and scenarios have been considered. Testing capabilities and scenarios are more flexible and broad when discussing quality assurance as a service. Because of the large number of testers involved in crowdsourced QAaaS platforms, the most vulnerable areas of the project may be given extra attention and retested by a number of individuals in the real-world environment at the same time. Crowdsourcing in QA as a service means bringing in a large number of independent testers into the process and having their feedback analyzed by a central platform or service. If outsourcing is involved, a quality assurance provider is represented by a company that offers testing solutions and QA services to other companies as a cloud-based service. All software needs are outsourced to a third-party vendor, which offers various testing approaches, tools, and frameworks, allowing for time and resource savings in technology and training. Benefits of QA as a Service QAaaS will benefit organizations by providing increased productivity, staff flexibility, considerable cost savings, and far more effective solutions. In addition, defects are identified and removed on time with minimal losses. The following are the primary benefits of using QA services for a business: Simpler and More Detailed QA Services New technologies and app enhancements can be adopted faster with QAaaS. There is a sense of urgency. The team that is participating in the process is experienced and proficient enough to go straight into the specifics of a given project. The basics can be skipped. QAaaS provides access to the most recent testing tools and technologies, which improve testing efficiency and effectiveness. Higher Quality Since a software quality assurance service provider has extensive experience delivering testing services in many industries for various clients, the level of quality of QA procedures is improved. Instead of adapting superior technologies and approaches on your own, you produce better software by incorporating quality assurance as a service team into the process. As a result, at a fixed or reasonable cost, you can build better software while substantially boosting efficiency. In addition, new people in a team with good expertise in the area can add fresh perspectives and ideas to the entire testing process. Faster Results With Automation Solutions If you provide automation solutions to match sprint cycles and speed up the process. As a result, you may accumulate your team’s resources elsewhere, where they might be needed. It is an efficient method of reducing time-to-market. Saved Costs The ability to attract an outsourced team as needed during the project eliminates the necessity to maintain a large testing team and infrastructure running on a daily basis. The process is more customized in accordance with the original budget and requirements. In quality assurance as a service, you pay for the actual outcomes, which are newly discovered defects, rather than the hours invested in teamwork. Overall, QA as a service shifts the focus, enabling essential competencies in a company to take priority over less critical activities. In addition, it helps to reorganize the company’s internal development process in order to maximize the results produced by the team.
Writing clean and maintainable code is crucial for successful software development projects. Clean and maintainable code ensures that the software is easy to read, understand, and modify, which can save time and effort in the long run. This article will discuss some best practices for writing clean and maintainable code. Follow a Consistent Coding Style Consistency is key when it comes to writing clean and maintainable code. Following a consistent coding style makes the code easier to read and understand. It also helps ensure that the code is formatted correctly, which can prevent errors and make debugging easier. Some common coding styles include the Google Style Guide, the Airbnb Style Guide, and the PEP 8 Style Guide for Python. Keep It Simple Simplicity is another important aspect of writing clean and maintainable code. Simple code is easier to understand, modify, and debug. Avoid adding unnecessary complexity to your code; use clear and concise variable and function names. Additionally, use comments to explain complex code or algorithms. Use Meaningful Variable and Function Names Using meaningful variable and function names is essential for writing clean and maintainable code. Descriptive names help make the code more readable and understandable. Use names that accurately describe the purpose of the variable or function. Avoid using single-letter variable names or abbreviations that may be confusing to others who read the code. I just implement this on the project that helps me to grab the client for the long term. It was about the calculator consisting of general calculators like GST and VAT calculators. Write Modular Code Modular code is divided into smaller, independent components or modules. This approach makes the code easier to read, understand, and modify. Each module should have a clear and specific purpose and should be well-documented. Additionally, modular code can be reused in different parts of the software, which can save time and effort. Write Unit Tests Unit tests verify the functionality of individual units or components of the software. Writing unit tests helps ensure that the code is working correctly and can help prevent bugs from appearing. Unit tests should be written for each module or component of the software and be automated to ensure that they are run regularly. Use Version Control Version control is a system that tracks changes to the code over time. Using version control is essential for writing clean and maintainable code because it allows developers to collaborate on the same codebase without overwriting each other's work. Additionally, version control will enable developers to revert to previous versions of the code if necessary. Document Your Code Documentation is an essential part of writing clean and maintainable code. Documentation helps others understand the code, and it helps ensure that the code is well-documented for future modifications. Include comments in your code to explain how it works and what it does. Additionally, write documentation outside of the code, such as README files or user manuals, to explain how to use the software. Refactor Regularly Refactoring is the process of improving the code without changing its functionality. Regularly refactoring your code can help keep it clean and maintainable. Refactoring can help remove unnecessary code, simplify complex code, and improve performance. Additionally, refactoring can help prevent technical debt, which is the cost of maintaining code that is difficult to read, understand, or modify. Conclusion In conclusion, writing clean and maintainable code is crucial for the success of software development projects. Following best practices such as consistent coding styles, simplicity, meaningful variable and function names, modular code, unit tests, version control, documentation, and regular refactoring can help ensure that the code is easy to read, understand, and modify. By following these best practices, developers can save time and effort in the long run and ensure that the software is of high quality.
I attended Dynatrace Perform 2023. This was my sixth “Perform User Conference,” but the first over the last three years. Rick McConnell, CEO of Dynatrace, kicked off the event by sharing his thoughts on the company’s momentum and vision. The company is focused on adding value to the IT ecosystem and the cloud environment. As the world continues to change rapidly, this enables breakout opportunities to occur. Dynatrace strives to enable clients to be well-positioned for the upcoming post-Covid and post-recession recovery. The cloud delivers undeniable benefits for companies and their customers. It enables companies to deliver software and infrastructure much faster. That’s why we continue to see the growth of hyper-scale cloud providers. Companies rely on the cloud to take advantage of category and business growth opportunities. However, with the growth comes complexity. 71% of CIOs say increasing is difficult to manage all of the data that is being produced. It is beyond human ability to manage and make sense of all the data. This creates a need and opportunity for automation and observability to address these needs. There is an increased focus on cloud optimization on multiple fronts. Key areas of focus are to reduce costs and drive more reliability and availability to ultimately drive more value for customers. Observability is moving from an optional “nice to have” to a mandatory “must-have.” The industry is at an inflection point with an opportunity to drive change right now. Organizations need end-to-end observability. Dynatrace approaches the problem in a radically different way. Data types need to be looked at collectively and holistically to be more powerful in the management of the ecosystem. Observability + Security + Business Analytics The right software intelligence platform can provide end-to-end observability to drive transformational change to businesses by delivering answers and intelligent automation from data. End users are no longer willing to accept poor performance from applications. If your application doesn’t work or provides an inferior user experience, your customer will find another provider. As such, it is incumbent on businesses to deliver flawless and secure digital interactions that are performant and great all of the time. New Product Announcements Consistent with the vision of A world where software works perfectly, not having an incident in the first place, Dynatrace announced four new products today: Grail data lakehouse expansion: The Dynatrace Platform’s central data lakehouse technology that stores, contextualizes, and queries data, beyond logs and business events to encompass metrics, distributed traces, and multi-cloud topology and dependencies. This enhances the platform’s ability to store, process, and analyze the tremendous volume and variety of data from modern cloud environments while retaining full data context. Enhanced user experience: With new UX features such as built-in dashboard functionalities and a visual interface to help foster teamwork between technical and business personnel. These new UX features allow Dynatrace Notebooks to be used, which is an interactive document capability that allows IT, development, security, and business users to work together using code, text, as well as multimedia to construct, analyze, and disseminate insights from exploratory, causal-AI analytics projects to ensure better coordination and decision making throughout the company. Dynatrace AutomationEngine: Features an interactive user interface and no-code and low-code tools that empower groups to make use of Dynatrace’s causal-AI analytics for observability and security insights to automate BizDevSecOps procedures over their multi-cloud environments. This automation platform enables IT teams to detect and solve issues proactively or direct them to the right personnel, thus saving time and allowing them to concentrate on complex matters that only humans can handle. Dynatrace AppEngine: Provides IT, development, security, and business teams with the capability of designing tailored, consistent, and knowledge-informed applications with a user-friendly, minimal-code method. Clients and associates can build personalized links to sync the Dynatrace platform with technologies over hybrid and multi-cloud surroundings, unify segregated solutions, and enable more personnel from their businesses with smart apps that rely on perceptibility, security, and business knowledge from their ecosystems. Client Feedback I had the opportunity to speak with Michael Cabrera, Site Reliability Engineering Leader at Vivint. Michael brought SRE to Vivint after bringing SRE to Home Depot and Delta. Vivint realized they were spending more time firefighting than optimizing, and SRE helps solve this problem. Michael evaluated more than a dozen solutions comparing features, ease of use, and comprehensiveness of the platform. Dynatrace was a clear winner. It enables SRE and enables a view into what customers are experiencing not available with another tool. By seeing what customers feel, Michael and his team can be proactive versus reactive. The SRE team at Vivint has 12 engineers and 200 developers servicing thousands of employees. Field technicians are in customers’ homes, helping them create and live in smarter homes. Technicians are key stakeholders since they are front-facing to end users. Dynatrace is providing Vivint with a tighter loop between what customer experience and what they could see in the tech stack. It reduces the time spent troubleshooting and firefighting versus optimizing. Development teams can see how their code is performing. Engineers can see how the infrastructure is performing. Michael feels Grail is a game changer. It allows Vivint to combine logs with business analytics to achieve full observability end-to-end into their entire business. Vivint was a beta tester of the new technology. The tighter feedback loops with deployment showed how the company’s engineering policies could further improve. They were able to scale and review the performance of apps and infrastructure and see more interconnected services and how things align with each other. Dynatrace is helping Vivint to manage apps and software through SLOs. They have been able to set up SLOs in a couple of minutes. It’s easy to install with one agent without enabling plug-ins or buying add-ons. SREs can sit with engineering and product teams and show the experience from the tech stack to the customer. It’s great for engineering teams to have real-time feedback on performance. They can release code and see the performance before, during, and after. The biggest challenge is having so much more information than before. They are training to help team members know what to do with the information and how to drill down as needed. Conclusion I hope you have taken away some helpful information from my day one experience at Dynatrace Perform. To read more about my day two experience, read here.
Are you looking to move your workloads from your on-premises environment to the cloud, but don't know where to start? Migrating your business applications and data to a new environment can be a daunting task, but it doesn't have to be. With the right strategy, you can execute a successful lift and shift migration in no time. Whether you're migrating to a cloud environment or just updating your on-premises infrastructure, this comprehensive guide will cover everything from planning and preparation to ongoing maintenance and support. In this article, I have provided the essential steps to execute a smooth lift and shift migration and make the transition to your new environment as seamless as possible. Preparation for Lift and Shift Migration Assess the Workloads for Migration Before starting the lift and shift migration process, it is important to assess the workloads that need to be migrated. This involves identifying the applications, data, and resources that are required for the migration. This assessment will help in determining the migration strategy, resource requirements, and timeline for the migration. Identify Dependencies and Potential Roadblocks This involves understanding the relationships between the workloads and identifying any dependencies that might impact the migration process. Potential roadblocks could include compatibility issues, security and data privacy concerns, and network limitations. By identifying these dependencies and roadblocks, you can plan and ensure a smooth migration process. Planning for Network and Security Changes Lift and shift migration often involves changes to the network and security configurations. It is important to plan for these changes in advance to ensure the integrity and security of the data being migrated. This includes defining the network architecture, creating firewall rules, and configuring security groups to ensure secure data transfer during the migration process. Lift and Shift Migration Lift and shift migration is a method used to transfer applications and data from one infrastructure to another. The goal is to recreate the current environment with minimum changes, making it easier for users and reducing downtime. Migration Strategies There are several strategies to migrate applications and data. A common approach is to use a combination of tools to ensure accurate and efficient data transfer. One strategy is to utilize a data migration tool. These tools automate the process of transferring data from one environment to another, reducing the risk of data loss or corruption. Some popular data migration tools include AWS Database Migration Service, Azure Database Migration Service, and Google Cloud Data Transfer. Another strategy is to use a cloud migration platform. These platforms simplify the process of moving the entire infrastructure, including applications, data, and networks, to the cloud. Popular cloud migration platforms include AWS Migration Hub, Azure Migrate, and Google Cloud Migrate. Testing and Validation Testing and validation play a crucial role in any migration project, including lift and shift migrations. To ensure success, it's essential to test applications and data before, during, and after the migration process. Before migration, test applications and data in the current environment to identify potential issues. During migration, conduct ongoing testing and validation to ensure accurate data transfer. After the migration is complete, final testing and validation should be done to confirm everything is functioning as expected. Managing and Monitoring Managing and monitoring the migration process is crucial for success. A project plan should be in place outlining the steps, timeline, budget, and resources needed. Understanding the tools and technologies used to manage and monitor the migration process is important, such as migration tools and platforms, and monitoring tools like AWS CloudTrail, Azure Monitor, and Google Cloud Stackdriver. Post-Migration Considerations Once your lift and shift migration is complete, it's important to turn your attention to the post-migration considerations. These considerations will help you optimize your migrated workloads, handle ongoing maintenance and updates, and address any lingering issues or challenges. Optimizing the Migrated Workloads for Performance One of the key post-migration considerations is optimizing the migrated workloads for performance. This is an important step because it ensures that your migrated applications and data are running smoothly and efficiently in the new environment. After a successful migration, it's crucial to ensure that your applications and data perform optimally in the new environment. To achieve this, you need to evaluate their performance in the new setup. This can be done through various performance monitoring tools like AWS CloudWatch, Azure Monitor, and Google Cloud Stackdriver. Upon examining the performance, you can identify areas that need improvement and make the necessary adjustments. This may include modifying the configuration of your applications and data or adding more resources to guarantee efficient performance. Handling Ongoing Maintenance and Updates Another important post-migration consideration is handling ongoing maintenance and updates. This is important because it ensures that your applications and data continue to run smoothly and efficiently, even after the migration is complete. To handle ongoing maintenance and updates, it's important to have a clear understanding of your infrastructure and the tools and technologies that you are using. You should also have a plan in place for how you will handle any updates or changes that may arise in the future. One of the key things to consider when it comes to maintenance and updates is having a regular schedule for updating your applications and data. This will help you stay on top of any changes that may need to be made, and will ensure that your workloads are running optimally at all times. Addressing Any Lingering Issues or Challenges It's crucial to resolve any unresolved problems or difficulties that occurred during the migration process. This guarantees that your applications and data run smoothly and efficiently and that any issues overlooked during migration are dealt with before they become bigger problems. To resolve lingering issues, it is necessary to have a good understanding of your infrastructure and the tools you use. Having a plan in place for handling future issues is also important. A key aspect of resolving lingering issues is to have a monitoring system in place for your applications and data. This helps to identify any problems and respond promptly. When Should You Consider the Lift and Shift Approach? The lift and shift approach allows you to convert capital expenses into operational ones by moving your applications and data to the cloud with little to no modification. This method can be beneficial in several scenarios, such as: When you need a complete cloud migration: The lift and shift method is ideal for transferring your existing applications to a more advanced and flexible cloud platform to manage future risks. When you want to save on costs: The lift and shift approach helps you save money by migrating your workloads to the cloud from on-premises with little modifications, avoiding the need for expensive licenses or hiring professionals. When you have limited expertise in cloud-native solutions: This approach is suitable when you need to move your data to the cloud quickly and with minimal investment and you have limited expertise in cloud-native solutions. When you don’t have proper documentation: The lift and shift method is also useful if you lack proper documentation, as it allows you to move your application to the cloud first, and optimize or replace it later. Conclusion Lift and shift migration is a critical step in modernizing legacy applications and taking advantage of the benefits of the cloud. The process can be complex and time-consuming, but careful planning and working with a knowledgeable vendor or using a reliable cloud migration tool can ensure a smooth and successful migration. Organizations can minimize downtime and risk of data loss, while increasing scalability, reliability, and reducing costs. Lift and shift migration is a smart choice for organizations looking to upgrade their technology and benefit from cloud computing. By following the best practices outlined in this article, organizations can achieve their goals and execute a successful lift and shift migration.
In November 2022, the Green Software Foundation organized its first hackathon, “Carbon Hack 2022,” with the aim of supporting software projects whose objective is to reduce carbon emissions. I participated in this hackathon with the Carbon Optimised Process Scheduler project along with my colleagues Kamlesh Kshirsagar and Mayur Andulkar, in which we developed an API to optimize job scheduling in order to reduce carbon emissions, and we won the “Most Insightful” project prize. In this article, I will summarize the key concepts of “green software” and explain how software engineers can help reduce carbon emissions. I will also talk about the Green Software Foundation hackathon, Carbon Hack, and its winners. What Is “Green Software”? According to this research by Malmodin and Lundén (2018), the global ICT sector is responsible for 1.4% of carbon emissions and 4% of electricity use. In another article, it is estimated that the ICT sector’s emissions in 2020 were between 1.8% and 2.8% of global greenhouse gas emissions. Even though these estimates carry some uncertainty, they give a reasonable idea of the impact of the ICT sector. Green Software Foundation defines “green software” as a new field that combines climate science, hardware, software, electricity markets, and data center design to create carbon-efficient software that emits the least amount of carbon possible. Green software focuses on three crucial areas to do this: hardware efficiency, carbon awareness, and energy efficiency. Green software practitioners should be aware of these six key points: Carbon Efficiency: Emit the least amount of carbon Energy Efficiency: Use the least amount of energy Carbon Awareness: Aim to utilize “cleaner” sources of electricity when possible Hardware Efficiency: Use the least amount of embodied carbon Measurement: You can’t get better at something that you don’t measure Climate Commitments: Understand the mechanism of carbon reduction What Can We Do as Software Engineers? Fighting global warming and climate change involves all of us, and since we can do it by changing our code, we might start by reading Ismael Velasco’s advice, an expert in this field. These principles are extracted from his presentation at the Code For All Summit 2022: 1. Green By Default We should move our applications to a greener cloud provider or zone. This article collects three main cloud providers. Google Cloud has matched 100% of its electricity consumption with renewable energy purchases since 2017 and has recently committed to fully decarbonize its electricity supply by 2030. Azure has been 100% carbon-neutral since 2012, meaning they remove as much carbon each year as they emit, either by removing carbon or reducing carbon emissions. AWS purchases and retires environmental attributes like renewable energy credits and Guarantees of Origin to cover the non-renewable energy used in specific regions. Also, only a handful of their data centers have achieved carbon neutrality through offsets. Make sure the availability zone where your app is hosted is green This can be checked on the website of the Green Web Foundation. Transfers of data should be optional, minimal, and sent just once. Prevent pointless data transfers. Delete useless information (videos, special fonts, unused JavaScript, and CSS). Optimize media and minify assets. Reduce page loads and data consumption with service workers’ focused caching solutions Make use of a content delivery network (CDN) You can handle all requests from servers that are currently using renewable energy thanks to Cloudfront Reduce the number of HTTP requests and data exchanges in your API designs Track your app’s environmental impact Start out quickly and simply, then gradually increase complexity. 2. Green Mode Design Users have the option to reduce functionality for less energy using the Green Mode design. Sound-only videos, Transcript-only audio Cache-only web app Zero ads/trackers Images optional: click-to-view, Grayscale images Green Mode is a way of designing software that prioritizes the extension of the device life and digital inclusion over graceful degradation. To achieve this, it suggests designing for maximum backward compatibility with operating systems and web APIs, as well as offering minimal versions of CSS. 3. Green Partnerships We should ponder three questions: What knowledge are we lacking? What missing networks are there? What can we provide to partners? What Is the Green Software Foundation? Accenture, GitHub, Microsoft, and ThoughtWorks launched the Green Software Foundation with the Linux Foundation to put software engineering’s focus on sustainability. The Green Software Foundation is a non-profit organization created under the Linux Foundation with the goal of creating a reliable ecosystem of individuals, standards, tools, and “green software best practices”. It focuses on lowering the carbon emissions that software is responsible for and reducing the adverse effects of software on the environment. Moreover, it was established for those who work in the software industry and has the aim to provide them with information on what they may do to reduce the software emissions that their work in the software industry is responsible for. Carbon Hack 2022 Carbon Hack 2022 took place for the first time between October 13th and November 10th, 2022, and was supported by the GSF member organizations Accenture, Avanade, Intel, Thoughtworks, Globant, Goldman Sachs, UBS, BCG, and VMware. The aim of Carbon Hack was to create carbon-aware software projects using The GSF Carbon Aware SDK, which has two parts, a Hosted API and a client library available for 40 languages. The hackathon had 395 participants and 51 qualified projects from all over the world. Carbon-aware software refers to when an application is executed at different times or in regions where electricity is generated from greener sources — like wind and solar — as this can reduce its carbon footprint. When the electricity is clean, carbon-aware software works harder; when the electricity is dirty, it works less. By including carbon-aware features in an application, we can partially offset our carbon footprint and lower greenhouse gas emissions. Carbon Hack 2022 Winners The total prize pool of $100,000 USD was divided between the first three winners and 4 category winners: First place – Lowcarb Lowcarb is a plugin that enables carbon-aware scheduling of training jobs on geographically distributed clients for the well-known federated learning framework Flower. The results of this plugin displayed 13% lower carbon emissions without any negative impacts. Second place – Carbon-Aware DNN Training with Zeus This energy optimization framework adjusts the power limit of the GPU and can be integrated into any DNN training job. The use case for Zeus showed a 24% reduction in carbon emissions and only a 3% decrease in learning time. Third place – Circa Circa is a lightweight library – written in C – that can be installed from a release using a procedure of configuring and making install instructions. It chooses the most effective time to run a program within a predetermined window of time and also contains a simple scripting command that waits for the energy with the lowest carbon density over a specified period of time. Most Innovative – Sustainable UI A library that provides a set of base primitives for building carbon-aware UIs to any React application; in the future, the developers would like to offer versions for other popular frameworks as well. The developers predicted that Facebook’s monthly emissions would be reduced by 1,800 metric tons of gross CO2 emissions if they were to use SUI Headless by reducing a tenth of a gram of CO2e every visit while gradually degrading its user interface. This is comparable to the fuel used by 24 tanker trucks or the annual energy consumption of 350 houses. Most Polished – GreenCourier Scheduling plugin implemented for Kubernetes. To deploy carbon-aware scheduling across geographically connected Kubernetes clusters, the authors developed a scheduling policy based on marginal carbon emission statistics obtained from the Carbon-aware SDK. Most Insightful – Carbon Optimised Process Scheduler Disclosure: This was my Carbon Hack team! In order to reduce carbon emissions, we developed an API service with a UI application that optimizes job scheduling. The problem was modeled using mixed-integer linear programming and solved using Open Source Solver. If it were possible to optimize hundreds of high-energy industrial processes, carbon emissions could be reduced by up to 2 million tons per year. An example scenario from the IT sector demonstrates how moving work by just three hours can reduce CO2 emissions by almost 18.5%. This results in a savings of roughly 300 thousand tons of CO2 per year when applied to a million IT processes. Most Actionable – HEDGE.earth 83% of the carbon emissions on the web come from API requests. This team developed a reverse proxy — an application that sits in front of back-end applications and forwards client requests to those apps — to maximize the amount of clean energy needed to complete API requests (also accessible in NPM). Take a look at all the projects from Carbon Hack 2022 here. Conclusion The collective effort and cross-disciplinary cooperation across industries and within engineering are really important to achieve global climate goals, and we can start with these two courses that the Linux Foundation and Microsoft offer regarding green software and sustainable software engineering. Also, we can begin debating with our colleagues on how to lower the carbon emissions produced by our applications. In addition, we could follow the people on their social media accounts who have knowledge on this topic; I would recommend the articles of Ismael Velasco to start. Regarding this article, if we achieve to write our codes greener, our software projects will be more robust, reliable, faster, and brand resilient. Sustainable software applications will not only help to reduce our carbon footprint with our applications but will also help to sustain our applications with fewer dependencies, better performance, low-resource usage, cost savings, and energy-efficient features.
Tracking Mean Time To Restore (MTTR) is standard industry practice for incident response and analysis, but should it be? Courtney Nash, an Internet Incident Librarian, argues that MTTR is not a reliable metric — and we think she's got a point. We caught up with Courtney at the DevOps Enterprise Summit in Las Vegas, where she was making her case against MTTR in favor of alternative metrics (SLOs and cost of coordination data), practices (Near Miss analysis), and mindsets (humans are the solution, not the problem) to help organization better learn from their incidents. Episode Highlights (1:54) The end of MTTR? (4:50) Library of incidents (13:20) What is an incident? (19:41) Cost of coordination (22:13) Near misses (24:21) Mental models (28:16) Role of language in shaping public discourse (29:33) Learnings from The Void Episode Excerpt Dan: Hey, everyone; welcome to Dev Interrupted. My name is Dan lines, and I'm here with Courtney Nash, who has one of the coolest possibly made-up titles, but possibly real: Internet Incident Librarian. Courtney: Yep, that's right, yeah, you got it. Dan: Welcome to the show. Courtney: Thank you for having me on. Dan: I love that title Courtney: Still possibly made up, possibly, possibly... Dan: Still possibly made up. Courtney: We'll just leave that one out there for the listeners to decide. Dan: Let everyone decide what that could possibly mean. We have a, I think, maybe a spicy show, a spicy topic. Courtney: It's a hot topic show. Dan: Hot topic, especially since we're at DevOps Enterprise Summit, where we hear a lot about the DORA metrics, one of them being MTTR. Courtney: Yes. Dan: And you might have a hot take on that. The end of MTTR? Or how would you describe it? Courtney: Yeah, I feel a little like the fox in the henhouse here, but Gene accepted the talk. So you know, there's that. Dan: So it's on him. Courtney: [laughing] It's all Gene's fault! So I have been interested in complex systems for a long time; I used to study the brain. And I got sucked down an internet rabbit hole quite a lot quite a while ago. And I've had beliefs for a long time that I haven't had data to back up necessarily. And we see these sort of perverted behaviors, not that kind of perverted, but where we take metrics in the industry, and then with Goddard's Law, pick whatever you pick up, people incentivize them, and then weird things happen. But I think we spend too little time looking at the humans in the system and a lot of time focusing on the technical aspects and the data that come out of the technical side of systems. So, I started a project about a year ago called The Void. It's the Verica Open Incident Database, actually a real, not made-up name. And it's the largest collection of public incident reports. So, if you all have an outage, and you hopefully go and figure out and talk about what happened, and then you write that up, but that's out in the world, so I'm not writing these, I'm curating them and collecting. I'm a librarian. So, I have about 10,000 of them now. And a bunch of metadata associated with all these incident reports. Engineering Insights before anyone else... The Weekly Interruption is a newsletter designed for engineering leaders by engineering leaders. We get it. You're busy. So are we. That's why our newsletter is light, informative, and oftentimes irreverent. No BS or fluff. Each week we deliver actionable advice to help make you - whether you're a CTO, VP of Engineering, team lead, or IC — a better leader. It's also the best way to stay up-to-date on all things Dev Interrupted — from our podcast to trending articles, Interact, and our community Discord. Get interrupted.
The Southwest Airlines fiasco from December 2022 and the FAA Notam database fiasco from January 2023 had one thing in common: their respective root causes were mired in technical debt. At its most basic, technical debt represents some kind of technology mess that someone has to clean up. In many cases, technical debt results from poorly written code, but more often than not, it is more a result of evolving requirements that older software simply cannot keep up with. Both the Southwest and FAA debacles centered on legacy systems that may have met their respective business needs at the time they were implemented but, over the years, became increasingly fragile in the face of changing requirements. Such fragility is a surefire result of technical debt. The coincidental occurrence of these two high-profile failures mere weeks apart lit a fire under organizations across both the public and private sectors to finally do something about their technical debt. It’s time to modernize, the pundits proclaimed, regardless of the cost. Ironically, at the same time, a different set of pundits, responding to the economic slowdown and prospects of a looming recession, recommended that enterprises delay modernization efforts in order to reduce costs short term. After all, modernization can be expensive and rarely delivers the type of flashy, top-line benefits the public markets favor. How, then, should executives make decisions about cleaning up the technical debt in their organizations? Just how important is such modernization in the context of all the other priorities facing the C-suite? Understanding and Quantifying Technical Debt Risk Some technical debt is worse than others. Just as getting a low-interest mortgage is a much better idea than loan shark money, so too with technical debt. After all, sometimes shortcuts when writing code are a good thing. Quantifying technical debt, however, isn’t a matter of somehow measuring how messy legacy code might be. The real question is one of the risk to the organization. Two separate examples of technical debt might be just as messy and equally worthy of refactoring. But the first example may be working just fine, with a low chance of causing problems in the future. The other one, in contrast, could be a bomb waiting to go off. Measuring the risks inherent in technical debt, therefore, is far more important than any measure of the debt itself — and places this discussion into the broader area of risk measurement or, more broadly, risk scoring. Risk scoring begins with risk profiling, which determines the importance of a system to the mission of the organization. Risk scoring provides a basis for quantitative risk-based analysis that gives stakeholders a relative understanding of the risks from one system to another — or from one area of technical debt to another. The overall risk score is the sum of all of the risk profiles across the system in question — and thus gives stakeholders a way of comparing risks in an objective, quantifiable manner. One particularly useful (and free to use) resource for calculating risk profiles and scores is Cyber Risk Scoring (CRS) from NIST, an agency of the US Department of Commerce. CRS focuses on cybersecurity risk, but the folks at NIST have intentionally structured it to apply to other forms of risk, including technical debt risk. Comparing Risks Across the Enterprise As long as an organization has a quantitative approach to risk profiling and scoring, then it’s possible to compare one type of risk to another — and, furthermore, make decisions about mitigating risks across the board. Among the types of risks that are particularly well-suited to this type of analysis are operational risk (i.e., risk of downtime), which includes network risk; cybersecurity risk (the risk of breaches); compliance risk (the risk of out-of-compliance situations); and technical debt risk (the risk that legacy assets will adversely impact the organization). The primary reason to bring these various sorts of risks onto a level playing field is to give the organization an objective approach to making decisions about how much time and money to spend on mitigating those risks. Instead of having different departments decide how to use their respective budgets to mitigate the risks within their scope of responsibility, organizations require a way to coordinate various risk mitigation efforts that leads to an optimal balance between risk mitigation and the costs for achieving it. Calculating the Threat Budget Once an organization looks at its risks holistically, one uncomfortable fact emerges: it’s impossible to mitigate all risks. There simply isn’t enough money or time to address every possible threat to the organization. Risk mitigation, therefore, isn’t about eliminating risk. It’s about optimizing the amount of risk we can’t mitigate. Optimizing the balance between mitigation and the cost of achieving it across multiple types of risk requires a new approach to managing risk. We can find this approach in the practice of Site Reliability Engineering (SRE). SRE focuses on managing reliability risk, a type of operational risk concerned with reducing system downtime. Given the goal of zero downtime is too expensive and time-consuming to achieve in practice, SRE calls for an error budget. The error budget is a measure of how far short of perfect reliability the organization targets, given the cost considerations of mitigating the threat of downtime. If we generalize the idea of error budgets to other types of risk, we can postulate a threat budget which represents a quantitative measure of how far short of eliminating a particular risk the organization is willing to tolerate. Intellyx calls the quantitative, best practice approach to managing threat budgets across different types of risks threat engineering. Assuming an organization has leveraged the risk scoring approach from NIST (or some alternative approach), it’s now possible to engineer risk mitigation across all types of threats to optimize the organization’s response to such threats. Applying Threat Engineering to Technical Debt Resolving technical debt requires some kind of modernization effort. Sometimes this modernization is a simple matter of refactoring some code. In other cases, it’s a complex, difficult migration process. There are several other approaches to modernization with varying risk/reward profiles as well. Risk scoring provides a quantitative assessment of just how important a particular modernization effort is to the organization, given the threats inherent in the technical debt in question. Threat engineering, in turn, gives an organization a way of placing the costs of mitigating technical debt risks in the context of all the other risks facing the organization — regardless of which department or budget is responsible for mitigating one risk or another. Applying threat engineering to technical debt risk is especially important because other types of risk, namely cybersecurity and compliance risk, get more attention and, thus, a greater emotional reaction. It’s difficult to be scared of spaghetti code when ransomware is in the headlines. As the Southwest and FAA debacles show, however, technical debt risk is every bit as risky as other, sexier forms of risk. With threat engineering, organizations finally have a way of approaching risk holistically in a dispassionate, best practice-based manner. The Intellyx Take Threat engineering provides a proactive, best practice-based approach to breaking down the organizational silos that naturally form around different types of risks. Breaking down such silos has been a priority for several years now, leading to practices like NetSecOps and DevSecOps that seek to leverage common data and better tooling to break down the divisions between departments. Such efforts have always been a struggle because these different teams have long had different priorities — and everyone ends up fighting for a slice of the budget pie. Threat engineering can align these priorities. Once everybody realizes that their primary mission is to manage and mitigate risk, then real organizational change can occur. Copyright © Intellyx LLC. Intellyx is an industry analysis and advisory firm focused on enterprise digital transformation. Covering every angle of enterprise IT from mainframes to artificial intelligence, our broad focus across technologies allows business executives and IT professionals to connect the dots among disruptive trends. As of the time of writing, none of the organizations mentioned in this article is an Intellyx customer. No AI was used to produce this article.
Samir Behara
Senior Cloud Infrastructure Architect,
AWS
Shai Almog
OSS Hacker, Developer Advocate and Entrepreneur,
Codename One
JJ Tang
Co-Founder,
Rootly
Sudip Sengupta
Technical Writer,
Javelynn