8 Reasons the Consequences of Bad Data Are More Severe
You may not have heard the term data downtime, but I’m willing to bet you’ve experienced it firsthand and the cost of bad data.
Join the DZone community and get the full member experience.
Join For FreeYou may not have heard the term data downtime, but I’m willing to bet you’ve experienced it and the cost of bad data firsthand.
Urgent ping from your CEO about “missing data” in a critical report? Duplicate tables wreaking havoc in your Snowflake warehouse, all titled some variation of “Mikes_Table_GOOD-V3.”? Or, perhaps you’ve unintentionally made a decision based on bad data from last year’s forecasts?
Data downtime is when data is missing, erroneous, or otherwise inaccurate. It intentionally recalls the early stages of the internet when websites would go down with what today would be alarming frequency.
It makes sense now, looking back on it. Not only were there infrastructure challenges, but not that many people were using the web, and sites were not nearly as valuable. As that changed with the rise of the cloud, e-commerce, and SaaS, ensuring reliability became mission-critical to a business, and site reliability engineering (SRE) was born.
Data is at a similar moment in time. Technologies are advancing, companies are moving to the cloud, and data is becoming more widespread and valuable than ever before.
The corollary to this is that as data becomes more valuable, the consequences of poor data quality become more severe. The best practices, technologies, and investments that were sufficient just a year or two ago will now jeopardize an organization’s ability to compete.
By 2025, 80% of organizations seeking to scale digital business will fail because they do not take a modern approach to data and analytics governance, according to Gartner.
In this post, we will cover 8 reasons why the cost of bad data is rising. Let's get into it.
Data Is Moving Downstream
Each stage also acts as a filter preventing bad data from moving downstream. The challenge is there are multiple trends in data currently that are accelerating the pace of data moving downstream from data democratization, data products, reverse ETL, and more.
Data Stacks Are Becoming More Complex
The further downstream that bad data travels, the more expensive the fix. It is much easier and quicker to have a data engineer troubleshoot an ETL pipeline than for a data scientist to re-train a machine learning model that has been fed bad data.
It’s not just the remediation that is expensive either. As companies increasingly rely on complex data assets to help execute business decisions, the opportunity costs of bad data rise as well.
For example, I spoke with an investment company with a machine learning model that would buy bonds automatically when they met certain criteria. Schema errors would take the model offline for days or weeks and, as a result, ground this part of their business to a standstill. Yikes.
As data stacks become more complex, there are also more data “handoffs,” introducing the opportunity for more issues. For example, one gaming company I spoke with noticed a drift in their new user acquisition data.
The social media platform they were advertising on changed their data schedule, so they were delivering data every 12 hours instead of 24. The company’s ETLs were set to pick up data only once per day, so this meant that suddenly half of the campaign data that was being sent to them wasn’t getting processed or passed downstream, skewing their new user metrics away from “paid” and towards “organic.”
Increased Data Adoption
According to a Google Cloud and Harvard Business Review report, 97% of surveyed industry leaders believe organization-wide access to data and analytics is critical to the success of their business. In fact, more than half of AutoTrader UK’s employees regularly engage with the data in their Looker dashboards at least once a month.
This is an awesome trend. However, more data consumers and more data analysts mean more people sitting on their hands when data downtime strikes.
Expectations of Data Consumers Are Increasing
And they have higher expectations than ever. They are accustomed to leveraging SaaS products that are guaranteeing 5 9’s of availability, meaning they are down less than 12 minutes a year. Honestly, I don’t know any data teams clearing that bar.
Unfortunately, most data teams are evaluated based on a feeling. Either data consumers and executive leadership “feel” the team is doing well or poorly. That’s because nearly 60% of organizations don’t measure the annual financial cost of bad data, according to Gartner.
With high data consumer expectations and little qualitative data measuring performance, data downtime has severe consequences not just for organizations but for data teams as well.
Data Engineers Are Harder to Find
This isn’t just anecdotal evidence either. The Dice 2020 Tech Job Report said data engineer was the fastest-growing job in technology with a 50% year-over-year growth in the number of open positions and the 2022 Report has the average salary as $117,295.
Data engineers are quickly becoming one of the most valuable assets. Taking them offline to fix downtime is expensive, and doing it repeatedly risks them deciding to leave to where they will work on more interesting projects.
Data Quality Responsibilities Are Becoming Distributed
This has the advantage of bringing data teams closer to the business and understanding the purpose of all relevant data operations; however, by its very nature also diffuses responsibility.
A decentralized setup puts a larger burden on clear communication and crisp processes. There is not one inbox or Slack channel to frantically ping when things are going wrong, and that’s scary.
Without strong processes, the diffusion of responsibility can extend the time it takes to resolve bad data or data downtime when these problems cross domains.
Crumbling of the Cookie
Companies are going to become more reliant on first versus third-party data as a result of tightening regulations such as GDPR and the industry’s move away from the cookie.
That means they will need to collect more data, which will become more valuable since they can no longer rely on Google algorithms to help their ads find the right consumers. Therefore, data downtime is starting to have a larger impact on marketing operations, and the cost of bad data is rising.
Data Is Becoming a Product, and It’s Super Competitive
Data teams are creating sophisticated data products that are quickly becoming part of the customer offering and unlocking new value for their companies.
In some industries, this has become super competitive. If your team is not producing actionable insights, you will quickly get outperformed by someone who is.
I see this most frequently in the media space, where data has become a complete arms race. The scale of data teams and the investment in them is astronomical. It has been breathtaking to watch these companies move from hourly batching to 15 minutes, to every 5 minutes, and now starting to stream.
There is no room for bad data in this environment. During your data downtime, someone else is publishing the scoop, getting clicks, and gaining valuable insights into their audience.
An Ounce of Treatment Is Worth a Pound of Pain
When you consider the increasing cost of bad data and that most organizations have more data quality issues than they think, increasing your investment in data quality or data observability seems like a smart move to make.
Some data teams are very perceptive when it comes to internal signals that it’s time to invest in data quality (from migrating to a cloud data warehouse like Snowflake or Redshift to having the CEO yell), but external drivers like the ones mentioned above can get lost in the shuffle.
I recommend taking a proactive approach. Consider how you can invest in your people, processes, and technologies to mitigate the rising cost of bad data.
Published at DZone with permission of Will Robins. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
Getting Started With Redpanda in Kubernetes
-
Postgres JSON Functions With Hibernate 5
-
Dynamic Data Processing Using Serverless Java With Quarkus on AWS Lambda by Enabling SnapStart (Part 2)
-
How To Perform Sentiment Analysis and Classification on Text (In Java)
Comments