The cost of cleaning data is often beyond the comfort zone of businesses swamped with potentially dirty data. That clogs the pathways to trustworthy and compliant corporate data flow.
Few companies have the resources needed to develop tools for challenges like data observability at scale, according to Kyle Kirwan, co-founder and CEO of data observability platform Bigeye. As a result, many companies are essentially flying blind, reacting when something goes wrong rather than proactively addressing data quality.
Data trust provides a legal framework for managing shared data. It promotes collaboration through common rules for data security, privacy, and confidentiality; and enables organizations to securely connect their data sources in a shared repository of data.
Bigeye brings data engineers, analysts, scientists, and stakeholders together to build trust in data. Its platform helps companies automate monitoring and anomaly detection and create SLAs to ensure data quality and reliable pipelines.
With complete API access, a user-friendly interface, and automated yet flexible customization, data teams can monitor quality, proactively detect and resolve issues, and ensure that every user can rely on the data.
Uber Data Experience
Two early members of the data team at Uber — Kirwan and Bigeye Co-founder and CTO Egor Gryaznov — set out to use what they learned building Uber’s scale to create easier-to-deploy SaaS tools for data engineers.
Kirwan was one of Uber’s first data scientists and the first metadata product manager. Gryaznov was a staff-level engineer who managed Uber’s Vertica data warehouse and developed several internal data engineering tools and frameworks.
They realized the tools their teams were building to manage Uber’s massive data lake and thousands of internal data users were far ahead of what was available to most data engineering teams.
Automatically monitoring and detecting reliability issues within thousands of tables in data warehouses is no easy task. Companies like Instacart, Udacity, Docker, and Clubhouse use Bigeye to keep their analytics and machine learning working continually.
A Growing Field
Founding Bigeye in 2019, they recognized the growing problem enterprises face in deploying data into high-ROI use cases like operations workflows, machine learning-powered products and services, and strategic analytics and business intelligence-driven decision making.
The data observability space saw a number of entrants in 2021. Bigeye separated itself from that pack by providing users the ability to automatically assess customer data quality with more than 70 unique data quality metrics.
These metrics are trained with thousands of separate anomaly detection models to ensure data quality problems — even the hardest to detect — never make it past the data engineers.
Last year, data observability burst onto the scene with no less than ten data observability startups announcing significant funding rounds.
This year, data observability will become a priority for data teams as they seek to balance the demand of managing complex platforms with the need to ensure data quality and pipeline reliability, Kirwan predicted.
Bigeye’s data platform is no longer in beta. Some enterprise-grade features are still on the roadmap, like complete role-based access control. But others, like SSO and in-VPC deployments are available today.
The app is closed source, and so are the proprietary models used for anomaly detection. Bigeye is a big fan of open-source options but decided to develop its own to achieve the performance goals internally set.
Machine learning is used in a few key places to bring a unique blend of metrics to each table in a customer’s connected data sources. The anomaly detection models are trained on each of those metrics to detect abnormal behavior.
Three features built-in at the end of 2021 automatically detect and alert on data quality issues and enable data quality SLAs.
The first, Deltas, makes it easy to compare and validate multiple versions of any dataset.
Issues, the second, bring multiple alerts together into a single timeline with valuable context about related issues. This makes it simpler to document past fixes and speed up resolutions.
The third, Dashboard, provides an overall view of the health of the data, helping to identify data quality hotspots, close gaps in monitoring coverage, and quantify a team’s improvements to reliability.
Eyeballing Data Warehouses
TechNewsWorld spoke with Kirwan to demystify some of the complexities his company’s data sniffing platform offers data scientists.
TechNewsWorld: What makes Bigeye’s approach innovative or cutting edge?
Kyle Kirwan: Data observability requires constant and complete knowledge of what is happening inside all the tables and pipelines in your data stack. It is similar to what SRE [site reliability engineering] and DevOps teams use to keep applications and infrastructure working around the clock. But it is reimagined for the world of data engineering and data science.
While data quality and data reliability have been an issue for decades, data applications are now critical to how many leading businesses run; because any loss of data, outage, or degradation can quickly result in lost revenue and customers.
Without data observability, data dealers must constantly react to data quality issues and have to wrangle the data as they go to use it. A better solution is identifying the issues proactively and fixing the root causes.
How does trust impact the data?
Kirwan: Often, problems are discovered by stakeholders like executives who do not trust their often-broken dashboard. Or users get confusing results from in-product machine learning models. The data engineers can better get ahead of the problems and prevent business impact if they are alerted early enough.
How is this concept different from similar-sounding technologies such as unified data management?
Kirwan: Data observability is one core function within data operations (think: data management). Many customers look for best-of-breed solutions for each of the functions within data operations. This is why technologies like Snowflake, Fivetran, Airflow, and dbt have been exploding in popularity. Each is considered an important part of “the modern data stack” rather than a one-size-fits-none solution.
Data observability, data SLAs, ETL [extract, transform, load] code version control, data pipeline testing, and other techniques should be used in tandem to keep modern data pipelines all working smoothly. Just like high-performance software engineers and DevOps teams use their sister techniques.
What role do data pipeline and DataOps play with data visibility?
Kirwan: Data observability is closely related to DataOps and the emerging practice of data reliability engineering. DataOps refers to the broader set of all operational challenges that data platform owners will face. Data reliability engineering is a part of data ops, but only a part, just as site reliability engineering is related to, but does not encompass all of DevOps.
Data observability could have benefits to data security, as it could be used to identify unexpected changes in query volume on different tables or changes in behavior to ETL pipelines. However, data observability would not likely be a complete data security solution on its own.
What challenges does this technology face?
Kirwan: These challenges cover problems like data discovery and governance, cost tracking and management, and access controls. It also covers how to manage an ever-growing number of queries, dashboards, and ML features and models.
Reliability and uptime are certainly challenges for which many DevOps teams are responsible. But they are often also charged with other aspects like developer velocity and security considerations. Within these two areas, data observability enables data teams to know whether their data and data pipelines are error-free.
What are the challenges of implementing and maintaining data observability technology?
Kirwan: Effective data observability systems should integrate into the workflows of the data team. This enables them to focus on growing their data platforms rather than constantly reacting to data issues and putting out data fires. A poorly tuned data observability system, however, can result in a deluge of false positives.
An effective data system should also take much of the maintenance out of testing for data quality issues by automatically adapting to changes in the business. A poorly optimized data observability system, however, may not correct for changes in the business or overcorrect for changes in the business, requiring manual tuning, which can be time-consuming.
Data observability can also be taxing on the data warehouse if not optimized properly. The Bigeye teams have experience optimizing data observability at scale to ensure that the platform does not impact data warehouse performance.