You don’t know Jack (about your data)

Ruben Orduz
5 min readJul 30, 2022

--

You think you do, you believe you do, except, likely, you actually don’t.

Photo by: Mika Baumeister

Unless you have a large army of talented data engineers, data scientists, ETL specialists and seasoned practitioners in your company, there is a really good chance you actually know very little about your data. And you may be lucky and getting away with it so far. But when you see the litany of high-profile “bad data” cases resulting in losses of, literally, hundreds of millions of dollars (see: Zillow, Unity Software), it’s not a question of IF, but WHEN and HOW either incomplete or illusory knowledge about your data will rear its ugly head an adversely affect your business.

A multifaceted problem

The challenge with gaining a deep understanding of your data is not insignificant. It would be erroneous and disingenuous to think you can just add this tool here or that smoke test check there or configure your data pipelines this way or some such and believe you have solid backing for your understanding of your data. In reality, the solution is a combination of those things plus several others, which include organizational and human factors. For example, having an organizational data policy is for naught if you don’t have the correct processes and tooling to measure and and analyze status and progress. On the other hand, having all the tooling in the world won’t help if there aren’t clear and explicit data quality goals, posture and contingencies. And none of that matters if your ML/AI or analysis models make (implicit or explicit) fundamental assumptions about what the data “looks like”. And the above won’t matter if you’re not looking at how your data is “mutating” or “evolving” over time (APIs change, downstream providers change, data quarries change, behaviors change, etc.). In short, you cannot solve a multifaceted problem with a single-dimensional solution.

A multifaceted solution

Because each organization is different, their business is different and their gathering and consumption of data is very different, it would be a disservice to give a prescriptive solution here. Instead, below I will attempt to provide what, in my opinion, is a generally advisable approach to tackling the data knowledge problem.

  • Data Policy/Governance: while there is a bit of a chicken-and-egg problem with setting organizational data quality goals and knowing what your data looks like, you could start by setting goals that make sense to the organization; for example: “No less than X% of these records must meet certain criteria at Y stage in our data pipeline”, “Unless X, Y tables meet Z criteria, it should not be moved to model serving”. You could alternatively provide higher-level guidance as to what the data flow should look like and gating points therein and defining roles and teams involved. A sensible policy also describes ownership and responsibilities for both the data itself and the processes around it. As you add relevant tooling to your data pipelines and gain additional insight about the “shape” and scope of your data, you should calibrate your organization’s goals and metrics to reflect that.
  • Establish business requirements (regarding data); this should be in joint collaboration with product management, engineering, data ops, data science, ML Ops as appropriate and any other stakeholder in the data< — >business relationship. The outcome of this stage is an agreed upon “contract” about data needed for optimal product functionality and the minimum acceptable quality therein. It should also be well understood the priorities different kinds of data have and the blast radius in the product if those fail. For example, perhaps a “who to follow” recommendation engine isn’t super important for the business and the product can function with out it, but the inferred shopping preference data and model is business critical.
  • Make use of data observability and monitoring tooling. You may already have such tooling and instrumentation, but this should be focused on data pertaining to the priorities established above. You shouldn’t try to monitor all your data or random aspects of the data. It should be deliberate and targeted and produce metadata that can be consumed by both the business and the data teams to meet the goals and priorities set forth. It should also provide feedback about anomalies, repeated failures, etc. to the pertinent team or teams.

These three stages should be in a feedback loop, at least until teams and stakeholders can “grok” what the data is doing, what hotspots are and what if any shortcomings need to be addressed. Eventually after some iterations, the direction, data “contracts” with team and product, etc. will be aligned.

(to be noted: I’m making the assumption the org has different teams for different roles in this scenario for the sake of illustration, but most likely for smaller shops or upstarts is that it’ll be a few people wearing different hats)

But this is only half the equation. While establishing the underpinnings of your data posture and data science/engineering discipline are crucial, so is the need to gain more insight about the shape and other aspects of your data so that you can have a much more complete picture and make intelligent decisions about it. I’d like to think of this area as both a source of insight and assurance.

  • Data Lineage: instrumentation and tooling should be added to your data pipelines so that you can gain insight about how the shape, size and characteristics of the data change through the dataflow. For instance, you may be making use of a “black box” off-the-shelf transformation tool and you want to observe how your data changes from input to output.
  • Data profiling: possibly one of the most valuable aspects of all these steps mentioned thus far. This is when you deploy tooling that will look at all your “at-rest” data for you and will create metadata and documentation about it. For example it will output table cardinality, column types, how many empty records or empty columns, any unexpected nulls (and how many), typical numerical value ranges, etc. You can take a deep look at what your data pre- and post-processing looks like (assuming the “shape” doesn’t change between these steps).
  • Data quality tests: with the insight gained by profiling your data, you then can write automated tests agains it. You can set pass/fail criteria depending what’s important to the teams and business. You can gate releases on successful passing of all the tests. You can catch and stop anomalous data before it’s fed to the models and analysis, again, depending on the agreed upon parameters we spoke above. For example, it might be ok for an online store’s data for someone’s online status field to be null, whereas it’s a show stopper if the inferred sentiment is not populated, or populated with a sentiment not in the expected set (say, it should only be ‘POS’, ‘NEG’ or ‘NEU’ and it shows populated as ‘OK’).

While not a bullet proof guarantee that a catastrophic data error could happen, having all the processes and tooling in place will considerably lower those chances of such event happening, while at the same providing everyone involved a solid insight about their data so that decisions and statements about the data are fact-based and metadata-driven, not based on assumptions, casual inspection or rely solely on defensive coding techniques inside whatever component consumes your data.

--

--

Ruben Orduz
Ruben Orduz

Written by Ruben Orduz

Software, 3D Printing, product reviews, data, and all things AI/ML.

No responses yet