Why Data Prep is the Real Work of Machine Learning

Garbage In, Garbage Out: Why Data Prep is the Real Work of Machine Learning

There is a romanticised version of Machine Learning (ML) that exists in movies and marketing pitch decks. In this version, the hard work is the “AI” itself, complex neural networks, cutting-edge algorithms, and futuristic code.

The reality, as any (most) Data Scientist will tell you, is quite different. The reality is 80% data preparation and 20% modeling.

If you are embarking on an ML project, your instinct might be to rush toward import sklearn or import tensorflow. This series is here to tell you to stop. Before you tune a single hyperparameter, you need to fix your foundation.

The Principle of GIGO (Garbage In, Garbage Out)

Machine Learning models are not magic; they are math. They do not “understand” the world; they find patterns in the numbers you feed them.

If you feed a model noisy, biased, or broken data (“Garbage In”), it will confidently give you wrong predictions (“Garbage Out”). It doesn’t matter if you use the most expensive GPU cluster or the latest Transformer architecture, a model trained on bad data is simply a powerful engine in a broken car.

Step 1: Define the Problem (Not the Code)

Before you collect a single byte of data, you must define the lineage of your problem.

  • What are we predicting? (e.g., Customer Churn)

  • Does the data actually capture this? (e.g., Do we have historical data on customers who actually churned, or just those who complained?)

Know Your Data’s Lineage

Data rarely arrives on a silver platter. It comes from messy SQL databases, user-generated logs, or third-party APIs. Understanding Data Lineage, where your data was born and how it traveled to you, is critical.

Ask yourself:

  1. Is this data a proxy? (Are you using “zip code” as a proxy for “income”? That introduces massive bias.)

  2. How was it collected? (Was it a mandatory form field? If users were forced to select an option, did they just pick the first one?)

The Road Ahead

In this series, we are going to walk through the complete “Raw to Ready” pipeline. We won’t just talk theory; we will cover the practical steps of turning messy real-world data into a clean signal.

Next up: We dive into the most underrated part of the ML stack, Data Labeling. How do you teach a machine what “true” looks like?

Get in touch if you are interested in organising your data ready for Data Engineering or ML

Recent

High Digital Named Best Computer Software Business of the Year 2026 We spend a lot of time on our blog and LinkedIn poking fun at the tech industry. We joke about Burger Fish product launches, the dangers of treating A...
Building the Factory: Automating Your Data Pipeline

We have reached the end of our journey. Over the last six posts, we have taken a raw, messy dataset and transformed it into a...

Don’t Cheat: Proper Splitting and Avoiding Data Leakage

We have now arrived at the most dangerous phase of the data preparation pipeline.

You have col...

Contact us

Complete the form and we’ll get in touch

Please enable JavaScript in your browser to complete this form.
Checkboxes

How Can We Help?

  • Building a new data product?

    Let's bring your vision to life.

  • Getting AI-ready?

    We'll prepare your data for intelligent insights.

  • Need custom application development?

    Scalable, secure, and built for growth.

  • Database challenges?

    Optimization, migration, or architecture - we've got you covered.

  • Exploring AI solutions?

    Our experts can guid your next big move.

  • Need better reporting & analytics?

    We create dashboards and visualisations that turn your data into clear, actionable insights.

Send a message or schedule a call for a free consultation

Awards & accreditations

High Digital: top bi data company
High Digital: top bi data company
Cyber Essentials Plus
High Digital: Innovate UK
High Digital : ISO 27001
High Digital : ISO 27001

'Our customers love to work with us'

Clutch logo

5 icon star icon star icon star icon star icon star

Read our reviews