Garbage In, Garbage Out: Why Data Prep is the Real Work of Machine Learning

January 20th, 2026

There is a romanticised version of Machine Learning (ML) that exists in movies and marketing pitch decks. In this version, the hard work is the “AI” itself, complex neural networks, cutting-edge algorithms, and futuristic code.

The reality, as any (most) Data Scientist will tell you, is quite different. The reality is 80% data preparation and 20% modeling.

If you are embarking on an ML project, your instinct might be to rush toward import sklearn or import tensorflow. This series is here to tell you to stop. Before you tune a single hyperparameter, you need to fix your foundation.

The Principle of GIGO (Garbage In, Garbage Out)

Machine Learning models are not magic; they are math. They do not “understand” the world; they find patterns in the numbers you feed them.

If you feed a model noisy, biased, or broken data (“Garbage In”), it will confidently give you wrong predictions (“Garbage Out”). It doesn’t matter if you use the most expensive GPU cluster or the latest Transformer architecture, a model trained on bad data is simply a powerful engine in a broken car.

Step 1: Define the Problem (Not the Code)

Before you collect a single byte of data, you must define the lineage of your problem.

What are we predicting? (e.g., Customer Churn)
Does the data actually capture this? (e.g., Do we have historical data on customers who actually churned, or just those who complained?)

Know Your Data’s Lineage

Data rarely arrives on a silver platter. It comes from messy SQL databases, user-generated logs, or third-party APIs. Understanding Data Lineage, where your data was born and how it traveled to you, is critical.

Ask yourself:

Is this data a proxy? (Are you using “zip code” as a proxy for “income”? That introduces massive bias.)
How was it collected? (Was it a mandatory form field? If users were forced to select an option, did they just pick the first one?)

The Road Ahead

In this series, we are going to walk through the complete “Raw to Ready” pipeline. We won’t just talk theory; we will cover the practical steps of turning messy real-world data into a clean signal.

Next up: We dive into the most underrated part of the ML stack, Data Labeling. How do you teach a machine what “true” looks like?

Get in touch if you are interested in organising your data ready for Data Engineering or ML

Categories

Recent

Why Complex Subsea Risk Requires Custom Tech: Announcing Our Partnership With Ternan Energy July 17th, 2026 When you are operating in a highly specialised sector like offshore energy, Cable Burial Risk Assessment (CBRA) is not a simple administrative chore....

Fore-Site: The AI-First HSE Platform Built for the Frontline May 27th, 2026

Every safety manager knows the frustration of trying to enforce compliance using clunky, outdated software. Frontline crews avoid loggin...

Why Safety Leaders Are Betting on HSE Tech May 26th, 2026

The role of the Health, Safety, and Environment (HSE) professional is undergoing a massive shift. For decades, safety management was syn...

How Can We Help?

Building a new data product?
Let's bring your vision to life.
Getting AI-ready?
We'll prepare your data for intelligent insights.
Need custom application development?
Scalable, secure, and built for growth.
Database challenges?
Optimization, migration, or architecture - we've got you covered.
Exploring AI solutions?
Our experts can guid your next big move.
Need better reporting & analytics?
We create dashboards and visualisations that turn your data into clear, actionable insights.

Send a message or schedule a call for a free consultation

Garbage In, Garbage Out: Why Data Prep is the Real Work of Machine Learning

Contact us

How Can We Help?

Company

Our services

Product discovery

Design

Software development

Data engineering

Artificial intelligence (AI)

Support

Techonologies we use

Backend

Frontend

Database

Cloud & devops

BI & analytics

Industries