Exploratory Data Analysis (EDA)

Sherlock Holmes Mode: Getting to Know Your Data

In Part 2, we discussed the importance of labelling your data to create a ground truth. Now that we have our dataset, the temptation is to immediately start “fixing” it.

However, you cannot fix what you do not understand.

This stage is called Exploratory Data Analysis (EDA). Think of this as the interview phase. You are interviewing your data to learn its secrets, its quirks, and its potential problems.

Visualising the Distribution

The first step is always visualisation. You need to see the shape of your data. If you simply calculate the average (mean) of a column, you might miss the full story.

For example, if you are analysing house prices, a few massive mansions could skew your average significantly. By using Histograms and Box Plots, you can visualise the spread. Is the data bell-shaped (normal distribution)? Or is it heavily skewed to one side? Understanding this shape helps you choose the right statistical tools later on.

The Outlier Dilemma: Error or Insight?

During EDA, you will almost certainly find data points that do not fit. These are outliers.

If you are analysing customer ages and find a value of “200”, that is clearly an error to be removed. However, if you are analysing credit card transactions and see a massive purchase, that might not be an error. That might be the fraud you are trying to predict.

Key takeaway: Never delete outliers blindly. Investigate them. They often hold the most valuable information in the entire dataset.

Finding Relationships

Finally, we look for correlations. How do different variables interact?

A Correlation Matrix (often visualised as a heatmap) allows you to spot redundant features. If “variables A” and “variable B” move in perfect sync, you likely do not need both. Feeding the model redundant information can slow down training and lead to overfitting.

Once you understand the shape and structure of your data, you are finally ready to pick up the mop and bucket.

Next up: Part 4 covers Data Cleaning. We look at how to handle missing values and scrub the dataset until it shines.

Get in touch to talk to a data engineering expert

Recent

High Digital Named Best Computer Software Business of the Year 2026 We spend a lot of time on our blog and LinkedIn poking fun at the tech industry. We joke about Burger Fish product launches, the dangers of treating A...
Building the Factory: Automating Your Data Pipeline

We have reached the end of our journey. Over the last six posts, we have taken a raw, messy dataset and transformed it into a...

Don’t Cheat: Proper Splitting and Avoiding Data Leakage

We have now arrived at the most dangerous phase of the data preparation pipeline.

You have col...

Contact us

Complete the form and we’ll get in touch

Please enable JavaScript in your browser to complete this form.
Checkboxes

How Can We Help?

  • Building a new data product?

    Let's bring your vision to life.

  • Getting AI-ready?

    We'll prepare your data for intelligent insights.

  • Need custom application development?

    Scalable, secure, and built for growth.

  • Database challenges?

    Optimization, migration, or architecture - we've got you covered.

  • Exploring AI solutions?

    Our experts can guid your next big move.

  • Need better reporting & analytics?

    We create dashboards and visualisations that turn your data into clear, actionable insights.

Send a message or schedule a call for a free consultation

Awards & accreditations

High Digital: top bi data company
High Digital: top bi data company
Cyber Essentials Plus
High Digital: Innovate UK
High Digital : ISO 27001
High Digital : ISO 27001

'Our customers love to work with us'

Clutch logo

5 icon star icon star icon star icon star icon star

Read our reviews