Table of Contents

Data cleaning is an essential first step in any data analysis pipeline, ensuring that data is consistent, formatted, and usable. This guide will walk you through ten quick Python one-liners for common data cleaning tasks using sample data, each tackling a typical data quality issue like duplicates, inconsistent formats, missing entries, and more.

To follow along, you should have some familiarity with list and dictionary comprehensions in Python. Let’s dive in!


Sample Data

We’ll start with some sample data that has various common issues to address:

data = [
    {"name": "alice smith", "age": 30, "email": "[email protected]", "salary": 50000.00, "join_date": "2022-03-15"},
    {"name": "bob gray", "age": 17, "email": "bob@not-an-email", "salary": 60000.00, "join_date": "invalid-date"},
    {"name": "charlie brown", "age": None, "email": "[email protected]", "salary": -1500.00, "join_date": "15-09-2022"},
    {"name": "dave davis", "age": 45, "email": "[email protected]", "salary": 70000.00, "join_date": "2021-07-01"},
    {"name": "eve green", "age": 25, "email": "[email protected]", "salary": None, "join_date": "2023-12-31"},
]

1. Capitalize Strings

Ensuring consistency in string formats is key to clean data. Let’s capitalize the name fields:

data = [{**d, "name": d["name"].title()} for d in data]

2. Convert Data Types

It’s often necessary to ensure that data types are correct across your dataset. Here, we’ll convert age fields to integers, defaulting to 25 if the conversion fails:

data = [{**d, "age": int(d["age"]) if isinstance(d["age"], (int, float)) else 25} for d in data]

3. Validate Numeric Ranges

Make sure numeric values like age fall within realistic ranges. Here, we restrict ages to 18-60, using a default if out of range:

data = [{**d, "age": d["age"] if isinstance(d["age"], int) and 18 <= d["age"] <= 60 else 25} for d in data]

4. Validate Email

To check that email addresses contain a basic format, we’ll replace invalid ones with a placeholder:

data = [{**d, "email": d["email"] if "@" in d["email"] and "." in d["email"] else "[email protected]"} for d in data]

5. Handle Missing Values

Fill missing values in the dataset. Here, we replace missing salary values with a default of 30,000:

data = [{**d, "salary": d["salary"] if d["salary"] is not None else 30000.00} for d in data]

6. Standardize Date Formats

When working with dates, it’s helpful to have a consistent format. Let’s standardize the join_date fields:

from datetime import datetime

data = [{**d, "join_date": (lambda x: (datetime.strptime(x, '%Y-%m-%d').date() if '-' in x and len(x) == 10 else datetime.strptime(x, '%d-%m-%Y').date()) if x and 'invalid-date' not in x else '2023-01-01')(d['join_date'])} for d in data]

7. Remove Negative Values

Ensure that numeric values, like salary, are non-negative by setting negative values to zero:

data = [{**d, "salary": max(d["salary"], 0)} for d in data]

8. Check for Duplicates

Remove duplicate entries based on unique fields like names:

data = {tuple(d.items()) for d in data}  # Convert to set to remove duplicates
data = [dict(t) for t in data]  # Convert back to list of dictionaries

9. Scale Numeric Values

Scaling numeric fields can be useful for relative comparisons. Here, we scale salaries as a percentage of the maximum salary in the dataset:

max_salary = max(d["salary"] for d in data)
data = [{**d, "salary": (d["salary"] / max_salary * 100) if max_salary > 0 else 0} for d in data]

10. Trim Whitespaces

Trimming unnecessary whitespaces helps ensure string fields are clean:

data = [{**d, "name": d["name"].strip()} for d in data]

Final Cleaned Data

After running these one-liners, the cleaned dataset might look like this:

[{'name': 'Bob Gray', 'age': 25, 'email': '[email protected]', 'salary': 85.7, 'join_date': '2023-01-01'},
 {'name': 'Alice Smith', 'age': 30, 'email': '[email protected]', 'salary': 71.4, 'join_date': datetime.date(2022, 3, 15)},
 {'name': 'Charlie Brown', 'age': 25, 'email': '[email protected]', 'salary': 0.0, 'join_date': datetime.date(2022, 9, 15)},
 {'name': 'Dave Davis', 'age': 45, 'email': '[email protected]', 'salary': 100.0, 'join_date': datetime.date(2021, 7, 1)},
 {'name': 'Eve Green', 'age': 25, 'email': '[email protected]', 'salary': 42.9, 'join_date': datetime.date(2023, 12, 31)}]

wraping up

These Python one-liners provide a compact way to handle common data cleaning tasks, making your data ready for analysis. For more data processing techniques, you can explore libraries like pandas, which can simplify some of these operations further. Happy cleaning!

Leave a Reply

Your email address will not be published. Required fields are marked *