Table of Contents
Data cleaning is an essential first step in any data analysis pipeline, ensuring that data is consistent, formatted, and usable. This guide will walk you through ten quick Python one-liners for common data cleaning tasks using sample data, each tackling a typical data quality issue like duplicates, inconsistent formats, missing entries, and more.
To follow along, you should have some familiarity with list and dictionary comprehensions in Python. Let’s dive in!
Sample Data
We’ll start with some sample data that has various common issues to address:
data = [
{"name": "alice smith", "age": 30, "email": "[email protected]", "salary": 50000.00, "join_date": "2022-03-15"},
{"name": "bob gray", "age": 17, "email": "bob@not-an-email", "salary": 60000.00, "join_date": "invalid-date"},
{"name": "charlie brown", "age": None, "email": "[email protected]", "salary": -1500.00, "join_date": "15-09-2022"},
{"name": "dave davis", "age": 45, "email": "[email protected]", "salary": 70000.00, "join_date": "2021-07-01"},
{"name": "eve green", "age": 25, "email": "[email protected]", "salary": None, "join_date": "2023-12-31"},
]
1. Capitalize Strings
Ensuring consistency in string formats is key to clean data. Let’s capitalize the name fields:
data = [{**d, "name": d["name"].title()} for d in data]
2. Convert Data Types
It’s often necessary to ensure that data types are correct across your dataset. Here, we’ll convert age fields to integers, defaulting to 25 if the conversion fails:
data = [{**d, "age": int(d["age"]) if isinstance(d["age"], (int, float)) else 25} for d in data]
3. Validate Numeric Ranges
Make sure numeric values like age fall within realistic ranges. Here, we restrict ages to 18-60, using a default if out of range:
data = [{**d, "age": d["age"] if isinstance(d["age"], int) and 18 <= d["age"] <= 60 else 25} for d in data]
4. Validate Email
To check that email addresses contain a basic format, we’ll replace invalid ones with a placeholder:
data = [{**d, "email": d["email"] if "@" in d["email"] and "." in d["email"] else "[email protected]"} for d in data]
5. Handle Missing Values
Fill missing values in the dataset. Here, we replace missing salary values with a default of 30,000:
data = [{**d, "salary": d["salary"] if d["salary"] is not None else 30000.00} for d in data]
6. Standardize Date Formats
When working with dates, it’s helpful to have a consistent format. Let’s standardize the join_date
fields:
from datetime import datetime
data = [{**d, "join_date": (lambda x: (datetime.strptime(x, '%Y-%m-%d').date() if '-' in x and len(x) == 10 else datetime.strptime(x, '%d-%m-%Y').date()) if x and 'invalid-date' not in x else '2023-01-01')(d['join_date'])} for d in data]
7. Remove Negative Values
Ensure that numeric values, like salary, are non-negative by setting negative values to zero:
data = [{**d, "salary": max(d["salary"], 0)} for d in data]
8. Check for Duplicates
Remove duplicate entries based on unique fields like names:
data = {tuple(d.items()) for d in data} # Convert to set to remove duplicates
data = [dict(t) for t in data] # Convert back to list of dictionaries
9. Scale Numeric Values
Scaling numeric fields can be useful for relative comparisons. Here, we scale salaries as a percentage of the maximum salary in the dataset:
max_salary = max(d["salary"] for d in data)
data = [{**d, "salary": (d["salary"] / max_salary * 100) if max_salary > 0 else 0} for d in data]
10. Trim Whitespaces
Trimming unnecessary whitespaces helps ensure string fields are clean:
data = [{**d, "name": d["name"].strip()} for d in data]
Final Cleaned Data
After running these one-liners, the cleaned dataset might look like this:
[{'name': 'Bob Gray', 'age': 25, 'email': '[email protected]', 'salary': 85.7, 'join_date': '2023-01-01'},
{'name': 'Alice Smith', 'age': 30, 'email': '[email protected]', 'salary': 71.4, 'join_date': datetime.date(2022, 3, 15)},
{'name': 'Charlie Brown', 'age': 25, 'email': '[email protected]', 'salary': 0.0, 'join_date': datetime.date(2022, 9, 15)},
{'name': 'Dave Davis', 'age': 45, 'email': '[email protected]', 'salary': 100.0, 'join_date': datetime.date(2021, 7, 1)},
{'name': 'Eve Green', 'age': 25, 'email': '[email protected]', 'salary': 42.9, 'join_date': datetime.date(2023, 12, 31)}]
wraping up
These Python one-liners provide a compact way to handle common data cleaning tasks, making your data ready for analysis. For more data processing techniques, you can explore libraries like pandas
, which can simplify some of these operations further. Happy cleaning!