Using Python to Automate Data Cleaning Before a CRM Migration

The Problem

Migrating customer data into a new CRM — whether Salesforce, HubSpot, or Zoho — sounds straightforward until you open the export file. Real-world data from legacy systems is messy: phone numbers in five different formats, duplicate contacts, blank required fields, emails with typos, and company names written six different ways. Importing dirty data doesn’t just cause errors — it corrupts your new system from day one, breaks automations, and forces manual cleanup that costs more than the migration itself. The fix is a repeatable Python pipeline that standardizes, deduplicates, validates, and flags bad records before a single row touches your new CRM. Done right, you import clean data once and move on.

Tech Stack & Prerequisites

Python 3.10+
pandas 2.2.0 — data manipulation
phonenumbers 8.13.x — phone normalization
email-validator 2.1.x — email validation
python-dotenv 1.0.x — environment variable management
openpyxl 3.1.x — reading/writing Excel files
A raw CRM export in .csv or .xlsx format
A terminal and a virtual environment (venv or conda)

Step-by-Step Implementation

Step 1: Setup

Create your project folder and install dependencies.

bash

mkdir crm-cleaner && cd crm-cleaner
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
pip install pandas==2.2.0 phonenumbers==8.13.39 email-validator==2.1.1 python-dotenv==1.0.1 openpyxl==3.1.2
```

Create the following project structure:
```
crm-cleaner/
├── .env
├── clean.py
├── validators.py
├── data/
│   ├── raw_export.csv
│   └── cleaned_output.csv

Step 2: Configuration

Never hardcode file paths or API keys. Store them in a .env file.

.env

env

RAW_FILE=data/raw_export.csv
OUTPUT_FILE=data/cleaned_output.csv
DEFAULT_COUNTRY=US

Load them at the top of every script using python-dotenv:

clean.py (top of file)

python

import os
from dotenv import load_dotenv

load_dotenv()

RAW_FILE = os.getenv("RAW_FILE")
OUTPUT_FILE = os.getenv("OUTPUT_FILE")
DEFAULT_COUNTRY = os.getenv("DEFAULT_COUNTRY", "US")

Add .env to your .gitignore immediately:

bash

echo ".env" >> .gitignore

Step 3: Core Logic

3a — Validators Module

validators.py

python

import phonenumbers
from email_validator import validate_email, EmailNotValidError


def normalize_phone(raw: str, country: str = "US") -> str | None:
    """
    Parse and format phone numbers to E.164 standard (+12025551234).
    Returns None if the number is invalid.
    """
    try:
        parsed = phonenumbers.parse(raw, country)
        if phonenumbers.is_valid_number(parsed):
            return phonenumbers.format_number(
                parsed, phonenumbers.PhoneNumberFormat.E164
            )
    except phonenumbers.NumberParseException:
        pass
    return None


def normalize_email(raw: str) -> str | None:
    """
    Lowercase and validate an email address.
    Returns None if invalid.
    """
    try:
        valid = validate_email(raw.strip(), check_deliverability=False)
        return valid.normalized
    except EmailNotValidError:
        return None


def normalize_name(raw: str) -> str:
    """
    Title-case a name and strip extra whitespace.
    """
    return " ".join(raw.strip().split()).title()

3b — Main Cleaning Pipeline

clean.py

python

import os
import pandas as pd
from dotenv import load_dotenv
from validators import normalize_phone, normalize_email, normalize_name

load_dotenv()

RAW_FILE = os.getenv("RAW_FILE")
OUTPUT_FILE = os.getenv("OUTPUT_FILE")
DEFAULT_COUNTRY = os.getenv("DEFAULT_COUNTRY", "US")

# ── 1. Load ──────────────────────────────────────────────────────────────────
df = pd.read_csv(RAW_FILE, dtype=str).fillna("")
print(f"[INFO] Loaded {len(df)} rows from {RAW_FILE}")

# ── 2. Normalize columns ─────────────────────────────────────────────────────
df["first_name"] = df["first_name"].apply(normalize_name)
df["last_name"]  = df["last_name"].apply(normalize_name)
df["company"]    = df["company"].str.strip().str.title()

df["email_clean"] = df["email"].apply(
    lambda x: normalize_email(x) if x else None
)
df["phone_clean"] = df["phone"].apply(
    lambda x: normalize_phone(x, DEFAULT_COUNTRY) if x else None
)

# ── 3. Flag invalid records ──────────────────────────────────────────────────
df["flag_invalid_email"] = df["email_clean"].isna() & df["email"].ne("")
df["flag_invalid_phone"] = df["phone_clean"].isna() & df["phone"].ne("")
df["flag_missing_name"]  = df["first_name"].eq("") | df["last_name"].eq("")

# ── 4. Deduplicate on normalized email (keep first occurrence) ───────────────
before_dedup = len(df)
df = df.drop_duplicates(subset=["email_clean"], keep="first")
print(f"[INFO] Removed {before_dedup - len(df)} duplicate rows by email")

# ── 5. Split: clean records vs. flagged records ──────────────────────────────
any_flag = df[["flag_invalid_email", "flag_invalid_phone", "flag_missing_name"]].any(axis=1)
clean_df  = df[~any_flag].copy()
flagged_df = df[any_flag].copy()

print(f"[INFO] Clean rows:   {len(clean_df)}")
print(f"[INFO] Flagged rows: {len(flagged_df)}")

# ── 6. Write outputs ─────────────────────────────────────────────────────────
clean_df.to_csv(OUTPUT_FILE, index=False)
flagged_df.to_csv(OUTPUT_FILE.replace(".csv", "_flagged.csv"), index=False)
print(f"[OK] Clean data saved to {OUTPUT_FILE}")

Step 4: Testing

Run the pipeline

bash

python clean.py
```

**Expected terminal output:**
```
[INFO] Loaded 3842 rows from data/raw_export.csv
[INFO] Removed 214 duplicate rows by email
[INFO] Clean rows:   3401
[INFO] Flagged rows: 227
[OK] Clean data saved to data/cleaned_output.csv

Verify output spot-checks

Open a quick Python shell to inspect results:

python

import pandas as pd

df = pd.read_csv("data/cleaned_output.csv")

# Confirm no invalid emails slipped through
assert df["email_clean"].notna().all(), "Invalid emails found in clean output!"

# Confirm all phones are E.164 format
assert df["phone_clean"].dropna().str.match(r"^\+\d{10,15}$").all(), "Bad phone format!"

# Preview first 5 rows
print(df[["first_name", "last_name", "email_clean", "phone_clean"]].head())

Check the flagged file

bash

# Count flagged rows per issue
python -c "
import pandas as pd
df = pd.read_csv('data/cleaned_output_flagged.csv')
print('Invalid emails:', df['flag_invalid_email'].sum())
print('Invalid phones:', df['flag_invalid_phone'].sum())
print('Missing names: ', df['flag_missing_name'].sum())
"

Common Errors & Troubleshooting

Gotcha 1 — NumberParseException: (0) Missing or invalid default region

This happens when a phone number has no country code and DEFAULT_COUNTRY is not set.

Fix: Ensure your .env has DEFAULT_COUNTRY=US (or the appropriate ISO country code). Also confirm load_dotenv() is called before os.getenv().

Gotcha 2 — UnicodeDecodeError when reading the CSV

Legacy CRM exports are often saved in Windows-1252 or Latin-1 encoding, not UTF-8.

Fix: Detect and specify encoding explicitly:

python

# Auto-detect encoding
import chardet

with open(RAW_FILE, "rb") as f:
    result = chardet.detect(f.read(100_000))
    detected = result["encoding"]

df = pd.read_csv(RAW_FILE, dtype=str, encoding=detected).fillna("")

Install chardet with: pip install chardet

Gotcha 3 — Deduplication removes the wrong record (wrong keep order)

drop_duplicates(keep="first") keeps whichever row appears first in the file — which might be the older, less complete record.

Fix: Sort by a completeness score before deduplicating so the most complete record survives:

python

# Score each row by how many key fields are non-empty
key_cols = ["first_name", "last_name", "phone_clean", "company"]
df["completeness"] = df[key_cols].apply(lambda row: row.ne("").sum(), axis=1)

# Sort descending so the most complete record is "first"
df = df.sort_values("completeness", ascending=False)
df = df.drop_duplicates(subset=["email_clean"], keep="first")
df = df.drop(columns=["completeness"])

Security Checklist

Never commit .env — add it to .gitignore before your first commit
Never log raw PII — avoid print(df.head()) in production; log row counts only
Restrict file permissions on output CSVs containing customer data (chmod 600 data/*.csv)
Delete raw export files after cleaning if they are no longer needed
Use a dedicated service account if pulling exports from a CRM API — never your personal admin credentials
Validate column names on load — assert expected columns exist before processing to catch schema drift early:

python

REQUIRED_COLS = {"first_name", "last_name", "email", "phone", "company"}
assert REQUIRED_COLS.issubset(df.columns), f"Missing columns: {REQUIRED_COLS - set(df.c

Muhammad Abdullah

Muhammad Abdullah is a SaaS systems researcher and CRM workflow specialist focused on practical implementation and scalable software configuration. He writes in-depth guides on CRM architecture, automation logic, API integrations, and data organization strategies for modern SaaS platforms.His work helps teams understand how CRM systems function internally from custom field structuring to workflow engineering and secure integrations. Abdullah’s goal is to simplify complex SaaS concepts into clear, implementation-ready documentation.

The Problem

Tech Stack & Prerequisites

Step-by-Step Implementation

Step 1: Setup

Step 2: Configuration

Step 3: Core Logic

3a — Validators Module

3b — Main Cleaning Pipeline

Step 4: Testing

Run the pipeline

Verify output spot-checks

Check the flagged file

Common Errors & Troubleshooting

Security Checklist

Leave a Comment Cancel Reply