Using Python to Automate Data Cleaning Before a CRM Migration

Using Python to Automate Data Cleaning Before a CRM Migration

The Problem

Migrating customer data into a new CRM — whether Salesforce, HubSpot, or Zoho — sounds straightforward until you open the export file. Real-world data from legacy systems is messy: phone numbers in five different formats, duplicate contacts, blank required fields, emails with typos, and company names written six different ways. Importing dirty data doesn’t just cause errors — it corrupts your new system from day one, breaks automations, and forces manual cleanup that costs more than the migration itself. The fix is a repeatable Python pipeline that standardizes, deduplicates, validates, and flags bad records before a single row touches your new CRM. Done right, you import clean data once and move on.

Tech Stack & Prerequisites

  • Python 3.10+
  • pandas 2.2.0 — data manipulation
  • phonenumbers 8.13.x — phone normalization
  • email-validator 2.1.x — email validation
  • python-dotenv 1.0.x — environment variable management
  • openpyxl 3.1.x — reading/writing Excel files
  • A raw CRM export in .csv or .xlsx format
  • A terminal and a virtual environment (venv or conda)

Step-by-Step Implementation

Step 1: Setup

Create your project folder and install dependencies.

bash
mkdir crm-cleaner && cd crm-cleaner
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
pip install pandas==2.2.0 phonenumbers==8.13.39 email-validator==2.1.1 python-dotenv==1.0.1 openpyxl==3.1.2
```

Create the following project structure:
```
crm-cleaner/
├── .env
├── clean.py
├── validators.py
├── data/
│   ├── raw_export.csv
│   └── cleaned_output.csv

Step 2: Configuration

Never hardcode file paths or API keys. Store them in a .env file.

.env

env
RAW_FILE=data/raw_export.csv
OUTPUT_FILE=data/cleaned_output.csv
DEFAULT_COUNTRY=US

Load them at the top of every script using python-dotenv:

clean.py (top of file)

python
import os
from dotenv import load_dotenv

load_dotenv()

RAW_FILE = os.getenv("RAW_FILE")
OUTPUT_FILE = os.getenv("OUTPUT_FILE")
DEFAULT_COUNTRY = os.getenv("DEFAULT_COUNTRY", "US")

Add .env to your .gitignore immediately:

bash
echo ".env" >> .gitignore

Step 3: Core Logic

3a — Validators Module

validators.py

python
import phonenumbers
from email_validator import validate_email, EmailNotValidError


def normalize_phone(raw: str, country: str = "US") -> str | None:
    """
    Parse and format phone numbers to E.164 standard (+12025551234).
    Returns None if the number is invalid.
    """
    try:
        parsed = phonenumbers.parse(raw, country)
        if phonenumbers.is_valid_number(parsed):
            return phonenumbers.format_number(
                parsed, phonenumbers.PhoneNumberFormat.E164
            )
    except phonenumbers.NumberParseException:
        pass
    return None


def normalize_email(raw: str) -> str | None:
    """
    Lowercase and validate an email address.
    Returns None if invalid.
    """
    try:
        valid = validate_email(raw.strip(), check_deliverability=False)
        return valid.normalized
    except EmailNotValidError:
        return None


def normalize_name(raw: str) -> str:
    """
    Title-case a name and strip extra whitespace.
    """
    return " ".join(raw.strip().split()).title()

3b — Main Cleaning Pipeline

clean.py

python
import os
import pandas as pd
from dotenv import load_dotenv
from validators import normalize_phone, normalize_email, normalize_name

load_dotenv()

RAW_FILE = os.getenv("RAW_FILE")
OUTPUT_FILE = os.getenv("OUTPUT_FILE")
DEFAULT_COUNTRY = os.getenv("DEFAULT_COUNTRY", "US")

# ── 1. Load ──────────────────────────────────────────────────────────────────
df = pd.read_csv(RAW_FILE, dtype=str).fillna("")
print(f"[INFO] Loaded {len(df)} rows from {RAW_FILE}")

# ── 2. Normalize columns ─────────────────────────────────────────────────────
df["first_name"] = df["first_name"].apply(normalize_name)
df["last_name"]  = df["last_name"].apply(normalize_name)
df["company"]    = df["company"].str.strip().str.title()

df["email_clean"] = df["email"].apply(
    lambda x: normalize_email(x) if x else None
)
df["phone_clean"] = df["phone"].apply(
    lambda x: normalize_phone(x, DEFAULT_COUNTRY) if x else None
)

# ── 3. Flag invalid records ──────────────────────────────────────────────────
df["flag_invalid_email"] = df["email_clean"].isna() & df["email"].ne("")
df["flag_invalid_phone"] = df["phone_clean"].isna() & df["phone"].ne("")
df["flag_missing_name"]  = df["first_name"].eq("") | df["last_name"].eq("")

# ── 4. Deduplicate on normalized email (keep first occurrence) ───────────────
before_dedup = len(df)
df = df.drop_duplicates(subset=["email_clean"], keep="first")
print(f"[INFO] Removed {before_dedup - len(df)} duplicate rows by email")

# ── 5. Split: clean records vs. flagged records ──────────────────────────────
any_flag = df[["flag_invalid_email", "flag_invalid_phone", "flag_missing_name"]].any(axis=1)
clean_df  = df[~any_flag].copy()
flagged_df = df[any_flag].copy()

print(f"[INFO] Clean rows:   {len(clean_df)}")
print(f"[INFO] Flagged rows: {len(flagged_df)}")

# ── 6. Write outputs ─────────────────────────────────────────────────────────
clean_df.to_csv(OUTPUT_FILE, index=False)
flagged_df.to_csv(OUTPUT_FILE.replace(".csv", "_flagged.csv"), index=False)
print(f"[OK] Clean data saved to {OUTPUT_FILE}")

Step 4: Testing

Run the pipeline

bash
python clean.py
```

**Expected terminal output:**
```
[INFO] Loaded 3842 rows from data/raw_export.csv
[INFO] Removed 214 duplicate rows by email
[INFO] Clean rows:   3401
[INFO] Flagged rows: 227
[OK] Clean data saved to data/cleaned_output.csv

Verify output spot-checks

Open a quick Python shell to inspect results:

python
import pandas as pd

df = pd.read_csv("data/cleaned_output.csv")

# Confirm no invalid emails slipped through
assert df["email_clean"].notna().all(), "Invalid emails found in clean output!"

# Confirm all phones are E.164 format
assert df["phone_clean"].dropna().str.match(r"^\+\d{10,15}$").all(), "Bad phone format!"

# Preview first 5 rows
print(df[["first_name", "last_name", "email_clean", "phone_clean"]].head())

Check the flagged file

bash
# Count flagged rows per issue
python -c "
import pandas as pd
df = pd.read_csv('data/cleaned_output_flagged.csv')
print('Invalid emails:', df['flag_invalid_email'].sum())
print('Invalid phones:', df['flag_invalid_phone'].sum())
print('Missing names: ', df['flag_missing_name'].sum())
"

Common Errors & Troubleshooting

Gotcha 1 — NumberParseException: (0) Missing or invalid default region

This happens when a phone number has no country code and DEFAULT_COUNTRY is not set.

Fix: Ensure your .env has DEFAULT_COUNTRY=US (or the appropriate ISO country code). Also confirm load_dotenv() is called before os.getenv().

Gotcha 2 — UnicodeDecodeError when reading the CSV

Legacy CRM exports are often saved in Windows-1252 or Latin-1 encoding, not UTF-8.

Fix: Detect and specify encoding explicitly:

python
# Auto-detect encoding
import chardet

with open(RAW_FILE, "rb") as f:
    result = chardet.detect(f.read(100_000))
    detected = result["encoding"]

df = pd.read_csv(RAW_FILE, dtype=str, encoding=detected).fillna("")

Install chardet with: pip install chardet

Gotcha 3 — Deduplication removes the wrong record (wrong keep order)

drop_duplicates(keep="first") keeps whichever row appears first in the file — which might be the older, less complete record.

Fix: Sort by a completeness score before deduplicating so the most complete record survives:

python
# Score each row by how many key fields are non-empty
key_cols = ["first_name", "last_name", "phone_clean", "company"]
df["completeness"] = df[key_cols].apply(lambda row: row.ne("").sum(), axis=1)

# Sort descending so the most complete record is "first"
df = df.sort_values("completeness", ascending=False)
df = df.drop_duplicates(subset=["email_clean"], keep="first")
df = df.drop(columns=["completeness"])

Security Checklist

  • Never commit .env — add it to .gitignore before your first commit
  • Never log raw PII — avoid print(df.head()) in production; log row counts only
  • Restrict file permissions on output CSVs containing customer data (chmod 600 data/*.csv)
  • Delete raw export files after cleaning if they are no longer needed
  • Use a dedicated service account if pulling exports from a CRM API — never your personal admin credentials
  • Validate column names on load — assert expected columns exist before processing to catch schema drift early:
python
REQUIRED_COLS = {"first_name", "last_name", "email", "phone", "company"}
assert REQUIRED_COLS.issubset(df.columns), f"Missing columns: {REQUIRED_COLS - set(df.c

Leave a Comment

Your email address will not be published. Required fields are marked *