The Problem
Migrating customer data into a new CRM — whether Salesforce, HubSpot, or Zoho — sounds straightforward until you open the export file. Real-world data from legacy systems is messy: phone numbers in five different formats, duplicate contacts, blank required fields, emails with typos, and company names written six different ways. Importing dirty data doesn’t just cause errors — it corrupts your new system from day one, breaks automations, and forces manual cleanup that costs more than the migration itself. The fix is a repeatable Python pipeline that standardizes, deduplicates, validates, and flags bad records before a single row touches your new CRM. Done right, you import clean data once and move on.
Tech Stack & Prerequisites
- Python 3.10+
- pandas 2.2.0 — data manipulation
- phonenumbers 8.13.x — phone normalization
- email-validator 2.1.x — email validation
- python-dotenv 1.0.x — environment variable management
- openpyxl 3.1.x — reading/writing Excel files
- A raw CRM export in
.csvor.xlsxformat - A terminal and a virtual environment (venv or conda)
Step-by-Step Implementation
Step 1: Setup
Create your project folder and install dependencies.
mkdir crm-cleaner && cd crm-cleaner
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install pandas==2.2.0 phonenumbers==8.13.39 email-validator==2.1.1 python-dotenv==1.0.1 openpyxl==3.1.2
```
Create the following project structure:
```
crm-cleaner/
├── .env
├── clean.py
├── validators.py
├── data/
│ ├── raw_export.csv
│ └── cleaned_output.csv
Step 2: Configuration
Never hardcode file paths or API keys. Store them in a .env file.
.env
RAW_FILE=data/raw_export.csv
OUTPUT_FILE=data/cleaned_output.csv
DEFAULT_COUNTRY=US
Load them at the top of every script using python-dotenv:
clean.py (top of file)
import os
from dotenv import load_dotenv
load_dotenv()
RAW_FILE = os.getenv("RAW_FILE")
OUTPUT_FILE = os.getenv("OUTPUT_FILE")
DEFAULT_COUNTRY = os.getenv("DEFAULT_COUNTRY", "US")
Add .env to your .gitignore immediately:
echo ".env" >> .gitignore
Step 3: Core Logic
3a — Validators Module
validators.py
import phonenumbers
from email_validator import validate_email, EmailNotValidError
def normalize_phone(raw: str, country: str = "US") -> str | None:
"""
Parse and format phone numbers to E.164 standard (+12025551234).
Returns None if the number is invalid.
"""
try:
parsed = phonenumbers.parse(raw, country)
if phonenumbers.is_valid_number(parsed):
return phonenumbers.format_number(
parsed, phonenumbers.PhoneNumberFormat.E164
)
except phonenumbers.NumberParseException:
pass
return None
def normalize_email(raw: str) -> str | None:
"""
Lowercase and validate an email address.
Returns None if invalid.
"""
try:
valid = validate_email(raw.strip(), check_deliverability=False)
return valid.normalized
except EmailNotValidError:
return None
def normalize_name(raw: str) -> str:
"""
Title-case a name and strip extra whitespace.
"""
return " ".join(raw.strip().split()).title()
3b — Main Cleaning Pipeline
clean.py
import os
import pandas as pd
from dotenv import load_dotenv
from validators import normalize_phone, normalize_email, normalize_name
load_dotenv()
RAW_FILE = os.getenv("RAW_FILE")
OUTPUT_FILE = os.getenv("OUTPUT_FILE")
DEFAULT_COUNTRY = os.getenv("DEFAULT_COUNTRY", "US")
# ── 1. Load ──────────────────────────────────────────────────────────────────
df = pd.read_csv(RAW_FILE, dtype=str).fillna("")
print(f"[INFO] Loaded {len(df)} rows from {RAW_FILE}")
# ── 2. Normalize columns ─────────────────────────────────────────────────────
df["first_name"] = df["first_name"].apply(normalize_name)
df["last_name"] = df["last_name"].apply(normalize_name)
df["company"] = df["company"].str.strip().str.title()
df["email_clean"] = df["email"].apply(
lambda x: normalize_email(x) if x else None
)
df["phone_clean"] = df["phone"].apply(
lambda x: normalize_phone(x, DEFAULT_COUNTRY) if x else None
)
# ── 3. Flag invalid records ──────────────────────────────────────────────────
df["flag_invalid_email"] = df["email_clean"].isna() & df["email"].ne("")
df["flag_invalid_phone"] = df["phone_clean"].isna() & df["phone"].ne("")
df["flag_missing_name"] = df["first_name"].eq("") | df["last_name"].eq("")
# ── 4. Deduplicate on normalized email (keep first occurrence) ───────────────
before_dedup = len(df)
df = df.drop_duplicates(subset=["email_clean"], keep="first")
print(f"[INFO] Removed {before_dedup - len(df)} duplicate rows by email")
# ── 5. Split: clean records vs. flagged records ──────────────────────────────
any_flag = df[["flag_invalid_email", "flag_invalid_phone", "flag_missing_name"]].any(axis=1)
clean_df = df[~any_flag].copy()
flagged_df = df[any_flag].copy()
print(f"[INFO] Clean rows: {len(clean_df)}")
print(f"[INFO] Flagged rows: {len(flagged_df)}")
# ── 6. Write outputs ─────────────────────────────────────────────────────────
clean_df.to_csv(OUTPUT_FILE, index=False)
flagged_df.to_csv(OUTPUT_FILE.replace(".csv", "_flagged.csv"), index=False)
print(f"[OK] Clean data saved to {OUTPUT_FILE}")
Step 4: Testing
Run the pipeline
python clean.py
```
**Expected terminal output:**
```
[INFO] Loaded 3842 rows from data/raw_export.csv
[INFO] Removed 214 duplicate rows by email
[INFO] Clean rows: 3401
[INFO] Flagged rows: 227
[OK] Clean data saved to data/cleaned_output.csv
Verify output spot-checks
Open a quick Python shell to inspect results:
import pandas as pd
df = pd.read_csv("data/cleaned_output.csv")
# Confirm no invalid emails slipped through
assert df["email_clean"].notna().all(), "Invalid emails found in clean output!"
# Confirm all phones are E.164 format
assert df["phone_clean"].dropna().str.match(r"^\+\d{10,15}$").all(), "Bad phone format!"
# Preview first 5 rows
print(df[["first_name", "last_name", "email_clean", "phone_clean"]].head())
Check the flagged file
# Count flagged rows per issue
python -c "
import pandas as pd
df = pd.read_csv('data/cleaned_output_flagged.csv')
print('Invalid emails:', df['flag_invalid_email'].sum())
print('Invalid phones:', df['flag_invalid_phone'].sum())
print('Missing names: ', df['flag_missing_name'].sum())
"
Common Errors & Troubleshooting
Gotcha 1 — NumberParseException: (0) Missing or invalid default region
This happens when a phone number has no country code and DEFAULT_COUNTRY is not set.
Fix: Ensure your .env has DEFAULT_COUNTRY=US (or the appropriate ISO country code). Also confirm load_dotenv() is called before os.getenv().
Gotcha 2 — UnicodeDecodeError when reading the CSV
Legacy CRM exports are often saved in Windows-1252 or Latin-1 encoding, not UTF-8.
Fix: Detect and specify encoding explicitly:
# Auto-detect encoding
import chardet
with open(RAW_FILE, "rb") as f:
result = chardet.detect(f.read(100_000))
detected = result["encoding"]
df = pd.read_csv(RAW_FILE, dtype=str, encoding=detected).fillna("")
Install chardet with: pip install chardet
Gotcha 3 — Deduplication removes the wrong record (wrong keep order)
drop_duplicates(keep="first") keeps whichever row appears first in the file — which might be the older, less complete record.
Fix: Sort by a completeness score before deduplicating so the most complete record survives:
# Score each row by how many key fields are non-empty
key_cols = ["first_name", "last_name", "phone_clean", "company"]
df["completeness"] = df[key_cols].apply(lambda row: row.ne("").sum(), axis=1)
# Sort descending so the most complete record is "first"
df = df.sort_values("completeness", ascending=False)
df = df.drop_duplicates(subset=["email_clean"], keep="first")
df = df.drop(columns=["completeness"])
Security Checklist
- Never commit
.env— add it to.gitignorebefore your first commit - Never log raw PII — avoid
print(df.head())in production; log row counts only - Restrict file permissions on output CSVs containing customer data (
chmod 600 data/*.csv) - Delete raw export files after cleaning if they are no longer needed
- Use a dedicated service account if pulling exports from a CRM API — never your personal admin credentials
- Validate column names on load — assert expected columns exist before processing to catch schema drift early:
REQUIRED_COLS = {"first_name", "last_name", "email", "phone", "company"}
assert REQUIRED_COLS.issubset(df.columns), f"Missing columns: {REQUIRED_COLS - set(df.c

Muhammad Abdullah is a SaaS systems researcher and CRM workflow specialist focused on practical implementation and scalable software configuration. He writes in-depth guides on CRM architecture, automation logic, API integrations, and data organization strategies for modern SaaS platforms.His work helps teams understand how CRM systems function internally from custom field structuring to workflow engineering and secure integrations. Abdullah’s goal is to simplify complex SaaS concepts into clear, implementation-ready documentation.

