Mysql/Python for data extraction

Mohammed Jassim Jasmin
May 15
4 min read

Updated: May 16

Use case :— processing up to 3 million records (30 lakh) in MySQL with name/address preprocessing and splitting logic, here is a breakdown of the best approach based on performance, memory, scalability, and flexibility:

✅

Recommended Approach: Use Python + Pandas + Batched MySQL Inserts

🔍 Why not MySQL-only (Option 1)?

SQL is not ideal for complex string operations like:
- Named Entity Recognition
- Address parsing and splitting logic
SQL-based regex is limited and hard to maintain for logic-heavy string matching
If using stored procedures, logic gets locked into DB, harder to scale
Will cause higher DB load during processing (slow + resource heavy)

✅ Why Python (Option 2) is Better:

Factor	Python Pandas + Batched Insert
String Processing	Rich (regex, ML, NER, fuzzy matching)
Memory Handling	Process in batches (e.g., 10k at a time)
Speed	Fast using vectorized ops (Pandas)
DB Load	Minimal if batched INSERTs used
Flexibility	Easy to plug in spaCy NER or rules
Scalability	Can run on larger infra (Dask, Spark if needed)
Debuggability	Easier to inspect intermediate outputs

✅ Proposed Workflow

Step 1: Batch Fetch from MySQL

import pandas as pd
import pymysql
from sqlalchemy import create_engine

engine = create_engine("mysql+pymysql://user:pass@host/db")

chunk_size = 10000
offset = 0
while True:
    query = f"SELECT * FROM raw_table LIMIT {chunk_size} OFFSET {offset}"
    df = pd.read_sql(query, engine)
    if df.empty:
        break

    # → Step 2: Process Data
    processed_df = process_df(df)

    # → Step 3: Insert into preprocessed table
    processed_df.to_sql('preprocessed_name_and_address', engine, if_exists='append', index=False)

    offset += chunk_size

Step 2: Process Data (

process_df

function)

This is where you:

Detect if field is name-only, address-only, or name+address
Extract name_extract, address_extract using:
- Rule-based splitting
- Or a trained spaCy/NER model
Create new columns like name1, address1, name_and_address1, name_extract1, address_extract1, etc.

You can implement rules like:

def is_name(text): return len(text.split()) < 4 and any(char.isalpha() for char in text)
def is_address(text): return any(keyword in text.lower() for keyword in ["street", "road", "avenue", "po", "city"])

Step 3: Write to Final Table

Once processed, append to MySQL in batches using pandas.to_sql() or raw INSERT for better control.

processed_df.to_sql("preprocessed_name_and_address", engine, if_exists="append", index=False)

You can also use executemany for higher speed if needed.

💡 Advanced Options (if performance needs grow):

Use Dask instead of Pandas for processing larger-than-memory datasets
Use PySpark if the processing gets truly massive
Use job queues (Celery, Prefect) for distributed processing
Apply NER models via spaCy inside process_df

✅ Summary

Option	Best For	Verdict
MySQL queries only	Simple filters, minimal logic	❌ Not scalable or flexible
Python + Pandas	Complex logic, ML models, large data	✅ Recommended
Python + Dask/PySpark	10M+ records or big infra	🟡 Only if needed

Here is a detailed step-by-step logic to process your MySQL dataset using Python and Pandas. This plan is optimized for clarity, performance, and memory usage — and is suitable for presenting in a technical showcase.

🧩 Goal

Preprocess MySQL table with up to 3 million records by:

Identifying:
- Name-only fields
- Address-only fields
- Combined Name + Address fields
Splitting combined fields into separate name and address
Creating a new MySQL table:
- preprocessed_name_and_address
- Fields: primary_key, apn_key, name1, address1, name_and_address1, name_extract1, address_extract1, etc.

✅ Step-by-Step Logic Using Python + Pandas

🔹 Step 1: Set Up Environment

import pandas as pd
import mysql.connector
from sqlalchemy import create_engine
import re

✅ Use sqlalchemy for bulk inserts and memory-efficient reading.

🔹 Step 2: Connect to MySQL

DB_URI = "mysql+mysqlconnector://username:password@localhost:3306/db_name"
engine = create_engine(DB_URI)

🔹 Step 3: Load Data in Chunks

🔄 Read data in chunks to avoid memory overload.

chunksize = 100_000  # Tune based on available memory
query = "SELECT primary_key, apn_key, name1, address1, name_and_address1 FROM original_table"

data_iter = pd.read_sql(query, engine, chunksize=chunksize)

🔹 Step 4: Define Preprocessing Rules

Use heuristic rules to classify and split fields.

def is_likely_name(text):
    return bool(re.fullmatch(r"[A-Za-z\s.,'-]{3,100}", text or ""))

def is_likely_address(text):
    return bool(re.search(r"\d+|Street|St\.|Avenue|Ave|Lane|Ln|Road|Rd|Block|PO Box", str(text), re.IGNORECASE))

def split_name_address(text):
    # Simplified split logic — replace with model later
    if "," in text:
        parts = [part.strip() for part in text.split(",")]
        if is_likely_name(parts[0]) and is_likely_address(",".join(parts[1:])):
            return parts[0], ",".join(parts[1:])
    return None, None

🔹 Step 5: Process Each Chunk

processed_chunks = []

for chunk in data_iter:
    chunk["name_extract1"] = None
    chunk["address_extract1"] = None

    for i, row in chunk.iterrows():
        combined = row.get("name_and_address1")

        # Split combined field
        if pd.notnull(combined):
            name, address = split_name_address(combined)
            chunk.at[i, "name_extract1"] = name
            chunk.at[i, "address_extract1"] = address

    processed_chunks.append(chunk)

🔹 Step 6: Create Final DataFrame

final_df = pd.concat(processed_chunks, ignore_index=True)

🔹 Step 7: Save to MySQL

Use if_exists='replace' for first time, 'append' for later.

final_df.to_sql("preprocessed_name_and_address", con=engine, index=False, if_exists="replace")

🔧 Optimization Suggestions

💡 Processing

Use vectorized operations where possible (e.g., apply() on columns instead of iterrows).
Replace regex splitting with ML/NLP-based model (like spaCy NER) for better accuracy.

💡 Storage

Only select required columns when reading from MySQL.
Drop unused intermediate columns after processing.

💡 Memory

Use chunksize with care: monitor RAM usage via psutil or memory_profiler.
Explicitly call del chunk and gc.collect() between loop iterations if memory is tight.

📝 Output Table Schema (Example)

primary_key	apn_key	name1	address1	name_and_address1	name_extract1	address_extract1
1001	503	NULL	NULL	John Smith, 21 Road Ave	John Smith	21 Road Ave