Mysql/Python for data extraction
- Mohammed Jassim Jasmin

- May 15
- 4 min read
Updated: May 16
Use case :— processing up to 3 million records (30 lakh) in MySQL with name/address preprocessing and splitting logic, here is a breakdown of the best approach based on performance, memory, scalability, and flexibility:
✅
Recommended Approach: Use Python + Pandas + Batched MySQL Inserts
🔍 Why not MySQL-only (Option 1)?
SQL is not ideal for complex string operations like:
Named Entity Recognition
Address parsing and splitting logic
SQL-based regex is limited and hard to maintain for logic-heavy string matching
If using stored procedures, logic gets locked into DB, harder to scale
Will cause higher DB load during processing (slow + resource heavy)
✅ Why Python (Option 2) is Better:
Factor | Python Pandas + Batched Insert |
String Processing | Rich (regex, ML, NER, fuzzy matching) |
Memory Handling | Process in batches (e.g., 10k at a time) |
Speed | Fast using vectorized ops (Pandas) |
DB Load | Minimal if batched INSERTs used |
Flexibility | Easy to plug in spaCy NER or rules |
Scalability | Can run on larger infra (Dask, Spark if needed) |
Debuggability | Easier to inspect intermediate outputs |
✅ Proposed Workflow
Step 1: Batch Fetch from MySQL
import pandas as pd
import pymysql
from sqlalchemy import create_engine
engine = create_engine("mysql+pymysql://user:pass@host/db")
chunk_size = 10000
offset = 0
while True:
query = f"SELECT * FROM raw_table LIMIT {chunk_size} OFFSET {offset}"
df = pd.read_sql(query, engine)
if df.empty:
break
# → Step 2: Process Data
processed_df = process_df(df)
# → Step 3: Insert into preprocessed table
processed_df.to_sql('preprocessed_name_and_address', engine, if_exists='append', index=False)
offset += chunk_sizeStep 2: Process Data (
process_df
function)
This is where you:
Detect if field is name-only, address-only, or name+address
Extract name_extract, address_extract using:
Rule-based splitting
Or a trained spaCy/NER model
Create new columns like name1, address1, name_and_address1, name_extract1, address_extract1, etc.
You can implement rules like:
def is_name(text): return len(text.split()) < 4 and any(char.isalpha() for char in text)
def is_address(text): return any(keyword in text.lower() for keyword in ["street", "road", "avenue", "po", "city"])Step 3: Write to Final Table
Once processed, append to MySQL in batches using pandas.to_sql() or raw INSERT for better control.
processed_df.to_sql("preprocessed_name_and_address", engine, if_exists="append", index=False)You can also use executemany for higher speed if needed.
💡 Advanced Options (if performance needs grow):
Use Dask instead of Pandas for processing larger-than-memory datasets
Use PySpark if the processing gets truly massive
Use job queues (Celery, Prefect) for distributed processing
Apply NER models via spaCy inside process_df
✅ Summary
Option | Best For | Verdict |
MySQL queries only | Simple filters, minimal logic | ❌ Not scalable or flexible |
Python + Pandas | Complex logic, ML models, large data | ✅ Recommended |
Python + Dask/PySpark | 10M+ records or big infra | 🟡 Only if needed |
Here is a detailed step-by-step logic to process your MySQL dataset using Python and Pandas. This plan is optimized for clarity, performance, and memory usage — and is suitable for presenting in a technical showcase.
🧩 Goal
Preprocess MySQL table with up to 3 million records by:
Identifying:
Name-only fields
Address-only fields
Combined Name + Address fields
Splitting combined fields into separate name and address
Creating a new MySQL table:
preprocessed_name_and_address
Fields: primary_key, apn_key, name1, address1, name_and_address1, name_extract1, address_extract1, etc.
✅ Step-by-Step Logic Using Python + Pandas
🔹 Step 1: Set Up Environment
import pandas as pd
import mysql.connector
from sqlalchemy import create_engine
import re✅ Use sqlalchemy for bulk inserts and memory-efficient reading.
🔹 Step 2: Connect to MySQL
DB_URI = "mysql+mysqlconnector://username:password@localhost:3306/db_name"
engine = create_engine(DB_URI)🔹 Step 3: Load Data in Chunks
🔄 Read data in chunks to avoid memory overload.
chunksize = 100_000 # Tune based on available memory
query = "SELECT primary_key, apn_key, name1, address1, name_and_address1 FROM original_table"
data_iter = pd.read_sql(query, engine, chunksize=chunksize)🔹 Step 4: Define Preprocessing Rules
Use heuristic rules to classify and split fields.
def is_likely_name(text):
return bool(re.fullmatch(r"[A-Za-z\s.,'-]{3,100}", text or ""))
def is_likely_address(text):
return bool(re.search(r"\d+|Street|St\.|Avenue|Ave|Lane|Ln|Road|Rd|Block|PO Box", str(text), re.IGNORECASE))
def split_name_address(text):
# Simplified split logic — replace with model later
if "," in text:
parts = [part.strip() for part in text.split(",")]
if is_likely_name(parts[0]) and is_likely_address(",".join(parts[1:])):
return parts[0], ",".join(parts[1:])
return None, None🔹 Step 5: Process Each Chunk
processed_chunks = []
for chunk in data_iter:
chunk["name_extract1"] = None
chunk["address_extract1"] = None
for i, row in chunk.iterrows():
combined = row.get("name_and_address1")
# Split combined field
if pd.notnull(combined):
name, address = split_name_address(combined)
chunk.at[i, "name_extract1"] = name
chunk.at[i, "address_extract1"] = address
processed_chunks.append(chunk)🔹 Step 6: Create Final DataFrame
final_df = pd.concat(processed_chunks, ignore_index=True)🔹 Step 7: Save to MySQL
Use if_exists='replace' for first time, 'append' for later.
final_df.to_sql("preprocessed_name_and_address", con=engine, index=False, if_exists="replace")🔧 Optimization Suggestions
💡 Processing
Use vectorized operations where possible (e.g., apply() on columns instead of iterrows).
Replace regex splitting with ML/NLP-based model (like spaCy NER) for better accuracy.
💡 Storage
Only select required columns when reading from MySQL.
Drop unused intermediate columns after processing.
💡 Memory
Use chunksize with care: monitor RAM usage via psutil or memory_profiler.
Explicitly call del chunk and gc.collect() between loop iterations if memory is tight.
📝 Output Table Schema (Example)
primary_key | apn_key | name1 | address1 | name_and_address1 | name_extract1 | address_extract1 |
1001 | 503 | NULL | NULL | John Smith, 21 Road Ave | John Smith | 21 Road Ave |





Comments