top of page

Mysql/Python for data extraction

Updated: May 16

Use case :— processing up to 3 million records (30 lakh) in MySQL with name/address preprocessing and splitting logic, here is a breakdown of the best approach based on performance, memory, scalability, and flexibility:




Recommended Approach: Use Python + Pandas + Batched MySQL Inserts





🔍 Why not MySQL-only (Option 1)?



  • SQL is not ideal for complex string operations like:


    • Named Entity Recognition

    • Address parsing and splitting logic


  • SQL-based regex is limited and hard to maintain for logic-heavy string matching

  • If using stored procedures, logic gets locked into DB, harder to scale

  • Will cause higher DB load during processing (slow + resource heavy)





✅ Why Python (Option 2) is Better:


Factor

Python Pandas + Batched Insert

String Processing

Rich (regex, ML, NER, fuzzy matching)

Memory Handling

Process in batches (e.g., 10k at a time)

Speed

Fast using vectorized ops (Pandas)

DB Load

Minimal if batched INSERTs used

Flexibility

Easy to plug in spaCy NER or rules

Scalability

Can run on larger infra (Dask, Spark if needed)

Debuggability

Easier to inspect intermediate outputs





✅ Proposed Workflow




Step 1: Batch Fetch from MySQL


import pandas as pd
import pymysql
from sqlalchemy import create_engine

engine = create_engine("mysql+pymysql://user:pass@host/db")

chunk_size = 10000
offset = 0
while True:
    query = f"SELECT * FROM raw_table LIMIT {chunk_size} OFFSET {offset}"
    df = pd.read_sql(query, engine)
    if df.empty:
        break

    # → Step 2: Process Data
    processed_df = process_df(df)

    # → Step 3: Insert into preprocessed table
    processed_df.to_sql('preprocessed_name_and_address', engine, if_exists='append', index=False)

    offset += chunk_size




Step 2: Process Data (

process_df

function)



This is where you:


  • Detect if field is name-only, address-only, or name+address

  • Extract name_extract, address_extract using:


    • Rule-based splitting

    • Or a trained spaCy/NER model


  • Create new columns like name1, address1, name_and_address1, name_extract1, address_extract1, etc.



You can implement rules like:
def is_name(text): return len(text.split()) < 4 and any(char.isalpha() for char in text)
def is_address(text): return any(keyword in text.lower() for keyword in ["street", "road", "avenue", "po", "city"])




Step 3: Write to Final Table



Once processed, append to MySQL in batches using pandas.to_sql() or raw INSERT for better control.

processed_df.to_sql("preprocessed_name_and_address", engine, if_exists="append", index=False)

You can also use executemany for higher speed if needed.




💡 Advanced Options (if performance needs grow):



  • Use Dask instead of Pandas for processing larger-than-memory datasets

  • Use PySpark if the processing gets truly massive

  • Use job queues (Celery, Prefect) for distributed processing

  • Apply NER models via spaCy inside process_df





✅ Summary


Option

Best For

Verdict

MySQL queries only

Simple filters, minimal logic

❌ Not scalable or flexible

Python + Pandas

Complex logic, ML models, large data

✅ Recommended

Python + Dask/PySpark

10M+ records or big infra

🟡 Only if needed





Here is a detailed step-by-step logic to process your MySQL dataset using Python and Pandas. This plan is optimized for clarity, performance, and memory usage — and is suitable for presenting in a technical showcase.




🧩 Goal



Preprocess MySQL table with up to 3 million records by:


  1. Identifying:


    • Name-only fields

    • Address-only fields

    • Combined Name + Address fields


  2. Splitting combined fields into separate name and address

  3. Creating a new MySQL table:


    • preprocessed_name_and_address

    • Fields: primary_key, apn_key, name1, address1, name_and_address1, name_extract1, address_extract1, etc.






✅ Step-by-Step Logic Using Python + Pandas





🔹 Step 1: Set Up Environment


import pandas as pd
import mysql.connector
from sqlalchemy import create_engine
import re
✅ Use sqlalchemy for bulk inserts and memory-efficient reading.



🔹 Step 2: Connect to MySQL


DB_URI = "mysql+mysqlconnector://username:password@localhost:3306/db_name"
engine = create_engine(DB_URI)




🔹 Step 3: Load Data in Chunks



🔄 Read data in chunks to avoid memory overload.
chunksize = 100_000  # Tune based on available memory
query = "SELECT primary_key, apn_key, name1, address1, name_and_address1 FROM original_table"

data_iter = pd.read_sql(query, engine, chunksize=chunksize)




🔹 Step 4: Define Preprocessing Rules



Use heuristic rules to classify and split fields.
def is_likely_name(text):
    return bool(re.fullmatch(r"[A-Za-z\s.,'-]{3,100}", text or ""))

def is_likely_address(text):
    return bool(re.search(r"\d+|Street|St\.|Avenue|Ave|Lane|Ln|Road|Rd|Block|PO Box", str(text), re.IGNORECASE))

def split_name_address(text):
    # Simplified split logic — replace with model later
    if "," in text:
        parts = [part.strip() for part in text.split(",")]
        if is_likely_name(parts[0]) and is_likely_address(",".join(parts[1:])):
            return parts[0], ",".join(parts[1:])
    return None, None




🔹 Step 5: Process Each Chunk


processed_chunks = []

for chunk in data_iter:
    chunk["name_extract1"] = None
    chunk["address_extract1"] = None

    for i, row in chunk.iterrows():
        combined = row.get("name_and_address1")

        # Split combined field
        if pd.notnull(combined):
            name, address = split_name_address(combined)
            chunk.at[i, "name_extract1"] = name
            chunk.at[i, "address_extract1"] = address

    processed_chunks.append(chunk)




🔹 Step 6: Create Final DataFrame


final_df = pd.concat(processed_chunks, ignore_index=True)




🔹 Step 7: Save to MySQL



Use if_exists='replace' for first time, 'append' for later.
final_df.to_sql("preprocessed_name_and_address", con=engine, index=False, if_exists="replace")




🔧 Optimization Suggestions




💡 Processing



  • Use vectorized operations where possible (e.g., apply() on columns instead of iterrows).

  • Replace regex splitting with ML/NLP-based model (like spaCy NER) for better accuracy.




💡 Storage



  • Only select required columns when reading from MySQL.

  • Drop unused intermediate columns after processing.




💡 Memory



  • Use chunksize with care: monitor RAM usage via psutil or memory_profiler.

  • Explicitly call del chunk and gc.collect() between loop iterations if memory is tight.





📝 Output Table Schema (Example)


primary_key

apn_key

name1

address1

name_and_address1

name_extract1

address_extract1

1001

503

NULL

NULL

John Smith, 21 Road Ave

John Smith

21 Road Ave





Recent Posts

See All
Web Backend APP Design

Designing a web backend application involves several key stages, from planning and architecture to implementation and deployment. Here's...

 
 
 

Comments


I Sometimes Send Newsletters

Thanks for submitting!

© 2023 by Mohammed Jassim

bottom of page