Python and MySQL

Python and MySQL

I was given a small task to write a Python/MySQL application.  The instructions where as follows:

Given a table ‘mailing’:

CREATE TABLE mailing (addr VARCHAR(255) NOT NULL );

The mailing table will initially be empty.  New addresses will be added on a daily basis.  It is expected that the table will store at least 10,000,000 email addresses and 100,000 domains.

Write a Python script that updates another table which holds a daily count of email addresses by their domain name.

Use this table to report the top 50 domains by count sorted by percentage growth of the last 30 days compared to the total.

** NOTE **
– The original mailing table should not be modified.
– All processing must be done in Python (eg. no complex queries or sub-queries)

The ‘mailing’ table did not have a ‘date’ column so I had to assume that that this field existed or there would be no way to determine when the email arrived.  MySQL does not keep a internal record of when a record has been added.  So I created the following table:

CREATE TABLE Mailing ( entryDate DATE NOT NULL, addr VARCHAR(255) NOT NULL)

My thinking was that they wanted most of the heavy lifting done by Python.  So I tested various Python script to read all the records from a CSV dump file, process the data and then create a report.   I tested with 10,000,000 emails.  The format of the CSV file was, <entrydate>,<addr> (‘20150520′,’testing@hotmail.com’).

Some relevant information about the environment I tested on:

Python version: 3.4.2
MySql version: 5.6.2.  MySQL Community Server
Operating System: Windows 8.1 64bit Core I7 with 16Gig RAM

The first script was a single threaded application that read the entire CSV file, processed the data and then created a report.  10 million e-mail records used about 220 Meg of had drive space.  Using this method took about 5 minutes to process.  This was too long.  The bottle neck was the IO process (disk reading and writing).

The second script was a multi-threaded application that still read all the information into memory first.  One thread took the list, processed the data and then used the bulk INSERT command to insert into MySQL.  Another thread used the list to create the report.  This reduced the time to about 1 minute 20 seconds.  However I felt I could get a better time.  I still only had one thread writing to MySQL and I felt that if I had multiple threads writing to the database I would increase efficiency.  At least that was the thought.

The MySQL connector for JAVA has LOAD DATA INFILE method that acts as a high speed Bulk Load.  The Python MySQL connector does not have this method.  However I did use Python’s memory mapping to read the file and then wrote the file to the database directly from memory.  I bypassed bulk loading the data.  By bulk loading, I mean that I did not write the data to a file and then bulk load that file directly into the database.

The modified the Python script now reads 10,000,000 emails, writes the data to MySQL and then generates a report.  All in about 20 seconds.

I have added a zip file of all the information you will need to use this file.  I would like to know your approach so please leave your comments or post a revised copy.

    Python/MySQL Integration