Cleaning Data with Python and Excel: A Brief Example

I want to show a simple, real world problem and solution I made today with Python.

Problem: My data has been corrupted (over 8,000 rows in Excel). Somewhere in my haste, I have scrambled first names and last names in the ‘first name’ and ‘last name’ field.

office_scrambled

Solution: Find unique words from both fields, which leads to unique first and last names.

Copy the columns from Excel into a Notepad text file. Then run the below code on the text file in command prompt.

run_command

with open('file_name.txt',"r") as f:
    lines = f.readlines()
    lines = [line.split() for line in lines]
    new_line = list()
    edited_lines = list()
    for line in lines:
        # keep only the first appearance of a word on each line
        new_line = [word for word in words if word.strip() not in new_line]
        edited_lines.append(new_line)

with open("edited_lines.txt",'w) as f:
    for line in edited_lines:
        line = ' '.join(line) +'\n'
        f.write(line)
Output
unique_words

As you can see to the right, this has done a good job of repairing the damage, but we still need to get separation of first and last names. Luckily, I had already written some very ugly code tthat I probably shoudn’t share (oh well!) that tries to match first and last names to columns appropriately and insert a semi-colon as a delimiter. Here is the code.

form_TextNote: You may need to use Excel’s “From Text” feature if pasting in data from text files does not work.

Once we paste the data from our program in, we can use Excel’s Text to Columns feature and split on the semi-colon delimiter.
After running our new file through the code linked above, we arrive at an output closer to what is desired: split name columns. A bug in my name_columnizer.py program accidentally inserted an extra ‘&’ into some records. (I’m not perfect :)) Thankfully it could be fixed by a simple find and replace in Excel.
find_replace_2
This takes us to the below data, which is not ideal but suited my purposes for the job at hand.
finish
I enjoyed this, because it was a custom solution to a problem created by my own carelessness in Excel. But I fixed it using relatively few lines of code. I want to start sharing more actual code on this blog so here is a start. I strive to be ‘Pythonic’, but my code is not always the best way. Often, it’s far from it. Sometimes I just write code until I find a solution. I try to solve problems and learn to the best of my ability.
Hope you enjoyed 😀
This solution was used before I discovered the Pandas library, which I highly recommend looking into for doing Excel tasks with Python.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.