This is the second of a series of posts as a part of my daily blogging journey. The goal is simple – write and post daily for as long as I can or $\geq 60$ days. Longer articles have been moved to “Essays”. To receive updates as and when I post something new, make sure you subscribe to my newsletter. The sign up form is at the end of this article. Without further ado, let’s get started.

I want to talk about Chinese Kiwis in this article. What is the first thing that you thought after reading that line? Some sort of fruit that has to do with China? Well, you’re not the first. I was talking to some friends about it, and they thought somewhat similar, lol. I wish to clarify a few things in this article. I’ll also talk about Python. No, not the snake. I meant the programming language. Specifically, I will talk about a project I did recently.

So first, about Chinese Kiwis. Well, I’m definitely not talking the kiwi fruit. Yes, sure it is tasty but that’s not the point. I’m not even talking about the bird. So, wtf am I talking about? I’m talking about the people of New Zealand. Yeah, they are called Kiwis, if you did not know.

What does Chinese have to do with New Zealand? Oh, that. By that I mean people of Chinese decent. So, what exactly is a Chinese-Kiwi? It would be more appropriate to ask, “Who is a Chinese-Kiwi?” It is easy to make out by this point, a Chinese-Kiwi is a person of Chinese decent but who holds a New Zealand citizenship. Think of it like Shirley Setia but Chinese.

What is the point of this? Trust me there is a point. I wrote about the two lifelong dreams that I here. I’ll just list them here again, and add a third one.

  1. To officiate someone else’s wedding just so that I can say, “You may start by kissing the bride” before anything else.
  2. I want to hand cash out to people instead of bullshit welfare schemes and witness more humans do amazing things.
  3. Fall in love with a Chinese-Kiwi, and have a great life.

Am I serious? No, not really. But, does it hurt to build castles in the air? I don’t think so. I find Chinese-Kiwis cute. Very specific, you might think. Look man, what else is a 21-year-old who isn’t getting called back to campus supposed to do? You might ask, “Do you casually talk about getting married to your friends?” Yes, I do. Hentai too. Didn’t get the reference? Watch this.

Are you speechless yet? My friends were. I got responses like:

  • “You want to get married to a Chinese-Kiwi? That’s sad.”
  • Lol.
  • “New Zealand kab jayega?” Soon?

Okay, enough about the Chinese-Kiwis. Oh by the way, Jacinda Ardern is awesome. :heart: We’ll see what happens when New Zealand opens up its borders again. Let me talk about this Python project I did recently. The goal was to take a messy .txt file, clean it, and write the clean data to a .csv file.

Sounds simple right? Well, not quite. It turns out the data itself is in tab-separated format. All is well so far, but the real problem arises because of the way the data is entered. Let me explain. The length of the rows were unequal, in that some had only 10 columns, while some others had 15 columns. That makes things tricky. Because, it means lots of exceptions while extracting the data.

As an example, take a look at these two rows. Try to write an algorithm to:

  1. Remove numbers from first name.
  2. Get the last name.
  3. Get address, city, zip code.
  4. Get year of birth, gender, party affiliation.
['413594258HOSSAIN', 'TASFIA', '39-06', '5A', '63', 'STREET', 'WOODSIDE', '113771137', '20010907FDEM', '07134142612021120191127A', 'R201912022020646-462-8154', 'tasfia971@gmail.com']

['413396645ALI', 'SABAH', 'M', '1967', '2A', 'OCEAN', 'AVENUE', 'BROOKLYN', '112301123', '19870122FBLK', '00141094817080220190312A', 'R', '2020']

It would have been a very simple solution if the rows were uniform. But, they are not. So, how did I solve it? I tried looking for something that’s common in every row, and then see what to do from there.

It turns out data like 20010907FDEM is common in every row. This data is YYYYMMDD[Gender][party]. Once I figured out the index of this data in every row, I can extract useful info relative to this index.

I’m attaching the final clean data file, and the initial messy text file. The entire solution is given below. You can also download this .py file if you want to run it yourself.

If you think you have a more efficient solution to solve the problem, please let me know. You can send me an email or a DM on Twitter.

import csv
import re

# open a new file so that we can write clean data
file = open('clean_data.csv', 'w', newline='')

# open the messy text file

# enter the path where you have the messy file
with open("/Users/reang/Documents/BK1.txt") as tsv:
    with file:
        writer = csv.writer(file)
        # write the column names
        writer.writerow(["first_name", "last_name", "address", "city", "zip_code", "year_born", "gender", "party"])

        # clean the messy data line by line
        for line in csv.reader(tsv, delimiter="\t"):
            raw_data = re.sub(' +', ' ', line[0]).strip()
            raw_data = raw_data.split()
            print(raw_data)
            
            # since the party names are sort of unique, it makes sense to use this
            matching_dems = [s for s in raw_data if "DEM" in s]
            matching_reps = [s for s in raw_data if "REP" in s]
            matching_blks = [s for s in raw_data if "BLK" in s]
            matching_gres = [s for s in raw_data if "GRE" in s]
            matching_inds = [s for s in raw_data if "IND" in s]

            # find parties
            if len(matching_dems) != 0:
                index = raw_data.index(matching_dems[0])
            elif len(matching_reps) != 0:
                index = raw_data.index(matching_reps[0])
            elif len(matching_blks) != 0:
                index = raw_data.index(matching_blks[0])
            elif len(matching_gres) != 0:
                index = raw_data.index(matching_gres[0])
            elif len(matching_inds) != 0:
                index = raw_data.index(matching_inds[0])
            
            # once you find the column containing party info, extract other useful info
            first_name = ''.join([i for i in raw_data[0] if not i.isdigit()])
            last_name = raw_data[1].strip()
            address = ' '.join(raw_data[2:index-2]).strip()
            city = raw_data[index-2]
            zip_code = raw_data[index-1][:5]
            year_born = raw_data[index][:4]
            gender = raw_data[index][8]
            party = raw_data[index][-3:]


            # write the extracted clean info to a CSV file
            writer.writerow([first_name, last_name, address, city, zip_code, year_born, gender, party])

If you found this post or even parts of it useful then you should subscribe to the newsletter. Just enter your email address below, and hit that subscribe button. I promise that I’ll not spam your inbox.

You should leave an anonymous feedback here.


For suggestions, corrections etc. please send me an email, or reach out on Twitter. Please read this before contacting me.