A bunch of tab delimited text files arrived on my ftp server the other day. Since the data was 3NF, and my small model was largely denormalised, I only needed a small fraction of the data which was to be converted in a JSON fixture. I thought about using external tables to load the data into an RDBMS and simply select what I wanted, however, some of the files were positively bursting with data, at 130+ columns and around 900,000 rows, and I'm far too lazy to map 130 columns to an external table definition.
Luckily, a short Python program will do the trick, and it's one hell of a lot more fun to write.
Skipping to the interesting bit, I had a function to convert CSV to a list of dicts:
reader = csv.reader(open(file_, 'rb'), delimiter=delimiter)
col_names = reader.next()
return [dict([(col.lower(), row[col_names.index(col)])
for col in col_names]) for row in reader]
It performed acceptably for files up to about 5000 rows, but was frustratingly slow for larger files. Using Python's cProfile library I was able to see why:
import cProfile
cProfile.run("csv2dict2('BigFile.txt', 't')")
Resulted in:
13636726 function calls in 41.319 CPU seconds
Ordered by: standard name
ncalls tottime percall filename:lineno(function)
1 0.786 41.319 :1()
1 16.271 40.533 autopinions2.py:40(csv2dict2)
1 0.000 0.000 {method 'disable' of '...
6818361 20.746 0.000 {method 'index' of 'list'...
6818361 3.515 0.000 {method 'lower' of 'str' ...
1 0.000 0.000 {open}
Since the whole point of the function was to convert the CSV to a data structure highly optimised for referencing values by name, I really should have smacked myself about the head.
In version 2, I used using a dict to lookup the column position against it's name (same principle of using the highly optimised data structure), and for good measure moved the lower method to the list comprehension which generated the dict.
reader = csv.reader(open(file_, 'rb'), delimiter=delimiter)
col_names = reader.next()
#create a dictionary such that {column_name: index position}
col_names = dict([(col.lower(),col_names.index(col))
for col in col_names])
return [dict([(col, row[col_names[col]]) for col in col_names])
for row in reader]
Running cProfile again:
218 function calls in 8.574 CPU seconds
Ordered by: standard name
ncalls tottime percall filename:lineno(function)
1 0.405 8.574 :1()
1 8.168 8.168 autopinions2.py:28(csv2dict)
1 0.000 0.000 {method 'disable' of ...
107 0.000 0.000 {method 'index' of 'list' ...
107 0.000 0.000 {method 'lower' of 'str'...
1 0.000 0.000 {open}
Thanks to cProfile second version provides more than acceptable performance for my one-shot.
BTW, I really have to move to a left nav layout so the code doesn't look so crap...

Feed
Comments are now closed.