incorporating dejaMoo: best of breed bull…

Text processing fun

Using cProfile for optimisation

A bunch of tab delimited text files arrived on my ftp server the other day. Since the data was 3NF, and my small model was largely denormalised, I only needed a small fraction of the data which was to be converted in a JSON fixture. I thought about using external tables to load the data into an RDBMS and simply select what I wanted, however, some of the files were positively bursting with data, at 130+ columns and around 900,000 rows, and I'm far too lazy to map 130 columns to an external table definition.

Luckily, a short Python program will do the trick, and it's one hell of a lot more fun to write.

Skipping to the interesting bit, I had a function to convert CSV to a list of dicts:

   reader = csv.reader(open(file_, 'rb'), delimiter=delimiter)
   col_names = reader.next()
   return [dict([(col.lower(), row[col_names.index(col)]) 
                for col in col_names]) for row in reader]
   

It performed acceptably for files up to about 5000 rows, but was frustratingly slow for larger files. Using Python's cProfile library I was able to see why:

   import cProfile
   cProfile.run("csv2dict2('BigFile.txt', 't')")
   

Resulted in:

             13636726 function calls in 41.319 CPU seconds

Ordered by: standard name

ncalls  tottime  percall filename:lineno(function)
        1    0.786   41.319 :1()
        1   16.271  40.533 autopinions2.py:40(csv2dict2)
        1    0.000   0.000 {method 'disable' of '...
  6818361   20.746   0.000 {method 'index' of 'list'...
  6818361    3.515   0.000 {method 'lower' of 'str' ...
        1    0.000   0.000 {open}

Since the whole point of the function was to convert the CSV to a data structure highly optimised for referencing values by name, I really should have smacked myself about the head.

In version 2, I used using a dict to lookup the column position against it's name (same principle of using the highly optimised data structure), and for good measure moved the lower method to the list comprehension which generated the dict.

       reader = csv.reader(open(file_, 'rb'), delimiter=delimiter)
       col_names = reader.next()
       #create a dictionary such that {column_name: index position}
       col_names = dict([(col.lower(),col_names.index(col))  
                        for col in col_names])
       return [dict([(col, row[col_names[col]]) for col in col_names])  
                    for row in reader]
   

Running cProfile again:

         218 function calls in 8.574 CPU seconds

Ordered by: standard name

ncalls  tottime  percall filename:lineno(function)
        1    0.405    8.574 :1()
        1    8.168    8.168 autopinions2.py:28(csv2dict)
        1    0.000    0.000 {method 'disable' of ...
      107    0.000    0.000 {method 'index' of 'list' ...
      107    0.000    0.000 {method 'lower' of 'str'...
        1    0.000    0.000 {open}

Thanks to cProfile second version provides more than acceptable performance for my one-shot.

BTW, I really have to move to a left nav layout so the code doesn't look so crap...

Comments (0) § Posted by in on
AddThis Social Bookmark Button

Comments are now closed.

Tweet Tweet

Stuffs

Thanks for dropping in.

This is the personal website of Cam MacRae. Any opinions expressed here are my entirely own, and have jack to do with my employer.

It's the product of a little elbow grease, the news.ycombinator noprocrast feature, and a healthy dose of Django.

A Django site.

Tags

  1. D (1)
  2. SOA (1)
  3. ajax (2)
  4. apollo (1)
  5. architecture (1)
  6. blogs (2)
  7. carsales (1)
  8. collaboration (1)
  9. css (1)
  10. django (9)
  11. duels (1)
  12. email (1)
  13. erlang (3)
  14. findability (1)
  15. flex (3)
  16. folksonomies (1)
  17. funny (2)
  18. geek (20)
  19. google (3)
  20. innovation (1)
  21. iphone (1)
  22. javascript (4)
  23. jython (1)
  24. life (5)
  25. lighttpd (1)
  26. lisp (1)
  27. mac (1)
  28. macbook (1)
  29. marketing (1)
  30. open-source (1)
  31. oracle (2)
  32. python (6)
  33. rails (2)
  34. ruby (1)
  35. silverlight (1)
  36. skitch (1)
  37. startups (4)
  38. tech (21)
  39. twitter (1)
  40. usability (1)
  41. web20 (6)
  42. work (3)
  43. yui (2)
ten1000miles.com | Aussie Blogs |  Feed

Creative Commons License This work is licensed under a
Creative Commons Attribution-Share Alike 3.0 Unported License.