Code in The OpenAddress Machine

by

Nelson's log

I’m trying to help with the OpenAddress project, particularly Mike Migurski’s “machine” code for running a full analysis of the data. Particularly particularly the ditch-node branch. Managed to get the tests running after some version hassle.

Now I’m taking inventory of what all code is in the repo to understand where to work. These notes are probably only useful right about now (December 2014) and only for the ditch-node branch.

openaddr/*.py

This is the big thing, about 1700 lines of Python code. The architecture is a bit confusing. I don’t really understand how it’s managing threads / processes for parallelism. Also the actual ETL code is a bit convolved with the download and cache code, including S3.

  • __init.py__: grab bag of top level stuff. Key methods:
    • cache() and conform() are wrappers around Python code to download and process data from sources. This code wraps Python modules and uses multiprocessing.Process to run them…

View original post 629 more words

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: