I’m trying to help with the OpenAddress project, particularly Mike Migurski’s “machine” code for running a full analysis of the data. Particularly particularly the ditch-node branch. Managed to get the tests running after some version hassle.
Now I’m taking inventory of what all code is in the repo to understand where to work. These notes are probably only useful right about now (December 2014) and only for the ditch-node branch.
This is the big thing, about 1700 lines of Python code. The architecture is a bit confusing. I don’t really understand how it’s managing threads / processes for parallelism. Also the actual ETL code is a bit convolved with the download and cache code, including S3.
- __init.py__: grab bag of top level stuff. Key methods:
- cache() and conform() are wrappers around Python code to download and process data from sources. This code wraps Python modules and uses multiprocessing.Process to run them…
View original post 629 more words