Download genomes from Genbank


Python Adventures


For a project, I had to download a bunch of records from the NCBI (National Center for Biotechnology Information) website. A record looks like this: CP002059.1 (almost 5 MB):

LOCUS       CP002059             5354700 bp    DNA ...
DEFINITION  'Nostoc azollae' 0708, complete genome.
ACCESSION   CP002059 ACIR01000000 ACIR01000001-ACIR01000216
VERSION     CP002059.1  GI:298231532
DBLINK      Project: 30807

I needed this data in text format.

Solution #1
My first idea was to download the page with wget. However, I was surprised to see that the downloaded file was less than 100 KB instead of 5 MB! When I looked at the source, it turned out that it’s full of AJAX calls. That is, the browser downloads this short HTML and then it is expanded. If you save the page with File -> Save as…, you have the complete HTML but how to automate the download process? How to get the post-AJAX…

