Download data with Python

Set up required libraries

import pandas as pd
import os
import rasterio
from matplotlib import pyplot as plt
from PIL import Image
import numpy as np
import tarfile
import pyprojroot
import requests
import xml.etree.ElementTree as ET

Set up work and output directories

We use library pyprojroot to define relative paths from our repository root folder:

repodir = pyprojroot.find_root(pyprojroot.has_dir(".git"))

Profile information from OSF

Workbook with profile content for Ecosystem Functional Groups of the IUCN Global Ecosystem Typology (Level 3 units) available at https://osf.io/4dcea

We can read this workbook directly from the url using pandas:

profiles_url='https://osf.io/download/4dcea/'
EFG_profiles = pd.read_excel(profiles_url, sheet_name = 'Short description')

Here is a peak of the downloaded file:

EFG_profiles[['name','short description', 'url']].head()
name short description url
0 F1.5 Seasonal lowland rivers These medium to large rivers in tropical, subt... https://global-ecosystems.org/explore/groups/F1.5
1 F1.4 Seasonal upland streams Seasonal rainfall patterns in large parts of t... https://global-ecosystems.org/explore/groups/F1.4
2 F1.3 Freeze-thaw rivers and streams In cold climates at high latitudes or altitude... https://global-ecosystems.org/explore/groups/F1.3
3 F1.2 Permanent lowland rivers Lowland rivers with slow continuous flows up t... https://global-ecosystems.org/explore/groups/F1.2
4 F1.1 Permanent upland streams These small rivers or streams in mountainous o... https://global-ecosystems.org/explore/groups/F1.1

Indicative map data download from zenodo

Indicative maps are available from different Zenodo repositories.

Bundle of indicative maps

One repository holds the bundle of maps in compressed tar archives. This record id automatically resolves to the latest version:

record_url = 'https://zenodo.org/api/records/3546513'
r = requests.get(record_url)

Check the difference between the id and conceptid, and the two DOI (digital object identifier):

response = r.json()
print((response['id'],
    response['conceptrecid'],
    response['doi'],
    response['conceptdoi']))
(10081251, '3546513', '10.5281/zenodo.10081251', '10.5281/zenodo.3546513')

We will create a folder for this direct download from zenodo for the latest version of the bundle:

datadir = repodir / "gisdata" / "indicative-maps-bundle" / "latest"
datadir.mkdir(parents=True, exist_ok=True)

The zenodo record has four different files:

response['files']
[{'id': 'fe0bbd7a-bc60-415f-a8f6-6848acd8c220',
  'key': 'README.md',
  'size': 19383,
  'checksum': 'md5:e6cdcb4b46b261fecd5fce5c31fb5805',
  'links': {'self': 'https://zenodo.org/api/records/10081251/files/README.md/content'}},
 {'id': '437d31b3-6502-454a-9511-da1432b22d1f',
  'key': 'all-maps-vector-geojson.tar.bz2',
  'size': 853479699,
  'checksum': 'md5:8490adfd971bc3a5b02cc72832e2df3e',
  'links': {'self': 'https://zenodo.org/api/records/10081251/files/all-maps-vector-geojson.tar.bz2/content'}},
 {'id': 'd2cb3226-e68c-452d-a6f8-a65d0686380b',
  'key': 'all-maps-raster-geotiff.tar.bz2',
  'size': 157378816,
  'checksum': 'md5:245d09d535609d1c010eff52e9c8bfaa',
  'links': {'self': 'https://zenodo.org/api/records/10081251/files/all-maps-raster-geotiff.tar.bz2/content'}},
 {'id': 'a42361fc-e4bc-48d3-8c8a-0deafa722f80',
  'key': 'map-details.xml',
  'size': 114612,
  'checksum': 'md5:5177a7a6fc5aa4981088093a68b2964e',
  'links': {'self': 'https://zenodo.org/api/records/10081251/files/map-details.xml/content'}}]

We want to download three of them:

for i in range(len(response['files'])):
    zenodo_url = response['files'][i]['links']['self']
    outputfile = datadir / response['files'][i]['key']
    if not outputfile.exists():
        zdownload = requests.get(zenodo_url)
        if zdownload.ok:
            with open(outputfile, mode="wb") as file:
                file.write(zdownload.content)

We extract maps from the tar archives in a sandbox folder:

sandbox = repodir / "sandbox" / "latest"
sandbox.mkdir(parents=True, exist_ok=True)
maps_tar = datadir / 'all-maps-raster-geotiff.tar.bz2'
with tarfile.open(maps_tar) as my_tarfile:
    if hasattr(tarfile, 'data_filter'):
        my_tarfile.extractall(path=sandbox,filter='data')
    else:
        # remove this when no longer needed, see https://docs.python.org/3/library/tarfile.html#tarfile-extraction-filter
        print('Extracting may be unsafe; consider updating Python')
        my_tarfile.extractall(path=sandbox)
Extracting may be unsafe; consider updating Python

Repositories for single EFG maps

Map details are stored in a xml file that is part of the map bundle zenodo download.

Check the file was downloaded:

map_details_file = datadir / "map-details.xml"
map_details_file.exists()
True

We’ll use the xml2 library to read the xml file

tree = ET.parse(map_details_file)
#maplist_root = tree.getroot()

We can query map details for an specific map:

node=tree.find(".//Map[@efg_code='T1.1']")
for child in node.iter():
    print(ET.tostring(child))
b'<Map efg_code="T1.1" map_code="T1.1.web.mix" map_version="v2.0" update="2020-11-08">\n         <Functional_group>T1.1 Tropical/Subtropical lowland rainforests</Functional_group>\n         <Description>Major and minor occurrences were initially identified using consensus land-cover maps (Tuanmu et al. 2014) and then cropped to selected terrestrial ecoregions (Dinerstein et al. 2017; DAWE 2012) at 30 arc seconds spatial resolution. Ecoregions were selected if: i) their descriptions mentioned features consistent with those identified in the profile of the Ecosystem Functional Group; and ii) if their location was consistent with the ecological drivers described in the profile.</Description>\n         <Contributors>\n            <map-contributor>JR Ferrer-Paris</map-contributor>\n            <map-contributor>DA Keith</map-contributor>\n         </Contributors>\n         <Dataset-doi>10.5281/zenodo.5090450</Dataset-doi>\n      </Map>\n      '
b'<Functional_group>T1.1 Tropical/Subtropical lowland rainforests</Functional_group>\n         '
b'<Description>Major and minor occurrences were initially identified using consensus land-cover maps (Tuanmu et al. 2014) and then cropped to selected terrestrial ecoregions (Dinerstein et al. 2017; DAWE 2012) at 30 arc seconds spatial resolution. Ecoregions were selected if: i) their descriptions mentioned features consistent with those identified in the profile of the Ecosystem Functional Group; and ii) if their location was consistent with the ecological drivers described in the profile.</Description>\n         '
b'<Contributors>\n            <map-contributor>JR Ferrer-Paris</map-contributor>\n            <map-contributor>DA Keith</map-contributor>\n         </Contributors>\n         '
b'<map-contributor>JR Ferrer-Paris</map-contributor>\n            '
b'<map-contributor>DA Keith</map-contributor>\n         '
b'<Dataset-doi>10.5281/zenodo.5090450</Dataset-doi>\n      '

The field with doi for the individual map are stored in the Dataset-doi tag. We can run a query for all elements containing this tag and summarise this using list comprehension:

all_dois = [doi.text for doi in tree.findall( ".//Dataset-doi")]

Now we use this list of DOIs to download a copy of each of the repositories containing files for each ecosystem functional group:

for doi in all_dois:
    out_folder = repodir / "gisdata" / "indicative-maps" / doi
    out_folder.mkdir(parents=True, exist_ok=True)
    if len(os.listdir(out_folder)) ==0:
        # query doi to get the corresponding zenodo API url
        r = requests.get('https://doi.org/'+doi, allow_redirects=True) 
        parsed_uri = urlparse(r.url) 
        zenodo_url = '{uri.scheme}://{uri.netloc}/api{uri.path}'.format(uri=parsed_uri)
        r.close()
        # now query the API
        r = requests.get(zenodo_url)
        response = r.json()
        for i in range(len(response['files'])):
            zenodo_url = response['files'][i]['links']['self']
            outputfile = out_folder / response['files'][i]['key']
            if not outputfile.exists():
                zdownload = requests.get(zenodo_url)
                if zdownload.ok:
                    with open(outputfile, mode="wb") as file:
                        file.write(zdownload.content)

Compare downloaded files

Now we have two copies of each map file, one in the sandbox folder (extracted from the map bundle), and one downloaded directly from the corresponding record.

Map code and version are described in the map attributes in the xml file:

node=tree.find(".//Map[@efg_code='T1.3']")
node.attrib
{'efg_code': 'T1.3',
 'map_code': 'T1.3.WM.nwx',
 'map_version': 'v1.0',
 'update': '2021-11-26'}

The file extracted from the tar archive uses the same code and version combination for the file name:

for k in sandbox.glob("%s*.tif" % node.attrib['efg_code']):
    print(k.name)
T1.3.WM.nwx_v1.0.tif

As well as the file downloaded from the specific repository:

out_folder = repodir / "gisdata" / "indicative-maps" / node.find('Dataset-doi').text
for k in out_folder.glob("%s*.tif" % node.attrib['efg_code']):
    print(k.name)
T1.3.WM.nwx_v1.0.tif

Raster maps

We’ll use the rasterio to read and plot the raster file

src = rasterio.open(k)
plt.imshow(src.read(1), cmap='pink')
plt.show()

Compare this with the thumbnail downloaded from Zenodo:

pngfile = [x for x in out_folder.glob("*.png")]
img = np.asarray(Image.open(pngfile[0]))
plt.imshow(img)