Skip to content

Data-Science-Unit/GeoFlo

Repository files navigation

GeoFlo — An Insights Pipeline for Admin-Level ML Tasks

GeoFlo Pipeline Overview

A modular pipeline for downloading, preprocessing, and extracting geospatial features from Earth Observation (EO) and survey data sources, designed to support admin-level machine learning tasks such as poverty prediction and socioeconomic analysis.


Overview

GeoFlo consists of four sequential pipeline stages:

  1. Data Download — Submits batch export tasks to Google Earth Engine (GEE) and downloads OpenStreetMap (OSM) data. GEE outputs are saved to your Google Drive and must be manually downloaded to your local machine before proceeding.
  2. Data Preprocessing — Merges the batch CSV files exported by GEE into consolidated datasets per data source and admin level, then applies geographic preprocessing.
  3. Feature Extraction — Aggregates all preprocessed data sources into a single feature CSV per administrative level.
  4. Feature Labelling (optional) — Joins the extracted features with a label dataset (e.g. population, wealth index, survey data) on administrative unit codes. A population labelling script is provided as a reference implementation, but any label can be used provided it can be joined on ADM{level}_PCODE.

Pipeline Architecture

1_data_download.py   →   [Google Drive]   →   2_data_preprocessing.py   →   3_create_admin_features.py   →   label_features_with_population.py
(GEE batch tasks          (manual download      (merge batch files &            (feature extraction               (optional: join any label
 + OSM download)           to local machine)     geographic preprocessing)       & merge into CSV)                dataset on ADM{level}_PCODE)

Requirements

Recommended Environment

We recommend using an environment that includes geoai-py, which bundles many of the geospatial dependencies this pipeline relies on (including geopandas, earthengine-api, and related libraries):

pip install geoai-py

Then install any remaining dependencies:

pip install pyyaml

Alternatively, install all dependencies individually:

pip install earthengine-api geopandas pandas pyyaml

Configuration

The pipeline uses three YAML configuration files:

Config file Used by
download_config.yml Steps 1 (download) and 2 (preprocessing)
feature_extraction_config.yml Step 3 (feature extraction)
population_labelling_config.yml Step 4 (population labelling)

download_config.yml

Used by 1_data_download.py and 2_data_preprocessing.py.

earth_engine_info:
  project_name: "<your-project-name>"
  drive_folder: "<your-drive-folder>"

boundary_information:
  boundary_path: "<path-to-boundary-shapefile>"
  admin_level_path: "<path-to-admin-level-shapefile>"

country: "<country-name>"
country_iso: "<ISO-code>"
year: <year>

landsat_download:
  download: true
  year: <year>
  admin_levels:
    - <level-1>
    - <level-2>

land_class_download:
  download: true
  year: <year>
  admin_levels:
    - <level-1>
    - <level-2>

buildings_download:
  download: true
  admin_levels:
    - <level-1>
    - <level-2>
  ee_file_name: "<ee-file-name>"

night_time_light_download:
  download: true
  year: <year>
  admin_levels:
    - <level-1>
    - <level-2>

osm_download:
  download: true
  output_path: "<path-to-output-directory>"

pre_processing_info:
  crs: "EPSG:4326"
  gee_downloads_dir: "<path-to-gee-downloads-dir>"
  base_downloads_dir: "<path-to-base-downloads-dir>"
  admin_levels:
    - <level-1>
    - <level-2>
  year: <year>
  delete_original_data: false

feature_extraction_config.yml

Used by 3_create_admin_features.py.

admin_information:
  admin_level: <admin-level>
  boundary_path: "<path-to-boundary-shapefile>"
  admin_code_column: "<admin-code-column>"
  admin_name_column: "<admin-name-column>"

data_information:
  base_dir: "<path-to-base-directory>"
  year: <year>
  output_extracted_features_dir: "<path-to-output-extracted-features-dir>"

# existing_extracted_features:
#   buildings: ""
#   landsat: ""

population_labelling_config.yml

Used by label_features_with_population.py — the reference implementation for feature labelling. If you write a custom labelling script, adapt this config structure to suit your label data.

features_csv: "<path-to-extracted-features-csv>"
population_csv: "<path-to-population-csv>"
admin_level: <admin-level>
admin_code_column: "<admin-code-column>"
output_fpath: "<path-to-output-csv>"

The population CSV must contain a T_TL column with total population values, and a column matching admin_code_column that can be joined to the features data on ADM{level}_PCODE.

Run each script in order, passing the path to your config file.

Step 1 — Download Data

python 1_data_download.py path/to/download_config.yml

Optional flags:

python 1_data_download.py path/to/download_config.yml --verbose

Downloads the following data sources (each toggleable via the config):

Source Description
Landsat Satellite imagery statistics per admin unit
Land Cover ESRI land classification ratios
Buildings VIDA building footprint aggregations
Night Time Lights Annual NTL statistics
OSM OpenStreetMap road, amenity, and infrastructure data

Important: GEE downloads (Landsat, Land Cover, Buildings, Night Time Lights) are submitted as batch tasks and exported directly to your Google Drive account — they do not download to your local machine immediately. You must monitor the tasks and wait for them to complete before proceeding to Step 2.

Monitor task progress at: https://code.earthengine.google.com/tasks

Once complete, manually download the output files from Google Drive to your local machine. Set gee_downloads_dir in download_config.yml to the local path where you have saved these files — this is required for Step 2 to locate them.


Step 2 — Preprocess & Merge Downloaded Data

python 2_data_preprocessing.py path/to/download_config.yml

GEE exports each data source as multiple batch CSV files (one per task/tile). This step merges those batch files together into consolidated datasets per data source and admin level, and applies geographic preprocessing (reprojection, standardisation, etc.). The merged outputs are written to the directory specified by base_downloads_dir in your config, ready for feature extraction in Step 3.


Step 3 — Extract Admin Features

python 3_create_admin_features.py path/to/feature_extraction_config.yml

Extracts and merges all features for the configured administrative level into a single CSV file:

output/features/admin_{level}_extracted_features.csv

Features extracted include:

  • Admin unit area (km²)
  • Landsat statistics
  • Land cover ratios
  • Annual Night Time Lights statistics
  • VIDA building footprint statistics
  • OSM-derived features

Step 4 — Label Features (optional)

The extracted feature CSV can be labelled with any external dataset — population, wealth index, survey-based poverty scores, or any other indicator — as long as the label data includes a column that can be joined to ADM{level}_PCODE.

A reference implementation for population labelling is provided:

python label_features_with_population.py path/to/population_labelling_config.yml

This joins the extracted features with a population CSV, adding two new columns per admin unit:

Column Description
T_TL Raw total population
log_plus_one_pop Log-transformed population: log(T_TL + 1), useful for modelling

The script handles both UTF-8 and latin-1 encoded CSV files automatically. The labelled output is saved to the path specified by output_fpath in the config.

For other label types, write a script that reads your label data, joins it to the feature CSV on ADM{level}_PCODE, and saves the result. The population labelling script can be used as a template.


Output

The final output of the pipeline is a CSV file at the path specified by output_extracted_features_dir in your config:

admin_2_extracted_features.csv   # one row per admin unit, one column per feature

Each row corresponds to an administrative unit identified by its ADM{level}_PCODE.


Logging

Step 1 (1_data_download.py) writes a timestamped log file to the working directory:

geoflo_download_YYYYMMDD_HHMMSS.log

Logs are also printed to the console. Use --verbose for DEBUG-level output.


Project Structure

.
├── 1_data_download.py               # Stage 1: data download
├── 2_data_preprocessing.py          # Stage 2: preprocessing & merging
├── 3_create_admin_features.py       # Stage 3: feature extraction
├── label_features_with_population.py  # Stage 4: population labelling (optional)
├── data_download/
│   ├── get_landsat_stats.py
│   ├── get_esri_land_class_ratios.py
│   ├── get_ntl_stats.py
│   ├── aggregated_buildings_download.py
│   └── osm_download.py
├── preprocessing/
│   └── preprocessing.py
├── feature_extraction/
│   ├── area_extractor.py
│   └── osm_data_extractor.py
├── download_config.yml
├── feature_extraction_config.yml
└── population_labelling_config.yml

Notes

  • The pipeline expects downloaded GEE data to follow a specific directory structure under base_dir:
    base_dir/
    ├── LandsatStats/{year}/adm{level}/
    ├── LandCoverRatios/{year}/adm{level}/
    ├── AnnualNTL_Stats/{year}/adm{level}/
    ├── VidaBuildings/{year}/adm{level}/
    └── osm/{year}/
    
  • Each data directory should contain one or more .csv files, which are concatenated automatically.
  • Admin boundary files must include a code column and a name column, which are remapped to a standardised ADM{level}_PCODE / ADM{level}_PT format.

About

GeoFlo: An automated data pipeline for downloading, preprocessing, and feature extraction of geospatial datasets, designed to support comprehensive environmental and demographic analysis.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages