GeoFlo — An Insights Pipeline for Admin-Level ML Tasks

A modular pipeline for downloading, preprocessing, and extracting geospatial features from Earth Observation (EO) and survey data sources, designed to support admin-level machine learning tasks such as poverty prediction and socioeconomic analysis.

Overview

GeoFlo consists of four sequential pipeline stages:

Data Download — Submits batch export tasks to Google Earth Engine (GEE) and downloads OpenStreetMap (OSM) data. GEE outputs are saved to your Google Drive and must be manually downloaded to your local machine before proceeding.
Data Preprocessing — Merges the batch CSV files exported by GEE into consolidated datasets per data source and admin level, then applies geographic preprocessing.
Feature Extraction — Aggregates all preprocessed data sources into a single feature CSV per administrative level.
Feature Labelling (optional) — Joins the extracted features with a label dataset (e.g. population, wealth index, survey data) on administrative unit codes. A population labelling script is provided as a reference implementation, but any label can be used provided it can be joined on ADM{level}_PCODE.

Pipeline Architecture

1_data_download.py   →   [Google Drive]   →   2_data_preprocessing.py   →   3_create_admin_features.py   →   label_features_with_population.py
(GEE batch tasks          (manual download      (merge batch files &            (feature extraction               (optional: join any label
 + OSM download)           to local machine)     geographic preprocessing)       & merge into CSV)                dataset on ADM{level}_PCODE)

Requirements

Python 3.x
Google Earth Engine Python API (ee)
geopandas
pandas
pyyaml
A valid Google Cloud project with Earth Engine access

Recommended Environment

We recommend using an environment that includes geoai-py, which bundles many of the geospatial dependencies this pipeline relies on (including geopandas, earthengine-api, and related libraries):

pip install geoai-py

Then install any remaining dependencies:

pip install pyyaml

Alternatively, install all dependencies individually:

pip install earthengine-api geopandas pandas pyyaml

Configuration

The pipeline uses three YAML configuration files:

Config file	Used by
`download_config.yml`	Steps 1 (download) and 2 (preprocessing)
`feature_extraction_config.yml`	Step 3 (feature extraction)
`population_labelling_config.yml`	Step 4 (population labelling)

`download_config.yml`

Used by 1_data_download.py and 2_data_preprocessing.py.

earth_engine_info:
  project_name: "<your-project-name>"
  drive_folder: "<your-drive-folder>"

boundary_information:
  boundary_path: "<path-to-boundary-shapefile>"
  admin_level_path: "<path-to-admin-level-shapefile>"

country: "<country-name>"
country_iso: "<ISO-code>"
year: <year>

landsat_download:
  download: true
  year: <year>
  admin_levels:
    - <level-1>
    - <level-2>

land_class_download:
  download: true
  year: <year>
  admin_levels:
    - <level-1>
    - <level-2>

buildings_download:
  download: true
  admin_levels:
    - <level-1>
    - <level-2>
  ee_file_name: "<ee-file-name>"

night_time_light_download:
  download: true
  year: <year>
  admin_levels:
    - <level-1>
    - <level-2>

osm_download:
  download: true
  output_path: "<path-to-output-directory>"

pre_processing_info:
  crs: "EPSG:4326"
  gee_downloads_dir: "<path-to-gee-downloads-dir>"
  base_downloads_dir: "<path-to-base-downloads-dir>"
  admin_levels:
    - <level-1>
    - <level-2>
  year: <year>
  delete_original_data: false

`feature_extraction_config.yml`

Used by 3_create_admin_features.py.

admin_information:
  admin_level: <admin-level>
  boundary_path: "<path-to-boundary-shapefile>"
  admin_code_column: "<admin-code-column>"
  admin_name_column: "<admin-name-column>"

data_information:
  base_dir: "<path-to-base-directory>"
  year: <year>
  output_extracted_features_dir: "<path-to-output-extracted-features-dir>"

# existing_extracted_features:
#   buildings: ""
#   landsat: ""

`population_labelling_config.yml`

Used by label_features_with_population.py — the reference implementation for feature labelling. If you write a custom labelling script, adapt this config structure to suit your label data.

features_csv: "<path-to-extracted-features-csv>"
population_csv: "<path-to-population-csv>"
admin_level: <admin-level>
admin_code_column: "<admin-code-column>"
output_fpath: "<path-to-output-csv>"

The population CSV must contain a T_TL column with total population values, and a column matching admin_code_column that can be joined to the features data on ADM{level}_PCODE.

Run each script in order, passing the path to your config file.

Step 1 — Download Data

python 1_data_download.py path/to/download_config.yml

Optional flags:

python 1_data_download.py path/to/download_config.yml --verbose

Downloads the following data sources (each toggleable via the config):

Source	Description
Landsat	Satellite imagery statistics per admin unit
Land Cover	ESRI land classification ratios
Buildings	VIDA building footprint aggregations
Night Time Lights	Annual NTL statistics
OSM	OpenStreetMap road, amenity, and infrastructure data

Important: GEE downloads (Landsat, Land Cover, Buildings, Night Time Lights) are submitted as batch tasks and exported directly to your Google Drive account — they do not download to your local machine immediately. You must monitor the tasks and wait for them to complete before proceeding to Step 2.

Monitor task progress at: https://code.earthengine.google.com/tasks

Once complete, manually download the output files from Google Drive to your local machine. Set gee_downloads_dir in download_config.yml to the local path where you have saved these files — this is required for Step 2 to locate them.

Step 2 — Preprocess & Merge Downloaded Data

python 2_data_preprocessing.py path/to/download_config.yml

GEE exports each data source as multiple batch CSV files (one per task/tile). This step merges those batch files together into consolidated datasets per data source and admin level, and applies geographic preprocessing (reprojection, standardisation, etc.). The merged outputs are written to the directory specified by base_downloads_dir in your config, ready for feature extraction in Step 3.

Step 3 — Extract Admin Features

python 3_create_admin_features.py path/to/feature_extraction_config.yml

Extracts and merges all features for the configured administrative level into a single CSV file:

output/features/admin_{level}_extracted_features.csv

Features extracted include:

Admin unit area (km²)
Landsat statistics
Land cover ratios
Annual Night Time Lights statistics
VIDA building footprint statistics
OSM-derived features

Step 4 — Label Features (optional)

The extracted feature CSV can be labelled with any external dataset — population, wealth index, survey-based poverty scores, or any other indicator — as long as the label data includes a column that can be joined to ADM{level}_PCODE.

A reference implementation for population labelling is provided:

python label_features_with_population.py path/to/population_labelling_config.yml

This joins the extracted features with a population CSV, adding two new columns per admin unit:

Column	Description
`T_TL`	Raw total population
`log_plus_one_pop`	Log-transformed population: `log(T_TL + 1)`, useful for modelling

The script handles both UTF-8 and latin-1 encoded CSV files automatically. The labelled output is saved to the path specified by output_fpath in the config.

For other label types, write a script that reads your label data, joins it to the feature CSV on ADM{level}_PCODE, and saves the result. The population labelling script can be used as a template.

Output

The final output of the pipeline is a CSV file at the path specified by output_extracted_features_dir in your config:

admin_2_extracted_features.csv   # one row per admin unit, one column per feature

Each row corresponds to an administrative unit identified by its ADM{level}_PCODE.

Logging

Step 1 (1_data_download.py) writes a timestamped log file to the working directory:

geoflo_download_YYYYMMDD_HHMMSS.log

Logs are also printed to the console. Use --verbose for DEBUG-level output.

Project Structure

.
├── 1_data_download.py               # Stage 1: data download
├── 2_data_preprocessing.py          # Stage 2: preprocessing & merging
├── 3_create_admin_features.py       # Stage 3: feature extraction
├── label_features_with_population.py  # Stage 4: population labelling (optional)
├── data_download/
│   ├── get_landsat_stats.py
│   ├── get_esri_land_class_ratios.py
│   ├── get_ntl_stats.py
│   ├── aggregated_buildings_download.py
│   └── osm_download.py
├── preprocessing/
│   └── preprocessing.py
├── feature_extraction/
│   ├── area_extractor.py
│   └── osm_data_extractor.py
├── download_config.yml
├── feature_extraction_config.yml
└── population_labelling_config.yml

Notes

The pipeline expects downloaded GEE data to follow a specific directory structure under base_dir:

base_dir/
├── LandsatStats/{year}/adm{level}/
├── LandCoverRatios/{year}/adm{level}/
├── AnnualNTL_Stats/{year}/adm{level}/
├── VidaBuildings/{year}/adm{level}/
└── osm/{year}/

Each data directory should contain one or more .csv files, which are concatenated automatically.
Admin boundary files must include a code column and a name column, which are remapped to a standardised ADM{level}_PCODE / ADM{level}_PT format.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
config		config
data_download		data_download
docs		docs
examples		examples
feature_extraction		feature_extraction
preprocessing		preprocessing
.gitignore		.gitignore
1_data_download.py		1_data_download.py
2_data_preprocessing.py		2_data_preprocessing.py
3_create_admin_features.py		3_create_admin_features.py
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GeoFlo — An Insights Pipeline for Admin-Level ML Tasks

Overview

Pipeline Architecture

Requirements

Recommended Environment

Configuration

`download_config.yml`

`feature_extraction_config.yml`

`population_labelling_config.yml`

Step 1 — Download Data

Step 2 — Preprocess & Merge Downloaded Data

Step 3 — Extract Admin Features

Step 4 — Label Features (optional)

Output

Logging

Project Structure

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GeoFlo — An Insights Pipeline for Admin-Level ML Tasks

Overview

Pipeline Architecture

Requirements

Recommended Environment

Configuration

download_config.yml

feature_extraction_config.yml

population_labelling_config.yml

Step 1 — Download Data

Step 2 — Preprocess & Merge Downloaded Data

Step 3 — Extract Admin Features

Step 4 — Label Features (optional)

Output

Logging

Project Structure

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`download_config.yml`

`feature_extraction_config.yml`

`population_labelling_config.yml`

Packages