A modular pipeline for downloading, preprocessing, and extracting geospatial features from Earth Observation (EO) and survey data sources, designed to support admin-level machine learning tasks such as poverty prediction and socioeconomic analysis.
GeoFlo consists of four sequential pipeline stages:
- Data Download — Submits batch export tasks to Google Earth Engine (GEE) and downloads OpenStreetMap (OSM) data. GEE outputs are saved to your Google Drive and must be manually downloaded to your local machine before proceeding.
- Data Preprocessing — Merges the batch CSV files exported by GEE into consolidated datasets per data source and admin level, then applies geographic preprocessing.
- Feature Extraction — Aggregates all preprocessed data sources into a single feature CSV per administrative level.
- Feature Labelling (optional) — Joins the extracted features with a label dataset (e.g. population, wealth index, survey data) on administrative unit codes. A population labelling script is provided as a reference implementation, but any label can be used provided it can be joined on
ADM{level}_PCODE.
1_data_download.py → [Google Drive] → 2_data_preprocessing.py → 3_create_admin_features.py → label_features_with_population.py
(GEE batch tasks (manual download (merge batch files & (feature extraction (optional: join any label
+ OSM download) to local machine) geographic preprocessing) & merge into CSV) dataset on ADM{level}_PCODE)
- Python 3.x
- Google Earth Engine Python API (
ee) geopandaspandaspyyaml- A valid Google Cloud project with Earth Engine access
We recommend using an environment that includes geoai-py, which bundles many of the geospatial dependencies this pipeline relies on (including geopandas, earthengine-api, and related libraries):
pip install geoai-pyThen install any remaining dependencies:
pip install pyyamlAlternatively, install all dependencies individually:
pip install earthengine-api geopandas pandas pyyamlThe pipeline uses three YAML configuration files:
| Config file | Used by |
|---|---|
download_config.yml |
Steps 1 (download) and 2 (preprocessing) |
feature_extraction_config.yml |
Step 3 (feature extraction) |
population_labelling_config.yml |
Step 4 (population labelling) |
Used by 1_data_download.py and 2_data_preprocessing.py.
earth_engine_info:
project_name: "<your-project-name>"
drive_folder: "<your-drive-folder>"
boundary_information:
boundary_path: "<path-to-boundary-shapefile>"
admin_level_path: "<path-to-admin-level-shapefile>"
country: "<country-name>"
country_iso: "<ISO-code>"
year: <year>
landsat_download:
download: true
year: <year>
admin_levels:
- <level-1>
- <level-2>
land_class_download:
download: true
year: <year>
admin_levels:
- <level-1>
- <level-2>
buildings_download:
download: true
admin_levels:
- <level-1>
- <level-2>
ee_file_name: "<ee-file-name>"
night_time_light_download:
download: true
year: <year>
admin_levels:
- <level-1>
- <level-2>
osm_download:
download: true
output_path: "<path-to-output-directory>"
pre_processing_info:
crs: "EPSG:4326"
gee_downloads_dir: "<path-to-gee-downloads-dir>"
base_downloads_dir: "<path-to-base-downloads-dir>"
admin_levels:
- <level-1>
- <level-2>
year: <year>
delete_original_data: falseUsed by 3_create_admin_features.py.
admin_information:
admin_level: <admin-level>
boundary_path: "<path-to-boundary-shapefile>"
admin_code_column: "<admin-code-column>"
admin_name_column: "<admin-name-column>"
data_information:
base_dir: "<path-to-base-directory>"
year: <year>
output_extracted_features_dir: "<path-to-output-extracted-features-dir>"
# existing_extracted_features:
# buildings: ""
# landsat: ""Used by label_features_with_population.py — the reference implementation for feature labelling. If you write a custom labelling script, adapt this config structure to suit your label data.
features_csv: "<path-to-extracted-features-csv>"
population_csv: "<path-to-population-csv>"
admin_level: <admin-level>
admin_code_column: "<admin-code-column>"
output_fpath: "<path-to-output-csv>"The population CSV must contain a T_TL column with total population values, and a column matching admin_code_column that can be joined to the features data on ADM{level}_PCODE.
Run each script in order, passing the path to your config file.
python 1_data_download.py path/to/download_config.ymlOptional flags:
python 1_data_download.py path/to/download_config.yml --verboseDownloads the following data sources (each toggleable via the config):
| Source | Description |
|---|---|
| Landsat | Satellite imagery statistics per admin unit |
| Land Cover | ESRI land classification ratios |
| Buildings | VIDA building footprint aggregations |
| Night Time Lights | Annual NTL statistics |
| OSM | OpenStreetMap road, amenity, and infrastructure data |
Important: GEE downloads (Landsat, Land Cover, Buildings, Night Time Lights) are submitted as batch tasks and exported directly to your Google Drive account — they do not download to your local machine immediately. You must monitor the tasks and wait for them to complete before proceeding to Step 2.
Monitor task progress at: https://code.earthengine.google.com/tasks
Once complete, manually download the output files from Google Drive to your local machine. Set
gee_downloads_dirindownload_config.ymlto the local path where you have saved these files — this is required for Step 2 to locate them.
python 2_data_preprocessing.py path/to/download_config.ymlGEE exports each data source as multiple batch CSV files (one per task/tile). This step merges those batch files together into consolidated datasets per data source and admin level, and applies geographic preprocessing (reprojection, standardisation, etc.). The merged outputs are written to the directory specified by base_downloads_dir in your config, ready for feature extraction in Step 3.
python 3_create_admin_features.py path/to/feature_extraction_config.ymlExtracts and merges all features for the configured administrative level into a single CSV file:
output/features/admin_{level}_extracted_features.csv
Features extracted include:
- Admin unit area (km²)
- Landsat statistics
- Land cover ratios
- Annual Night Time Lights statistics
- VIDA building footprint statistics
- OSM-derived features
The extracted feature CSV can be labelled with any external dataset — population, wealth index, survey-based poverty scores, or any other indicator — as long as the label data includes a column that can be joined to ADM{level}_PCODE.
A reference implementation for population labelling is provided:
python label_features_with_population.py path/to/population_labelling_config.ymlThis joins the extracted features with a population CSV, adding two new columns per admin unit:
| Column | Description |
|---|---|
T_TL |
Raw total population |
log_plus_one_pop |
Log-transformed population: log(T_TL + 1), useful for modelling |
The script handles both UTF-8 and latin-1 encoded CSV files automatically. The labelled output is saved to the path specified by output_fpath in the config.
For other label types, write a script that reads your label data, joins it to the feature CSV on ADM{level}_PCODE, and saves the result. The population labelling script can be used as a template.
The final output of the pipeline is a CSV file at the path specified by output_extracted_features_dir in your config:
admin_2_extracted_features.csv # one row per admin unit, one column per feature
Each row corresponds to an administrative unit identified by its ADM{level}_PCODE.
Step 1 (1_data_download.py) writes a timestamped log file to the working directory:
geoflo_download_YYYYMMDD_HHMMSS.log
Logs are also printed to the console. Use --verbose for DEBUG-level output.
.
├── 1_data_download.py # Stage 1: data download
├── 2_data_preprocessing.py # Stage 2: preprocessing & merging
├── 3_create_admin_features.py # Stage 3: feature extraction
├── label_features_with_population.py # Stage 4: population labelling (optional)
├── data_download/
│ ├── get_landsat_stats.py
│ ├── get_esri_land_class_ratios.py
│ ├── get_ntl_stats.py
│ ├── aggregated_buildings_download.py
│ └── osm_download.py
├── preprocessing/
│ └── preprocessing.py
├── feature_extraction/
│ ├── area_extractor.py
│ └── osm_data_extractor.py
├── download_config.yml
├── feature_extraction_config.yml
└── population_labelling_config.yml
- The pipeline expects downloaded GEE data to follow a specific directory structure under
base_dir:base_dir/ ├── LandsatStats/{year}/adm{level}/ ├── LandCoverRatios/{year}/adm{level}/ ├── AnnualNTL_Stats/{year}/adm{level}/ ├── VidaBuildings/{year}/adm{level}/ └── osm/{year}/ - Each data directory should contain one or more
.csvfiles, which are concatenated automatically. - Admin boundary files must include a code column and a name column, which are remapped to a standardised
ADM{level}_PCODE/ADM{level}_PTformat.
