Skip to content

Remove keep_duplicates from vertical observation and model result#657

Open
ecomodeller wants to merge 1 commit into
mainfrom
remove-vertical-keep-duplicates
Open

Remove keep_duplicates from vertical observation and model result#657
ecomodeller wants to merge 1 commit into
mainfrom
remove-vertical-keep-duplicates

Conversation

@ecomodeller
Copy link
Copy Markdown
Member

Summary

VerticalObservation and VerticalModelResult exposed a keep_duplicates argument that silently deduplicates rows sharing a (time, z) pair (with a warning). This PR removes the argument and instead raises a ValueError when duplicates are present.

Why this is out of scope for modelskill

A vertical profile is, by definition, a set of measurements at unique depths for a given timestamp. Two values at the same (time, z) is not a sampling artifact the library can sensibly resolve — it's a data-quality issue that only the caller has context to fix. The choice between which duplicate to keep, whether to average them, or whether to discard them entirely depends on the instrument, the QC pipeline, and the scientific question. Hard-coding \"first\" as the default silently picked one for the user, which is the worst of all options: it hid a real problem behind a warning that's easy to miss.

modelskill's job is to compare observations and model results — not to perform input cleaning that pandas already does better. Users who genuinely want to dedupe can do so with one line of pd.DataFrame.drop_duplicates(...) or groupby(...).mean() before constructing the object, with full control over the semantics. Failing loudly on malformed input is the right default for a comparison library.

The same argument arguably applies to TrackObservation / TrackModelResult, where keep_duplicates was originally introduced — duplicate timestamps in track data have a real-world cause (sub-second rounding in altimetry, satellite crossings) but the resolution is still a caller concern. That's a separate discussion and not touched here.

Behaviour change

Before: VerticalObservation(df_with_duplicates)UserWarning("Removed N duplicate (time, z) entries with keep=first"), returns a deduped object.

After: VerticalObservation(df_with_duplicates)ValueError("Input contains N duplicate (time, z) entries. Vertical profiles must have a unique depth per timestamp; deduplicate the input before constructing the object.")

The vertical surface is still alpha and has not been exposed to users, so no deprecation path is needed.

Raise ValueError on duplicate (time, z) entries instead of silently
deduplicating. Vertical profiles must have a unique depth per timestamp;
duplicates indicate a data-quality issue that the caller should resolve
before construction, not a sampling artifact the library should paper over.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant