Remove keep_duplicates from vertical observation and model result#657
Open
ecomodeller wants to merge 1 commit into
Open
Remove keep_duplicates from vertical observation and model result#657ecomodeller wants to merge 1 commit into
ecomodeller wants to merge 1 commit into
Conversation
Raise ValueError on duplicate (time, z) entries instead of silently deduplicating. Vertical profiles must have a unique depth per timestamp; duplicates indicate a data-quality issue that the caller should resolve before construction, not a sampling artifact the library should paper over.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
VerticalObservationandVerticalModelResultexposed akeep_duplicatesargument that silently deduplicates rows sharing a(time, z)pair (with a warning). This PR removes the argument and instead raises aValueErrorwhen duplicates are present.Why this is out of scope for modelskill
A vertical profile is, by definition, a set of measurements at unique depths for a given timestamp. Two values at the same
(time, z)is not a sampling artifact the library can sensibly resolve — it's a data-quality issue that only the caller has context to fix. The choice between which duplicate to keep, whether to average them, or whether to discard them entirely depends on the instrument, the QC pipeline, and the scientific question. Hard-coding\"first\"as the default silently picked one for the user, which is the worst of all options: it hid a real problem behind a warning that's easy to miss.modelskill's job is to compare observations and model results — not to perform input cleaning that pandas already does better. Users who genuinely want to dedupe can do so with one line of
pd.DataFrame.drop_duplicates(...)orgroupby(...).mean()before constructing the object, with full control over the semantics. Failing loudly on malformed input is the right default for a comparison library.The same argument arguably applies to
TrackObservation/TrackModelResult, wherekeep_duplicateswas originally introduced — duplicate timestamps in track data have a real-world cause (sub-second rounding in altimetry, satellite crossings) but the resolution is still a caller concern. That's a separate discussion and not touched here.Behaviour change
Before:
VerticalObservation(df_with_duplicates)→UserWarning("Removed N duplicate (time, z) entries with keep=first"), returns a deduped object.After:
VerticalObservation(df_with_duplicates)→ValueError("Input contains N duplicate (time, z) entries. Vertical profiles must have a unique depth per timestamp; deduplicate the input before constructing the object.")The vertical surface is still alpha and has not been exposed to users, so no deprecation path is needed.