close
Skip to content

StationLookup: static data inconsistency & API limitations #4044

@dcamron

Description

@dcamron

This arrives from a support ticket characterized by the following example, on its face just demonstrating that StationLookup (here, powering add_station_lat_lon) won't identify stations by anything other than the 4-letter ICAO station id we use to index our lookup, and doesn't offer much in the way of errors/info when failing to find,

from io import StringIO
import pandas as pd
from metpy.io import add_station_lat_lon, station_info, StationLookup

# example dataframe
df_str = """station  height      u_wind        v_wind  temperature  dewpoint
68004    1403   21.000000     3.857637e-15         17.3     -17.9
67221     546   -4.698463    -1.710101e+00         21.8      13.8
67221     546   -4.698463    -1.710101e+00         21.8      13.8
67221     546   -4.698463    -1.710101e+00         21.8      13.8
68007     530    9.848078     1.736482e+00          2.3      -7.9"""

df = pd.read_fwf(StringIO(df_str))

print(add_station_lat_lon(df))
# produces
"""   station  height     u_wind  ...  dewpoint  latitude  longitude
0    68004    1403  21.000000  ...     -17.9       NaN        NaN
1    67221     546  -4.698463  ...      13.8       NaN        NaN
2    67221     546  -4.698463  ...      13.8       NaN        NaN
3    67221     546  -4.698463  ...      13.8       NaN        NaN
4    68007     530   9.848078  ...      -7.9       NaN        NaN"""

As our Station namedtuples are populated with plenty of other information, I figured it wouldn't be too hard to expand StationLookup to support something like this. Come along for my exploration:


We index on ICAO id, eg "KDEN", from four staticdata files, in order of precedence: sfstns.tbl (from GEMPAK?), master.txt (from ???), stations.txt (credited to NSF NCAR RAL & NWS AWC), and airport_codes.csv (a 5M file, from ???).

  1. airport_codes.csv is not indexed the same as the other files, instead constructing a dataframe and indexing on its column titles. This means it is never used in lookup (as publicly documented and used in add_station...) and never creates Stations. This throws off a lot of expected behavior for StationLookup even for users who would try to use it directly, and damages our ability to eg iterate over the lookup without throwing out that data. I believe we are also reading in, converting to dict, and caching all(?) of this data on any metpy.io import.
print([station for station in station_info if station_info[station].synop_id == '68004'])
"""AttributeError: 'dict' object has no attribute 'synop_id'"""

Stations can be defined in multiple files, and properties like WMO identifier, lat/lon/alt, name can be different or missing inconsistently across files. Stations are not updated or overwritten, instead we store each as a snapshot of an individual data source. I think that's sane.

print(station_info['FYOW'])
# produces
"""Station(id='FYOW', synop_id=999999, name='Otjiwarongo Arpt', state='--', country='NA',
longitude=16.67, latitude=-20.43, altitude=1481, source='.../sfstns.tbl')"""

print(station_info.tables.parents['FYOW'])
# produces
"""Station(id='FYOW', synop_id='68004', name='Otjiwarongo', state='--', country='NA',
longitude=16.633333333333333, latitude=-20.45, altitude=1455, source='.../master.txt')"""
  1. However, that establishes "defaults" based on the precedence set in our underlying ChainMap, with no indication of additional definitions. We provide no API or documentation to check beyond the first data source a key is found in StationLookup, and there's no functionality for add_station_lat_lon to search deeper or lookup on Station properties.

Combining these, here's what it takes to find a solution right now,

# remove `airport_codes.csv` dict map
del station_info.tables.maps[-1]

print([station for station in station_info if station_info[station].synop_id == '68004'])
# produces
"""[]"""

# while station 'FYOW' exists in the "default" keys, that WMO id
# cannot be found in any of the "default" Stations, so we
# search in keys from later-specific data files
print([station for station in station_info.tables.parents if station_info.tables.parents[station].synop_id == '68004'])
# produces
"""['FYOW']"""

I think those are the main issues I've dredged up, and I think there are quite a few separate sub-issues we could choose to take on from these:

  • Document the data sources, justifying and documenting the precedence we search them, and doing something about keeping them updated. Did you know there's an entirely "new" WMO identifier system?
  • Fix how airport_codes.csv is indexed if we use it at all. Avoid creating the dataframe up front? Similarly, confirm the caching behavior we want and be more explicit in when this data is read/cached.
  • Expose the underlying data tables to users. They're each unique, and some are a pain to parse. If we parse and provide the tables themselves, users can do neat things like joining to their existing dataframes.
  • API enhancements
    • Add public API to retrieve multiple/"next" station definitions from StationLookup.
    • Add public API to lookup Stations based on their Station properties, eg find from WMO id, find from a name (sub)string, or find stations at given lats and/or lons.
    • Use these in dependent functionality, eg add_station_lat_lon search based on WMO id/name or searching deeper on NaN values. Or add some other helpers to get there.

I've bolded what I think would address the ticketed issue. I just learned Pandas has some chunking/iteration functionality that could help with some of this. I'm glad to take a look at this with some help prioritizing, after the CI overhaul.

Metadata

Metadata

Assignees

Labels

Area: IOPertains to reading dataType: BugSomething is not working like it shouldType: EnhancementEnhancement to existing functionalityType: FeatureNew functionality

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions