This arrives from a support ticket characterized by the following example, on its face just demonstrating that StationLookup (here, powering add_station_lat_lon) won't identify stations by anything other than the 4-letter ICAO station id we use to index our lookup, and doesn't offer much in the way of errors/info when failing to find,
from io import StringIO
import pandas as pd
from metpy.io import add_station_lat_lon, station_info, StationLookup
# example dataframe
df_str = """station height u_wind v_wind temperature dewpoint
68004 1403 21.000000 3.857637e-15 17.3 -17.9
67221 546 -4.698463 -1.710101e+00 21.8 13.8
67221 546 -4.698463 -1.710101e+00 21.8 13.8
67221 546 -4.698463 -1.710101e+00 21.8 13.8
68007 530 9.848078 1.736482e+00 2.3 -7.9"""
df = pd.read_fwf(StringIO(df_str))
print(add_station_lat_lon(df))
# produces
""" station height u_wind ... dewpoint latitude longitude
0 68004 1403 21.000000 ... -17.9 NaN NaN
1 67221 546 -4.698463 ... 13.8 NaN NaN
2 67221 546 -4.698463 ... 13.8 NaN NaN
3 67221 546 -4.698463 ... 13.8 NaN NaN
4 68007 530 9.848078 ... -7.9 NaN NaN"""
As our Station namedtuples are populated with plenty of other information, I figured it wouldn't be too hard to expand StationLookup to support something like this. Come along for my exploration:
We index on ICAO id, eg "KDEN", from four staticdata files, in order of precedence: sfstns.tbl (from GEMPAK?), master.txt (from ???), stations.txt (credited to NSF NCAR RAL & NWS AWC), and airport_codes.csv (a 5M file, from ???).
airport_codes.csv is not indexed the same as the other files, instead constructing a dataframe and indexing on its column titles. This means it is never used in lookup (as publicly documented and used in add_station...) and never creates Stations. This throws off a lot of expected behavior for StationLookup even for users who would try to use it directly, and damages our ability to eg iterate over the lookup without throwing out that data. I believe we are also reading in, converting to dict, and caching all(?) of this data on any metpy.io import.
print([station for station in station_info if station_info[station].synop_id == '68004'])
"""AttributeError: 'dict' object has no attribute 'synop_id'"""
Stations can be defined in multiple files, and properties like WMO identifier, lat/lon/alt, name can be different or missing inconsistently across files. Stations are not updated or overwritten, instead we store each as a snapshot of an individual data source. I think that's sane.
print(station_info['FYOW'])
# produces
"""Station(id='FYOW', synop_id=999999, name='Otjiwarongo Arpt', state='--', country='NA',
longitude=16.67, latitude=-20.43, altitude=1481, source='.../sfstns.tbl')"""
print(station_info.tables.parents['FYOW'])
# produces
"""Station(id='FYOW', synop_id='68004', name='Otjiwarongo', state='--', country='NA',
longitude=16.633333333333333, latitude=-20.45, altitude=1455, source='.../master.txt')"""
- However, that establishes "defaults" based on the precedence set in our underlying
ChainMap, with no indication of additional definitions. We provide no API or documentation to check beyond the first data source a key is found in StationLookup, and there's no functionality for add_station_lat_lon to search deeper or lookup on Station properties.
Combining these, here's what it takes to find a solution right now,
# remove `airport_codes.csv` dict map
del station_info.tables.maps[-1]
print([station for station in station_info if station_info[station].synop_id == '68004'])
# produces
"""[]"""
# while station 'FYOW' exists in the "default" keys, that WMO id
# cannot be found in any of the "default" Stations, so we
# search in keys from later-specific data files
print([station for station in station_info.tables.parents if station_info.tables.parents[station].synop_id == '68004'])
# produces
"""['FYOW']"""
I think those are the main issues I've dredged up, and I think there are quite a few separate sub-issues we could choose to take on from these:
- Document the data sources, justifying and documenting the precedence we search them, and doing something about keeping them updated. Did you know there's an entirely "new" WMO identifier system?
- Fix how
airport_codes.csv is indexed if we use it at all. Avoid creating the dataframe up front? Similarly, confirm the caching behavior we want and be more explicit in when this data is read/cached.
- Expose the underlying data tables to users. They're each unique, and some are a pain to parse. If we parse and provide the tables themselves, users can do neat things like
joining to their existing dataframes.
- API enhancements
- Add public API to retrieve multiple/"next" station definitions from
StationLookup.
- Add public API to lookup
Stations based on their Station properties, eg find from WMO id, find from a name (sub)string, or find stations at given lats and/or lons.
- Use these in dependent functionality, eg
add_station_lat_lon search based on WMO id/name or searching deeper on NaN values. Or add some other helpers to get there.
I've bolded what I think would address the ticketed issue. I just learned Pandas has some chunking/iteration functionality that could help with some of this. I'm glad to take a look at this with some help prioritizing, after the CI overhaul.
This arrives from a support ticket characterized by the following example, on its face just demonstrating that
StationLookup(here, poweringadd_station_lat_lon) won't identify stations by anything other than the 4-letter ICAO station id we use to index our lookup, and doesn't offer much in the way of errors/info when failing to find,As our
Stationnamedtuples are populated with plenty of other information, I figured it wouldn't be too hard to expandStationLookupto support something like this. Come along for my exploration:We index on ICAO id, eg "KDEN", from four
staticdatafiles, in order of precedence:sfstns.tbl(from GEMPAK?),master.txt(from ???),stations.txt(credited to NSF NCAR RAL & NWS AWC), andairport_codes.csv(a 5M file, from ???).airport_codes.csvis not indexed the same as the other files, instead constructing a dataframe and indexing on its column titles. This means it is never used in lookup (as publicly documented and used inadd_station...) and never createsStations. This throws off a lot of expected behavior forStationLookupeven for users who would try to use it directly, and damages our ability to eg iterate over the lookup without throwing out that data. I believe we are also reading in, converting to dict, and caching all(?) of this data on anymetpy.ioimport.Stations can be defined in multiple files, and properties like WMO identifier, lat/lon/alt, name can be different or missing inconsistently across files.
Stationsare not updated or overwritten, instead we store each as a snapshot of an individual data source. I think that's sane.ChainMap, with no indication of additional definitions. We provide no API or documentation to check beyond the first data source a key is found inStationLookup, and there's no functionality foradd_station_lat_lonto search deeper or lookup onStationproperties.Combining these, here's what it takes to find a solution right now,
I think those are the main issues I've dredged up, and I think there are quite a few separate sub-issues we could choose to take on from these:
airport_codes.csvis indexed if we use it at all. Avoid creating the dataframe up front? Similarly, confirm the caching behavior we want and be more explicit in when this data is read/cached.joining to their existing dataframes.StationLookup.Stations based on theirStationproperties, eg find from WMO id, find from a name (sub)string, or find stations at given lats and/or lons.add_station_lat_lonsearch based on WMO id/name or searching deeper on NaN values. Or add some other helpers to get there.I've bolded what I think would address the ticketed issue. I just learned Pandas has some chunking/iteration functionality that could help with some of this. I'm glad to take a look at this with some help prioritizing, after the CI overhaul.