StationLookup: static data inconsistency & API limitations

This arrives from a support ticket characterized by the following example, on its face just demonstrating that `StationLookup` (here, powering `add_station_lat_lon`) won't identify stations by anything other than the 4-letter ICAO station id we use to index our lookup, and doesn't offer much in the way of errors/info when failing to find,

```python
from io import StringIO
import pandas as pd
from metpy.io import add_station_lat_lon, station_info, StationLookup

# example dataframe
df_str = """station  height      u_wind        v_wind  temperature  dewpoint
68004    1403   21.000000     3.857637e-15         17.3     -17.9
67221     546   -4.698463    -1.710101e+00         21.8      13.8
67221     546   -4.698463    -1.710101e+00         21.8      13.8
67221     546   -4.698463    -1.710101e+00         21.8      13.8
68007     530    9.848078     1.736482e+00          2.3      -7.9"""

df = pd.read_fwf(StringIO(df_str))

print(add_station_lat_lon(df))
# produces
"""   station  height     u_wind  ...  dewpoint  latitude  longitude
0    68004    1403  21.000000  ...     -17.9       NaN        NaN
1    67221     546  -4.698463  ...      13.8       NaN        NaN
2    67221     546  -4.698463  ...      13.8       NaN        NaN
3    67221     546  -4.698463  ...      13.8       NaN        NaN
4    68007     530   9.848078  ...      -7.9       NaN        NaN"""
```

As our `Station` namedtuples are populated with plenty of other information, I figured it wouldn't be too hard to expand `StationLookup` to support something like this. Come along for my exploration:

---

We index on ICAO id, eg "KDEN", from four `staticdata` files, in order of precedence: `sfstns.tbl` (from GEMPAK?), `master.txt` (from ???), `stations.txt` (credited to NSF NCAR RAL & NWS AWC), and `airport_codes.csv` (a 5M file, from ???).

1. `airport_codes.csv` is not indexed the same as the other files, instead constructing a dataframe and indexing on its column titles. This means it is never used in lookup (as publicly documented and used in `add_station...`) and never creates `Station`s. This throws off a lot of expected behavior for `StationLookup` even for users who would try to use it directly, and damages our ability to eg iterate over the lookup without throwing out that data. I believe we are also reading in, converting to dict, and caching all(?) of this data on any `metpy.io` import.

```python
print([station for station in station_info if station_info[station].synop_id == '68004'])
"""AttributeError: 'dict' object has no attribute 'synop_id'"""
```

---

Stations can be defined in multiple files, and properties like WMO identifier, lat/lon/alt, name can be different or missing inconsistently across files. `Stations` are not updated or overwritten, instead we store each as a snapshot of an individual data source. I think that's sane.

```python
print(station_info['FYOW'])
# produces
"""Station(id='FYOW', synop_id=999999, name='Otjiwarongo Arpt', state='--', country='NA',
longitude=16.67, latitude=-20.43, altitude=1481, source='.../sfstns.tbl')"""

print(station_info.tables.parents['FYOW'])
# produces
"""Station(id='FYOW', synop_id='68004', name='Otjiwarongo', state='--', country='NA',
longitude=16.633333333333333, latitude=-20.45, altitude=1455, source='.../master.txt')"""
```

2. However, that establishes "defaults" based on the precedence set in our underlying `ChainMap`, with no indication of additional definitions. We provide no API or documentation to check beyond the first data source a key is found in `StationLookup`, and there's no functionality for `add_station_lat_lon` to search deeper or lookup on `Station` properties.

----

Combining these, here's what it takes to find a solution right now,

```python
# remove `airport_codes.csv` dict map
del station_info.tables.maps[-1]

print([station for station in station_info if station_info[station].synop_id == '68004'])
# produces
"""[]"""

# while station 'FYOW' exists in the "default" keys, that WMO id
# cannot be found in any of the "default" Stations, so we
# search in keys from later-specific data files
print([station for station in station_info.tables.parents if station_info.tables.parents[station].synop_id == '68004'])
# produces
"""['FYOW']"""
```

---

I think those are the main issues I've dredged up, and I think there are quite a few separate sub-issues we could choose to take on from these:
- Document the data sources, justifying and documenting the precedence we search them, and doing _something_ about keeping them updated. Did you know there's an entirely "new" [WMO identifier system](https://community.wmo.int/wigos-station-identifiers)?
- **Fix how `airport_codes.csv` is indexed if we use it at all. Avoid creating the dataframe up front? Similarly, confirm the caching behavior we want and be more explicit in when this data is read/cached.**
- Expose the underlying data tables to users. They're each unique, and some are a pain to parse. If we parse and provide the tables themselves, users can do neat things like `join`ing to their existing dataframes.
- **API enhancements**
  - Add public API to retrieve multiple/"next" station definitions from `StationLookup`.
  - Add public API to lookup `Station`s based on their `Station` properties, eg find from WMO id, find from a name (sub)string, or find stations at given lats and/or lons.
  - Use these in dependent functionality, eg `add_station_lat_lon` search based on WMO id/name or searching deeper on NaN values. Or add some other helpers to get there.

I've bolded what I think would address the ticketed issue. I just learned Pandas has some chunking/iteration functionality that could help with some of this. I'm glad to take a look at this with some help prioritizing, after the CI overhaul.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StationLookup: static data inconsistency & API limitations #4044

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

StationLookup: static data inconsistency & API limitations #4044

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions