My 2022 Tech Highlights: 5 Python libraries that water resources data professionals should know

My 2022 Tech Highlights: 5 Python libraries that water resources data professionals should know

Since writing my first line of code a little more than 3 years ago, I have become enthralled with the potential of Python-based data science to create cutting-edge environmental analyses and products. Yet looking back on it, 2022 will forever stand out as the year in which I dramatically levelled-up my data science and software development skill set. I attribute this to both receiving excellent technical mentorship from my colleagues?Paul Tomasula?and?Anthony Aufdenkampe, as well as being given opportunities to “learn on the job” working on projects from early stages to completion at?LimnoTech.

I'm using this post to share some of my learning by highlighting the top five Python libraries I’ve found most useful for environmental data science in 2022. Enjoy!

  • Xarray: For the uninitiated, Xarray can best be described as a pandas-like interface for working with multi-dimensional datasets in memory. An obvious use-case for this from an environmental data science standpoint is to make redundant the painful experiences associated with saving and manipulating gigantic NetCDF and GeoTIFFs locally. Beyond that, the ability to relate disparate datasets along a common axis (like time) is particularly interesting as a method to seamlessly integrate time-series data with gridded datasets. I had the opportunity to do a deep dive with Xarray while modernizing a suite of tools for our USGS clients and left thoroughly convinced it will play a role in almost all my work going forward.
  • HyRiver: I have to tip my hat to PhD candidate Taher Chegini for making a set of Python tools that is nothing else but game changing for anyone working with US water data. HyRivers allows a wide variety of US water relevant APIs to be programmatically accessed with an easy to use suite of tools. Whether you need streamflow gage time-series data, a watershed boundary shapefile, or gridded meteorological data, HyRivers vastly simplifies the process. Better yet, HyRivers allows some datasets such as US 3DEP Digital Elevation Models (DEMs) to be read directly into in-memory Xarray objects! Frankly it feels like I’m cheating when I use this library, and I cringe at the many hours I’ve previously spent manually accessing data portals or writing custom scripts to query APIs.?
  • HvPlot: If you’ve used Python for data science, you’re likely familiar with making plots using MatplotLib. However, I recommend jumping ship to HvPlot. HvPlot is a high-level library that can use MatplotLib (or Bokeh, or Plotly) under-the-hood, but enables one to skip much of the manual effort previously involved in generating data visualizations. The developers behind HvPlot are privy to industry trends and have closely integrated their API with performant big-data libraries such as Xarray and Dask. For example, a simple “xarray.DataArray.hvplot.plot()” command will rapidly generate an interactive visualization of a gridded dataset! Additionally, one can always fine-tune their plot by passing keyword arguments to the underlying “engine” of choice. Since a picture is worth a thousand words, I plan on sharing some of the power of HvPlot soon, so stay tuned for that.
  • CatBoost: CatBoost is simply my machine learning (ML) model of choice for tabular datasets. Similar to the ever-popular XGBoost, CatBoost uses “gradient boosting decision trees” under-the-hood. This year I was able to spend time on a project where I directly compared different ML methods, and across the board CatBoost outperformed! That said, it is not the fastest model to train by any means and is most appropriate where pure prediction accuracy is the goal.
  • PySWMM: Those working in the water resources industry are aware that US EPA’s Storm Water Management Model (PC-SWMM) is the golden standard for sewer and watershed modeling. However, for us to effectively integrate 21st century technology into our water infrastructure one must be able to model it first in a way that PC-SWMM can’t support. PySWMM applies the same underlying SWMM model, but allows practitioners to control model inputs and Real Time Controls (RTCs, i.e. a pump that turns on/off to prevent flooding) using arbitrarily complex Python code. This enables us to design, model, and implement next-generation solutions for some of our most pressing stormwater management issues. For example, in a case where the capital expenditures required to upgrade a stormwater system would be too costly, we can envision a future where flooding is prevented via retrofitting the system with IoT-style sensors that inform AI controlled RTCs. The potential here is incredibly exciting and something I am proud to be involved with in the new year!

Michael Di Matteo

Team Lead - Water Resources at KBR | PhD CPEng | Stormwater SA Treasurer

2 年

Thanks, Xavier, for posting!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了