Bushfire Simulation in the Cloud – Leading the future

Bushfire Simulation in the Cloud – Leading the future

It’s a story that’s now as old as the internet itself: Complex and specialist software that requires specialist hardware, has a steep learning curve, is built and deployed using legacy principles, and has a die-hard user base of specialist users who can whisper away the myriad error messages is having its hour of reckoning. Whilst bushfire simulation itself is both science and specialist art, so too is the process of disrupting the space to derive the maximum benefit possible from new technology.


This article is the beginning of a deep dive on how we are approaching the challenges that fire management agencies Australia wide have faced when moving on from legacy bushfire simulators, how critical technology choices play a pivotal role in enabling bushfire practitioners and risk managers into the future, and importantly, what those choices are so that you too can consider them. Finally, we know through many years of feedback that legacy software with its myriad issues and legacy software management practices are a large reason that we lose the interest of bright and enquiring minds wishing to explore bushfire sciences in today’s digital world. By making smarter choices we open new ways to engage our volunteers, our staff, our partner agencies, our risk managers, and our students.

?

Before we get technical, lets define those places where fire management agencies, and the fire management community at large needs to see value for money:

-?????????????A reduction of IT administrative overhead

-?????????????An increase to accessibility

-?????????????An increase of system interconnectedness

-?????????????A decrease of user complexity

-?????????????A simplification of data pipeline complexity

-?????????????Improved systems redundancy

-?????????????A decrease in hardware management and user management overhead

-?????????????Improved security

?

All of these lead to three keystone measures of success

-????????????? An opportunity for improving informed decision making across the organisation

-????????????? Greater retention of fire science learners in key programs

-????????????? Improved return on investment

?

In order to deliver this, we need to think carefully about what it is that we believe that we want. Given that we have come from software that requires every user to have a machine which requires FTP access, firewall permissions, installed executables and supporting DLL libraries, a suite of often hand coded helper applications to collect, prepare, and validate data, and a messy routine to keep each of the inputs updated at least twice daily, outputs backed up, and an ever growing list of if-then caveats in training and procedures, it's safe to say we were able to approach the problem with some clear metrics in mind.

?

Our solution should be able to:

-????????????? Make use of loosely coupled APIs, separating front end and back end, and allowing us flexibility of use cases, ability to scale, ability to integrate, and offering future proofing.

-????????????? Remove the need for 99% of users to install software, manage downloads and dependencies, yet still permit them to perform as detailed work as their role requires.

-????????????? Permit users to access and operate the software from low bandwidth areas, including those in rural areas with as little data transfer as possible.

-????????????? Permit users to use their choice of device, including mobile devices.

-????????????? Remain isolated and independent of the enterprise network for security purposes but be able to talk to the network via the open internet using appropriate APIs where possible.

-????????????? Be as reasonably fast as possible, being nimble and adaptive enough that as advances in hardware and architecture become available, we are able to make use of those in a relatively fast fashion without being beholden to archaic server build and reverence practices.

-????????????? Give us greater visibility, granularity, certainty, and observability of who is creating what, and at what level. Standard .exe builds of previous generation fire simulators do not enable this at the infrastructure level.

-????????????? Scale. The impact of scalability and performance is critical. We are operating simulations which have complex physics modelling including not just fire, but weather, atmospherics, ember transport, convection, etc. The hardware which is driving the AI revolution can be used to support this, and is generally available.

?

With this in mind, the development of the Spark Operational bushfire behaviour simulator by AFAC and CSIRO has yielded a number of things which we can leverage to make this a reality. Two of the design choices, which were aimed at desktop usage but can be extended to other implementation patterns are critical:

-????????????? The back end simulator is accessed via a secured REST API, making use of the Python FastAPI module

-????????????? The underlying simulation framework wraps a number of spatial libraries in OpenCL using Geostack, a CSIRO initiative.


Note: Some network elements such as ports have been lightly edited to dissuade targeting.


Scaling Part One

Out of the box the Spark Operational application launcher starts a single instance of the back end simulation API on port 6000, giving us our first look at how to scale this application. More on this soon. The front end of the application works effectively as a standalone website with Javascript handling much of the functionality, including things like user session identification and communication with our API. This part of the application is almost entirely executed in the clients browser and thus presents an exceptionally low load scenario to our server – quite the opposite of what is happening with full and complex server side physics simulations.

?

One thing that we learned during testing is that when built upon a suitably sized server the application and physics processing can support a number of users, and that the API has a fairly basic queuing algorithm – largely being handled by the WSGI server itself. We have no control over the WSGI component, as in we cannot convert the application to ASGI at this late stage of development, but we did notice that even very large simulations would rarely consume more than a few GB of memory. This is interesting, because if we want to support more than a single user undertaking a simulation at once, WSGI is not an ideal choice. With a little bit of thinking, and talking to colleagues in other jurisdictions (shoutout to Sam Fergusson from Tasmania), we identified that running multiple back end simulation API endpoints on different ports the load balancing the incoming requests with a solution such as Nginx is a viable alternative. For the purpose of this demonstration, lets assume that we are now operating Nginx on port 6000, and internally load balancing requests to 6001, 6002, 6003, 6004, each of which is operating a copy of the simulation API.


The above has achieved a scenario where multiple simulation jobs can be run concurrently, and requests during busy periods over and above the 4 threads will be queued by their respective WSGI instances awaiting simulation.

?

Performance Part One

We noted earlier the wonderful use of OpenCL to wrap some of the geospatial and physical libraries in use by Spark Operational. Whilst OpenCL could always do with some more development love, it has an excellent ability to make available libraries that would have been CPU only to GPU processing capabilities. Graphics Processing Units are the biggest driver today behind the AI revolution, powering AI tools you may have heard of such as ChatGPT, Dall-E, Bing Copilot, Claude AI, and more. One of the reasons that such tools have gravitated towards using GPU for these computationally heavy tasks is because they are exceptionally efficient at processing certain types of math, and for our example today this includes physics and vectors.

?

What would happen if we asked out OpenCL environment to use our GPU instead of the CPU?

?

By default Spark Operational will use a machines CPU to undertake processing. During testing we found that switching our OpenCL environment over to use GPU resulted in a 6-8 fold reduction in simulation time, occasionally reaching much higher numbers than these. In reality this means a simulation that took 2 minutes to run on a CPU now took 15-20 seconds, or less, to run on a GPU. This is quite critically important for a number of reasons:

-????????????? Technology companies often talk about failing fast. ?This is because it allows you to either iterate quicker, or divest of a particular investment quicker. During times of high pressure such as 2019-20 bushfires, an analyst looking to simulate a complex fire needs to iterate through scenarios as fast as possible to reach reliable answers.

-????????????? Sometimes we do complex things such as running ensemble modelling, where many models with uncertainty built into their variables are run at the same time and combined to give a probabilistic outcome – speeding through these many iterations gets your probabilities faster.

-????????????? Job backlog can be an issue, so clearing jobs as quick as possible removes potential bottlenecks.

?

Performance and Scaling, Part two

To do this we were a little creative with how this was implemented. The application is capable of reading the desired OpenCL ID (graphics card ID) to use from an environment variable. We also want to make sure the simulator starts up and runs, and stays running. Thankfully, we are using a Linux based system, so we can use systemctl to operate each instance of the simulator as a daemon. When doing this, each process can be handed the same environment variable, but with a different value, a neat little trick. So, on a small machine with one graphics card we can create our 4 APIs on ports 6001, 6002, 6003, 6004 using a system process that’s watched, and automatically restarted on failure, and which all look at the same GPU. But using the same setup script on a larger machine with 4 graphics cards, we could either:

-????????????? Assign each API to its own graphics card by pointing the environment variables at the correct IDs (using a loop in our shell setup script – thanks bash!)

-????????????? Introduce some particularly complex load balancing where our first graphics card (ID = 0) gets 4 API instances attached to it (e.g. ports 6001, 6002, 6003, 6004), Card ID = 1 gets 4 API instances (ports 6005 – 6008), and so on, eventually creating 16 API instances, and then updating our Nginx load balancer to reflect this.


Whew! Given point 2 above we could now support 16 full physics bushfire simulations at exactly the same point in time without any queuing on the WSGI service.

?

Reproducibility

One of the things we have been very keen to avoid is building monolithic servers that require reverence and may only be treated with kid gloves. Instead, we wanted this deployment pattern to be updateable and deployable across a variety of cloud providers with minimal change. Whilst we are working on Amazon S3, and it has become a specific project requirement due to the availability of particular graphics hardware, we wanted to be able to offer the basic principles to other fire and government land management agencies who have been part of the Spark Operational group. To this end, we’ve produced a deployment script written in bash that can be dropped into AWS to spin up a new Spark Operational server in under 15 minutes.

?

Nope, we didn’t use Terraform, Elastic Beanstalk, Docker or build AMIs (though we could have for each), for the simple fact that we wanted the tooling to be as accessible and as transparent as possible, allowing our national colleagues to create their own deployment patterns off a proven high-performance model. We’re quite aware of the restrictions that most agencies place on their users in terms of network, cloud, and application access – but what remains is the ability to copy and paste some plaintext script in a web browser.

?

This script downloads an installs the Python environment, Python modules and dependencies, OpenCL, Nvidia drivers, survives a machine restart, sets ups services and monitoring, starts the web server, establishes weather model downloads, among a litany of other functions. It’s been test on 8 different AWS instances and deployed over 150 time during development to date. At this stage, it’s quite battle tested.

?

Next Steps

The next steps in this journey are still to come, however complex monitoring, including the development of telemetry dashboards, job recording, and the introduction of a comprehensive weather archive are all within scope and should be online in the coming months. We've also not touched CyberSecurity in this example, and the diagrams provided hint at only the basics covered for a basic install. Additional measures should be put in place.


Please reach out in the comments if there’s anything more you would like to hear about or would like guidance on.

James Hilton

Principal Software Developer at Covey Associates/Fire Intel

11 个月

It's fantastic to see Spark being configured, tested and put through its paces like this, especially the containerisation and GPU parallel instances. On the CPU/GPU thing - we found that the default CPU library POCL is around 4x slower than Intel CPU library. Unfortunately there are some issues with the current Intel driver so it can't really be used at the moment. GPUs, as found here, are way faster if available. It's a real shame OpenCL support has fallen behind. Also wanted to mention Geostack is open-source (thanks to CSIRO) and can provide GPU acceleration to loads of different geospatial operations, including the fire simulations and processing used in Spark.

Nikhil Garg

Researcher

11 个月

Nice write up Simon.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了