Keeping Databases In Sync via Python

Keeping Databases In Sync via Python

When I first started at my new job, I had my mind blown away by the complex by which all the database data was moved around and kept in sync with one another. There were in total 13 in just the production environment alone and another half dozen in the test environment for starters. Data was moved between all these databases through the use of 12 batch server tasks, 32 configuration files, 24 batch files and 13 python scripts. When it was broached to me that I would be moving all our data into a new database design, I knew I needed to do something and quick to wrangle all of these data movements into something simple and easy to manage.

This is an example of an old configuration file that was used to shift data around the enterprise. This configuration file was consumed by a Python script that did not use any of the standard ArcPy libraries. Instead it was using some open source python library, at the time I start writing system, I was unable to find updates to that corresponded to the most current version of Esri's software offerings. In all honesty, I have an aversion to using home brewed libraries touching any database running the Esri Spatial Data Engine on the back-end. Too many potential issues of non-SDE compliant methods being used and having to deal with impending fallout. I know there are some great libraries out there, but I have always felt that there are just too many variables Esri adds into the mix that make it impossible to be sure.

The old configuration file used a single table to determine when to publish data and how. It published data in a daily, weekly, monthly, quarterly, and yearly model. It could handle feature classes, tables, views, etc. Pretty much everything you would normally expect.

This method, however, led to some complications outside of my aversion to using home brewed libraries...Time. During off service level agreement hours, the database and other infrastructure needed time to perform its daily maintenance. Typically, run times for publishing would range from 20 minutes to 97 minutes for just one of the data movements. A layer would get republished regardless of actual need and when you have the potential of moving 400+ layers to other databases, the time required was an expanding window potentially. Our data and how it actually changed in the main editing database was rather different than how the publishing script worked though. Of the data that was publishing on the intervals previously mentioned, only 173 of them actually needed to be published between 1/9/2017 and 6/12/2018.

When I started developing the new process the areas to address included finding a way to determine if a layer actually needed to be updated, remove the reliance on XML configuration files, remove or rather reduce the Developer heavy dependencies to try to resolve issues, remove reliance on 3rd party Python libraries to execute basic functions and simply make it nearly bullet proof. Key to the design were to place the workload on a simple python script, a 3 views, a publishing control table and a table of connections used by ArcPY to make the connection. From there, I got to work.

This simple diagram shows generally shows how the entire process works. The python script queries a single view in a database designed to answer a single question..."What changed?". If there are no changes recorded, the script terminates for the day to run again the following day. If there are changes, the python script then queries a table called layer_sync_control.

Via simple Y or N flags, layer sync control determines how a layer moves based upon the script. In the example shown, it identifies which database is the source, if a scratch database is required and where the final resting place will be along with its final projection. If no projection is required, we simply leave the scratch database related fields as NULL and it will not be used.

If it is determined the layer updated is indeed to be published, the python script then queries the destination database connections used to either copy or copy, reproject, copy and rename process.

Outside of the two tables, here is the only other configuration required to run this script.

The script itself is fairly self explanatory. In general, the biggest issue anyone would have running it would be database locking because it uses arcpy.Copy_management(input_connection, output_connection) to make all the necessary changes. However, with some small retooling, it wouldn't take much to swap it out for a truncate and append method if the feature class to be copied was identical in configuration to the target.

All in all, this was an interesting experiment that turned out to be a real good method of handling keeping data in sync. I foresee needing to make a few tweaks before long, but it is working like a champ and keeps total times low. This script is currently managing the synchronization of data between 4 databases with a total average run time of about 45 minutes. Once we drop the older databases (2) of which 1 requires a different projection than the other, the time will drop to about 25 minutes per day. Not too shabby.

Roger Clarke

Associate Professor of the Practice, TAMU

6 年

Smart guy John!

回复

要查看或添加评论,请登录

John S.的更多文章

社区洞察

其他会员也浏览了