登录查看更多内容

Keeping Databases In Sync via Python

John S.

Sr. GIS Technical Specialist @ Amazon

发布日期: 2018年11月12日

When I first started at my new job, I had my mind blown away by the complex by which all the database data was moved around and kept in sync with one another. There were in total 13 in just the production environment alone and another half dozen in the test environment for starters. Data was moved between all these databases through the use of 12 batch server tasks, 32 configuration files, 24 batch files and 13 python scripts. When it was broached to me that I would be moving all our data into a new database design, I knew I needed to do something and quick to wrangle all of these data movements into something simple and easy to manage.

This is an example of an old configuration file that was used to shift data around the enterprise. This configuration file was consumed by a Python script that did not use any of the standard ArcPy libraries. Instead it was using some open source python library, at the time I start writing system, I was unable to find updates to that corresponded to the most current version of Esri's software offerings. In all honesty, I have an aversion to using home brewed libraries touching any database running the Esri Spatial Data Engine on the back-end. Too many potential issues of non-SDE compliant methods being used and having to deal with impending fallout. I know there are some great libraries out there, but I have always felt that there are just too many variables Esri adds into the mix that make it impossible to be sure.

The old configuration file used a single table to determine when to publish data and how. It published data in a daily, weekly, monthly, quarterly, and yearly model. It could handle feature classes, tables, views, etc. Pretty much everything you would normally expect.

This method, however, led to some complications outside of my aversion to using home brewed libraries...Time. During off service level agreement hours, the database and other infrastructure needed time to perform its daily maintenance. Typically, run times for publishing would range from 20 minutes to 97 minutes for just one of the data movements. A layer would get republished regardless of actual need and when you have the potential of moving 400+ layers to other databases, the time required was an expanding window potentially. Our data and how it actually changed in the main editing database was rather different than how the publishing script worked though. Of the data that was publishing on the intervals previously mentioned, only 173 of them actually needed to be published between 1/9/2017 and 6/12/2018.

When I started developing the new process the areas to address included finding a way to determine if a layer actually needed to be updated, remove the reliance on XML configuration files, remove or rather reduce the Developer heavy dependencies to try to resolve issues, remove reliance on 3rd party Python libraries to execute basic functions and simply make it nearly bullet proof. Key to the design were to place the workload on a simple python script, a 3 views, a publishing control table and a table of connections used by ArcPY to make the connection. From there, I got to work.

This simple diagram shows generally shows how the entire process works. The python script queries a single view in a database designed to answer a single question..."What changed?". If there are no changes recorded, the script terminates for the day to run again the following day. If there are changes, the python script then queries a table called layer_sync_control.

Via simple Y or N flags, layer sync control determines how a layer moves based upon the script. In the example shown, it identifies which database is the source, if a scratch database is required and where the final resting place will be along with its final projection. If no projection is required, we simply leave the scratch database related fields as NULL and it will not be used.

If it is determined the layer updated is indeed to be published, the python script then queries the destination database connections used to either copy or copy, reproject, copy and rename process.

Outside of the two tables, here is the only other configuration required to run this script.

The script itself is fairly self explanatory. In general, the biggest issue anyone would have running it would be database locking because it uses arcpy.Copy_management(input_connection, output_connection) to make all the necessary changes. However, with some small retooling, it wouldn't take much to swap it out for a truncate and append method if the feature class to be copied was identical in configuration to the target.

All in all, this was an interesting experiment that turned out to be a real good method of handling keeping data in sync. I foresee needing to make a few tweaks before long, but it is working like a champ and keeps total times low. This script is currently managing the synchronization of data between 4 databases with a total average run time of about 45 minutes. Once we drop the older databases (2) of which 1 requires a different projection than the other, the time will drop to about 25 minutes per day. Not too shabby.

Roger Clarke

Associate Professor of the Practice, TAMU

6 年

Smart guy John!

查看更多评论

要查看或添加评论，请登录

John S.的更多文章

Options To Pull Geographic Coordinates From A Geometry Store

2018年12月29日

Options To Pull Geographic Coordinates From A Geometry Store

Ran into something interesting that might be worth sharing with the group. One of the developers on staff ran into an…
The Great Smoke Pit Roundup

2017年6月8日

The Great Smoke Pit Roundup

I have found the most interesting side projects I have ever had the chance to be involved with usually came in…
GeoJSON & Why It Matters

2017年5月10日

GeoJSON & Why It Matters

A colleague of mine created this wonderful video discussing GeoJSON and I had to share it. I have been looking at…
Leveraging GeoBase & Google Maps APIs for Airshows

2016年9月2日

Leveraging GeoBase & Google Maps APIs for Airshows

The airshow planning team leans heavily on the cadre of geographers to develop the many maps needed to plan its…

11 条评论
Joint Base Elmendorf-Richardson Buckner Gym Expansion Nears Completion

2016年1月23日

Joint Base Elmendorf-Richardson Buckner Gym Expansion Nears Completion

It only took $25 million dollars and two years, but thanks to the strong partnership between the 673 Civil Engineer…

See all articles

Keeping Databases In Sync via Python

John S.

Sr. GIS Technical Specialist @ Amazon

John S.的更多文章

社区洞察

其他会员也浏览了

Jun 27: Biweekly Product Update

Must-Know DataFrame Manipulation Techniques for Data Analysts

10 Essential Python One-Liners Every Data Scientist Needs to Know

Exploring Chroma DB: A Python Approach in Jupyter Notebooks

Running Python Workloads on scalable Snowflake Compute clusters

PyData London

How Tower Enables Reliable Execution of Python Apps on Iceberg Tables

Automating Weather Data Processing with Airflow, Docker, and Python

JSON Parsing with Python | Scrape Parse Data Python

Google Analytics Data Analysis With Python And Data Studio

John S.的更多文章

Options To Pull Geographic Coordinates From A Geometry Store

The Great Smoke Pit Roundup

GeoJSON & Why It Matters

Leveraging GeoBase & Google Maps APIs for Airshows

Joint Base Elmendorf-Richardson Buckner Gym Expansion Nears Completion

社区洞察

其他会员也浏览了

Jun 27: Biweekly Product Update

Must-Know DataFrame Manipulation Techniques for Data Analysts

10 Essential Python One-Liners Every Data Scientist Needs to Know

Exploring Chroma DB: A Python Approach in Jupyter Notebooks

Running Python Workloads on scalable Snowflake Compute clusters

PyData London

How Tower Enables Reliable Execution of Python Apps on Iceberg Tables

Automating Weather Data Processing with Airflow, Docker, and Python

JSON Parsing with Python | Scrape Parse Data Python

Google Analytics Data Analysis With Python And Data Studio