课程: Data Pipeline Automation with GitHub Actions Using R and Python
今天就学习课程吧!
今天就开通帐号,24,700 门业界名师课程任您挑!
Data backfilling - GitHub教程
课程: Data Pipeline Automation with GitHub Actions Using R and Python
Data backfilling
- [Instructor] So far we reviewed the different component of the data pipeline. In this video, we'll review the data backfilling process. Let's first define what data backfill is and why we need it. Data backfill is typically defined as the initial loading of the store call data of the dataset, which in our case is the load of all these local data of the four sub-region series. As we have closer to six years of hourly data, this is a pool of about 50,000 observations per series, or an overall pool of 200,000 data points. For comparison, the regular refresh process loads about 24 observation per call if the refresh process run daily. This mean that the magnitude of the data load of the backfill is more than 2,000 times bigger than the regular refresh process. And this is also why you typically would prefer to run the backfill process locally and not on the server. The backfill process follow a fairly similar steps as the data refresh process we saw earlier. The main difference is that…