Side Project - Staging view: Cheaper is Better

Side Project - Staging view: Cheaper is Better


Scheduled


Here we are again, trying to scrape every possible argentinian peso.

Last week I commented on how, to move data from the "bronze" layer (a.k.a. a bunch of json files generated every 30 seconds), I had used a Durable Azure Function with a timer trigger. Once a day, it would load the ~0.14GB to a staging layer within the data lake in a more compact format. This move was convenient for two reasons:

  • Firstly, as the goal of this whole experiment is to move and make the data available at the lowest possible cost, this means reducing storage costs as much as possible. While it's low, there's a significant difference between having 0.14GB per day and 0.4MB per day after compaction.
  • Secondly, I have a forgotten Raspberry Pi, which although it's currently acting as compute for data extraction, still has capacity to run other processes. Furthermore, if I containerize both the extractor and staging applications, I could deploy them on other services if needed.




Using a Raspberry Pi directly for compute has similar considerations to using a VM. Therefore, to make use of it, I did so in conjunction with Podman and cron to schedule the tasks.



As a result, while there wasn't an improvement in execution times or a significant reduction in costs, we made better use of the compute resources we already have available and had a more fun time tinkering with the Raspberry Pi (which is always fun).

Now, with the staging layer consolidating at a rate of 0.4MB per day, we can continue modeling on the staging layer with the idea of making some use of the data, rather than just watching it increase day by day.




Previous posts:



Steven Moore

Enterprise Data Solutions: Business Intelligence and Analytics | Microsoft Azure Data | Microsoft Fabric & Power BI | CCH? Tagetik | ERP and CPM

1 周

That's awesome, Ignacio. I also agree. It's fun to get hours-deep into a fun project, fix an issue, build something great, or just try something different.

回复

要查看或添加评论,请登录

Ignacio Alvarez的更多文章

  • Side Project - Staging Area??

    Side Project - Staging Area??

    Well, here we are. In my previous post, I discussed how I was ingesting data from a web service that emits public…

    5 条评论
  • Side mini-project: Ingestion from WS

    Side mini-project: Ingestion from WS

    Excited to share progress on a side project involving public transportation data from my city. I recently got access to…

    1 条评论
  • Exploring Data Quality: Insights from 'Data Quality Engineering in Financial Services' Book

    Exploring Data Quality: Insights from 'Data Quality Engineering in Financial Services' Book

    Recently, I've been immersed in the book "Data Quality Engineering in Financial Services" by Brian Buzzelli making my…

    3 条评论
  • One more brick: Delta Sharing

    One more brick: Delta Sharing

    Sharing Data with Delta Sharing When there is a need to share data, either with an end client through visualization…

    3 条评论
  • One more brick: Dynamic Views

    One more brick: Dynamic Views

    In the realm of data management, especially in environments where a consumption layer is accessible to end-users or…

  • One more brick: Delta Data Skipping

    One more brick: Delta Data Skipping

    Internally, Databricks provides the "Delta Data Skipping" functionality to enhance performance in reading tables. This…

    3 条评论

社区洞察

其他会员也浏览了