DATA Pill #033 - 4 ways to optimize BigQuery, 30 data models in DBT, 4 enablers of being data-driven, and a look back at the 2022 predictions
Hi,
Holiday time is almost over but I hope you will find some time to read the next Data Pill!
We managed to find some really “meaty” content for you. Enjoy!
ARTICLES
What I Got Wrong: Looking Back at My 2022 Predictions for the Modern Data Stack | 14 min | Modern Data Stack | Prukalpa | Personal Blog
Before the predictions for 2023, let’s take a step back into the past and check the reflections on six major trends from 2022 that Prukalpa made at the beginning of this year. What did we get right? What didn’t quite go as expected? What did we completely miss? Read more about?
where we started and where are we now with:?
1.Data Mesh
2.Metrics Layer
3.Reverse ETL
4.Active Metadata & Third-Gen Data Catalogs
5.Data Teams as Product Teams
6.Data Observability
T-Mobile Supports 5G Rollout with Azure Synapse Analytics and Power BI | 7 min | Data | Microsoft Blog
A short story about building a nationwide 5G network. Read how T-Mobile, who use Power BI built a centralized source of data, maintaining high levels of performance and functionality using a data lakehouse supported by Microsoft Azure Data Factory, Azure Synapse Analytics and Azure Databricks.
Migrating over 30 data models from plain SQL to DBT in just 5 days | 4 min | dbt | Ramtin Javanmardi | Mentimeter Blog
All the reasons why the company felt compelled to migrate over 30 of their models and sunset the old models explained. They did the migration in three distinct steps, which you can read about. A great example of how big migrations of business-critical models do not have to be boring or feel stressful.
AWS Disaster Recovery Strategies – PoC with Terraform | 10 min | AWS | Martin Perez Rodrigues | Xebia Blog
In this article you can explore proof-of-concept written in Terraform, where they will for example create the front-end layer of three-tier architecture.?
How Einride is taking road freight to new places—on the cloud and on the road | 12 min | Google Cloud | Matt Chaban | Google Cloud Blog
Einride is rethinking every piece of the freight system, from trailers to local deliveries to the remote and autonomous platforms to operate them. If you want to check how they plan to create a sustainable, resilient delivery network using AI and tech, read this blog post.
Improving Video Voice Dubbing Through Deep Learning | 12 min |? TensorFlow | Paul McCartney, Vivek Kwatra, Yu Zhang, Brian Colonna, Mor Miller | Google Developers
Did you know that most of the videos on Youtube are in English but less than 20% of the world’s population speak English as a first or second language? This is why voice dubbing is increasingly used to transform video in other languages. In this blog post you can read about the research of voice dubbing quality using deep learning.
TUTORIALS
Meshing MLOPS on Azure with MLFlow | 6 min | MLOps |? Keshav Singh | Personal Blog
In this blog Keshav will establish the ML life cycle leveraging MLFlow – an open source machine learning platform and framework for managing the ML life cycle. It is a short, hands-on demonstration of the MLOPs standardization on a Mesh Platform.?
4 ways to optimize your BigQuery tables for faster queries | 15 min | BigQuery | Kelvin Gakuo | Airbyte Blog
Read this step-by-step tutorial where you will explore design patterns of your BigQuery storage that you can use to increase the speed and performance of your queries. To optimize your workloads on BigQuery, you can optimize your storage by:
1. Partitioning your tables.
2. Clustering your tables.
3. Pre-aggregating your data into materialized views.
领英推荐
4. Denormalizing your data.
In this blog post you will also read about BigQuery storage and compute costs and how to investigate BigQuery performance issues and more.?
NEWS
Snowflake introduces Add-On for Microsoft Visual Studio | 2 min | Snowflake | Christian Lauer | Snowflake?
The add-on makes it possible for developers to gain access to Snowflake from within the VS Code architecture. This extension also connects the user to Snowflake and enables them to write and execute SQL queries, but also to see the results without ever leaving the VS Code. After one has successfully signed in, they can see and change their active database, schema, role and whole warehouse
Grafana Releases New Frontend Observability SDK and Backend Profiling Database | 6 min | Grafana | Matt Capbell | InfoQ
Recently Grafana announced two new additions to its suite of observability and monitoring tools
Debezium 2.1.0.Final Released | 5 min | Database | Jiri Pechanec | Debezium Blog
You might recently noticed that Debezium went a bit silent for the last few weeks. No, we are not going away. In fact the elves in Google worked furiously to bring you a present under a Christmas tree - Debezium Spanner connector.
PODCAST
Update your model’s view of the world in Real Time with streaming Machine Learning using River | 1 h 16 min | ML | The Python Podcast.__init__?
River is a framework for building streaming machine learning projects that can constantly adapt to new information. Listen to the podcast episode, where Max Halford explains how the project works, why you might (or might not) want to consider streaming ML, and how to get started building with River. You will also find the answers to questions, for example:
Top 6 Worst Apache Kafka JIRA Bugs | 1 h 10 min | guest: Anna McDonald | Confluent
After listening to this episode you will get to know the details about how batching works, the replication protocol, how Kafka’s networking stack dances with Linux’s one and which is the most important Scala class to read if you’re only going to read one.
Anna gives Kris the details about the bugs that she found and about some of the scariest, most surprising and most enlightening corner cases.
DATA TUBE
Customer showcase: Miro (hosted by dbt Labs) | 60 min | Modern Data Stack | dbt Labs
In this video, Felipe Leite and Stephen Pastan from Miro unpack their shift to a Modern Data Stack and share the vital technical changes they made to build a scalable and tech-forward data stack. Watch this to discover how to efficiently scale your analytics stack when your data and data team grows 10x in 2 years and start prioritizing what gets done when there's that much growth.
?
CONFS EVENTS AND MEETUPS
Near Real-Time Anomaly Detection With Delta Live Tables and Databricks Machine Learning | 9 January 2023 at 9am GMT; 10am CET | Live webinar?
Join the webinar featuring Achraf Hamid, Data Scientist at Mailinblack, who will explore the importance of anomaly detection for businesses. The session will also examine how to solve common anomaly challenges, and achieve a near real-time anomaly detection system using the Databricks Lakehouse Platform.
Speakers:
________________________
Have any interesting content to share in the DATA Pill newsletter?
? Join us on GitHub
Adam from the GetInData | Part of Xebia