DATA Pill #019 - GCP, dbt, AWS and Sociopaths in the Modern Data Stack

DATA Pill #019 - GCP, dbt, AWS and Sociopaths in the Modern Data Stack

Hello!

Before we begin, we have a huge announcement to make!

Our community has become a partner of the biggest Big Data Conference in Northern Poland:

DataMass Summit.

This is another step in our project towards knowledge-sharing, so we are kind of… proud ;)

And for this occasion, we have a 10% discount code (ON code, no No-code ;)):

DATAPILL10

More about DataMass in the conference section.?

But before we get to that, a looooot about dbt, GCP and “the Office” meme.


ARTICLES?

The Modern Data Stack Through ‘The Gervais Principle’ | 17 min read | Data Flow | Lauren Balik from Upright Analytics on Medium

This one gets off to an interesting start: Data doesn’t move left-to-right in an organization, it moves through Losers, the Clueless and Sociopaths.?

What if we looked at data flow in terms of the pathological nature of organizations on a vertical axis, not a horizontal one?

No alt text provided for this image


Orchestrating dbt Google Cloud PART 2 | 10 min read | dbt &GCP | Enrique Lopez de Lara | Pythian Blog

In this article Enrique defines and demonstrates how to deploy some Google Workflows to orchestrate tasks.


End-to-End DBT project in Google Cloud Platform (Part 1) | 11 min read | dbt &GCP | Mohamed Dhaoui | Dev Genius Blog

One more series of posts - very detailed! All about running dbt projects on GCP and building a dbt-based data platform!

Part 1: Main concepts around DBT and how to organize a DBT project and run it on Google BigQuery

Part 2: How to package the DBT project and deploy it onto the Google Cloud Platform

Part 3: Gives precise details about running dbt with Workflows.


Serverless Messaging: Latency Compared | 5 min | AWS | Bite-Sized Serverless

A comparison of the AWS serverless messaging systems.

SQS Standard can deliver a message to a consumer in as fast as 14 ms and is seldomly slower than 100 ms, assuming low batch sizes. Kinesis with Enhanced Fan-Out is only slightly slower and allows for multiple consumers and a long history of events.

Since we're talking about AWS, here's a role in an interesting AWS project.?


The Modern Metadata Platform: What, Why, and How? | 13 min read | Data Stock | Mars Lan | Metaphor Blog

Metadada management seemed to be a solved problem. With the Modern Data Platform and democratisation of data, we let a bunch of new folks into this candy store with data, which means new challenges. Metadata started to look and smell like a Big Data problem. The idea on how to keep everything intact is a Modern Metadata Platform. Written by the authors of DataHub (now developing their own product: metaphor.io) with a nice walkthrough from the need to the solution.

By clicking MORE LINKS you will find LinkedIn, Allegro and McDonald's case studies.

{ MORE LINKS }


?NEWS?

Announcing Public Preview of Data Lineage in Unity Catalog | 5 min read | Data Lineage |? Paul Roome, Sachin Thakur and Tao Feng | Databricks Blog?

Better late than never ;) Databricks have finally announced the public preview of data lineage in Unity Catalog, available on AWS and Azure.


Announcing the GetInData Modern Data Platform - a self-service solution for Analytics Engineers | 10 min read | Data Platform | Micha? Rudko | GetInData Blog?

The Modern Data Platform (or Modern Data Stack) is on the lips of basically everyone in the data world right now. The need for a more self-service approach towards data-driven insight development has been observed in many of our clients for some time now.

  • What’s the deal with MDP and what was the motivation to create such a platform??
  • Architecture and Data Platform framework.

{ MORE LINKS }


?DATA LIBRARY

Best Resources for DevOps | 5 min read | DevOps | Java Revisited | Twitter

A collection of meaty DevOps materials, like this Road Map by Vrashabh Sontakke

No alt text provided for this image


?TOOLS

?Modin: Scale your Pandas workflows by changing a single line of code | 10 min to dig GitHub?

Modin is a drop-in replacement for pandas. While pandas is single-threaded, Modin lets you instantly speed up your workflows by scaling pandas so it uses all of your cores. Modin works especially well on larger datasets, where pandas becomes painfully slow or runs out of memory.

By simply replacing the import statement, Modin offers users effortless speed and scale for their pandas workflows.?

?

DataTube

From Nothing to Something: Klarna’s Journey With Recommendation Systems | 24 min | Anil Sharma | GAIA

Klarna’s journey from zero recommendation models to a state of five use cases in one year.

A recording from the GAIA Conference 2022.



CONFS AND MEETUPS

DATAMASS SUMMIT? | 29-30 September | Gdańsk

To specify the subject of the summit: Big Data, Data Science, Machine Learning and AI, all in the context of cloud solutions.

One-day workshops, a one-day conference. A lot of case studies are planned for this event.

A few points from the agenda:

  • How to process 33bln events from set top boxes in under 4 minutes?
  • Data engineering at the scale of PepsiCo eCommerce, 3 years of experience
  • Data Platform - what does it take to be called a modern one? A new stack with well-known best practices
  • The Data Mesh concept, executed by Trino
  • Spark vs. Bigquery vs. Trino: Shopify’s journey of SQL transformations at scale

Remember about a 10% discount with code: DATAPILL10!

PS, maybe there will be a chance to meet and network in our community?


___________________________


See you next week ??

Adam Kawa from GetInData

PS, are there any “The Office” fans in here?

要查看或添加评论,请登录

Adam Kawa的更多文章

社区洞察

其他会员也浏览了