Secret sauce in data warehousing

Vincent Rainardi

Data Architect & Data Engineer

发布日期: 2024年11月28日

There is no secret sauce in data warehousing. It’s just hard work, good design and most importantly, a good team. But this week I was chatting with my 24 year old son, about all the years I spent working in data warehousing. When asked if there were keys to success in DW, I said “Oh yes there are!”. So here it is, the “secret sauce” in data warehousing.

1.????? Denormalised design

2.????? Preprocess the output

3.????? Gradual delivery

4.????? Automated testing

5.????? Packaged deployment

6.????? Good team

The first one I’d like to call out as the key contibuting factor to successful delivery of a data warehouse is denormalised design. What I meant by that is star schema. Dimensional modelling. In the source systems, which are usually various different types of applications, the data is highly normalised. The source databases consist of many tables, connected to each other on foreign keys and primary keys. When querying this kind of database, we need to join many tables. So it’s not reporting friendly. This is why many companies build a data warehouse, i.e. to make it easy to report and analyse data across several different systems/applications.

When building a data warehouse, you might be thinking of something like this:

Forget about the Enterprise Data Warehouse for a moment, and focus on the Data Mart part. When building a data warehouse from scratch, your number one opponent is time. Your only have a few months to prove that it’s worth it. Think of it from the board’s perspective: they give you several hundreds thousand dollars, and within a few months they expect to see something tangible. Something that gives the business a good advantage. Like ways to increase revenue, or cost saving.

On the one hand you have a few source systems with complicated tables, and on the other hand you have a group of users hungry for data. You haven’t got time to build an Enterprise Data Warehouse, which can take years to build.

So the trick here (or the “secret sauce”) is to take a thin horisontal slice across the board. You ask the users the most important analysis that they need to do right now. Like client reporting, or cost analysis, or CRM. Something really specific. Something they badly need. Then find the source data for that in the source system(s), and build a piece of data mart and EDW just for that. Like the long red box below. That’s what you are going to build in the next few months to prove your worth.

But as I said, time is always against you. Just preparing the infrastructure alone could take you months, let alone setting up the team and gathering the requirements. At first 6 months seems to be ample time, but once you start it, it’s not much at all. Infrastructure alone could take 2 months, may be even 3. Gathering requirements could take a month, if you’re lucky. So if anyone said to you “Hey I know a way to save us 2 months” you will want to take that route. And that route is to skip the EDW and go straight to the Data Mart:

You still build just 1 piece of crucial functionality, but you bring the source system straight to the mart for the users to query, report and analyse. And for this, denormalised model is ideal. Dimensional model has been my “secret source” for years. Decades even. Users understand it, because it is simple. And it speaks their language. They instinctively recognise the data model. They recognise the terms used, I mean the table names and attribute names. Whether they build the report themselves, or we build the reports for them, they are familiar with that representation of the data.

So in that first phase, when you have to prove your worth, you build just one fact table. And a few dimensions which are required by that fact table. This is so, so important I’ll write again: just build one fact table.

So that’s the first secret sauce: dimensional model.

PS. Show it to the users in Dev. Don’t bother deploying it to Test or Prod. Just show those tables and reports in Dev. When you’re reading Linkedin like today, it’s all peaceful. But when you’re in the tick of it, everybody is pressuring you. Time is something you do not have. Make the infrastructure not a problem by doing it in Dev environment. And save a lot of time. And a lot of headache.

2. Preprocessing the output

I was working for one of the biggest banks in the world. They do investment banking, retail banking and commercial banking too. They are a huge corporation. They operate in every country in the world. I can’t name them because of non confidentiality agreement. But I can mention the data warehousing project, as long as I keep it at high level so that the bank can’t be identified. It was transcontinental data warehouse team, spanning across US, Europe and Asia. The data warehouse was in Oracle, and the BI tool was Tibco Spotfire. It was investment banking data warehouse, containing derivatives transactions. “Derivatives” are a financial instruments such as options, futures and swaps.

The query that Spotfire threw to the Oracle database took minutes to complete. Sometimes 2-3 minutes, but sometimes 15-20 minutes. They were complicated queries, and the data warehouse was not in dimensional model. Hint, hint! See the first secret sauce above. There were a lot of joins in the query. And it was a huge database. And not tuned well either.

….

Alas, time is always against us! Got to work now unfortunately so I got to stop here. But I promise I will continue writing this in a day or two. So check back. And I have another experience to tell you about preprocesing (for a big pharma). I’ll write that one as well. I might end up like this, i.e. writing just one or two secret sauce(s) at a time. As it takes a lot of time to write the whole article. So please bear with me. Thank you for reading!

Right, part 2 now.

So we came up with a plan. The night before, after the daily ingestion into the warehouse, after all the data quality rules, all the transformations and data processing, we precalculated all the things in those Spotfire queries. We did all the joins, all the aggregations, all the calculations that Spotfire would be using the next day, the night before. And we stored all the results in a few “flatten” tables. So all those complicated joins and aggregations which takes 2-3 minutes and sometimes 10-15 minutes had been done, and the outputs were stored in the database. Therefore, the next day, when the users at the trading desks opened Spotfire BI tool, the results came back within a second. Because everything has been prepared beforehand.

That was a big success. And we won the CIO award for that. It was a big relief for many users. Fortunately for us, we knew what the BI tool wanted. But the second case a little bit tricky. It was in a big pharma company in Europe. The data warehouse was in Netezza MPP. They used Analysis Services cubes. They a myriad of different BI tool at the front end, as each customer had different preferences. Tableau, Strategy Companion Analyzer, Panorama, Qlikview, etc. Everything was processed on the fly, based on the customers’ requests. Customer ordered some data, then Integration Services fetched those data from the data warehouse in Netezza database, and fed it to Analysis Services which would then create cubes. For small cubes the process finished in minutes and the cube files were sent to the corporate customers, and theu used their BI tool to visualise and interrogate the data.

But for big cubes it process could take well over an hour. And there are thousands of data variations that the customer could order, because they placed their data order based on different parameters, such as different pathways, regional variations or time parameters. Probably even millions of data variations. So it was not possible to preprocess the data that the cubes were going to process, because there were so many different things that the customers could ask. In the end the DW team solved this problem like this: they prepared a few “template cubes” (say 10) which would satisfy every customer request. And these template cubes are smaller enough to be processed in minutes. So if the request is data A, then we’ll use cube 3. If the request is data B then we’ll use cube 8, and so on.

And this is the trick: the data required by each template cube is preprocessed in advance. Each tempate cube needs different datasets, depending on the customer order. So the night before, after all the data ingestions into Netezza DW, all the data quality rules, all the data transformations and processing, at the end of the batch we ran those SSIS packages outputing all the data that each of the template cubes needed. Take template cube 5 for instance. It could need dataset 1, or dataset 2, … or dataset 12. All these 12 datasets had been prepared and created beforehand. So when the order came the next day, and that order used tempate cube 5 and dataset 11, the cube processing would finish within minutes. Because the dataset has been prepared, and the template cube is small enough.

Another example of preprocessing was for client reporting. It was a large asset management company in Europe. They had a team of about 10 people whose job is to send reports to clients. The reports contained the client’s portfolio breakdown, such as breakdown by credit rating, by country, by maturity, and so on. The report also contained the performance of the client’s portfolio (top level and allocation), the transactions (which securities were purchased or sold), and the risks associated with each portfolio, such as credit risk, etc.

This time the problem was not performance. It was accuracy, consistency and control over the data. We used a specialised software (which I can’t name due to confidentiality) to produce thousands of customer reports automatically, from the data warehouse. The data warehouse was on SQL Server. It contains performance data, valuations data, risk data, and tons of reference data such as clients, country, currency, portfolios, benchmarks, security (as in financial instruments, including derivatives), maturity bands, credit ratings (from S&P, Moody’s and Fitch), prices, transactions, and a lot of other data.

It was a good thing that all those data were in one data warehouse, and could be queried within seconds to satisfy thousands of client reports. Some of the client reports are monthly, some are quarterly, some twice a year and some annually. Let’s take one report for an example. It’s for client called Pension Fund A, and it’s quarterly. Today was the end of quarter, say 31st March. And the report is due in 8 days on 8th of April. Tomorrow, on 1st April that specialised software (let’s call it CR, stands for Client Reports) will automatically run, querying say 30 different datasets that this report needs from the data warehouse and produced a draft report. And the CR tool would mark it as “Stage 1 complete”. Meaning all the required data has been obtained.

Sometimes stage 1 wouldn’t be completed until day 3 or even 4. Yes it would take only seconds for the tool to query each dataset from the warehouse, but particular datasets like risk and finance are not available until day 3. That’s why it could take a few days for stage 1 to complete. Once it completes, it would send an email to the people who are required to sign this report off.

Say that for this report it would take 3 responsible officers to sign off this report, each signing off on different parts of the report. They each got an email from CR, reviewed the draft reports and click the Sign Off button on that CR tool and the status in CR would become “Stage 2 complete”. Meaning the first level sign off is done. Then CR sends an email to the second level approver (usually at manager level). The manager reviewed the report and click the Sign Off button. Of course in reality it’s not that simple, because there often were problems with the numbers, which doesn’t match up with the approver’s expectations. And in this case there were queries, checks and investigations around the numbers, corrections even, before they can sign off the reports. So the original 30 datasets might be different to the approved datasets.

And that was the core of the issue. So we prepared those datasets, and version them, and store them in a set of output tables. And clearly label them: who the report for (the client ID of that Pension Fund A), what period/date it is for (31st March 20xx), and which dataset stage it was for (original, approved, or sent off). We didn’t store it in flatten tables. But we store it in “Entity, Attribute, Value” tables, aka “Key Value Pair” tables. So we have 3 columns:

Entity column which stores the dataset name, like “Performance Allocation by Sector”, or “Portfolio Breakdown by Country”.
Attribute column which stores the sector name or country name, like “Argentina” or “Healthcare”.
Value column which stores the performance value of valuation value, like 1.3% or USD 30k.

And we also have a few columns on the left hand side, such as Report Name, Reporting Date, Client ID, Section, Sub Section and Order. The report name is something like “Pension Fund A Quarterly Investment Report”, Reporting Date = 31st March 20xx, Client ID = the client ID of Pension Fund A. A client report consists of several “Section” such as portfolio performance, portfolio risk, portfolio valuation, portfolio breakdown, and so on.

The “Portfolio Performance” section would contain the overall performance of the report (covering the period of 1 month, 3 months, 6 months, 1 year, 3 years, 5 years) compared the the benchmark(s). Plural as there could be multiple benchmarks. So the Portfolio Performance section has no sub sections. But the “Portfolio Breakdown” sections contains several sub sections: Breakdown by country, Breakdown by currency, Breakdown by maturity bucket, Breakdown by … and so on.

领英推荐

What does your Data Warehousing say about your…

Plain Concepts 10 个月前

Top Benefits of Implementing a Data Warehouse for Your…

Alnafitha IT 1 个月前

Data Warehousing (DW) Excellence: Elevate Your…

InbuiltData 1 年前

And the last column is the Order of those attributes within the dataset. So the “Portfolio Breakdown by Country”, the order could be alphabetical. But for some clients the countries are grouped together based on their “region” such as APAC (Asia Pacific), EMEA (Europe Middle East and Africa), Americas (US + Canada and Latin America). So within EMEA, it is in alphabetical order. And for that kind of thing, this Order column would be handy. And of course for some dataset the order is not alphabetical, but the most important first.

And we also have a timestamp column and User ID column on the right hand side of those “Entity, Attribute, Value” tables. So we know exactly when each row was produced, and by whom.

Anyway, the point of telling all the above is, we needed to store the output of the calculations, exactly as how it is seen on the report/screen, for each version. We can’t rely that when the same query runs again, the output (the numbers) would be exactly the same as when it ran last time.

So yes, precalculation in that client reporting project helps the performance, but most importantly it was for data accuracy, consistency and control. Every version of the dataset was stored, so we could reproduce them, exactly as how they were seen on the screen/report. And we did not store it in “flatten” tables, aka OBT (One Big Table). But we store it in EAV tables (Entity Attribute Value), aka KVP (Key Value Pair) tables. That way, for all those thousands of client reports, we only stored them in 3 EAV tables. Almost all reports were stored in the first EAV tables, the structure of which I described above. The second and third EAV tables had more value columns, and more attribute columns. These 2 EAV tables are only used by a few complex datasets that had multi-attributes and multi-values per attribute.

Well, as promised, that was the second secret sauce done: Preprocess the output.

I’ll continue again in a day or two with the third secret sauce: gradual delivery. In the early part of my career, I delivered in “big bang” approach. Because the projects were small, like Microsoft Access project for a FMG manufacturer, and Foxpro project for a textile company. But when I did data warehousing projects I started to deliver in iterative fashion, as the projects were big and complicated. It needs to be delivered gradually to minimise the risk. And there are challenges in doing that. First of all, we had to keep testing the same parts over and over again, and that takes time and resources. Secondly because we deployed many times into production (go live), we could miss some bits. So what works in Test environments did not work in production. And that’s when the next secret sauces come in: automated testing and packaged deployments.

Ok I’ll continue again in a day or two, so check back here. Thanks for reading!

….

Right! This is the third part of the article, and hopefully the last one. I’ll cover the next 3 “secret sauces” today:

Incremental delivery
Automated testing
Packaged deployment

When building a data warehouse it is essential that we deliver it piece by piece. Not in one go, but bit by bit. First of all, because we need to prove our value to all the stakeholders. Secondly, to keep it simple. If you let it build up until it is very complicated, it will be difficult to release, and it will be difficult to test too. By delivering bit by bit, the users can see more and more functionality being delivered over time, so they can prioritise which next most important business functionality that they need, that we need to deliver for them.

In the first go live, or release to production, avoid delivering any functionality. Just flow a piece of data end to end, from the source system to the warehouse, and release it to production. That alone is already a big enough challenge. You need jump through many hoops to get that done. First of all, there will be support questions, i.e. “Oh, you’re putting something into production? Who’s going to support it? Have you done a handover session with the production support team? What happens if there are issues with the application? Have you arranged the servers to be monitored? Did you write a documentation? Is helpdesk aware of this?”

You need to face up to a forum of architects, explaining how it all works, and how it must comply with the company security standards. They will ask you things like: “What authentication do you use? How do you maintain the authorisation for each user? Is the data encrypted when it’s being transmitted? Where is the architectural document and diagram?” Believe me, even with “no functionality” release, there will be a myriad of questions like above that you have to go through. It will take a significant amount of time and effort.

One way of tackling that is to say that it is only a “proof of concept”. But even so, as you are putting things into production, there will be strict measures on security, data privacy and architecture. And support. I’d advise you not to use the “proof of concept” trick, because the point of going live early is to get something that technically works properly into production system. Yes it may be just one table that you bring over from the source system to the data warehouse, but it is real data, it is in production environment, and it is flowing end-to-end. And that my friends, is enough to scare everyone, from head of risk to head of IT. Head of compliance too. Head of change delivery too. They would persuade you to back down. “What is the point of going live if you have no functionality to deliver?” they would say to you.

That is exactly the point! Over the next few months, you will be putting a new BI system on Looker, for cost analysis say. And to pave the way you have this first release. If you manage to convince everyone from the architects to the head of production support, then you have a “permission to go live”. You have a pass! A green light. And you can use this pass to put in all those functionalities into live systems. It means that this new application (the new BI system) has met the corporate standards on security, architecture, and operational management. And it means that everyone in IT and outside IT are aware of it, and they are in agreement with you.

Now, once that first big hurdle is over, you can focus on “going live” with the next thing, and the next thing. Try to go live every 2 weeks if possible. If not every 3 weeks. Certainly don’t leave it too long. 5 weeks is too long. As soon as something is working in Dev, put it into production. Going live every Monday is a good idea. It provides a good regular cadence to the testers, and to the users. And to the application support team.

The delivery must be requirement driven. You do not pick tables from the source system and bring them to the warehouse. Instead, you get one important requirement from the users, and get all the necessary data from the source system. I repeat: one important requirement. Not two, not three, but only one requirement. And that requirement must be a very important one. Something that makes a difference to the business operations. Something that give the business a good advantage over our competitors. So that, even if after this requirement the DW/BI project got cut short, they will be left with something tangible, something that benefit the business a great deal.

Now this “incremental delivery” goes hand-in-hand with automated testing. Automated testing is something really incredible in software development, including DW/BI projects. Say you’ve delivered one functionality to the business: footfall analysis. For those of you who are not in retail, it is the number of people passing through our shops. We track the number of people getting in and out of each store, and we use it to optimise the store layout and to identify shopping trends. Which times during the day are the peaks and the lows, which days in a week, and which dates in a month. And of course, as you are aware we are approaching Christmas, the busiest time in a year! Not only this help improve the quality of customer service during peak times (and customer experience) but also help managing the labour cost more effectively. It helps us manage the staff schedules more effectively during busy periods.

During low periods, management can reduce the staffing level, and therefore saving cost without jeopardising customer service. And it helps us managing the inventory too. You know exactly what increase in sales can be expected, and by when. And therefore you can ensure that the most popular items are well stocked ahead of those busy times. Apologies I get carried away. The point is, you have delivered one important functionality to the business: footfall analysis. And next week you are delivering another important functionality: price optimisation.

When you delivered the footfall analysis, your team has tested it. For example, when you have zero traffic, your footfall calculation is not erroring with “division by zero”. And tons of other tests. Let’s say that there were 25 tests. Now, when you deliver the “price optimisation” piece next week, you will need to retest that the previously delivered functionality is not adversely affected. You have to do retest the footfall analysis functionality. You have to rerun those 25 tests. And when you deliver another new functionality, you have to rerun those 25 footfall analysis test, and another 25 price optimisation tests. Slowly but surely, this builds up, and in the end your testers are not going to cope with the volume, if they test it manually.

With automatic testing, those 50 tests are done programmatically. And they are done during deployment. So just before the code is deployed to the Dev environment, all those 50 tests are executed programmatically. You just press a button and within a minute or so all 50 tests are done. If 2 tests failed, you fix them, and you run those 50 tests again. Wait a minute or two, and this time you see all green. Wonderful isn’t it. I’ve seen it working for real, and it is a massive help. Not only you safe hundreds of hours (thousands even), but you also ensure that your code is of the highest quality. And it make a big difference to the deliver times. You are not held back a few days because you haven’t tested it. You can bring new functionalities to the business users much quicker than if you do testing manually.

You see, if you deliver incrementally, the amount of testing increased significantly. First release: 25 tests. Second release: 50 tests. Third release: 75 tests. And so on. So every week the amount of test is increased dramatically. You won’t be able to to deliver incrementally if you don’t do automated testing. At least it will drag you down, slow you right down.

When you deliver incrementally, you have to deploy into Dev, Test and Prod often. Like once a week. And that is a problem in itself. You can’t manually put tables, stored procedures and reports into Dev, then into Test environment, then into Prod environment. It would be a huge amount of work. And you will miss a component. I guarantee you, you will miss a component. You’ll find that it works well in Dev but doesn’t work in Test, because there are differences between what’s in Dev and what’s in Test. It could be a row in the mapping table. It could be the wrong version of a stored procedure. Or a new column in a table.

And it could be much worse. The application behaves differently in Production than it is in Test. What works in Test does not work the same way in Production. In other words, your brand new shiny BI app is buggy. And that could bring the user trust right down. “Do not believe the BI numbers”, they would whisper between themselves. “Don’t believe the BI numbers” is the worse nightmare a data warehouse practitioner can experience. Because nothing matters any more. All those piping work, all those design work, all those best practices – they are all for delivering the numbers to the users. We are in the data business. If the users don’t believe the data we deliver, we are as good as dead. It’s like we don’t exist. Our existence, the data warehouse team, the analytics team, the BI team, whatever you want to call it, our existence is for delivering data to the business. If they don’t believe our data, we are screwed!

So back to the points, you have to do a release to production (aka going live). To do that, you put all your code into one package, and release it into Dev. If it’s all ok in Dev, you release the same package, I repeat: the same package, into Test environment. And do automated testing. Plus some manual testing/sanity check. And when all is green, you deploy the same package, I repeat: the same package, into the Production environment.

The key here is “the same package”. That’s the secret sauce. So you can’t miss a thing. It is the same package. It contains everything that you deployed to Dev. You deploy the same package into Test, then into Prod. And guess who deploy it and how? Not by hand, but automated. The deployment into Dev, Test and Prod is automated. You could use Azure DevOps, Jenkins, Terraform or Bicep, take your pick, but your deployment has got to be automated.

Human makes mistakes. And by automating the deployment, not only you save a lot of time, but also ensuring that the same package got deployed the same way into each environment. And that my friends, is the last secret sauce in data warehousing. It is package deployment. It ensures that what works in Dev and Test, will work the same way in Prod. And it enables you to go live frequently, like once a week. And it saves you a lot of time and effort. And at the same time, it increasing the quality of your code. The quality of your deliveries. The quality of the whole BI project.

And that leaves us with one thing. One secret sauce: a good team.

By far, the most important thing in building a data warehouse is having a good team. In software development industry, there are a lot of deadwood, aka “passengers”. They are people who are not delivering. They are just in it for the ride. You want multi-function people. Someone who can design, but can also code. And can also test. And do deployments and release. And can present in an architecture forum. And understand security. Do not use single-function people, especially if you are a small team. If your data warehouse team consists of 5 people, everyone of those people must be muti-functional. You can’t afford a “DBA” in the team who only does database administration. Your need to have a good developer, who can also design stuff. Your BA will need to gather requirement, and write specs, and test as well.

If you are lucky enough to run a big DW/BI team, say 25 people, then you can have segreation of duty. But it would be to your advantage to have multi-functional people in your team. Because that enables you to flexibly assign tasks to people.

In any development team, the infrastructure is tricky. There are a lot of infrastructure tasks such as setting up servers and connectivities, but all you have are analysts, developers and testers. Who would do those infrastructure tasks then? Who would do DevOps tasks too? You know, things like setting up release pipelines, Terraform to create Snowflake databases and roles, or ARM and Bicep to create Databricks clusters. It is best to allocate someone within your team to do those infrastructure & DevOps tasks. Preferably a developer, because the infra/devops world is closely related to developer (skill wise). And so that later on when the infra & DevOps workload subsides, that person can help doing coding. Infra & DevOps peaks out in the beginning, but then decreasing to almost zero towards the end of the project.

But more importantly, you need to have good people. And by that I don’t mean their skills. I mean people with initiatives. People who have drive within them. In meetings, you would notice that most of the people are quiet. They don’t say anything, some even don’t open their camera. But there are a few who talks. Those are who I mean, the good people. They always come forward. They bring initiatives and ideas to you. They work, and work hard. They want the project to succeed.

By their nature, team members only do what they are ask. They only do the tasks assigned to them. But the are some who do things without being asked. That’s what I mean by good people. And in a small team you need everybody to be good people. You can’t afford people who is like a gong: they only make a sound when you hit them. I mean only work when you ask them. You need all 5 people to have initiative. In a small team of 5 people, you can’t afford to have a manager either. The leader must also do some work. Like 70% working/delivering, and 30% managing team.

And even when you have a big DW/BI team (like 25 people), it would make a big difference if you have good people. People with initiatives and drives. On top of that, if they have good skills too, then they are ideal. A team like that can deliver. They have the drive, they have the necessary skills/experience. And you need to set the team up at the very start, when you are recruiting people into the DW team. Both from inside the company, and from outside.

Even when you are 3 months in and you found people without any initiatives or drive in your team, don’t hesitate to replace them. It takes time because of all the HR procedures, but it is worth it in the end.

Having a team of good people is the key to successful delivery of any development project, not just in data warehousing and BI or AI. But any software, any application. Data or not data, people is the key. People is the one factor that makes all the difference. People is more important than any of the secret sauce that I explained above. Even if you don’t have good process, or good technology in place, having good people can make a difference. With a team of good people, you can make good process and employ good technology. But on the other hand, if you have good process and tech stack but you don’t have good people, you won’t be able to deliver. And that my friends is the most important secret sauce in data warehousing and BI: good people!

Thank you for reading all the way up to the end.

As always I would appreciate your comments, opinions and thoughts. Write it below in the comments. Thank you.

Dmytro Andriychenko

Senior Data Engineer / Architect, Business Intelligence MCSE, Power BI, Fabric, Azure and DevOps Expert

2 个月

I would only change the first point to be a "Good enough design": it does not necessarily have to be denormalised, although there are good reasons for it to be so in many cases, but it has to be suitable for the subject area...

Dmytro Andriychenko

Senior Data Engineer / Architect, Business Intelligence MCSE, Power BI, Fabric, Azure and DevOps Expert

2 个月

That was such a click-bait!!! You are right of course, as usual!!! Love your posts!!!

查看更多评论

要查看或添加评论，请登录

Vincent Rainardi的更多文章

Data Architecture, Information Architecture, Data Governance, Data Engineer

2025年2月25日

Data Architecture, Information Architecture, Data Governance, Data Engineer

Just a quick one on the difference between Data Architecture and Information Architecture. Information Architecture is…

6 条评论
Data engineer and pizza

2025年2月23日

Data engineer and pizza

As a data engineer we have two jobs. The first one is to build data pipelines, and the second one is to build data…

1 条评论
Power BI Data Analyst Certification

2025年2月19日

Power BI Data Analyst Certification

Perhaps your friend or yourself are interested to work in data, but don’t know where to start. Well one way is to get a…

1 条评论
Data Catalog for Snowflake

2025年2月18日

Data Catalog for Snowflake

You have a data platform on Snowflake. And you are looking for a data catalog.
Building a Data Lake on AWS

2025年2月11日

Building a Data Lake on AWS

Principles: reliable storage, serverless, data governance, orchestration, IAM, IAC, audit trail, DevOps Decentralise…

2 条评论
Transformation Engineering

2025年2月2日

Transformation Engineering

ETL is Dead We no longer do ETL, Extract > Transform > Load. No.

230 条评论
A good company

2025年1月12日

A good company

I've been working for 15 companies in my career. I’m a contractor.

2 条评论
Advice for computer science students

2025年1月11日

Advice for computer science students

Someone who teaches in university just asked me a question: his students who completed CSE undergraduate degree…

1 条评论
A few quick tips on dbt.

2025年1月10日

A few quick tips on dbt.

1. dbt build error The line number on dbt build error is 2 lines too many.
Every data scientist needs to read this book

2025年1月4日

Every data scientist needs to read this book

Every data scientist needs to read this book: Big book of data science https://www.databricks.

See all articles

Secret sauce in data warehousing

Vincent Rainardi

Data Architect & Data Engineer

领英推荐

Vincent Rainardi的更多文章

社区洞察

其他会员也浏览了

Modern Data Warehouse: All You Need to Know

Does our business win with data warehouse optimization?

Building a data warehouse: A step-by-step guide

Data quality in data warehousing: challenges and solutions

Data Platforms - An Outlook

Setting Up a Data Warehouse in MS Access

Understanding the Significance of Data Warehousing in Business Intelligence

Data Warehousing, BI, Big Data & Data Science for Data Managers

Managing a data warehouse migration

Data Warehouse...Yeah or Nay?

领英推荐

Vincent Rainardi的更多文章

Data Architecture, Information Architecture, Data Governance, Data Engineer

Data engineer and pizza

Power BI Data Analyst Certification

Data Catalog for Snowflake

Building a Data Lake on AWS

Transformation Engineering

A good company

Advice for computer science students

A few quick tips on dbt.

Every data scientist needs to read this book

社区洞察

其他会员也浏览了

Modern Data Warehouse: All You Need to Know

Does our business win with data warehouse optimization?

Building a data warehouse: A step-by-step guide

Data quality in data warehousing: challenges and solutions

Data Platforms - An Outlook

Setting Up a Data Warehouse in MS Access

Understanding the Significance of Data Warehousing in Business Intelligence

Data Warehousing, BI, Big Data & Data Science for Data Managers

Managing a data warehouse migration

Data Warehouse...Yeah or Nay?