Four tips for building a data warehouse

I think my top tip would be that we should design the warehouse based on the usage, not the sources. Many DW projects start with the data sources, ingestion into stage, etc. and creating the warehouse based on the data they receive. I’d advise that you start from the other end. Sit down with the business users, find out what analysis they need to do, what analytics they need to have, and design the mart to support their work. When you have not designed a warehouse before, creating 40 hubs, sats & links could be a daunting task (or 40 fact & dim tables). By focusing only on what the users use, it would reduce your present complexity considerably.

My second tip is that we should start from production environment. Meaning that from the get go we should be aiming for getting production env up and running asap. Yes, even before we start developing. It takes a lot of time to overcome all environmental issues, such as security settings, user access, and network configs. Transporting the data alone could take months to setup. Don’t linger on “paper state” too long. Just get on with the implementation and learn from it. You’d need to get architectural approval of course, but start small, learn and build up from there. Don’t spend months trying to design the whole thing and not doing anything for real. You can’t see many of the environments and architectural issues just by writing documents. Just build something and learn from that experience. You’ll be surprised by the number of issues you encounter when you start Terraforming, configuring EC2, setting up dbt and Snowflake, etc. Or Microsoft Fabric with OneLake and ADF (and key vaults, linked services, resource groups). Probably half of the work is outside development, i.e. setting up all the infrastructure. So do not underestimate them, start tackling them early.

The third tip would be about optimising development effort, from management point of view. In a warehouse project, a developer could have nothing to do because the specs are not ready. It could be a BI developer, it could be an ETL developer (analytics engineer and data engineer if you want to modern terms). The worse case is that BI developers need to wait months because there is no data. So the tip is this: when you (the DW architect/designer) sit down with the business users (see my first tip above), bring your BI developers with you. Then you create a star schema prototype, with dummy data, to satisfy those user needs. This way, the BI developer can create a dashboard mockup, with some data, and go through it with the users in an iterative fashion. And while they are doing that, in the meantime sit down with your data engineers, and figure out where to get that required data from.

If you are lucky enough to be the leader of the pack, for example if you are an IT manager, then this next tip is for you. As a manager, you need to secure the budget for creating a warehouse. Not just for a few months, but for two three years. The trick here is to demonstrate the business value as early as possible. As the one who control the purse string, you have control over the resources/staff allocation. Use those resources to pick one low hanging fruit, and deliver it. Something easy enough to deliver, but is critical for the business user to do their job. Say it’s the competition analysis. Instruct your staff to devote all their attention to satisfy that one requirement. Sit down with your staff and explain why delivering that one low hanging fruit is important: because unless you deliver it you won’t get more budget for that DW project. Make them realise that even though the functionality is simple (the dev is easy), but delivering the infra is a biggie. So ask them to help each other to deliver the infra. When building the infra, please don’t go for a Rolls Royce. Just a Ford would do, as long as it gets you from A to B. I know you can automate DevOps, you can set all the biceps and ARMs for everything, but please, for this first iteration refrain yourself. Act like you are a startup, not a big bank. Once you create something concrete that the users can use, and the board can see that the company revenue gets better, then you’ll secure the next 12 months funding. And that’s when you do proper DevOps, proper development processes, metadata driven stuff and so on.

I know that some companies got lucky, they had like a quarter of a million dollars to spend on data platform with no questions asked. But most of us are not like that. In most of companies they are tight with the purse strings. If they give you $25k and you don’t deliver after 6 months, the likelihood is that your can kiss your DW project good bye as the project would be canned and the money allocated somewhere else by the board. Make sure your team realise this. That on the first iteration you guys don’t have the luxury of on-boarding everything from the data sources and create a grand shiny warehouse. Your team will need to be laser focused on delivering just competitive analytics (as per the above example), and forget everything else.

Nathan Jones

Capture consumer brand M&A value 60% sooner ?? ?????? | Data Strategy ??| Data Integration ?? | M&A ?? | Founder Calon Analytics ?? | AI ??? | Keynote speaker ?? | Dachshund dad ??

6 个月

Agree with a lot of this. One thing I in particular agree with is about getting prod ready from the start. If possible, I wouldn’t let non-prod data anywhere near a data warehouse dev or test environment. That is a big ‘if possible’ but companies have so many differences between test and prod that it’s often a real waste of time using anything other than real prod to dev and test with.

Nimesh Parikh

Enterprise Data Architect | Microsoft Data Technologies (Azure & MSBI) | TOGAF 9? Certified | MCSE

6 个月

Agree, but at many places business stakeholders don't know what they really wants or may be can't explain what they want. Prob is you may creat prototype with dummy data but it is possible that require data may not be captured by source system at all.

要查看或添加评论,请登录

Vincent Rainardi的更多文章

  • Data Architecture, Information Architecture, Data Governance, Data Engineer

    Data Architecture, Information Architecture, Data Governance, Data Engineer

    Just a quick one on the difference between Data Architecture and Information Architecture. Information Architecture is…

    6 条评论
  • Data engineer and pizza

    Data engineer and pizza

    As a data engineer we have two jobs. The first one is to build data pipelines, and the second one is to build data…

    1 条评论
  • Power BI Data Analyst Certification

    Power BI Data Analyst Certification

    Perhaps your friend or yourself are interested to work in data, but don’t know where to start. Well one way is to get a…

    1 条评论
  • Data Catalog for Snowflake

    Data Catalog for Snowflake

    You have a data platform on Snowflake. And you are looking for a data catalog.

  • Building a Data Lake on AWS

    Building a Data Lake on AWS

    Principles: reliable storage, serverless, data governance, orchestration, IAM, IAC, audit trail, DevOps Decentralise…

    2 条评论
  • Transformation Engineering

    Transformation Engineering

    ETL is Dead We no longer do ETL, Extract > Transform > Load. No.

    230 条评论
  • A good company

    A good company

    I've been working for 15 companies in my career. I’m a contractor.

    2 条评论
  • Advice for computer science students

    Advice for computer science students

    Someone who teaches in university just asked me a question: his students who completed CSE undergraduate degree…

    1 条评论
  • A few quick tips on dbt.

    A few quick tips on dbt.

    1. dbt build error The line number on dbt build error is 2 lines too many.

  • Every data scientist needs to read this book

    Every data scientist needs to read this book

    Every data scientist needs to read this book: Big book of data science https://www.databricks.

社区洞察

其他会员也浏览了