Excel and Power BI from a DevOps guy: Part 1
Data and DevOps is a Desert...

Excel and Power BI from a DevOps guy: Part 1

Over the past month I've been getting my hands dirty with Power BI because I needed some dashboards for internal reporting. Now I'm no Power BI expert but I have found it interesting coming from a more 'traditional' software development background where things like CICD and version control are normal. So this is my view from that perspective. There is a small team working together trying to pull together the data. This is a classic business use case where the data is coming, in part or indeed sometimes in total, from Excel sheets. Here I'll work through the Excel side in particular, but similar rules apply for other data sources.

Version Control is a good idea... Power BI disagrees

The first thing that hit me was the lack of version control, none. This means that the manageable entity is the PBIX file, for anyone reading in the Java world, this is like having a JAR file as the only manageable entity. This really is the first problem as it means two people can't be working in the same area at the same time. This lack of version control means that you have two choices

  1. One developer per PBIX
  2. Cut and paste into files and version control the files

So we went for the first approach with MS Teams acting as the way to ask who was working with what and handover between the guys in India and me in the US. This really isn't very nice.

Data v Dashboards a decision to be made

The next decision was how to add some degree of parallelism. Due to another challenge here we had a bit of a limit on options. In Power BI when you publish a data model to powerbi.com you can connect to it from a new dashboard in "Live Query" mode. The problem is that you can only connect to one data source in Live Query mode. This is a bit of a problem as it prevents you splitting the data model work up across multiple people and then unifying at the data level. You can either have everyone create their own variations on the data model for their own dashboard, not ideal, or have everyone leverage the same data source for their dashboards.

No alt text provided for this image

This means that getting that first Data Model at least vaguely schema correct is important. The Data doesn't have to be correct straight away but the schema needs to be. This part for us is about 5 days of work.

Building the Data Model

Using Teams to manage Data

So with Power BI Desktop if you are connecting to large data sources it can take a LONG time to refresh the model after you've made some modifications. The way to get around this is to have a reduced testing data set (good idea from a security perspective) that is quick to refresh and supports more iterative development. There were two ways that we could have managed the files

  1. Via Teams (Sharepoint)
  2. Via version control (Git)

The advantage of option 1 is that you can refresh the model on powerbi.com without requiring a new deployment, so as new files are available you can refresh cloud to cloud leveraging Microsoft's infrastructure. I'd really recommend this.

So under your Sharepoint you should create two different Teams (which also create the associated PowerBI workspaces)

  1. Development Teams
  2. Production Teams

General users are added to the Production version, and the Development one is where you do your development and testing. You could go for more environments this way if you want to have additional stages but for what we were doing two were sufficient. So what we did was had all of the files stored under the Development Teams environment, this meant that the number of people who could access the raw files was reduced, and indeed the production files were stored in an area with even less people able to access. Excel isn't nice, its always a security challenge, hence why it was separated off.

So like with 'normal' software development we had a /Data and a /Test Data directory, the Test Data had data which covered a broader set of data challenges than the production data (to help with debugging) but less than 10% of the total data volume, because it takes a long time to refresh with a remote, or even local, pull of large volumes (a couple of hundred files). To differentiate between the two areas we defined a parameter and a helper.

Parameters, Helpers, config files and loading via powerbi.com

To enable the switch we created a simple config file, one file per user. When you are the developer you just create a config file called "<userid>_config.xlsx" and then set that as a parameter. Then a simple helper function:

No alt text provided for this image

This way we can define a value "Root Directory" that enables the switch to be done both in the development environment, for me on my funky fast home internet (thanks Cox Gigablast) this meant it turned a 60+ minute load into a 20 minute load, while for a colleague on a limited bandwidth turned "I'll run it over night" into a 20 minute load.

Creating a Testing Dashboard

One problem with the "only one Live Connection" is it means that automated testing is a bit of a challenge. There are really three types of testing you can do and one of them is really impacted here

  1. Rules based testing - So for instance one rule in my data set is that although we receive a CDC feed there should only be one instance of each record in the data set.
  2. Cross reference validation - where you are loading two sources and they should align
  3. Calculated results

This last one is where it becomes difficult. What you'd like to do is use the dashboard and then compare the two data sources to see if the calculated results from your test data set are equivalent to the results that you know you should be getting for that test data set. But you really don't want that validation data as part of your core model, so its either "download to Excel and compare" or bite the bullet and make data validation a standard loaded part of the data model, then when you load from the development into production mark that as "not loaded" for the production release.

This is what we are now doing, it really is a bit ugly and its certainly no JUnit but having that testing dashboard and the results you'd expect for a given test data set is very important if you want to move towards a proper CICD approach. It does have an added advantage though that within the production load we can do the same thing which gives users confidence that the dashboards (which replace current Excel reports) are correct.

I'll post a few more pieces as I take this journey on learning Power BI and coming at it from the software development rather than BI development side.

Nicholas Hughes

Former CEO of EITR Technologies (successful exit) | Automator of Things | Just a Guy? | The Salt Guy | Pipe Symbol Enthusiast | Are you seriously still reading this?

4 年

Great insights! Especially with regard to the lack of version control and live query data source limitations. Hopefully these and other shortcomings will be addressed in the near future.

回复
Kumar Chinnakali

Reimagining contact center as a hands-on architect bridging users, clients, developers, and business executives in their context.

4 年

Nice one Steve Jones, thanks for sharing looking for the next edition.

回复
Subir Sarbabidya

CEO India & Global Product & Technology Leader at Maistering B.V.

4 年

Hey Steve - long time no see - nice article - we are deep in this rut with an e2e native Azure setup - so here are my $0.02 on your current misery with Power BI . I recognize you may not be able to do some of the things I suggest becos of issues within the Capgemini Azure Cloud setup (which I knew well) but take it away anyway - (1) Collaborative Development using Power BI Desktop - Power BI can ride seamlessly on the Office 365 collaborative workspace foundation, if you pair a Power BI subscription with a Office 365 tenant - once you do so you can create Power BI collaborative workspaces and publish power BI desktop files to a workspace where all members on that workspace can collaborate on the same Power BI file. If you upgrade the Power BI Classic Workspace to the New PowerBI Workspace you can unlock O-365 RBAC features in your workspaces with assigned roles like administrators, contributors, publishers and viewers. (2) Version Control in Power BI - Version Controls applies to 3 entites (i) datasets, (ii) models and (iii) Power BI projects/files - for data you have your CDC mechanisms already I presume, for Models you can use Azure Analysis Services paired with VSTS/Azure DevOps, it is very effective and FOR version control of PBIX files consider one of the 2 approaches (i) creating a Data Analytics Project in your Azure DevOps/VSTS env, give access to your developers to the Projects, and check your PBIX files into the project (ii) convert the PBIX files to PBIT (PowerBI template) and check it in Git. Yes, I agree with you, version management features themselves even by this mechanism will not be very great with PowerBI because both PBIX and PBIT are essentially are zip binary files and hence by nature they can't really merge changes. However most authored dashboard desktop software in the peergroup including Tableau or Qlik are not great in version management either. So, you may have to change the way you look at version control for authored dashboards as an output. Finally (3) Publishing Power BI dashboards - I will recommend not using?PowerBI.com and start using using the Power BI embedded framework. Embedded gives you a great deal of flexibility on controlling?performance of output using (i) managing performance tiers of the service on Azure (ii) optimizing publishing aspects like iframe bootstrapping, query caching, embed parameters and permissions and (iii) use Performance Analyser inside your Power BI desktop to monitor the performance of your embed content. Microsoft has done tons of enhancements to the Power BI?framework particularly if you go Azure native, it may be worth a try. Let me know if you find this useful.

??????Ben Brandwood

Director of Data | Data Architecture | Strategy | Bootstrapping

4 年

The version control thing is very interesting. Excel is getting there faster, at least in the sense that it's "multiplayer" and keeps things nice. On the other hand we are using Azure Data Factory and it's integration with Azure DevOps and it's almost native branching/merging design is one of the better ones I've come across. With the inevitble merging of toolsets (probably under that new Synapse banner) there's going to be a bunch of uplift at some point.

Do you want team collaboration, CICD and version control while building your data visualization and reports? You should try Looker :-)

要查看或添加评论,请登录

Steve Jones的更多文章

  • Cost Management for Generative AI

    Cost Management for Generative AI

    How to make the business case work for your Generative AI projects thanks to Capgemini’s framework for LLM Cost…

    4 条评论
  • Its time to talk about Systems of Action

    Its time to talk about Systems of Action

    In IT we've talked about Systems of Record, Systems of Transaction, Systems of Process, and Data Warehouses. These have…

    6 条评论
  • Eventually Correct v Eventually Consistent

    Eventually Correct v Eventually Consistent

    We all know about eventually consistent, the idea that the system might be in an inconsistent state right now but it…

    2 条评论
  • "It's just research" isn't good enough in AI Ethics -p

    "It's just research" isn't good enough in AI Ethics -p

    There was a little bit of noise around some papers being rejected on ethical grounds at the NeurIPS 2020 Conference…

  • No-code? Again?

    No-code? Again?

    I'm old, like so old I remember setting a baud rate on a VT220, "telnet localhost 25" and emoticons before they were…

    6 条评论
  • The risk of AI mechanizing Fake News

    The risk of AI mechanizing Fake News

    Now is exactly the time when companies, governments and NGOs should be thinking about the limits and constraints on AI.…

  • See the future, but focus on the MVP

    See the future, but focus on the MVP

    Data Warehouses were the last bastion of Waterfall delivery, the single schema was the last artifact in IT that…

    3 条评论
  • Clouds are the Telcos of Data

    Clouds are the Telcos of Data

    "We should build our own national mobile phone network" said Dave. The CEO turned around confused "What?" she said.

  • The rise of AI APIs - AI for the rest of us

    The rise of AI APIs - AI for the rest of us

    Data Science is hard, Quantum Computing is spectacularly hard. While there are movements to make Data Science more…

    5 条评论
  • Iowa shows that “roll it back” isn’t enough

    Iowa shows that “roll it back” isn’t enough

    So here we go, the 2020 election cycle is fully underway and software is at the heart of the first debacle, an app used…

    7 条评论

社区洞察

其他会员也浏览了