Choosing a BI Tool
Data reporting and visualisation ‘BI’ tools come in many flavours, with a bewildering variety of features to confuse and confound. So how can you see past the glossy sales facades and decide which one is right for your organisation?
Regardless of anything else, all tools begin their task by extracting data from a data source, and because how they do this dictates all of the rest of their capabilities, the first thing you should try to understand about a tool is how it takes source data and makes it available to end users.
BI tools can be divided into 5 categories on this basis.
1. Basic SQL Generation
The most basic tools are those that require the user to write SQL to select the data to include in the report. The user needs to hand-code the joins between those tables, select the columns, apply any transformation and then to apply the relevant filters or other functions.
Users need to be very familiar with both the databases involved, and with SQL. Objects defined within a report are generally not easily re-usable or available to other users, and there is a high dependency on individual users to construct their queries in the right way. There is typically little or no metadata recorded centrally about the content or construction of reports, and no way to ensure that different people seek similar answers in the same way.
SQL interfaces are best used when you have a small number of power users looking to run ad-hoc reports against operational or highly changeable systems that they are very familiar with and comfortable using. Looked at from this perspective they are barely ‘BI’ tools at all.
2. Advanced SQL Generation
Early BI tools (like Hyperion's old Brio product) took a similar approach but presented the user with an interface that allows SQL to be generated by dragging and dropping items from lists of available tables and by drawing lines between them to define relationships. The data relationships, which are key to generating reports, are defined in this way by the user, as are any filters, calculated data items or transformations.
This configuration actually builds a SQL statement in the background, which is then stored with each report, and the user can usually view the SQL if they wish.
The user can then define a report by selecting fields from the result set, and can format these, develop graphs, charts and so on as desired.
Sometime a single report container can contain several 'reports', which are essentially just different visualisation of the same data set. In order to ‘re-use’ the data definitions, the configuration can be saved as a template report, which can then be copied and modified to make additional reports.
3. Cube Builders
Some of the most popular power user or departmental BI tools (like Power BI, Qlik and Tableau) use a ‘cube’ based approach to address data issues. A cube is a multi-dimensional data source rather than a relational one - imagine a single big spreadsheet rather than multiple tables joined by lines. In order to populate the cube you have to set-up a link to your relational data source, define the tables and columns you are interested in and the joins between them, and then click the ‘build’ button to populate the cube with data.
The build can take a long time depending on how much data is being imported, but once the cube has been loaded query response times can be very good. If you have ever used a pivot table in Excel then you know how user friendly this type of interface can be. You can interact very dynamically with a cube to create nice visualisations and dashboards ‘on the fly’, without waiting for any more queries to run against the database – i.e. no switching between ‘design’ and ‘run’ modes.
There are 2 main styles of cube use, personal and shared, and most vendors offer both options with ‘shared’ being the ‘Enterprise’ or ‘Server’ option.
Personal Cubes
For personal use the ‘cube’ that a user builds is a one-off and exists in a report file on their PC or personal storage. Each time they open this report, the user can choose to refresh the cube from the source (or not). The report acts as an independent micro-database, tailor made for the user. You can save the file to a shared folder so that other users can access it, but this approach suffers from the same issues as any 'spread-mart' – data is not governed, multiple copies exists, people have to know what they are doing to construct the cube in the first place etc.
Shared Cubes
For shared use a cube can be placed on a server, like IBM's Cognos or Microsoft’s SQL Server Analysis Services. In this situation a skilled architect/administrator sets-up the cube and schedules the regular refresh of data. Users connect their reports to the cube and only the output is stored locally, with the cube data itself residing on the server. As with any other system, if there are many large concurrent queries then performance will degrade.
While they are very user-friendly, cubes do have significant limitations. Generally they only allow simple equality-based inner joins and require full referential integrity be enforced in the source data, which restricts to some degree what data can be included and how. Their monolithic nature means that cubes don’t scale very well to large data sets or complex data models and it can take a long time to refresh a cube to include new data.
It can be tricky to extend an existing cube to include additional data items and cubes are queried using uncommon query languages like MDX or DAX that are quite different from SQL, so it can be difficult to find developers who are skilled in these languages. Cubes are also fairly ‘closed’ applications, so there are limited means by which a developer can enhance functionality, tune performance etc.
4. Semantic Layer SQL Generation
Some tools (like Business Objects and OBIEE) take advanced SQL interfaces a step forward by adding a re-usable/shareable layer of metadata that adds significant power and functionality.
A designer specifies the joins between objects, categorises and renames the fields within the model that users are likely to report on, defines standard calculations and so on, and the result is stored in a shared ‘semantic layer’, hosted on a server application.
End users don’t need to be familiar with or understand the database design, they just need to access the semantic model via the report designer interface and drag and drop the desired elements into their report. Behind the scenes the tool will construct the SQL required to retrieve the data, using only the elements of the previously defined join paths that are needed to fulfil each query. This means that many users can run many different queries using the same shared 'semantic' model.
Users can develop also multiple reports from a single query, but if they need to change the query, they will need to switch back to design mode, make the change and run the query again. Multi-pass query behaviour can be mimicked by combining the output of two queries within the reporting tool. In this case query performance will be limited by the capability of and resources available to the reporting tool, rather than the source database, which can be problematic.
For straight-forward 'tidy' data sources the semantic layer approach works very well. However, complications can arise if the data model is complex and different users need to use different join paths to access the same parts of the data model - e.g. if they want to use an aggregate table instead of the detail level, or because the finance team join tables in a different way than marketing do, or if they want to do something unusual requiring an outer join.
The more advanced tools facilitate some degree of ‘aggregate awareness’, meaning the tool will modify the query SQL to use summary tables unless data is required that only exists on the detail level. However in many cases, system administrators might need to make alternate versions of the semantic model available that will cater for the different style of use, which is a big administration overhead. End users also find having multiple models confusing and may be unsure which model is the right one to use to best answer their question.
5. Object-Oriented SQL Generation
The approaches previously described largely attempt to manage complexity through simplification - by forcing the user or designer to make up-front binding decisions about object selection and join paths. Simplicity is good in many ways, but it imposes limitations that become more apparent as the scale and complexity of the data increases and as user requirements grow and diverge.
Tools like MicroStrategy take a different 'object oriented' approach, requiring designers to define concepts, and then allowing them to ascribe the same concept to multiple database objects simultaneously. This allows the tool to leverage the richness of the underlying data model in a more sophisticated way.
In the example above, the 'category' attribute is defined as existing in multiple locations - the primary/master location is the category lookup, but the attribute exists also as a foreign key in the child sub-category lookup, and in 4 fact tables - tutorial targets, and sales year, quarter and month aggregate.
When a user creates a report, the decision engine will look at the attributes and facts required and will make an ‘intelligent’ decision at report run time about where and how to source the data. If the user selects items that exist on several tables, the engine will use other information, such as the minimum number of tables that can be used to assemble all of the data and statistics about the size of each table to determine which join path is likely to be most efficient.
Using the previous example, if the user includes category and sales in their report without specifying a time period, the engine will get the data from the smallest table, i.e. the yearly sales summary. If the user selects sales and month, the engine will go to the monthly summary instead.
Tools like this are extremely powerful because a single model can encompass all objects within a data warehouse without the need to specify multiple (or indeed any) specific join paths or usage contexts. The tool can also adapt to use outer joins where appropriate.
The ability to nimbly switch between detail and aggregate tables in a way that is seamless and invisible to end users is a huge advantage and it is clear that this 'object oriented' approach can be extremely effective in delivering self-service BI for large/complex data sources to large/diverse user groups. The draw-backs have traditionally been the somewhat 'black-box' nature of how the tools assemble the SQL they need to run. Users have to be careful and configure the attributes and metrics correctly, and there can be some frustrating times until they tune in to how things work.
Conclusion
When choosing a tool you need to consider the use cases and determine requirements based on these.
- Does your tool need to service a large and diverse user community, or a small niche one?
- Are your source data sets big or small (breadth, depth, diversity, complexity)?
- Will your BI tool be used to deliver only IT-produced pre-defined reports, or will users also need to produce their own reports via 'self-service'?
- Are your self-service users skilled in data modelling and SQL or are they data novices?
- Do you have the IT staff/infrastructure resources to support server-based BI or are you restricted to personal/desktop BI solutions?
- Will your users need to slice and dice interactively within their reports or are they happy to export to excel to do this?
- Will users often need to drill to view lower levels of detail for specific results, or to combine data from multiple business areas?
The answers to these questions will go a long way to telling which kind of BI tool will do the job for you, and can help narrow the field to a more manageable list of candidates.
Questions on Data Strategy, Data Warehousing, Data Integration, Data Quality, Business Intelligence, Data Management or Data Governance? Click Here to begin a conversation.
John Thompson is a Managing Partner with Client Solutions Data Insights Division. His primary focus for the past 15 years has been the effective design, management and optimal utilisation of large analytic data systems.
Associate Faculty at National College of Ireland
5 å¹´Thanks for the article, John.? The question check list in your conclusion goes to the heart of the long-standing Kimball vs Inmon debate about database architecture and recognises that organisations require a more strategic view about the organisation and its goals in its approach to BI.? Just a thought: with more and more DW/BI/ML capacity (and capability) being delivered through cloud services these days, perhaps it might be necessary to add more levels to your hierarchy in the future.
Director at Effect360, Appistoki Ireland, Programme Management, Change Management, Business Consultancy
5 å¹´Good article John!
Data Warehouses and Analytics for effective businesses!
5 å¹´Thanks for sharing that John. With all the marketing around BI and Reporting tools people forget that there is always some database behind the scenes and it needs to be queried! The more complex the source data is the more important to have Reporting platform able to handle that complexity, otherwise customers end up with tons of free form SQL.
Senior Manager, Data Engineering at EY
5 å¹´Hi John, That is a nice summary of the BI Reporting Tool landscape. I would note that some of the above solutions could breach GDPR guidelines, if customer PII data sits on a network drive or local disk. As you say a solution to how best to deliver reporting should be informed by what you are looking to get out of it.