The Surprising Power of Web File Data Connections in Qlik Sense
A screenshot of a Qlik Sense Cloud dashboard, created using data gathered in this tutorial. Credit: Austin Spivey

The Surprising Power of Web File Data Connections in Qlik Sense

Originally published on Medium

One of my favorite aspects of my job is that I get to play with business intelligence software all day long, learning how to eek out every bit of cool functionality that a program has to offer. One that I’ve been using quite a bit recently is RStudio, using it to write R code that scrapes the web for data. Though I love using RStudio, I wanted to try loading data into my BI tool of choice, Qlik Sense, using only a single data connection and a small amount of Qlik scripting. Thus, I came up with this small tutorial to help beginning (and advanced) users of Qlik Sense to scrape data from a website.

With this quick tutorial, we’ll be using the Web File data connector in Qlik Sense Cloud to scrape the HTML tables from a couple of pages of stats on the ESPN website. This will involve (with sample scripts below to help) two For loops, four variables, and about 70 total lines of script.

To build this app, we’ll first sign in to Qlik Sense Cloud here.

After creating a new app, we are prompted to load data in, either by loading in files via the Data Manager or via the Data Load Editor. We will need to use the Data Load Editor in order to write our own script. After selecting the Editor, we will select the ‘Create new connection’ button on the right and then ‘Web file’ (doc). We will now paste in our URL, https://www.espn.com/mens-college-basketball/statistics/team/_/stat/scoring, and our connection name, ‘ESPN.’

Our new data connection appears in the right panel. Now, let’s create a new section for our script: in the left panel, select the button with the plus + sign. Name this section ‘Load in Data.’ On the right panel, where it says ‘ESPN,’ under the words ‘Web file,’ select the first icon. This will allow us to select data from the connection.

On the page that pops up, we can select which tables to include. In our case, only one table was found on the page and it has been automatically named ‘@1.’ The settings should read:

  • File format: HTML
  • Field names: Embedded field names
  • Character set: 28591 (ISO 8859-1 Latin I)

After confirming the above settings, select ‘Insert script’ on the bottom-right. You should now have a load script in the editor, which we are going to tweak by adding a table name and making the URL dynamic. See the script below to see these changes, added in bold font:

Here, we are naming the table ‘Scoring Stats load.’ We are also adding the 'URL is' specifier (doc), which will allow us to create a dynamically-changing URL (doc). A dynamically-changing URL will allow Qlik Sense to change the URL on-the-fly while it loops.

Since we want to get all of the information from each page, as well as each NCAA season between 2002–2019, we need to loop through this URL. To do this, we will create two variables that will change the URL: $(szn) and $(page).

In the same Editor section, ‘Load in Data,’ add the bolded script:

We have now added two For loops. These allow us to do something to our data for each item in a list, or sequence. Our variables are no good unless they’re being used to change the URL, so let’s add them in like this (bolded):

Now when we run this script, it will pull in the data from the table of each page it goes to. The pattern it’s following is resembled verbosely as:

  1. https://www.espn.com/.../year/2002/count/1
  2. https://www.espn.com/.../year/2002/count/41
  3. https://www.espn.com/.../year/2002/count/81
  4. https://www.espn.com/.../year/2002/count/121
  5. https://www.espn.com/.../year/2002/count/161

These two loops will go through each page for each season, just by changing the variables each time they loop.

Now, we don’t want to have to run through this loop every time we use this app. For one thing, it’s bad form to put unnecessary strain on companies’ servers, even those of large companies like ESPN. Besides, we don’t want to wait 25–35 seconds every time we load the data anyways.

To address this issue, we’ll add in some logic that will check to see if we already have a file saved with our data in it; if no such file exists, we will save our data to a QVD to use next time we run the script. This file can be conveniently saved as an attached file, so we don’t need to save it to our device.

Let’s add this to our script (additions bolded):

Let’s say we want our file to be a QVD (doc) and named ‘scoring_stats.qvd.’ The function QvdCreateTime() will check to see what time our file was created. If there is no creation time (read: it hasn’t been created yet), then it will go through our loop to pull the data in from ESPN using the dynamic URL we created and then store the QVD as an attached file.

To add some script that will clean our data a bit, start a new section and paste in the following:

[Scoring Stats B]:
Load *,
RecNo() as [Record ID]
Resident [Scoring Stats A];

Drop Table [Scoring Stats A];

[Scoring Stats C]:
NoConcatenate Load *
Resident [Scoring Stats B]
where [RK] <> 'RK'
Order BY [Record ID] asc;

Drop Table [Scoring Stats B];

[Scoring Stats D]:
Load *,
SubField([FGM-FGA], '-', 1) as [FGM],
SubField([FGM-FGA], '-', 2) as [FGA],
SubField([3PM-3PA], '-', 1) as [3PM],
SubField([3PM-3PA], '-', 2) as [3PA],
SubField([FTM-FTA], '-', 1) as [FTM],
SubField([FTM-FTA], '-', 2) as [FTA]
Resident [Scoring Stats C];

Drop Table [Scoring Stats C];
Drop Field [FGM-FGA];
Drop Field [3PM-3PA];
Drop Field [FTM-FTA];
Rename Field [RK] to [RK_prev];

[Scoring Stats]:
Load *,
If( [RK_prev] = '', Peek([RK], -1), [RK_prev] ) as [RK],
Trim([TEAM]) as [Team]
Resident [Scoring Stats D];

Drop Table [Scoring Stats D];
Drop Field [RK_prev];
Drop Field [TEAM];

There you have it: a set of NCAA Men’s Basketball scoring data from 2002 to present day, collected by grabbing the HTML table from each page of ESPN’s URL sequence. Now you’re ready to start building out visualizations and, most importantly, finding insights.

要查看或添加评论,请登录

Austin Spivey的更多文章

  • How to Un-pivot a Table in Qlik Sense

    How to Un-pivot a Table in Qlik Sense

    One of my biggest annoyances when it comes to cleaning data is needing to make a transformation that is seemingly super…

  • Encoding strings in Base64 using Qlik script

    Encoding strings in Base64 using Qlik script

    I recently had a project requirement that involved needing to encode a few strings in Base64 to be sent over a REST API…

    2 条评论
  • Making My Own Website Using GitHub Pages

    Making My Own Website Using GitHub Pages

    Note: This article was originally posted to Medium and can be found here. At this point in my job search, I think it’s…

    1 条评论
  • My Foray into Data Visualization using Qlik Sense

    My Foray into Data Visualization using Qlik Sense

    Note: This article was originally posted to Medium and can be found here. For this post, I am going to detail my…

    1 条评论

社区洞察

其他会员也浏览了