advertools SEO Crawler - Analytics UI

advertools SEO Crawler - Analytics UI

OR: How to interactively explore/analyze large datasets with Plotly 's Dash and The Apache Software Foundation Apache Parquet

First step in creating a no-code UI for the analytics part of the advertools crawler.

Code repo:

A crawl dataset typically has 100-200 columns. You typically want to analyze a tiny subset of those columns, and loading them all into memory can make the app unnecessarily very slow.

Once you convert the dataset to the parquet format, you can now load only the column(s) that you want.

Parquet also allows you to load only metadata about your file/dataset. So, here's how the app works:

Assumptions:

  1. Your dataset is in parquet format (CSV can also work, but not as efficiently)
  2. The dataset is on the same computer as your Dash app, works locally, or you can use this server file explorer if your user is exploring already uploaded files on the server:

You can easily add a simple callback to upload and convert to parquet if these assumptions are not met.

What the app does:

  • The user "uploads" the dataset. Actually what they are doing is simply locating the file of interest, so no data is really loaded.
  • Once the file is selected, the relevant metadata is loaded (column names and their types). Those columns are split in two (numeric and non-numeric), and they populate the options for the dropdowns of the respective data types. Still no data is loaded to memory, only some tiny metadata.

No alt text provided for this image

  • Once the user selects a numeric column, only that column is loaded to memory, and the relevant charts/summary are displayed:

No alt text provided for this image

Here we have a short summary about the column's data, as well as two charts.

  1. The histogram on the left shows how the data are distributed
  2. The cumulative distribution chart on the right (ECDF), gives you more info about the cumulative distribution. For example, 80% of my pages are less that 500 bytes, 75% of my pages take 3 seconds to load (therefore 25% of pages take more than 3 seconds to load)

  • Selecting a text (non-numeric) column shows a different summary

No alt text provided for this image

The advertools crawler places multiple elements that exist on the same page in the same cell, separated by two @ signs e.g. "val_1@@val_2@@val_3"

It would be interesting to know how many pages have zero elements. In some cases it would be problematic, and in some it wouldn't. The histogram on the left shows ~500 pages with no <title> tag. The majority of pages (~2,500) have one, and some pages have 2 and 3 title tags. You might want to look into those pages.

You also get a table of the most commonly used text elements, in the form of an exportable table. This can help analyze the content of the website, and what it focuses on the most. The summary card has some data as well, showing the number of elements, uniques, elements per page, and missing elements.

Again, those numbers could mean completely different things based on the column you are analyzing. Having no <h4> elements on a page is not an issue, but missing a <title> should definitely be fixed.

Here is the summary card of the link URLs on this website:

No alt text provided for this image

This is a very simple exploration of the (non)numeric columns. As a next step, more specialized analysis tools/charts will be created to analyze text, links, content, performance, and more.

Please share any feedback, improvements, or bugs if you have any.

You can find the full code here:

Suresh Kumar Gondi

Technical SEO Consultant | Semantic SEO Practitioner

2 年

Thanks professor. Can we save the plots as SVG as well? As they scale any sizes without pixelated.

Christopher Gutknecht

Data & Analytics | Leading the data team at Bergzeit ??

2 年

Wow amazing, thanks!

要查看或添加评论,请登录

Elias Dabbas的更多文章

社区洞察

其他会员也浏览了