advertools SEO Crawler - Analytics UI
OR: How to interactively explore/analyze large datasets with Plotly 's Dash and The Apache Software Foundation Apache Parquet
First step in creating a no-code UI for the analytics part of the advertools crawler.
Code repo:
A crawl dataset typically has 100-200 columns. You typically want to analyze a tiny subset of those columns, and loading them all into memory can make the app unnecessarily very slow.
Once you convert the dataset to the parquet format, you can now load only the column(s) that you want.
Parquet also allows you to load only metadata about your file/dataset. So, here's how the app works:
Assumptions:
You can easily add a simple callback to upload and convert to parquet if these assumptions are not met.
What the app does:
领英推荐
Here we have a short summary about the column's data, as well as two charts.
The advertools crawler places multiple elements that exist on the same page in the same cell, separated by two @ signs e.g. "val_1@@val_2@@val_3"
It would be interesting to know how many pages have zero elements. In some cases it would be problematic, and in some it wouldn't. The histogram on the left shows ~500 pages with no <title> tag. The majority of pages (~2,500) have one, and some pages have 2 and 3 title tags. You might want to look into those pages.
You also get a table of the most commonly used text elements, in the form of an exportable table. This can help analyze the content of the website, and what it focuses on the most. The summary card has some data as well, showing the number of elements, uniques, elements per page, and missing elements.
Again, those numbers could mean completely different things based on the column you are analyzing. Having no <h4> elements on a page is not an issue, but missing a <title> should definitely be fixed.
Here is the summary card of the link URLs on this website:
This is a very simple exploration of the (non)numeric columns. As a next step, more specialized analysis tools/charts will be created to analyze text, links, content, performance, and more.
Please share any feedback, improvements, or bugs if you have any.
You can find the full code here:
Technical SEO Consultant | Semantic SEO Practitioner
2 年Thanks professor. Can we save the plots as SVG as well? As they scale any sizes without pixelated.
Data & Analytics | Leading the data team at Bergzeit ??
2 年Wow amazing, thanks!