advertools v0.13.0 new features
#advertools v0.13.0 is out!
??????
Main highlights:
?? crawl_headers function to crawl with the HEAD method
??reverse_dns_lookup to run on a massive scale
??New crawling rules for following links include/exclude, regex & URL parameters
All new features can be explored in this notebook.
crawl_headers can be used as an efficient status code checker, because it gets data about the page without downloading its content.
This is very light on servers, especially for images.
It also gets all available response headers and redirect information. Here are sample columns names:
Yes, your page returns a 200 OK status code, but what about the components of the page?
This can enrich your crawling workflow
You can also analyze redirects.
It would be good to know if pages have been redirected.
It's especially important with external links which might have been removed, redirected to another page, or a home page.
In this example (nytimes.com) 17% of the external links have been redirected.
领英推荐
If you get the Content-length header you can analyze the size of pages/images, without downloading them.
HTML size + image sizes can give an improved (but not exact) estimate for the full size:
You can also group images into size groups for a better understanding:
reverse_dns_lookup is a new function that is like the "host" command, but for many many IPs.
It takes a list of duplicated IPs (as you typically have in a log file), gets host information, & produces a counts report
Here 22.8K IPs were finished in a little over a minute:
Crawling rules to control which links to follow:
Check out all the new features in this notebook and see the updated documentation:
Happy to get any suggestions, feedback, bugs, issues.
Hope you like it!
pip install --upgrade advertools
Data/Web Analyst | GA4, GSC, SEO, Content | BigQuery, Python, R, SQL
2 年Huge ????