advertools v0.13.0 new features

advertools v0.13.0 new features

#advertools v0.13.0 is out!

??????

Main highlights:

?? crawl_headers function to crawl with the HEAD method

??reverse_dns_lookup to run on a massive scale

??New crawling rules for following links include/exclude, regex & URL parameters

All new features can be explored in this notebook.

crawl_headers can be used as an efficient status code checker, because it gets data about the page without downloading its content.

This is very light on servers, especially for images.

It also gets all available response headers and redirect information. Here are sample columns names:

No alt text provided for this image

Yes, your page returns a 200 OK status code, but what about the components of the page?

  • links
  • images
  • hreflang
  • CSS files
  • script URLs
  • iframes
  • metatags with URLs (Twitter, OG)
  • JSON-LD elements containing links
  • etc...

This can enrich your crawling workflow

No alt text provided for this image

You can also analyze redirects.

It would be good to know if pages have been redirected.

It's especially important with external links which might have been removed, redirected to another page, or a home page.

In this example (nytimes.com) 17% of the external links have been redirected.

No alt text provided for this image

If you get the Content-length header you can analyze the size of pages/images, without downloading them.

HTML size + image sizes can give an improved (but not exact) estimate for the full size:

No alt text provided for this image

You can also group images into size groups for a better understanding:

No alt text provided for this image

reverse_dns_lookup is a new function that is like the "host" command, but for many many IPs.

It takes a list of duplicated IPs (as you typically have in a log file), gets host information, & produces a counts report

Here 22.8K IPs were finished in a little over a minute:

No alt text provided for this image
No alt text provided for this image

Crawling rules to control which links to follow:

  • exclude_url_params: Don't follow link if it contains any of these parameters. OR True if you want to exclude any URL with any parameter
  • include_url_params: Follow link if it contains any of the params
  • exclude_url_regex & include_url_regex: The full power and flexibility (and danger!) of regular expressions to determine which links to follow.

Check out all the new features in this notebook and see the updated documentation:

Happy to get any suggestions, feedback, bugs, issues.

Hope you like it!

pip install --upgrade advertools        



Marco Giordano

Data/Web Analyst | GA4, GSC, SEO, Content | BigQuery, Python, R, SQL

2 年

Huge ????

要查看或添加评论,请登录

社区洞察

其他会员也浏览了