登录查看更多内容

advertools v0.13.0 new features

Elias Dabbas

Digital Marketing meets Data Science ––> advertools

发布日期: 2022年2月11日

+ 关注

#advertools v0.13.0 is out!

??????

Main highlights:

?? crawl_headers function to crawl with the HEAD method

??reverse_dns_lookup to run on a massive scale

??New crawling rules for following links include/exclude, regex & URL parameters

All new features can be explored in this notebook.

crawl_headers can be used as an efficient status code checker, because it gets data about the page without downloading its content.

This is very light on servers, especially for images.

It also gets all available response headers and redirect information. Here are sample columns names:

Yes, your page returns a 200 OK status code, but what about the components of the page?

links
images
hreflang
CSS files
script URLs
iframes
metatags with URLs (Twitter, OG)
JSON-LD elements containing links
etc...

This can enrich your crawling workflow

You can also analyze redirects.

It would be good to know if pages have been redirected.

It's especially important with external links which might have been removed, redirected to another page, or a home page.

In this example (nytimes.com) 17% of the external links have been redirected.

Laurence Svekis ? 2 年前

Scrape webdata from Google Sheets, an alternative to…

Adrien Velter 4 个月前

Easy Solutions

Vikram Shetty ?? 9 个月前

If you get the Content-length header you can analyze the size of pages/images, without downloading them.

HTML size + image sizes can give an improved (but not exact) estimate for the full size:

You can also group images into size groups for a better understanding:

reverse_dns_lookup is a new function that is like the "host" command, but for many many IPs.

It takes a list of duplicated IPs (as you typically have in a log file), gets host information, & produces a counts report

Here 22.8K IPs were finished in a little over a minute:

Crawling rules to control which links to follow:

exclude_url_params: Don't follow link if it contains any of these parameters. OR True if you want to exclude any URL with any parameter
include_url_params: Follow link if it contains any of the params
exclude_url_regex & include_url_regex: The full power and flexibility (and danger!) of regular expressions to determine which links to follow.

Check out all the new features in this notebook and see the updated documentation:

Happy to get any suggestions, feedback, bugs, issues.

Hope you like it!

pip install --upgrade advertools

advertools v0.13.0 new features

Elias Dabbas

Digital Marketing meets Data Science ––> advertools

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

The 4 Practice Problems Non-Devs Need to Start Scraping the Web

C# Keywords Tutorial Part 26: else

AMPonent, Webcomponent Building Library

C# Keywords Tutorial Part 31: false

How to Set Optionset Values Based On Related Optionset Value?

C# Keywords Tutorial Part 42: if

Understanding require function (Node.js)

HTML5 Cheat Sheet

Custom Functions in MarkLogic SPARQL

Best practices from open source: Use img.decode() in image-heavy applications

领英推荐

Trying Google Gemini for Data & Code Analysis

2024年5月15日

Word Similarity Matrix - Python Code

2023年3月23日

XML Sitemap Analysis - ForeignAffairs.com

2023年1月8日

Crawling and Parsing JSON-LD Data

2022年12月24日

advertools SEO Crawler - Analytics UI

2022年10月1日

Migration and Population Density Dashboard - WorldBank Data

2019年12月22日

Gold Reserves per Country - Quarterly (updated up to Q3-2019)

2019年8月1日

Global Terrorism Database Dashboard

2018年3月21日