登录查看更多内容

Crawling and Parsing JSON-LD Data

Elias Dabbas

Digital Marketing meets Data Science –> advertools

发布日期: 2022年12月24日

+ 关注

JSON-LD data can be the best and easiest to handle while crawling, if properly used by a website.

The #advertools #crawler extracts and parses JSON-LD by default, you don't have to do anything.

But you need to know a bit about the structure:

Column names:

All start with jsonld_ followed by the name of the entity e.g.

jsonld_@context

jsonld_@type

Nested items are also flattened (note more than one dot):

jsonld_mainEntity.description

jsonld_mainEntity.publisher.name

jsonld_mainEntity.mentions

Multiple scripts on the same page:

Additional JSON-LD scripts get a number.

1st script:

jsonld_name

2nd script:

jsonld_1_name

3rd script:

jsonld_2_name

etc.

This lets you know how scripts are organized.

JSON-LD is flexible & allows for unlimited configs & nesting ...

Scripts that contain a list of items:

You might have sameAs containing a list of entities (Twtr profile, website, fb, etc) for example.

In this case these are saved as a list.

Sometimes sub-items are themselves nested or in lists!

You should make friends with pandas.json_normalize

Install #advertools:

python3 -m pip install advertools

json_normalize documentation:

https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html…

Enjoy!

要查看或添加评论，请登录

Elias Dabbas的更多文章

How to share Python apps (Dash, Streamlit, etc.) without deploying them

2024年12月28日

How to share Python apps (Dash, Streamlit, etc.) without deploying them

There is a magical feature in uv, which allows you run remote scripts hosted online. As a consequence, you can have the…

6 条评论
10 Lines of Code to Deploy a Dash app from scratch (using uv)

2024年12月8日

10 Lines of Code to Deploy a Dash app from scratch (using uv)

Deploy a minimal Dash app very quickly with uv, using ten lines of code, and two paste operations. Assumptions: You…
Trying Google Gemini for Data & Code Analysis

2024年5月15日

Trying Google Gemini for Data & Code Analysis

This is a quick overview and my first attempt to really see how well this works. This is not a proper test, and does…
Word Similarity Matrix - Python Code

2023年3月23日

Word Similarity Matrix - Python Code

When you have a text list and want a way to quantify the similarity between the phrases that you have, this function…

6 条评论
XML Sitemap Analysis - ForeignAffairs.com

2023年1月8日

XML Sitemap Analysis - ForeignAffairs.com

When you have dates in URLs you can get a lot of info about a website's content from its sitemap. Here is a quick…

2 条评论
advertools SEO Crawler - Analytics UI

2022年10月1日

advertools SEO Crawler - Analytics UI

OR: How to interactively explore/analyze large datasets with Plotly's Dash and The Apache Software Foundation Apache…

5 条评论
advertools v0.13.0 new features

2022年2月11日

advertools v0.13.0 new features

#advertools v0.13.

2 条评论
Migration and Population Density Dashboard - WorldBank Data

2019年12月22日

Migration and Population Density Dashboard - WorldBank Data

When I first started learning about the population of countries and the world, there were three billion of us. Now we…

2 条评论
Gold Reserves per Country - Quarterly (updated up to Q3-2019)

2019年8月1日

Gold Reserves per Country - Quarterly (updated up to Q3-2019)

I've been looking at gold data, and ended up creating a mini dashboard for that! https://www.dashboardom.

2 条评论
Global Terrorism Database Dashboard

2018年3月21日

Global Terrorism Database Dashboard

The GTD is a project by The National Consortium for the Study of Terrorism and Responses to Terrorism (START). It is…

5 条评论

See all articles

Crawling and Parsing JSON-LD Data

Elias Dabbas

Digital Marketing meets Data Science –> advertools

Elias Dabbas的更多文章

社区洞察

其他会员也浏览了

Advanced Trees, Intervals, and Data Structure Challenges: Week 4 Breakdown

Change the data type of columns in Pandas

Parent-Child Hierarchy over Time in Power BI (with Python and M)

Numeric functions and missing data

A few methods to deal with class imbalance in target

Party Buzz Kill: modifying data

Static and Const in Rust

What is Central Tendency? Mean,Median & Mode

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

A Beginner’s Guide to Carry out Extreme Value Analysis (3) - CI

Elias Dabbas的更多文章

How to share Python apps (Dash, Streamlit, etc.) without deploying them

10 Lines of Code to Deploy a Dash app from scratch (using uv)

Trying Google Gemini for Data & Code Analysis

Word Similarity Matrix - Python Code

XML Sitemap Analysis - ForeignAffairs.com

advertools SEO Crawler - Analytics UI

advertools v0.13.0 new features

Migration and Population Density Dashboard - WorldBank Data

Gold Reserves per Country - Quarterly (updated up to Q3-2019)

Global Terrorism Database Dashboard

社区洞察

其他会员也浏览了

Advanced Trees, Intervals, and Data Structure Challenges: Week 4 Breakdown

Change the data type of columns in Pandas

Parent-Child Hierarchy over Time in Power BI (with Python and M)

Numeric functions and missing data

A few methods to deal with class imbalance in target

Party Buzz Kill: modifying data

Static and Const in Rust

What is Central Tendency? Mean,Median & Mode

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

A Beginner’s Guide to Carry out Extreme Value Analysis (3) - CI