Crawling and Parsing JSON-LD Data

Crawling and Parsing JSON-LD Data

JSON-LD data can be the best and easiest to handle while crawling, if properly used by a website.

The #advertools #crawler extracts and parses JSON-LD by default, you don't have to do anything.

But you need to know a bit about the structure:

Column names:

All start with jsonld_ followed by the name of the entity e.g.

jsonld_@context

jsonld_@type

No alt text provided for this image

Nested items are also flattened (note more than one dot):

jsonld_mainEntity.description

jsonld_mainEntity.publisher.name

jsonld_mainEntity.mentions

Multiple scripts on the same page:

Additional JSON-LD scripts get a number.

1st script:

jsonld_name

2nd script:

jsonld_1_name

3rd script:

jsonld_2_name

etc.

This lets you know how scripts are organized.

No alt text provided for this image

JSON-LD is flexible & allows for unlimited configs & nesting ...

Scripts that contain a list of items:

You might have sameAs containing a list of entities (Twtr profile, website, fb, etc) for example.

In this case these are saved as a list.

No alt text provided for this image


Sometimes sub-items are themselves nested or in lists!

No alt text provided for this image


You should make friends with pandas.json_normalize

Install #advertools:

python3 -m pip install advertools

json_normalize documentation:

https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html…

Enjoy!

要查看或添加评论,请登录

Elias Dabbas的更多文章

社区洞察

其他会员也浏览了