Crawling and Parsing JSON-LD Data
JSON-LD data can be the best and easiest to handle while crawling, if properly used by a website.
The #advertools #crawler extracts and parses JSON-LD by default, you don't have to do anything.
But you need to know a bit about the structure:
Column names:
All start with jsonld_ followed by the name of the entity e.g.
jsonld_@context
jsonld_@type
Nested items are also flattened (note more than one dot):
jsonld_mainEntity.description
jsonld_mainEntity.mentions
Multiple scripts on the same page:
Additional JSON-LD scripts get a number.
1st script:
jsonld_name
2nd script:
jsonld_1_name
3rd script:
jsonld_2_name
etc.
This lets you know how scripts are organized.
JSON-LD is flexible & allows for unlimited configs & nesting ...
Scripts that contain a list of items:
You might have sameAs containing a list of entities (Twtr profile, website, fb, etc) for example.
In this case these are saved as a list.
Sometimes sub-items are themselves nested or in lists!
You should make friends with pandas.json_normalize
Install #advertools:
python3 -m pip install advertools
json_normalize documentation:
Enjoy!