Using OPAL, Virtuoso, and OpenAI LLMs to Generate a Recipes Knowledge Graph via Web Crawling

Using OPAL, Virtuoso, and OpenAI LLMs to Generate a Recipes Knowledge Graph via Web Crawling

This guide outlines the steps to integrate Large Language Model (LLM) batch operations via the OpenLink AI Layer (OPAL) into the Virtuoso crawler. The goal is to generate a collection of RDF documents accessible either publicly or through controlled access using HTTP/WebDAV or SPARQL. This article specifically demonstrates how to create RDF renditions of recipe pages from the BBC Good Food website by utilizing batch processing of prompts sent to OpenAI’s gpt-4o-mini.

Conceptual Overview

Sequence Flow Diagram

Step 1: Create Crawler Descriptor

Insert the crawler descriptor for the target website to define its crawling behavior.

INSERT SOFT WS.WS.VFS_SITE (
    VS_DESCR, VS_HOST, VS_URL, VS_INX, VS_OWN, VS_ROOT, VS_NEWER,
    VS_DEL, VS_FOLLOW, VS_NFOLLOW, VS_SRC, VS_OPTIONS, VS_METHOD,
    VS_OTHER, VS_OPAGE, VS_REDIRECT, VS_STORE, VS_UDATA,
    VS_DLOAD_META, VS_INST_ID, VS_EXTRACT_FN, VS_STORE_FN, VS_DEPTH,
    VS_CONVERT_HTML, VS_XPATH, VS_BOT, VS_IS_SITEMAP, VS_ACCEPT_RDF,
    VS_THREADS, VS_ROBOTS, VS_DELAY, VS_TIMEOUT, VS_HEADERS
) VALUES (
    'BBC Good Food Recipes',
    'https://www.bbcgoodfood.com',
    '/recipes/',
    NULL,
    1,
    'www.bbcgoodfood.com/recipes/',
    '1900-01-01 00:00:00',
    NULL,
    'https://www.bbcgoodfood.com/recipes/%',
    '',
    NULL,
    '',
    NULL,
    NULL,
    NULL,
    1,
    1,
    serialize(vector('follow-meta', 0, 'store-type', 'dav')),
    0,
    NULL,
    'OAI.DBA.REGISTER_BATCH_ITEM',
    '',
    2,
    0,
    NULL,
    1,
    0,
    0,
    1,
    '',
    0.000000,
    NULL,
    NULL
);
        

Step 2: Initialize the Crawl

Make the crawl job visible in the Virtuoso Crawler's Conductor UI via the following command, and then initialize the crawl process and let it run to completion.

WS.WS.VFS_INIT_QUEUE();        
Virtuoso Content Crawler Interface


Imported BBC Good Foods Recipe Pages

Folder containing crawled documents

Step 3: Create Task ID for the LLM Batch Task

Create a unique Task ID by utilizing a distinct label or generating a hash (e.g., SHA-256).

-- Generate Hash for use as unique Task ID which, for this example, returns: nvsO6XMexfveKk-r4Ovb4Ok8iRM2stxSMYTuI5LxHmA

SELECT encode_base64url(
    xenc_digest(concat('https://www.bbcgoodfood.com', 'www.bbcgoodfood.com/recipes/#this'), 'sha256')
);        

Step 4: Create LLM Batch Task

Use the generated Task ID and the batch task instructions file (/tmp/batch_sys.jsonl) which contains the following as the basis for the batch to be sent to gpt-4o-mini instance:

{
  "task": "Generate a comprehensive representation of this information in JSON-LD using valid terms from https://schema.org. Set @base to {page_url}, denote terms using a hyperlink, and expand @context accordingly.",
  "guidelines": [ 
    { "rule": "Use @vocab appropriately." },
    { "rule": "If applicable, include at least 5 Questions and associated Answers, 10 Defined Terms, and 3 HowTos (where HowToSteps are labeled and then indexed using the schema:position attribute—no need for itemElementList), and associate these entity types coherently with the main article." },
    { "rule": "Utilize annotation properties to enhance the representations of Questions, Answers, Defined Term Set, HowTos, and HowToSteps, if they are included in the response and associate them with article using schema:hasPart." },
    { "rule": "Where relevant, add article body, but no more than 20 words." },
    { "rule": "Where relevant, add article sections and fleshed out body comprising no more than 20 words." },
    { "rule": "Where possible, align images with relevant article and howto step sections." },
    { "rule": "Add a label to each how-to step." },
    { "rule": "Add descriptions of any other relevant entity types." },
    { "rule": "If not using JSON-LD, triple quote literal values containing more than 20 words." },
    { "rule": "Whenever you encounter inline double quotes within the value of an annotation attribute, change the inline double quotes to single quotes." },
    { "rule": "Whenever you encountered video, handle using the VideoObject type, specifying properties such as name, description, thumbnailUrl, uploadDate, contentUrl, and embedUrl. Do not guess or insert non-existent information." },
    { "rule": "Whenever you encounter audio, handle using the AudioObject type, specifying properties such as name, description, thumbnailUrl, uploadDate, contentUrl, and embedUrl. Do not guess or insert non-existent information." },
    { "rule": "Where relevant, include additional entity types when discovered, e.g., Product, Offer, and Service, etc." },
    { "rule": "Language tag the values of annotation attributes." },
    { "rule": "Describe article authors and publishers in detail." },
    { "rule": "Fix all JSON-LD usage errors." }
  ]
}
        
OAI.DBA.CREATE_BATCH_TASK(
    'nvsO6XMexfveKk-r4Ovb4Ok8iRM2stxSMYTuI5LxHmA',
    'gpt-4o-mini',
    file_to_string('/tmp/batch_sys.jsonl'),
    0.1,
    0.5
);
        

Step 5: Create LLM Batch

Associate tasks with a batch.

OAI.DBA.CREATE_BATCH(
    'nvsO6XMexfveKk-r4Ovb4Ok8iRM2stxSMYTuI5LxHmA',
    0
);
        

Step 6: Manage Batch Processing

Start Batch Processing, which requires an OpenAI API Key

Begin processing tasks.

SELECT OAI.DBA.BATCH_START(1, 'sk-xxxxxx');
        

Check State of LLM Batch Tasks

List undefined tasks.

SELECT B_ID FROM OAI.DBA.BATCH WHERE B_STATE = 'undefined';        

Monitor Batch Status

Check batch progress, periodically, until 'complete' response is returned.

OAI.DBA.BATCH_CHECK(1, 'sk-xxxxxx');
        

Mark Batch as Complete

OAI.DBA.BATCH_COMPLETE(1, 'sk-xxxxxx');
        

Optionally Cancel Batch

OAI.DBA.BATCH_CANCEL(1, 'sk-xxxxxx');
        

Step 7: Query Batch Results

List Active and Completed Batches

OAI.DBA.BATCH_LIST('sk-xxxxxx');
        

View Generated JSON-LD

SELECT BI_RESULT FROM OAI.DBA.BATCH_ITEMS;
        

Step 8: Analyze Token Usage

Query batch processing tokens for cost analysis.

SELECT B_ID, B_TASK_ID, B_BATCH_ID, B_TS, B_STARTED, B_LAST_TS, B_STATE, B_TOKENS, B_ERROR 
FROM OAI.DBA.BATCH;
        

Step 9: Export RDF Results

Export results to a specified directory.

OAI.DBA.BATCH_EXPORT_RESULTS(1, '/DAV/www.bbcgoodfood.com/rdf');        
Exported RDF Documents comprising JSON-LD content


Sample Snippet of RDF Generated, using JSON-LD

{
    "@context": {
        "@vocab": "https://schema.org/",
        "page_url": "https://www.bbcgoodfood.com/recipes/air-fryer-apple-cinnamon-tarts",
        "Recipe": "https://schema.org/Recipe",
        "Question": "https://schema.org/Question",
        "Answer": "https://schema.org/Answer",
        "HowTo": "https://schema.org/HowTo",
        "HowToStep": "https://schema.org/HowToStep",
        "DefinedTermSet": "https://schema.org/DefinedTermSet",
        "DefinedTerm": "https://schema.org/DefinedTerm",
        "VideoObject": "https://schema.org/VideoObject",
        "AudioObject": "https://schema.org/AudioObject"
    },
    "@type": "Recipe",
    "name": "Air fryer apple & cinnamon tarts",
    "author": {
        "@type": "Person",
        "name": "Samuel Goldsmith",
        "url": "https://www.bbcgoodfood.com/author/samuelgoldsmith"
    },
    "image": "https://images.immediate.co.uk/production/volatile/sites/30/2024/10/AirFryerAppleCinnamonTarts-07ae2eb.jpg?quality=90&resize=440,400",
    "description": "Use a sheet of ready-rolled puff pastry to make these easy tarts filled with apples and cinnamon.",
    "prepTime": "PT20M",
    "cookTime": "PT25M",
    "totalTime": "PT45M",
    "recipeYield": "6 servings",
    "recipeIngredient": [
        "320g sheet ready-rolled puff pastry",
        "3 apples",
        "1 tsp lemon juice",
        "1 ?? tsp cinnamon",
        "40g caster sugar",
        "1 egg or 3 tbsp milk, to glaze",
        "Icing sugar to decorate (optional)"
    ],
    "recipeInstructions": {
        "@type": "HowTo",
        "name": "Method",
        "step": [
            {
                "@type": "HowToStep",
                "name": "Step 1",
                "text": "Line the base of your air fryer with foil. Cut the pastry sheet in half lengthways, then cut each half into three pieces, so you have six in total. Score an edge of about 2cm around each piece, being careful not to cut all the way through. Put in the fridge to chill while you prepare the apple.",
                "position": 1
            },
            {
                "@type": "HowToStep",
                "name": "Step 2",
                "text": "Core and quarter the apple, then cut each quarter into thin slices. Put the apple in a bowl, add the lemon juice and toss. Add the cinnamon and caster sugar, then mix well to coat.",
                "position": 2
            },
            {
                "@type": "HowToStep",
                "name": "Step 3",
                "text": "Put the apple slices into the centre of the pastry, being careful not to go over the border you scored earlier. Chill in the fridge for 20 mins.",
                "position": 3
            },
            {
                "@type": "HowToStep",
                "name": "Step 4",
                "text": "Heat the air fryer to 180C. Cook the tarts for 25-30 mins until the pastry is cooked and the apple has started to turn light golden.",
                "position": 4
            }
        ]
    },
    "nutrition": {
        "@type": "NutritionInformation",
        "calories": "267 kcal",
        "fatContent": "15 g",
        "saturatedFatContent": "7 g",
        "carbohydrateContent": "27 g",
        "sugarContent": "11 g",
        "fiberContent": "2 g",
        "proteinContent": "4 g",
        "saltContent": "0.5 g"
    },
    "recipeCategory": "Dessert",
    "recipeCuisine": "Vegetarian",
    "video": {
        "@type": "VideoObject",
        "name": "Air fryer apple & cinnamon tarts",
        "description": "Watch how to make these delicious tarts.",
        "thumbnailUrl": "https://images.immediate.co.uk/production/volatile/sites/30/2024/10/AirFryerAppleCinnamonTarts-07ae2eb.jpg?quality=90&resize=440,400",
        "uploadDate": "2024-10-01",
        "contentUrl": "https://www.bbcgoodfood.com/recipes/air-fryer-apple-cinnamon-tarts",
        "embedUrl": "https://www.bbcgoodfood.com/recipes/air-fryer-apple-cinnamon-tarts#video"
    },
    "hasPart": [
        {
            "@type": "Question",
            "name": "Can I use other fruits?",
            "acceptedAnswer": {
                "@type": "Answer",
                "text": "Yes, you can substitute apples with pears or berries."
            }
        },
        {
            "@type": "Question",
            "name": "How long do these tarts last?",
            "acceptedAnswer": {
                "@type": "Answer",
                "text": "They can be stored in an airtight container for up to 3 days."
            }
        },
        {
            "@type": "Question",
            "name": "Can I freeze these tarts?",
            "acceptedAnswer": {
                "@type": "Answer",
                "text": "Yes, they freeze well before baking."
            }
        },
        {
            "@type": "Question",
            "name": "What can I serve with these tarts?",
            "acceptedAnswer": {
                "@type": "Answer",
                "text": "Serve with whipped cream or vanilla ice cream."
            }
        },
        {
            "@type": "Question",
            "name": "Are these tarts suitable for vegans?",
            "acceptedAnswer": {
                "@type": "Answer",
                "text": "No, they contain egg. Use a vegan alternative for glazing."
            }
        }
    ],
    "definedTermSet": {
        "@type": "DefinedTermSet",
        "name": "Cooking Terms",
        "hasDefinedTerm": [
            {
                "@type": "DefinedTerm",
                "name": "Puff Pastry",
                "description": "A light, flaky pastry made from layers of dough and fat."
            },
            {
                "@type": "DefinedTerm",
                "name": "Glaze",
                "description": "A coating applied to food to give it a shiny appearance."
            },
            {
                "@type": "DefinedTerm",
                "name": "Chill",
                "description": "To cool food in the refrigerator."
            },
            {
                "@type": "DefinedTerm",
                "name": "Core",
                "description": "To remove the central part of a fruit."
            },
            {
                "@type": "DefinedTerm",
                "name": "Slice",
                "description": "To cut food into thin pieces."
            },
            {
                "@type": "DefinedTerm",
                "name": "Score",
                "description": "To make shallow cuts in the surface of food."
            },
            {
                "@type": "DefinedTerm",
                "name": "Tart",
                "description": "A baked dish consisting of a pastry base filled with sweet or savory ingredients."
            },
            {
                "@type": "DefinedTerm",
                "name": "Air Fryer",
                "description": "A kitchen appliance that cooks by circulating hot air."
            },
            {
                "@type": "DefinedTerm",
                "name": "Icing Sugar",
                "description": "A finely powdered sugar used for decoration."
            },
            {
                "@type": "DefinedTerm",
                "name": "Caster Sugar",
                "description": "A fine sugar used in baking."
            }
        ]
    }
}

        

Step 10: Export RDF Results to a Knowledge Graph in the Virtuoso RDF Quad Store

Copy the RDF generated from its SQL Table to a Knowledge Graph (denoted by the IRI: urn:www.bbcgoodfood.com) by running the following.

OAI.DBA.BATCH_RDF_JSONLD_IMPORT(1,'urn:www.bbcgoodfood.com');
        

Step 11: Explore

Explore the Knowledge Graph via page returned for the following SPASQL query.

SPARQL
SELECT DISTINCT (SAMPLE(?s) AS ?sample)
                (COUNT(*) AS ?count) 
                (?o AS ?entityType)
FROM <urn:www.bbcgoodfood.com> 
WHERE {?s a ?o} 
GROUP BY ?o
ORDER BY DESC (?count)
LIMIT 50
        

You can also visit this Live Knowledge Graph Exploration Page generated via a SPARQL Query for the query result.


Notes

  • Replace sk-xxxxxx with your actual OpenAI API key.
  • Modify paths and file names (/tmp/batch_sys.jsonl) as per your configuration.
  • Use Virtuoso's Conductor UI to monitor and control crawler jobs.

Tools That Make It Possible

  1. OpenLink AI Layer (OPAL) -- a Virtuoso Add-On Layer that integrates LLMs with Virtuoso for natural language interactions with Data Spaces (databases, knowledge bases or graphs, and other document collections).
  2. OpenLink Virtuoso - Virtuoso is a unique high-performance and scalable system for managing data spaces (databases, knowledge graphs, and other document collections) that leverages hyperlinks as super-keys for unambiguous entity naming, access, and exploration.

Conclusion

In the new era of AI, high-quality data and data marketplaces will serve as essential fuel for AI Agents. These data sources will be curated and published by domain experts, protected, and monetized using fine-grained Attribute-Based Access Controls (ABAC).

As demonstrated in this article, OPAL, combined with the Virtuoso Data Spaces platform, offers a uniquely powerful solution for turning this vision into reality—immediately and cost-effectively. This is achieved through a loosely coupled architecture built on existing open standards.

Related

要查看或添加评论,请登录

OpenLink Software的更多文章

社区洞察

其他会员也浏览了