Tackling a small AI use case with a local multi-modal LLM

Tackling a small AI use case with a local multi-modal LLM

For the enterprise AI is really all about business process automation. Local LLM’s can be really useful to automate discrete tasks that can either be difficult to do or quite tedious. For example, let’s say you are in a design or marketing agency and you have a lot of photos stored on our Nasuni network share. Trying to find the right photo for a new campaign or design can be difficult because the photos are not named in a way that you can easily search, and browsing can take a long time before you stumble upon the photo you had in mind.

What if we could automate the renaming of each photo so that it its name described the photo. That sounds like a job for AI !

To facilitate this we are going to use something I have posted about previously. Llamafile, and the multi-modal LLaVA model (??https://www.dhirubhai.net/pulse/llamafile-enables-on-device-large-language-models-without-jim-liddle-j0vdf/). Llamafile essentially enables the running of LLMs with a single file.

We are going to be running the Llamafile LLM on WSL2 (Windows Subsystem for Linux) on Windows 11 .

To prep WSL to work with our script and model there is first something we have to do first.

The LLava LLM uses something called APE (Another Process Executioner) which is a component within the LLaVA framework. It plays a role in managing the execution of processes and resources within the framework.

APE acts as an intermediary between the LLaVA model and the operating system. It helps launch and manage processes, handles communication between them, and facilitates the loading of models and other resources.

WSL has a feature called "WSLInterop" that allows users to run Windows executables directly from within the Linux environment. While this can be convenient in some cases, it can also cause problems when a program (like one using APE) expects to be running on a pure Windows system.

We therefore need to take some steps to ensure that APE does not cause issues within WSL.

We should be able to just run the following command within the Linux WSL instance:

sudo sh -c "echo -1 > /proc/sys/fs/binfmt_misc/WSLInterop"        

This can however often result in 'permissions denied' which would effectively stop the Llamafile LLM from executing.

To resolve this enter the following at the terminal prompt (for reference I am running the Ubuntu Linux WSL instance):

sudo sh -c 'echo :WSLInterop:M::MZ::/init:PF > /usr/lib/binfmt.d/WSLInterop.conf'
sudo systemctl unmask systemd-binfmt.service
sudo systemctl restart systemd-binfmt
sudo systemctl mask systemd-binfmt.service
 sudo sh -c 'echo -1 > /proc/sys/fs/binfmt_misc/WSLInterop'        

The combined effect of these commands is to enable the execution of Windows executables from within WSL but with the added control of using the /init program and running them in the background (and of course you would not need to go through these steps if you were running this directly from a Linux system rather than an emulated Linux system).

Now that we are in a good position with our Linux WSL instance lets create a directory structure where we will keep our Llamafile and script. This can be anything of your choosing but I have simply created a 'Llamafile' sub-directory within a 'Projects' directory.

Next you will need to download the LLaVA Llamafile and place it in this directory. You can download that from here. If you download to another directory and want to move it to your newly created Llamafile WSL directory you can simply copy and paste it using windows explorer. To navigate to your windows directory using windows explore you will need to enter 'wsl$' into the explorer bar (or you can just use wget to download it through the terminal !).

I downloaded some free stock photos from the internet that we can use for our test:

These are the images that we are going to have our LLM rename.

Now lets create our Python code to interact with the LLaVA LLM:

import os
import subprocess
import re

# Define the network directory containing images
image_directory = r"/mnt/c/Users/jimli/Downloads/images"  

# Define the path to the LLaVA llamafile (I am assuming it is in the same directory as the script)
llava_file = os.path.join(os.path.dirname(os.path.abspath(__file__)), "llava-v1.5-7b-q4.llamafile")

# Function to sanitize and format the description to use as a filename
def sanitize_filename(filename):
    # Remove any characters that are not alphanumeric, spaces, or underscores
    filename = re.sub(r'[^a-zA-Z0-9_ ]', '', filename)
    # Replace underscores with spaces
    filename = filename.replace('_', ' ')
    # Truncate the filename to a reasonable length (e.g., 120 characters)
    return filename[:120]

# Function to process each image
def process_image(image_path):
    try:
        # Run the LLaVA command to get the description of the image
        result = subprocess.run(
            ['sh', '-c', f'HIP_VISIBLE_DEVICES=0 "{llava_file}" --cli --image "{image_path}" --temp 0 -ngl 999 -e -p "### User: What do you see?\n### Assistant:" --silent-prompt'],
            capture_output=True,
            text=True,
            check=True
        )
        description = result.stdout.strip()
        if description:
            # Sanitize the description to create a valid filename
            sanitized_description = sanitize_filename(description)
            # Construct the new file path
            new_image_path = os.path.join(os.path.dirname(image_path), sanitized_description + os.path.splitext(image_path)[1])
            # Rename the image
            os.rename(image_path, new_image_path)
            print(f'{os.path.basename(image_path)} renamed to {os.path.basename(new_image_path)}')
            return True
        else:
            print(f'No description returned for {image_path}')
            return False
    except subprocess.CalledProcessError as e:
        print(f'Error processing {image_path}: {e}')
        return False
    except Exception as e:
        print(f'An error occurred: {e}')
        return False

# Count the number of processed images
processed_count = 0

# Debugging: Check if the image directory is accessible
if not os.path.exists(image_directory):
    print(f'Error: The directory {image_directory} does not exist or is not accessible.')
else:
    print(f'Accessing directory: {image_directory}')

# Iterate through the image directory
for root, _, files in os.walk(image_directory):
    print(f'Checking directory: {root}')  # Debugging: Print each directory being checked
    for file in files:
        print(f'Found file: {file}')  # Debugging: Print each file found
        if file.lower().endswith(('.jpg', '.jpeg', '.png', '.tga', '.bmp', '.psd', '.gif', '.hdr', '.pic', '.pnm')):
            image_path = os.path.join(root, file)
            print(f'Processing file: {image_path}')  # Debugging: Print each file being processed
            if process_image(image_path):
                processed_count += 1

print(f'Completed processing {processed_count} images')        

Things to note about the code:

  • I have left the my original mount path in so you can see the format but you will need to replace this with your own
  • The LLaVA model is in the same directory as the script so you will need to change the location if this is not the case
  • This line defines the silent CLI interaction with the LLaVA model. This is where you would change the prompt that is submitted to the model ('What do you see?'):

 ['sh', '-c', f'HIP_VISIBLE_DEVICES=0 "{llava_file}" --cli --image "{image_path}" --temp 0 -ngl 999 -e -p "### User: What do you see?\n### Assistant:" --silent-prompt'],        

  • There is some sanitization of the LLM results to insert spaces between works ( as we want to ensure we can search against the individual words).

Running the script iterates through the images in the defined directory:

Now lets take a look at the images:

Each image is now renamed as description of what the image is:

Let's say I am a user and I want an atmospheric picture of a train for a new design project and I remember that I had used in the past a black and white photo of a train. If I now enter that in my Windows Search I can easily find the image.

It would also be fairly easy to extend this, perhaps to automatically add the top three metadata tags of the image for example.

If we wanted to productionize this we would need to harden the code, add some logging and monitoring, and perhaps containerize it to make it a service.

Now you can achieve similar results using hyperscaler AI models but there will be an API / token cost for every image processed. Having a discrete local LLM microservice would be cheaper and also offer greater privacy for the assets that are being identified by the LLM.

Great example and I shared it with my daughter in advertising. This is always a challenge and a great solution you have suggested!

Roman Gelembjuk

Team Lead Software Developer

5 个月

What about energy usage? Can you see any significant increase after your tests?

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了