Exploring CogVideo: Practical Tips and Personal Experiences

Exploring CogVideo: Practical Tips and Personal Experiences

Introduction

Generative AI is more and more popular and available nowadays. So I was interested in a locally runnable open source text to video solution that is possible to launch on a laptop like my Dell G15 5535 with RTX 4060 8Gb VRAM. I found CogVideo to be interesting and promising. Here is my experience with it.


CogVideo logo

CogVideo is a model that creates high-quality videos from text prompts. It uses a? 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and video fidelity, and it leverages pre-trained large language models (like GPT-like models) to generate video sequences from text prompts.

The current version can create videos with 6 seconds length and with frame rate 8. Supported resolution 720x480(only). Maximum prompt length is 226 tokens. It has two types of models CogVideoX-2B and CogVideoX-5B. CogVideoX-2B has 2 billion parameters and CogVideoX-5B has 5 billion parameters. Also CogVideo is able to generate not only video from text but video from text+image and video from text + another video. For image to text CogVideo has special models but for video to video it uses its main models. More details about CogVideo architecture and technologies may be found in the paper.

CogVideo in Docker

I was surprised that in the CogVideo’s github repo there is no Dockerfile or other instructions on how to launch a project in docker but I was curious if it is possible. I decided to launch CogVideo inside Docker because it is easy to manage and easy to move containers to any cloud environment if it is necessary. Also Docker allows you to isolate your environment so it is much easier to share and install in other laptops. Just share images via docker registry.?Here is my instructions:

  1. At first it is important to install nvidia container toolkit and of course nvidia drivers with cuda. Here is manual how to do it: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html?. I tested it on Kubuntu 24.04.1 LTS with docker-ca version 27.2.1 but in theory it should work with any version of docker and OS that is supported by container toolkit.?
  2. I decided to use as a base image nvidia/cuda:12.6.1-cudnn-devel-ubuntu24.04 maybe it is a little bit redundant but it has all in nvidia drivers and libraries for NVIDIA cards. Only one thing should be done: add python to the image.
  3. Here is my Dockerfile:

FROM nvidia/cuda:12.6.1-cudnn-devel-ubuntu24.04
RUN apt-get update &&\
    DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata
RUN apt-get install -y python3 python3.12 python3.12-venv python3.12-dev libjpeg-dev zlib1g-dev
RUN python3 -m venv /opt/cog_video_venv
ENV PATH="/opt/cog_video_venv/bin:$PATH"
COPY ./requirements.txt /requirements.txt
RUN pip install -r /requirements.txt
RUN pip install jupyter
RUN pip install torchao==0.4.0        

  • tzdata installed separately because it is necessary for python and it is a common problem to configure it properly w/o any action.?
  • ibjpeg-dev zlib1g-dev important for PIL that is actively used in CogVideo.?
  • Because in the newest Ubuntu it is hard to rewrite the system's python package I decided to create venv inside docker. It look redundant but actually it doesn’t bring any performance overhead
  • jupyter added for interactive code launch?
  • torchao==0.4.0 – important if you want to launch a demo with quantization. It isn’t included in requirements.txt and it seems the latest torchao (0.5.0) has some breaking changes that haven't yet reflected in CogVideo but with torchao 0.4.0 everything works ok.?

4. I launch docker via docker compose. It is important to attach a video card to docker for proper inference. Here is my docker compose file:

services:
  runner:
    build:
      context: .
      dockerfile: ./Dockerfile
    working_dir: /learn
    env_file:
      - .env
    network_mode: "host"
    command: jupyter notebook --allow-root --ip=0.0.0.0 --port 8890
    volumes:
      - ".:/learn"
      - "/dev/dri:/dev/dri"
      - "./huggingface_checkpoint:/root/.cache/huggingface" # we are saving checkpints in local folder not inside docker container
      - "./torch_hub_checkpoints:/root/.cache/torch/" # same for torch checkpoints
    ulimits:
      rtprio: 95
      memlock: -1
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [ gpu ]        

For easy access I put all together in my github. You can find Dockerfile, docker-compose and detailed instruction how to launch here

Prompt and results

I explored prompt recommendations and tips that are available. As I mentioned before, the maximum prompt length is 226? tokens. It is enough to create a small scenario (roughly it is text with 904 symbols). And I want to mention that this model doesn’t like short prompts. Prompts should be detailed and highly descriptive. I tried a few short prompts and got very weak results so weak that I don’t want to show them here.??

CogVideo’s developers recommend using the GPT model for proper prompt creation. They have a script for prompt optimization that? uses OpenAI API for this purpose. But I assume not all use OpenAI API so I refactored their prompt a little bit for ChatGPT. The prompt for prompt creation :))) turned out to be quite long. That's why I'm giving a link to my github. Here is my prompt configuration. As for ChatGPT models I used GPT-4o but I think it doesn't matter what model to use.

I decided to test three scenarios. With two available models with different params but unfortunately I couldn’t launch a lower version of model (2B - 2 billion params). So here are results of testing 5B models with number of inference steps 50 (default recommended by CogVideo) and 100:

Very casual and simple:

  • Initial prompt: Guy is buying beer in supermarket
  • ChatGPT prompt: A casually dressed man in his late twenties, wearing a gray hoodie and faded jeans, walks down a brightly lit supermarket aisle lined with various bottles and cans of beer. He pauses in front of a well-stocked display, eyes scanning the rows of craft brews and familiar brands. Reaching for a tall amber bottle, he inspects the label thoughtfully, tapping it against his palm. A moment later, he drops it into his shopping basket with a clink, which already holds a few snack items. The overhead lights reflect off the glass as he continues down the aisle, his expression one of casual decision-making, with the hum of distant conversations and beeping registers creating a familiar backdrop.

Result with number of inference steps = 50

Result with number of inference steps = 100

You can see that both results are good enough. But 100 steps results better than 50. The environment and guy looks good and corresponds to the prompt. Of course actions don’t look natural but I think it is mostly an issue of the prompt that it should be a little bit “easier” in describing actions for 6 seconds video. And I would say on the 100 steps version the face of the man has more details and is better drawn.


Something that more creative

  • Initial prompt: The old man in the see couch a big fish and tries to drag it into the boat.
  • ChatGPT prompt: An elderly man with weathered skin and a white beard, dressed in a worn-out fishing vest and faded cap, stands on a small wooden boat rocking gently in the ocean. His hands grip a thick, fraying fishing line, muscles straining as he struggles to pull a massive, glistening fish toward the boat. The fish, shimmering silver with deep blue accents, thrashes wildly, its powerful tail splashing water into the air. The sun casts a golden glow across the scene, reflecting off the old man's determined face. He plants his feet firmly, sweat dripping down his brow, as the boat sways, creaking under the tension. The sea around them is vast and calm, contrasting the intense battle between man and nature, while the horizon stretches endlessly behind him.

Result with number of inference steps = 50

Result with number of inference steps = 100

I would say that a 50 steps result looks much better than 100. In 100 steps the model made a very big mistake. It placed the fish in the wrong position ))) and the old man looked very calm for this scene. But I like the old man image more in the 100 steps version. Also 100 steps is more detailed and more smooth than the 50 steps version. But again the 50 steps version reflects the scenario more precisely.


Super surreal thing

  • Initial prompt: Donald Trump, drink vodka with one reptiloid
  • ChatGPT prompt: In a luxurious, dimly lit room adorned with ornate gold accents, Donald Trump, wearing a sharp black suit with a red tie, sits at a polished mahogany table. Across from him is a tall reptilian figure, with iridescent green scales, dressed in an elegant black robe. The creature's sharp, golden eyes are fixed on Trump as they each hold crystal-clear vodka shots in small, intricately designed glasses. Trump smiles confidently, raising his glass in a friendly toast while the reptiloid’s clawed hand carefully mirrors the motion. The atmosphere is one of quiet intrigue, with soft classical music playing in the background, as they take a synchronized sip, exchanging subtle glances that hint at an otherworldly understanding.

Result with number of inference steps = 50

Result with number of inference steps = 100

I like both of the results ))). But the 100 steps version is much better detailed. Chair of the main hero, glasses and the reptiloid look impressive. And of course Donald Trump is impressive as always )))

As an inference time: On my RTX 4060 8Gb VRAM 50 steps inference took avg 25 minutes and 100 steps inference took avg 49 minutes. 100 steps are much better detailed but in one case the 50 steps version much better sticks to the scenario.

Issues and Notes

CogVideo develops very rapidly and it doesn’t have excusing documentation so you need to look at the code to figure out how things work.

Also I have few issues:

  • I couldn’t receive any results from image to video model. It always returned None object. ?I spent a few hours on investigation and at didn’t find the reason and decided to open ticket on CogVideo repository.
  • The 2B version of the model also didn’t launch because it consumed too much memory on loading. So 32Gb memory was not enough for it. Maybe it is a bug that will be fixed later.
  • Also I couldn’t launch quantization example even I fixed issue with libraries (CogVideo doesn’t work with latest torchao) but in both cases (quantization and 2B model). I had memory issue that happened when models were loading into RAM (not VRAM).

During my test I noticed that CogVideo didn’t consume more than 3.5 Gb VRAM. So In theory CogVideo is able to launch on video cards with 4Gb VRAM. But I didn’t test it.

There are a few notebooks for Colab provided by CogVideo team. I tried launching them and I have very unstable results inside Colab (it crashed every second retry). Also I tried to launch notebooks on my laptop with camenduru/cogvideox-5b-float16 model and they crashed due to VRAM issues. I didn't analyze issues deeply, maybe it is just temporary issues (will fix soon in new CogVideo releases) and you should try them on your laptop. Here are links:

Conclusion

Modern text to video models are very impressive and CogVideo is one of them. It allows you to create videos in a wide range of themes and understand prompts well enough. I can’t imagine the size of the dataset that was used for training this model. So CogVideo is a good starting point if someone wants to explore video generation technology and try something on your own laptop.


Useful links


Linas Rajackas

Experienced Quantitative Analyst & Algorithmic Trader | Proficient in Crypto Market Dynamics & Data Analysis | Father of 3

5 个月

Impressive results for a laptop.

回复

要查看或添加评论,请登录

Vitalii Bychenkov的更多文章

社区洞察

其他会员也浏览了