Exploring CogVideo: Practical Tips and Personal Experiences
Vitalii Bychenkov
Senior Python Developer | AI & ML Architect | Scalable Systems at Production Level
Introduction
Generative AI is more and more popular and available nowadays. So I was interested in a locally runnable open source text to video solution that is possible to launch on a laptop like my Dell G15 5535 with RTX 4060 8Gb VRAM. I found CogVideo to be interesting and promising. Here is my experience with it.
CogVideo is a model that creates high-quality videos from text prompts. It uses a? 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and video fidelity, and it leverages pre-trained large language models (like GPT-like models) to generate video sequences from text prompts.
The current version can create videos with 6 seconds length and with frame rate 8. Supported resolution 720x480(only). Maximum prompt length is 226 tokens. It has two types of models CogVideoX-2B and CogVideoX-5B. CogVideoX-2B has 2 billion parameters and CogVideoX-5B has 5 billion parameters. Also CogVideo is able to generate not only video from text but video from text+image and video from text + another video. For image to text CogVideo has special models but for video to video it uses its main models. More details about CogVideo architecture and technologies may be found in the paper.
CogVideo in Docker
I was surprised that in the CogVideo’s github repo there is no Dockerfile or other instructions on how to launch a project in docker but I was curious if it is possible. I decided to launch CogVideo inside Docker because it is easy to manage and easy to move containers to any cloud environment if it is necessary. Also Docker allows you to isolate your environment so it is much easier to share and install in other laptops. Just share images via docker registry.?Here is my instructions:
FROM nvidia/cuda:12.6.1-cudnn-devel-ubuntu24.04
RUN apt-get update &&\
DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata
RUN apt-get install -y python3 python3.12 python3.12-venv python3.12-dev libjpeg-dev zlib1g-dev
RUN python3 -m venv /opt/cog_video_venv
ENV PATH="/opt/cog_video_venv/bin:$PATH"
COPY ./requirements.txt /requirements.txt
RUN pip install -r /requirements.txt
RUN pip install jupyter
RUN pip install torchao==0.4.0
4. I launch docker via docker compose. It is important to attach a video card to docker for proper inference. Here is my docker compose file:
services:
runner:
build:
context: .
dockerfile: ./Dockerfile
working_dir: /learn
env_file:
- .env
network_mode: "host"
command: jupyter notebook --allow-root --ip=0.0.0.0 --port 8890
volumes:
- ".:/learn"
- "/dev/dri:/dev/dri"
- "./huggingface_checkpoint:/root/.cache/huggingface" # we are saving checkpints in local folder not inside docker container
- "./torch_hub_checkpoints:/root/.cache/torch/" # same for torch checkpoints
ulimits:
rtprio: 95
memlock: -1
deploy:
resources:
reservations:
devices:
- capabilities: [ gpu ]
For easy access I put all together in my github. You can find Dockerfile, docker-compose and detailed instruction how to launch here
Prompt and results
I explored prompt recommendations and tips that are available. As I mentioned before, the maximum prompt length is 226? tokens. It is enough to create a small scenario (roughly it is text with 904 symbols). And I want to mention that this model doesn’t like short prompts. Prompts should be detailed and highly descriptive. I tried a few short prompts and got very weak results so weak that I don’t want to show them here.??
CogVideo’s developers recommend using the GPT model for proper prompt creation. They have a script for prompt optimization that? uses OpenAI API for this purpose. But I assume not all use OpenAI API so I refactored their prompt a little bit for ChatGPT. The prompt for prompt creation :))) turned out to be quite long. That's why I'm giving a link to my github. Here is my prompt configuration. As for ChatGPT models I used GPT-4o but I think it doesn't matter what model to use.
I decided to test three scenarios. With two available models with different params but unfortunately I couldn’t launch a lower version of model (2B - 2 billion params). So here are results of testing 5B models with number of inference steps 50 (default recommended by CogVideo) and 100:
Very casual and simple:
Result with number of inference steps = 50
Result with number of inference steps = 100
You can see that both results are good enough. But 100 steps results better than 50. The environment and guy looks good and corresponds to the prompt. Of course actions don’t look natural but I think it is mostly an issue of the prompt that it should be a little bit “easier” in describing actions for 6 seconds video. And I would say on the 100 steps version the face of the man has more details and is better drawn.
Something that more creative
领英推荐
Result with number of inference steps = 50
Result with number of inference steps = 100
I would say that a 50 steps result looks much better than 100. In 100 steps the model made a very big mistake. It placed the fish in the wrong position ))) and the old man looked very calm for this scene. But I like the old man image more in the 100 steps version. Also 100 steps is more detailed and more smooth than the 50 steps version. But again the 50 steps version reflects the scenario more precisely.
Super surreal thing
Result with number of inference steps = 50
Result with number of inference steps = 100
I like both of the results ))). But the 100 steps version is much better detailed. Chair of the main hero, glasses and the reptiloid look impressive. And of course Donald Trump is impressive as always )))
As an inference time: On my RTX 4060 8Gb VRAM 50 steps inference took avg 25 minutes and 100 steps inference took avg 49 minutes. 100 steps are much better detailed but in one case the 50 steps version much better sticks to the scenario.
Issues and Notes
CogVideo develops very rapidly and it doesn’t have excusing documentation so you need to look at the code to figure out how things work.
Also I have few issues:
During my test I noticed that CogVideo didn’t consume more than 3.5 Gb VRAM. So In theory CogVideo is able to launch on video cards with 4Gb VRAM. But I didn’t test it.
There are a few notebooks for Colab provided by CogVideo team. I tried launching them and I have very unstable results inside Colab (it crashed every second retry). Also I tried to launch notebooks on my laptop with camenduru/cogvideox-5b-float16 model and they crashed due to VRAM issues. I didn't analyze issues deeply, maybe it is just temporary issues (will fix soon in new CogVideo releases) and you should try them on your laptop. Here are links:
Conclusion
Modern text to video models are very impressive and CogVideo is one of them. It allows you to create videos in a wide range of themes and understand prompts well enough. I can’t imagine the size of the dataset that was used for training this model. So CogVideo is a good starting point if someone wants to explore video generation technology and try something on your own laptop.
Useful links
Experienced Quantitative Analyst & Algorithmic Trader | Proficient in Crypto Market Dynamics & Data Analysis | Father of 3
5 个月Impressive results for a laptop.