Fun With Wan 2.2 Text To Video AI
This blog post is pretty much also a Youtube video, so if you want, you can just watch that instead!
I recently got interested in text to video AI generators after reading a post about Wan 2.2. I hadn’t really been interested in them before because they all kind of sucked, and they either took too much VRAM or were proprietary ones that you couldn’t run locally like Veo or Sora, etc.
But after looking at some of the videos that Wan 2.2 could generate, and seeing that it could run on some pretty standard hardware, I decided to look into it a bit more.
Making a LoRA
Whenever I use these AI image generators, I always like to make images of myself. Call it narcissism or whatever, but I think it’s really fun to throw myself into impossible and weird situations, or make paintings of myself as a Roman emperor or whatever. But to make the AI generate images that look even the least bit like me, I need to train a LoRA. I’ve done this already before with image generators like Stable Diffusion and, more recently, Flux.
I used a tool called ai-toolkit to train the Flux LoRA on Runpod around a year ago. The quality of that image generation model was quite good, but I quickly forgot about it after I tried making a bunch of art for my Samsung Frame TV of myself which didn’t turn out that great. When researching ways to train a LoRA for Wan 2.2, I found that ai-toolkit already had support for it. I still had the config files and image dataset that I used to train the Flux model, so I figured it would be pretty easy to do the Wan 2.2 one too.
The training took about 2 hours running on a 24GB VRAM GPU. I think it was an A5000. The interesting thing about training is that I didn’t need to use videos to train the video model, just images. After I loaded the LoRA into the default ComfyUI workflow for Wan 2.2, I was able to put myself into a bunch of different situations!
Making Use of AI Videos
The main issue that pops up with these video generation AI models is that the length of the videos has to be pretty short. The other problem is that it’s hard to have consistency with characters and scenes between clips, especially if you’re only using text as a prompt to generate the video. I’ve seen some really shitty videos that people were really proud of where the characters all look slightly different between scenes, and the vehicle they’re riding in changes wildly. Yet somehow they think their AI video is the next summer blockbuster.
When Google’s Veo came out (I think it was Veo 3 or 4?) I saw a video that someone made that I thought was really clever. It was a fake news story about a synchronized swimming team of cats. I thought it was really smart because news stories consist of a bunch of short clips, and there wouldn’t really need to be a whole lot of consistency between the scenes.
Any good artist knows to work within the constraints of whatever tools they’re using. I had a grand idea to create an 80s sitcom intro where every character was myself. You know how those videos always had someone looking at the camera and smiling? Or doing something wacky? Those scenes are always just a few seconds long, and they don’t really need to be consistent between each other since they usually super random anyway. So I started prompting. Here are a few examples that I really liked:
I made this revolving door shot of myself and every time I see this video I’m just amazed by the look on my (my?) face. I mean, it doesn’t even look exactly like me, but I just think it’s so funny.
I like how the AI de-aged me and the setting really does look like an 80s television show.
I had AI make me talk on a corded phone wearing a blazer with rolled up sleeves. It doesn’t get more 80s than this! I really liked how the lighting worked in this one, where you can see the shadow that was cast by the person as well as the cord of the phone itself!
I also made AI make me totally ripped at the beach. This actually closely resembles the training images that I fed to AI, so it hardly had to do any work at all to make this video.
AI made me a doctor, which is something my parents can finally be proud of. Also I made myself eat a sandwich, which I thought was an original thought but then I realized that I stole it from Weird Al’s Like a Surgeon video (or is it Weird AI?). I love this clip because the way he (I?) chews is so, so funny to me.
Anyway, here’s the whole video if you want to watch it!
The Music
Since I wanted to make an 80s sitcom intro, and all 80s sitcom intros have banger music in them, I was thinking I could write a banger 80s sitcom intro hit myself. But then I realized that I’m not really good at writing songs or singing, and I just know how to play the trumpet, so I figured I’d just use AI for this, too. I’m honestly not a huge fan of AI music, as a musician myself. But I figured it would be appropriate given the context of the video being completely AI too.
I wrote the lyrics to my song myself, and then I used Suno to generate the music. It took quite a few generations and prompting with styles to get a halfway decent song. I think that Suno doesn’t really know how to write a unique melody.
My theory for this is that there’s probably some code or something embedded in the model that prevents Suno from using a copyrighted melody from a real song. So while it can come up with a decent chord progression and the instruments and vocals all sound real, the thing is so scared of getting sued that it won’t come up with a melody. Like, can you imagine if you asked Suno for a song and it just gave you Careless Whisper?
What’s Next?
I make a lot of videos, whether that’s for Hoagie’s Youtube , or my Youtube, or my other Youtube for stupid stuff. I genuinely like making videos, so it’s sometimes weird to me that people who are making these AI videos think they will replace everyone who is currently involved in video production. Like I know that “AI replacing everyone” is a trope, but another popular trope is that people are complaining because AI is making art while they do repetitive boring tasks, where it should be the other way around.
There have been a lot of complaints about AI slop too, whether that’s text slop or image slop or video slop. I mentioned before that I’ve seen so many cringey videos that people post thinking that they’ve made something incredible when it’s about the worst thing I’ve ever seen.
I think the problem is that when you make art, it can suck or it can be good, and you eventually get a good understanding of which it is that you made while you are learning to make it. Usually people start off bad at it, and then they get better. Through this process, you develop taste and an understanding of what works and what doesn’t. The experience you get from putting time into it will shape your ability to filter and be your harshest critic, which helps you make better art that is worth sharing with people.
When someone prompts an AI to make an image, and the image comes out sufficiently pleasing, they assume that there’s some sort of skill involved at the same level of making that image from scratch. I don’t think that’s the case, and while I’m not even anti-AI, I do think that there’s something to be said about learning a craft that you dont get by asking an AI to do it (obviously).
I’m not saying that I’d never use AI as part of my usual workflow to make videos (in fact, I’ve used AI that cleans up audio for my poor microphone conditions before, and it turned out pretty good). But I think AI is still at the point that CGI in movies was some time in the late 90s. Like people will never stop talking about how bad the CGI was in the Scorpion King! And I’m sure there will be uses of AI that people will talk about in a similar way that haven’t even been used yet! But at some point, maybe it’ll be good enough!
Leave a Comment
Comments are moderated and won't appear immediately after submission.