Are you looking to create engaging faceless short videos for platforms like YouTube or TikTok but want to avoid the hassle of complex video editing? This article will walk you through how to automate the entire process using OpenAI, ElevenLabs, and MoviePy.
By the end of this tutorial, you'll know how to automatically generate visuals and voiceovers for short videos based on any script. Whether you’re creating educational content, storytelling, or meme videos, this workflow will save you tons of time.
Prerequisites
Before getting started, you’ll need:
- API keys for both OpenAI (for generating visuals) and ElevenLabs (for voiceovers).
- Basic Python knowledge.
- MoviePy and other required Python libraries installed (moviepy,openai,elevenlabs, etc.).
Step 1: Setting Up API Keys
import openai
from elevenlabs import ElevenLabs
# Set up your OpenAI and ElevenLabs API keys
openai.api_key = "your_openai_api_key"
elevenlabs_client = ElevenLabs(api_key="your_elevenlabs_api_key")
Start by getting API keys from OpenAI and ElevenLabs. Replace the placeholders in the code with your actual API keys.
Step 2: Preparing the Script
Your video starts with a story or script. You can replace the story_script variable with the text you want to turn into a video. Here’s an example script about Dogecoin:
story_script = """
Dogecoin began as a joke in 2013, inspired by the popular 'Doge' meme featuring a Shiba Inu dog. It unexpectedly gained a massive following thanks to its community's charitable initiatives, eventually evolving into a legitimate cryptocurrency with support from Elon Musk.
"""
The script will be split into sentences to match each visual and audio segment.
Step 3: Generating Images with OpenAI’s DALL-E
For each sentence, we generate a corresponding image using OpenAI’s DALL-E model.
def generate_image_from_text(sentence, context, idx):
    prompt = f"Generate an image without any text that describes: {sentence}. Context: {context}"
    response = openai.images.generate(
        model="dall-e-3",
        prompt=prompt,
        size="1024x1792",
        response_format="b64_json"
    )
    image_filename = f"images/image_{idx}.jpg"
    with open(image_filename, "wb") as f:
        f.write(base64.b64decode(response.data[0].b64_json))
    return image_filename
This function sends each sentence to DALL-E and saves the generated image. We ensure the generated visuals match the video's theme.
Step 4: Generating Voiceovers with ElevenLabs
Once we have the visuals, we need voiceovers. ElevenLabs converts each sentence into speech.
def generate_audio_from_text(sentence, idx):
    audio = elevenlabs_client.text_to_speech.convert(
        voice_id="pqHfZKP75CvOlQylNhV4",
        model_id="eleven_multilingual_v2",
        text=sentence,
        voice_settings=VoiceSettings(stability=0.2, similarity_boost=0.8)
    )
    audio_filename = f"audio/audio_{idx}.mp3"
    with open(audio_filename, "wb") as f:
        for chunk in audio:
            f.write(chunk)
    return audio_filename
This function generates an audio file for each sentence. You can select different voice settings to customize the narration style.
Step 5: Combining Audio and Video
Next, we pair each image with its corresponding voiceover using MoviePy:
from moviepy.editor import ImageClip, AudioFileClip
image_clip = ImageClip(image_path, duration=audio_clip.duration)
image_clip = image_clip.set_audio(audio_clip)
video_clips.append(image_clip.set_fps(30))
Each image is displayed for the duration of its audio clip, ensuring synchronization.
Step 6: Applying Video Effects
To make the video more dynamic, we apply zoom and fade effects to each image. For example, the apply_zoom_in_center effect slowly zooms into the center of the image:
def apply_zoom_in_center(image_clip, duration):
    return image_clip.resize(lambda t: 1 + 0.04 * t)
Other effects include zooming in from the upper part or zooming out. These effects are applied randomly to each clip to keep the video visually engaging.
Step 7: Final Video Assembly
We combine all video clips into one seamless video and add background music:
final_video = concatenate_videoclips(video_clips, method="compose")
final_video.write_videofile(output_video_path, codec="libx264", audio_codec="aac", fps=30)
Step 8: Adding Captions
Captions improve video accessibility and engagement. We use Captacity to automatically add captions based on the audio.
captacity.add_captions(
    video_file=output_video_path,
    output_file="captioned_video.mp4",
    font_size=130,
    font_color="yellow",
    stroke_width=3
)
Step 9: Adding Background Music
To finish the video, background music is added. The volume is reduced so that it doesn't overpower the narration.
background_music = AudioFileClip(music_filename).subclip(0, final_video.duration).volumex(0.2)
narration_audio = final_video.audio.volumex(1.5)
combined_audio = CompositeAudioClip([narration_audio, background_music])
final_video.set_audio(combined_audio)
Conclusion
The GitHub repository for this project is available here.
By using OpenAI and ElevenLabs, we’ve automated the creation of faceless videos from text. You can now quickly generate YouTube Shorts or TikToks without needing a camera or microphone.
This automated process has allowed us to create a Faceless Shorts Video service on our Robopost software, offering content creators a seamless way to produce high-quality videos. Whether you are creating educational videos, short stories, or even meme-style content, this service handles everything from visuals to voiceovers with minimal effort.
Now, you can focus on creativity and storytelling while Robopost handles the heavy lifting of video production. Happy creating!
