Wav2Lip iOS Project

A UNCC Ai4Health Research Project

This project leverages the Wav2Lip AI model to create a comprehensive tool for lip-syncing video content. Wav2Lip is an advanced AI model designed to generate realistic lip movements synchronized with given audio. The model has been converted to CoreML using ONNX for integration into iOS applications, ensuring efficient performance on Apple devices. The project is public and aims to provide an end-to-end solution for video processing, audio extraction, and lip-syncing.

Detailed Project Description:
Deploying Wav2Lip; An AI Model Converted to CoreML

Project Components:

1. Wav2Lip Conversion:

CoreML Conversion: The original Wav2Lip model, typically in PyTorch, is converted to CoreML using ONNX (Open Neural Network Exchange). This conversion ensures compatibility with Apple's machine learning framework, allowing efficient on-device processing.

4. Audio Extractor

Extraction: The app uses AVFoundation to extract audio from video files. AVAssetReader and AVAssetTrack are employed to read audio tracks.
Processing: Extracted audio is processed to match the input requirements of the Wav2Lip model, involving resampling and normalization. This ensures the audio is in the correct format and sample rate for accurate lip-syncing.

2. Wav2Lip SetUp

The setup involves installing Homebrew and Pyenv, creating a Python virtual environment, and installing necessary dependencies. Users clone the Wav2Lip repository and download the pre-trained model weights. This ensures that the environment is ready to run the model and generate lip-synced videos.

5. Video Picker

File Selection: A user-friendly interface built with UIKit allows users to browse and select video files from their device. The UIDocumentPickerViewController is used for this purpose.
Playback: Selected videos can be previewed using AVKit. The app leverages AVPlayerViewController to provide a seamless video playback experience, allowing users to verify their selection before processing.

3. Video Processor

Capture and Frame Extraction: The app utilizes AVFoundation to capture video frames. Frames are extracted using AVAssetReader and CMSampleBuffer.
Preprocessing: Frames are preprocessed to ensure compatibility with the Wav2Lip model. This includes resizing and normalizing the images, typically done using CoreImage or Metal for efficient GPU-accelerated processing.

6. Content Viewer

File Selection: A user-friendly interface built with UIKit allows users to browse and select video files from their device. The UIDocumentPickerViewController is used for this purpose.
Playback: Selected videos can be previewed using AVKit. The app leverages AVPlayerViewController to provide a seamless video playback experience, allowing users to verify their selection before processing.

What is Wav2Lip?

Wav2Lip is an AI model designed to generate realistic lip movements synchronized with an audio track. It uses deep learning techniques to match the lip movements of a speaker in a video with the given audio. This technology is ideal for applications such as dubbing, video editing, and creating engaging content. By converting Wav2Lip to CoreML, the project ensures it runs efficiently on iOS devices, leveraging Apple's powerful machine learning capabilities.

Conclusion:

This project provides a comprehensive iOS app for video processing and lip-syncing using the Wav2Lip model. Each component, built with Swift and utilizing Apple's frameworks, ensures users can easily set up the environment, process video and audio files, and achieve high-quality lip synchronization on their iOS devices.