ACM 35th Workshop on Network and Operating System Support for Digital Audio and Video (NOSSDAV’25)
31 March – 3 April 2025 | Stellenbosch, South Africa
[PDF]
Emanuele Artioli (Alpen-Adria Universität Klagenfurt, Austria), Farzad Tashtarian (Alpen-Adria Universität Klagenfurt, Austria), Christian Timmerer (Alpen-Adria Universität Klagenfurt, Austria)
Abstract: The primary challenge of video streaming is to balance high video quality with smooth playback. Traditional codecs are well tuned for this trade-off, yet their inability to use context means they must encode the entire video data and transmit it to the client.
This paper introduces ELVIS (End-to-end Learning-based Video Streaming Enhancement Pipeline), an end-to-end architecture that combines server-side encoding optimizations with client-side generative in-painting to remove and reconstruct redundant video data. Its modular design allows ELVIS to integrate different codecs, in-painting models, and quality metrics, making it adaptable to future innovations.
Our results show that current technologies achieve improvements of up to 11 VMAF points over baseline benchmarks, though challenges remain for real-time applications due to computational demands. ELVIS represents a foundational step toward incorporating generative AI into video streaming pipelines, enabling higher quality experiences without increased bandwidth requirements.
By leveraging generative AI, we aim to develop a client-side tool, to incorporate in a dedicated video streaming player, that combines the accessibility of multilingual dubbing with the authenticity of the original speaker’s performance, effectively allowing a single actor to deliver their voice in any language. To the best of our knowledge, no current streaming system can capture the speaker’s unique voice or emotional tone.
Index Terms— HTTP adaptive streaming, Generative AI, End-to-end architecture, Quality of Experience.