Student Works

Reverb – AI in Podcast

Shikha Shah

MDes 2023

Metamorphosis of One-way Audio into Dynamic and Interactive Conversations through Conversational AI

Introduction

In the contemporary digital age, optimizing curiosity through conversational elements in audio media consumption has become a pivotal focus, especially considering the predominantly monologic nature of current audio content such as audiobooks and podcasts. The passive engagement in these activities, often paired with multitasking endeavors like driving or cooking, renders the user unable to actively seek information or jot down notes about discussed topics or unfamiliar terms. Despite the existence of prior art utilizing Artificial Intelligence (AI) tools for content generation and summarization by startups like Descript and Speechify, and AI character generation by groups like the Human AI Interaction team at MIT Media Labs, a gap persists in creating a truly interactive and conversational podcast experience.

The crux of the problem revolves around instilling interactivity and conversationality in podcasts and designing innovative interaction methodologies that enhance accessibility, especially in scenarios that mimic disability (pseudo-disabled instances) due to multitasking. Employing methods such as Voice Cloning, Text-to-Speech and Speech-to-Text conversions through the Whisper API, utilizing the Chat GPT API for text summarization and note-taking, and using Llama Index to retain the information context, this project aims to transform monologic audio content into a dialogic or conversational format, thereby enriching the user experience by making it more engaging and informative.

Final Design

The final Reverb design focuses on four pillars aligned to key user needs – Conversations, Note-taking, Summarization and Cross-Referencing of content. Natural language processing capabilities facilitate free-flowing dialogues with a virtual podcast host. Note-taking features allow easy capture of key moments. Summarization provides quick recaps of podcasts segments. And leveraging embeddings to index podcast content libraries enables the AI to intelligently cross-reference relevant episodes. Besides the technical implementations, meticulous prompt design categorizes queries to simplify usage. And voice cloning aims to create naturalistic responses, vital for immersive experiences.

Ultimately, realizing the promise of enhanced interactivity and curiosity through Conversational AI necessitates maintaining a vigilant equilibrium between captivating user experiences and ethical safeguarding against potential harms from emerging synthetic voice/media technologies. Achieving this concerted focus on both innovation and responsibility remains an evolving, multifaceted challenge in steering such advancements towards societal good.

Tackling the lag in conversational flow poses meaningful technical challenges but promises much more natural interactions once achieved. This thesis presents a vision for expanding into various future applications, notably in the realm of distance learning education. Imagine an environment where students, while listening to prerecorded online lectures, have the ability to interact, asking questions as though they were physically present in the classroom, or directly taking notes from the video.

Full listing

-