I developed a web application that automatically transcribes audio files and identifies different speakers in conversations using OpenAI’s Whisper model and speaker diarization technologies, creating a powerful tool for transcription needs.

Text to Audio Transcription with Speaker Diarization


This project combines modern web technologies with state-of-the-art AI models to create an easy-to-use tool for transcribing speech with speaker identification. Key features include:

  • Modern User Interface:
    • Clean, intuitive web interface with drag-and-drop file uploads
    • Real-time progress tracking during transcription processing
    • Responsive design that works on various devices
    • Support for 18 languages in the user interface
  • Powerful Transcription Engine:
    • Integration with OpenAI’s Whisper speech recognition models
    • Multiple model sizes (tiny, base, small, medium, large) to balance speed and accuracy
    • Speaker diarization to identify and separate different speakers in the conversation
    • Support for various audio formats (MP3, WAV, OGG, FLAC, M4A)
  • Flexible Output Options:
    • Download transcriptions in multiple formats (TXT, JSON, SRT, VTT)
    • Timestamped transcripts with speaker identification
    • Clean formatting for easy reading and post-processing

This tool significantly improves productivity for anyone working with audio transcription - from journalists and researchers to content creators and accessibility professionals.

The application is built using Flask for the backend, with a modern JavaScript frontend. It incorporates several advanced technologies including PyTorch, OpenAI Whisper, and Pyannote Audio for speaker diarization.


Audio Transcription Tool Interface Transcription Results Example

GitHub Repository