Open-source AI Transcription Tool for Long Audio

Qwen3-ASR-Toolkit is an open-source command-line audio transcription tool that uses Alibaba’s Qwen-ASR API to convert speech to text with remarkable accuracy across multiple languages.

The toolkit intelligently bypasses the API’s 3-minute audio limit by splitting files into smaller chunks and processing them in parallel. This means you can transcribe hours of content quickly and efficiently.

Download Qwen3-ASR-Toolkit

Features

Unlimited Audio Length Processing: Bypasses the official API’s 3-minute restriction through intelligent chunking and parallel processing.
Voice Activity Detection (VAD): Analyzes audio streams to identify natural break points, preventing words from being cut mid-sentence.
Multi-threaded Parallel Processing: Sends multiple audio chunks to the API simultaneously, dramatically reducing total processing time for long files.
Automatic Hallucination Removal: Post-processes transcriptions to detect and eliminate common ASR artifacts and repetitive text patterns.
Universal Format Support: Handles virtually any audio or video format, including MP4, MOV, MKV, MP3, WAV, and M4A through FFmpeg integration.
Automatic Audio Resampling: Converts input audio to the required 16kHz mono format without manual preprocessing.
Remote URL Processing: Transcribes audio files directly from web URLs without requiring local downloads.
Context-aware Recognition: Accepts custom context strings to improve accuracy for domain-specific terminology and proper names.

Use Cases

Academic Research: Transcribing lengthy interviews, focus groups, or lecture recordings that often exceed standard API limits. Researchers can process hours of qualitative data without manual intervention.
Podcast Production: Converting entire podcast episodes into searchable text for show notes, blog posts, or accessibility compliance. The parallel processing significantly reduces turnaround time.
Corporate Documentation: Transcribing board meetings, training sessions, or client calls where accuracy and efficiency are paramount. The context feature helps with company-specific terminology.
Content Creation: Converting video content to text for subtitle generation, blog post creation, or SEO optimization. Supports direct processing from video platforms.
Legal and Medical Applications: Processing depositions, patient interviews, or consultation recordings where precision matters. The multi-language support accommodates diverse client bases.

How to Use It

1. Install Python 3.8 or higher and FFmpeg on your system. Ubuntu users can install FFmpeg with sudo apt update && sudo apt install ffmpeg, while macOS users can use brew install ffmpeg. Windows users need to download FFmpeg from the official website and add it to their system PATH.

2. Obtain a DashScope API key from Alibaba Cloud’s DashScope Console. For security and convenience, set this as an environment variable.

On Linux and macOS, add export DASHSCOPE_API_KEY="your_api_key_here" to your shell profile. Windows users can set this through the system environment variables interface.

3. Install the toolkit directly from PyPI using pip install qwen3-asr-toolkit. This makes the qwen3-asr command available system-wide. Alternatively, clone the GitHub repository and install from source for the latest development version.

4. The simplest transcription command is qwen3-asr -i "/path/to/your/audio.mp4". This processes a local file using default settings with 4 parallel threads. For remote files, use qwen3-asr -i "https://example.com/audio.mp3" to process URLs directly.

5. Increase processing speed with more threads using qwen3-asr -i "audio.wav" -j 8 for 8 parallel connections. Improve accuracy for specialized content by providing context: qwen3-asr -i "tech_talk.mp4" -c "Python, machine learning, neural networks". Use silence mode with the -s flag to suppress progress output while maintaining file output.

6. The tool automatically saves transcriptions to a text file in the same directory as the input file, using the same base filename with a .txt extension.

Pros

No Duration Limitations: Processes audio files of any length without requiring manual splitting or preprocessing.
Exceptional Speed: Multi-threaded processing dramatically reduces transcription time compared to sequential processing methods.
High Accuracy: Leverages Alibaba’s advanced Qwen-ASR model trained on millions of hours of diverse audio data.
Smart Segmentation: VAD technology ensures natural break points, maintaining sentence integrity and readability.
Format Flexibility: Supports virtually any audio or video format through robust FFmpeg integration.
Automated Quality Control: Built-in post-processing removes common transcription artifacts and repetitive patterns.
Simple Installation: Single command installation from PyPI with straightforward dependency management.
Cost Effective: Utilizes Alibaba’s competitive API pricing while maximizing efficiency through parallel processing.

Cons

API Dependency: Requires active internet connection and DashScope API access, making it unsuitable for offline environments.
Cloud-based Processing: Audio data is sent to Alibaba’s servers, which may not comply with certain data privacy requirements.
Limited Customization: Cannot fine-tune the underlying model for specific domains or accents beyond context hints.
FFmpeg Dependency: Requires separate installation and configuration of FFmpeg, which can be challenging for non-technical users.
No Real-time Processing: Designed for batch processing rather than live transcription applications.

Related Resources

DashScope Console: The official platform for managing API keys and monitoring usage quotas for Qwen services.
Qwen3 ASR: Learn more about Alibaba’s Qwen3 ASR model.
Qwen3-ASR-Demo: Try Qwen3-ASR model on Hugging Face.
FFmpeg Documentation: Official guide for audio and video format conversion, useful for preprocessing media files.
Alibaba Cloud Model Studio: Official documentation for all Qwen model APIs and integration examples.

FAQs

Q: Is the Qwen3-ASR-Toolkit completely free?
A: Yes, the toolkit itself is free and open-source. However, you will need a DashScope API key, which have its own pricing and usage limits.

Q: What languages does the Qwen-ASR API support?
A: The Qwen-ASR API supports 11 languages, including English, Chinese, Arabic, French, German, Spanish, Italian, Portuguese, Russian, Japanese, and Korean.

Q: How does the context feature work?
A: You can provide a list of specific terms, names, or acronyms using the -c flag. This helps the ASR model recognize them more accurately in the audio.

Q: What happens if some chunks fail during parallel processing?
A: The toolkit includes error handling for failed API calls. If a chunk fails, it will retry the request.