Added New Speech and Audio Generation AI Model Templates with Focus on Music and Multimodal Capabilities #79

0xmuon · 2025-04-13T18:13:11Z

Below are the 8 important Audio Generation and Speech Generation multimodels which are missing in the templates section,and should be added,as for now no speech/audio model exists:

What this PR brings to the templates section?

Note: I have added proper tags and recommended on discord to add them to the website (https://dashboard.nosana.com/deploy) ,for proper syncing.

1. Meta MusicGen Medium

A state-of-the-art text-to-music generation model that transforms text descriptions into high-quality music samples.

Key Features:

1.5B parameter model
32kHz audio output
Single-stage auto-regressive Transformer architecture
Trained on 20K hours of licensed music data
Supports various music styles and genres

Technical Specs:

Sampling Rate: 32kHz
Context Length: 50Hz token sampling
GPU Requirements: 16GB+ VRAM
Performance Metrics:
- Frechet Audio Distance: 5.14
- Kullback-Leibler Divergence: 1.38
- Text Consistency: 0.28

2. Microsoft Phi-4 Multimodal

A lightweight open multimodal foundation model supporting text, image, and audio inputs.

Key Features:

128K token context window
Multi-modal input support
Lightweight architecture
Comprehensive instruction tuning

Technical Specs:

Input Types: Text, images, audio
Context Length: 128K tokens
GPU Requirements: 16GB+ VRAM
API Endpoint: Port 9000

3. OpenAI Whisper Large V3 Turbo

Type: Speech Recognition

99 languages support
Optimized for faster inference
Speech recognition and translation
Categories: Audio Generation, Multimodal, New, Speech Generation, API

4. MIT AST Speech Commands v2

Type: Audio Classification

98.12% accuracy on keyword spotting
Fine-tuned on Speech Commands v2 dataset
Audio Spectrogram Transformer architecture
Categories: Audio Classification, API, New

5. NVIDIA BigVGAN v2

Type: Neural Vocoder

44kHz sampling rate
Universal vocoder architecture
High-quality audio generation
Categories: API, New, Audio Generation

6. MIT Audio Spectrogram Transformer

Type: Audio Classification

AudioSet category classification
Transformer architecture
State-of-the-art performance
Categories: Audio Classification, API, New

7. Coqui XTTS-v2

Type: Text-to-Speech

17 languages support
6-second reference audio for voice cloning
Multilingual capabilities
Categories: Speech Generation, API, New

8. F5-TTS

Type: Text-to-Speech

Flow matching architecture
Expressive voice synthesis
Natural-sounding output
Categories: API, Speech Generation, New

Implementation Details

Template Structure

Each template includes:

info.json: Metadata and categorization
job-definition.json: Deployment configuration
README.md: Comprehensive documentation

API Standardization

OpenAI-compatible API where applicable
Standardized error handling
Comprehensive documentation

Impact

Enhanced Capabilities

Advanced music generation
Multimodal AI processing
Improved speech recognition
Better audio classification
Enhanced text-to-speech

0xmuon · 2025-04-18T11:26:40Z

Hey @ maintainers, If any changes are required regarding reducing the template models or adding audio/speech related tags to validate.js ,then please let me know.

djmbritt · 2025-04-22T15:57:44Z

Please run npm run validate and make sure that all issues have been resolved.
Also provide evidence that these templates work, links and screenshots.

0xmuon added 2 commits April 13, 2025 23:18

Added all Popular and Important Audio/Speech related model Templates

04e24be

significant changes to template specific readme.md

05cb735

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added New Speech and Audio Generation AI Model Templates with Focus on Music and Multimodal Capabilities #79

Added New Speech and Audio Generation AI Model Templates with Focus on Music and Multimodal Capabilities #79

Uh oh!

0xmuon commented Apr 13, 2025

Uh oh!

0xmuon commented Apr 18, 2025 •

edited

Loading

Uh oh!

djmbritt commented Apr 22, 2025

Uh oh!

Uh oh!

Added New Speech and Audio Generation AI Model Templates with Focus on Music and Multimodal Capabilities #79

Are you sure you want to change the base?

Added New Speech and Audio Generation AI Model Templates with Focus on Music and Multimodal Capabilities #79

Uh oh!

Conversation

0xmuon commented Apr 13, 2025

Below are the 8 important Audio Generation and Speech Generation multimodels which are missing in the templates section,and should be added,as for now no speech/audio model exists:

What this PR brings to the templates section?

1. Meta MusicGen Medium

2. Microsoft Phi-4 Multimodal

3. OpenAI Whisper Large V3 Turbo

4. MIT AST Speech Commands v2

5. NVIDIA BigVGAN v2

6. MIT Audio Spectrogram Transformer

7. Coqui XTTS-v2

8. F5-TTS

Implementation Details

Template Structure

API Standardization

Impact

Enhanced Capabilities

Uh oh!

0xmuon commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djmbritt commented Apr 22, 2025

Uh oh!

Uh oh!

0xmuon commented Apr 18, 2025 •

edited

Loading