Skip to content

Conversation

0xmuon
Copy link

@0xmuon 0xmuon commented Apr 13, 2025

Below are the 8 important Audio Generation and Speech Generation multimodels which are missing in the templates section,and should be added,as for now no speech/audio model exists:

What this PR brings to the templates section?

Note: I have added proper tags and recommended on discord to add them to the website (https://dashboard.nosana.com/deploy) ,for proper syncing.

1. Meta MusicGen Medium

A state-of-the-art text-to-music generation model that transforms text descriptions into high-quality music samples.

Key Features:

  • 1.5B parameter model
  • 32kHz audio output
  • Single-stage auto-regressive Transformer architecture
  • Trained on 20K hours of licensed music data
  • Supports various music styles and genres

Technical Specs:

  • Sampling Rate: 32kHz
  • Context Length: 50Hz token sampling
  • GPU Requirements: 16GB+ VRAM
  • Performance Metrics:
    • Frechet Audio Distance: 5.14
    • Kullback-Leibler Divergence: 1.38
    • Text Consistency: 0.28

2. Microsoft Phi-4 Multimodal

A lightweight open multimodal foundation model supporting text, image, and audio inputs.

Key Features:

  • 128K token context window
  • Multi-modal input support
  • Lightweight architecture
  • Comprehensive instruction tuning

Technical Specs:

  • Input Types: Text, images, audio
  • Context Length: 128K tokens
  • GPU Requirements: 16GB+ VRAM
  • API Endpoint: Port 9000

3. OpenAI Whisper Large V3 Turbo

Type: Speech Recognition

  • 99 languages support
  • Optimized for faster inference
  • Speech recognition and translation
  • Categories: Audio Generation, Multimodal, New, Speech Generation, API

4. MIT AST Speech Commands v2

Type: Audio Classification

  • 98.12% accuracy on keyword spotting
  • Fine-tuned on Speech Commands v2 dataset
  • Audio Spectrogram Transformer architecture
  • Categories: Audio Classification, API, New

5. NVIDIA BigVGAN v2

Type: Neural Vocoder

  • 44kHz sampling rate
  • Universal vocoder architecture
  • High-quality audio generation
  • Categories: API, New, Audio Generation

6. MIT Audio Spectrogram Transformer

Type: Audio Classification

  • AudioSet category classification
  • Transformer architecture
  • State-of-the-art performance
  • Categories: Audio Classification, API, New

7. Coqui XTTS-v2

Type: Text-to-Speech

  • 17 languages support
  • 6-second reference audio for voice cloning
  • Multilingual capabilities
  • Categories: Speech Generation, API, New

8. F5-TTS

Type: Text-to-Speech

  • Flow matching architecture
  • Expressive voice synthesis
  • Natural-sounding output
  • Categories: API, Speech Generation, New

Implementation Details

Template Structure

Each template includes:

  • info.json: Metadata and categorization
  • job-definition.json: Deployment configuration
  • README.md: Comprehensive documentation

API Standardization

  • OpenAI-compatible API where applicable
  • Standardized error handling
  • Comprehensive documentation

Impact

Enhanced Capabilities

  • Advanced music generation
  • Multimodal AI processing
  • Improved speech recognition
  • Better audio classification
  • Enhanced text-to-speech

@0xmuon
Copy link
Author

0xmuon commented Apr 18, 2025

Hey @ maintainers, If any changes are required regarding reducing the template models or adding audio/speech related tags to validate.js ,then please let me know.

@djmbritt
Copy link
Member

Please run npm run validate and make sure that all issues have been resolved.
Also provide evidence that these templates work, links and screenshots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants