Enhancement: Custom endpoints support #245

HenkieTenkie62 · 2025-08-12T14:31:00Z

HenkieTenkie62
Aug 12, 2025

As most people I'm working with limited resources and thus need model swapping.
For ASR/STT and Document-OCR I'm now unloading models to free up VRAM for processing.

Would it be an idea to support custom endpoints to automate this?
These could reside in llamahost:8080/custom/

This way llama-swap would be in full control over this process.
Endpoints could be configured like models and placed inside groups to avoid large models/applications being loaded at the same time and causing OOM.
Proxy addresses can be kept the same; the start and stop cmd line and healthCheckpoint can be also be repurposed for this exact goal.

I guess these requests probably don't follow any openAI style API and thus need no reprocessing, only rerouting.

Would this be an interesting option or would this go beyond the scope of the project?
I think it would be a great addition.

I've now glued the function in but the implementation is not very flexible:
main...HenkieTenkie62:llama-swap:main#diff-36e6873dfbfa67366814a9233117652a77b868fa2de6082e6d060fcbb91a5fb8

mostlygeek · 2025-08-14T05:18:24Z

mostlygeek
Aug 14, 2025
Maintainer

llama-swap will stay as close as possible to the openai api. We have allowed for some llama-server specific endpoints but these are on a case by case basis.

1 reply

sousekd Sep 1, 2025

Thank you @mostlygeek for the project! I just started experimenting with the config and I’m really enjoying it. I mainly plan to use it with Open Web UI (or similar) together with llama.cpp and ik_llama.cpp as inference engines, to serve various LLMs to a small team.

I noticed what looks like an incompatibility with llama-server–specific endpoints. With Silly Tavern, it works fine over the OpenAI API, but I couldn’t get it working with the llama.cpp connection profile (which I think uses text-completion endpoints and exposes more config options than chat-completion).

I apologize if this has already been discussed elsewhere. Do you plan to expand compatibility with underlying inference engines like llama.cpp, or is it an intentional deliberate decision to only support the OpenAI API?

I’d love to better understand your thoughts and decisions here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhancement: Custom endpoints support #245

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Enhancement: Custom endpoints support #245

Uh oh!

Uh oh!

HenkieTenkie62 Aug 12, 2025

Replies: 1 comment · 1 reply

Uh oh!

mostlygeek Aug 14, 2025 Maintainer

Uh oh!

Uh oh!

sousekd Sep 1, 2025

HenkieTenkie62
Aug 12, 2025

Replies: 1 comment 1 reply

mostlygeek
Aug 14, 2025
Maintainer