Self-hosting LLMs

posted 2 months ago

I’d like to self host a large language model, LLM.

I don’t mind if I need a GPU and all that, at least it will be running on my own hardware, and probably even cheaper than the $20 everyone is charging per month.

What LLMs are you self hosting? And what are you using to do it?

Sort:

Hot Top Controversial New Old

[ - ]

The Hobbyist@lemmy.zip

23 points

2 months ago

I run the Mistral-Nemo(12B) and Mistral-Small (22B) on my GPU and they are pretty code. As others have said, the GPU memory is one of the most limiting factors. 8B models are decent, 15-25B models are good and 70B+ models are excellent (solely based on my own experience). Go for q4_K models, as they will run many times faster than higher quantization with little performance degradation. They typically come in S (Small), M (Medium) and (Large) and take the largest which fits in your GPU memory. If you go below q4, you may see more severe and noticeable performance degradation.

If you need to serve only one user at the time, ollama +Webui works great. If you need multiple users at the same time, check out vLLM.

Edit: I’m simplifying it very much, but hopefully should it is simple and actionable as a starting point. I’ve also seen great stuff from Gemma2-27B

Edit2: added links

Edit3: a decent GPU regarding bang for buck IMO is the RTX 3060 with 12GB. It may be available on the used market for a decent price and offers a good amount of VRAM and GPU performance for the cost. I would like to propose AMD GPUs as they offer much more GPU mem for their price but they are not all as supported with ROCm and I’m not sure about the compatibility for these tools, so perhaps others can chime in.

Edit4: you can also use openwebui with vscode with the continue.dev extension such that you can have a copilot type LLM in your editor.

permalink

report

[ - ]

dukatos@lemm.ee

2 points

2 months ago

I run ollama:rocm and deepseek-coder model on Radeon 6700XT. I only had to set the GPU via environment variables because it is not officially supported by ROCm, but it works.

permalink

report

parent

[ - ]

tester30361@lemmy.world

1 point

2 months ago

Deleted by creator

permalink

report

parent

[ - ]

Avid Amoeba@lemmy.ca

1 point

2 months ago

If you need to serve only one user at the time, ollama +Webui works great. If you need multiple users at the same time, check out vLLM.

Why can’t it serve multiple users? Open Web UI seems to support multiple users.

permalink

report

parent

[ - ]

The Hobbyist@lemmy.zip

3 points

2 months ago

I didn’t say it can’t. But I’m not sure how well it is optimized for it. From my initial testing it queues queries and submits them one after another to the model, I have not seen it batch compute the queries, but maybe it’s a setup thing on my side. vLLM on the other hand is designed specifically for the multi co current user use case and has multiple optimizations for it.

permalink

report

parent

[ - ]

Avid Amoeba@lemmy.ca

1 point

2 months ago

I see. Makes sense.

permalink

report

parent

[ - ]

Scrubbles@poptalk.scrubbles.tech

9 points

2 months ago

LLMs use a ton of VRAM, the more VRAM you have the better.

If you just need an API, then TabbyAPI is pretty great.

If you need a full UI, then Oogabooga’s TextGenration WebUI is a good place to start

permalink

report

[ - ]

InverseParallax@lemmy.world

7 points

2 months ago

Ollama, llama3.2, deepcode and a bunch of others.

Using a GPU but man they’re picky, they mostly want Nvidia gpus.

Do NOT be afraid to run on the cpu. It’s slow, but for 1 user it’s actually mostly fine.

permalink

report

[ - ]

Deckweiss@lemmy.world

5 points

2 months ago

GPT4All is a nice and easy start.

permalink

report

[ - ]

Showroom7561@lemmy.ca

5 points

2 months ago

You can run this right from Windows: https://jan.ai/

You’ll need a lot of RAM, and processing is decently fast, even on a basic laptop.

edit: holy hell. Grammar.

permalink

report

[ - ]

dangling_cat@lemmy.blahaj.zone

3 points

2 months ago

Tip: you can copy and paste the Hugging Face link directly into the search box, and it will download the model automatically! Also, it’s pretty smart. It will load into your VRAM first, then your RAM. If you can fit everything into VRAM, you get the fastest speed. But even if you are using RAM, it’s not terribly bad; it’s still faster than you can read.

permalink

report

parent

[ - ]

GreenSofaBed@lemmy.zipOP

1 point

2 months ago

This is pretty cool!

permalink

report

parent

Selfhosted

!selfhosted@lemmy.world

Create post

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don’t control.

Rules:

Be civil: we’re here to support and learn from one another. Insults won’t be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it’s not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don’t duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

Community stats

3.7K
Monthly active users
2K
Posts
23K
Comments

Community stats

Community moderators