feddit.org

23 points

4 months ago

IMO the more interesting models are 70B and 8B, aka the first models you can host yourself and (for basically the first time) the first open models distilled from such a large “parent” model.

But the release is a total dud among testers because they’re bugged with llama.cpp, lol.

report

[ - ]

tonyn@lemmy.ml

12 points

4 months ago

I’ve got llama 3.1 8b running locally in open webui. What do you mean it’s bugged with llama.cpp?

report

[ - ]

sunzu@kbin.run

3 points

4 months ago

Does anyone know what it takes to run 70b?

Seems like min 32gb RAM and 4070?

report

[ - ]

2 points

4 months ago

I mean I have a 24GB GPU, and its almost too slow for me. If someone makes an AQLM I may run it some.

report

[ - ]

sunzu@kbin.run

1 point

4 months ago

You were able to load 70b just into GPU?

report

[ - ]

2 points

4 months ago

Yeah, an AQLM 70B will fit in 24GB with very short context, but decent quality.

You never hear about it, mostly because it’s so hard to quantize in the first place, but also because it’s not a GGUF so most people ignore the format, lol.

report

Show more comments

[ - ]

9 points

4 months ago

llama.cpp, the underlying engine, doesn’t support extended RoPE yet. Basically this means long context doesnt work and short context could be messed up too.

I am also hearing rumblings of a messed up chat template?

Basically with any LLM in any UI that uses a GGUF, you have to be very careful of bugs you wouldn’t get in the huggingface-based backends. A lot of models run without errors, but not quite right.

report

[ - ]

FaceDeer@fedia.io

1 point

4 months ago

I wouldn’t call it a “dud” on that basis. Lots of models come out with lagging support on the various inference engines, it’s a fast-movibg field.

report

[ - ]

1 point

4 months ago

Yeah, but it leaves a bad initial impression when all the frontends ship it and the users aren’t aware its bugged.

report