Technically correct ™
Before you get your hopes up: Anyone can download it, but very few will be able to actually run it.
When the RTX 9090 Ti comes, anyone who can afford it will be able to run it.
This would probably run on a a6000 right?
Edit: nope I think I’m off by an order of magnitude
What’s the resources requirements for the 405B model? I did some digging but couldn’t find any documentation during my cursory search.
405b ain’t running local unless you got a proepr set up is enterpise grade lol
I think 70b is possible but I haven’t find anyone confirming it yet
Also would like to know specs on whoever did it
I regularly run llama3 70b unqantized on two P40s and CPU at like 7tokens/s. It’s usable but not very fast.
Typically you need about 1GB graphics RAM for each billion parameters (i.e. one byte per parameter). This is a 405B parameter model. Ouch.
Edit: you can try quantizing it. This reduces the amount of memory required per parameter to 4 bits, 2 bits or even 1 bit. As you reduce the size, the performance of the model can suffer. So in the extreme case you might be able to run this in under 64GB of graphics RAM.
Hmm, I probably have that much distributed across my network… maybe I should look into some way of distributing it across multiple gpu.
Frak, just counted and I only have 270gb installed. Approx 40gb more if I install some of the deprecated cards in any spare pcie slots i can find.