If AI can now speak Italian, it can certainly replace us...(sopuli.xyz)

posted 7 months ago

lseif@sopuli.xyz

programmerhumor@lemmy.ml

50 commentshide report

Sort:

Hot Top Controversial New Old

[ - ]

stingpie@lemmy.world

5 points

7 months ago

This might be happening because of the ‘elegant’ (incredibly hacky) way openai encodes multiple languages into their models. Instead of using all character sets, they use a modulo operator on each character, to make all Unicode characters represented by a small range of values. On the back end, it somehow detects which language is being spoken, and uses that character set for the response. Seeing as the last line seems to be the same mathematical expression as what you asked, my guess is that your equation just happened to perfectly match some sentence that would make sense in the weird language.

permalink

report

[ - ]

NeatNit@discuss.tchncs.de

1 point

7 months ago

I suppose it’s conceivable that there’s a bug in converting between different representations of Unicode, but I’m not buying and of this “detected which language is being spoken” nonsense or the use of character sets. It would just use Unicode.

The modulo idea makes absolutely no sense, as LLMs use tokens, not characters, and there’s soooooo many tokens. It would make no sense to make those tokens ambiguous.

permalink

report

parent

[ - ]

stingpie@lemmy.world

1 point

7 months ago

I completely agree that it’s a stupid way of doing things, but it is how openai reduced the vocab size of gpt-2 & gpt-3. As far as I know–I have only read the comments in the source code– the conversion is done as a preprocessing step. Here’s the code to gpt-2: https://github.com/openai/gpt-2/blob/master/src/encoder.py I did apparently make a mistake, as the vocab reduction is done through a lut instead of a simple mod.

permalink

report

parent

[ - ]

PlexSheep@infosec.pub

4 points

7 months ago

Do you have a source for that? Seems like an internal detail a corpo wouldn’t publish

permalink

report

parent

[ - ]

stingpie@lemmy.world

1 point

7 months ago

Can’t find the exact source–I’m on mobile right now–but the code for the gpt-2 encoder uses a utf-8 to unicode look up table to shrink the vocab size. https://github.com/openai/gpt-2/blob/master/src/encoder.py

permalink

report

parent

[ - ]

crispy_kilt@feddit.de

2 points

6 months ago

Seriously? Python for massive amounts of data? It’s a nice scripting language, but it’s excruciatingly slow

report

[ - ]

2 points

7 months ago

Well, it certainly doesn’t overflow on 32 bit systems

permalink

report

[ - ]

Vitaly@feddit.uk

0 points

7 months ago

Kind of looks like the writing system of Georgian language but I’m not sure

permalink

report

[ - ]

Allero@lemmy.today

1 point

7 months ago

No, this is Glagolitic script, an alternative to Cyrillic. Mostly used in old Slavic scriptures, was later replaced by Cyrillic and Latin.

Most Slavs themselves don’t know how to read this

permalink

report

parent

[ - ]

TwilightKiddy@programming.dev

1 point

7 months ago

It’s a dead script that was not that common in the first place, in Kievan Rus’ it was even used as a form of encryption in XI—XVI centuries for how little spread it was. It is also very different from modern Cyrillic. So, saying “most Slavs don’t know how to read it” is a bit of an understatement. Noone knows how to read it, apart from some linguists and overzealous Witcher fans.

permalink

report

parent

[ - ]