How reliable are modern LLMs?

posted 21 days ago

I wanted to extract some crime statistics broken by the type of crime and different populations, all of course normalized by the population size. I got a nice set of tables summarizing the data for each year that I requested.

When I shared these summaries I was told this is entirely unreliable due to hallucinations. So my question to you is how common of a problem this is?

I compared results from Chat GPT-4, Copilot and Grok and the results are the same (Gemini says the data is unavailable, btw :)

So is are LLMs reliable for research like that?

Sort:

Hot Top Controversial New Old

[ - ]

jet@hackertalks.com

27 points

21 days ago

LLMs are totally unreliable for research. They are just probable token generators.

Especially if your looking for new data that nobody has talked about before, then your just going to get convincing hallucinations, like talking to a slightly drunk professor at a loud bar who can’t ever admit they don’t know something.

Example: ask a llm this “what open source software developer died in the September 11th attacks?”

It will give you names, and when you try to verify those names, you’ll find out those people didn’t die. It’s just generating probable tokens

permalink

report

[ - ]

mods_mum@lemmy.todayOP

9 points

21 days ago

That’s seems pretty fucking important :) Thanks for educating me. I’ll stick to raw R for now.

permalink

report

parent

[ - ]

INeedMana@lemmy.world

5 points

21 days ago

Asking an LLM for raw R code that accomplishes some task and fixing the bugs it hallucinates can be a time booster, though

permalink

report

parent

[ - ]

LANIK2000@lemmy.world

4 points

20 days ago

Tried the example, got 2 names that did die in the attacks, but they sure as hell weren’t developers or anywhere near the open source sphere. Also love the classic “that’s not correct” with the AI response being “ah yes, of course”. Shit has absolutely 0 reflection. I mean it makes sense, people usually have doubts in their head BEFORE they write something down. The training data completely skips the thought process, LLMs can’t learn to doubt.

permalink

report

parent

[ - ]

ViaFedi@lemmy.ml

0 points

21 days ago

Solutions exist where you give the LLM a bunch of files e.g., PDFs which it then will solely base it’s knowledge on

permalink

report

parent

[ - ]

jet@hackertalks.com

5 points

21 days ago

It’s still a probable token generator, you’re just training it on your local data. Hallucinations will absolutely happen.

permalink

report

parent

[ - ]

slacktoid@lemmy.ml

0 points

20 days ago

This isn’t training its called a RAG Workflow, as there is no training step per se

permalink

report

parent

[ - ]

simplymath@lemmy.world

18 points

21 days ago

Absolutely not.

permalink

report

[ - ]

Fern@lemmy.world

3 points

21 days ago

Definitely. The thing you might want to consider as well is what you are using it for. Is it professional? Not reliable enough. Is it to try to understand things a bit better? Well, it’s hard to say if it’s reliable enough, but it’s heavily biased just as any source might be, so you have to take that into account.

I don’t have the experience to tell you how to suss out its biases. Sometimes, you can push it in one direction or another with your wording. Or with follow-up questions. Hallucinations are a thing but not the only concern. Cherrypicking, lack of expertise, the bias of the company behind the llm, what data the llm was trained on, etc.

I have a hard time understanding what a good way to double-check your llm is. I think this is a skill we are currently learning, as we have been learning how to sus out the bias in a headline or an article based on its author, publication, platform, etc. But for llms, it feels fuzzier right now. For certain issues, it may be less reliable than others as well. Anyways, that’s my ramble on the issue. Wish I had a better answer, if only I could ask someone smarter than me.

Oh, here’s gpt4o’s take.

When considering the accuracy and biases of large language models (LLMs) like GPT, there are several key factors to keep in mind:

1. Training Data and Biases

Source of Data: LLMs are trained on vast amounts of data from the internet, books, articles, and other text sources. The quality and nature of this data can greatly influence the model’s output. Biases present in the training data can lead to biased outputs. For example, if the data contains biased or prejudiced views, the model may unintentionally reflect these biases in its responses.
Historical and Cultural Biases: Since data often reflects historical contexts and cultural norms, models might reproduce or amplify existing stereotypes and biases related to gender, race, religion, or other social categories.

2. Accuracy and Hallucinations

Factual Inaccuracies: LLMs do not have an understanding of facts; they generate text based on patterns observed during training. They may provide incorrect or misleading information if the topic is not well represented in their training data or if the data is outdated.
Hallucinations: LLMs can “hallucinate” details, meaning they can generate plausible-sounding information that is entirely fabricated. This can occur when the model attempts to fill in gaps in its knowledge or when asked about niche or obscure topics.

3. Context and Ambiguity

Understanding Context: While LLMs can generate contextually appropriate responses, they might struggle with nuanced understanding, especially in cases where subtle differences in wording or context significantly change the meaning. Ambiguity in a prompt or query can lead to varied interpretations and outputs.
Context Window Limitations: LLMs have a fixed context window, meaning they can only “remember” a certain amount of preceding text. This limitation can affect their ability to maintain context over long conversations or complex topics.

4. Updates and Recency

Outdated Information: Because LLMs are trained on static datasets, they may not have up-to-date information about recent events, scientific discoveries, or new societal changes unless explicitly fine-tuned or updated.

5. Mitigating Biases and Ensuring Accuracy

Awareness and Critical Evaluation: Users should be aware of potential biases and inaccuracies and approach the output critically, especially when discussing sensitive or fact-based topics.
Diverse and Balanced Data: Developers can mitigate biases by training models on more diverse and balanced datasets and employing techniques such as debiasing algorithms or fine-tuning with carefully curated data.
Human Oversight and Expertise: Where high accuracy is critical (e.g., in legal, medical, or scientific contexts), human oversight is necessary to verify the information provided by LLMs.

6. Ethical Considerations

Responsible Use: Users should consider the ethical implications of using LLMs, especially in contexts where biased or inaccurate information could cause harm or reinforce stereotypes.

In summary, while LLMs can provide valuable assistance in generating text and answering queries, their accuracy is not guaranteed, and their outputs may reflect biases present in their training data. Users should use them as tools to aid in tasks, but not as infallible sources of truth. It is essential to apply critical thinking and, when necessary, consult additional reliable sources to verify information.

permalink

report

parent

[ - ]

PerogiBoi@lemmy.ca

7 points

21 days ago

They aren’t. They’re a party trick.

permalink

report

[ - ]

rickdg@lemmy.ml

7 points

21 days ago

Treat it like an eager impressionable intern with a confident stride.

permalink

report

[ - ]

jeffhykin@lemm.ee

3 points

21 days ago

Who is also 15 yrs old and has brain damage

permalink

report

parent

[ - ]

rickdg@lemmy.ml

1 point

18 days ago

Proof that you can do anything if somebody piles billions of money on you.

permalink

report

parent

[ - ]

xia@lemmy.sdf.org

4 points

20 days ago

How reliable is autocorrect?

permalink

report

AI

!artificial_intel@lemmy.ml

Create post

Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals, which involves consciousness and emotionality. The distinction between the former and the latter categories is often revealed by the acronym chosen.

Community stats

203
Monthly active users
115
Posts
215
Comments

1. Training Data and Biases

2. Accuracy and Hallucinations

3. Context and Ambiguity

4. Updates and Recency

5. Mitigating Biases and Ensuring Accuracy

6. Ethical Considerations

Community stats

Community moderators