I wanted to extract some crime statistics broken by the type of crime and different populations, all of course normalized by the population size. I got a nice set of tables summarizing the data for each year that I requested.

When I shared these summaries I was told this is entirely unreliable due to hallucinations. So my question to you is how common of a problem this is?

I compared results from Chat GPT-4, Copilot and Grok and the results are the same (Gemini says the data is unavailable, btw :)

So is are LLMs reliable for research like that?

27 points
*

LLMs are totally unreliable for research. They are just probable token generators.

Especially if your looking for new data that nobody has talked about before, then your just going to get convincing hallucinations, like talking to a slightly drunk professor at a loud bar who can’t ever admit they don’t know something.

Example: ask a llm this “what open source software developer died in the September 11th attacks?”

It will give you names, and when you try to verify those names, you’ll find out those people didn’t die. It’s just generating probable tokens

permalink
report
reply
9 points

That’s seems pretty fucking important :) Thanks for educating me. I’ll stick to raw R for now.

permalink
report
parent
reply
5 points

Asking an LLM for raw R code that accomplishes some task and fixing the bugs it hallucinates can be a time booster, though

permalink
report
parent
reply
4 points
*

Tried the example, got 2 names that did die in the attacks, but they sure as hell weren’t developers or anywhere near the open source sphere. Also love the classic “that’s not correct” with the AI response being “ah yes, of course”. Shit has absolutely 0 reflection. I mean it makes sense, people usually have doubts in their head BEFORE they write something down. The training data completely skips the thought process, LLMs can’t learn to doubt.

permalink
report
parent
reply
0 points

Solutions exist where you give the LLM a bunch of files e.g., PDFs which it then will solely base it’s knowledge on

permalink
report
parent
reply
5 points

It’s still a probable token generator, you’re just training it on your local data. Hallucinations will absolutely happen.

permalink
report
parent
reply
0 points
*

This isn’t training its called a RAG Workflow, as there is no training step per se

permalink
report
parent
reply
18 points

Absolutely not.

permalink
report
reply
3 points
*

Definitely. The thing you might want to consider as well is what you are using it for. Is it professional? Not reliable enough. Is it to try to understand things a bit better? Well, it’s hard to say if it’s reliable enough, but it’s heavily biased just as any source might be, so you have to take that into account.

I don’t have the experience to tell you how to suss out its biases. Sometimes, you can push it in one direction or another with your wording. Or with follow-up questions. Hallucinations are a thing but not the only concern. Cherrypicking, lack of expertise, the bias of the company behind the llm, what data the llm was trained on, etc.

I have a hard time understanding what a good way to double-check your llm is. I think this is a skill we are currently learning, as we have been learning how to sus out the bias in a headline or an article based on its author, publication, platform, etc. But for llms, it feels fuzzier right now. For certain issues, it may be less reliable than others as well. Anyways, that’s my ramble on the issue. Wish I had a better answer, if only I could ask someone smarter than me.


Oh, here’s gpt4o’s take.

When considering the accuracy and biases of large language models (LLMs) like GPT, there are several key factors to keep in mind:

1. Training Data and Biases

  • Source of Data: LLMs are trained on vast amounts of data from the internet, books, articles, and other text sources. The quality and nature of this data can greatly influence the model’s output. Biases present in the training data can lead to biased outputs. For example, if the data contains biased or prejudiced views, the model may unintentionally reflect these biases in its responses.
  • Historical and Cultural Biases: Since data often reflects historical contexts and cultural norms, models might reproduce or amplify existing stereotypes and biases related to gender, race, religion, or other social categories.

2. Accuracy and Hallucinations

  • Factual Inaccuracies: LLMs do not have an understanding of facts; they generate text based on patterns observed during training. They may provide incorrect or misleading information if the topic is not well represented in their training data or if the data is outdated.
  • Hallucinations: LLMs can “hallucinate” details, meaning they can generate plausible-sounding information that is entirely fabricated. This can occur when the model attempts to fill in gaps in its knowledge or when asked about niche or obscure topics.

3. Context and Ambiguity

  • Understanding Context: While LLMs can generate contextually appropriate responses, they might struggle with nuanced understanding, especially in cases where subtle differences in wording or context significantly change the meaning. Ambiguity in a prompt or query can lead to varied interpretations and outputs.
  • Context Window Limitations: LLMs have a fixed context window, meaning they can only “remember” a certain amount of preceding text. This limitation can affect their ability to maintain context over long conversations or complex topics.

4. Updates and Recency

  • Outdated Information: Because LLMs are trained on static datasets, they may not have up-to-date information about recent events, scientific discoveries, or new societal changes unless explicitly fine-tuned or updated.

5. Mitigating Biases and Ensuring Accuracy

  • Awareness and Critical Evaluation: Users should be aware of potential biases and inaccuracies and approach the output critically, especially when discussing sensitive or fact-based topics.
  • Diverse and Balanced Data: Developers can mitigate biases by training models on more diverse and balanced datasets and employing techniques such as debiasing algorithms or fine-tuning with carefully curated data.
  • Human Oversight and Expertise: Where high accuracy is critical (e.g., in legal, medical, or scientific contexts), human oversight is necessary to verify the information provided by LLMs.

6. Ethical Considerations

  • Responsible Use: Users should consider the ethical implications of using LLMs, especially in contexts where biased or inaccurate information could cause harm or reinforce stereotypes.

In summary, while LLMs can provide valuable assistance in generating text and answering queries, their accuracy is not guaranteed, and their outputs may reflect biases present in their training data. Users should use them as tools to aid in tasks, but not as infallible sources of truth. It is essential to apply critical thinking and, when necessary, consult additional reliable sources to verify information.

permalink
report
parent
reply
7 points

They aren’t. They’re a party trick.

permalink
report
reply
7 points

Treat it like an eager impressionable intern with a confident stride.

permalink
report
reply
3 points

Who is also 15 yrs old and has brain damage

permalink
report
parent
reply
1 point

Proof that you can do anything if somebody piles billions of money on you.

permalink
report
parent
reply
4 points

How reliable is autocorrect?

permalink
report
reply

AI

!artificial_intel@lemmy.ml

Create post

Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals, which involves consciousness and emotionality. The distinction between the former and the latter categories is often revealed by the acronym chosen.

Community stats

  • 203

    Monthly active users

  • 115

    Posts

  • 215

    Comments

Community moderators