Microsoft and Reddit Are Fighting About Why Bing’s Crawler Is Blocked on Reddit(www.404media.co)

posted 3 months ago

coyotino [he/him]@beehaw.org

technology@beehaw.org

35 commentshide report

Sort:

Hot Top Controversial New Old

[ - ]

coyotino [he/him]@beehaw.orgOP

60 points

3 months ago

The beef between Microsoft and Reddit came to light after I published a story revealing that Reddit is currently blocking every crawler from every search engine except Google, which earlier this year agreed to pay Reddit $60 million a year to scrap the site for its generative AI products.

I know the author meant “scrape”, but sometimes it really does feel like AI is just scrapping the old internet for parts.

permalink

report

[ - ]

cybermass@lemmy.ca

15 points

3 months ago

Yeah, aren’t like over half of reddit comments/posts by bots these days?

permalink

report

parent

[ - ]

originalucifer@moist.catsweat.com

13 points

3 months ago

yep, and the longer that happens the less value to the dataset. its becoming aged.

permalink

report

parent

[ - ]

KeriKitty (They(/It))@pawb.social

13 points

3 months ago

[Joke] See, Reddit’s doing a nice thing here! They’re making sure nobody ends up toxifying their own dataset by using Reddit’s garbage heap of bot posts!

permalink

report

parent

Show more comments

[ - ]

doctortofu@reddthat.com

44 points

3 months ago

I can see why spez is upset about scrappers and search engines - image a company profiting from people creating lots of data, just hoarding it and using it for free, and not paying those people a cent, preposterous, right? :)

permalink

report

[ - ]

Ilandar@aussie.zone

28 points

3 months ago

“This was Microsoft’s choice, not ours,” Reddit spokesperson Tim Rathschmidt told me in an email. “We are and have been open to agreements with companies who are open about their intentions and commit to treat us and our users fairly. If Bing or others want access within our policies, without training, without summarization, and without selling it to others, we are and have always been open to that. If they want to build a business selling Reddit data or using the data for training, we could be open to that, but it’s a commercial conversation.”

Mojeek, the search engine that initially told me that Reddit was blocking all search engines but Google, and which was unable to get in touch with Reddit at the time, told me Reddit got in touch after that story was published. Mojeek said it was unable to share any details about the deal because of an NDA, but confirmed that Reddit wanted to get paid for letting Mojeek crawl the site, even though Mojeek does not have any AI products.

This doesn’t add up and it makes me wonder what else Google and reddit agreed upon. This situation benefits no one except Google, as far as I can tell. If reddit wants to milk search engines, and Microsoft is willing and able to pay (which I assume they are), there is no reason for the deal to not go ahead like it did with Google. Kinda makes my brain start going down the conspiracy path, but then again it’s hardly unbelievable that Google would pursue anti-competitive business strategies, particularly when it comes to generative AI.

permalink

report

[ - ]

Moonrise2473@feddit.it

28 points

3 months ago

A search engine can’t pay a website for having the honor of bringing them visits and ad views.

Fuck reddit, get delisted, no problem.

Weird that google is ignoring their robots.txt though.

Even if they pay them for being able to say that glue is perfect on pizza, having

User-agent: *
Disallow: /

should block googlebot too. That means google programmed an exception on googlebot to ignore robots.txt on that domain and that shouldn’t be done. What’s the purpose of that file then?

Because robots.txt is completely based on honor (there’s no need to pretend being another bot, could just ignore it), should be

User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /

permalink

report

[ - ]

ssm@lemmy.sdf.org

21 points

3 months ago

I hope all big corporate SEO trash follows suite, once they’ve all filtered themselves out for profit we can hopefully get some semblance of an unshittified search experience.

permalink

report

[ - ]

tal@lemmy.today

7 points

3 months ago

The reason that robots.txt generally worked was because nobody was trying to really leverage it against bot operators. I’m not sure that this might not just kill robots.txt. Historically, search engines wanted to index stuff and websites wanted to be indexed. Their interests were aligned, so the convention worked. This no longer holds if things like the Google-Reddit partnership become common.

Reddit can also try to detect and block crawlers; robots.txt isn’t the only tool in their toolbox.

Microsoft, unlike most companies, does actually have a technical counter that Reddit probably cannot stop, if it comes to that and Microsoft wants to do a “hostile index” of Reddit.

Microsoft’s browser, Edge, is used by a bunch of people, and Microsoft can probably rig it up to send content of Reddit pages requested by their browser’s users sufficient to build their index. Reddit can’t stop that without blocking Edge users. I expect that that’d probably be exploring a lot of unexplored legal territory under the laws of many countries. It also wouldn’t be as good as Google’s (I assume real-time) access to the comments, but they’d get to them.

Browsers do report the host-referrer, which would permit Reddit to detect that a given user has arrived from Bing and block them:

https://en.wikipedia.org/wiki/HTTP_referer

In HTTP, “Referer” (a misspelling of “Referrer”[1]) is an optional HTTP header field that identifies the address of the web page (i.e., the URI or IRI), from which the resource has been requested. By checking the referrer, the server providing the new web page can see where the request originated.

In the most common situation, this means that when a user clicks a hyperlink in a web browser, causing the browser to send a request to the server holding the destination web page, the request may include the Referer field, which indicates the last page the user was on (the one where they clicked the link).

Web sites and web servers log the content of the received Referer field to identify the web page from which the user followed a link, for promotional or statistical purposes.[2] This entails a loss of privacy for the user and may introduce a security risk.[3] To mitigate security risks, browsers have been steadily reducing the amount of information sent in Referer. As of March 2021, by default Chrome,[4] Chromium-based Edge, Firefox,[5] Safari[6] default to sending only the origin in cross-origin requests, stripping out everything but the domain name.

Reddit could block browsers with a host-referrer off bing.com, killing the ability of Bing to link to them. I don’t know if there’s a way for a linking site to ask a browser to not give or forge the host-referrer. For Edge users – not all Bing users – Microsoft could modify the browser to do so, forcing Reddit to decide whether to block all Edge users or not.

permalink

report

parent

[ - ]

i_am_not_a_robot@discuss.tchncs.de

5 points

3 months ago

It is possible to remove the referer header:

permalink

report

parent

[ - ]

Ace! _SL/S@ani.social

2 points

3 months ago

They can try to block crawlers all they want

They will not succeed without restricting access to Reddit to an unusable degree, since crawlers can be coded to imitate real users close enough. Combine that with enough proxies and they can’t do jack shit

Also you could get arround the Referer header quite easily via redirects (unless Reddit went ahead and used a Whitelist for those, which again would be a very stupid decision) and some more methods

permalink

report

parent

[ - ]

CanadaPlus@lemmy.sdf.org

2 points

3 months ago

Man, wouldn’t that be nice. There’s too much money in appearing on searches for me to ever expect that to happen, though.

permalink

report

parent

Technology

!technology@beehaw.org

Create post

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community’s icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

Community stats

2.8K
Monthly active users
1.7K
Posts
9.7K
Comments

Community stats

Community moderators