I saw this post and I was curious what was out there.
https://neuromatch.social/@jonny/113444325077647843
Id like to put my lab servers to work archiving US federal data thats likely to get pulled - climate and biomed data seems mostly likely. The most obvious strategy to me seems like setting up mirror torrents on academictorrents. Anyone compiling a list of at-risk data yet?
I have a script that archives to:
- Internet Archive: Digital Library of Free & Borrowable Texts, Movies, Music & Wayback Machine
- Webpage archive
- Ghostarchive, a website archive
- Self-hosted https://archivebox.io/
I used to solely depend on archive.org, but after the recent attacks, I expanded my options.
Script: https://gist.github.com/YasserKa/9a02bc50e75e7239f6f0c8f04fe4cfb1
EDIT: Added script. Note that the script doesn’t include archiving to
archivebox, since its API isn’t available in stable verison yet. You can add a
function depending on your setup. Personally, I am depending on Caddy and
docker, so I am using caddy module [1] to execute commands with this in my Caddyfile
:
route /add {
@params query url=*
exec docker exec --user=archivebox archivebox archivebox add {http.request.uri.query.url} {
timeout 0
}
}
isn’t this prone to a
|| rm -rf /
or something similar at the end of the URL?
if you can docker exec
, you have a lot of privileges already, so be sure to make sure this is not a danger