[ home / rules / faq ] [ overboard / sfw / alt ] [ leftypol / edu / siberia / latam / hobby / tech / games / anime / music / draw / AKM ] [ meta ] [ wiki / tv / twitter / tiktok ] [ GET / ref / marx / booru ]

/tech/ - Technology

"Technology reveals the active relation of man to nature" - Karl Marx
Name
Options
Subject
Comment
Flag
File
Embed
Password(For file deletion.)


File: 1665654819264.png (95.93 KB, 640x480, wget where.png)

 

how do i scrape images/webm/videos from imageboards like this one or any other using wget, how?

bin doin 4cha using this
wget –wait=1 -P qwerty -nd -r -l 1 -H -D archived.moe -A png,gif,jpg,jpeg,webm https://archived.moe/w/thread/2202966/


pls halp

File: 1685856515034.jpg (104.33 KB, 1200x675, doeg.jpg)

Kind of opposite problem for me but how do I archive threads here with media links? Preferably without having to make an account. I'll do a bunch if someone can tell me what site to use. Sorry couldn't find a specific thread on archiving links here and didn't wanna make a new one yet.

>>19802
Are you scraping them as a personal download? Or do you want them on web.archive.org Wayback Machine?
wget is probably a good tool if you're personally storing a thread and all the media in it. I think you'd want a recursive, 1 level download, restricted to leftypol.org domain, I forget the exact options but they won't be hard for you to find.

>>19803
>Or do you want them on web.archive.org Wayback Machine?
This one I mean.
I think OP was doing the former.

>>19804
The two options I've seen are:
1) ArchiveTeam's ArchiveBot - https://wiki.archiveteam.org/index.php?title=ArchiveBot
2) DIY script - download the list of links with wget, and automate submitting them to Wayback Machine's save address (make sure to have the script pause between visits to it doesn't say Too Many Requests)

I'd recommend trying the first if appropriate, at least try once and see if it grabs all the media. But I think it puts you in a queue so idk if it's good for short-lived threads.

From >>18433
wget -O - https://leftypol.org/tech/res/1280.json \
    | jq -r .posts[].files[]?.file_path \
    | sed s_^_https://leftypol.org/_ \
    | wget -i -

>>19802
This seems to work:
#!/bin/bash

URL=$1
END=${URL##*/}
NAME=${END%%.*}
BASE=${URL%/*}
DOMAIN=$(echo $1 | cut -d'/' -f1-3)

TO='--retry-connrefused --wait=60 --random-wait --read-timeout=60 --timeout=60 -t 0'

wget https://web.archive.org/save/$1 -O /dev/null --spider $TO

wget -O - $BASE/$NAME.json \
    | jq -r .posts[].files[]?.file_path \
    | sed s_^_https://web.archive.org/save/$DOMAIN/_ \
    | while read u; do
    wget $u - $TO --spider -O /dev/null
    sleep 1m
done

>>17234
>>19802
ask chatgpt to write a python script for it

>>17234
>>19802
can somebody please tell me how to archive the threads to archive.today? it always gives me an error, but archive.org works fine. is it because of the craptcha? please suggest a workaround

>>19805
different anon but cool thank you
>>20068
does archive.is/archive.ph work?

>>8534
Someone archive this shit and jannies merge redundant threads I reported so I can post mine.
billions must hoard

>>20113
nah, none of them work since they require a captcha and my script can't get past any of them. i need fixes bros

>>20266
bump of shame

>>22747
you have to hire iranians to do the captchas for your bot



Unique IPs: 8

[Return][Go to top] [Catalog] | [Home][Post a Reply]
Delete Post [ ]
[ home / rules / faq ] [ overboard / sfw / alt ] [ leftypol / edu / siberia / latam / hobby / tech / games / anime / music / draw / AKM ] [ meta ] [ wiki / tv / twitter / tiktok ] [ GET / ref / marx / booru ]