scraping images from imageboards

Posting mode: Reply [Return]
Name	^♔ 8U⚔5K$XT⚬wYS}N<vVy9mh-tu{F'@g?/JIa:j☏0c*donW\|Z+⚍z]_fB>~Hiq=rG&L bCP26R%3xpk(1e#4⚲⛘!\[♭A;lOs,)♪`D
Options
Subject	Spoiler Image
Comment
Verification
Flag
File
Embed
Password	(For file deletion.)

File: 1665654819264.png (95.93 KB, 640x480, wget where.png)

scraping images from imageboards Anonymous 13-10-22 09:53:39 No.17234

how do i scrape images/webm/videos from imageboards like this one or any other using wget, how?

bin doin 4cha using this
wget –wait=1 -P qwerty -nd -r -l 1 -H -D archived.moe -A png,gif,jpg,jpeg,webm https://archived.moe/w/thread/2202966/

pls halp

What site to archive threads with media links? Anonymous 04-06-23 05:28:35 No.19802

File: 1685856515034.jpg (104.33 KB, 1200x675, doeg.jpg)

Kind of opposite problem for me but how do I archive threads here with media links? Preferably without having to make an account. I'll do a bunch if someone can tell me what site to use. Sorry couldn't find a specific thread on archiving links here and didn't wanna make a new one yet.

Anonymous 04-06-23 05:36:39 No.19803

>>19802
Are you scraping them as a personal download? Or do you want them on web.archive.org Wayback Machine?
wget is probably a good tool if you're personally storing a thread and all the media in it. I think you'd want a recursive, 1 level download, restricted to leftypol.org domain, I forget the exact options but they won't be hard for you to find.

Anonymous 04-06-23 06:30:06 No.19804

>>19803
>Or do you want them on web.archive.org Wayback Machine?
This one I mean.
I think OP was doing the former.

Anonymous 04-06-23 06:52:55 No.19805

>>19804
The two options I've seen are:
1) ArchiveTeam's ArchiveBot - https://wiki.archiveteam.org/index.php?title=ArchiveBot
2) DIY script - download the list of links with wget, and automate submitting them to Wayback Machine's save address (make sure to have the script pause between visits to it doesn't say Too Many Requests)

I'd recommend trying the first if appropriate, at least try once and see if it grabs all the media. But I think it puts you in a queue so idk if it's good for short-lived threads.

Anonymous 04-06-23 08:51:28 No.19806

From >>18433

wget -O - https://leftypol.org/tech/res/1280.json \
    | jq -r .posts[].files[]?.file_path \
    | sed s_^_https://leftypol.org/_ \
    | wget -i -

Anonymous 04-06-23 09:27:35 No.19807

>>19802
This seems to work:

#!/bin/bash

URL=$1
END=${URL##*/}
NAME=${END%%.*}
BASE=${URL%/*}
DOMAIN=$(echo $1 | cut -d'/' -f1-3)

TO='--retry-connrefused --wait=60 --random-wait --read-timeout=60 --timeout=60 -t 0'

wget https://web.archive.org/save/$1 -O /dev/null --spider $TO

wget -O - $BASE/$NAME.json \
    | jq -r .posts[].files[]?.file_path \
    | sed s_^_https://web.archive.org/save/$DOMAIN/_ \
    | while read u; do
    wget $u - $TO --spider -O /dev/null
    sleep 1m
done

Anonymous 09-06-23 03:11:32 No.20066

>>17234
>>19802
ask chatgpt to write a python script for it

Anonymous 09-06-23 03:17:09 No.20068

>>17234
>>19802
can somebody please tell me how to archive the threads to archive.today? it always gives me an error, but archive.org works fine. is it because of the craptcha? please suggest a workaround

Anonymous 11-06-23 11:30:16 No.20113

>>19805
different anon but cool thank you
>>20068
does archive.is/archive.ph work?

Anonymous 11-06-23 13:05:57 No.20128

>>8534
Someone archive this shit and jannies merge redundant threads I reported so I can post mine.
billions must hoard

Anonymous 12-06-23 03:24:50 No.20266

>>20113
nah, none of them work since they require a captcha and my script can't get past any of them. i need fixes bros

Anonymous 08-12-23 14:25:40 No.22747

>>20266
bump of shame

Anonymous 08-12-23 22:00:37 No.22749

>>22747
you have to hire iranians to do the captchas for your bot

Anonymous 15-04-24 13:53:09 No.24459

>>22747

Unique IPs: 8