Saturday, November 14, 2015

WGET to clone webpage

Easy way to crawl web pages/applications to store static content.

wget --keep-session-cookies --save-cookies c.txt --load-cookies c.txt --no-check-certificate -T 10 -x -r -nc <url-to-page>

To set different user agent add --user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1"

Cookie file must be in Netscape format. If webpage has a logout implemented, crawl the site until logout is stored, delete all other crawled pages and crawl with new session again. WGET does not crawl sites twice.

To bypass login you can create a cookie file with signed on session cookies.

# HTTP cookie file.
# Generated by Wget on 2015-11-14 09:18:02.
# Edit at your own risk. FALSE / FALSE 0 name cookie-value