How to download an entire website for offline use
Where it all started
In a chat I read about to download a website for offline use. There are several ways on how to do that. If it's a simple page only, then open the browser menu and choose "File" -> "Save page as" or via the hotkey Ctrl + S (in Firefox). For other browsers this may vary but is quite similar.
The browser saves the current html in a file and create a folder of the same name to store all the assets loaded in that file. When viewing this file offline (e.g. open it in the browser, the layout should pretty much look like the original).
This works for a single page only. You would have to visit all desired pages and repeate the "save as" step.
Crawling with wget
On Linux systems a small program called wget exists that does the crawling task quite
easy.
wget \
--recursive \
--no-clobber \
--page-requisites \
--html-extension \
--convert-links \
--restrict-file-names=unix \
--domains docs.moodle.org \
https://docs.moodle.org/500/en/Main_page
The wget arguments are:
- --recursive: go down the entire tree and fetch all linked documents.
- --no-globber: do not overwite file in case the download interupts because of a bad internet connection
- --page-requisites: load all assets that are required by the page, these are css, javascripts, images and so on.
- --html-extension: adds the appropriate suffix e.g. above the wikis Main_page would be stored as Main_page.html.
- --convert-links: convert all links inside the html to local links so that when you browse in the offline html page and click a link, you will be send to the other downloaded offline page.
- --restrict-file-names: when naming the files, follow a certain schema and convert a file name, in case there are special chars that are not allowed with that schema. The values here can be "unix", "windows", "nocontrol", "ascii", "lowercase", "uppercase", "none".
- --domains: download only from within these domains, comma separated in case there is more than one. This prevents that the download will also follow external links to other websites.
Finally the url where to start is given. In this case it's the Moodle documentation in English.
Cloudflare captcha
The above site is protected with a captcha from Cloudflare in order to prevent automatic crawling by bot. Wget is such a bot and recognized as such by Cloudflare. The above command will fail misserably. To test this out, I used curl and send a custom user agent along with the request to pretend being a browser:
curl -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36' \
https://docs.moodle.org/500/en/Main_page
However, that didn't help much.
Also because that website is build by a Mediawiki, you could try to access the page via the Mediawiki API:
https://docs.moodle.org/500/en/api.php?action=parse&page=Main_page&format=json
Unfortunately, these calls are also protected by Cloudflare. When you open that url in a browser, you get a nice result and see that the call actually is correct parameterized and works. When you do the same on the command line, it does not work. The result contains basically a Javascript that handles the captach.
Use the browser with Greasemonkey
For Firefox and Chrome there exists a plugin called Greasemonkey (the link leads to the Firefox extension). That plugin exists for Chrome and other browsers as well. Also, there are a few similar projects (maybe forks) that have a similar name and serve the same purpose.
The idea of Greasemonkey is that you can define user scripts, that run in a sandbox and have access to the current DOM document in the browser. Scripts can also be limited to run on a certain URL pattern only.
In my case, I created a script that traverses the current DOM document, looks up all links
and assets, loads the data and stores them. It's pretty much the same thing that wget
does, except it has hardly any option to influence the crawling behaviour. Also, the
link replacement is a bit buggy (documents in a deeper hierarchy have not the correct links
to the assets - you'll notice missing styles).
The script is triggered by Ctrl + Shift + F.
// ==UserScript==
// @name fetch-site
// @namespace canjanix
// @match *://*/*
// @grant GM.xmlHttpRequest
// @include *
// @version 0.1
// ==/UserScript==
(function() {
'use strict';
let config = {
cssjs: '0',
img: '0',
extAssets: '0',
maxDept: 2,
saveEndpoint: 'http://localhost:3000/?file=',
root: document.location.host + document.location.pathname,
path: document.location.pathname,
};
// Remember all urls that we have fetched so far in order to do that twice.
const fetchedUrls = new Set();
// Count number of resources fetched.
let itemCnt = 0;
async function fetchResource(url, target) {
if (!fetchedUrls.has(url)) {
fetchedUrls.add(url);
try {
await saveResource(target, url);
} catch (err) {
console.error("Error saving resource:", url, err);
}
}
}
function getTargetUrl(url, node) {
const tag = node?.tagName || 'A';
let path = url.match(/^https?:\/\//i) ? new URL(url).pathname : url;
path = path.replace(config.path, '');
if (path.indexOf('/') === 0) {
path = path.substring(1);
}
if (path.substring(path.length - 1) === '/') {
path = path.substring(0, path.length -1);
}
if (tag === 'A') {
if (path === '') {
path = 'index.html';
} else if (!path.match(/\.html?$/i)) {
path += '.html';
}
}
return path;
}
function isUrlToFetch(url, node) {
const tag = node?.tagName || 'A';
// External URL to other domain, ignore all links and assets when external assets is off.
if (url.match(/^https?:\/\//i)) {
if (url.indexOf(document.location.host) === -1 && config.extAssets === '0' && tag !== 'A') {
return false;
}
if (tag === 'A' && url.indexOf(config.root) === -1) {
return false;
}
}
// Internal image but we don't want images.
if (config.img === '0' && tag === 'IMG') {
return false;
}
// Internal script or link (css) but cssjs is off.
if (config.cssjs === '0' && (tag === 'SCRIPT' || tag === 'LINK')) {
return false;
}
// Any link that contains a bookmark or query params.
if (tag === 'A' && (url.indexOf('#') > -1 || url.indexOf('?') > -1)) {
return false;
}
// Assume that here we are good to fetch the url.
return true;
}
// Wrapper around GM.xmlHttpRequest
function gmFetch(url, responseType = "arraybuffer") {
return new Promise((resolve, reject) => {
GM.xmlHttpRequest({
method: "GET",
url: url,
responseType: responseType,
onload: function(res) {
if (res.status >= 200 && res.status < 300) {
resolve({
body: res.response,
headers: res.responseHeaders,
status: res.status
});
} else {
reject(new Error("HTTP " + res.status + " for " + url));
}
},
onerror: reject
});
});
}
// Save a resource (text or binary) to PHP
async function saveResource(path, url) {
try {
const res = await gmFetch(url, "arraybuffer");
// Detect content-type
let contentType = "application/octet-stream";
const ctMatch = res.headers.match(/content-type:\s*([^\n]+)/i);
if (ctMatch) {
contentType = ctMatch[1].trim();
}
// Wrap ArrayBuffer in a Blob
const blob = new Blob([res.body], { type: contentType });
// Build target PHP endpoint
const targetUrl = config.saveEndpoint + encodeURIComponent(path);
await fetch(targetUrl, {
method: "POST",
headers: { "Content-Type": contentType },
body: blob
});
itemCnt += 1;
console.log("Saved:", path, contentType);
} catch (err) {
console.error("Failed to save", url, err);
}
}
// Main function: save HTML and all resources
async function savePageAndAssets(data) {
// Merge config with injected setup.
config = {...data};
// Check valid root.
config.root = config.root.replace(/^https?:\/\//, '');
if (config.root.indexOf(document.location.host) !== 0) {
alert('Invalid root ' + config.root);
return;
}
config.path = config.root.replace(document.location.host, '');
// Construct a base path for the target storage, where all the dats is stored underneeth.
let basePath = config.root;
if (basePath.charAt(basePath.length - 1) !== '/') {
const p = basePath.lastIndexOf('/');
basePath = (p > -1) ? basePath.substring(0, p) : '/';
}
// This is the list of pages to fetch. The first page is current location.
let todo = [{u: '__##start##__', d: 0}];
while (todo.length > 0) {
// Current html to fetch
let {u, d} = todo.shift();
const currentUrl = u;
const currentDept = d;
let currentTarget = '';
let doc;
if (currentUrl === '__##start##__') {
currentTarget = getTargetUrl(document.location.pathname);
doc = document.documentElement.cloneNode(true);
} else {
let pageHtml = '';
try {
const res = await fetch(currentUrl, { method: "GET" });
pageHtml = await res.text(); // raw HTML string
} catch(err) {
console.log('Error fetching: ' + currentUrl);
continue;
}
try {
const parser = new DOMParser();
doc = parser.parseFromString(pageHtml, "text/html");
} catch (err) {
console.log('Error parsing result from: ' + currentUrl);
continue;
}
currentTarget = getTargetUrl(currentUrl);
}
// Images
if (config.img === '1') {
for (const img of doc.querySelectorAll("img[src]")) {
if (isUrlToFetch(img.src, img)) {
const target = getTargetUrl(img.src, img);
await fetchResource(img.src, basePath + target);
img.src = target;
}
}
}
if (config.cssjs === '1') {
// Stylesheets
for (const link of doc.querySelectorAll("link[rel='stylesheet'][href]")) {
if (isUrlToFetch(link.href, link)) {
const target = getTargetUrl(link.href, link);
await fetchResource(link.href, basePath + target);
link.href = target;
}
}
// JavaScript files
for (const script of doc.querySelectorAll("script[src]")) {
if (isUrlToFetch(script.src, script)) {
const target = getTargetUrl(script.src, script);
await fetchResource(script.src, basePath + target);
script.src = target;
}
}
// Preloaded assets (fonts, scripts, styles, etc.)
for (const link of doc.querySelectorAll("link[rel='preload'][href]")) {
if (isUrlToFetch(link.href, link)) {
const target = getTargetUrl(link.href, link);
await fetchResource(link.href, basePath + target);
link.href = target;
}
}
}
// All internal links
doc.querySelectorAll("a[href]").forEach(a => {
if (!isUrlToFetch(a.href, a)) return;
if (!fetchedUrls.has(a.href) && currentDept < config.maxDept) {
fetchedUrls.add(a.href);
todo.push({u: a.href, d: currentDept + 1});
}
a.href = getTargetUrl(a.href, a);
});
// Save HTML itself
await fetch(config.saveEndpoint + encodeURIComponent(basePath + currentTarget), {
method: "POST",
headers: { "Content-Type": "text/html; charset=UTF-8" },
body: doc.nodeType === Node.DOCUMENT_NODE ? doc.documentElement.outerHTML : doc.outerHTML,
});
itemCnt += 1;
console.log("Saved HTML:", basePath + currentTarget);
}
alert(`Page and resources saved (${itemCnt})!`);
}
// Hotkey trigger: Ctrl+Shift+F
document.addEventListener("keydown", function(e) {
if (e.ctrlKey && e.shiftKey && e.key === "F") {
e.preventDefault();
showDialog();
}
});
function showDialog() {
// Remove existing dialog if already present
const old = document.getElementById('gm-save-site-dialog');
if (old) old.remove();
// Create dialog container
const div = document.createElement('div');
div.id = 'gm-save-site-dialog';
div.style.position = 'fixed';
div.style.top = '0';
div.style.left = '0';
div.style.width = '100%';
div.style.height = '100%';
div.style.backgroundColor = 'rgba(0,0,0,0.5)';
div.style.display = 'flex';
div.style.alignItems = 'center';
div.style.justifyContent = 'center';
div.style.zIndex = '9999';
// Defaults in the form
let root = document.location.host + document.location.pathname;
// Inner box
div.innerHTML = `
<div style="background:#fff; padding:20px; border-radius:8px; min-width:300px;">
<h3>Download Site</h3>
<form>
<div>
<label>
Start fetching elements below:<br/>
<input type="text" name="root" value="${root}" size="${root.length}"/>
</label>
</div>
<div>
<label>
Save endpoint:<br/>
<input type="text" name="saveEndpoint" value="http://localhost:3000?file=" size="${root.length}"/>
</label>
</div>
<div>
<label>
<input type="checkbox" name="img" value="1" checked="checked"> Include images
</label>
</div>
<div>
<label>
<input type="checkbox" name="cssjs" value="1" checked="checked"> Include scripts and styles
</label>
</div>
<div>
<label>
<input type="checkbox" name="extAssets" value="0" checked="checked"> Include external assets
</label>
</div>
<div style="margin-bottom: 0.8em">
<label>
<input type="text" name="maxDept" value="2" size="2"> Max dept to traverse links
</label>
</div>
<button type="submit">Start</button>
<button type="button" data-cancel="true">Cancel</button>
</form>
</div>
`;
document.body.appendChild(div);
// Handle form submission
const form = div.querySelector('form');
form.addEventListener('submit', function(e) {
e.preventDefault();
const fd = new FormData(form);
const data = {};
// This avoids relying on the iterator protocol
fd.forEach((value, key) => {
if (data.hasOwnProperty(key)) {
// Support multiple values per key
if (!Array.isArray(data[key])) data[key] = [data[key]];
data[key].push(value);
} else {
data[key] = value;
}
});
div.remove();
savePageAndAssets(data);
});
// Handle cancel
div.querySelector('button[data-cancel]').addEventListener('click', () => {
div.remove();
});
}
})();
The script is loaded on all sites. You could change that by changing the @match directive
in the top section.
After an html page is loaded the script is executed and an event listener is added to
listen on the hot key Ctrl + Shift + F. When this event is observed, the dialogue
is inserted into the existing DOM document, overlaying the existing content because of it's z-index.
The dialogue itself has two buttons that get an event listener (click) attached.
In either case the dialogue is hidden on the click. When the submit button is clicked,
the form data is collected and passed onto the call to savePageAndAssets() and then
the work is done.
Saving the downloaded pages and assets
Whenever a page or asset is downloaded, it must be saved somewhere. As far as I know within the javascript you cannot write to the file system. The idea was to post the data to some other URL. In this case, I have a local webserver in PHP and a script that handles the POST requests to store the data locally.
<?php
/**
* Accepts a path and file name im the query string in file and saved the content
* from the POST request.
* Examples:
* curl -X POST "http://localhost:3000?file=logs/test.json" \
* -H "Content-Type: application/json" \
* -d '{"hello":"world","time":"2025-09-12"}'
* curl -X POST "http://localhost:3000?file=uploades/2025/picture.jpg" \
* --data-binary @my-pricture.jpg
*/
$dataDir = __DIR__ . DIRECTORY_SEPARATOR . 'data' . DIRECTORY_SEPARATOR;
header('Content-Type: application/json');
// Check if form was submitted via POST
if ($_SERVER["REQUEST_METHOD"] !== "POST") {
echo json_encode([
'res' => 3,
'msg' => "Invalid request method.",
]);
exit;
}
// Check for filename in query string
if (!isset($_GET['file']) || empty($_GET['file'])) {
echo json_encode([
"res" => 2,
"msg" => "Missing file parameter."
]);
exit;
}
$filename = $_GET['file'];
// Resolve safe path
$targetDir = $dataDir . dirname($filename);
$realBase = realpath($dataDir);
if ($realBase === false) {
mkdir($dataDir, 0777, true);
$realBase = realpath($dataDir);
}
$realTarget = realpath($targetDir);
if ($realTarget === false) {
mkdir($targetDir, 0777, true);
$realTarget = realpath($targetDir);
}
// Security check: prevent directory traversal
if (strpos($realTarget, $realBase) !== 0) {
echo json_encode([
"res" => 1,
"msg" => "Invalid file path."
]);
exit;
}
$filepath = $dataDir . $filename;
// Get raw POST body
$rawBody = file_get_contents("php://input");
if ($rawBody === false || $rawBody === '') {
echo json_encode([
"res" => 4,
"msg" => "Empty request body."
]);
exit;
}
// Try saving the raw body
if (file_put_contents($filepath, $rawBody, LOCK_EX) !== false) {
echo json_encode([
"res" => 0,
"msg" => "Data saved successfully.",
"file" => $filename,
"size" => strlen($rawBody)
]);
} else {
echo json_encode([
"res" => 5,
"msg" => "Failed to save data."
]);
}
Ideally, you'd save that file as index.php and in the same directory start a webserver
with php -S localhost:3000. That's it. Inside the script there are two sample calls that
help you to check that the server works and actually stores the data correctly that are sent.
Conclusion
This project was a proof of concept. At the end it didn't bypass the Cloudflare bot detection. I still can't download a captach protected site. However, it was a good chance to try out something new and event if that didn't work, maybe get inspiration on how to solve other problems in the future.