Download HTML versions of archived domains from archive.org’s WayBack Machine
Archive.org Downloader attempts to correct all internal linking so that the site is saved on your computer in a browsable form*
*Due to dynamic page content, exact copies cannot be guaranteed
Current Version 1.5
Portable version, no installer.
If updating please note down your license email as you will need to enter it again.
- When you first run the program it should activate your 7 day trial period. If it does not then please enter “7 DAY TRIAL” as the registration email.
- To begin scraping a site from Archive.org first click the “save as…” button to select the destination folder for saving (eg Documents).
- Then enter the start page to begin crawling. Its worth noting that www.domain.com and domain.com can give different results. You should decide before downloading a site which is the one that was used when the site was live (you can use majesticSEO or Ahrefs metrics to decide this).
- The final setting needed is the target date. While downloading Archive Downloader will attempt to get the cached copy of the page that is nearest to this date.
- Now click “scrape domain” to start the scraper.
- To speed up the process you can increase the number of threads but be aware archive.org do ban IP addresses so for more than 10 threads you should load some proxies first. You will see a substantial speed increase when using proxies due to throttling applied by archive.org
- Once the scraper has finished you should get a local copy of everything that is available that is hopefully browsable!
Latest Updates and Features in 1.5
- <base> tag bug fixed!
- Added option to retry time out errors caused by too many threads.
Latest Updates and Features in 1.4
- You can now enter an exact date and software will return copy nearest to it.
- Strips all archive.org inserted html and comments.
- Right-click output window for clear/copy contents.
- Private and semi-private proxies supported fully (for formats – see end of page)
- Multiple bug fixes!
The best method to get the most complete working download from archive.org
First find a working page to start from on archive.org. Copy its URL (the page url not the web.archive.org url) into the first text box.
For large sites this might take a while. Obviously the more proxies, the more threads, the faster it goes. If you are running too many threads per proxy you might start to see “time out” errors. To rectify this lower your threads. From v1.5 onwards auto retry of time out errors is enabled by default.
If you do see a lot of time out and similar errors you may want to consider running the download again with the same settings (untick – overwrite files) to retry the timed out items. You can quickly do this by clicking the red “Reset URL Cache” button and (Start at URL again). Tick “Dont show skipped files” to only see the downloaded items on a second run through. (not needed if “Auto Retry Time Out items” is checked)
Hopefully you now have a working local copy with most (if not all) references changed to relative links. This means you can browse it locally and upload the html to any domain.
If not, your site is missing some stuff somewhere. So we can try and get the complete list of archived items and download those.
If you still have Archive Downloader open then click the “Reset URL cache” button. You can also close and restart the application but you must be sure to set the same save folder as above for the following steps to work.
Paste in the same url as before but this time into the second textbox (www.domain.com)
Choose the same target date and save folder (if needed).
Untick “Add found external links”
Untick “Overwrite files”
Click “Get Missing Files”
Again, if the previous step took a long time this might take longer. Proxies, proxies etc etc.
Finally, if you are still not satisfied with the downloaded site try this (quicker) fix.
Same again, click “Reset” or restart the application and use the same save folder.
This time choose a different year (start with one year earlier than previous steps).
Paste the url into the first text box again.
Untick “Add found external links”
Tick “Overwrite files”
Click “Start at URL”
This will get your start page again and just download the elements it directly references.
You can also try the above step with any broken links (404 pages) you might find when testing the downloaded archived site.
If you get a lot of time out errors during a run through then you may miss some important files. To correct this start the same run through again but untick “overwrite files”
Current known issues/workarounds
If you have any language packs for .NET installed you *may* need to uninstall them before the error reporting functions correctly.
Old versions (1.0 to 1.2) should be uninstalled via your control panel. All future versions will be portable with no installer.
Sub-domains are treated as external sites. For example when downloading domain.com the crawler will not follow links to cdn.domain.com. This is to stop it trying to download the whole internet! Current work around is to download sub-domains separately and then recombine the downloads manually
Archive Downloader will need private proxies to download large sites in a reasonable time frame. Without them you will begin to be throttled and start getting time out errors. You can use private proxies that are either IP locked or username/password locked by saving them in a text file in the formats shown below. Public proxies should not be used as these very often will insert html code of their own which at best will crash ArchiveDownloader or at worst fill your downloaded site with adverts/malware.
Proxy list text file formats