Archive.org Downloader

archive downloaderDownload HTML versions of archived domains from archive.org’s WayBack Machine

Downloads the complete website (including all available pages, images, style sheets and javascript) and saves it to a folder on your PC.

Archive.org Downloader attempts to correct all internal linking so that the site is saved on your computer in a browsable form*

*Due to dynamic page content, exact copies cannot be guaranteed

Download Here

Current Version 1.5
Portable version, no installer.
If updating please note down your license email as you will need to enter it again.

Usage:

  • When you first run the program it should activate your 7 day trial period.  If it does not then please enter “7 DAY TRIAL” as the registration email.
  • To begin scraping a site from Archive.org first click the “save as…” button to select the destination folder for saving (eg Documents).
  • Then enter the start page to begin crawling.  Its worth noting that www.domain.com and domain.com can give different results.  You should decide before downloading a site which is the one that was used when the site was live (you can use majesticSEO or Ahrefs metrics to decide this).
  • The final setting needed is the target date.  While downloading Archive Downloader will attempt to get the cached copy of the page that is nearest to this date.
  • Now click “scrape domain” to start the scraper.
  • To speed up the process you can increase the number of threads but be aware archive.org do ban IP addresses so for more than 10 threads you should load some proxies first.  You will see a substantial speed increase when using proxies due to throttling applied by archive.org
  • Once the scraper has finished you should get a local copy of everything that is available that is hopefully browsable!

archive.org downloader 1.5 screenshot

Download Here

Latest Updates and Features in 1.5

  • <base> tag bug fixed!
  • Added option to retry time out errors caused by too many threads.

Latest Updates and Features in 1.4

  • You can now enter an exact date and software will return copy nearest to it.
  • Choose if files are to be renamed to .html (best results) or to leave as is (less compatible but fixes a lot of javascript issues).
  • Strips all archive.org inserted html and comments.
  • Right-click output window for clear/copy contents.
  • Private and semi-private proxies supported fully (for formats – see end of page)
  • Multiple bug fixes!

The best method to get the most complete working download from archive.org

First find a working page to start from on archive.org. Copy its URL (the page url not the web.archive.org url) into the first text box.

crawlfromurlChoose the target date you want a copy for.
Select the save folder.
Scrape url settingsTick “Add found external links”
Untick “Overwrite files”
Click “Start at URL

Archive Downloader will now begin crawling from the specified page and downloading any new items (pages/images/javascript etc) it finds.

For large sites this might take a while.  Obviously the more proxies, the more threads, the faster it goes.  If you are running too many threads per proxy you might start to see “time out” errors.  To rectify this lower your threads.  From v1.5 onwards auto retry of time out errors is enabled by default.

If you do see a lot of time out and similar errors you may want to consider running the download again with the same settings (untick – overwrite files) to retry the timed out items.  You can quickly do this by clicking the red “Reset URL Cache” button and (Start at URL again).  Tick “Dont show skipped files” to only see the downloaded items on a second run through.  (not needed if “Auto Retry Time Out items” is checked)

Hopefully you now have a working local copy with most (if not all) references changed to relative links.  This means you can browse it locally and upload the html to any domain.

If not, your site is missing some stuff somewhere.  So we can try and get the complete list of archived items and download those.

If you still have Archive Downloader open then click the “Reset URL cache” button.  You can also close and restart the application but you must be sure to set the same save folder as above for the following steps to work.
downloadcompletearchivePaste in the same url as before but this time into the second textbox (www.domain.com)
Choose the same target date and save folder (if needed).
download complete archive settingsUntick “Add found external links”
Untick “Overwrite files”
Click “Get Missing Files

Again, if the previous step took a long time this might take longer.  Proxies, proxies etc etc.

Some websites that still do not appear to download correctly after both steps above usually heavily rely on javascript for displaying the page or for navigation.  The best way to test this is to host the files locally using a web server such as wampserver.

Finally, if you are still not satisfied with the downloaded site try this (quicker) fix.

Same again, click “Reset” or restart the application and use the same save folder.

This time choose a different year (start with one year earlier than previous steps).

Paste the url into the first text box again.
Untick “Add found external links”
Tick “Overwrite files”
Click “Start at URL”

This will get your start page again and just download the elements it directly references.

You can also try the above step with any broken links (404 pages) you might find when testing the downloaded archived site.

Final word of warning:  This application will also download any Javascript files it finds and attempt to fix the paths in them so they run from any folder.  You must ensure you have an up to date Anti-Virus installed before browsing downloaded sites.

If you get a lot of time out errors during a run through then you may miss some important files.  To correct this start the same run through again but untick “overwrite files”

Current known issues/workarounds

If you have any language packs for .NET installed you *may* need to uninstall them before the error reporting functions correctly.

Old versions (1.0 to 1.2) should be uninstalled via your control panel.  All future versions will be portable with no installer.

Sub-domains are treated as external sites. For example when downloading domain.com the crawler will not follow links to cdn.domain.com.  This is to stop it trying to download the whole internet!  Current work around is to download sub-domains separately and then recombine the downloads manually

Proxy Support

Archive Downloader will need private proxies to download large sites in a reasonable time frame.  Without them you will begin to be throttled and start getting time out errors.  You can use private proxies that are either IP locked or username/password locked by saving them in a text file in the formats shown below.  Public proxies should not be used as these very often will insert html code of their own which at best will crash ArchiveDownloader or at worst fill your downloaded site with adverts/malware.

Proxy list text file formats

proxy:port 
login:password@proxy:port
Share this...
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  

Leave a comment

Your email address will not be published.


*