AllExperts > Encyclopedia 
Search      
Find out about volunteering to AllExperts

Web archiving: Encyclopedia BETA


Free Encyclopedia
 Index · Browse A-Z  · Questions and Answers ·
Encyclopedia

Browse A-Z
ABCDEFGHIJKLMNOPQRSTUVWXYZNum


License
Disclaimer

 
 
 
 
Free Online Courses
12 Weeks to Weight Loss
Take Charge of Stress
Learn How to Bake
Budgeting 101
Deeper Faith
DIY Fashion Makeover

       MORE E-COURSES
 
   

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z  Misc

Web archiving

Web archiving is the process of collecting the Web or particular portions of the Web and ensuring the collection is preserved in an archive for future researchers, historians, and the public. Due to the incredible size of the Web, web archivists typically employ web crawlers for automated collection of the Web. The largest web archiving organization is the Internet Archive which strives to maintain an archive of the entire Web. National libraries, national archives and various consortia of organizations are also involved in archiving culturally important Web content.

Collecting the Web

Web archivists generally archive all types of web content including HTML web pages, style sheets, JavaScript, images, and video. They also archive metadata about the collected resources such as access time, MIME type, and content length. This metadata is useful in establishing authenticity and provenance of the archived collection.

Crawlers

Web archivists typically use web crawlers to automate the process of collecting web pages. Web crawlers typically view web pages in the same manner that users with a browser see the Web. The Heritrix crawler is a popular tool used by many web archivists for making archive-quality crawls.

On-demand

There are numerous services that individuals may use to archive web resources "on-demand":
* WebCite, a service specifically for scholarly authors, journal editors and publishers to permanently archive and retrieve cited Internet references (Eysenbach and Trudel, 2005).
* Archive-It, a subscription service, allows institutions to build, manage and search their own web archive
* hanzo:web is a personal web archiving service created by Hanzo Archives that can archive a single web resource, a cluster of web resources, or an entire website, as a one-off collection, scheduled/repeated collection, an RSS/Atom feed collection or collect on-demand via Hanzo's open API.
* Spurl.net is a free on-line bookmarking service and search engine that allows users to save important web resources.

Difficulties and limitations

Crawlers

Web archives which rely on web crawling as their primary means of collecting the Web are influenced by the difficulties of web crawling:
* The robots exclusion protocol may request crawlers not access portions of a website. Some web archivists may ignore the request and crawl those portions anyway.
* Large portions of a web site may be hidden in the deep web. For example, the results page behind a web form lies in the deep web because a crawler cannot follow a link to the results page.
* Some web servers may return a different page for a web crawler than it would for a regular browser request. This is typically done to fool search engines into sending more traffic to a website.
* Crawler traps (e.g., calendars) may cause a crawler to download an infinite number of pages, so crawlers are usually configured to limit the number of dynamic pages they crawl.

The Web is so large that crawling a significant portion of it takes a significant amount of technical resources. The Web is changing so fast that portions of a website may change before a crawler has even finished crawling it.

General limitations

Not only must web archivists deal with the technical challenges of web archiving, they must also contend with intellectual property laws. Peter Lyman (2002) states that "although the Web is popularly regarded as a public domain resource, it is copyrighted; thus, archivists have no legal right to copy the Web."Some web archives that are made publicly accessible like WebCite's or the Internet Archive's allow content owners to hide or remove archived content that they do not want the public to have access to. Other web archives are only accessible from certain locations or have regulated usage. WebCite also cites on its FAQ a recent lawsuit against the caching mechanism, which Google won.

References

*
*

See also

* Archives
* Heritrix
* Internet Archive
* UK Web Archiving Consortium
* Web crawling
* WebCite

External links

* International Internet Preservation Consortium
* WebArchivist
* Web archiving bibliography
* Web archiving programmes:
** Digital Archive of Chinese Studies
** European Archive
** Internet Archive
** Kulturarw3
** Minerva
** netarchive.dk
** Pandora
** Paradigma
** UK Government Web Archive
** UK Web Archiving Consortium
** WARP



Email this page
About Us | Advertise on This Site | User Agreement | Privacy Policy | Kids' Privacy Policy | Help
About and About.com are registered trademarks of About, Inc. The About logo is a trademark of About, Inc. All rights reserved.
This is the "GNU Free Documentation License" reference article from the English Wikipedia. All text is available under the terms of the GNU Free Documentation License. See also our Disclaimer.