Wireles Networking is a practical guide to planning and building low-cost telecommunications infrastructure. See the editorial for more information....



Pre-populate the Cache Using Wget

Instead of setting up a mirrored website as described in the previous section, a better approach is to populate the proxy cache using an automated process. This method has been described by J. J. Eksteen and J. P. L. Cloete of the CSIR in Pretoria, South Africa, in a paper entitled Enhancing International World Wide Web Access in Mozambique Through the Use of Mirroring and Caching Proxies. In this paper they describe how the process works:

An automatic process retrieves the site's home page and a specified number of extra pages (by recursively following HTML links on the re-trieved pages) through the use of a proxy. Instead of writing the retrieved pages onto the local disk, the mirror process discards the retrieved pages. This is done in order to conserve system resources as well as to avoid possible copyright conflicts. By using the proxy as intermediary, the retrieved pages are guaranteed to be in the cache of the proxy as if a client accessed that page. When a client accesses the retrieved page, it is served from the cache and not over the congested international link. This process can be run in off-peak times in order to maximize band-width utilization and not to compete with other access activities.

The following command (scheduled to run at night once every day or week) is all that is needed (repeated for every site that needs pre-populating).

wget --proxy-on --cache=off --delete after -m http://www.python.org

Explanation:

-m Mirrors the entire site. wget starts at www.python.org and follows all hyperlinks, so it downloads all subpages.
--proxy-on Ensures that wget makes use of the proxy server. This might not be needed in set-ups where a transparent proxy is employed.
--cache=off Ensures that fresh content is retrieved from the Internet, and not from the local proxy server.
--delete after Deletes the mirrored copy. The mirrored content remains in the proxy cache if there is sufficient disk space, and the proxy server caching parameters are set up correctly.

In addition, wget has many other options; for example, to supply a password for websites that require them. When using this tool, Squid should be configured with sufficient disk space to contain all the pre-populated sites and more (for normal Squid usage involving pages other than the pre-populated ones). Fortunately, disk space is becoming ever cheaper and disk sizes are far larger than ever before. However, this technique can only be used with a few selected sites. These sites should not be too big for the process to finish before the working day starts, and an eye should be kept on disk space.




Last Update: 2007-01-24