Overview
The process of distriubtion to the end-user answers two main needs:
- Getting updates and new packages;
- Getting the necessary files for a fresh installation.
In each of those use cases, the distribution process assumes that the software (either new or updated) has been packaged into the distribution vendor's format of choice and that the user has to download files in order to install them (as noted in the
packaging overview, alternative ways exist to replace the traditional "download and install" method, such as the
ZeroInstall system).
This section will concentrate mostly on the second need, namely that of the user downloading the necessary files for installing a fresh OS (usually with CD ISO images). This is because an efficient solution to the second need can also solve the first one (getting updates and new packages). Also, the main problem is that of downloading huge amounts of data (for ISO images) while updates do tend to be smaller and less bandwidth-consuming.
Installation of GNU/Linux systems is usually done with CDs. At least one of the CDs needed for installation is a boot CD and contains most of the "essential" installation files. Some distributions (for example Knoppix) run from a "LiveCD" which nulls the need for installing the system on the computer's hard disk.
CDs for installation can be obtained either by purchasing ready-made CDs from a vendor, or downloading the CD ISO images from one of the distribution vendor's mirror sites.
Mirror sites are servers which host some (or all) of the distribution files. A mirror does not necessarily (and often indeed does not) contain all of the available files, due to space constraints. A mirror might choose to host only files for popular architectures (usually x86), might choose not to mirror source code files, or host only the latest version of distributions. In Debian, for example, in order to mirror the full distribution options, one needs to provide approx. 100GB of disk space).
Downloading ISO images is bandwidth-consuming, takes some time and overloads the servers. Hosting a mirror which has ISO images becomes quite expensive. Because of that, many mirrors do not have at all ISO images, and instead mirror only packages.
Following is a case study of installation files distribution in the Debian distribution. Much thought has been given in Debian to this problem and some tools exist to ease the load on the mirror servers. However, in spite of the existing tools, this is still a big problem today in distributions.
Case Study - Debian
Available files
The full Debian archive contains CD ISO images for the "Stable" and "Testing" distributions, and packages for all three distributions - including "Unstable" (there may actually be more distributions - "Experimental" and "Frozen" but they are not always available. The three mentioned above are the "canonical" Debian distributions).
The distributions are in the "dist" directory. All packages are put into the "pool" directory, and files in the distribution directories point to them. CD ISO images are in the "debian-cd" directory. There exist two branches of distribution and packages - one for "US" and one for "non-US", because of patent problems and other US restrictions. Both branches have "dist" and "pool" directories.
Mirroring structure
There are three levels of mirrors in Debian's hierarchy: Primary, secondary and leaf. A mirror server might get the updates using either a Push or Pull method. "Push" means that its "father" mirror server notifies it whenever a new update is available (usually every 24 hours). "Pull" means that the mirror server itself checks sporadically to see whether updates are available.
The primary mirror servers get their updates straight from the master archive site (which is not publicly accessible). They are usually strong servers with good bandwidth and they mirror the whole Debian archive. The primary mirrors usually get the updates with the "Push" method. The secondary mirrors get their updates from the primary mirror servers. The leaf mirrors get their updates from the secondary mirrors and do not provide files for lesser mirrors (hence - "leaf"). Since, as noted before, mirrors usually don't mirror the whole archive, mirrors that inherit from incomplete archives will be at least as incomplete.
The full list of Debian mirrors is available
here or
here (more detailed).
Problems
The main problems with the mirroring and distribution scheme are as follows:
- Much load on mirror servers, due to bandwidth
Each ISO image is approximately 650 MBs in size. The "Stable" distribution today consists of 7 images, while the "Testing" distribution consists of 15 (!) images. Downloading the complete "Stable" distribution, for example, will transfer to the user's computer about 4 GBs of data. Especially after a new release has been declared by Debian, servers which carry CD images get overloaded fast. To make things worse, many users download the 650 MB files using their browsers or other tools that don't support resuming. Since 650 MB is a big volume, chances are that the download will not complete successfully and the user will restart again from bit 0. Since not many servers can afford to be loaded so much, most mirrors opt to not archive the Debian CD images, thus making those who do - more loaded.
- Bandwidth/time needed for users
As noted previously, users need a lot of bandwidth to download CD images from the mirror servers. Downloading 4 GBs for "Stable" (or twice that size for "Testing") takes a lot of time and can be discouraging. Users that don't have high-speed Internet connections can't really download such a volume. Such users are encouraged to purchase ready-made CDs. Another problem is that when downloading whole CD images, the user actually downloads a lot of data that he won't ever need or install. If the user needs but one file that resides, for example, on CD image no. 5, he will still need to download the whole 650 MBs of that image.
- Incomplete archives, due to limited disk space and/or bandwidth
Space and/or bandwidth restrictions may lead to incomplete archival in secondary or leaf mirrors. Users who want to download from the mirror closest to them geographically, might not really have an option if that mirror does not carry the architecture or distribution they want. Again - the more complete mirrors get more load.
- Availability of mirror servers
Whether due to overloading or due to other problems, many servers have access problems and either do not respond or have other errors. Results of automatic checks made to the various mirror servers can be seen
here and
here (more detailed). From these results, one can see that many mirror servers actually are inaccessible or otherwise may not provide the services needed in a standard fashion. Regarding mirrors with CD images, the situation is even worse and it's evident that _most_ of the servers either do not have all the CD images or have some problem responding to the automatic check. Naturally, since so many servers are problematic, the good servers get overloaded.
- Time for an update to get to a leaf mirror
Usually, a mirror server will be updated every 24 hours, either by probing its master mirror for updates (Pull) or by getting them directly (Push). This means that a leaf mirror may be updated with the latest versions 3 full days after the master Debian archive has been updated. When the update is security-related or an urgent fix, this may simply be unacceptable. Users are encouraged to download updates from their nearest server, usually a leaf one. But since an update can take a long time, they may decide to download from a primary server and so they can overload it.
The solutions available today don't really solve the core problems (of a few good servers taking the load of most Debian users) but they ease the symptoms a bit.
CD ISO images can be downloaded using the peer-to-peer system BitTorrent. Using BitTorrent eases the load on the servers since much data is received from other peers downloading the same files instead of only from the mirror servers. For the download to be fast, many users should be downloading the same file and keep sharing it after they've finished downloading. A problem with BitTorrent might be that some company policies may block peer-to-peer communications.
A "network install" single CD which enables the user to install the entire operating system. There also exists an even smaller floppy disks version. This single CD contains just the minimal amount of software to start the installation and fetch the remaining packages over the Internet.
Net install is a good option for single workstations with a low-bandwidth access to the Internet, since only the needed packages are downloaded, as opposed to downloading the full-blown ISO images, with many packages that the user may not want to install. However, when installing more than one workstation this becomes inefficient and it is recommended to download the CD images.
Jigdo (Jigsaw Download) is an application that creates CD images on the user's computer using packages downloaded from the "pool" directories. This means that the user doesn't have to download whole CD images as monolithic files. Instead, Jigdo knows (using a special .jigdo file) all the packages which reside on the CD image and downloads those packages from the various mirrors. The advantages here are that Jigdo can download from all mirror servers (as opposed to only those servers that archive CD images) and that it can download different packages from different servers. When using Jigdo, the load on the servers is distributed more evenly and this can also result in a faster download for the user. Jigdo can also be used to download images _for_ a mirror server (not only _from_) - using the jigdo-mirror script. Users are encouraged to use Jigdo when downloading binary ISO images, instead of using FTP, HTTP or other methods of direct download from the mirrors.
Mandrakelinux
The Mandrakelinux case is quite similar to that of Debian:
- There is one master archive;
- Primary mirror servers mirror it;
- Secondary mirrors and leaf mirrors also exist;
- ISO images can be downloaded via BitTorrent;
- The same set of problems described in the Debian case study also applies here.
With Mandrakelinux it is not possible to use tools such as Jigdo to download CD images from the contained packages. Regarding network install, it is possible to set up a server for network install when installing more than one workstation. However, apparently there isn't a way to network install using a mirror server, so for installing just one workstation there is no shortcut and the user has to download the whole images.
Mirror servers statistics for the official releases and for the Cooker release can be found respectively
here and
here.
- Main.AssafSagi - 31 Jan 2005
{metadata}
Topics Wp4
{metadata}
Comments: 0