State of the art on the distribution
2.1 Linux Distributions
This subsection briefly looks at the main Linux distributions and BSDs as these are the current alternatives to Mandriva Linux.
A Linux distribution or GNU/Linux distribution (or a distro) is a Unix-like operating system plus application software comprising the Linux kernel, the GNU operating system, assorted free software and sometimes proprietary software, all created by individuals, groups or organizations from around the world. Companies such as Red Hat, SuSE and Mandriva, as well as community projects such as Debian, assemble and test the software and provide it as a complete system, more or less ready to install and use. There are over 200 different Linux distributions in active development.
Figure "Basic Interactions in F/OSS Processes" sketches the basic interactions in today's F/OSS processes, from upstream development to installation on the end user's machine.
Basic Interactions in F/OSS Processes
We next provide a review of several major GNU/Linux distributions, followed by a discussion of the major problems encountered today regarding the distribution to end users.
2.1.1 Red Hat
Red Hat Linux split into two directions in 2003. One branch merged with Fedora, and is also known as the Red Hat community edition. The second became the commercial Red Hat Enterprise edition. The key legacy of Red Hat is its packaging technology - RPM - that is used today by several F/OSS projects.
Red Hat Network
Binaries for Red Hat Enterprise Linux are no longer provided via ftp across mirrors, but rather by a customised management architectural solution called
Red Hat Network (RHN). Customers use RHN to download distribution ISO's, errata (patches) and software packages. Clients that subscribe to RHN can automatically update their system in a customised way.
Two architectural models exist for RHN. The first is the
Hosted Model where the distribution is stored on the network. This model is recommended for individuals or small companies. The second model is the
Satellite Model and is used by bigger enterprises. It consists of placing an RHN on the customer's local network. A satellite server then serves the different client machines and connects to Red Hat in order to download updates. In this way, each client machine in the local network can use a different Linux configuration.
All communication between customers, managed systems and RHN is protected by SSL encryption for privacy and authentication. Every package (RPM) is gpg-signed and contains MD5 checksums for both the package and contained files to ensure data integrity before deploying on target systems.
An interesting aspect of the RHN is that a functional definition is presented in [23]. The network is accessible through an
Access API. The key abstraction is the
channel, which corresponds to a set of packages and every client machine that is connected to a specific channel can be updated when the content of the channel changes. Channels can be created and managed by the system administrator. One possibility is for him to associate access rights with a channel and thus control the local systems that read from it.
A channel can be used to implement a
staged environment. Along with the base channel that corresponds to the core system, other types of channels exist. A development channel is used by developers of the community to distribute their work. A Testing & QA channel is used to report on the packages under development and for bug reports. A production channel is used to develop beta versions. The architecture allows actions to be defined on channels by users. An example action could be to remove packages whenever a new version is available, or to rollback to a previous version of the system when a compilation error occurs. Another use case is a system administrator that downloads new patches and tests them on specific machines. If the test passes, he copies the updates in the production channel, where the users' machines are connected.
The Red Hat Network's Staged Architecture
The Red Hat Network has two useful lessons for us:
- The network suggests that distributing software to end-users is not independent of other F/OSS aspects. RHN captures a large slice of the open source process since it deals with installing, testing, QA and feedback.
- A functional architecture is defined that captures all major requirements of the distribution network. This can then be refined to specific architectures. This approach allows one to consider the requirements of the system independently of the underlying platform. Such an approach could be very promising for EDOS.
Fedora Project
The goal of the Fedora project is to work with the Linux community to build a complete, general purpose operating system exclusively from free software. A stable release is usually provided 2 or 3 times per year, and selected components from the distribution are chosen for incorporation into RedHat Enterprise Linux. Fedora is distributed through mirror servers listed at
http://fedora.redhat.com/download/mirrors.html (there were 222 mirrors in operation on the 26th of October 2004) and also via BitTorrent. The developer community is quite proactive, and bug reports are maintained via a Bugzilla site.
2.1.2 Debian
Debian is community project whose aim is to provide a free operating system based on the Linux kernel. The project organisation is funded by donations from industry. The development community reportedly is composed of thousands of developers.
The Debian Linux distribution network is based on mirror servers. All mirrors seem to be maintained by owners who need not be part of the Debian project.
Debian has 32 official mirrors (one in each major client country, listed at
http://www.debian.org/mirror/list) and about 340 non-official ones. The main difference is that an official mirror (with a name like
country.debian.org/debian) must be updated at least once a day and support
push mirroring. This is a technique that allows a server to inform and update its client mirrors as soon as it receives a new version. Mirrors are therefore hierarchically organized in two levels.
The time taken to effect a copy for pull-mirroring can vary, so each mirror contains a timestamp, accessible at
http://mirror.debian.org/status.html. Analysis of this log reveals that several mirror servers are not well maintained. A client (leaf) mirror simply compares its own time-stamp with its server mirror at pre-configured time intervals.
The size of a Debian release is about 8 GB for a supported architecture, and the whole thing is 100 GB. A distribution is composed of 8710 packages. A mirror contains a U.S. distribution version and a non-U.S. distribution version to avoid legal problems arising from U.S. patent law and encryption export restrictions.
The Debian distribution process, and the problems posed, is quite similar to that of Mandriva Linux, e.g. short release cycles and poorly maintained mirrors. The main difference seems to be the push mirroring used by primary mirrors that Debian employs.
Debian has three distributions: "Stable", "Testing" and "Unstable". Actually there are two more distributions - "Experimental", which contains volatile elements which - should they have bugs - may bring down the whole system (for example, a new file system), and "Frozen" which is a temporary distribution before "Testing" becomes "Stable". The "Experimental" distribution is not meant for personal use, but rather as a platform for trying out new ideas and testing them. The first 3 distributions are considered okay for home use (even "Unstable", though not recommended for beginners).
A new package usually gets into the "Unstable" distribution (though there are some exceptions, as noted here). This distribution contains packages which are supposed to be - on the whole - stable, according to their developers and sites like Freshmeat. However, those packages haven't been tested and integrated into the whole Debian distribution and so are considered for now to be "Unstable". It is important to note that because the packages in "Unstable" do have some degree of stability, there are some users who prefer to have the "Unstable" distribution installed on their machines - just to be among the first to get new and updated software.
An automatic process evaluates nightly the packages in the "Unstable" distribution. If certain criteria for a package are met (spent X days in "Unstable", has fewer critical bugs than its respective version on "Testing" and additional criteria) then the package is moved to "Testing". "Testing" is the distribution which is the release candidate.
Whenever the Debian release manager decides (which is not very often) a freeze is declared on the "Testing" distribution. At that point buggy packages are removed from the distribution and no new packages can be let in except for bug fixes. After an additional period of time the distribution goes into a "deep freeze" when no changes at all are allowed, except installation-related. When the distribution proves to be stable enough - it becomes the new "Stable" distribution and distributed as such. As implied before, the "Stable" distribution is not updated very often and so fits corporate users and servers, where keeping up with the "bleeding edge" is not a requirement.
Debian contributors make changes to packages' source code for them to fit with the whole Debian distribution, and the changes are kept alongside the original source. However, the Debian hierarchy has no "internal" and "external" distinction among contributors. Practically every one can become a maintainer of one package or more. The maintainer is actually the one responsible for uploading packages to the various distributions, while the developers send their source code and diff files to the maintainers.
Jigsaw Download
As stated in [12], Jigsaw Download, or in short jigdo, is intended to be the main way of distributing Debian CD images in the future. Jigdo enables the download of updated dvd/cd images in a faster way for two reasons. First, it downloads only the delta between an existing image and the updated image. Second, its basic download-able unit is a package, and thus the client connects also to mirror sites that don't host the entire image.
Jigdo's concept of operation is hereby explained. The image .iso file and the packages it is made of are used to build two files: A .template file which is the .iso file but has the packages replaced with their respectful MD5 checksums, and a .jigdo file that tells how to get each package according to its MD5 checksum. A client downloads these two files, then downloads the changed packages from the original .iso image, and finally uses the changed and unchanged packages to reconstruct the new .iso image. An experimental upgrade of Woody RC3 disc2 to RC4 had only 20% of the packages changed and thus needed to be downloaded.
The use of MD5 checksums offers great accuracy and reliability. The current implementation is neither multi-threaded nor is it designed for load balancing and thus performance could yet be improved. Combining jigdo with P2P ideas could be promising.
2.1.3 FreeBSD
Berkeley Software Distribution, or BSD for short, refers to a set of versions of the Unix operating system. The three principal free variants of BSD are FreeBSD, OpenBSD and NetBSD. This section describes the approach used by the FreeBSD release engineering team to make production quality releases of the FreeBSD Operating System [22] as well as the FreeBSD approach to making available and installing applications [6].
FreeBSD Development process
The development of FreeBSD is a very open process. FreeBSD is comprised of contributions from thousands of people around the world. Although the FreeBSD Project provides anonymous CVS allowing the community to review and contribute to the code, only a group of around 300 people are given write access to the CVS repository. These people are called
committers and are responsible for the bulk of FreeBSD development. An elected core-team of very senior developers is responsible for deciding the project's overall goals and directions.
In order to facilitate the rapid development of production quality releases, FreeBSD development has been split into two parallel tracks. The main development branch is the
HEAD of the CVS tree, known as "FreeBSD-CURRENT" or "-CURRENT". This branch is the "bleeding-edge" of FreeBSD development though which all new changes first enter the system. A more stable branch aimed at production environments, known as "FreeBSD-STABLE" or "-STABLE", is also maintained. Changes go from -CURRENT to -STABLE at a different pace, and with the general assumption that they have been thoroughly tested by the user community. This approach allows FreeBSD to provide a high security environment while continuing to improve the system and implementing new technologies and features. Both branches are located on a master CVS repository and are replicated via CVSup to mirrors all over the world.
Bug reports and feature requests are continuously submitted by users throughout the release cycle. Problem reports are entered into FreeBSD GNATS [10] database through email, the
send-pr application, or via a web interface.
FreeBSD Release process
The FreeBSD Release Process is based on a standardized release engineering procedure. This procedure emphasises the security and stability of FreeBSD releases and refuses to sacrifice these features for any self-imposed deadlines or target release dates.
New releases of FreeBSD are released from the -STABLE branch at approximately four month intervals. 45 days before the anticipated release date, the release engineer sends an email to the development mailing lists to remind developers that they only have 15 days to integrate new changes before the code freeze. This process is known as "MFC sweeps" ("Merge From CURRENT") and it describes the process of merging a tested change from the -CURRENT development branch to the -STABLE branch. Once the code enters the "Code freeze" state, it becomes much harder to justify new changes to the system unless a serious bug-fix or security issue is involved. Then, until final release is ready, at least one release candidate is released per week, the release engineering team being in constant communication with the security-officer team, documentation and port maintainers. When several candidates have been made available and all major issues have been resolved, a new branch is created for the release, the version number is bumped up and Release Tags are created. Only then is the new Release officially created.
For most conservative users, individual release branches were introduced with FreeBSD 4.3. These release branches are created shortly before a final release is made. After the release goes out, only the most critical security fixes and additions are merged onto the release branch.
FreeBSD Distribution process
FreeBSD is available from anonymous FTP sites and from CDROM.
The official FreeBSD public FTP sites are all mirrors of a master server that is open only to other FTP sites. When the release has been thoroughly tested and packaged for distribution, the master FTP site is updated. It may then take between several hours and two days before a majority of the Tier-1 FTP sites have the new software. Release engineers coordinate with the FreeBSD mirror site administrators before announcing the general availability of new software on the FTP sites. FreeBSD's handbook advises mirrors to load the release package set at least four days prior to release day. Thus the release is uploaded between 24 and 48 hours before the planned release time with "other" file permissions turned off. This allows mirror sites to prepare availability of new releases while avoiding that users start downloading it from mirror sites.
During the period between releases, nightly snapshots are built automatically by the FreeBSD Project build machines. The user community can keep their system up to date with -STABLE and -CURRENT development using CVSup and "make world" tools in order to download and apply latest patch sets to their system source code tree.
CVSup can mirror different kind of files like sources, binaries or symbolic links. It parses and understands the Revision Control System (RCS) files of a CVS repository, and continually keeps track of updates made on files. Performance is obtained through the use of a multi-threaded architecture on both client and server, which allows for more efficient use of both the upload and download channels. The authors claim that it is the fastest mirroring process available since it uses better the available bandwidth of the network. While in traditional systems the server sends a list of its files to clients, and then sends the files that need to be updated, a CVSup client creates a list of its files, sends the list to the server, and waits for the file updates.
FreeBSD Ports Collection and Packages
The FreeBSD
ports collection is the main system for installing new software versions on machines running FreeSBD. The FreeBSD web site maintains an up-to-date search-able list of all available
ported applications.
A FreeBSD port for an application is a collection of files designed to automate the process of installing an application from source code, i.e. downloading needed files, applying patches, installing dependencies, compiling the application then installing it. Amongst other advantages, unlike packages, ports allow users to compile applications with tweaked, non conservative, options specific to their environment. They also allow users to use application-specific compile time options and allow them to apply latest existing patches. Note that binary packages for most important ports are also available from FreeBSD servers, and that packages can be generated from ports tree.
As for system source code tree, FreeBSD port tree can be updated and kept up-to-date using CVSup. Once the port tree has been updated, installed ports can be updated using the
portupgrade tool. Ports security check is ensured by the
portaudit tool which checks FreeBSD database for known ports issues. Once installed
portaudit is automatically run at ports installation time and can be run on a regular basis to check already installed ports.
In FreeBSD, anyone may submit a new port, or volunteer to maintain an existing port if it is unmaintained, not needing any special commit privileges. The guidelines for creating and maintaining ports can be found in the
Porter's Handbook [7].
2.1.4 Mandriva Linux
Mandriva Linux is a Linux distribution created by Mandriva. The first release was based on Red Hat Linux (version 5.1) and KDE (version 1.0) in July 1998. It has since diverged from Red Hat and has included a number of original tools mostly to ease system configuration.
Mandriva's development version of the next MandrivaLinux release is called "Cooker". The purpose of Cooker is to improve the Mandriva Linux distribution by permitting a better interaction between the development team and the Mandriva Linux users, both for debugging and adding new features. It is an entire distribution unto itself, that is constantly in progress and sometimes cannot even be installed because it is broken itself due to the incompatibilities.
Cooker is by all means a distribution, albeit it might be unstable because it is in testing status. About every 6 months a new stable release is out. Before the release (about 3 months before) a beta version is already out for users to play around with and submit bugs. Later, as testing (and subsequent fixing) proceeds, the version becomes more stable and is declared a release candidate (RC).
The packages in a Mandriva Linux distribution are divided into two categories:
main and
contrib.
Main includes the packages which are essentially the "sponsored" release. These packages have been tested and verified before making it into the next release. As the Cooker version becomes more and more stable, "freezes" are declared and no more new contributions to the packages are permitted, except for fixing serious bugs.
The other category,
contrib, contains pieces of software which are not part of the core of the distribution, but they are still supposed to work along with the release. When "freezes" are declared, it is still possible to contribute and submit new and updated packages to contrib.
Whenever a contributor packages a new piece of software, she has to put it in the "incoming" folder of the Mandriva's FTP server. Also, she has to notify the Cooker mailing list and the editor, so he will know that it exists and that he has to decide what to do with it. Contributors are encouraged to package only the source code (source packages) and not the binaries, since the editor has to perform some validation tests on the code (to prevent trojan horses, non-licensed software, other legal problems and so on).
Source code is often changed to fit with Mandriva Linux distribution. Usually, the contributor who packaged the software also makes some changes. In other cases, it will be the editor's duty to tweak the code. Either way, the original sources are kept along with a diff file containing all the changes that were made.
Each package in the distribution has a
maintainer. The
maintainers are persons who are "trusted" by Mandriva. A new contributor is called "external contributor" and he can only upload packages in the way described previously. He can not be their maintainer. The maintainer is an "internal contributor". Those are people who were once "external contributors" but were considered trustworthy by the editor due to their activities up until now.
The Mandriva Linux distribution process can be described by the figure below:
The MandrivaLinux Distribution Process
As depicted also in the figure, in Mandriva Linux developing process we can identify two main cycles performed in the preparing process for a new release.
The first one, marked with A, is rather an internal cycle, specific to Mandriva's package maintainers. Each maintainer is a Mandriva Linux developer in charge of a particular set of packages, who searches for the last version of package, builds it on the machines in the cluster and inserts it into Cooker. At this point, when a new version of a package is uploaded into Cooker (e.g.: a new version of Perl library), some inconsistencies may appear between the new package and the dependency related packages (e.g.: applications using the Perl library). Therefore, the maintainer must check the packages affected by the last update and return them to the cluster. They are rebuilt here and reinserted afterwards in Cooker.
The number of people implied in this development cycle is restrained and limited only to the package maintainers. The regular developers and contributors are not allowed to add or to modify packages in Cooker.
On the other hand, the second cycle is much more large and involves a lot more people. It represents the way of taking benefit from the Mandriva Linux community's contribution. As we mentioned before, the role of Cooker is to provide the community with the last versions of packages included in the Mandriva Linux release currently under development. The distribution process is done in the classic fashion, using mirror sites organised in a multi-level hierarchy.
The first step of the distribution consists in replicating the whole Cooker release on a Mainserver using
rsync for synchronization. Mandriva disposes of a fixed set of primary mirror servers, called also "level one mirrors", which hold copies of the Mainserver. The primary mirrors get the updates using either push or pull method. Each primary mirror replicates the whole content of the Cooker release, meaning both source and binary packages, main and contributors packages, for all architectures.
The secondary mirrors get synchronized afterwards with the level one mirrors. In Mandriva Linux distribution the hierarchy of mirror servers goes up to 5 levels, but the autonomy of the secondary mirrors is rather strong. Therefore a strict control on the content of each secondary mirror or on the mirrors' network architecture can not be achieved. Each mirror decides for its own to which mirror to synchronize, on which time interval, and what content to replicate. Secondary mirrors use the pull method for synchronization.
Using one of the mirror servers, the user is able to receive the last version of the packages she is interested in. It is about a particular category of users, the ones that are willing to tryout and to test the latest features and improvements of the applications.
Users' feedback is done by Cooker's mailing list and Bugzilla reports.
2.1.5 Conary
Conary system was developed as a way of addressing the current limitations of
repositories, packaging and distribution management, and to make it possible for users to create their own, customized distributions in an easy manner. In the simplest sense, one can think of Conary as a package management system with a more consistent view of objects from the repository level down to individual files. The package management is also combined with a
version management scheme, providing provision for branches in package versioning.
Conary is also a
distributed software management system for Linux distributions. It replaces traditional package management solutions (such as RPM and dpkg) with one designed to enable loose collaboration across the Internet. Conary enables sets of distributed and loosely-connected repositories to define the components to be installed on a Linux system. Rather than having a full distribution come from a single vendor, Conary allows administrators and developers to branch a distribution, keeping only those pieces that meet their needs, while making it possible to easily incorporate components from other repositories across the Internet.
The first presentation of Conary was given at the
2004 Ottawa Linux Symposium by Eric Troan, project initiator and one of the founders of Specifix company. Specifix produces "rPath Linux", as a reference implementation of a Linux distribution based on Conary. It can be installed and used like any other distribution, or it can be used as the basis for the creation of a Linux distribution tailored specifically to match user needs.
Direct connections to EDOS
At a first glance, we can identify certain similarities between Conary addressed problems and EDOS project's main goals. We mention here the packaging management (including dependencies treatment) and the distributed architecture for the code distribution. The production of a customized Linux distribution is yet another aspect to be addressed by EDOS, and in particular by the measurement and evaluation methodology proposed in Wp5.
Our approach uses as a basis Mandriva Linux distribution, and recently Caixa Magica's Linux distribution, to observe and to improve the current processes implied by the production of a new release. Both distributions currently use the traditional RPM package management solution and the mirroring mechanism for distributing data (source code, binaries, documentation and meta-data). The new P2P architecture will use the distribution's publisher and trusted mirrors for package repositories. Furthermore, all the peers connected in the network will participate with their own resources in maintaining local repositories and will help in the distribution process, exchanging data.
A slight difference in EDOS perspective is that the consistence of the distribution must be maintained against the publisher's repository. Participants from the community are not supposed to branch a distribution, but instead, they will be able to query the system index and to obtain the needed pieces of the distribution from different other participants.
Major Concepts
Conary treats
files as "first class objects", which are managed by the framework as a whole. Files have a unique ID and a version history; they also have a set of attributes. One of those attributes is the file's location in the filesystem; moving a file is a simple matter of changing that attribute.
A
"trove" is a container holding one or more files and other troves. Files are contained by reference. A
"component" is a collection of files, by reference. Example components listed by Eric for the bzip2 package might be bzip2:runtime (binary files to run the program), bzip2:lib, bzip2:doc, and, of course, bzip2:source. Components can be aggregated together into packages. Both components and packages are considered to be "troves," for what it's worth.
Version strings are hung onto everything; Specifix has added some complexity to the versioning system, though. Each version string includes the repository name, a namespace (think of it as a distribution name), a branch name (for the creation of trees in the version space), the upstream package version, and a two-part local revision number. Needless to say, the version strings get long, but the system hides the full string most of the time. Creating versions in this way allows the system to easily determine which version of a package is the newest, which version of which distribution is built for, and so on.
Branching is done by adding a branch name to the version string. Branching allows the tracking of versions of packages which were shipped with a specific distribution, along with updates to those packages. There is also a special type of branch called a
"shadow" which tracks changes to the trunk it was branched from. Essentially, the shadow is automatically merged with each new version of the trunk it is following. This feature would be useful for somebody maintaining a derivative distribution; they want to keep up with what the source distribution is doing without losing track of their own changes. The only problem with shadows is that, like a number of other Conary features, they are not actually implemented yet.
"Flavors" are another Conary feature; they seem to be patterned after Gentoo's "USE flags." A flavor is a set of configuration options describing how all packages are to be built. This feature is used for multiple architecture support, or for building versions of distributions with different feature sets (e.g. creating a distribution without PAM support). Multiple flavors of a package can be installed on a system if they don't conflict with each other; this allows, for example, the installation of 32-bit libraries on x86-64 systems.
Then, there is the concept of
"changesets". A changeset is a collection of modifications to files (including attribute changes) and the troves which contain them. A changeset is, essentially, a patch to a package or a distribution. Changesets, which track only changes, can be much smaller than the packages they describe, and can thus be an efficient way of distributing updates. Changesets describe changes to configuration files in diff format, which often allows them to be merged automatically with local changes. A system administrator can also create a changeset describing his or her local changes to the system; that changeset can then be used for merging with updates, or replicating the system elsewhere. Local changesets can also be used for version control and the tracking of system changes.
"Tags" are Conary's answer to the package script problem (and, also, to the complex set of interactions represented by the RPM "trigger" mechanism). A tag is a file attribute describing the type of the file, be it "shared library," "info file," or any of a long list of alternatives. Most files can be tagged automatically by Conary. Tags have scripts associated with them; there is, for example, a script which handles the installation of an info file and updating the relevant directory. These scripts are distributed separately; there is only one copy of them on the system. The scripts are thus easily fixed when bugs turn up, and they can be customized by the local administrator if need be. Separating out the management scripts in this way should also make it easier to install packages from other distributions.
A
"fileset" is an arbitrary collection of files built from components in the repository. Filesets seem to be intended to help in the creation of small system images for embedded systems; they allow an easy picking and choosing of an exact set of desired files. "Groups" are, instead, the analog of the Debian "task" or Anaconda "component." They allow the management of several packages as a unit, but they come with their own local changesets so that local changes to the group are tracked properly.
Enhancements on EDOS Approach
As we briefly got an insight of the new concepts used by Conary, there are some aspects that could be seen as enhancements to EDOS goals and that could give possitive leads to complete our approach.
- The version management scheme used in Cooker distribution could be refactorized and possibly to support branching.
- The data unit used in EDOS distribution management is the package. Packages are grouped into Collections on a hierarchical base. Assuming this level of granularity for the "first class objects" may be evaluated as insufficient concerning the changes and the updates on particular files. Meanwhile, for keeping the coherence of the distribution, as well as for deploying an efficient index for the distribution process, the package level is more convenient.
- The unique ID used for indexing the packages is composed by the name of the package and its version number. The version history was not planed to be included in tracking the packages over the network, but this information resides in package meta-data. The first goal of the distribution system is to provide users with the latest version of the Linux distribution.
- The package location is stored in the P2P index and is automaticaly updated when a new peer holds the given package.
- An interesting point to add would be to consider the patches, "changesets" is Conary terminology, as distinctive objects in the system, and to use them in updating the distribution or, more complex, into the customization mechanism.
2.1.6 Other Linux Distributions
SuSE
SuSE is a major retail Linux distribution, produced in Germany and it's now currently owned by Novell.Inc.
SUSE Linux was originally based on Slackware Linux and it was founded in late 1992 as a UNIX consulting group, which among other things regularly released software packages that included SLS and Slackware, and printed UNIX/Linux manuals. They released the first CD version of SLS/Slackware in 1994, under the name S.u.S.E. Linux1.0. The name "S.u.S.E.", later shortened to just "SuSE", was originally an acronym for the German phrase "Software und System Entwicklung" ("Software and system development"). Unlike most other makers of Linux distributions who allow immediate download of their final versions, SUSE first releases the Personal and Professional versions in boxed sets which include extensive documentation, then waits a few months before it releases versions on its FTP servers.
Gentoo
Gentoo Linux is another popular Linux distribution. Even if its creator and former software architect, Daniel Robbins, imported the "Ports" system from the FreeBSD community, he constructed the distribution around a specific philosophy. First he wants Gentoo to remain free. Secondly, he wants that users maintain complete control over their machines. This last point is important since it differs from the way a distribution like Mandriva Linux works. Mandriva Linux furnishes software that is responsible for installing, uninstalling or updating packages. This works transparently and is comparable to Microsoft Windows systems. On the other hand, Portage (the "Ports" system of Gentoo) uses scripts to describe which, when and how packages are updated. The user configures his system exactly the way he wants. Even if a particular system evolves automatically over time (depending on how the user configured Portage), Gentoo also provides some "official" releases on CD-ROMs, through mirrors servers or via BitTorrent. Gentoo CVS servers can also be accessed over the Web.
Slackware
There is not a lot of documentation on the Slackware website. Their philosophy claims that they want to be the most "Unix-like" Linux distribution. Graphically we would represent Slackware as the intersection of Debian, Gentoo and LFS (Linux From Scratch). Slackware can be obtained through CDs, via BitTorrent, or via a mirror server.
Knoppix
Knoppix is a bootable CD-ROM containing a full Debian-based Linux distribution. No installation is required. The Knoppix CD automatically recognizes the hardware, launches a Linux kernel and then unzips and launches the different applications following user requests. An ISO image of this CD can be freely downloaded from the Knoppix website.
2.1.7 Problems Today
The main problems with the mirroring and distribution scheme are summarized here.
Much load on mirror servers, due to bandwidth
Each CD ISO image is approximately 650 MBs in size. In the Debian example, the "Stable" distribution today consists of 7 images, while the "Testing" distribution consists of 15 images. Downloading the complete "Stable" distribution, for example, will transfer to the user's computer about 4 GBs of data. Especially after a new release has been declared by Debian, servers which carry CD images get overloaded fast. To make things worse, many users download the 650 MB files using their browsers or other tools that don't support resuming. Since 650 MB is a big volume, chances are that the download will not complete successfully and the user will restart again from bit 0. Since not many servers can afford to be loaded so much, most mirrors opt to not archive the Debian CD images, thus making those who do - more loaded.
Bandwidth/time needed for users
As noted previously, users need a lot of bandwidth to download CD images from the mirror servers. Downloading 4 GBs for "Stable" (or twice that size for "Testing") takes a lot of time and can be discouraging. Users that don't have high-speed Internet connections can't really download such a volume. Such users are encouraged to purchase ready-made CDs. Another problem is that when downloading whole CD images, the user actually downloads a lot of data that he won't ever need or install. If the user needs but one file that resides, for example, on CD image no. 5, he will still need to download the whole 650 MBs of that image.
Incomplete archives, due to limited disk space and/or bandwidth
Space and/or bandwidth restrictions may lead to incomplete archival in secondary or leaf mirrors. Users who want to download from the mirror closest to them geographically, might not really have an option if that mirror does not carry the architecture or distribution they want. Again - the more complete mirrors get more load.
Availability of mirror servers
Whether due to overloading or due to other problems, many servers have access problems and either do not respond or have other errors. Results of automatic checks made to the various mirror servers can be seen at
http://mirror.debian.org/status.html and at
http://www.de.debian.org/dmc/today/. From these results, one can see that many mirror servers actually are inaccessible or otherwise may not provide the services needed in a standard fashion. Regarding mirrors with CD images, the situation is even worse and it's evident that most of the servers either do not have all the CD images or have some problem responding to the automatic check. Naturally, since so many servers are problematic, the good servers get overloaded.
Time for an update to get to a leaf mirror
Usually, a mirror server will be updated every 24 hours, either by probing its master mirror for updates (Pull) or by getting them directly (Push). This means that a leaf mirror may be updated with the latest versions 3 full days after the master Debian archive has been updated. When the update is security-related or an urgent fix, this may simply be unacceptable. Users are encouraged to download updates from their nearest server, usually a leaf one. But since an update can take a long time, they may decide to download from a primary server and so they can overload it.
2.2 F/OSS Projects
This section looks at other - non-Unix - F/OSS projects that have some lessons on code distribution.
2.2.1 Apache
Apache is a software foundation promoting the development of free and high quality software. Developers are volunteers who communicate only via mailing lists in order to keep a trace of the contents and to allow people to work in an asynchronous manner. This last point is essential since the developers are dispersed over the world and often work on the project during their spare time.
Politically, Apache does not employ a hierarchical structure to co-ordinate projects. They opt for a
meritocracy - the more you contribute, the more power you get. Anybody can take part in any of the Apache projects. A newbie typically starts by participating in a mailing list, contributing later by sending patches, and little by little, he becomes trusted by the other community members. He can then be granted direct access to the source code.
When decisions need to be taken, the community uses a basic voting system. The mailing list publishes the topic of the vote and a deadline, typically 72 hours. To vote, community members answer with "-1", "0", or "1" if they respectively disagree, have no opinion, or agree. Depending the case, a "-1" vote can be interpreted as a veto. In this case the vote is frozen until an agreement is found and all the members withdraw their negative vote.
The Apache Software Foundation (ASF) supervises the different projects through its Board of Directors. The board essentially deals with with political issues. All technical issues are delegated to each Project Management Committee. Despite this liberty, the different projects are organized in similar ways. Sources are stored in CVS servers that can be updated several times a day. Regression tests are provided with the sources. A developer that improves a code module then applies all available regression tests before asking its machine to automatically produce the patch (via CVS). This patch is then published, and the regression tests are updated.
2.2.2 Mozilla
Mozilla functions in a very similar way to the Apache Foundation. They also use the meritocracy as a political pillar and the same tools to coordinate development (CVS, Bugzilla...). They work on 6 different projects: Firefox, Thunderbird, Mozilla Suite, Bugzilla, Camino and Calendar Project. The Mozilla Foundation, created in July 2003, deals with organizational, legal, and financial issues for the Mozilla open-source software project. There are currently five members in the Mozilla Foundation Board of Directors. Mozilla.org is the central point that will maintain mailing lists, provide technical and architectural direction for the projects, collect changes and make periodically releases. New code is however essentially developed among the community members, of which there are currently several thousand. A patch or any modification from a community member is sent to the owner of the corresponding module (mozilla.org designs the different module owners), who includes it after testing.
One difference between Apache and Mozilla is that Mozilla does not use a voting system in order to take decisions. The Mozilla model is based on commercial software development processes. It is the module owner who decides what code gets included in his module and it is
mozilla.org which decides which modules get introduced into the repository. The aim is to avoid several parallel versions of the software. Mozilla calls this the
Benevolent Dictator system, because the Dictator (module owner or
mozilla.org) has always to make the best choices for the community if he wants to keep his place. Since it is an open-source project, if the module owner does not do his job well, the community members just have to design a new module. This is also true for
mozilla.org: if they do not meet the expectations of the module owners or the community members, another code assembler is designated.
An interesting site to mention here is
mozdev.org, which contains currently 200 applications. The projects hosted on here create applications and add-ons that are based on top of the source code provided by mozilla.org.
2.2.3 Open Office.org
OpenOffice has gained considerable success over the past few years as an alternative though compatible environment to MicroSoft Office. It runs on all major OS platforms, including Linux, MacOS and Windows. The project's APIs are open and use the XML standard for document representation.
OpenOffice is an off-shoot of StarOffice - a product bought by Sun Microsystems in 1999. The code base is written in C++, though APIs exist for other languages, including Java. The project is managed by a Community Council, one of whose goals is to oversee the status of the projects in progress. Projects can be classed as accepted, native language or incubator, and each has a designated lead assigned by the Council. The Council is supported by donations from the public. The software licenses used for OpenOffice distributions are LGPL and SISSL.
Software is distributed via a mirroring system. A two-tier mirroring set-up is employed with rsync being used to effect copies between them. A mirror is generally required to maintain two stable releases, and optionally a localised (to a country) release, a developer release and a contribution release (on which no QA has been effected yet).
In order to support the code distribution process, OpenOffice solicits different kinds of support from the community. The community can contribute documentation support - especially with respect to the different natural languages. Code contributions are made in response to issues posted by the project lead, and submissions are made via CVS.
The community is also involved in testing and quality assurance, and Issue Tracker - a follow-up to IssueZilla - is used to coordinate this. Users can contribute remarks, smoke tests - which are Web-based query forms - and can also run automated program unit tests (known as qadevOOo) that are written in Java.
2.2.4 Eclipse
Eclipse is a popular development environment used today that integrates several important development tools and has support for different languages. Its plug-in based architecture makes it extensible and it has now been deployed on a wide range of platforms. The environment is managed by the Eclipse Foundation, which is a non-profit consortium of industry leaders, including Borland, Hitachi and Sybase.
Eclipse is organized as a series of projects and sub-projects, and each has a designated lead who is responsible for overseeing the project: ensuring that development subscribes to open source principles such as meritocracy and transparency. Leads must adhere to a set of process guidelines that are formalized in a document known as a charter.
Eclipse projects are distributed using a mirroring architecture. The distribution size for all projects combined is around 65 Gigabytes and nightly builds can be as large as 1 Gigabyte. There are around 100 mirrors currently in operation, each is independently maintained and uses an rsync script to effect copies. Mirror sites are requested to make a copy at least once per day.
Developers use CVS to contribute code to builds to a project. Bugzilla is used for bug tracking and reporting, along with the standard newsgroups and mailing lists.
2.3 Data Dissemination
This section presents the state-of-the-art in data dissemination technologies, and reviews notable systems from each category.
2.3.1 Content Delivery Networks
A Content Delivery Network (CDN) is a network of computers grouped together to deliver content (web pages, media streaming, real-time video or audio) in a cooperative, optimized manner. The technology could also be used for the dissemination of code and binaries, and thus we shall review it representatives of it in the next sub sections.
As stated in [15], an emphasis is placed on performance, and high availability. There are two general approaches to building CDNs. The first is overlay model, in which the content is replicated to thousands of servers worldwide, and the network infrastructure is used transparently. The second is network model, in which code is deployed to the routers and switches so that requests forwarding could be decided according to a predefined policy.
Coral
Coral is a CDN with two basic aims. First, it seeks to prevent flash crowds on Web sites furnishing static content, whose bandwidth may be modest. Second, it seeks to minimise (RTT) latency for clients.
The CDN is composed of a number of volunteer machines running Coral software that incorporates a DNS server and a Web proxy. The DNS server converts URLs for a target Web server into URLs for a Web proxy, and tries to ensure that the proxy is close to the requesting client. Web proxies cache content from target Web sites. There is no centralized management of proxies; the network is self-organizing and a modified traceroot tool is used by nodes to maintain a performance map of the network.
Flash crowds are avoided by limiting the hits on each Web proxy cache. This is achieved using two techniques. The first involves a weaker form of DHT known as Distributed Sloppy Hash Tables (DSHT). The technique is based on the observation that DHTs perform poorly when there is high demand for a key. In DSHT, each node is annotated with information relating to recent hits on that node. During insertion, if the load on a node is high, then the content is stored on a node at a greater distance from the key. The second technique for combating flash crowds is that the set of volunteer sites are organised in a hierarchy of clusters. The nodes of a cluster are related in that recent measures confirm that the RTT between then is inferior to a certain value. Clusters towards the bottom of the hierarchy have greater RTT times, and consequently are more populated. The basic ideas is that, in a lookup or publish operation, the system restricts itself to a top-level hierarchy. If the DSHT operation fails, then the operation is repeated in a more densely populated, though slower, cluster.
Though the Coral system has interesting proposals for flash crowds, it does not treat some points that are important for the EDOS General Architecture, notably the links between units of content (dependencies) and the large size of the content base.
Akamai
Akamai addresses the issue of deploying high volumes of information, mainly off- web sites, without growing the infrastructure of the web site. [2] explains the general concept of operation. First, a tool called the Akamaizer is used to transform the original site's HTML documents to point to Akamai servers for media-rich objects. When clients download pages from the content provider they receive Akamai's resource locators (ARLs) instead of the original URLs. Finally, the media-rich (pix, audio, video) objects are then requested and received from the Optimal Akamai Server and not from the content provider.
The requests from the Akamai network itself are handled using a multi-level hierarchy of DNS servers. The request is directed to a Low Level DNS Server (LLDNS) which is "close" to the local DNS, and the LLDNS returns the IP address of the optimal content server to service the request. The DNS maps are computed fast enough to re-route over server failures and bottle necks. The TTL
for requests from the DNS servers is set high enough so that multiple requests from the same area use the same server on one hand, but also low enough to identify bottle necks and downed servers.
As for the subject of fault-tolerance: Machine failure is handled locally by the buddy system (backup machine).In case of a network or data center outage, traffic will not be directed to these servers after 1-2 minutes. However, the main pages (containing the ARLs) of the content provider are crucial and must be resilient. Monitoring is performed 24-7-365 by a Network Operations Center (NOC). After a problem is detected, automatic re-direction has already occurred and what remains is to fix the original problem itself. Clients can view statistics and current status via live and historic reporting.
2.3.2 Structured P2P Systems
By Structured P2P Networks, we refer to an overlay network witch is created by specific rules that maps content and location. For content we mean, for example, file names, but can me mapped any other resources, for location we mean for example node address. This networks emerge for the need of scalability that unstructured networks can't achieve. This networks offer a scalable solution to locate resources that have a known identifier (exact-match queries), but the keywords queries are still object of research in distributed environments. In the other hand, for efficiently routing messages, structured systems are hard to maintain when the rate of joining and leaving is high. Kadop, Chord, Pastry and others are examples of structured systems.
Chord
Chord [21] is a scalable peer to peer lookup service. Its main function is giving a key to map it to a node. The structured approach appeared due to the problems posed by already existing unstructured peer to peer systems, in which the placement of content was completely unrelated to the overlay topology. This unstructured solution meant that for content location a searching mechanism needed to be developed and usually it was not possible without implications regarding availability, scalability and persistence of the overlay network.
Structured systems try to address the scalability issues that unstructured systems were faced with. In a structured network the overlay topology is tightly controlled and files (sometimes pointers to files) are placed at specified locations. This means that the system provides a way of associating the content and location (usually a node address) in a form of a distributed route table so that queries can be efficiently routed to the node with the desired content. However the cost of providing a scalable solution for exact queries (this means a user should know the identifier of the requested data, it's not easy to search based on key terms) lies in the difficulty posed when maintaining the structure needed for efficiently routing messages when the leaving and joining rates for nodes are quite high.
Chord (and Pastry [18]) is a representative of structured peer to peer systems (Chord maintains also an associated finger table besides the distributed routing table).
In Chord data location is implemented by identifying data items (files) with keys and by storing the (key, data) pairs at the node that the key maps to. Nodes are also identified by keys. So both nodes and files have keys associated by means of a deterministic function. Node identifiers are ordered in an identifier circle modulo 2^m. A key k is assigned to the first node whose identifier is equal to or follows k in the identifier space.
Consistent hashing allows for load balancing as each node receives close to the same number of keys. Using this mode of assigning keys to nodes the only routing information needed is for each node to know its successor node on the circle. So when a query for a given key is made, it is passed around the circle via successor pointers until a node containing the key is encountered. This is actually the node the query maps to.
In order to maintain load balance, in Chord (beside the map function) when a new node joins the network, a part of the keys previously assigned to new node's successor will become assigned to the new node. When a node leaves the network all the keys assigned to it will be reassigned to its successor.
One of the main advantages of Chord it's the simplicity. Only one data element per node needs to be correct for Chord to guarantee correct routing. However this does not mean fast routing. Further research papers proved that performance degrades gracefully when routing information becomes obsolete due to nodes joining and leaving and availability remains high, but only if nodes fail independently. There are some problems as even though the overlay topology is not based on the underlying IP network it is still possible for a single failure in the IP network to cause multiple links failure in the Chord overlay.
As said, Chord only needs the successor of a node for correct routing. To increase the speed Chord maintains additional routing information, a "finger table" in which each entry x in this table points to the successor n + 2x.
So when a node n tries to perform a lookup for key k the finger table is read to identify the highest node n^next who's ID is between n and k. If Chord can find it the lookup is repeated from node n^next. Otherwise it is repeated from n's successor.
From EDOS point of view, DHTs like Chord are needed to support KadoP [1] because the functionality we need to provide regarding querying cannot be easily provided by the DHT (Chord only provides looking up data given a key). Chord appears less complex and needs less state information per node compared with other solutions but it does not consider network proximity when routing. And not least Chord has a high maintenance cost as each join/leave will induce state change on other nodes.
Pastry
Pastry [18] is a peer to peer content location and routing system based on a self-organizing overlay network of nodes connected over Internet. It is build to be completely decentralized, scalable, fault-resilient and to reliably route a message to the live node with the nodeID numerically closest to a key. It also automatically adapts to the arrival, departure and failure of nodes.
Pastry nodes are organized in a circular ID space, each node having a unique 128 bit nodeID. The nodeID is assigned randomly when a node joins the system by computing a cryptographic of the node public key or its IP address (not practical if nodes are behind NAT). Using this naming mechanism Pastry makes the assumption that nodeID's are generated so that the resulting set of nodeIDs is uniformly distributed in the circular nodeID space. Keys are also 128 bit values. Data associated with a key is stored in the node whose ID is numerically closest to the key.
For routing each Pastry node maintains several structures: a routing table, a neighborhood set and a leaf set.
Having N nodes in the network, a node's routing table is organized in log2^bN rows with 2^b
-1 entries in each row (the ID's are sequence of digits in base 2^b, usually b is 1 or 4). The n'th row of the routing table contains the nodeIDs and IP addresses of those nodes whose nodeID shares the present node's nodeID in the first n digits but differ in the n+1 digit. If there are more than 2^b
-1 qualified nodes, the closest 2^b
-1 nodes will be selected, according to proximity metric.
The neighborhood set contains the m closest nodes according to the proximity metric and it is not used for routing. It is used for maintaining locality properties.
The leaf set contains L numerically closest nodes:L/2 numerically smaller nodes and L/2 numerically larger nodes.
Given a message, the node first checks to see if the key falls into the range of nodeIDs covered by its leaf set. If so, the message is forwarded directly to the destination node, namely the node in the leaf set whose nodeID is closest to the key. If the key is not covered by the leaf set then the routing table is used and the message is forwarded to a node that shares a common prefix with the key by at least one more digit (so routing is based on prefix matching rather than numerical difference). It is possible that the appropriate entry in the routing table is empty or the associated node is not reachable in which case the message is forwarded to a node that shares a prefix with the key at least as long as the present node's nodeID. Such a node must be in the leaf set unless the message has already arrived at the node with the numerically closest nodeID.
Routing in Pastry can be random meaning the choice among multiple nodes can be made randomly. In the event of a malicious or failed node along the path, the query may be repeated several times by the client until a route is chosen that avoids the bad node.
When entering the ring, a new node n asks any existing node z to route its nodeID to numerically closest node x. Every node on the path from z to x will send complete routing table to n so the i'th row in node n routing table will be initialized with i'th row of i'th node on the path from z to x. In the end n will send a copy of its resulting state to each node in its routing table, neighborhood set and leaf set.
What makes Pastry different is the fact that Pastry tries to have each entry in the route table to refer to a node that is near the node trying to take advantage of proximity routing. When routing there are potentially several nodes in the routing table closer to the message's key in the ID space so Pastry tries to select among the set of possible next hops the one that is closest in the physical network or one that represents a good compromise between progress in the ID space and proximity.
Another interesting property is that a new node n will chose an initial contact node z closer. It can also be proved that the route table initialization also sustains the locality as every node periodically requests state from nodes in the neighborhood set in order to update its route table.
Also it is worth mentioning that if replicas of an object are stored on k nodes with adjacent nodeIDs, Pastry messages requesting the object have the tendency to first reach a node near the requesting node.
Though Pastry seems more adequate than Chord, the amount of traffic generated by adding a new node to the ring must be carefully reviewed as Edos has more than 4 millions potential clients.
KadoP
KadoP [1] is a peer to peer system, which manages semi-structural data. It can be used for the publication, indexation and query of XML data such as XML documents and Web service calls. It is based on two key peer to peer technologies: DHT and AXML.
The DHT represents an overlay network application which allows the storage and the retrieval of key-value pairs. Giving a key it routes the message to a peer charged with the storage of the key.
KadoP manages XML information such as XML documents, Web services and semantic information in the form of hierarchy of
concepts organized in hierarchy using the
isA and
partOf relationships. The XML data published may be annotated with semantic information. Fragments of XML documents (Ex: XML sub-trees, elements, words) can be defined as being
relatedTo a concept. The query engine can make use of the semantic information to evaluate a query, if the query specifies this.
This turns to be useful in the case different users annotate/and use differently the same piece of data, depending on the role they have in the system.
The KadoP query language is based on tree pattern queries, and it allows the retrieval of pieces of published XML documents, based on conditions referring to the structure and the content of the searched information. The nodes of a KadoP tree pattern query represent data items, and the edges represent containment relationships among the nodes. Each node may be annotated with:
- name conditions
- semantic conditions of the form "relatedTo C", where C denotes a concept
- textual conditions of the form "contains W", where W is a word. We distinguish a single return node of the query.
KadoP returns information at different granularities: instead of entire XML documents one can retrieve precise responses information such as XML elements (XML sub-trees). This has a positive impact on system's performance, as published XML documents tens to be big while the requested piece of information is a sub-tree.
As EDOS project needs to publish several gigabytes of data, it also needs to make available to users almost one hundred megabytes of metadata and especially a way to query it. One of the goals is also to transform the metadata provided by Mandriva, from an internal and proprietary format, to a more suited one for distributed querying.
Existent DHTs unfortunately cannot provide a simple way to publish and interrogate the available information on packages (metadata). As systems they have different goals, so they only offer a basic searching tool: based on a key they route the message to a peer. Also the storage offered by the DHT applications is not suitable for the indexation of data (text or XML). Usually DHT applications are designed to store large file for each key and have a primitive update management. KadoP proposes a new approach for sharing content in peer to peer networks and yet, as a system build on top of DHTs, benefits greatly from their simplicity while being capable of using efficient indexation techniques to publish, store, and especially query the available information.
The choice for using KadoP was also enforced by the fact that it would be more efficient to publish some part of metadata in the form of intensional documents as the available metadata on all packages is in the order of tens of megabytes of text. As a result, the requirements for our system, of it being capable of asking fine-granularity queries and to take into account the structure and the semantics, may be no longer sufficient, we also need a system prepared to exploit the intensional or extensional character of metadata.
2.3.3 File Sharing
File sharing applications are natural candidates for use in the EDOS distribution process.
This section looks at some file sharing systems. Our motivation is not because
they are F/OSS projects, but because they are - and can be - used as the
basis for a distribution architecture.
BitTorrent
BitTorrent allows users to download a file in a near peer-to-peer fashion. Users download different pieces of a file from different users. Thus, users download and upload simultaneously, and bandwidth is distributed between users. BitTorrent is used already by Mandriva Linux developers.
BitTorrent is a very popular tool in use today, and is responsible for the majority of P2P traffic. The main advantage is its simplicity and the heuristic it employs to achieve very fast download speeds. However, BitTorrent relies on a centralized server (called a "tracker") that keeps track of all downloading users. This tracker is a point of failure in the system. In EDOS, we want to use the efficient ideas in BitTorrent, but provide a decentralized solution instead of the tracker.
An interesting presentation of the resource consumption aspects of the system is presented in [4]. The system aims for Pareto efficiency (a system where resources are allocated in such a manner that no individual is better off or worse off), a higher level of resource utilization and robustness. The main problems that the system has to address are high churn rate, fairness, finding the best piece allocation strategy and ensuring steady up-rates. A specific problem is that users tend to kill their clients as soon as download completes (irrespective of on-going uploads). Peers use a tracker site to find each other and it stores a minimum of information. In general the algorithm used by the tracker is to generate a random list of peers since this is the most robust with respect to disconnection and segmentation, resulting from churn. A tracker also stores a hash of each piece so that its integrity on receipt can be verified. A seed (complete version of the file) must exist and be downloadable in totality from there. The piece that a peer chooses for download can follow a strict priority, rarest first (i.e., the piece is the least common among the set of peers), random order, etc. Choking is the explicit (temporary) refusal to upload and is required for good system performance (i.e., it can be used to prevent imbalance in rates between two users) ands is how Pareto efficiency is achieved.
Kazaa
Another system we investigated is file sharing via peer-to-peer networks. We choose Kazaa because it is a very well-known used system (even if not anymore the most popular) and because we found more documentation about this system than on others. It is clear that big differences exist between code distribution and file sharing, but we still find important to analyze more deeply P2P networks, and ideas can be reused in our project. We will not give a description here about how Kazaa works. Rather, we point out some particularities that can potentially be exploited in EDOS.
First we discovered that P2P is more and more used and that it consists today in the majority of Internet traffic. Then we learned that P2P downloads does not follow the traditional Zipf's law, used for Web traffic. The curve is much flatter, giving less importance to popular files than predicted by the Zipf's law. This difference is explained by the fact that the same Internet site is visited several times by the same user, while a file is usually downloaded only once by a particular user. In contrast to Web pages that evolve with the time, a shared file is always the same. And finally we learned also that Kazaa favors good peers. A peer that shares lots of files will obtain a better priority for its owns downloads. A good description of the system has been presented in [11].
2.3.4 Multicast
The use of P2P networks for multicasting, as opposed to using IP multicast, has been thoroughly studied and researched. The main point is using the bandwidth resources of the peers to ease the load on the source server (or servers). We have studied several multicast approaches and present two representative systems here.
Bullet'
Bullet' [13] is based on a multicast mesh (as opposed to a multicast tree). The updates are pushed from the source nodes to several peers - disjointedly (i.e. each peer gets a disjoint portion of the data, the ratio of which depends on its perceived bandwidth). The data is then disseminated from those peers to the other peers in the mesh using a pull mechanism. Peers are paired according to the "disjointedness" of their data sets and trade missing blocks. The source node and any other peer who completes the download stay in the mesh and help other peers by "seeding" (giving away blocks without getting any back).
Many trade-offs and considerations are configurable and are taken into account. Between them are peering strategy (which peers are paired), block request strategy (which missing blocks to request first), data encoding (should we encode using rateless codes, other codes, or any at all) and more.
Instead of a centralized tracker (as in BitTorrent), Bullet' uses RanSub [14], a protocol for delivering random subsets to all nodes. RanSub allows the nodes to know what parts of the file other nodes have. It is a distributed protocol that has no single point of failure.
The authors present good simulation results using the MACEDON framework.
Also in the article is a system called Shotgun, built on top of Bullet' with the purpose of being a fast synchronization tool (an extension to rsync). Using/improving this system might be an interesting approach for our own problem - today's mirrors indeed synchronize using rsync. The Shotgun system could prove to be a good "off-the-shelf" solution for synchronizing the mirrors and perhaps also the users themselves (treating them as n-th level mirrors).
Avalanche
The Avalanche system [9] uses a scheme of network coding to disseminate large files to multiple users using P2P methods.
The system groups several peers together in "neighborhoods" much like BitTorrent does - via a centralized tracker. Peers have a maximum number of neighbors they can connect to. Nodes can request additional (or replacement) neighbors from the tracker at any time.
The article also describes two possible incentive methods: in the first one neighbors who have uploaded blocks previously are being preferred when allocating outgoing bandwidth, and the second method is "tit-for-tat" like in BitTorrent. Later in the article it is shown that the "tit-for-tat" method damages the system's throughput and the dissemination is finished later. It is consistent with the Bullet article approach which states that there's a trade-off between a system being either fast or fair. No results are shown for the first method.
The main idea in the article, however, is not the dissemination system itself, but the novel use of network coding. Network coding is a coding scheme in which each peer can produce encoded blocks using what she has up to that point and with a high probability those blocks will be helpful to other peers.
The network coding employed here is as follows.
- The source node creates a block by multiplying each block with a random coefficient and computing the sum. Those coefficients are later sent along with the block.
- Each time the source node wants to send a block, it computes the new block with new random coefficients (and sends the coefficient vector along with the block).
- With high probability, when k such blocks are retrieved (k being the number of original blocks), one has k equations with k variables and can solve the equation set to compute the original blocks (the variables in this case).
- When a node which is not the source node wants to send a block to another node, it computes a whole new block by generating a new linear combination from the block he already has. If the sender node had received previously an equation which the receiver node didn't have, then the new block will also contain new information for the receiver node.
The advantage of network coding is that it's much easier for nodes to exchange information and they don't have to manage complex structures to know what blocks are missing. With high probability, k received blocks will suffice to rebuild the file. There is a need to negotiate a bit, though, since it is possible that nodes will not have new information for each other after all. The authors suggest that sending a linear combination of the coefficient vectors is a good heuristic.
The authors conduct experiments for unencoded blocks, source-encoded blocks and network coding. The network coding scheme proves to be faster than the source-encoding scheme by 20-30 percent and faster than the unencoded scheme by 2-3 times.
A disadvantage, however, of using network coding is the time spent solving the equations. When talking about large files, the entire file has to be scanned in order to produce an encoded block. Also, the complexity of solving the equation set is o(k^3) where k is the number of blocks. Clearly, a more efficient way is mandatory for network coding to be practical for large files such as in our scenario.
2.3.5 Publish/Subscribe Systems
A natural model for the code and binaries distribution problem is that of publish/subscribe systems. The distributions companies are the publishers, and the end-users are the subscribers. This model enable all users to get software of their interest on one hand, and decouples the communication between participants with no shared interests on the other. Pub/sub systems divide into two main categories, subject-based and content-based. While content-based systems allow for greater flexibility in expressing users' interests, their implementations are less efficient, scalable and robust than those of the subject-based systems.
Scribe
Scribe[20] is a large-scale event notification infrastructure for topic-based pub/sub applications, built on top of Pastry, an overlay DHT substrate. Scribe offers a simple API to its applications that includes the creation of topics, subscribing and unsubscribing to/from a topic and the ability to publish an event for a specific topic. A credentials mechanism is used throughout the API calls.
Scribe implementation:
- Each topic has a unique Id. The scribe node with the nodeId that is numerically closest to the topicId acts as a rendez-vous point for the associated topic. This node forms the root of the multicast tree created for the topic.
- The multicast tree for a each topic is constructed in a distributed fashion by reverse path forwarding. Pastry routing mechanism assures there is no circular path.
- The rendez-vous point controls access to the group, publish-wise (no spam etc.) - all multicast messages are sent to it (its IP is cached after discovered, and replaced in the case of failure or change).
Scribe employs several technics to ensure the reliability of the topology and the events dissemination. These include Pastry's TCP best effort model, liveliness checks by parent nodes, and replicas for handling failure of the rendez-vous point. The cost for recovery from failure is logarithmic in the number of the nodes. Scribe exposes an API for failure handling - in case added reliability is required.
A problematic "feature" of Scribe is that forwarders for a topic aren't neccesarily subscribed to it. However, as opposed to similar systems like Bayeux[26], the rendez-vous point in Scribe doesn't need to be contacted for every join/leave of a subscribed node, and peers store very little group membership information. This contributes much to the system's scalability.
Hyper
HYPER[24] is a content-based pub/sub system that strives to minimize both the matching and forwarding overhead within the pub/sub network and the delay experienced by the clients. Content-based pub/sub systems fall into two categories:
- Systems where clusters are formed and clients subscribe to all clusters that overlap (possibly partially) with their interest.
- Systems that achieve precise content delivery, at the price of higher state maintenance and processing cost. Content matching (event vs. filter) is a lengthy process that takes up to tens of milliseconds for a single node along the path.
The system model:
- Content space is a multi-dimensional space (each dimension is an attribute).
- An event is a point in the above space, and a subscription is a rectangle.
- The network is an overlay of pub/sub servers, organized a single-source tree.
- Subscription tables are maintained at each tree node.
The proposed pub/sub network identifies virtual groups according to common subscriptions. A virtual group is a unique, disjoint, sub-region of the content space shared by the same subset of subscribers. A leaf node can belong to multiple such groups.
- Content messages need to be matched only at the "entry" to the group.
- Each group induces a subtree embedded in the original tree and is used to bypass the intermediate nodes from the root to the leaves (after matching to the virtual group, we can disseminate the event straight to the leaf nodes).
- Content outside the definition scope of the above groups is handled by the default pub/sub network as "usual".
Identification of the virtual groups is done by finding intersections of the subscription rectangles (NP complete problem) using grid-based grouping as suggested by [17].
The paper suggests an optimization of the delivery tree of a virtual group by shortcutting to bypass intermediate servers having only a single child node. The shortcutting is done by early matching against the filters and is directed at sparse groups for which the saved matching operations are lower. This reduces delay for clients and minimizes load on "innocent" forwarding nodes, yet it is very much relevant to a very specific overlay network, i.e. a tree.
Homed
The motivation for HOMED[3] is to design an overlay network supporting the complexity of content-based publish/subscribe system on one hand, and have efficient and robust distributed behavior on the other hand. HOMED stands for Hypercube Overlay Mesh for Event Dissemination. The basic idea is to represent users according to a summary of their interests, and construct a mesh topology where nodes neighbor with nodes of similar interests. Content-based pub/sub systems usually rely on brokers nodes to disseminate events. These are points of failure, and assume more responsibilities than simple nodes (e.g. hold large routing tables). HOMED eliminates the need for those.
HOMED general design:
- Every node is given a unique Interest Digest (ID) according to the user's interests, which is the basis for the HOMED topology (Bloom-filters can be used here).
- ID cover tables are maintained at each peer. These determine the events under the responsbility of the peer.
- A published event is first routed to a node with an ID that matches the event ID (EID) and then multicast from that node with an event dissemination tree.
- The network "distance" could be used to construct an efficient dissemination tree.
An important property of the system is that nodes that don't match the interest of the event, aren't used in the routing (as opposed to Scribe for example). Also, complexity for the various actions is logarithmic in the number of nodes, plus an additive factor for the number of dimensions of the content space.
Reach[16] is another example of a content-based pub/sub system that has a lot in common with HOMED.
2.4 Distributed File Systems
When analyzing the distribution processes in EDOS, we need to look also at distributed storage systems. The rationale is that after the initial dissemination of the software packages, there will still be users who didn't get the files (e.g. because they weren't connected during the dissemination period). Those "latecomers" will need to be able to locate the files and download them after the swarm has ended. One possibility is to have them download the files from servers. However, keeping the P2P spirit in mind, we might think about using the resources of the peers to distributedly store the distribution archives. Following are three examples of systems that might possibly be used in such a solution.
2.4.1 Google File System (GFS)
GFS[8] is used for all of the data processing requirements of Google. As with mass storage systems, the requirements include performance, availability, reliability and scalability. It is also built from observations real usage. First, component failures are the norm and not the exception. Second, files are huge by traditional standards; multi Gigabytes are common and this influences block sizes. Third, files are modified nearly exclusively in append-only mode and this permits a relaxed consistency model. Fourth, the API is flexible to support further development; it supports record append and snapshot commands.
The architecture is composed of a master and several chunk servers. A chunk server stores file chunks (as local Linux files). The master maps file names and offsets to chunks, and stores all meta-data. Chunks are replicated on chunk servers.
Meta-data is optimized for recovery. For instance, chunk servers store information about chunks they have and the master queries these chunk servers when it boots. Only the operation log needs to be permanently stored; this keeps a log of changes to the meta data.
2.4.2 OceanStore
OceanStore[5] is a ubiquitous persistent storage system designed for having data accessible "anytime and anywhere". OceanStore employs dedicated, though untrusted, servers which may fail sporadically.
Data in OceanStore is stored in two layers: the primary replica is the more "trusted" version and is stored by a set of dedicated servers called the "inner ring" (the servers are chosen for each file). This replica is retrieved by getting erasure-coded blocks from the servers (who agree on the content with a byzantine agreement protocol) and rebuilding the file.
The secondary replica actually consists of cached copies of the file, published by peers who have retrieved and rebuilt the file previously.
OceanStore is built on top of the Tapestry[25] overlay network.
Some OceanStore properties:
- By design, files stored in OceanStore are "read only" and changes result in new versions. This can support both source packages and binary packages. New versions will be available but so will be the old ones.
- The more a file is read in OceanStore, the more available it becomes (replicated), while unread files can lose replicas. This way, the more popular a file is - the easier it will be to locate a copy.
- OceanStore is decentralized and uses Tapestry as the P2P overlay network. Other possibilities for the overlay network exist, such as Chord.
- OceanStore utilizes localization through Tapestry, bringing files closer to where many clients requested them.
- Byzantine agreement methods are employed and guarantee that all non-faulty replicas agree (as long as no more than about one third of the replicas are faulty).
Some points regarding the possible use of the OceanStore (or a similar) system in the EDOS project:
- OceanStore relies, for many operations on a file, on an "inner ring" - a set of servers responsible for the specific file in question.
- It's not certain that we can use the same model when no "servers" exist - only unreliable peers.
- We need to achieve the same operations, perhaps with a different architecture: getting a pointer to the latest version of a file, and storing persistently a file's "Primary replica" (the replica which is assured to be persistent).
2.4.3 PastFS
PAST [19] is a persistent storage system built on top of the Pastry overlay network [18], although it could also function on top of other ones (e.g. Tapestry[25], Chord[21] etc.) though naturally not all of its properties might hold.
PAST holds the following properties:
- The files held in PAST are read-only, like in OceanStore. Unlike OceanStore, however, no explicit versioning is supported. Consequently, operations related to versioning are also not explicitly supported (e.g. getting the "heartbeat" - a pointer to the latest version).
- When a node stores a file, it stores the whole file and not only chunks of it. This makes PAST more suitable to small file scenarios (as opposed to, say, storing ISO images). It might be possible to alter PAST to support breaking a file into chunks but it might be like building a whole new system.
- Each file is replicated k times according to the user's choice. The responsible node for a file is the node with a nodeID corresponding to to the file's hash. The replicas are placed at the k numerically closest nodes to the responsible node. Whenever a new or deleted node changes the contents of the "k closest nodes" set the contents are either copied to remedy the situation or a pointer is created from one node to another.
- PAST supports both "replica diversion" and "file diversion" - operations for load-balancing and making sure that capable nodes store the file's replicas.
- PAST also supports caching files, thus helping ease the load with popular files.
PAST's advantages are that it can be implemented on top of any fitting overlay network; it's simpler than the OceanStore persistent storage system; and (also opposed to OceanStore) it doesn't assume dedicated storage servers.
PAST's disadvantages are that it doesn't explicitly support versions and directories; it stores whole files - infeasible for large file scenarios (such as ours in EDOS); and the replica and file diversion mechanisms seem a bit naive when high churn is introduced. Therefore one might have to use dedicated servers after all.
2.5 Transactions
Transactions
We observed that a classical approach of a fully-fledged distributed database system does not fit the needs of the EDOS project. The various actors in EDOS require a certain degree of autonomy and the network consists of many transient nodes.
A distributed database, however, brings along a high degree of control and requires tight collaboration between actors, which is not desired in the present environment.
Instead of providing a separate distributed database architecture for WP4, we decided rather to integrate well-proven (distributed) database services/techniques into the P2P architecture and
to adopt an integrated architecture. We came to the conclusion that the proposed P2P architecture would benefit a lot if we integrate
transaction support for the file/package upload and download so that we can guarantee a certain degree of control and consistency for the file upload and download, i.e. relaxed ACID properties.
To start our discussion we provide a state of the art summary of the concept of transaction in database technology.
A transaction in database management systems is defined as a logical unit of database processing, i.e. it refers to one or more database access operations (i.e. insertion, deletion, modification or retrieval of data). One way of specifying transaction boundaries is by explicit
begin transaction and
end transaction statements. All operations between these two statements form the transaction. The database is in a stable and consistent state before the transaction starts, and its state is stable and consistent again when the transaction ends. If a transaction fails, it is aborted and a rollback is performed in order to rebuild the previous state of the database [27].
ACID
Transactions should have the following properties: Atomicity, Consistency, Isolation and Durability, also known as ACID properties.
- Atomicity: A transaction is an atomic unit of processing. It is either performed entirely or does not leave any effects at all.
- Consistency: A transaction is consistency preserving in that its full execution takes the database from one consistent state to another.
- Isolation: A transaction should appear as though it is being executed in isolation from other transactions. That is, the execution of a transaction should not interfere with any other transaction executing concurrently.
- Durability: The changes applied to the database by a committed transaction must persist in the database. These changes must not be lost whatever failure might occur.
However, there are some application areas (i.e. non-standard application areas such as workflow environments and CAD) where an ACID transaction is too strict. In a workflow environment for example, where several users work together, a transaction can take a very long time, e.g. if the interaction of different users is required and there are several waiting points within the process. Therefore, new concepts for non-standard transactions in different non-standard application areas have evolved, e.g. nested transactions or long transactions.
In the following, different non-standard transaction approaches are presented.
Long Transactions
Long transactions are transactions that take (much) longer time than normal transactions (from several minutes to several days, weeks or months). A long transaction takes too long to be allowed to hold locks that another transaction needs. Consequently, the ACID properties must partially be violated.
In [28] the
saga approach is introduced: A saga is a collection of actions, that together form a long-duration transaction. Formally, a saga consists of a the following components [36]:
- A collection of actions
- A graph whose nodes are either actions or special abort or complete nodes, and whose arcs link pairs of nodes. No arcs leave these two special nodes, which are called terminal nodes
- An indication of the node at which the action starts, called start node.
Compensating Transactions.
In a saga each action A has a compensating action, which is denote as A
-1.
If we execute A and later execute A
-1, then the resulting database state is the same as if neither A nor A
-1 had executed [36]. By the property of compensating transaction the effect of the saga is negated and the database state is the same as before the initial transaction A.
Nested Transactions
In contrast to flat transactions, nested transactions can be seen as a tree of transactions. The transaction at the root of the tree is called top-level transaction, the others are called sub transactions. Transactions at the leaf level are flat transactions, all the other transactions are nested [29]. Their basic structured is illustrated in Figure 1.
TA 1 is the top level transaction,
TA 11 is a sub transaction and
TA 12,
TA 111 and
TA 112 are (flat) leaf transactions .
Nested Transaction
Nested transactions can either be
open or
closed, which is defined by the lock keeping behavior of the top level transaction: In
closed nested transactions all locks acquired by sub transactions are kept until the top level transaction ends. In contrast to a closed transaction, locks acquired by sub transactions of an
open nested transaction are released at the end of the respective sub transaction. Consequently, open and closed nested transactions do not correspond to the ACID principle. For closed nested transactions, ACID properties are guaranteed for the top level transaction (TL-TA). For the sub transaction (sub-TA) consistency, atomicity and isolation can be guaranteed, whereas durability can not be guaranteed until the top level transaction is executed entirely and successfully. Otherwise, in case one of the sub transactions fails, all primarily committed sub transaction are rolled back. Their effect is consequently not durable until the top level transaction has committed successfully [29].
Closed nested transactions are extremely useful if modularization is an issue: they enable the decomposition of a complex operation into sub operations which are executed using sub transactions.
Distributed Transactions
Whereas the structure of a nested transaction is defined as a functional decomposition, the structure of a distributed transaction depends on the distribution of the data in the network.
A distributed transaction is typically a flat transaction that runs in a distributed environment. The transaction is executed at a number of different nodes (all distributed over the network) at which the transaction accesses the local database. The coordination of the transactions is more complicated then with a centralized approach since all the coordination mechanism must adapt to the distributed environment (e.g. distributed concurrency control etc.) [32].
Mobile Transactions
In contrast to traditional transactions mobile transactions are initiated by a mobile client. They have to deal with disconnection periods and cell handovers during the transaction execution.
Mobile transactions can be classified according to the capability of the client to connect to the server or other peers. Connected mode means that the client is not necessarily connected all the time, but a connection can be established, if required. With the disconnected mode the transaction is executed locally (on the mobile) on replicated data. Synchronization takes place as soon as the client connects next time [31].
Several different transaction models were proposed for both mobile transaction classes [34], including
Clustering,
HiCoMo,
Prewrite,
Semantics-based,
Pro-Motion,
Kangaroo Transaction and some more.
Transactions in P2P environments
In [35] a concept for grid transaction processing is presented. The authors concentrate on decentralized concurrency control for transactions in a grid environment in which no central control is required. Services (web services) are seen as transactions which are invoked by a node and are usually processed on (many) other nodes. These transactions consist of a number of service calls, where each service call can be processed on a different node.
Dependencies between transactions are managed by the transactions themselves. The protocol relies on a decentralized serialization graph, where each node and each transaction maintain a local serialization graph. The serialization graph of the peer reflects the dependencies of the transactions that invoked service calls on that peer whereas the serialization graph of the transaction includes the dependencies in which the transaction is involved.
In [30] the DSGT (Decentralized Serialization Graph) protocol is presented. DSGT ensures concurrency control and recovery for services (i.e. transactions) in P2P environments.
This work could build a basis for the transaction support planned for the EDOS distribution architecture. However, it doesn't address the consistency issue, which is very important within the EDOS environment.
In [33] an approach for agent-based transaction in a decentralized P2P network is presented. The focus is on a new form of cooperative information system based on a P2P model. The model
consists of a multi agent system with 4 components (wrapper, mediator, facilitator and planner) which are responsible for the management and control of transactions composed by data management
operations (read, write, delete) and their outcomes.
ACID properties are not addressed.
2.6 Security
Package and distribution security
The aim of this chapter is to describe the security state of the art in package delivery of opensource *nix distributions.
First chapter summarizes the main concepts of cryptography in order to understand the rest of the document.
Next chapter lists and discusses security aspects of package distribution in the major *nix distributions.
Following chapter explains the method used to mirror sites today.
Last chapter discusses the advantages of using P2P networks in order to distribute packages and distributions; in particular in paragraph 5.3.2 it will be proposed a model in order to integrate P2P with PGP (
Pretty Good Privacy).
Security Basics
All security mechanisms deployed today are based on either symmetric/secret key or asymmetric/public key cryptography, or sometimes a combination of the two. Here we will introduce the basic aspects of the secret key and public key techniques and compare their main characteristics.
Secret Key Techniques:
Secret key techniques are based on the fact that the sender and recipient share a secret, which is used for various cryptographic operations, such as encryption and decryption of messages and the creation and verification of message authentication data. This secret key must be exchanged in a separate out of bound procedure prior to the intended communication (using a PKI for example).
Public Key Techniques:
Public Key Techniques are based on the use of asymmetric key pairs. Usually each user is in possession of just one key pair. One of the pair is made publicly available, while the other is kept private. Because one is available there is no need for an out of band key exchange, however there is a need for an infrastructure to distribute the public key authentically. Because there is no need for pre-shared secrets prior to a communication, public key techniques are ideal for supporting security between previously unknown parties.
Asymmetric Key Pairs:
Unlike a front door key, which allows its holder to lock or unlock the door with equal facility, the public key used in cryptography is asymmetric. This means just the public key can encrypt a message with relative ease but decrypt it, if at all, with considerable difficulty.
Besides being one-way functions, cryptographic public keys are also trapdoor functions- the inverse can be computed easily if the private key is known.
PGP
PGP, or "Pretty Good Privacy", is a program that is intended to help make electronic mail more secure. It does this by using sophisticated techniques known as
public key encryption.
If you find yourself wondering what electronic mail and making information unreadable by spies has to do with RPM, you have a good point. However, although PGP claim to fame is the handling of e-mail in total privacy, it has some other tricks up its sleeve.
Encryption and digital signature
As we mentioned above, PGP uses public key encryption to do some of its magic. You might guess from the name that this type of encryption involves keys of some sort. But, as you might imagine, these are not keys that you can copy down at the local hardware store. They are
large numbers.
PGP uses two different types of keys: public and private. The public key, as its name suggests, can be shared with anyone. The private key, as its name suggests, should be kept secret. PGP creates keys in pairs - one private and one public. A key pair must remain a pair; if one is lost, the other by itself is useless. Why? Because the two keys have an interesting property that can be exploited in two ways:
- A message encrypted by a given public key can only be decrypted with the corresponding private key.
- A message encrypted by a given private key can only be decrypted with the corresponding public key. (Digital signature)
In the case of sending messages in total privacy, the key pairs are used in the first manner. It allows two people to exchange private messages without first exchanging any "secret codes". The only requirement is that each knows the other's public key.
However, for packages distribution, the
second method is the important one: in fact each user must be able to verify with the company public key that the file was "encrypted" with its relative private key.
The state of the art in package security
In the following sections it will be discussed the security criteria used in the major distributions today.
RPM
The RPM Package Manager [40] (RPM) is a powerful command line driven package management system capable of installing, uninstalling, verifying, querying, and updating computer software packages. RPM are used in distributions like Red Hat, Mandriva and Suse.
RPM security consists in a PGP/GPG signature inserted inside the
rc file of the RPM: before installing the package the RPM manager tries to verify the signature: if this operation fails (e.g. there is no public key to verify the file, or the package is corrupted) the installation process aborts.
However, signature process is not mandatory so there are a lot of RPM packages that are not signed.
These are some examples of command useful to manage the digital signature of RPM files:
DEB Package
Deb files denote Debian packages [39] and the main tool to manage them is
apt.
The current scheme for package signature checking using apt is:
- The Release file includes the md5sum of Packages.gz (which contains the md5sums of packages) and will be signed. The signature is one of a trusted source.
- This signed Release file is downloaded by 'apt-get update' and stored along with Packages.gz.
- When a package is going to be installed, it is first downloaded, and then the md5sum is generated.
- The signed Release file is checked (signature ok) and it extracts from it the md5sum for the Packages.gz file, the Packages.gz checksum is generated and (if ok) the md5sum of the downloaded package is extracted from it.
- If the md5sum from the downloaded package is the same as the one in the Packages.gz file the package will be installed otherwise the administrator will be alerted and the package will be left in cache (so the administrator can decide whether to install it or not). If the package is not in the Packages.gz and the administrator has configured the system to only install checked packages it will not be installed either.
By following the chain of MD5 sums apt is capable of verifying that a package originates from a specific release. This is less flexible than signing each package one by one, but can be combined with that scheme too.
Currently, this scheme is fully implemented in apt 0.6, which is available in the experimental distribution. These changes are based on the patch for apt, which provides this implementation.
Slackware Packages
The Slackware package management system is very simple: each package consists of a
tar gzipped file containing all the file to be installed plus some meta file containing package information and installation scripts. Package security is very similar to Debian system and consists in maintaining a file called
CHECKSUMS.md5 containing all the md5 hashes of every file in the distribution; in another file called
CHECKSUMS.md5.asc there is the PGP digital signature of the first file.
Currently there is no automatic verification process so the users have to manually check the signature using the gpg software (
gpg -verify).
FreeBSD Packages
FreeBSD distributes (like all the major Linux distributions) both precompiled package and source package: the former type is called "package" the latter "port" [38]
- The package contains pre-compiled copies of all the commands for the application, as well as any configuration files or documentation.
- A FreeBSD port for an application is a collection of files designed to automate the process of compiling an application from source code.
Natively FreeBSD doesn't support signature mechanism but only MD5 integrity checking: each port contains a file called distinfo containing the MD5 checksum; for example:
~~
MD5 (MailScanner-install-4.39.6-1.tar.gz) = efe4a1eec0b072febbb3d6037a65470d
SIZE (MailScanner-install-4.39.6-1.tar.gz) = 4854418
~~
The ports are also cheksummed and the hashes information are contained in a file called
+CONTENTS:
@comment MD5:57a0384ff515dc68439840a4ea9892f7
However FreeBSD provide a tool called
pkg_sign that is able to sign packages with GPG keys or X509 certificates but this method is not official.
portage (gentoo)
Portage is software packages for Gentoo Linux; they tried to emulate FreeBSD Ports system and they follow the same philosophy. Everything can be downloaded as source code and compiled right for the architecture of each system [37]. Now there is no security mechanism in portage distribution but Gentoo developers have decided to move to a new and more secure format.
As part of an overall effort to improve the security of Gentoo Linux, the Portage development team is starting to implement some new features in Portage which will allow for increased security in package management and distribution system. One of the first new features that users will notice is a digest of every file involved in the merge process, including
ebuilds, patches and source tar balls. In addition to offering increased security, these digests will help isolate and track down corrupt ebuilds or other files on our rsync and source mirrors.
The next step in the process will be signing these digest files with a GPG key to ensure non-repudiation. While there is still some discussion amongst the development team on the best way to achieve this, the current leading solution involves each developer signing ebuilds individually, and then one master Gentoo "uberkey" signing all of the developer keys to establish a Gentoo "web of trust". Developer keys will be made available through public
key servers, as well as on www.gentoo.org
The goal of what has come to be known as "Secure Portage" is to provide a robust package management system that offers end-to-end security in the emerge process. As yet, there is no confirmed timeline on when the entire system will become available, but the digesting portion is in testing now and the rest will soon follow.
Mirroring
Mirroring with rsync
The actual method used for syncing master distribution servers with mirrors around the world is done with various tools. The mechanism for a mirror is: connect to the master server and download new packages/files. The most used tool for mirroring is rsync.
Rsync [41] is a replacement for scp/rcp that has many more features. Rsync uses the "rsync algorithm" which provides a very fast method for bringing remote files into sync. It does this by sending just the differences in the files across the link, without