Paris, MandrakeSoft, 07 Oct 2004

EDOS Metrics and Code Distribution Discussion

The goal of our discussion at the October 7th meeting at MandrakeSoft was to start defining evaluation metrics for open source code distribution architectures. These metrics apply equally to the current code distribution set-up used by MandrakeSoft for their Mandrakelinux versions, and to any distribution architecture developed during the project.

It is clear that metrics can only be defined once the process - and associated problems - of code distribution is fully understood. This note contains a description of the current code distribution process of Mandrakelinux.

Introduction

The goal of code distribution is to transfer a copy of the Linux system software from the editor to end-user's machines, where the software can be installed and run.

There are a number of challenges in this process. In particular, one goal is to minimize the delay between the arrival of a new version and its availability to users. Another most important one is to ensure that the installed software is consistent, i.e., that the software on his machine all comes from the same version. A difficulty comes from the quantity and complexity of the software. The software is divided into modules, called packages, and new versions of each package regularly appear. The consistency requirement is violated whenever a software component in one package requires software that is not present, or not up to date.

As an example, imagine that an end-user installs version 1.2 of OpenOffice. The core program requires libraries that must also be installed. The installation becomes inconsistent if the libraries subsequently installed are not version 1.2. Consistency failure - the inability to locate the correct package versions - is currently the greatest complaint of Mandrakelinux users.

Two factors contribute to making consistency very difficult to achieve for open source code distribution: the size of the software system and the size of the end-user community. We consider each issue separately.

System Software

A complete version of the Linux system software is known as a distribution. To avoid confusion with the term code distribution, we will use the term system synonymously with distribution in the remainder of this note.

A system version is composed of packages. The package format used in Mandrakelinux is RPM (Red Hat Package Manager). RPM Package Manager (originally called "Red Hat Package Manager" is a package management system used for Linux distribution (originally from Red Hat, but now used by many other distributors as well). RPM is capable of installing, uninstalling, verifying, querying and updating packages. A package contains source files and/or object files, a label and package meta-data. The package label gives the software name, the version of the package and the system release to which the package belongs. The meta-data includes the package's creation date, an MD5-based signature used for verifying package integrity, and installation scripts (?). The package label is used for versioning control (?).

Each package includes the changes from the previous versions.

The package management system can evaluate meta-data to allow package searches, automatic upgrade to newer versions, checking that all dependencies of a package are fulfilled and/or fulfilling them automatically, etc.

Note that there is a distinction between system release and system version. A system version is produced after a set of changes have been integrated into the system by the Mandrakelinux editor, and then made available to the Linux community. These changes generally deal with bug fixes. A system release is a finalized version; it typically contains new software functionality compared to previous releases.

A package size can vary between several kiloBytes up to tens of MegaBytes. We can estimate an average of 0.5 MegaByte per package. A complete Mandrakelinux version or release is composed of between 8000 and 9000 object code packages (binaries) and 4000 source code packages. So this yields around 12 GigaBytes in total. For example, the last Mandrakelinux release (Community 10.1) contains a total of 9280 object code packages. The binaries and source code packages are more or less the same in total size.

The release cycle for Mandrakelinux is around 6 months, and the effort required is nearly 30 man-years. A key objective for the Mandrakelinux editor is to reduce this effort. While this is one of the overall goals of the EDOS project, the code distribution issue is particularly concerned with version release cycles, since this is where much effort is spent, and lost, between releases.

As said, new versions are produced mainly to account for bug fixes. A given version is estimated to have around 3000 bugs. These bugs are partly discovered during testing within Mandrakelinux, after integration of source code components found on open source code forums such as freshmeat and sourceforge. The remaining bug reports come directly from the Linux user community, who are actually quite active in this respect: of the 12000 registered Mandrakelinux users, 1000 of them have reported bugs during the last 6 months. The end-user involvement is important since many bugs are specific to the end-user's installation, e.g., program X fails when run with driver Y for device Z. There are too many environment parameters for MandrakeSoft to be able to test each of them.

Since it is important to get bug fixes to end-users as quickly as possible, new system versions are regularly produced. The system version, termed development distribution in Mandrakelinux parlance, changes every day. Further, daily version changes can involve up to 100 packages, and 100 MegaBytes of data. Given the regularity of versions, and the fact that a complete build of 8000 packages takes a few days, updates can only be delivered to the user community in incremental fashion. This is the issue of downloading.

Downloading to End-Users

The Mandrakelinux editor places each new version of the system on its main server, which we term the database in the remainder of this document. The editor takes no further part in the code distribution process after that.

The central database has several components:

A version of the system for each supported architecture. The processor architectures currently supported include Intel 32 and 64 bit processors, the Itanium, Sparc and Alpha. The Intel architectures are by far the most popular among the end-user community. A dependency graph for each architectural version. This captures dependency relations between packages so that a user who installs a package knows what other packages need to be installed in order to keep his system consistent. A version can be downloaded in binary form, or with the source code. Currently, 99% of Linux end-users only download binaries. As said, the publishing of a version (update) on the database is the limit of the Mandrakelinux editor's involvement in the code distribution process. The community is responsible for getting the version to end-users. The process follows a hierarchical, pull model. The reference version is copied to 1st level site, that are copied to 2nd level, etc. For availability reasons, end-users are discouraged from downloading directly from the central database and the reference version. These intermediaries are necessary to improve the performance of the entire system with parallelism. The consistence between the copies is then a main issue.

The intermediaries involved in the current Mandrakelinux code distribution process are a set of around 100 mirror servers. 5 of these servers are primary servers; these download and contain a copy of all data on the central database. A further set of secondary servers download versions from the primary servers. The set of mirror servers actually forms a tree structure, with each server obtaining a copy of a system version from the mirror that is its parent node in the tree. The Mandrakelinux code distribution tree has 5 levels. End-users generally download their versions from the leaf node mirror servers (?), and rarely from the primary mirrors. Only the primary mirrors are likely to have versions for all processor architectures.

It is important to keep in mind that the mirrors are independent systems. There are in no way formally responsible to the Mandrakelinux editor. There have no obligation to ensure that the code distribution tree is consistent. In fact, the choice of a site as a mirror may happen in a random way. For instance, a university may decide to store a version of Linux on its server for its students to download. The Mandrakelinux editor might then encourage end-users, or other mirrors, to download from this site.

The mirroring process, i.e., the download from one server to the next in the tree, is a variable process. The time taken to mirror a version from one server to another can take anywhere between 1 hour and one day. This can depend on the infrastructure available at the mirror site. The frequency of update is also variable; some mirrors choose to make an update once every day, even though updates of the central database occur on average every 20 minutes. This means that mirrors may have versions that are not up to date. Only the central database, which end-users do not download from, is guaranteed to have the latest version. Also, while a mirror is copying from a source, the source may decide to switch to a new version. This results in inconsistencies or may lead the mirror to abort the current copying and start a new one, a main cause of frustration.

Each mirror site uses its own ftp scripts to download versions, rather than use tools provided by the Mandrakelinux editor. This is in contrast to end-users who use the Linux urpmi program to test for and install updates. This program tool is configured to contact a specific mirror server. (In Debian, apt-get is used instead of urpmi.)

As a consequence of this organization, the copying process from the central database to all mirror sites is disorganized, error prone, and the source of all consistency errors.

We next provide a first very rough proposition of metrics for evaluating version distribution.

_Question:_ is this important to Mandrakelinux ? There is a problem of a lack od consistency in package names when packages from different distributions are used together. This makes automatic dependency handling difficult. In fact, this is a problem in coordination amongst major distributors who use RPM in packaging such as Red Hat and Mandrakelinux. However, when packages from a particular distribution are used only, then automatic dependency checking can work.

Towards Metrics

The process being evaluated is code distribution - the copy of versions from editor to end-user, and not particularly the mirror server performance. In the same way, it is not required that an architecture developed in the project employs the current mirror servers, or use them in the same manner.

Consistency: The user's urpmi program searches for updates at some server; the user configures the frequency of update. Ultimately, it is more important for a user to have a consistent version, than the latest version. The lack of consistency manifests itself by the failure to compile or run Linux programs. An obvious metric for code distribution is thus the number of consistency errors that arise in a fixed period. Other measures of consistency may of course be considered such as the number of missing or incoherent files; weight on file importance or edit distance between the available file and the correct one may be considered. Note that one can consider here also "weak" forms of consistencies; e.g., some mirrors accept to have various packages coming from different versions but insist in having consistency internally within a package. 100% coherence would be obtained if a distribution at the certain mirror would be perfectly equal to the corresponding copy of the database (i.e. to the mirror 0 at some earlier time.

Freshness: A further metric is the average time it takes for a version fix to arrive at an end-user from the time that it appears on the central database. Currently, this can take up to 48 hours. Slow code distribution along with consistency problems clutters the process of bug reporting by end-users. If this can be improved, then the number of bug reports can be increased, leading perhaps to better versions and fewer fixes. So, we need measures such as the average time between the current version on a mirror and the reference site. 100% freshness is achieved if the mirror has exactly the current copy of mirror 0.

Availability: One might want to measure the availability of data as well as the time it takes to obtain it when a coherent version is missing.

Degradation: It may also be interesting to measure the degradation introduced by each level of mirroring.

Mirroring costs: These first measures deal with the quality of the mirrors from a user viewpoint. We see them as the most fundamental ones. We next move to metrics on the cost of administering the mirrors:

One metric for code distribution is the space required on sites that store versions, or parts of the version. Space is typically a concern for many mirror servers.

2 Network traffic to these sites is a related measure. 3 Process resources. We typically want to penalize here solutions where mirror sites waste resources by having to abandon a copy because the site they copying from moved to a new version. 4 Administrator time: this is difficult to measure. On one hand, providing them with easy to use tools would simplify their work. But this is a case where they would loose some independence. So, it is unclear if this would be seen from them as positive.

A first attempt at attacking the problem

_Improve the actual system_. The idea is to use techniques from distributed databases to improve based on the same architecture, namely a hierarchy of mirrors. Some of the techniques that may prove useful:

More sophisticate diff techniques (computing differences between packages or files and exchanging the description of differences instead of new versions - typically based on "minimum edit script", i.e., the minimum script to go from one to another).

2 Combination of pull and push (as provided in AXML). 3 Use of more than one version on mirroring sites 4 More controle of the start time of versions.

_Considering a P2P architecture_. In P2P, users create local mirrors, i.e. locally coherent subsections of other mirrors. The system becomes more distributed. Some of the issues that would have to be tackled:

The P2P should manage versions which is non standard for current systems.

2 Granularity? Should this be at the file level or the package level. Where are the databases stored? 3 How is a new version "flooded" in the system? 4 How to deal with providers that are blocking P2P systems? 5 Impact of network geography and proximity of mirors. 6 Replication in the P2P. 7 How are the pieces located? Randomly as in standard P2P? Based on the peers desires?

Let's get started

The following actions have to be continued:

State of the art on code distribution

2 State of the art of similar performance measure 3 Discuss, criticize, complement and precise the measures above 4 Detail the two candidate architectures. All hierarchy; all P2P; one hierarchy with a single P2P; one hierarchy with many P2P, others. 5 Understand and analyze the current distribution by Mandrakelinux - see BitTorrent (some P2P software that is used)

BitTorrent

BitTorrent allows users to download a file from multiple different people. Instead of everyone nailing one server, users get the file from other users. BitTorrent software is currently used by Mandrakelinux users to download the latest versions of distributions. It works only at users' level, not also at the level of mirroring servers. The software is capable of swarming downloads across unreliable networks and it adresses two major problems of data distributions: scalability issues in serving large files and bandwidth costs rised by serving a large number of users. The solution proposed by BitTorrent is to exploit the unutilized upload capacity of users and to connect them into a P2P network architecture. This way, users' contribution grows at the same rate as their demand.

The system aims for Pareto efficiency, a higher level of resource utilization and robustness. The main problems that the system has to address are high churn rate, fairness, finding the best piece allocation strategy and ensuring steady up-rates. A specific problem is that users tend to kill their clients as soon as download completes (irrespective of on-going uploads). Peers use a tracker site to find each other and it stores a minimum of information. In general the algorithm used by the tracker is to generate a random list of peers since this is the most robust with respect to disconnection and segmentation, resulting from churn. A tracker also stores a hash of each piece so that its integrity on receipt can be verified. A seed (complete version of the file) must exist and be downloadable in totality from there. The piece that a peer chooses for download can follow a strict priority, rarest first (i.e., the piece is the least common among the set of peers), random order, etc. Choking is the explicit (temporary) refusal to upload and is required for good system performance (i.e., it can be used to prevent imbalance in rates between two users) ands is how Pareto efficiency is achieved.

{metadata}

Topics Wp2 Wp4 Wp5

{metadata}

Version 1.6 last modified by StephaneLauriere on 10/11/2005 at 17:14

Comments 0

No comments for this document

Attachments 0

No attachments for this document

Creator: BorisVrdoljak on 2005/05/12 21:52
Copyright EDOS Consortium
1.1.1