Current Analysis (Tools and results)
Goals
The goals of this analysis are manifold:- Build a comprehensive model to assess the actual complexity of the relationships which are present in the current and more widespread package-based distributions
- Analyse this model in order to find not-trivial inconsistencies and problems which arise when it comes to maintain and update a package base for a given distribution.
- Have a better understanding of what are the actual attributes needed for specifying the relationships among packages in a package-based distribution. This analysis will lead to a new WP2Meta Data model for package description.
Current setting
We are concentrating on the two of the major package-based distributions currently available in the Linux world, i.e. Debian and Mandriva, which happen to use the two most widespread package formats available nowadays: DEB and RPM. In order to do so we have built a tool-chain which is made up of the following components:- A set of parsers which are able to read the RPM and Debian control files and produce a neutral XML format which contains the relevant package relationships.
- A general library (EDOSLib) which provides an API for handling a graph-based model of a package repository and provides methods and algorithms for (at present) performing an high-level analysis.
- An utility which extract from the graph-based model a set of Constraint Logic Programming (CLP) problems which are used to perform an in-depth analysis which takes into account all the complexity given by the relationships between package versions.
- A parametric CLP program which models the CLP problem with respect to the package dependency relationships.
- The MozartOz programming system which is used to prerform the actual analysis.
- Some side utilities for visualizing and exploring the graph-based model and its complexity.
A taxonomy of package bases
We distinguish between three different types of package base:- Pool: A pool contains a set of packages where each package may have more than one version with different requirements. Different packages in the pool may be in conflict among them (e.g. two different versions of the same package type). The pool is used as the "package universe" from which it is possible to extract a distribution. Since many distribution depends on the same pool (e.g. Debian), maintaining some properties of the pool is crucial in order not to break the distributions which are built on top it.
- Distribution: A distribution is a set of packages where there are exactly one version for each package type. There can be conflicts among packages in the distribution (e.g. two different and incompatible web-servers).
- Installation: An installation is a subset of package taken from a distribution which is currently installed on a running system. An installation cannot be inconsistent. But it can become so if we try to modify an installation by trying using the usual install/remove/upgrade operations.
The EGraph Data format
The tool-chain uses an XML data format which is, at present, an extension of the GraphML file format. GraphML provides a suitable way for representing graph-based information and it fits perfectly with our requirements. This format may evolve to a more general data representation format for the WP2MetaData. The complete description of the EGraph format is available.Current analysis
We are processing the Debian pool snapshot taken on the 7th June 2005. The complete package base has 19177 unique package names and 29006 actual packages. For each package we have extracted the subgraph rooted at that given package in order to reduce size and the complexity. Many packages, infact, are not involved in any relationship with a given package. For each package type, we are try to find an answer the following question: "given all the package version and the relationships there exist a way to install this package?" The results of this analysis will eventually show the presence of some packages wich, no matter how they are put in a distribution, they will never be installable.Version numbers and normalization
The CLP engine is not able to recognize the rich format of version specification used in DEB and RPM formats. So a mapping between package versions and integer is needed. To do so, for each available package type, we collect all the available versions and, after ordering them using a standard comparison algorithm for version numbers, we map each version to the corresponding integer in the sorted order.Statistics
A spreadsheet with all the statistics about the Debian Pool is available at the following address http://www.pps.jussieu.fr/~fabio/debian-pool-statistics.ods (OpenOffice 2.0beta SCalc)Other references
The current API provided by EDOSLib can be read at http://www.pps.jussieu.fr/~fabio/EDOSLib/javadoc The EGraph file for the Debain Pool can be downloaded from http://www.pps.jussieu.fr/~fabio/Debian-Packages-Pool.egraph.gz The collection of MozartOZ programs (which have more detailed statistics as a beginning comment) can be downloaded from http://www.pps.jussieu.fr/~fabio/data.oz.tar.bz2 (WARNING: Unpacking this archive will generate almost 20000 which use more than 300Mb of disk space!)Screenshots
Some screenshots of the graph generated from the Debian Pool. {metadata} Topics Wp2 {metadata}
Version 1.13 last modified by MarcLijour on 05/04/2006 at 11:17
Document data
Attachments:
No attachments for this document
Comments: 0