On the Distribution of Open Source Software

WP4 Team

Foreword

This paper summarises on-going work in Work-Package 4 of the EDOS project. The goal of this work-package is to investigate scalable and secure solutions to improving the process of distributing open-source data (source code, binaries, documentation and meta-data) to open source users and developers.

A key challenge in the code distribution process is the ability to correctly transfer a large sized code base to a very large number of people. In the case of Mandrivalinux for instance, this entails copying a code base of 20 Gigabytes to a community containing up to 4 million users (i.e., the number of installed versions of Mandrakelinux). Given that the community is continuing to growing, solutions to code distribution must be scalable.

Currently, F/OSS projects generally employ a set of mirror servers for code distribution: data is placed on a master project server, and from there is (asynchronously) downloaded to an initial set of mirror servers, and from these servers to others. This process is quite slow, taking 48 hours to copy from a master server to all mirror servers in the case of Mandrivalinux. This latency can lead to inconsistencies on the user and developer side when he installs project modules. This in turn can create awkward dependencies at the module level in subsequent releases.

The aim of this work-package is to design and evaluate two alternative distribution architectures that address the issues of latency and consistency.

Approach

All major Free and Open Source (F/oss) projects rely on mirror servers to assure the distribution of content to end-users. The mirror server approach is a true "community" solution since the servers are independently administered, and a successful distribution process depends on the active participation of server administrators. On the other hand, the efficiency of the mirror server solution degrades as the numbers of downloads increase, as the size of the code base increases, and as the frequency of releases increases. The mirror server approach is thus a hindrance to the progression of F/oss, and alternative solutions need to be investigated.

The weakness of the mirror server solution is both technical and managerial. On the one hand the community is dependent on the reliable operation of a handful of servers; on the other hand, the community relies on the active and continued participation of a handful of administrators. The search for a decentralised solution led us to consider a peer-to-peer architecture.

A peer-to-peer (P2P) system is a community of computing users who share resources. The community exists because the more resources that members contribute to the community, the more resources they can gain access to. P2P is now a popular phenomenon and seems well adapted to the F/oss community: software is a resource packaged in files that can be shared just as any other resource. In particular, a P2P system avoids the technical and managerial bottleneck of the mirror server solution. Our P2P solution will consider also transactional and security support.

Yet before experimenting with the peer-to-peer technical solution, it occurred to us that we needed to address a more fundamental issue related to the efficiency of data distribution: the nature of the distributed content. Maximal efficiency entails that each client gets all of, and only, the content he requires; there is no superfluous transfer of data. This has two major implications:

  • Content must be correctly and extensively classified. For instance, different F/oss users are interested in locating different classes of content at any time, e.g., patches for some module, binaries for an application, test suites, documentation, bug reports, etc. The mirror server solution is inefficient in this respect since clients seeking different content needlessly compete for server cycles and network bandwidth.
  • Downloaded content is only valuable to the client if it contains pertinent and accurate meta-data. The EDOS project is already working on the problem that modules can contain inaccurate and out-of-date meta-data regarding dependency information (WP2). Other important meta-data include test results and bug reports (WP3) and licenses. The importance of meta-data is that defines the search criteria for the client in the distribution architecture and enables him to precisely determine the content modules he needs to acquire from the architecture.
The question of meta-data has led us to work in parallel on the definition of content meta-data. Though originally started in the context of the distribution process, this work has been evolving into a cross work-package activity since it links issues such as testing and dependency management with distribution. This has led us to start defining a Project Management Interface whose role is to act as a dashboard for a project: it enables content classification, the definition of community roles and generally allows the status of a project and its developments to be queried.

Our separation of concerns regarding meta-data and distribution explains the organisation of our work and of this document. Section 2 gives more detail on the Project Management Interface and the structure of metadata. Section 3 describes the current state of work on the P2P-based distribution architecture. Work on the alternative solution, distributed databases, is overviewed in Section 4. We comment on security issues in Section 5 and conclude in Section 6 with a description of our work for the coming 12 months.

A Project Management Interface

While each EDOS workpackage focuses on a specific topic, cross-issues exist and information needs to be shared in order to efficiently address each workpackage task. Content distribution cannot be addressed without also considering dependencies as well as content testing and bug reporting. All of these topics are inter-related, so each influences the way the others should be tackled.

The Project Management Interface (PMI) has thus become a cross-workpackage activity. On the one hand, the PMI provides guidelines to which the workpackages can refer to in order to interface with other topics and to provide them with the information they need. On the other hand, it provides guarantees on the available information on which workpackages can rely in order to achieve their task.

Further, the role of the PMI is to provide a uniform way to handle the F/OSS content resources that will be distributed through the EDOS Distribution API, but also F/OSS community resources that are related to this content. Again, creating, testing and debugging of content resources are closely coupled to community resources. Knowing which community members have worked on a unit of content, and their degree of experience, is mandatory to improving the F/OSS Process.

The PMI aims at providing kernel to streamline and improve the whole F/OSS process, taking into consideration the different aspects of F/OSS.

Attributes

The core of the PMI is built around the concept of attributes. Attributes are metadata that are used to qualify, classify and thus locate entities be they actors or content units. Further attributes are meant to be able to associate content to community resources.

The PMI formalizes the semantics of attributes, the way they can be compared and combined and defines the following useful properties:

  • Attributes are substitutable. Substitutability models the fact that entities can be related in useful ways. For instance, a sub-type can be used in place of a type; a license can be used in place of compatible licenses, etc. Thus substitutability is more expressive than equality. An attribute matches another one, if the second one can be substituted with the first one.
  • Attributes can be used to build Boolean expressions. Attributes of the same type can be related using and, or and not operators.
  • Attributes can be grouped as directories. Directories are groups of attributes of different types. They can be used to express a template (for looking up an unit matching the set of provided attributes), a state (current configuration) or as a signature of a F/OSS resource (set of attributes associated to a unit).
  • Attributes can be matched. In the matching process, attributes, directories and expressions are compared.

Content resource handling

For now, the PMI concentrates on handling content resources. It formalizes a set of functions and related integrity rules enabling the creation of new units, their retrieval, and installation. The PMI defines the state of a content unit, by specifying mandatory attributes every content unit has to be able to provide. Here is an overview of such required attributes:

  • A localisation that indicates the name and/or location of a unit.
  • A set of functionality attributes describing functional (type) information.
  • An attribute corresponding to the license used by the unit
  • A version tag.
  • Backward compatibility attributes
  • A description attribute
  • A set of attributes defining the types the content belongs to. In the case of a Linux distribution they can be source, binary, documentation, application or utility
  • Requirement and conflict dependencies in terms of localisation attributes or functionality attributes
  • Set of localisation attributes defining the members of the unit if the latter is a collection of other units
The specification of these attribute sets and the definition of mandatory attributes at unit creation time allow reasoning on issues related to specific topics such as unit look-up, installation, versioning or patching. We describe here

Look-up

Created units are available to the community. They can be retrieved through a look-up process by specifying a directory of attributes the looked-up unit has to match. As a result the look-up function returns a set of matching units. The PMI formalizes this process.

Possible ways to look-up units are multiple. Following examples give a short and non-exhaustive overview of different types of look-up that can be achieved. Note that these examples can be combined at will.

  • Looking-up a unit by providing its localisation attribute. This example reflects the way packages are currently retrieved, by providing their name, uid.
  • Looking-up units based on a choice of licenses. Specifying a Boolean expression of license attributes does this.
  • Looking-up units by specifying the set of functionalities it has to provide or the set of functionalities it cannot provide. This can be done to express so called virtual packages.

Versioning and Patching

The PMI clearly separates content patching from content versioning and highlights the main differences these two operations involve. While patching focuses minor correction of bugs, without changing the overall behaviour of the unit (i.e. without changing the attributes defining the state of the unit), versioning modifies the attributes associated to the unit. Both patching and versioning keep a history of modifications. These two operations are semantically different. Thus, such a separation clarifies the different responsibilities and situations that may occur during the life cycle of a unit.

Dependencies

The way the PMI offers to describe unit dependencies follows the recommendations of the EDOS deliverable 2.1. The PMI specifies the integrity rules that have to be asserted at unit creation time as well as at installation time.

The attribute-based approach to describe attribute functionality adds expressive power to dependency description. Attributes and attribute expressions can be used to express generic properties of units. Then the PMI allows units to be matched to such sets of attributes. This is particularly useful to replace so called virtual packages as the latter are supposed to share a common set of properties. Further, as attributes define the properties of unit, and as attributes are substitutable, dependencies built on attributes ensure that as long as a unit offers the same attributes an older version was offering, the newer version can substitute the older one, if there is no dependency conflict with the local configuration.

As for patching and versioning, the PMI separates semantically different information to describe unit's properties, and thus unit's dependencies. Indeed, different attribute sets are provided to express different information. The elements of these sets are then separated depending on their usage context. For instance, the functionality provided by a unit is clearly separated from the functionality the unit requires. An algebra for handling these attributes is provided and the substitutability relation for each set of attributes is defined. Thus the PMI acts as a safeguard ensuring a correct definition of dependencies.

Finally, as the PMI is able to express dependencies using Boolean expressions of functionality attribute as well as using Boolean expressions of localisation attributes. As localisation attributes can be package names, backward compatibility with existing tools is ensured.

A Distribution API

The Distribution API focuses on the efficient distribution of content units over the network, in a P2P system architecture. The API description uses an object-oriented model, slightly different, but compatible (as shown below) with the general Project Management Interface (PMI) model, based on content units and attributes.

The Data Model

We distinguish three distinct types of content units:

  • Packages, representing the RPM files
  • Utilities, representing files used in the installation process
  • Collections, representing grouping structures that gather several objects (content units): packages, utilities or collections. Collections allow organizing the content units in hierarchical (tree) structures.
The main difference between Packages and Utilities in distribution comes from the versioning politics. While for Packages any change in a package contents leads to a new package version, changes to Utilities (or Collections) do not necessarily change the version. For Collections, at some point in time, when the collection is considered as being "stable", it may be published as a new version. For Utilities, we do not consider they have their own, independent version number, but rather they inherit the version number of the collection they belong to.

The following example illustrates the data model elements in the context of the Mandriva Cooker distribution.

cooker.png

The figure represents the hierarchical organization of the Cooker distribution, where leaves represent packages or utilities and internal nodes represent collections. Square boxes are used for Collection objects and round boxes represent Packages or Utilities.

The data model for distribution can be summarized as follows:

  • Object: Package | Utility | Collection
  • ObjectID
    • Name
    • Version_Number
  • Package
    • ID: ObjectID
    • Metadata
    • Location
    • Value: stream
  • Utility
    • ID: ObjectID
    • Metadata
    • Location
    • Value: stream
  • Collection
    • ID: ObjectID
    • Metadata
    • Value: ObjectID[]
Objects represent any content unit distributed in the system (Package, Utility or Collection). An object is identified by an ObjectID, consisting of a name and a version number. The Name must include the path of the object in the distribution tree, because for instance, the same package name and version may occur in different collections (e.g. package perl version 5.8.7 may exist for different system architectures: i586, sparc, ...).

The ObjectID is a logical identifier for the Object - several replicas of the same object may exist in the P2P system. The Location attribute allows physically locating the object instance (replica) in the network (it may be seen as an URL).

Value refers to the content of the Object: the file for Packages and Utilities, and the composition list (as a set of ObjectIDs) for Collections. Actually, Collection composition definition allows any graph structure for collections. We restrict Collections to hierarchical structures (trees or DAGs) and for the moment we focus on tree collection structures, in which an Object is not shared between different collections.

Metadata represents all the Object properties necessary to characterize it in the process of distribution and retrieval. Metadata properties correspond to content unit attributes in the Project Management Interface (PMI), but contain only those attributes that do not change for a given ObjectID. For instance, Location or Collection composition are not included in Metadata. The (not exhaustive) list of Metadata properties is:

  • Metadata
    • Type: application, documentation, source, binary, utility
    • Date: object's creation time
    • Signature: SH1 checksum
    • Size: Number_of_Bytes
    • Dependencies
    • License
    • ...

Correspondence with the Project Management Interface (PMI)

Here follows a quick example of how the distribution API can be expressed using the Project Management Interface (PMI).

Consider the Mandriva Cooker distribution example above. In the context of the PMI, the above illustration becomes the following one.

cooker-pmi.png

In this figure, square boxes represent units as defined by the PMI. Dotted arrows list members of a collection unit. Each unit defines a set of types of units it contains. These types usually have to be defined in the context of the project.

In the case of a Linux distribution, we define (for instance) that available types are: source, binary (opposite to source), documentation, application (opposite to documentation) and utility. Each unit may be of any of these types; the only restriction being that a collection can only contain members of a type of the collection. The type attribute can then be used to define how the unit has to be handled.

All the units possess all the other attributes defined by the PMI. This covers dependencies, localization, functionalities, license, etc. Nevertheless these are not illustrated on the figure.

The localization attribute acts as a UID for the unit itself as well as a link to a set of physical locations where the content can be found. This link can be updated over time in order to provide up-to-date localizations and can be of any type (such as a URL to a torrent, ftp site...)

In the concrete case of a Linux distribution, unlike utilities and packages that have content, a collection is a set of meta-data with no content. This meta-data is represented by the set of attributes associated to the unit.

In the Distribution API model presented above, attributes are separated in several categories:

  • Identifiers, corresponding to ObjectID, include logical localization (name) and versioning attributes
  • Constant properties, corresponding to Metadata, include all the attributes that do not change for a given ObjectID
  • Variable properties, corresponding to Location and Value, that may be different in time or on different peers for the same ObjectID

Architectural components, roles and API levels

Actors and roles

The P2P distribution system is composed of 3 types of actors:

  • the Publisher represents the reference server of the distribution editor. It is the seeder of the data in the system. The insertion of data in the system is done either in Push or in Pull mode.
  • the Replicators (Mirrors), used as replication peers, get data from the Publisher or from other Replicators and offer this data for distribution.
  • the Simple Peers, seen as the users' machines participating in data sharing.
DistributionAPI-Architecture2.jpg

For each one of this three categories of components, we associate specific roles in the distribution:

  • Publication, corresponding to the Publisher functions.
  • Replication, for providing replicas of published objects to distribution. Replicators, but also Simple Peers (depending on the implementation) may play this role.
  • Client, corresponding to functions such as searching, subscribing, downloading objects. Simple Peers, Replicators, but also the Publisher may play this role.

Logical and Physical Levels

The API makes the distinction between a Logical level and a Physical level. At the Logical level, we describe the main API methods, corresponding to the software distribution functionalities: publishing, replicating, downloading, subscribing, etc. Normally, a distribution application only needs this API level to use the system.

At the Physical level we consider lower-level methods, necessary to implement the logical level functions. Physical level methods provide finer grain access to EDOS distribution functionalities, in order to implement different strategies than those provided by the logical level.

The distribution API below is defined at the Logical level, organized by roles and is composed of the basic methods necessary to realize the system's functionalities. Examples of some Physical level sub-functions are also provided.

Basic functionalities for each role

Publisher

  • publishPackage(PackageID, Metadata, Location)
Publishes a package by storing information about the package in the index. The package is added (or replaces an older version) in the implicit collection (collection name = prefix of the package name, collection version = last published version).

  • publishUtility(UtilityID, Metadata, Location)
Publishes a utility. Unlike packages, utilities may change their content while keeping the same ID. In this case, only the Publisher?s instance is fresh, and the other replicas are eliminated from the index. Another difference with packages is that the implicit collection for utilities has the same version number as the utility.

  • publishCollection(CollectionID, Metadata, Value)
Publishes a collection and recursively all the collection's elements. The published collection is added to the implicit collection (similarly to packages). Unlike packages, the same collection may have different compositions in time. Only the Publisher surely has the right (latest) composition. A collection is published when a new version is available. Between two versions, the collection's content may be modified through the (un)publishing of packages, utilities or sub-collections.

  • unpublishPackage(PackageID)
Un-publishes the package by removing information about it in the index. Removes the package from its implicit collection parent.

  • unpublishUtility(UtilityID)
Similar to unpublishPackage, but for utilities.

  • unpublishCollection(CollectionID)
Un-publishes the collection and recursively all its elements. Removes the collection from its implicit collection parent.

Examples of physical level functions for Publisher

  • publishPackageLocation (PackageID, Location) : publishes the package location in the index
  • publishPackageMetadata (PackageID, Metadata) : publishes the package metadata in the index
  • publishPackagePush (PackageID, Peer[]) : if push is allowed, pushes package content (files) to other peers
  • insertIntoCollection (CollectionID, ObjectID[]) : inserts objects into collection
  • deleteInCollection (CollectionID, ObjectID[]) : removes objects from collection
  • ...

Replicator

  • (un)publishReplicatedPackage (PackageID, Metadata, Location)
(Un)publishes in the index the fact that a replica of the package exists at the given location. A replica of metadata is also produced.

  • (un)publishReplicatedUtility (UtilityID, Metadata, Location)
(Un)publishes in the index the fact that a replica of the utility exists at the given location. A replica of metadata is also produced.

  • (un)publishReplicatedCollection (CollectionID, Location)
Recursively applies (un)publishReplicatedPackage(Utility) for all the packages and utilities in the collection. The collection itself is not (un)published as an object.

Examples of physical level functions for Replicator

  • (un)publishReplicatedPackageLocation (PackageID, Location) : (un)publishes the replicated package location in the index
  • publishReplicatedPackagePush (PackageID, Peer[]) : if push is allowed, pushes package content (files) to other peers
  • ...

Client

  • Package getPackage (PackageID)
Gets a copy of the requested package. It chooses the best location of the package on the network for downloading. Large packages may be cut in several slices, downloaded in parallel from different locations in a BitTorrent-like style.

  • Utility getUtility(UtilityID)
Gets a copy of the requested utility after choosing the best location of the utility on the network for downloading.

  • Collection getCollection(CollectionID)
Gets a copy of the collection by recursively downloading its components. It identifies the missing packages and utilities, downloads them in parallel from several sources and builds locally the requested collection.

Examples of physical level functions for Client

  • Location[] locatePackage (PackageID) : returns the set of locations where the package may be found in the system
  • PackageID[] getCollectionPackages (CollectionID) : gets the set of IDs for all the packages (at any depth) in the collection
  • PackageID[] computeMissingPackages (PackageID[]) : computes the set of missing packages on the local peer, among the packages in the list and those required by dependencies
  • LocationMap getBestLocations (PackageID[], UtilityID[]) : decides for each package and utility in the lists what is the best downloading location
  • ...

Advanced functionalities

Subscription

Subscription can be used in software distribution to provide event notification and possible automatic download of objects. In dealing with subscription, we consider the concept of channel, used in Red Hat Network (RHN).

Each channel is an abstraction that corresponds to a set of objects (packages, utilities or collections). The Publisher feeds channels by publishing or un-publishing objects in channels. Publishing/un-publishing in a channel adds/removes objects to/from the set of objects of that channel.

For each channel, permissions can be assigned to distinct users. Clients, depending on their permissions, may subscribe to channels. A subscription to a channel should indicate:

  • the objects of interest : new/all objects, a query filter
  • the notification moment : immediate, periodically
  • the action to do on notification : simple notification, automatic download
Clients may get the notified objects either manually, using the getXXX methods for the Client role, or automatically, using a multicast algorithm.

Subscription methods associated to the Publication role

  • Channel createChannel (ChannelName, ChannelDescription, ObjectID[], AccessRights)
Creates a new channel and initializes its content with the set of objects. Publishes to the index the channel name and description.

  • (un)publishPackageToChannel(PackageID, ChannelName, Date)
  • (un)publishUtilityToChannel(UtilityID, ChannelName, Date)
  • (un)publishCollectionToChannel(CollectionID, ChannelName, Date)
These methods publish the distribution object to the given channel at the given date. Notification to subscribers to the channel is sent and possibly multicast push is activated.

Examples of physical level functions for Publisher-side subscription

  • SubscriptionID addSubscribtion (ChannelName, SubscriberInfo, Subscription) : function called by the Publisher when receiving a subscription request from some client
  • removeSubscribtion (ChannelName, SubscriberInfo, SubscriptionID) : function called by the Publisher when receiving a canceling subscription request from some client
  • multicast (ObjectID[], Peer[]) : multicast distribution of objects to subscribers
  • ...
Subscription methods associated to the Client role

  • ChannelName[] getChannelList()
Returns the list of channel names available in the system.

  • ChannelDescription getChannelDescription(ChannelName)
Returns the description of the given channel.

  • SubscriptionID subscribeToChannel(ChannelName, SubscriberInfo, Subscription)
Subscribes to a channel.

  • unsubscribeToChannel(ChannelName, SubscriberInfo, SubscriptionID)
Cancels some given subscription to a channel.

Querying

The API should enable searching for data objects not only according to their IDs, but also by other criteria such as: functionality, license, status, size, etc. The Metadata will be queried in order to locate the wanted objects. A simple query language will be used to query metadata values.

  • PackageID[] queryPackages(Query)
Gets the list of package IDs matching the query.

  • UtilityID[] queryUtilities(Query)
Gets the list of utility IDs matching the query.

  • CollectionID[] queryCollections(Query)
Gets the list of collection IDs matching the query.

  • Version_Number getLastVersionNb(Name, Type)
Gets the last version number for the object of the given type (package, utility, collection).

Database and Transactions

Security Issues

Conclusions and Perspectives

Bibliography

Version 1.19 last modified by DanVodislav on 04/11/2005 at 15:05

Comments 0

No comments for this document

Attachments 3

Image
cooker-pmi.png 1.1
PostedBy: Pawlak on 27/10/2005 (20kb )
Image
cooker.png 1.1
PostedBy: Pawlak on 27/10/2005 (15kb )
Image
DistributionAPI-Architect~.jpg 1.1
PostedBy: DanVodislav on 02/11/2005 (50kb )

Creator: Bryce on 2005/10/25 07:53
Copyright EDOS Consortium
1.1.1