On the Distribution of Open Source Software
WP4 Team
Foreword
This paper summarises on-going work in Work-Package 4 of the EDOS project. The goal of this work-package is to investigate scalable and secure solutions to improving the process of distributing open-source data (source code, binaries, documentation and meta-data) to open source users and developers. A key challenge in the code distribution process is the ability to correctly transfer a large sized code base to a very large number of people. In the case of Mandrivalinux for instance, this entails copying a code base of 20 Gigabytes to a community containing up to 4 million users (i.e., the number of installed versions of Mandrakelinux). Given that the community is continuing to growing, solutions to code distribution must be scalable. Currently, F/OSS projects generally employ a set of mirror servers for code distribution: data is placed on a master project server, and from there is (asynchronously) downloaded to an initial set of mirror servers, and from these servers to others. This process is quite slow, taking 48 hours to copy from a master server to all mirror servers in the case of Mandrivalinux. This latency can lead to inconsistencies on the user and developer side when he installs project modules. This in turn can create awkward dependencies at the module level in subsequent releases. The aim of this work-package is to design and evaluate two alternative distribution architectures that address the issues of latency and consistency.Approach
All major Free and Open Source (F/oss) projects rely on mirror servers to assure the distribution of content to end-users. The mirror server approach is a true "community" solution since the servers are independently administered, and a successful distribution process depends on the active participation of server administrators. On the other hand, the efficiency of the mirror server solution degrades as the numbers of downloads increase, as the size of the code base increases, and as the frequency of releases increases. The mirror server approach is thus a hindrance to the progression of F/oss, and alternative solutions need to be investigated. The weakness of the mirror server solution is both technical and managerial. On the one hand the community is dependent on the reliable operation of a handful of servers; on the other hand, the community relies on the active and continued participation of a handful of administrators. The search for a decentralised solution led us to consider a peer-to-peer architecture. A peer-to-peer (P2P) system is a community of computing users who share resources. The community exists because the more resources that members contribute to the community, the more resources they can gain access to. P2P is now a popular phenomenon and seems well adapted to the F/oss community: software is a resource packaged in files that can be shared just as any other resource. In particular, a P2P system avoids the technical and managerial bottleneck of the mirror server solution. Our P2P solution will consider also transactional and security support. Yet before experimenting with the peer-to-peer technical solution, it occurred to us that we needed to address a more fundamental issue related to the efficiency of data distribution: the nature of the distributed content. Maximal efficiency entails that each client gets all of, and only, the content he requires; there is no superfluous transfer of data. This has two major implications:- Content must be correctly and extensively classified. For instance, different F/oss users are interested in locating different classes of content at any time, e.g., patches for some module, binaries for an application, test suites, documentation, bug reports, etc. The mirror server solution is inefficient in this respect since clients seeking different content needlessly compete for server cycles and network bandwidth.
- Downloaded content is only valuable to the client if it contains pertinent and accurate meta-data. The EDOS project is already working on the problem that modules can contain inaccurate and out-of-date meta-data regarding dependency information (WP2). Other important meta-data include test results and bug reports (WP3) and licenses. The importance of meta-data is that defines the search criteria for the client in the distribution architecture and enables him to precisely determine the content modules he needs to acquire from the architecture.
A Project Management Interface
While each EDOS workpackage focuses on a specific topic, cross-issues exist and information needs to be shared in order to efficiently address each workpackage task. Content distribution cannot be addressed without also considering dependencies as well as content testing and bug reporting. All of these topics are inter-related, so each influences the way the others should be tackled. The Project Management Interface (PMI) has thus become a cross-workpackage activity. On the one hand, the PMI provides guidelines to which the workpackages can refer to in order to interface with other topics and to provide them with the information they need. On the other hand, it provides guarantees on the available information on which workpackages can rely in order to achieve their task. Further, the role of the PMI is to provide a uniform way to handle the F/OSS content resources that will be distributed through the EDOS Distribution API, but also F/OSS community resources that are related to this content. Again, creating, testing and debugging of content resources are closely coupled to community resources. Knowing which community members have worked on a unit of content, and their degree of experience, is mandatory to improving the F/OSS Process. The PMI aims at providing kernel to streamline and improve the whole F/OSS process, taking into consideration the different aspects of F/OSS.Attributes
The core of the PMI is built around the concept of attributes. Attributes are metadata that are used to qualify, classify and thus locate entities be they actors or content units. Further attributes are meant to be able to associate content to community resources. The PMI formalizes the semantics of attributes, the way they can be compared and combined and defines the following useful properties:- Attributes are substitutable. Substitutability models the fact that entities can be related in useful ways. For instance, a sub-type can be used in place of a type; a license can be used in place of compatible licenses, etc. Thus substitutability is more expressive than equality. An attribute matches another one, if the second one can be substituted with the first one.
- Attributes can be used to build Boolean expressions. Attributes of the same type can be related using and, or and not operators.
- Attributes can be grouped as directories. Directories are groups of attributes of different types. They can be used to express a template (for looking up an unit matching the set of provided attributes), a state (current configuration) or as a signature of a F/OSS resource (set of attributes associated to a unit).
- Attributes can be matched. In the matching process, attributes, directories and expressions are compared.
Content resource handling
For now, the PMI concentrates on handling content resources. It formalizes a set of functions and related integrity rules enabling the creation of new units, their retrieval, and installation. The PMI defines the state of a content unit, by specifying mandatory attributes every content unit has to be able to provide. Here is an overview of such required attributes:- A localisation that indicates the name and/or location of a unit.
- A set of functionality attributes describing functional (type) information.
- An attribute corresponding to the license used by the unit
- A version tag.
- Backward compatibility attributes
- A description attribute
- A set of attributes defining the types the content belongs to. In the case of a Linux distribution they can be source, binary, documentation, application or utility
- Requirement and conflict dependencies in terms of localisation attributes or functionality attributes
- Set of localisation attributes defining the members of the unit if the latter is a collection of other units
Look-up
Created units are available to the community. They can be retrieved through a look-up process by specifying a directory of attributes the looked-up unit has to match. As a result the look-up function returns a set of matching units. The PMI formalizes this process. Possible ways to look-up units are multiple. Following examples give a short and non-exhaustive overview of different types of look-up that can be achieved. Note that these examples can be combined at will.- Looking-up a unit by providing its localisation attribute. This example reflects the way packages are currently retrieved, by providing their name, uid.
- Looking-up units based on a choice of licenses. Specifying a Boolean expression of license attributes does this.
- Looking-up units by specifying the set of functionalities it has to provide or the set of functionalities it cannot provide. This can be done to express so called virtual packages.
Versioning and Patching
The PMI clearly separates content patching from content versioning and highlights the main differences these two operations involve. While patching focuses minor correction of bugs, without changing the overall behaviour of the unit (i.e. without changing the attributes defining the state of the unit), versioning modifies the attributes associated to the unit. Both patching and versioning keep a history of modifications. These two operations are semantically different. Thus, such a separation clarifies the different responsibilities and situations that may occur during the life cycle of a unit.Dependencies
The way the PMI offers to describe unit dependencies follows the recommendations of the EDOS deliverable 2.1. The PMI specifies the integrity rules that have to be asserted at unit creation time as well as at installation time. The attribute-based approach to describe attribute functionality adds expressive power to dependency description. Attributes and attribute expressions can be used to express generic properties of units. Then the PMI allows units to be matched to such sets of attributes. This is particularly useful to replace so called virtual packages as the latter are supposed to share a common set of properties. Further, as attributes define the properties of unit, and as attributes are substitutable, dependencies built on attributes ensure that as long as a unit offers the same attributes an older version was offering, the newer version can substitute the older one, if there is no dependency conflict with the local configuration. As for patching and versioning, the PMI separates semantically different information to describe unit's properties, and thus unit's dependencies. Indeed, different attribute sets are provided to express different information. The elements of these sets are then separated depending on their usage context. For instance, the functionality provided by a unit is clearly separated from the functionality the unit requires. An algebra for handling these attributes is provided and the substitutability relation for each set of attributes is defined. Thus the PMI acts as a safeguard ensuring a correct definition of dependencies. Finally, as the PMI is able to express dependencies using Boolean expressions of functionality attribute as well as using Boolean expressions of localisation attributes. As localisation attributes can be package names, backward compatibility with existing tools is ensured.A Distribution API
The Distribution API focuses on the efficient distribution of content units over the network, in a P2P system architecture. The API description uses an object-oriented model, slightly different, but compatible (as shown below) with the general Project Management Interface (PMI) model, based on content units and attributes.The Data Model
We distinguish three distinct types of content units:- Packages, representing the RPM files
- Utilities, representing files used in the installation process
- Collections, representing grouping structures that gather several objects (content units): packages, utilities or collections. Collections allow organizing the content units in hierarchical (tree) structures.
The figure represents the hierarchical organization of the Cooker distribution, where leaves represent packages or utilities and internal nodes represent collections. Square boxes are used for Collection objects and round boxes represent Packages or Utilities.
The data model for distribution can be summarized as follows:
- Object: Package | Utility | Collection
- ObjectID
- Name
- Version_Number
- Package
- ID: ObjectID
- Metadata
- Location
- Value: stream
- Utility
- ID: ObjectID
- Metadata
- Location
- Value: stream
- Collection
- ID: ObjectID
- Metadata
- Value: ObjectID[]
- Metadata
- Type: application, documentation, source, binary, utility
- Date: object's creation time
- Signature: SH1 checksum
- Size: Number_of_Bytes
- Dependencies
- License
- ...
Correspondence with the Project Management Interface (PMI)
Here follows a quick example of how the distribution API can be expressed using the Project Management Interface (PMI). Consider the Mandriva Cooker distribution example above. In the context of the PMI, the above illustration becomes the following one.
In this figure, square boxes represent units as defined by the PMI. Dotted arrows list members of a collection unit. Each unit defines a set of types of units it contains. These types usually have to be defined in the context of the project.
In the case of a Linux distribution, we define (for instance) that available types are: source, binary (opposite to source), documentation, application (opposite to documentation) and utility. Each unit may be of any of these types; the only restriction being that a collection can only contain members of a type of the collection. The type attribute can then be used to define how the unit has to be handled.
All the units possess all the other attributes defined by the PMI. This covers dependencies, localization, functionalities, license, etc. Nevertheless these are not illustrated on the figure.
The localization attribute acts as a UID for the unit itself as well as a link to a set of physical locations where the content can be found. This link can be updated over time in order to provide up-to-date localizations and can be of any type (such as a URL to a torrent, ftp site...)
In the concrete case of a Linux distribution, unlike utilities and packages that have content, a collection is a set of meta-data with no content. This meta-data is represented by the set of attributes associated to the unit.
In the Distribution API model presented above, attributes are separated in several categories:
- Identifiers, corresponding to ObjectID, include logical localization (name) and versioning attributes
- Constant properties, corresponding to Metadata, include all the attributes that do not change for a given ObjectID
- Variable properties, corresponding to Location and Value, that may be different in time or on different peers for the same ObjectID
Architectural components, roles and API levels
Actors and roles
The P2P distribution system is composed of 3 types of actors:- the Publisher represents the reference server of the distribution editor. It is the seeder of the data in the system. The insertion of data in the system is done either in Push or in Pull mode.
- the Replicators (Mirrors), used as replication peers, get data from the Publisher or from other Replicators and offer this data for distribution.
- the Simple Peers, seen as the users' machines participating in data sharing.
For each one of this three categories of components, we associate specific roles in the distribution:
- Publication, corresponding to the Publisher functions.
- Replication, for providing replicas of published objects to distribution. Replicators, but also Simple Peers (depending on the implementation) may play this role.
- Client, corresponding to functions such as searching, subscribing, downloading objects. Simple Peers, Replicators, but also the Publisher may play this role.
Logical and Physical Levels
The API makes the distinction between a Logical level and a Physical level. At the Logical level, we describe the main API methods, corresponding to the software distribution functionalities: publishing, replicating, downloading, subscribing, etc. Normally, a distribution application only needs this API level to use the system. At the Physical level we consider lower-level methods, necessary to implement the logical level functions. Physical level methods provide finer grain access to EDOS distribution functionalities, in order to implement different strategies than those provided by the logical level. The distribution API below is defined at the Logical level, organized by roles and is composed of the basic methods necessary to realize the system's functionalities. Examples of some Physical level sub-functions are also provided.Basic functionalities for each role
Publisher
- publishPackage(PackageID, Metadata, Location)
- publishUtility(UtilityID, Metadata, Location)
- publishCollection(CollectionID, Metadata, Value)
- unpublishPackage(PackageID)
- unpublishUtility(UtilityID)
- unpublishCollection(CollectionID)
- publishPackageLocation (PackageID, Location) : publishes the package location in the index
- publishPackageMetadata (PackageID, Metadata) : publishes the package metadata in the index
- publishPackagePush (PackageID, Peer[]) : if push is allowed, pushes package content (files) to other peers
- insertIntoCollection (CollectionID, ObjectID[]) : inserts objects into collection
- deleteInCollection (CollectionID, ObjectID[]) : removes objects from collection
- ...
Replicator
- (un)publishReplicatedPackage (PackageID, Metadata, Location)
- (un)publishReplicatedUtility (UtilityID, Metadata, Location)
- (un)publishReplicatedCollection (CollectionID, Location)
- (un)publishReplicatedPackageLocation (PackageID, Location) : (un)publishes the replicated package location in the index
- publishReplicatedPackagePush (PackageID, Peer[]) : if push is allowed, pushes package content (files) to other peers
- ...
Client
- Package getPackage (PackageID)
- Utility getUtility(UtilityID)
- Collection getCollection(CollectionID)
- Location[] locatePackage (PackageID) : returns the set of locations where the package may be found in the system
- PackageID[] getCollectionPackages (CollectionID) : gets the set of IDs for all the packages (at any depth) in the collection
- PackageID[] computeMissingPackages (PackageID[]) : computes the set of missing packages on the local peer, among the packages in the list and those required by dependencies
- LocationMap getBestLocations (PackageID[], UtilityID[]) : decides for each package and utility in the lists what is the best downloading location
- ...
Advanced functionalities
Subscription
Subscription can be used in software distribution to provide event notification and possible automatic download of objects. In dealing with subscription, we consider the concept of channel, used in Red Hat Network (RHN). Each channel is an abstraction that corresponds to a set of objects (packages, utilities or collections). The Publisher feeds channels by publishing or un-publishing objects in channels. Publishing/un-publishing in a channel adds/removes objects to/from the set of objects of that channel. For each channel, permissions can be assigned to distinct users. Clients, depending on their permissions, may subscribe to channels. A subscription to a channel should indicate:- the objects of interest : new/all objects, a query filter
- the notification moment : immediate, periodically
- the action to do on notification : simple notification, automatic download
Subscription methods associated to the Publication role
- Channel createChannel (ChannelName, ChannelDescription, ObjectID[], AccessRights)
- (un)publishPackageToChannel(PackageID, ChannelName, Date)
- (un)publishUtilityToChannel(UtilityID, ChannelName, Date)
- (un)publishCollectionToChannel(CollectionID, ChannelName, Date)
- SubscriptionID addSubscribtion (ChannelName, SubscriberInfo, Subscription) : function called by the Publisher when receiving a subscription request from some client
- removeSubscribtion (ChannelName, SubscriberInfo, SubscriptionID) : function called by the Publisher when receiving a canceling subscription request from some client
- multicast (ObjectID[], Peer[]) : multicast distribution of objects to subscribers
- ...
Subscription methods associated to the Client role
- ChannelName[] getChannelList()
- ChannelDescription getChannelDescription(ChannelName)
- SubscriptionID subscribeToChannel(ChannelName, SubscriberInfo, Subscription)
- unsubscribeToChannel(ChannelName, SubscriberInfo, SubscriptionID)
Querying
The API should enable searching for data objects not only according to their IDs, but also by other criteria such as: functionality, license, status, size, etc. The Metadata will be queried in order to locate the wanted objects. A simple query language will be used to query metadata values.- PackageID[] queryPackages(Query)
- UtilityID[] queryUtilities(Query)
- CollectionID[] queryCollections(Query)
- Version_Number getLastVersionNb(Name, Type)
Database and Transactions
Security Issues
Conclusions and Perspectives
Bibliography
Version 1.19 last modified by DanVodislav on 04/11/2005 at 15:05
Comments: 0