Metadata management in EDOS

5.1 Overview

Metadata management is a key issue in the distribution process. We aim at building a global, distributed information system about content to be disseminated in the network (software packages). This system is fed with content metadata. The ability of expressing complex queries over metadata and to provide effective distributed query processing is a major contribution to this project.

In the largest sense, metadata consists of properties of the content units. E.g., metadata for a package includes the name, the version, the size, the checksum, the licenses, dependencies to other packages, etc. In the Mandriva Linux distribution, this information can be extracted from the RPM files. Moreover, in the current Mandriva configuration, metadata files (called hdlists) are already extracted by the publisher and distributed to mirrors. These files may be very large (they group metadata for packages of a large collection) and keeping them updated in the network is a major problem.

Existing tools for searching packages based on metadata values have limited functionalities. There are typically many specific tools, each one for a different query type, and even those addressing a larger category of queries (e.g. urpmi) are difficult to use (complex syntax) and do not cover all metadata properties. Moreover, these tools need local copies of hdlist files, that have to be download and kept updated, which is a serious limitation.

The metadata management solution we propose overcomes these drawbacks. We provide distributed management of metadata, where queries address the network, and not local structures hard to keep consistent in case of changes. We define a standard query language for metadata (based on a subset of XQuery), and not separate tools for various query types.

5.2 Metadata modeling

Several approaches to model metadata or parts of it exist in the EDOS project. For the Project Management Interface (PMI), which provides a general API for the whole F/OSS management process, metadata is simply a set of attributes attached to the content units. Also, WP2 focused on precise modeling of a single metadata attribute: dependencies between packages.

In the distribution process we adopted a general metadata model (dealing with all metadata properties), more precise than the PMI model, but compatible with PMI and WP2 models. This model results from the need of indexing and querying metadata in a distributed system and to keep it consistent over changes.

The metadata model follows the data model for F/OSS distribution described in the Distribution API annex, in which content units are classified in packages, utilities and collections, and organized in hierarchies. We classify metadata properties in several categories:

  • identifiers, i.e. properties that uniquely identify a content unit. In our model, this concerns the name and the version number of the content unit.
  • static properties, i.e. properties that do not change in time for a published content unit. Examples of such properties in our model: size, category, checksum, license, dependencies, etc. For a given content unit identifier, these properties do not change.
  • changing properties, i.e. properties that may vary in time, that are the most difficult to manage. In our model, there are two such metadata properties:
    • location, i.e. the set of locations where replicas of the content unit exist in the network. Each replica is identified by its content unit identifier and its network location.
    • composition, for collections, concerns the set of content units children (packages, utilities or sub-collections) of the given collection in the content hierarchy. Composition may vary in time for a collection, e.g. when a new package is added, or replaced by a new version.

5.2.1 Representation choices for implementation

We build our distributed metadata management system on top of KadoP (see the "State of the art" and "Distribution system architecture" chapters), a P2P information system based on a distributed hash table (DHT). When a new content unit is published in the system, its metadata is indexed in the DHT. An important advantage of KadoP is that it allows managing XML data, and not only key-value couples, as usual DHTs do.

In this context, two main choices to represent metadata properties in the system are possible:

  • as key-value couples in the DHT, where the key would be the couple (property name, property value) and the value a list of content unit identifiers.
  • as XML documents, managed by KadoP.
The advantage of the first representation is to provide simpler update and better performances for simple look-up queries (or for conjunctions of such queries), because only the index is accessed. The advantage of the XML representation is twofold:
  • generality: XML is a standard format for data, it allows grouping metadata for the same content unit in the same structure and to represent easily complex properties (e.g. dependencies).
  • query language: a powerful (XQuery-like) query language for XML metadata documents is available in KadoP.
The choice to be made for the EDOS distribution system favours the XML representation, but uses also the key-value one when frequent updates and look up queries on the property are necessary. It may be summarized as follows:
  • identifier and static properties are represented in XML documents
  • location is represented as a key-value couple in the DHT, because updates (for each new replica in the network) and look up queries on location (for downloading) are very frequent in the system.
  • composition is included in the XML representation in a particular way, as described below.

5.3 XML representation of metadata

Ideally, all the metadata properties of all the content units in a distribution should be grouped together in a single XML file. This allows expressing general tree (XQuery) queries on any metadata properties of any content unit in the distribution, with the best query processing performances.

The problem is that such a document is huge and that any update would be very time consuming, because this very large document must be re-indexed in the DHT. Updates concern changing metadata properties (location and composition). Location is not stored in the XML representation, but we want to keep composition there, in order to be able to query it.

The solution is to split metadata in one separate XML file for each content unit and to represent collection composition as "links" to other content unit metadata files. We use the ability of KadoP to index and query ActiveXML documents by representing these links as ActiveXML service calls. When activated, each service call returns the metadata XML file of the corresponding child. Updates become limited to a single content unit metadata file and not to the whole distribution metadata.

The structure of a content unit metadata file could be summarized as follows:

  • a common root for the metadata properties of the content unit
  • a child element of the root for identifier properties and another for static metadata properties, containing the property values. The set of static properties is different for different content unit types (e.g., richer for packages)
  • for collections, a child element of the root, containing the composition service calls for all the collection's children.
This structure could be described by the following DTD. Most static properties are marked optional because they may not appear for every content unit type.

<!DOCTYPE CONTENT_UNIT_METADATA [
<!ELEMENT CONTENT_UNIT_METADATA (ID, PROPERTIES, COMPOSITION? TYPE, SIZE, DOMAIN*)>
<!ELEMENT ID (NAME, VERSION)>
<!ELEMENT PROPERTIES (TYPE, DOMAIN*, SIZE?, CHECKSUM?, LICENSE?, ...)>
<!ELEMENT COMPOSITION (CHILD*)>
<!ELEMENT CHILD (SERVICE_CALL)>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT VERSION (#PCDATA)>
<!ELEMENT TYPE ("package"|"utility"|"collection")>
...
]>

5.4 Future work concerning metadata management

Besides the implementation of the EDOS distribution system based on KadoP, we plan the following actions concerning metadata management.

Building a tool for extracting XML metadata files from RPM packages

Ceve is a tool already developed by WP2 for metadata extraction. It is an Ocaml program that parses RPM and Debian packages and extracts metadata attributes into an intermediary format. There are several possible options for Ceve input:

The tool was initially designed to extract the dependency attributes, but more complex extensions can be easily added. We plan to enhance the functionalities of Ceve, in collaboration with WP2, in order to extend its usage in the distribution system. More precisely, we need to extract all the RPM metadata that is relevant to be indexed by KadoP, according to the DTD structure described above.

We study also the possibility to use the RPM-Metadata format proposed by Duke University (http://linux.duke.edu/) or to adapt it to our needs. The goal of the Duke project was to define a unified XML format for both RPM and Debian packages and to break the metadata out into multiple files. There is also some code already developed to generate the files in that format from packages, but the activity on the project seems to be decreasing in the last period.

Query processing optimization

  • Load balancing: storing all the metadata files only on the Publisher may produce a bottleneck at query processing. We will introduce metadata replication (in parallel with content replication) and consequently modify query processing in KadoP.
  • Location management: the list of replica locations for a content unit may be very large, while only a few of these locations are normally used in download optimization. Transferring and updating such large lists is time consuming. We want to replace the list of locations in the index with a web service returning only parts of the list helping to improve updates.
Version 1.14 last modified by StephaneLauriere on 02/01/2006 at 14:59

Comments 0

No comments for this document

Attachments 0

No attachments for this document

Creator: DanVodislav on 2005/12/20 00:35
Copyright EDOS Consortium
1.1.1