Metadata management in EDOS
5.1 Overview
Metadata management is a key issue in the distribution process. We aim at building a global, distributed information system about content to be disseminated in the network (software packages). This system is fed with content metadata. The ability of expressing complex queries over metadata and to provide effective distributed query processing is a major contribution to this project. In the largest sense, metadata consists of properties of the content units. E.g., metadata for a package includes the name, the version, the size, the checksum, the licenses, dependencies to other packages, etc. In the Mandriva Linux distribution, this information can be extracted from the RPM files. Moreover, in the current Mandriva configuration, metadata files (called hdlists) are already extracted by the publisher and distributed to mirrors. These files may be very large (they group metadata for packages of a large collection) and keeping them updated in the network is a major problem. Existing tools for searching packages based on metadata values have limited functionalities. There are typically many specific tools, each one for a different query type, and even those addressing a larger category of queries (e.g. urpmi) are difficult to use (complex syntax) and do not cover all metadata properties. Moreover, these tools need local copies of hdlist files, that have to be download and kept updated, which is a serious limitation. The metadata management solution we propose overcomes these drawbacks. We provide distributed management of metadata, where queries address the network, and not local structures hard to keep consistent in case of changes. We define a standard query language for metadata (based on a subset of XQuery), and not separate tools for various query types.5.2 Metadata modeling
Several approaches to model metadata or parts of it exist in the EDOS project. For the Project Management Interface (PMI), which provides a general API for the whole F/OSS management process, metadata is simply a set of attributes attached to the content units. Also, WP2 focused on precise modeling of a single metadata attribute: dependencies between packages. In the distribution process we adopted a general metadata model (dealing with all metadata properties), more precise than the PMI model, but compatible with PMI and WP2 models. This model results from the need of indexing and querying metadata in a distributed system and to keep it consistent over changes. The metadata model follows the data model for F/OSS distribution described in the Distribution API annex, in which content units are classified in packages, utilities and collections, and organized in hierarchies. We classify metadata properties in several categories:- identifiers, i.e. properties that uniquely identify a content unit. In our model, this concerns the name and the version number of the content unit.
- static properties, i.e. properties that do not change in time for a published content unit. Examples of such properties in our model: size, category, checksum, license, dependencies, etc. For a given content unit identifier, these properties do not change.
- changing properties, i.e. properties that may vary in time, that are the most difficult to manage. In our model, there are two such metadata properties:
- location, i.e. the set of locations where replicas of the content unit exist in the network. Each replica is identified by its content unit identifier and its network location.
- composition, for collections, concerns the set of content units children (packages, utilities or sub-collections) of the given collection in the content hierarchy. Composition may vary in time for a collection, e.g. when a new package is added, or replaced by a new version.
5.2.1 Representation choices for implementation
We build our distributed metadata management system on top of KadoP (see the "State of the art" and "Distribution system architecture" chapters), a P2P information system based on a distributed hash table (DHT). When a new content unit is published in the system, its metadata is indexed in the DHT. An important advantage of KadoP is that it allows managing XML data, and not only key-value couples, as usual DHTs do. In this context, two main choices to represent metadata properties in the system are possible:- as key-value couples in the DHT, where the key would be the couple (property name, property value) and the value a list of content unit identifiers.
- as XML documents, managed by KadoP.
- generality: XML is a standard format for data, it allows grouping metadata for the same content unit in the same structure and to represent easily complex properties (e.g. dependencies).
- query language: a powerful (XQuery-like) query language for XML metadata documents is available in KadoP.
- identifier and static properties are represented in XML documents
- location is represented as a key-value couple in the DHT, because updates (for each new replica in the network) and look up queries on location (for downloading) are very frequent in the system.
- composition is included in the XML representation in a particular way, as described below.
5.3 XML representation of metadata
Ideally, all the metadata properties of all the content units in a distribution should be grouped together in a single XML file. This allows expressing general tree (XQuery) queries on any metadata properties of any content unit in the distribution, with the best query processing performances. The problem is that such a document is huge and that any update would be very time consuming, because this very large document must be re-indexed in the DHT. Updates concern changing metadata properties (location and composition). Location is not stored in the XML representation, but we want to keep composition there, in order to be able to query it. The solution is to split metadata in one separate XML file for each content unit and to represent collection composition as "links" to other content unit metadata files. We use the ability of KadoP to index and query ActiveXML documents by representing these links as ActiveXML service calls. When activated, each service call returns the metadata XML file of the corresponding child. Updates become limited to a single content unit metadata file and not to the whole distribution metadata. The structure of a content unit metadata file could be summarized as follows:- a common root for the metadata properties of the content unit
- a child element of the root for identifier properties and another for static metadata properties, containing the property values. The set of static properties is different for different content unit types (e.g., richer for packages)
- for collections, a child element of the root, containing the composition service calls for all the collection's children.
<!ELEMENT CONTENT_UNIT_METADATA (ID, PROPERTIES, COMPOSITION? TYPE, SIZE, DOMAIN*)>
<!ELEMENT ID (NAME, VERSION)>
<!ELEMENT PROPERTIES (TYPE, DOMAIN*, SIZE?, CHECKSUM?, LICENSE?, ...)>
<!ELEMENT COMPOSITION (CHILD*)>
<!ELEMENT CHILD (SERVICE_CALL)>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT VERSION (#PCDATA)>
<!ELEMENT TYPE ("package"|"utility"|"collection")>
...
]>
5.4 Future work concerning metadata management
Besides the implementation of the EDOS distribution system based on KadoP, we plan the following actions concerning metadata management.Building a tool for extracting XML metadata files from RPM packages
Ceve is a tool already developed by WP2 for metadata extraction. It is an Ocaml program that parses RPM and Debian packages and extracts metadata attributes into an intermediary format. There are several possible options for Ceve input:- Mandriva hdlist files
- .rpm files
- .deb files
- Linux.duke.edu RPM-Metadata format (http://linux.duke.edu/projects/metadata/)
Query processing optimization
- Load balancing: storing all the metadata files only on the Publisher may produce a bottleneck at query processing. We will introduce metadata replication (in parallel with content replication) and consequently modify query processing in KadoP.
- Location management: the list of replica locations for a content unit may be very large, while only a few of these locations are normally used in download optimization. Transferring and updating such large lists is time consuming. We want to replace the list of locations in the index with a web service returning only parts of the list helping to improve updates.
Version 1.14 last modified by StephaneLauriere on 02/01/2006 at 14:59
Document data
Attachments:
No attachments for this document
Comments: 0