Filesystem-based historical database for metadata
Motivations
Until now, we have used MySQL as a backend for storing historical metadata. However, this has serious disadvantages :- The SQL protocol is not type-safe and there are no well-known typeful bindings.
- The storage overhead is important. About 50MB of data is stored as 200MB tables.
- Interesting SQL queries involving dependencies can be very slow.
- Queries handling transitive dependencies are not possible.
- As a result of the above two points, SQL is only used as a storage engine.
- As the Debian version number ordering is not included in MySQL, it is difficult to insert a new version number without updating the whole versions table.
- Adding the data for a single day (S+U+T)(main+contrib+non-free) takes a few minutes.
- Loading the data also takes many minutes.
- The same slowness also applies to Postgresql.
- Therefore RDBMs such as MySQL and Postgresql are not suitable as storage engines due to perfomance reasons, although it would have been preferable.
Data that needs to be stored
- For every package, we want, for every every distribution (stable, unstable, testing), component (main, contrib, non-free) and architecture (i386, ia64...), its lifetime, that is the set of dates during which that package is present.
- For every package, we also want the complete unparsed metadata. This is necessary in order to:
- Constitute permanent archives
- Guard against parsing and interpretation errors
- Create a backend for ara?
- Use existing tools (apt, etc.) that may require specific fields.
- We also want the same for source packages.
- Lifetimes are usually single intervals. However this cannot be assumed to be always true, because :
- No rule in the Debian policy guide forbids reintroduction of a previous package.
- There can be holes in the data,
- Data can be loaded out-of-order.
- The lifetimes of a given package under different architectures tend to be closely correlated, but not exactly. Hence, for compaction purposes, there is little to exploit in that correlation.
Extra data that could be stored
The results of the installability checks are expensive to compute. Therefore, we would like to store them along the rest of the data.Structure
- debian/hashcode/unit/version/
- source
- meta
- inside:~archive_1~:~component_1~
- installability
- life
- ...
- ~architecture1~
- meta
- inside:~archive_1~:~component_1~
- installability
- life
- ...
- source
Format of the lifetime files
The lifetime file describes the set of dates a source or binary package is present in a given part of the distribution. Dates are represented as integers corresponding to the Gregorian format. In other words, April 1st, 2006 is the integer 20060401. Each lifetime is a list of intervals. An interval is a couple of integers (x,y) that are valid dates, that represents the set of (valid) dates z that are comprised between x and y, i.e. x <= z <= y. Therefore it is not possible to represent empty intervals, which is a nice property. Let D(x,y) stand for the set of days comprised between x and y. A lifetime is therefore an (int * int) list [ x1,y1;x2,y2;...;xn,yn ] satisfying the following:- for all i, xi and yi are valid dates
- for all i, xi <= yi (we have valid intervals)
- for all i, yi < x{i+1} (the intervals are given in increasing order and are disjoint)
let dump = List.iter (fun (x,y) -> Printf.printf "%d,%d\n" x y);;
Format of the metadata files
The metadata files are in Debian format. No transformation of any kind should be applied to them, in particular, non-ASCII characters should not be subject to any kind of transcoding. We are hoping that the files will all be UTF-8, but there exist many irregularities.Format of the installability files
This format is to be defined.Notes on filesystems
It is strongly recommended to use ReiserFS to store the database. Preliminary tests show that about 30 seconds (ReiserFS) to 1 minute (Ext3) are required for filling and extending the lifetimes of 15,000 packages. It should be noted that with ramfs, performance is much higher. The current loader ("libmetadata") is able to load, on a single architecture, 37155 packages in about a minute from RAM, in 6 minutes from a ReiserFS partition (for which tar takes one minute and a half to create a six megabyte archive).
Version 1.18 last modified by Berke on 13/04/2006 at 15:34
Document data
Attachments:
No attachments for this document
Comments: 0