Considerations for a historical database

Some statistics

On 2006-05-15, ara has indexed 52,077 packages and 20,993 units. The marshalled, hash-consed metadata takes about 27 megabytes. This makes about 518 bytes per package.

The raw DBFS database covering the dates from 2006-01-01 to 2006-04-18 has 34,184,661 bytes of metadata (total size of the meta files) for 37,155 packages (i386 only). This makes about 920 bytes per package.

The SQL database covering the same dates has 58,858 packages, 25,239 sources and 28,098 units. A total of 73,901 versions are talked about. There are 413,191 requirements and 377,099 disjunctions. The lifetime entries total 4,721,284 columns.

The maximum number of versions per unit is 62, the average being 3.13.

There is an average of 7 requirements per package.

CodeTypeNumberMinAvgMax
1Provides8,98811.4334
2Depends53,42305.55106
3Pre-Depends78901.309
4Conflicts16,03402.06133
5Replaces13,70501.7326
6Suggests15,37301.9633
7Enhances58001.4128
8Recommends8,91401.62125

Debian knows of 13 architectures, 5 of which for non-discontinued processors. About 700 versions are produced per day.

Increase of the number of cumulated packages

Exponential model

On 2006-01-01, there were 16,985 units for the i386 architecture in Debian unstable main i386. On 2006-04-18 this number increased to 17,779. Hence in 108 days, we got an increase of 4.67%. Taking the 108-th root of that ratio, we see that the daily increase ratio is alpha=1.0004231, from which we deduce a yearly ratio of 1.17. This correlates roughly with the number of packages in previous Debian releases. We may therefore assume that the number of packages increases no more than 25% per year.

Extrapolation from source

In the past, policy changes leading to the splitting of package into lib*, *-doc, *-common and *-core like packages lead to an increase in the number of packages per given source. From litterature on the subject:

DateNumber of sources
1998-071,096
1999-031,551
2000-082,611
2005-068,633
2006-029,500
2010-0115,000 (estimated)

Linear model

The number of packages per architecture increases at a rate of about 100 packages per day per architecture or 36,500 packages per architecture per year. Hence in four years we should have about 730,000 packages total.

Total package flux

The package flux F is defined as the number of new packages that enter the distribution per unit of time. A package is defined as a triple (architecture, unit, version). From the above we estimate F as packages_per_day * architectures = 100 x 5 = 3650. Over 5 architectures this makes about 200,000 packages per year. Of course the package flux is not constant. However our metadata does not span a sufficient range for doing meaningful predictions on the package flux.

Assuming package flux is proportional to number of packages, we may expect the flux to increase similarly at a rate of 25% per year.

Number of sources per day : about 2.75 new source units per day. About 5.91 new units per day.

Estimated lifespan requirements

Ideally, the database should be able to handle the complete history of the Debian metadata. However, day-to-day metadata older than 2005-03-01 seems to have been lost. Assuming the system works at least one more year (until the end of the project) makes the absolute minimum lifespan two years. Of course, continuing for three years after the project completes is reasonable. Hence we need to be able to handle 5 years of data.

Estimated size of database

Hence, 4 years from now, the number of packages in unstable can be predicted as 45,000, and the yearly flux as 8,500 packages per day, assuming the number of architectures remains stable. The cumulated number of packages in the database can therefore be estimated as roughly 7.4 million by march 2009.

Adding a safety factor of two, we want our database engine to be able to process 15 million packages by 2009. Following Moore's law, if we allow for a memory consumption of 0.5 GB today, we may allow a consumption of 3.25 GB by 2009.

Hence, the system should be engineered so as to be able to process 15 million packages in 3.25 GB of memory. In other words, the database should use no more than 200 bytes of RAM per package (architecture,unit,version). Similarly, if we allow today 5 gigabytes for disk storage, it should not use more than 32 GB in four years, i.e., no more than 2 kilobytes per package today. We can therefore conclude that :

  • Storage of the raw metadata on disk, at about 520 bytes per package plus filesystem overhead, is perfectly adequate.
  • Using 32-bit pointers should be more or less okay.
  • 200 bytes per package seems scarce enough to bother with hash-consing. In addition, multiple indexes, and in particular functional ones, may have a high overhead and empirical measurements from history, anla and debcheck will be welcome.
Assuming we use an unshared textual representation that takes about 520 bytes per package, this would give about 780 megabytes. However ara currently indexes only one architecture. It has been experimentally verified that the added storage requirement of supplementary architectures is very small, due to the large number of fields shared between the metadata of a given package accross different architectures.

Empirical memory usage

Anla uses about 300 megabytes of memory for 60,000 packages which translates to about 5kB per package. This is too much and needs to be cut down. The history-cache takes about 46 MB on disk (about 750 bytes per package). Installability data takes only 1 MB. Very roughly, there seems to be an expansion factor of 1:6.5 between marshalled and in-memory data.

File lists

We would like to add to each package the list of files it contains. Etch Contents for i386 has about 1.7 million files. Hence there are about 100 files per package. On average, the complete path for each file takes 77 characters. The total Contents (which does not include files created by installation scripts) takes 135 MB. This makes 675 MB for 5 architectures. Hence a shared, path-based tree representation is vital for representing individual paths. Adding MD5 sums (16 bytes) for these will take another 136 MB total for all 5 architectures. This is feasible for disk storage. However, should we choose to do RPM-style collision checking, 16 or 32 bytes of MD5 should be sufficient (disk lookups can be done for resolving ambiguities).

Assuming each file is given a single integer identifier, this would amount to 800 bytes per package on average for a list of files, which is out of the question. Using an array would only reduce this to 400 bytes. However path-names are usually architecture-independent. Hence we may hope to divide these figures by the number of architectures, which brings us to 160 or 80 bytes which is more reasonable.

Other metadata fields

The Description field

By large the longest field in all packages, it takes a few hundred bytes on average. Fortunately this field does not change much between versions or between packages.

Dependency fields

These fields change quite frequently, however their individual components may change less frequently. Hash-consing requirements would have been useful if the overhead of the index structure was small. However as a requirement only takes a few words, it seems that this is not really necessary. One may, however, try to use an array instead of a linked list to save some space.

Other fields

A compact, run-length encoded bitvector could be used to represent fields like the Tag: field.

Requirements for functional data structures

To allow maximal sharing in a "package laboratory" tool that can have multiple views, or a query server daemon with multiple contexts, the use of functional data structures can advocated. However:

  • The Set and Map structures have high overhead,
  • The user-supplied modifications are likely to be small.

Speed requirements

The response to queries should be instantaneous when ``cold-started'' from the command-line interface. The only way to achieve this is to have a daemon running. However, the daemon should not take too long to start. One minute seems to be an acceptable limit. Hence 3.5 GB of memory representation should be constructed in 60 seconds, that makes a data construction rate of 60 MB/s in 2009 and of 9 MB/s today, which seems quite reasonable.

Conclusions

We have estimated that the acceptable memory consumption per package, all reasonable architectures included, is about 1kB per package.

Memory usage of current higher-level tools such as anla or history is five times higher than this limit.

Further investigations are needed but we assume that this is due to the heavy use of indexing structures. It should be noted that the current backend uses hashtables as its primary data structure, and that Ocaml hashtables have about two to three words overhead per entry, whereas Set and Map structures have four to five words overhead.

Recommendations

  • Use a compact, list of intervals representation for lifetimes.
  • Use a path tree for representing file names
  • Share file lists between architectures
  • Share strings, especially description fields
  • Think about using a hybrid approach where functional patches are applied to read-only non-functional core of large data.
Version 1.17 last modified by RobertoDiCosmo on 17/05/2006 at 23:06

Comments 0

No comments for this document

Attachments 0

No attachments for this document

Creator: Berke on 2006/05/15 14:32
Copyright EDOS Consortium
1.1.1