Peer (Subscription) Clustering

Motivation

  • A new version, or a major update, is made up of a large number of packages
  • The users are diverse in which packages they need/want to download
    • Have different sets of packages installed on their local machines
    • Are interested in different update types (all, security only, bugs and security etc.)
  • On one hand - wouldn't like to distribute each package in its own "environment" (e.g. multicast tree).
    • A user that wants to download an entire version will have to join ~10K multicast trees!
    • Too much overhead of building the trees
  • On the other hand - wouldn't like to force a user to participate in the distribution of packages he has no interest in
    • Wasting bandwidth & storage for a cause he doesn't believe in
    • Latency for getting wanted packages increases

Requirements

  • Formal definition
    • Problem Notation
      • P packages
      • Every subscription (Si) is a bit mask vector of length P, denoting for every possible package whether the user is interested in it or not
      • N nodes (subscribers), each with a bit mask of length P
      • Output: K sets of (disjoint?!) node or package clusters
        • Peer clusters: packages have to be transmitted more than once, in different clusters
          • (-) more load on the source, needs to feed multiple clusters with same packages
          • (+) user participates in just one dissemination "environment"
          • (+) this is how the data clustering algorithms of riabov work
        • Package clusters: user has to join more than one cluster
          • (-) more load on user's machine
          • (+) source feeds a packages only once
        • ==> Maybe I need to put limits on the source machine "load" (number of clusters it feeds, some with the same packages) and the user "load" (number of package clusters it participates in)
        • ==> Must be a middle ground, when no cluster is neccesarily disjoint...

Directions

  • "Intuitive" mechanism that is based on the logic relation between packages
    • Compilation/Runtime dependencies
    • Similarity of packages (e.g. Web browsers)
    • This "knowledge" has to be converted to an operative clustering scheme
  • General algorithms for Data Clustering
    • No prior knowledge of the problem's domain
    • Give a solution that is "close" to optimal
    • Create clusters by merging the "closest" groups in order to minimize some general or local distance function

State-of-the-art

  • Channelization Problem
    • (+) Exact model and notation of the problem
    • (-) Offline algorithms (can be tackled using sampling)
    • (-) Experiments on relatively small sets (scalability)
  • Data Clustering algorithms
  • Facility Location
    • (+) Online algorithms, with constant competitive ratio for random order of input
    • The "translation" from/to the channelization problem is not complete
      • (-) A client is served only by one facility (intuitively - this would cause clusters to hold many unnecessary packages)
      • (+) Package clusters ("facilities") can/will share packages
  • Bloom filters
Version 1.3 last modified by Yotam on 22/05/2005 at 19:55

Comments 0

No comments for this document

Attachments 0

No attachments for this document

Creator: Yotam on 2005/05/22 19:48
Copyright EDOS Consortium
1.1.1