Peer (Subscription) Clustering
Motivation
- A new version, or a major update, is made up of a large number of packages
- The users are diverse in which packages they need/want to download
- Have different sets of packages installed on their local machines
- Are interested in different update types (all, security only, bugs and security etc.)
- On one hand - wouldn't like to distribute each package in its own "environment" (e.g. multicast tree).
- A user that wants to download an entire version will have to join ~10K multicast trees!
- Too much overhead of building the trees
- On the other hand - wouldn't like to force a user to participate in the distribution of packages he has no interest in
- Wasting bandwidth & storage for a cause he doesn't believe in
- Latency for getting wanted packages increases
Requirements
- Formal definition
- Problem Notation
- P packages
- Every subscription (Si) is a bit mask vector of length P, denoting for every possible package whether the user is interested in it or not
- N nodes (subscribers), each with a bit mask of length P
- Output: K sets of (disjoint?!) node or package clusters
- Peer clusters: packages have to be transmitted more than once, in different clusters
- (-) more load on the source, needs to feed multiple clusters with same packages
- (+) user participates in just one dissemination "environment"
- (+) this is how the data clustering algorithms of riabov work
- Package clusters: user has to join more than one cluster
- (-) more load on user's machine
- (+) source feeds a packages only once
- ==> Maybe I need to put limits on the source machine "load" (number of clusters it feeds, some with the same packages) and the user "load" (number of package clusters it participates in)
- ==> Must be a middle ground, when no cluster is neccesarily disjoint...
- Peer clusters: packages have to be transmitted more than once, in different clusters
- Problem Notation
Directions
- "Intuitive" mechanism that is based on the logic relation between packages
- Compilation/Runtime dependencies
- Similarity of packages (e.g. Web browsers)
- This "knowledge" has to be converted to an operative clustering scheme
- General algorithms for Data Clustering
- No prior knowledge of the problem's domain
- Give a solution that is "close" to optimal
- Create clusters by merging the "closest" groups in order to minimize some general or local distance function
State-of-the-art
- Channelization Problem
- (+) Exact model and notation of the problem
- (-) Offline algorithms (can be tackled using sampling)
- (-) Experiments on relatively small sets (scalability)
- Data Clustering algorithms
- Dissemination Systems Riabov
- Expected performance
- ToDo: Understand the planes issue in the Zero-loss algorithm
- Facility Location
- (+) Online algorithms, with constant competitive ratio for random order of input
- The "translation" from/to the channelization problem is not complete
- (-) A client is served only by one facility (intuitively - this would cause clusters to hold many unnecessary packages)
- (+) Package clusters ("facilities") can/will share packages
- Bloom filters
Version 1.3 last modified by Yotam on 22/05/2005 at 19:55
Document data
Attachments:
No attachments for this document
Comments: 0