Search engines, spam filtration, and p2p protocols - all need to rate the value of information. Search engines need it to provide good results; spam filtration needs it to exclude noise; and p2p networks need it for security and efficiency.
What is "value?" I'll use two dimensions:
- Quality (how useful the information is). The more permissive participation is, the greater the need for quality-ranking. Term-frequency search alone fails for the Web, because the Web doesn't prefilter for quality.
- Trust (how safe the information is to use). The more function-critical the information is, the greater the need for trust-ranking. If a DHT's host-lookup can be manipulated to flood an unsuspecting computer (a DDoS) then the DHT is unsound.
Trust- and quality-ranking are a key feature of open networks with shared data-structures. Open networks maximise potential value, because they let agents contribute and extract data independently. But, this means the nodes must also filter and moderate independently. Without a means to do so, the network will be overwhelmed, and won't provide any value at all.
Email is an open network: anybody can create a server and account. And, the inbox a shared structure: users share append-rights over each others' inboxes. But, without proper spam-filtering, email is nearly useless. Filtering must be both effective (very little spam) and accurate (very few false positives) or email loses its utility.
There are two approaches to trust & quality I will contrast: Policy-based and Reputation-based. From A survey of trust in computer science and the Semantic Web:
Policy-based trust. Using policies to establish trust, focused on managing and exchanging credentials and enforcing access policies. Work in policy-based trust generally assumes that trust is established simply by obtaining a sufficient amount of credentials pertaining to a specific party, and applying the policies to grant that party certain access rights. The recursive problem of trusting the credentials is frequently solved by using a trusted third party to serve as an authority for issuing and verifying credentials.
Policies are manually managed and/or pushed off to 3rd-party authorities, as in the case of DNS registrars and TLS authorities. Centralized authorities do not remove the need to rate information; instead they give that task to the authority, and in the process they remove agency from the user. Centralized authorities are also single points of failure.
Policies are easier to implement, but scale poorly for p2p systems. CAs for example could never scale to provide credentials for every Web user.
What's the other option?
Reputation-based trust. Using reputation to establish trust, where past interactions or performance for an entity are combined to assess its future behavior. Research in reputation-based trust uses the history of an entity’s actions/behavior to compute trust, and may use referral-based trust (information from others) in the absence of (or in addition to) first-hand knowledge. In the latter case, work is being done to compute trust over social networks (a graph where vertices are people and edges denote a social relationship between people), or across paths of trust (where two parties may not have direct trust information about each other, and must rely on a third party). Recommendations are trust decisions made by other users, and combining these decisions to synthesize a new one, often personalized, is another commonly addressed problem.
Reputation systems are used in search engines, spam filtering (email, forums), and markets (ebay, uber). They are automated, or they distribute the task of ranking across all agents, and so they tend to scale much better than policies do. But, they must be designed carefully in order to produce good results. Bad assumptions or a lack of actionable information can make reputation systems unusable or attackable.
In SSB, we have decided never to adopt centralized authorities in the protocol. This decision was motivated by
- distrust in the quality of global authorities (they are SPoFs)
- desire to maximize autonomy of individual users
- general curiosity
This constraint pushes us to find novel solutions for shared datastructures. Though we can use policies, they must be defined by the user; there should not be authorities hardcoded into the software. In order to avoid manual policy-management, we'll need to employ reputation systems to automate the network's operation.
Can we balance policy and reputation systems to create open & shared datastructures, which scale better and give users more agency?
SSB is a trust network. (A trust network is just a social network that publishes trust relationships.) It currently handles only two relationships:
- is
afollowingb? If so,adownloads messages byb. - is
ablockingb? If so,adoesnt wantbto receive messages bya.
Let's define "shared datastructures" as structures which are populated by other agents, and then used in automated decisions by the software. Under that definition, there are presently 5 shared datastructures in SSB. None of them employ trust- or quality-ranking in a sophisticated way, yet.
- Following. Following uses a transitive "expansion" of the graph: if you follow
b, you also download everybody thatbfollows. This is called friend-of-a-friend downloading. - Blocking. In replication, SSB uses published "blocking" relationships to decide whether to transfer a feed's messages.
- Petnames. (Not yet in core.) Petnames can be self-assigned, or assigned to others. The policy for using the petnames is very restrictive:
- The petname must be self-assigned by a followed users, or
- The petname must be assigned by the local user.
- Pub announcements. Pub address announcements are shared and used freely.
- Blob downloads. Blobs linked by followed users are automatically downloaded.
In all but blocking, I believe the policies are overly restrictive (petnames), overly optimistic (pub announcements), or both (following, blob downloads). There are also other shared data structures we can adopt. Using reputation in combination with policy, we can potentially improve on
- user petnames (who is @bob?)
- pub announcements (should I sync with this address?)
- optimistic downloading decisions (should I download this feed or blob without asking the user?)
- wanted-blob lookup results (which peers will give me blob X?)
- blob metadata (what's this blob's name, filetype, description, etc?)
- software safety ratings (is this package safe to run?)
- search-result recommendations (how should search results be ordered?)
- product and service ratings (is Dans Diner any good?)
- wiki/knowledge-base assertions (is "Obama's from Kenya" true?)
An ideal system will maximize value while minimizing user effort and unwanted results. "As easy as Googling" is a sentiment we should be able to apply consistently. The quality of shared datastructures becomes more important when
- You expand the knowledgebase to include strangers. This is a necessary step to improving the quality of results, since it increases the knowledge you can access, but it also forces you to answer more difficult questions, as you have to rate the value of strangers' inputs.
- You want to make stronger assumptions about value. For instance, suppose you wanted a
whoiscommand that gave an immediate single result for a search.whois bob-> a single pubkey. This is the quality we expect out of DNS, and it seems conceivable we could get the same from a well-constructed policy+reputation network. - You want automated processes to expend resources. Downloading feeds and blobs requires bandwidth and diskspace, but it can improve the user's dataset. Can optimistic following and downloading be automated, in a more sophisticated way than FoaF? (Automating virality of data?)
PageRank is still one of the top search algorithms in use, so it's important to understand where it fits in the current landscape of research.
Propagation Models for Trust and Distrust in Social Networks does a good job explaining PageRank's qualities, in section 2. From there, and from other papers:
- PageRank is a probabilistic analysis, of a random surfer clicking links through the Web. An alternative form of analysis is to measure paths between the local node to other nodes, which is the sort of thing that PGP does in its WoT.
- PageRank operates with global information, and seeks to be objective (everyone's opinion) instead of subjective (google's opinion). Google crawls the Web to get global knowledge of the graph. That's not something you'd try to do in a p2p system; instead, you try to apply subjective analysis. However, since an SSB node's dataset is prefiltered (by following) into a "local neighborhood," then you may be able to apply global-knowledge algorithms like PageRank and get interesting results.
- PageRank infers that links are recommendations from one page to another. That's usually a correct assumption, but not always. Semantic Web research tries to use richer signals (explicit recommendations of pages, ratings of trust between agents, ratings of belief in facts, etc) and then construct better analysis.
- In practice, Google has incorporated richer signals than weblinks since the original PageRank.
- PageRank relies on weblinks being transitive recommendations. The algorithm does not apply for recommendations that aren't transitive. Agent trust signals are not always transitive, and distrust signals are never transitive.
- Because PageRank creates global ratings for pages, it's actually estimating the global reputation of a page. Reputation is different from trust, because trust is subjective, from a particular node's perspective.
- Many algorithms, including Eigentrust and Advogato, rely on a seed set of trusted peers. The seed-set makes it possible to detect malicious rings that work together to elevate their rating. The original PageRank paper doesn't have any such seed-set, but in practice I imagine Google has an equivalent.
Trust Management for the Semantic Web and The EigenTrust Algorithm for Reputation Management in P2P Networks are two good papers on probabilistic analysis. Both of these try to generalize PageRank. Eigentrust is cited a lot by other papers.
Propagation Models for Trust and Distrust in Social Networks is a good resource for path-analysis, as it explains Advogadro and a novel algorithm called Appleseed.
Before deciding on analysis algorithms, you need to decide the data model.
TODO Notes on the data model
Modeling and Evaluating Trust Network Inference importance of domain and types of trust
TODO Notes on computation, and how SSB's local data model relates
It's not clear to me whether path-algebra, such as those in Advagadro or Appleseed, is better or worse than probabilistic algoirthms, such as in PageRank and Eigentrust. It is asserted, by Advagardo's authors, and by these guys, that subjective analysis (given in a Web of Trust) is an improvement, particularly for isolating malicious actors. But, probabilistic models can incorporate WoT subjectivity.
TODO notes on the anatomy of a search engine
TD-IDF for relevance, then ranking
TODO notes on robustness against malicious users AND on robustness against compromised trusted users
note: A blocks B, means that B does not want B to see A's data, and doesn't want to see B's either.