Skip to content

Instantly share code, notes, and snippets.

@pfrazee
Last active April 1, 2020 17:10
Show Gist options
  • Select an option

  • Save pfrazee/e314196dcecd4c49382d to your computer and use it in GitHub Desktop.

Select an option

Save pfrazee/e314196dcecd4c49382d to your computer and use it in GitHub Desktop.

combining policy and reputation systems

Search engines, spam filtration, and p2p protocols - all need to rate the value of information. Search engines need it to provide good results; spam filtration needs it to exclude noise; and p2p networks need it for security and efficiency.

What is "value?" I'll use two dimensions:

  • Quality (how useful the information is). The more permissive participation is, the greater the need for quality-ranking. Term-frequency search alone fails for the Web, because the Web doesn't prefilter for quality.
  • Trust (how safe the information is to use). The more function-critical the information is, the greater the need for trust-ranking. If a DHT's host-lookup can be manipulated to flood an unsuspecting computer (a DDoS) then the DHT is unsound.

Trust- and quality-ranking are a key feature of open networks with shared data-structures. Open networks maximise potential value, because they let agents contribute and extract data independently. But, this means the nodes must also filter and moderate independently. Without a means to do so, the network will be overwhelmed, and won't provide any value at all.

Email is an open network: anybody can create a server and account. And, the inbox a shared structure: users share append-rights over each others' inboxes. But, without proper spam-filtering, email is nearly useless. Filtering must be both effective (very little spam) and accurate (very few false positives) or email loses its utility.

There are two approaches to trust & quality I will contrast: Policy-based and Reputation-based. From A survey of trust in computer science and the Semantic Web:

Policy-based trust. Using policies to establish trust, focused on managing and exchanging credentials and enforcing access policies. Work in policy-based trust generally assumes that trust is established simply by obtaining a sufficient amount of credentials pertaining to a specific party, and applying the policies to grant that party certain access rights. The recursive problem of trusting the credentials is frequently solved by using a trusted third party to serve as an authority for issuing and verifying credentials.

Policies are manually managed and/or pushed off to 3rd-party authorities, as in the case of DNS registrars and TLS authorities. Centralized authorities do not remove the need to rate information; instead they give that task to the authority, and in the process they remove agency from the user. Centralized authorities are also single points of failure.

Policies are easier to implement, but scale poorly for p2p systems. CAs for example could never scale to provide credentials for every Web user.

What's the other option?

Reputation-based trust. Using reputation to establish trust, where past interactions or performance for an entity are combined to assess its future behavior. Research in reputation-based trust uses the history of an entity’s actions/behavior to compute trust, and may use referral-based trust (information from others) in the absence of (or in addition to) first-hand knowledge. In the latter case, work is being done to compute trust over social networks (a graph where vertices are people and edges denote a social relationship between people), or across paths of trust (where two parties may not have direct trust information about each other, and must rely on a third party). Recommendations are trust decisions made by other users, and combining these decisions to synthesize a new one, often personalized, is another commonly addressed problem.

Reputation systems are used in search engines, spam filtering (email, forums), and markets (ebay, uber). They are automated, or they distribute the task of ranking across all agents, and so they tend to scale much better than policies do. But, they must be designed carefully in order to produce good results. Bad assumptions or a lack of actionable information can make reputation systems unusable or attackable.

In SSB, we have decided never to adopt centralized authorities in the protocol. This decision was motivated by

  1. distrust in the quality of global authorities (they are SPoFs)
  2. desire to maximize autonomy of individual users
  3. general curiosity

This constraint pushes us to find novel solutions for shared datastructures. Though we can use policies, they must be defined by the user; there should not be authorities hardcoded into the software. In order to avoid manual policy-management, we'll need to employ reputation systems to automate the network's operation.

Can we balance policy and reputation systems to create open & shared datastructures, which scale better and give users more agency?


SSB is a trust network. (A trust network is just a social network that publishes trust relationships.) It currently handles only two relationships:

  • is a following b? If so, a downloads messages by b.
  • is a blocking b? If so, a doesnt want b to receive messages by a.

Let's define "shared datastructures" as structures which are populated by other agents, and then used in automated decisions by the software. Under that definition, there are presently 5 shared datastructures in SSB. None of them employ trust- or quality-ranking in a sophisticated way, yet.

  • Following. Following uses a transitive "expansion" of the graph: if you follow b, you also download everybody that b follows. This is called friend-of-a-friend downloading.
  • Blocking. In replication, SSB uses published "blocking" relationships to decide whether to transfer a feed's messages.
  • Petnames. (Not yet in core.) Petnames can be self-assigned, or assigned to others. The policy for using the petnames is very restrictive:
    • The petname must be self-assigned by a followed users, or
    • The petname must be assigned by the local user.
  • Pub announcements. Pub address announcements are shared and used freely.
  • Blob downloads. Blobs linked by followed users are automatically downloaded.

In all but blocking, I believe the policies are overly restrictive (petnames), overly optimistic (pub announcements), or both (following, blob downloads). There are also other shared data structures we can adopt. Using reputation in combination with policy, we can potentially improve on

  • user petnames (who is @bob?)
  • pub announcements (should I sync with this address?)
  • optimistic downloading decisions (should I download this feed or blob without asking the user?)
  • wanted-blob lookup results (which peers will give me blob X?)
  • blob metadata (what's this blob's name, filetype, description, etc?)
  • software safety ratings (is this package safe to run?)
  • search-result recommendations (how should search results be ordered?)
  • product and service ratings (is Dans Diner any good?)
  • wiki/knowledge-base assertions (is "Obama's from Kenya" true?)

An ideal system will maximize value while minimizing user effort and unwanted results. "As easy as Googling" is a sentiment we should be able to apply consistently. The quality of shared datastructures becomes more important when

  • You expand the knowledgebase to include strangers. This is a necessary step to improving the quality of results, since it increases the knowledge you can access, but it also forces you to answer more difficult questions, as you have to rate the value of strangers' inputs.
  • You want to make stronger assumptions about value. For instance, suppose you wanted a whois command that gave an immediate single result for a search. whois bob -> a single pubkey. This is the quality we expect out of DNS, and it seems conceivable we could get the same from a well-constructed policy+reputation network.
  • You want automated processes to expend resources. Downloading feeds and blobs requires bandwidth and diskspace, but it can improve the user's dataset. Can optimistic following and downloading be automated, in a more sophisticated way than FoaF? (Automating virality of data?)

PageRank is still one of the top search algorithms in use, so it's important to understand where it fits in the current landscape of research.

Propagation Models for Trust and Distrust in Social Networks does a good job explaining PageRank's qualities, in section 2. From there, and from other papers:

  • PageRank is a probabilistic analysis, of a random surfer clicking links through the Web. An alternative form of analysis is to measure paths between the local node to other nodes, which is the sort of thing that PGP does in its WoT.
  • PageRank operates with global information, and seeks to be objective (everyone's opinion) instead of subjective (google's opinion). Google crawls the Web to get global knowledge of the graph. That's not something you'd try to do in a p2p system; instead, you try to apply subjective analysis. However, since an SSB node's dataset is prefiltered (by following) into a "local neighborhood," then you may be able to apply global-knowledge algorithms like PageRank and get interesting results.
  • PageRank infers that links are recommendations from one page to another. That's usually a correct assumption, but not always. Semantic Web research tries to use richer signals (explicit recommendations of pages, ratings of trust between agents, ratings of belief in facts, etc) and then construct better analysis.
    • In practice, Google has incorporated richer signals than weblinks since the original PageRank.
  • PageRank relies on weblinks being transitive recommendations. The algorithm does not apply for recommendations that aren't transitive. Agent trust signals are not always transitive, and distrust signals are never transitive.
  • Because PageRank creates global ratings for pages, it's actually estimating the global reputation of a page. Reputation is different from trust, because trust is subjective, from a particular node's perspective.
  • Many algorithms, including Eigentrust and Advogato, rely on a seed set of trusted peers. The seed-set makes it possible to detect malicious rings that work together to elevate their rating. The original PageRank paper doesn't have any such seed-set, but in practice I imagine Google has an equivalent.

Trust Management for the Semantic Web and The EigenTrust Algorithm for Reputation Management in P2P Networks are two good papers on probabilistic analysis. Both of these try to generalize PageRank. Eigentrust is cited a lot by other papers.

Propagation Models for Trust and Distrust in Social Networks is a good resource for path-analysis, as it explains Advogadro and a novel algorithm called Appleseed.


Before deciding on analysis algorithms, you need to decide the data model.

TODO Notes on the data model

Modeling and Evaluating Trust Network Inference importance of domain and types of trust


TODO Notes on computation, and how SSB's local data model relates

It's not clear to me whether path-algebra, such as those in Advagadro or Appleseed, is better or worse than probabilistic algoirthms, such as in PageRank and Eigentrust. It is asserted, by Advagardo's authors, and by these guys, that subjective analysis (given in a Web of Trust) is an improvement, particularly for isolating malicious actors. But, probabilistic models can incorporate WoT subjectivity.


TODO notes on the anatomy of a search engine

TD-IDF for relevance, then ranking


TODO notes on robustness against malicious users AND on robustness against compromised trusted users

@dominictarr
Copy link

note: A blocks B, means that B does not want B to see A's data, and doesn't want to see B's either.

@NHQ
Copy link

NHQ commented Oct 1, 2015

There seems like a world of possibility here, and to commit to any set of algorithms is a scary thought, because presumptions.

I propose a simpler, quick start method, which can be optimized at the speed and scale of open source collaboration.

Do it manually first, then digitally later.

It's not hard to single-action trash a message or block a user; it's probably a good skill to have.

Groups of people can choose to join "junkmail" metadata unions, and train their filters what the garbage is.

The unions trust each other, and so those decision propagate quickly, and bad messages disappear without notice.

"Many hands make a light load."

Maybe there is a protocol to appeal blockage, in the case of accidental swipe-lefts; and undo.

People can set their own level of filtering to a floating point, or [x,y,...,n], because each person is connected to a unique set of "junkmail" metadata unions, or social networks; this could have multiple interfaces.

Then the fun starts.

People can use their own data, share their own data, and write advanced filtering algorithms with all that data.

Popular algorithms become like black & white lists which others can install, or use to bootstrap new members and networks.

In more socialized, less controlled networks, spam block may become more about detecting behaviours than merely blocking sources, especially when you can no longer control how fast people can create identities and forge bots; this may seem obvious.

But a lot of open source creativity may be necessary to make it work; we simply don't know.

Now, let's imagine the worst: total spamarchy in the system.

But, we own the means of control, so we are not powerless.

In defence, groups develop intentional subnetworks where certain social behaviours are blocked by convention in that community.

They may use moderation, invitations, andor enforce particular algorithmic filtrations, for blocking content or behavior; this is happening already in private bittorrent communities.

From this perspective, search seems like an entirely different problem...

@dominictarr
Copy link

question: what methodologies do the various trust metric papers use to evaluate their reputation systems?

At this point reputation systems intersects with game theory - the study of the incentive structures around cooperation (which in our context means creating or promoting quality content) and defection (spam, abuse, fraud etc). Game theory is interesting here, because it gives a very concrete evaluation criteria: what is the profitability of defection? (usually, crime pays if there isn't too much compettition - but how well it pays with what amount of competition determines the "crime rate") but mostly the games studied in GT are much less sophisticated than the reputation systems described in the RS papers.

@pfrazee
Copy link
Author

pfrazee commented Oct 6, 2015

The survey paper I link to in the intro cites some game-theory studies. I'd look there.

@liamzebedee
Copy link

Woah dude this is awesome, cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment