MISRA C vs Lisp deathmatch

It's interesting how MISRA C and Lisp represent different different kinds of "ideal" for programming. Lisp is at the extreme of expressiveness/flexibility/"power" (in the sense of giving the programmer control over the program), while MISRA C is at the extreme of portability/reliability/safety by severely limiting the abstractions programmers are allowed to use (no recursion! no C99! no comma operator!).

I find both appealing, although in different ways and of course for very different applications.

Crucible

God be between you and harm, in all the empty places you walk.

- blessing of the 18th Egyptian dynasty

Augur's black swans

This post continues exploring Augur's decentralized lie detection/event consensus mechanism. It builds on a number of previous ideas:

In my last post, I discussed how a common statistical technique called Principal Component Analysis (PCA) can function as a lie detector. While there's a fair bit of math involved, the core insight is simply that lies are more variable than truths. If you take a set of questions -- each of which has an "obvious" (objectively and easily-determinable answer -- and collect several people's answers to these questions, by cross-referencing their answers, you can figure out who is contributing most to the variability in the collection of answers. PCA provides a score that tells you precisely this information.

Or does it? One problem with PCA is that it is a second-order statistical technique. This means that it accounts for pairwise coupling of signals -- "how dispersed were each user's answers to question A? how much did user X's answers vary with user Y's answers?" -- but it is blind to three-way, four-way, etc. interactions. Now, in many cases, "cheating" or collusion between users will affect the second as well as the higher moments; for these cases, PCA should work fine. However, this is not always the case: certain kinds of feedback loops can alter the fourth-order statistics, while leaving the second-order statistics untouched. PCA cannot see these. Collusion structured in the right way is invisible to PCA.

Four-way interactions are of particular interest, because the fourth standardized moment (the kurtosis) quantifies the amount of positive feedback present in the system. "Excess kurtosis" indicates that extreme occurrences are significantly more common than you might expect, based on "intuition", or, more precisely, by assuming that the system's underlying dynamics are driven by a random walk. (Nasim Taleb's The Black Swan is an awesome, non-technical introduction to this concept.)

So, can we use a fourth-order "version" of PCA -- perhaps in combination with ordinary, second-order PCA -- to arrive at a more comprehensive measure of nonconformity? In fact, there is a (literal) higher-order version of PCA, called Independent Component Analysis, or ICA. The difficulty with ICA is that, like PCA, it relies on diagonalization to find the "best" multi-linear decomposition of a joint moment tensor. For PCA, diagonalizing the covariance matrix (i.e., the second-order joint moment tensor) gives you a set of uncorrelated axes -- these are the covariance matrix's eigenvectors, since the off-diagonal elements of the eigenvalue matrix are 0. ICA is a more powerful technique: it finds a set of axes which are mutually statistically independent (rather than merely uncorrelated). Unfortunately, diagonalizing higher-order tensors is much messier than it is for matrices, and often these calculations fail to converge. (The most widely-used implementation of ICA, FastICA, uses fixed-point iteration rather than tensor decomposition, but nevertheless has the same convergence problems.) This makes ICA impractical for use in a decentralized system, since the report matrices it must operate on will be essentially random.

Let's take a step back. Since we're really interested in the users' scores -- that is, projections onto the independent axes, rather than the axes themselves -- we should be able to avoid ICA and the troublesome issues of tensor diagonalization altogether. What if we just summed across three of the dimensions of the fourth joint moment tensor (the cokurtosis tensor)? The remaining dimension would contain very literal measures of each user's contribution to the overall cokurtosis. This simpler method does not encounter issues with convergence or numerical stability. However, it still has a problem with complexity -- it turns out the cokurtosis tensor takes a prohibitively long time to calculate.

Happily, there's a much better way of arriving at the result we want (sums over the cokurtosis tensor's dimensions). First, a few definitions:

\mathbf{X}: centered, standardized matrix of Augur event reports
\mathbf{K}: cokurtosis tensor
N_e: number of events
N_r: number of reporters
\mathbf{x}_i: the i^\mathrm{th} signal (column) of the centered data matrix \mathbf{X}.

 \mathbf{K} is the cokurtosis tensor, which has dimensions  N_r \times N_r \times N_r \times N_r ; visually, this would be a tesseract (a 4-D cube).  \mathbf{K} is calculated as:

 \mathbf{K} = \frac{1}{N_e} \sum_{i=1}^{N_e} \left( \mathbf{x}_i \otimes \mathbf{x}_i \otimes \mathbf{x}_i \otimes \mathbf{x}_i \right)

where \otimes is the tensor product. However, we're not actually interested in \mathbf{K} itself; our goal is instead to collapse \mathbf{K} to \mathbf{c}, the nonconformity vector:

 c_i = \sum_{j=1}^{N_r}\sum_{k=1}^{N_r}\sum_{\ell=1}^{N_r} K_{ijk\ell}

The nonconformity vector is the heart of Augur's decentralized lie detection mechanism. The elements of \mathbf{c} quantify each reporter's contribution to the overall cokurtosis. High values indicate a large contribution -- and therefore a high likelihood of cheating. I've recently spent some time doing quantitative comparisons between several different lie detection algorithms, and my conclusion so far is that by far the strongest lie detector is simply how much each user contributes to the cokurtosis tensor.

(My next blog post will go into detail on how different mechanisms are compared with one another. Here is the very short version: treat the lie detector algorithms as binary classifiers, run a bunch of Monte Carlo simulations, then see which algorithm has the best Matthews correlation coefficient. Spoiler: the cokurtosis method kicks butt.)

\mathbf{K} gives Augur's lie detector a great deal of resolving power, but there's a catch: calculating the nonconformity vector by building the cokurtosis tensor, then summing over its dimensions, is prohibitively expensive. Naive construction of  \mathbf{K} requires \mathcal{O}(N_r^4) operations.

However, it should be possible to bypass this expensive operation, as we don't do anything with the elements of \mathbf{K} aside from sum over them. (And, the final dimension, which is not summed over, is just \mathbf{c}.) After spending some time writing out these sums, something occurred to me that was in retrospect pretty obvious: \mathbf{K} is symmetric, so the collapsed indices should contribute the same way to every element of \mathbf{c}.

After tinkering a bit with the algebra, it became clear that this "same way" is the sum across each row of the data matrix, with each element of the data matrix cubed:

 r_i = \sum_{j=1}^{N_e} X_{ij}^3

(The method described here works for collapsing any symmetric tensor, not just fourth-order. The power to which X_{ij} is raised is always one less than the tensor's order.)

So, how do we actually calculate the vector \mathbf{c}? Notice that, if we break apart the brute-force calculation of \mathbf{c}, the i^\mathrm{th} element of \mathbf{c} is the dot product of this "sum across rows" vector \mathbf{r} with column i of the data matrix \mathbf{X}. Since this dot product is carried out for every column, \mathbf{c} can be calculated using a simple matrix product:

 \mathbf{c} = \mathbf{r}^T \mathbf{X}

This equation's time complexity should be linear with respect to both N_e and N_r. To verify this numerically, I ran a set of Monte Carlo simulations, where the numbers of events and reporters were gradually cranked up from 10 to 100. Each data point shown here is the average time elapsed over 250 simulations, and the error bars represent \pm standard deviations of the mean:

Cokurtosis timing over 250 simulations

As expected, this method of calculating \mathbf{c} has \mathcal{O}(N_e + N_r) complexity -- a huge improvement from where we started, and fast enough for use in decentralized systems!

Descent

Beware that, when fighting monsters, you yourself do not become a monster...
For when you gaze long into the abyss, the abyss gazes also into you.

- Friedrich Nietzsche

A decentralized lie detector

Barack Obama won the 2012 U.S. Presidential election.

We know this is true, because the election's already been run, and the winner is known. You know it; I know it. But, the blockchain doesn't know it. Blockchains don't know anything about the external world, except what we tell them. So, how do we tell the blockchain who won the Presidential election?

Since everyone knows the answer, one really easy way to get the election results into the blockchain is to just ask people about it: "Who won the election?" People punch in the winner's name, and now the blockchain is knowledgeable and happy.

But, what if people don't all tell the blockchain the same thing?

Well, if they're honest, they should! After a Presidential election, it's really, really easy to find out who the winner was: you probably already know it, and if you don't, you just type "US President election winner" into Google, and, half a second later, you've got the answer. Furthermore, if I ask you and 100 other people, I shouldn't get 100 or even 10 different answers; if you're all honest, you should all say the same thing.

The trickier question is, what if people aren't honest? Suppose you're in a classroom with 50 honest people and 50 lazy jerks. If you ask the group, "What is the temperature outside right now?", the 50 honest people will look at the thermometer you've conveniently hung outside -- which reads 65 degrees Fahrenheit -- and all report that the temperature is 65 degrees Fahrenheit. The 50 lazy jerks are too lazy to check the thermometer, so they just make up a number. Notice that it's easy to figure out what the correct answer is in this case: the lazy jerks very likely did not all pick the same number, so the most common answer is most likely to be correct. This is our guiding light for lie detection:

Lies are more variable than truths.

Taking this statement at face value, we could design a system that sifts out truths from lies by just accepting the most common answer as correct. In fact, this is the basic idea behind Schellingcoin, which takes the median answer to be correct. The difficulty with the Schellingcoin mechanism is that (as its author, Vitalik, notes) it is susceptible to collusion attacks. The simplest attack is if a group of dishonest users controls 51% of the coins, they control 100% of the outcomes. For example, suppose 51/100 users had a million dollars riding on Mitt Romney, which would be one hell of an incentive for them to make the consensus algorithm declare Romney to be the winner.

What if the dishonest group controls 51% of the reports, but they vote on many events at the same time? Before, each user's answer was just a number. Now each user's answers is not just a number, but a series of numbers -- a signal -- which has a value for every event. There might be a 51% group of disgruntled Romney supporters willing to lie about the election, but what if there were 19 other events they had to simultaneously report on? The chances are much lower that the members of this dishonest group are willing and able to perfectly coordinate their lies across so many different events!

In other words, "lies are more variable than truths" is even more true if we consider the users' reports across many events. Put another way, by increasing the number of times each voter's signal is sampled, we greatly increase the resolving power of the Schellingcoin idea. This idea is the foundation of Augur's event resolution mechanism (as well as that of Truthcoin, Augur's theoretical inspiration).

In Augur, truth equals consensus. As such, the Augur oracle is intended to be used for events which are easily and objectively determinable after the event has occurred. (For example, the winner of an election.) However, there are many cases where consensus might not reflect the truth, such as events where

  • ...there is ongoing controversy about what happened (e.g., "Malaysia Airlines Flight 370 was brought down by terrorists")
  • ...the outcome is subjective (e.g., "was Carter a good President?")
  • ...the outcome is unreasonably difficult to determine (e.g., "what is President Obama's checking account balance?")

These questions are not good candidates for Augur! In fact, all questions include a "this was a bad question" answer, in case a user is asked to report on an ill-defined or unanswerable event. (Of course, Augur's oracle can dutifully report the consensus in these cases -- and the consensus very often will be "this was a bad question" -- but it is up to you the user to use your judgment as to whether this consensus is an accurate reflection of the truth.)

Note: The following sections are a bit technical. If you're not familiar with the concepts of eigenvalues and eigenvectors, here are a few good primers on eigenvalues, eigenvectors, and how they're used in Principal Component Analysis:

Searching for lies in the matrix

Here is a table of reports, in which each user's answers are a row, and each column represents an event:

Example report matrix: 50 users, 25 events

There are three questions consensus needs to address:

  1. How do you determine the amount that each user contributed to the overall variability in this table?
  2. How do you determine what the correct event outcome actually was?
  3. How do you incentivize people to report honestly?

First, let's nail down exactly what we mean by "variability". In statistics, variance quantifies how widely dispersed a set of values is; i.e., if most users agree on an event's outcome, that event has a low variance. Covariance is how much two sets of values (i.e., two events) vary together. So, roughly speaking, "variability" is variance + covariance -- this isn't perfect, but let's roll with it for now.

Reputation -- a tradeable token -- is needed to report on events. Each user's report is weighted by the number of Reputation tokens the user has. Reports are made on events (after they expire) at fixed, pre-defined intervals. (In this simple example, each user starts with a single Reputation token.)

Suppose there are E total events being reported on by N users, and let \mathbf{r} be a column vector of Reputations. We first take the table \mathbf{B} of reports and construct its weighted covariance matrix, where the weights are Reputations. The centered table (the table with each event's weighted mean value subtracted out) \mathbf{X} is:

 \mathbf{X} = \mathbf{B} - \mathbf{1} \mathbf{r}^T \mathbf{B}

where \mathbf{1} is a column vector of ones and T denotes the transpose operation (swapping columns and rows). The ordinary (unweighted) covariance matrix is given by multiplying the centered data matrix and its transpose; similarly, the weighted covariance matrix is:

 \mathbf{\Sigma} = \frac{\mathbf{X}^T \mathbf{R} \mathbf{X}}{1-\sum_i^{N}r_i^2}

where \mathbf{R} has the elements of \mathbf{r} on its diagonal and zeros elsewhere, \mathbf{R}_{ij} = \delta_{ij} r_i.

Each row (and column) in \mathbf{\Sigma} corresponds to the variability across users, within a single event. Since there are E events, \mathbf{\Sigma} is an E \times E matrix. Next, we diagonalize \mathbf{\Sigma},

\mathbf{\Sigma} = \mathbf{S} \mathbf{\Lambda} \mathbf{S}^T

revealing \mathbf{\Lambda}, an E \times E matrix with \mathbf{\Sigma}'s eigenvalues on its diagonal, and zeros elsewhere:

\mathbf{\Lambda} = \begin{bmatrix} & \lambda_1 & 0 & \cdots & 0 &\\ & 0 & \lambda_2 & \cdots & 0 &\\ & \vdots & \vdots & \ddots & \vdots &\\ & 0 & 0 & \cdots & \lambda_E &\\ \end{bmatrix}

The eigenvalues are in descending order, \lambda_i > \lambda_{i+1}.

The eigenvector \mathbf{s}_1 associated with the largest eigenvalue \lambda_1 is the principal component, which is a unit vector oriented in the direction of as much of the report matrix's variability as can be captured in a single dimension. This direction is a new coordinate axis. By projecting the (centered) data onto this axis, we obtain a non-conformity score for each user:

 \mathbf{c} = \mathbf{X} \mathbf{s}_1

Score c_i of this vector then tells us user i's contribution to the direction of maximum variance. Using these scores for event consensus is called Sztorc consensus (named after its ruthlessly strategic and paranoid inventor, Paul Sztorc). These scores are then used to reward users who reported with the consensus (low scores) and punish those who voted against it (high scores).

Scores vs reputation change

Shown here is the user's non-conformity score (on the x-axis) versus their gain (or loss) in Reputation (on the y-axis). Blue indicates users that told the truth; red indicates dishonest users. The size of the circle indicates the number of users that reported identical results. As expected, by far the largest single circle is that of the honest users, and (in this simple example) those users were rewarded accordingly, while the dishonest users all lost Reputation.

The second eigenvector, and beyond!

For large and/or complicated ballots, intuitively, you might expect to need more than a single direction to really capture what's going on in the ballot. If there are in fact multiple directions required -- multiple coordinate axes -- then we need to extract more than just the first eigenvector from the covariance matrix.

Empirically (in simulations, anyway), this seems to be the case. Often the first eigenvector accounts for 40% or less of the total ballot variance, as shown in this breakdown for the simple example shown above:

Scree plot

In this plot, the bars represent the percentage of variance explained by each component. The blue line is the cumulative percentage explained. This plot also suggests a simple way of incorporating more directions into our results: just keep adding eigenvectors until we reach some pre-defined threshold!

Setting a fixed variance threshold (\alpha) allows us to extract the number of components (n) needed to explain at least \alpha \times 100 \% of the report matrix's total variance. The non-conformity vector \mathbf{c} is calculated by summing together the projections onto the first n components, each of which is weighted by its eigenvalue \lambda:

 \mathbf{c} = \mathbf{X} \sum_{i=1}^n \lambda_i \mathbf{s}_i

This is the fixed-variance threshold method, first described in the Augur whitepaper. Although this method has improved resolving power (see below), it has the downside of being more computationally demanding than Sztorc consensus, as it requires the computation of the first n eigenvectors.

However, we're really not interested in the eigenvector basis per se: in fact, we only care about the scores. The scores tell you how much each user contributed to the direction of maximum variance.

The eigenvectors -- the new coordinate axes -- are obtained by diagonalizing the per-event covariance matrix, which is an E \times E matrix, where rows/columns account for each event's covariance across all users. But, if all we care about is each user's relative contribution to the overall variability, this information is available more directly by constructing the per-user covariance matrix:

 \mathbf{\Sigma}' = \frac{\mathbf{X}' \left(\mathbf{R} \mathbf{X}'\right)^T}{1-\sum_i^{N}r_i^2}

where the row-centered report matrix is \mathbf{X}' = \mathbf{B} - \frac{1}{E} \mathbf{B} \mathbf{1} \mathbf{1}^T .

\mathbf{\Sigma}' is an N \times N matrix, where each row/column represents a user's covariance, taken across all events. Summing across this matrix's rows, each user's contribution to the total variance and covariance is found as the ratio of that user's sum to the total:

 c_k' = \frac{\sum_{i=1}^N \mathbf{\Sigma}_{ik}'}{\sum_{i=1}^N \sum_{j=1}^N \mathbf{\Sigma}_{ij}'}

Test driving the algorithms

A valuable rule-of-thumb I picked up in graduate school: when in doubt, simulate!

The way I'm testing and comparing different event resolution algorithms is using Monte Carlo simulations. If you're not familiar with Monte Carlo, it basically means: create a bunch of randomized datasets (with known outcomes), feed them through the consensus machinery, and see what comes out the other side.

First, create a vector of known answers (to True, False, Indeterminate questions) with random entries. There are two kinds of reporters:

1. Users can be honest, in which case they just copy the correct answers down.

2. Or dishonest/lazy/ignorant, in which case they just pick answers at random on their ballot.

Dishonest users also have a small -- in this example, 30% -- chance to copy one another. (This is an adjustable parameter, \gamma = 0.3.) Imagine that the dishonest/lazy/ignorant user calls up their buddy and says, "Hey Bob, can you just send a screenshot of your ballot over, I don't feel like filling this stupid thing out, I could be training for the upcoming hot dog eating contest instead." Dishonest users also have a \gamma^2 = 0.09 (9%) chance to copy in triples, and a \gamma^3 = 0.027 (2.7%) chance to copy in quadruples.

First, here are the results of a set of 500 simulations, for the three event resolution algorithms discussed so far in this post:

1. Sztorc, which uses only the users' scores on the largest eigenvector.

2. Fixed-variance is a weighted sum of eigenvectors up to a fixed threshold of 90% variance.

3. Covariance uses ratios from the per-user covariance matrix, and does not require matrix diagonalization.

The first metric I used to compare the algorithms is the percentage of "beats". A "beat" is a dishonest user that received a larger reward than the honest users. In this plot, the y-axis is % of dishonest players that were beats. The x-axis is the % of dishonest players in the simulation (this is the same for all plots). For this metric, lower scores are better -- ideally, you'd want no beats at all!

Monte Carlo simulation results: % beats

Next is the "liars' bonus", i.e. the amount of bonus Reputation dishonest users received, minus the amount received by honest players. As before, lower = better:

Monte Carlo simulation results: liars' bonus

Finally -- and in some ways the most important -- is the percentage of outcomes that the algorithm got right. (Since these are just tests, I define the "correct" answer beforehand, so I know whether the consensus lands on the correct result or not!) Here, higher percentages are better:

Monte Carlo simulation results: % correct

As expected, the fixed-variance method (yellow) significantly outperforms the Sztorc method (blue) across-the-board, due to the extra information contained in the additional eigenvectors. More surprising is that it also outperforms the covariance ratio method. This is possibly due to the eigenvalue weightings, which are not available without diagonalizing the covariance matrix.

Cokurtosis: collusion's fat tail

One of the shortcomings of a covariance-based technique is that it is based solely on second-order statistics (i.e., variance/covariance), and ignores higher-order statistics, such as skewness, kurtosis, etc. I think this could be a problem, since "fat tails" are everywhere in social systems, because herd mentality = positive feedback = fat tails.

In particular, the kind of cheating we're most concerned about is collusion, which is essentially just users copying each other's ballots. My intuition is that this copying is very likely to manifest as a sort of "rich get richer" behavior: the more users have copied a particular ballot's answers, the easier it is for still more users to find -- and therefore copy -- the ballot. In other words, exactly the kind of directly proportional feedback that gives rise to power-law behavior in loads of different systems.

If this is right, then this sort of cheating/collusion should create excess among-user kurtosis/cokurtosis -- i.e., they should be contributing extra to the fourth joint central moment ("cokurtosis") tensor. This means that you can sum across the tensor's fibers and score the users in the same way that you can directly from the covariance matrix.

To investigate this, I put together a small Julia library for building joint central moment tensors, JointMoments.jl, and conducted the same tests as above (for the second-order-only methods):

Comparison of simulations: cokurtosis has the greatest resolving power

So far, it seems like this cokurtosis tensor-based method has better resolving power than any of the other methods! But, this is just the result of a small preliminary simulation -- lots more testing to do before this method is declared the winner for real.

One thing that will be important is throwing more (and different) "liar" patterns at each algorithm and seeing how they hold up. Right now, the "dishonesty" in the simulations is fairly tame, and really is intended to capture apathy/laziness more than actual focused attack attempts.

My prediction? Given the winners of the above comparisons, I suspect the best results will be from combining the cokurtosis and fixed-variance threshold methods.

All the simulations and algorithms discussed here are 100% open-source, so if you have suggestions, criticism, or just want to jump in and start tinkering, please do!

  • pyconsensus: standalone Python implementation of Augur's oracle
  • Simulator.jl: numerical simulations, statistics and plotting tools

Reposted at Augur Blogger.

Edited on 3/4/2015 to include a short discussion of what Reputation is, as well as the kinds of events that Augur's oracle is meant to address.