Secret ballot and protest voting on Wikipedia

Ballots happen to be secret, unless the voter represents some constituency. So, US Congress elections are secret, but congresspeople’s voting records are open. One reason is to avoid untruthful voting. Telling everyone about your choice makes the choice dependent on other opinions, and the election results deviate from the optimal ones.

Secrecy sets the voter free to vote how she thinks is best for her. For example, social ratings are biased in favor of the top ratings because the voter prefers to keep her friends happy by not downvoting their content. That not necessarily distorts the ranking of content, which matters, but does change average ratings.

The alternative is the troll theory. Secret voting hurts responsibility because no one will find out what the voter did in the voting booth. Let’s check.

This is Wikipedia’s article feedback voting:

I used a sample of 12 million votes from a one-year dump (thanks to Matthias Mullie from the Wikimedia Foundation for pointing at this file). Here, zero means no vote at all, and anonymous voters tend to leave more gaps behind.

Registered users serve as a control group for anonymous (secret) voting. They are more responsible, because registration is time consuming and indicates deeper involvement in the community. Casted votes:

Compared to YouTube, the scale is more informative in general: voters use a larger fraction of mid-range ratings. Anonymous visitors set the lowest rating more frequently, but the difference is small.

Descriptive stats

In addition, summary statistics for the sample:

Variable Obs Mean Std. Dev. Min Max
trustworthy 2982561 2.956455 2.015107 0 5
objective 2982550 2.762515 2.073398 0 5
complete 2982551 2.591597 1.940069 0 5
well_written 2982552 3.177147 1.895086 0 5

And correlation across rating dimensions:

trustw~y object~e complete well_w~n
trustworthy 1
objective 0.6416 1
complete 0.5856 0.7031 1
well_written 0.3977 0.5515 0.5309 1

The average rating set in a particular day:

The century of cinema

A brief elaboration on rating systems discussed before. Since mass voting creates a skewed distribution and uses only a fraction of the scale, web services typically should avoid simple aggregation of votes.
For example, here’s IMDb’s famous Top 250 distributed by release decade:

The top is biased in favor of recent movies. Not least because voters themselves represent a young audience.

Compare it with film distribution from less known aggregator TSPDT:

Best films here have a normal distribution around the 1970s.
Which rating is better? Ideally, a rating system must minimize the difference between the ratings you see before and set after watching a movie. It works better when ratings can be conditioned on your preferences. (The recommender systems have the same goal, and that’s what the Netflix competition was about.)
TSPDT is based on opinions by critics, and the IMDb accepts anyone’s vote. Clearly, critics spend more time on comparing and evaluating cinema. But their tastes deviate from those of the public, and their ratings may be only second-best predictors of your own rating.

In politics, the role of authoritative recommenders typically belongs to journalists. And as Gentzkow et al. notice in “Competition and Ideological Diversity,” people prefer like-minded newspapers. So, both journalists and critics have access only to like-minded subsets of population. And being informed helps a little in guiding others in choosing politicians and movies, unless you recommend something your reader already agree with.
But information remains an important dimension in any voting. Banerjee and Duflo, “Under the Thumb of History?”, survey papers showing that voters choose other candidates when they have more information. In that sense, critics may improve everyone else’s choices.
The problem may be not in information and preferences themselves, but in poor matching between experts and the public with similar preferences. Web services then may focus on improving this matching as far as they have access to both groups. Their recommender systems may advise products, but it makes sense to advise specific experts. Ratings in that case shouldn’t be taken literally. They’re only the mean of matching experts and subgroups of the public on issues they agree with each other.

Pitfalls of rating systems

A few years ago, YouTube changed its rating mechanism from five-star rating to upvote–downvote system. And it makes sense after you look at the typical distribution of ratings:

In most cases, users set either 1 or 5. That’s not very informationally efficient, but it’s the fact that users were reluctant to rate videos across the entire scale.

This J-shaped distribution creates problems because the mean here makes no sense. When a website reports an average rating of 3.0, it means one of two things. Either one person rated the video at 1 and another one at 5. Or both rated it at 3, which is almost never the case.

In an economy guided by ratings, the difference between these two interpretations is large and unpleasant. Since no one rates stuff around the mean, a decision based on this mean is uninformed. In the end, you watch something that you’d later rate at 1 or 5, not 3. It’s like you’d ordered a steak and the waiter brought you sushi.

The worse thing about this risk is that it’s implicit. Users look at ratings to reduce the risk of making a wrong choice, but instead they gamble between 1 and 5. Fortunately, the ratings aren’t entirely random. They’re conditioned on the stuff we observe, like gender, age, and interests. The means then may start working. Just check if those 1s and 5s were set by distinct demographic groups.

Of course, it’s now about hundreds of 1s and 5s, because the degrees of freedom go down with each factor we get into the equation. How to get more ratings?

The solution is exactly what YouTube did: replace a five-star scale with a binary choice. Users don’t like investing time in thinking about the proper rating, so thumbs up or down helps with decision fatigue.

More ratings allow computing the means for subgroups of users. These subratings become more relevant for those who search stuff by its rating. Though YouTube didn’t make customized ratings yet, that’s an option for many web services relying on user feedback.

While Uber and Fiverr can improve their rating systems by reducing it to binary choices, a scale is still a good choice for, say, IMDb. When you watch a movie for two hours, you try to rate it better than YouTube’s typical three minutes. And then multiple peaks emerge for controversial movies:

You have the mean and median near to each other in a sort of Poisson distribution. The other two peaks are around radical 1 and 10. So, you need more than two grades on a scale.

Conventional hits have the YouTube pattern though:

Which again looks like the Poisson distribution with the disproportionate number of 1s.

In the end, a good rating system has to balance between the desirable number of votes and the size of the scale.