Vice:Motherboard carried a story two days ago titled “Reddit Is Working on an Entirely New Front Page Algorithm”. It covers recent dissatisfaction among Redditors with perceived changes in how articles reach the front page. It’s a fine title, but I think mine is at least as accurate.
The conclusion of the Vice article seems plausible: that the algorithm hasn’t changed, but growth of the site and the usage patterns that came with it caused an imbalance in the factors used for scoring. But the real story lies in what Reddit had to say about the scoring problems.
“Users have been complaining about the front page being stale, and they might be right,” Steve Huffman, Reddit’s CEO, told me in a phone interview. “I’ve noticed it too. We didn’t change anything, but it feels slower.”
Asked about it a month ago, the CTO (that’s Chief Technical Officer) admitted he didn’t know. He relayed the question to a developer, who responded with gaslighting:
…whatever you’re perceiving is almost certainly imaginary in terms of change to the site. Software wise, absolutely nothing has changed.
This response isn’t hugely surprising. After all, a serious bug in the “hotness” algorithm’s downvote scoring existed for over 6 years before being patched, including 2 years after Reddit was notified of the bug.
This latest scoring problem has been growing for a long time, even if public uproar only started recently. Besides a conceptual analysis of the scoring algorithm, there are other ways this could have been detected. If Reddit tracked some basic metrics about front-page stories, such as “average/max story age” and “time to reach front page”, they could have identified the changing nature of the site long before it became obvious to the public. But instead it grew to a point where it was identifiable even with a “black box” view of the site, and where the first official acknowledgement came from the CEO eyeballing the site in his browser.
I’m 90 percent sure it’s as simple as that. The other 10 percent is, maybe there’s something else going on.
No one at Reddit fully understands how their front page algorithm works. Reddit is the top and only source of news for thousands (millions?) of people, and a drastic shift occurred in the selection of which news is made visible, and no one at Reddit had any control over this process.
None of this is meant to particularly knock the Reddit developers’ competence1. The only unusual thing about their situation is that most of Reddit’s code is public2, allowing outsiders to analyze the situation and make more educated guesses about what’s going on.
This exact same scenario is playing out at countless other tech companies, only shrouded in secrecy and misdirection. Most of them we will never even know that something is wrong, since there is more to be learned by what we don’t see than by what we do. Most of them will never answer for their mistakes, nor correct them.
Google decides what we find when we search. Facebook decides which of our friends we speak to. LinkedIn directs the course of our professional lives. Netflix tells us how much we will like a movie. Hulu decides which commercials we see. Twitter, if they ever manage to stop patching leaks long enough to build anything new3, will definitely start controlling their stream algorithmically. And I bet none of these companies has a single employee with a complete and 100% accurate understanding of how their respective algorithms work in practice.
These are only the tech “giants”. A second wave of algorithmic takeover is underway among a thousand smaller companies that interact with every other facet of our lives. Algorithms decide which coupons print for you at Target. Algorithms decide how long you brush your teeth. Algorithms decide which streets you travel on. Algorithms decide which words to suggest in your phone’s autocorrect. Algorithms may soon decide which tech startups get funded to build more algorithms. And as the Internet of Things expands, algorithms are going to push their way into our shopping lists and water bottles and a hundred other tiny facets of our lives4, each one coded up by a different company working mostly in isolation and without outside review.
These companies don’t all attract the top tech talent. They don’t all have enormous technical budgets. They don’t all have company culture conducive to good engineering practices. So when these companies hand over a piece of their business to an algorithm, odds are good they’re making uneducated guesses and testing them poorly. If Reddit cannot write an algorithm that functions properly in the first place, nor maintain enough insight into its behavior to know when things go wrong, what chance do all the others have?
You’ve probably already heard concerns about the “filter bubble” that shapes what we see online based on our past preferences. But our problems extend beyond these self-directed blinders. A world is looming where our lives are not only controlled by algorithms, but where even their creators lack basic visibility into how these algorithms work or if they are behaving as expected.
I’m no King Ludd urging you to smash the looms. Algorithms can do great things. They can produce incredible effectiveness, beauty, or serendipity. But it’s time to recognize them not as magic, but as tools that do good or harm dependent on our skill in applying them.
One good step is the movement to recognize that algorithms can be discriminatory. It is important we establish that an algorithm is discriminatory if its results are discriminatory. This pushes responsibility back on to the administrator of the algorithm, and recognizes that they are not cleared of fault simply because they never thought to check.
Our legal system is also working through the question of legal responsibility for algorithmic market manipulation. The need to show “intent” in price-fixing is ripe for exploitation—a dishonest capitalist can provide a seemingly-innocuous algorithm that just happens to converge on a certain behavior over time that just happens to produce unfairly advantageous market conditions, all without the capitalist take any overt action. Hopefully we can land on a stronger definition that doesn’t let the owner off the hook just because the algorithm did the dirty work for them.
In the end, maybe the simplest way to think about our algorithms is like our pets. Yes, they respond to their environment and make decisions independently. But the owner is ultimately responsible for their behavior. When you adopt a dog you accept responsibility for properly feeding, housing, and training it, and you implicitly accept that failure to perform these duties would leave you legally liable. Let’s do the same for algorithms. You want one, you better care for it properly, or I’ll call Algorithm Control on you and get that thing carted off to a shelter.
Further reading: The “digital star chamber” by Frank Pasquale.
Though frankly, maybe they could start being nicer to outsiders who come to them with bug reports. Their track record at this point doesn’t really justify rudeness.↩
This is majorly to their credit, despite my other gripes.↩
If you want a top example of an important company with a staggering level of technical incompetence, look no further than Twitter. Twitter, who inadvertently coined the phrase “fail whale”. Twitter, who doesn’t understand hyperlinks. Twitter, who is still convinced that I am a small business instead of a person. When they build an algorithm to control your entire Twitter experience, how confident do you feel that they’ll get it right?↩
I’ve chosen examples that you might consider mundane because I think the mundane is plenty important. But if you need to feel more alarmed, companies are also eager to start using algorithms with serious consequences like deciding who gets a bank loan.↩
So you have a Rails app, and you wrote some tests because you’re a responsible developer. You want these tests to run quickly, so you want to use database transactions to handle database cleanup. Yes, really, you do.
Using transactions instead of deletion or truncation is the biggest and easiest performance win you will find in your test suite. I’ve seen this change shave off 5 seconds per test on a database with a very ordinary number of tables. Savings of 5 seconds per test, with a modest test suite, can easily equal savings of 30 minutes per test run. Now multiply that by every developer on your team, every time they need to run the tests and have nothing to do while sitting there waiting for the results. Yes, you really want to be using transactions.
There’s just one problem: you probably wrote some integration tests, because as previously mentioned, you’re a responsible developer. You probably used Capybara. And since this is 2015 and JavaScript is unavoidable, you’re probably using a JavaScript-enabled driver like Selenium or Webkit. So you mix these ingredients together and all hell breaks loose.
There are two problems that are teaming up right now to make your life miserable:
Database transactions aren’t shared between threads. Unfortunately, Capybara needs to use a separate thread to fire up the server that it’s going to run against. So you’ve got all your seed data inserted into the database, all wrapped up in a transaction in thread #1. But your server is running in thread #2, without access to that transaction, so all it sees is an empty database. So you run your test suite and everything fails because all the data is “missing”.
Now, it’s possible to fix this problem, which we’ll cover below. But when you do, you will run your tests again and they will fail even harder than before.
Turns out there’s a reason that threads don’t usually share database connections. Trouble arises most often in this scenario: your integration test performs an action, sees the results it was expecting, and passes. At this point, control of the test thread returns to your test cleanup, which will proceed to roll back the transaction.
However, the browser thread may not have finished yet. It’s not at all uncommon to have an AJAX request or two that are still pending even when your test has passed. Maybe it’s a follow-up action to what your test performed, like refreshing resource attributes. Or maybe it’s something unrelated, like credentials verification. Whatever the content, the result is that this AJAX request hits your test server, which tries to access the database through the same connection that the other thread is trying to use to clean up.
Databases don’t like having multiple concurrent attempts to use a connection. When this happens, MySQL will give vague and varying errors, the most frequent of which look like this:
Mysql2::Error: This connection is in use by: #<Thread:0x0000000bb400b8>
PostgreSQL handles the situation even worse, and will silently hang in a way that may force you to kill your test suite and possibly restart the database daemon.
And to make it all even worse, these errors are not only hard to understand but also inconsistent to reproduce. You might get them once every 5 test runs. You might get them only when the full suite runs, but not with smaller subsets of tests. You might see them happen in only one environment (say, on the CI server but not on your development machine). It’s enough to make you consider a new career. North Dakota is hiring right now, FYI.
We’ve identified the problems, which in this case is well over half the battle. Now let’s fix them.
First we’ll need to share the database connection so everyone gets the same transaction. There’s a fairly standard way to do this, documented by Capybara, that involves dropping a monkey-patch into your test setup. Sure, it’s a little kludgy, but it’s only in your test suite so it’s not gonna hurt production.
Now the harder part: we need the tests to wait for the browser to finish any outstanding requests before they pass and continue on to the next step. We could throw a sleep 5
at the end of our test2, but remember, the whole point of this was to make the tests faster! Instead, the best solution is for us to ask the browser about pending requests, and wait until it reports that it’s finished. There are a few different versions of this general idea floating around, but it gets tricky because there’s not a standard way to check pending requests, so you gotta make use of whatever your JavaScript framework provides. And then you probably need a safeguard for pages where that framework isn’t present. And may you be lucky enough to never test an app that uses several different frameworks in various combinations. And so on…
I thought this should be easier, for everyone. So I made transactional_capybara to do just that. It bundles up the shared connection hack plus AJAX waiting logic for the more common JavaScript frameworks. And for a typical test suite, using it is as easy as adding something like this to your test helper:
require 'transactional_capybara/rspec'
Yeah. That good.
I free you from the dread of Capybara heisenbugs! I raise you from the agony of exacting parallelism management! I cast out the foul demons of database deadlock! Go forth and use transactions with Capybara, now and forever!3
Mushroom cloud icon designed by Gokce Ozan from the Noun Project.↩
For the record, I fully endorse using this method temporarily. It’s a great way to confirm that the errors are, in fact, caused by pending requests. Throw a sleep
at the end of your problem tests, and if the trouble goes away you know it’s a race condition and you can carry on with this solution. If a sleep
doesn’t solve it, then you might have some other problem on your hands and this might not be the blog for you.↩
Just, whatever you do, don’t have your integration tests do some model-level database queries at the end to assert shit. That’s not how integration tests work, okay? And it’s going to break everything again and I won’t help you.
Okay fine I’ll help you. Just do five lashes of penitence, then call TransactionalCapybara::AjaxHelpers.wait_for_ajax(page)
in between your browser commands and your weird database shit, and you should be fine.↩
Somehow, and I know this is hard to believe, my exposé of SiteKey’s failings six years ago wasn’t the final stake through its heart like I had hoped. SiteKey is still very much alive and just as useless as ever.
I hate it, and I hate putting up with their useless security theater1. But there is one bright spot: most Sitekey installations can be coaxed into giving you some gloriously weird pictures as your “secret image”. And if your institution makes use of the even-less-justifiable feature of setting a caption for your secret image, you can pair your bizarre image with an equally fitting caption. Your secret image is still useless as far as protecting your account, but at least it will make you smile.
Certainly I am taking a grave risk2 in sharing this private information with you. But the world needs to know. I present to you the various SiteKey images and captions I have used over the years:
The trick to getting these sort of results for yourself is simple. Most SiteKey systems offer you the ability to page through sets of images as you’re picking. But with a sort of adorable dedication to detail in their ultimately useless system, these are not discrete pages from a limited set of images. Each “page” is actually a randomized set pulled from some vast store of stock photos. So you can click the next page button as many times as you like4, giving you access to a virtually-unlimited set of images. With this much choice, it’s only a matter of time before some strange results appear.
For the caption, though, you’re on your own.
Anyone else think that the TSA’s expansion of Pre-Check is really just a way to phase out all their failed policies without ever admitting that any of it was a mistake?↩
I estimate that the security impact of making my secret images public reduces SiteKey’s effectiveness by roughly 50%, from an original value of 0% protection to a new value of 0% protection.↩
For some inexplicable reason, ING Direct implemented a very strict profanity filter on the secret caption. I could never figure that one out. It’s a secret phrase, right? Are they worried about my past self offending my future self?↩
I did encounter one SiteKey system that limited the number of available pages. However, this was merely a front-end constraint, and with a little bit of JavaScript tinkering I was able to generate new requests that the back-end happily fulfilled with more images.↩
fortune
is a venerable Unix utility with a simple purpose. You call it and it returns a random quotation picked from a set of files. You can set this up to run when a terminal is opened, and this way you get a nice quote-of-the-day thing. fortune
ships with virtually all Linux distributions. There’s just one problem: the default quote databases that ship with fortune
, well, they stink.
Unix fortune
is an intriguing window into the earlier days of the Internet, when neckbeards were still the dominant social group and BBS was still the dominant mode of communication. It is, unfortunately, also a reminder that casual sexism on the Internet didn’t start with Twitter. The quotes in the fortune
databases are largely unchanged since the days when the Internet was a bunch of pasty white guys snickering over dirty jokes in chat rooms. It’s not just that much of the content is sexist (or racist) enough to be offensive; it’s that it’s not even funny-offensive. It’s the kind of content that derives all its value from being transgressive, and stripped of that, becomes banal.
Of course, you can avoid the databases of raunchy humor in fortune
. Then what you’re left with is an endless stream of barely-amusing Larry Wall quotes, and a tiresome flood of atheist dogma that truly rivals /r/atheism in obnoxiousness1.
What I’m saying is fortune
needs some new content.
Where to get some? Sure, I could troll Wikiquote or one of the many other quote-on-demand services, but I find that people who go looking for quotes to compile almost always have what I find to be terrible taste2. Instead, what if I were to find some content written by one of the most iconoclastic philosophers in history? One who produced several collections of aphorisms, a format perfectly suited to fortune
? One with a glorious mustache?
Yes, I very much want to be greeted by a Friedrich Nietzsche aphorism every day when I open my terminal. That is what I want.
Luckily, fortune
can be pointed at new database files from which to select a quote. The format of these files is quite simple: they are plain text, usually normalized to 80 columns wide, and each entry is followed by a %
character alone on a line. And each of these files must be accompanied by a second file with a .dat
extension. This second file is a binary blob created with the strfile
utility, used to help with random access.
I found a number of Nietzsche’s works available electronically and under the public domain3, which was perfect for my purposes. I had hoped to build a workflow to automatically parse these files into a suitable format, but the variability of the formatting meant it was far easier for me to tweak each book manually with regular expressions and Vim.
I now have glorious Nietzsche fortunes on my command line, but what good would all this work be if I didn’t share it with the world? Thus I have created a GitHub repository with everything you need to get your own Nietzsche fortunes.
You can now point fortune
at the data in this project:
fortune -s -n 600 this_project/fortune
And of course, it wouldn’t be quite the same if you didn’t have your fortune delivered to you by a talking ASCII cow (or in my case, a dinosaur):
cowsay -W 70 -f stegosaurus $(fortune -s -n 600 this_project/fortune)
Enjoy. I know I will.
Internet atheists are an impressive group. Who else could take a subculture formed in opposition to orthodoxy, and turn it into a community rife with leader-worship and an irrepresible need to force your personal opinions on everyone else? N.B. I am more-or-less an atheist, and yet I absolutely cannot stand these people.↩
Yes yes, I realize that I am now one of those people, and I fully appreciate the irony. Thanks for checking.↩
Most of the books came from the wonderful and important Project Gutenberg, with one additional work from Nietzsche’s Features.↩
Lately I’ve been building some Rake tasks that do destructive things. I need them, and I need them to be destructive. But I also need to be very sure they don’t run at the wrong time.
This is part of my sysadmin philosophy: automate everything that’s possible to automate, and double- or triple-guard everything else against my own presumed eventual fallibility.
# lib/tasks/my_task.rb
desc 'Delete stuff that I might still want (careful!)'
task :reset_data => :environment do
# Delete all the things
end
Yikes. One misplaced rake reset_data
and I’m in trouble.
It’s not enough to let allow destruction to proceed when the command is entered on the command line. Shell history can be your own worst enemy. Tab completion can stab you in the back. Distraction and urgency can gang up on you. And it’s certainly not enough to leave a note in the task descriptions advising caution. Who reads those descriptions anyways?
No, these tasks need a confirmation; an “are you sure?”. They need to stop me and demand my attention and say “Hey, look at what you just entered and make absolutely certain that’s what you meant.” And I need an easy way to do this for several different Rake tasks—so easy that I can’t possibly get lazy and leave it off, and so foolproof that I can’t screw it up.
Let’s get started.
class Nope < RuntimeError; end
This is my favorite line. Heck, this is my favorite line of Ruby I’ve written in months.
# Rakefile
task :destructive do
puts "This task is destructive! Are you sure you want to continue? [y/N]"
input = STDIN.gets.chomp
raise Nope unless input.downcase == "y"
end
This helper task prompts for input1, and insists on seeing a “y” before it continues. Why the raise
? Well, Rake tasks aren’t actually methods, so they freak out if you try to return
. And normally a next
will exit the task, but in this case we want to exit not only this task, but also the one that called it. next
would pass execution right to the next task, which is exactly what we do not want.
Raising an error is fighting dirty, but it’s an effective way to end Rake’s execution.
# lib/tasks/my_task.rb
desc 'Delete stuff that I might still want (careful!)'
task :reset => [:destructive, :environment] do
# Delete all the things
end
Back to our original task, with one minor modification. The prompting is all taken care of by adding :destructive
to the dependencies. This ensures that our helper will be run first, and will bail out of everything if it doesn’t get a confirmation. The task itself can get right down to the business of whatever destruction it’s responsible for.
Fair warning: this technique slightly perverts the semantics of the Rake system. Some people might not like it on ideological grounds, which is fine. There are other ways to accomplish the same functionality. But in my book, nothing beats this for simplicity.
Confused by the [y/N]
? This is a Unix convention when asking for input on the command line. It shows what characters are valid options to select, so if we had more options it might look something like [y/N/s/a]
. The capitalized character indicates a default option—if you hit enter without entering input, this is the option that will be selected. The actual case should not matter, thus we use downcase
before comparing to accept either “y” or “Y”.↩
If you are working on a Ruby application and trying to run something on the command line and getting an error message you don’t understand, you might be having a problem with your Ruby environment. The first thing you should do is run the command prepended by bundle exec
. So if you were originally trying to run:
rspec --seed=123 spec/awesome_spec.rb
Just run this:
bundle exec rspec --seed=123 spec/awesome_spec.rb
Did that work? Great. You now have a workaround, and you know that your problem is a Ruby environment problem.
Didn’t work? If you are using rbenv1, there are two things to try. First, run rbenv rehash
and then try your command again:
rbenv rehash
bundle exec rspec --seed=123 spec/awesome_spec.rb
If it still doesn’t work, try prepending an rbenv exec
to your command, still keeping the bundle exec
:
rbenv exec bundle exec rspec --seed=123 spec/awesome_spec.rb
Yes, this is silly, but it’s also a great way to troubleshoot. If you’ve tried both of these steps and you’re still getting the same error, your problem is probably something other than a Ruby environment problem. If one of these steps fixed the problem, you know it was a Ruby environment problem and you can take steps to fix it permanently.
Nobody wants to prepend all this stuff every time you run a command. Nobody. It boggles my mind that anyone is even willing to live that way. Dijkstra chiseled the first NAND gate out of the bleeding flesh of his own bicep so that you and I might live in a better world, and you would let his sacrifice go to waste? Let’s fix this once and for all.
Was your problem solved by using bundle exec
? There are two parts to this.
Fair warning: I’m going to show you a method that is slightly insecure, in exchange for being the most convenient and causing the fewest headaches. If you’re concerned about this, you should consider one of the alternatives2.
First, open your ~/.bashrc
or ~/.bash_profile
, whichever you prefer3. Add this line, probably somewhere near the bottom:
export PATH=./bin:$PATH
Now, cd
to your Ruby project and run this command:
bundle install --binstubs
This will create a bin
directory with some files in it. Commit this directory to your version control system.
Now open a new shell and try running your command again without the extra stuff on the front. It should work.
Did rbenv rehash
solve your problem? You need to run this command just once any time you install a new gem it hasn’t seen before that has an executable part (such as rake
, rspec
, pry
, etc). You won’t find yourself needing to do this very often, just remember it the next time you hit the same problem.
Did you have to add rbenv exec
before your problems went away? Add the following line to your ~/.bashrc
or ~/.bash_profile
, but make sure to add it before the PATH
line we added for Bundler:
eval "$(rbenv init -)"
Now open a new shell and try running your command again without the extra stuff on the front. It should work.
If you’re the impatient type, you are no longer reading this because you left as soon as it started working. Enjoy!
If you’re the curious/suspicious type, you might want to know what exactly these magical incantations mean and why running them solved your problem. Read on!
All of the problems we’ve covered boil down to the issue of paths. Say you run something on your command-line, like this:
foobar --baz
Your shell doesn’t actually know what you want when you ask for foobar
. But it has a list of paths that it will search until it finds a matching executable file to run. So it will try a list like the following until it finds something that exists:
/usr/local/sbin/foobar
/usr/local/bin/foobar
/usr/bin/foobar
/sbin/foobar
/usr/sbin/foobar
Your shell stores this list of paths in the PATH
variable. You can see what your PATH
currently looks like by running this:
echo $PATH
That will print out a bunch of directory paths separated by :
s. This is your PATH
, and it’s the only way your shell knows how to run anything you ask for.
Now, most of the executables you run from the command line, like echo
or ls
, are installed in standard locations that your shell already knows about. And when you use the default version of Ruby installed on your system, the executables from Ruby and its installed gems, such as ruby
and rspec
, are also available in a common path.
One of the most important things that Bundler does is let us install specific versions of gems for a given application – even if we have several applications on one machine that use different versions of the same gem, each application will get the correct version. This is a wonderful thing, but it introduces a new problem: when you ask the command line to run rspec
, which version of the RSpec gem should it run? Bundler can look at the Gemfile
and determine which version is desired, but your shell doesn’t know anything about Rubygems - all it has is its paths.
So Bundler does the best available thing - it puts a simple executable “stub” inside the project’s hierarchy, in a place that can be included in the shell’s paths. This is what happens when you run bundle install --binstubs
– Bundler looks at your Gemfile and builds a stub file for all the executable bits from those gems. These stubs are named to mimic the executables, like bin/rspec
and bin/rake
. Each stub is in charge of activating Bundler, determining which version of the gem should be run, and then passing along all the arguments from the command line to the real executable from the gem.
With rbenv, we have the same problem at a greater scale. All its special versions of Ruby are kept in a special directory like ~/.rbenv/versions
. This is great because it doesn’t cause any conflicts with system packages, but it does leave your shell in another quandary. The shell doesn’t know which version of Ruby you want to be using at the moment, so it can’t even begin to search for the correct Ruby executables.
The solution for rbenv is similar to Bundler’s: when you run rbenv rehash
, it creates some “shim” files in ~/.rbenv/shims
named again to mimic the executables. These find the correct version of Ruby, then pass the entire command through to be handled by that Ruby. Because it’s also in charge of Ruby as a whole, rbenv also maintains shim files for built-in executables like ruby
and gem
.
Now that we have our stubs and shims, the only thing that remains is to point our shell towards them. This is where we come back to the $PATH
variable. Besides storing all the default paths, this variable can be modified by the user to add custom paths to search for executables. That’s exactly what we do: we add ./bin
and ~/.rbenv/shims
to the $PATH
(importantly, we add these near the front of the paths list). This way, when we run something like rspec
in our shell, the shell starts looking through the paths in order, and the first thing it finds is ./bin/rspec
– our Bundler stub. So it runs that, and Bundler and rbenv take care of the rest4.
Congratulations! If you’ve made it this far, not only is your Ruby environment working flawlessly, but you also understand what it is that makes it work. You’ve done good work today.
If you are using rvm, there is probably an equivalent step you should take to ensure that the command is running with the right set of paths, but I don’t use rvm much so I don’t know what it is. If someone tells me, I’ll update this article.↩
Adding a relative path like ./bin
to your $PATH
is insecure because it could potentially allow someone to trick you into running something you don’t want. If you cd
to a directory and run ls
, for example, someone could inject a malicious bin/ls
executable in that directory and you would run that instead of the regular system ls
that you expected, and bad things would happen. I don’t find this particular bogeyman all that frightening for my everyday (non-mission-critical) computing needs, but you can decide for yourself. If this concerns you, you’ll have to decide which alternative solution you’d prefer:
If you only have a small number of Ruby projects, you could add only the full paths instead of using a relative path, like this:
PATH=/home/me/code/myrubyproj/bin:$PATH
If you are using RubyGems version 2.4.2 or higher, you can leave out the PATH
modification and instead put this line into your .bashrc
:
RUBYGEMS_GEMDEPS=-
This is the solution of the future, and provides the same convenience without the security flaw. Unfortunately, it’s very new and still has some serious bugs that will cause errors on Gemfiles that use the more advanced features of Bundler.
If you don’t mind the inconvenience, you can remember to run all the commands with bin
on the front, i.e. bin/rspec
. The trick here is to get good at recognizing the weird errors that result every time you forget to add bin
, so that you can immediately re-run the command with the needed prefix.
If you don’t have a preference already, use .bashrc
. It’s marginally more correct.↩
One more subtlety: when you run a gem executable like rspec
, it actually needs help from both Bundler and rbenv – rbenv to find the correct Ruby version and Bundler to find the correct gem version. How do they cooperate? Well, by relying properly on Unix conventions, each package is able to do the right thing without explicitly knowing about the existence of the other. The only thing they need from us is to have the directories added to $PATH
in the right order. When the shell searches for rspec
the first thing it finds is ./bin/rspec
, the Bundler stub, so it executes that. That file contains a sh-bang line instructing the shell that it is to be executed using the ruby
program. So the shell searches for ruby
, and this time it finds ~/.rbenv/shims/ruby
. It invokes the ruby
shim, which selects the correct version of Ruby and uses that to run the rest of the Bundler stub, which finds and runs the correct version of the gem executable. Alternate explanation: Unix Magic!↩
News broke today about a widespread security flaw in OAuth and OpenID. The written material is a bit short on actual explanations or actionable steps, which is unfortunate when the flaw claims to affect virtually all OAuth providers and must be patched in the OAuth client applications.
In the spirit of my previous post on OAuth security issues, I want to give you a practical guide to the issue. We’re going to learn, in order: how to know if you’re vulnerable, how to fix it, and (for the curious) what the flaw actually is.
Quite likely1. The flaw is not tied to a particular OAuth implementation or usage pattern2. The simplest way to know if you are affected is to follow the steps to fix it in the next section. If there’s nothing there for you to do, you weren’t vulnerable.
Thankfully, closing the security hole is simple. Follow these steps for each of the OAuth providers (Facebook, Google, GitHub, etc) that you use:
Log in to the provider’s OAuth management page. This is where you first registered your application with the provider, and where they show you the Client ID and Client Secret.
Find the place to enter your application’s redirect URL. Google calls this the “Authorized redirect URI”. GitHub calls it the “Authorization callback URL”. In Facebook, you will find this under “Settings -> Advanced -> Valid OAuth redirect URIs”3.
Enter your full callback URL(s) in this field. This means you should be providing the entire path, such as https://mysite.com/oauth/callback
. Do not use wildcards, and do not use only the domain.
That’s it! You’re safe4.
This vulnerability has two parts: trick the OAuth provider into redirecting to an unusual place, then as a result trick the OAuth client into leaking credentials.
Here’s how it would go down:
CoypuApp is the trendiest web community for fans of obscure rodent species5. CoypuApp lets users log in with Facebook. Alice, the user, trusts CoypuApp and uses her Facebook account as her identity.
When CoypuApp registered as a Facebook developer, they set it to trust their entire domain, www.coypu.ar
.
CoypuApp offers a common convenience feature - when a user lands on a page and then logs in, they will be returned to the page they were viewing. To accomplish this, CoypuApp’s home page takes an optional extra parameter, like this: http://www.coypu.ar/welcome?return_to=http%3A%2F%2Fwww.coypu.ar%2Fvideos%2F2243
. When return_to
is present, it will redirect to the specified page, thus returning the user to the coypu video they were watching.
A malicious user posts on the CoypuApp forums6 purporting to link to an awesome coypu blog. In fact the URL is a handcrafted attack URL:
https://www.facebook.com/dialog/oauth?client_id=5564&redirect_uri=http%3A%2F%2Fwww.coypu.ar%2Fwelcome%3Freturn_to%3Dhttp%253A%252F%252Fcoypu-haters.nz%252Fattack
Let’s break this link down: it initiates the OAuth login process with Facebook. One of the OAuth params is redirect_uri
, which names the endpoint that Facebook will redirect to with a token. Normally, this would be http://www.coypu.ar/oauth
, and CoypuApp would pull the token out of the hash and use it to authenticate the user7.
But in this link, something is wrong. redirect_uri
is set to another value: http://www.coypu.ar/welcome?return_to=http%3A%2F%2Fcoypu-haters.nz%2Fattack
. Facebook looks at this link and sees that it is at the www.coypu.ar
domain. Facebook thus assumes it is safe. But remember that convenience feature we talked about? The attacker has snuck the return_to
param into this URL, and it is pointing to a malicious site, coypu-haters.nz
!
Alice sees the link to “awesome coypu blog” and clicks it. In the worst case, she still has a valid session at Facebook and it remembers her authorization of CoypuApp, so she is immediately passed through the OAuth process8.
Alice’s browser is redirected with the OAuth token in the hash: http://www.coypu.ar/welcome?return_to=http%3A%2F%2Fcoypu-haters.nz%2Fattack#access_token=d34db33f
.
The convenience feature sees that return_to
is present, so it redirects to the URL it is given: http://coypu-haters.nz/attack#access_token=d34db33f
.
Oh crap. The owner of coypu-haters.nz
now has Alice’s OAuth token and has access to anything that Alice authorized CoypuApp to do. They immediately start posting vile anti-coypu screeds to Alice’s facebook wall, badly tarnishing her reputation.
It’s a clever attack. It combines the unavoidably stateless nature of OAuth with the unrelated (but common) occurrence of open redirection endpoints in OAuth client domains to manipulate the provider’s trust and the client’s lack of caution. The root cause here is the poor choice of trusting all routes in a client’s domain. OAuth providers must start demanding exact redirect paths when client apps are registered, and this problem will be effectively eliminated.
I could say that you are only vulnerable if your domain hosts an “open redirection” endpoint - an endpoint that takes another URL as a parameter and indiscriminately redirects to it. However, it is my belief that an open redirection endpoint is a natural waste product of any sufficiently complicated web application, and so it is unwise both to assume that you will be aware of any that exist, and to believe that you will not introduce one at a later date. It is far easier to fix this vulnerability as I proscribe than it is to audit your entire infrastructure for open redirections.↩
Clients that use the “authorization code” flow of OAuth and authenticate the token request with a secret known only to the client server are not directly vulnerable. However, this does you no good unless the provider is locked down to allow only the authorization code flow for your application. The attacker can craft a malicious URL with response_type=token
, and most providers will happily honor that regardless of whether you ever implemented the implicit grant flow. At this point you are every bit as vulnerable as anyone else.↩
Facebook’s handling of this is shoddy and irresponsible. By hiding this option in the “Advanced” settings and defaulting to trusting the full domain, they’ve ensured that the vast majority of Facebook OAuth clients will be vulnerable to this. Your data is deeply unsafe in Facebook’s hands, but then again, you should have already known that.↩
Hopefully it goes without saying that your OAuth callback endpoint doesn’t itself perform open redirection while keeping the hash params intact. If you’ve made that poor of a decision, maybe you should let someone else handle the OAuth code from now on.↩
Ian Lunderskov insists that I credit him for his role in the development of this imaginary web application. And also that I belatedly credit him for the development of an imaginary capybara-based cryptocurrency.↩
The malicious user could post this link anywhere, but the easiest way to target coypu enthusiasts is through their own forums.↩
Let’s hope CoypuApp is following my previous advice on authenticating their tokens.↩
The best case is that Alice has not yet signed in to CoypuApp, and so when she clicks the link she is prompted to “Allow CoypuApp to see my Facebook information”. This might tip Alice off that something is wrong. But ordinary people find the Internet very confusing and more often than not will simply follow the instructions. So this isn’t much of a “best” case.↩
Many web apps that use OAuth suffer from a fairly serious security flaw. Generic OAuth client libraries cannot completely patch this hole on their own, so you, the end-developer, are responsible for taking precautions. A small oversight when implementing the OAuth flow can open you up to someone impersonating your users and stealing their stuff. If you are using OAuth, you should definitely know about this and make sure you aren’t exposed.
More complete explanations of the flaw have been written, and it is mentioned in the OAuth spec. This, instead, is a simple and practical guide to the problem. I’m going to explain how to know if you’re vulnerable, how to fix it, and (for the curious) why this is a danger at all.
You are at risk if both of the following apply to you:
You are using an OAuth client to perform the “implicit grant” flow of OAuth. This is the flow that returns a token directly in the hash (instead of returning a code that’s traded for a token). If your OAuth client exists anywhere on the front end, such as a JavaScript framework or a mobile app, you are probably using implicit grant. If your OAuth client is server-side, check anyhow. Could be someone decided to use the implicit flow improperly because it looked easier. If your response_type
is set to token
, you’re using implicit grant.
You are using OAuth for authentication, not just authorization. What does that mean, exactly? It means that when someone logs in with OAuth, you take this as proof of their identity and give them access to some private information or abilities beyond the information pulled from the OAuth provider. If you’re storing any information of your own, this probably applies to you1. If you’re describing your OAuth client as “Sign in with [service]”, this probably applies to you. If you’re unsure, err on the side of caution; you can’t do any harm by closing this hole preemptively.
You may be exempt if both of the following are true:
To close the security hole, you need to add one extra step. After you receive the OAuth callback with the token, you need to verify the token. Don’t save it or do anything else with the token until it’s verified.
Here’s how you verify the token:
client_id
earlier when you kicked off the whole OAuth flow.app_id
. Google calls it audience
. Doorkeeper calls it application.uid
. Compare this ID to your own client ID. Do they match? Then you’re all set! Do they not match? Throw away the token and fail ungracefully3.That’s it! Verify the tokens and the security hole is closed. You’ve done your good deed for the day and you may carry on with your business. If you want to know why you just performed this extra step, read on.
The thing about receiving an OAuth callback is that you don’t know where that token has been.
We assume that receiving a working token for a user means we’re talking to that user. In fact, it’s possible that someone else is impersonating the user. Here’s how it would go down:
This mistake is very easy to commit. Our intuition dupes us into thinking that a call to an OAuth callback will always come directly from the provider’s authentication, but the stateless nature of the web means that it may come from anywhere.
You can see now why verifying the token is important. That would break up the attack in step #5, when CapyApp would see that the application id associated with the token doesn’t match CapyApp’s own client id, and would discard the token instead of giving access to Alice’s personal effects.
The alternative, when this wouldn’t apply to you, is when you are storing absolutely no user information of your own. Rather, you are only providing delegated access to the resources that the user has authorized via the token they’ve given you. Iff this is the case, you are not exposing anything the token holder couldn’t get without your help and there is no security hole to close.↩
Or hard-persisted, as the case may be. The point is that it’s stored somewhere safe on your end as the counterpoint to information coming through the OAuth callback.↩
Ungraceful failure is often the most appropriate type of failure. More on this someday later.↩
The hottest new thing in our increasingly fad-obsessed web development world is icon fonts. The most popular one of these is FontAwesome, which found the magic combination of providing a nice icon set and piggybacking on the ultimate web design fad, Bootstrap. Yes, FontAwesome is a nicely-designed icon set, and I have nothing against it personally, it’s just that FontAwesome and every other icon font are built on a stupid fucking idea.
We’re two weeks into a new project and our designers are adding extra markup to just one of the elements in a navigation list so they can apply a CSS rotate transform, because they need an ellipsis icon which FontAwesome doesn’t fucking provide, so they’re turning one of the “list” icons on its side instead so it looks like an ellipsis if you squint. Now, this is a problem one has with any stock icon set - you inevitably run into situations where the set, no matter how large, is missing that one crucial icon that you need right now. So you face the choice of shoehorning in some icon that doesn’t really fit and will definitely confuse your users, or creating your own addition and trying to match the style of the others. However, when you use an icon font your problems have just doubled because nobody, anywhere, incorporates editing font faces as an ordinary part of their workflow. How many of your designers know how to use FontForge? Zero? Yeah, I thought so.
Why is everyone so infatuated with FontAwesome? It’s because we started designing responsive layouts and realized that managing lots of different sizes of the same image kinda sucks, and then the Retina display came out and everyone shit our pants over how bad everything looked on it. So people started thinking gee, wouldn’t it be nice if we could use vector graphics? Infinite resizing without a loss of fidelity, no finicky image sizes to deal with, and often much smaller file sizes to boot. This is a valid observation, which is why we have a format expressly for vector graphics that works great. It’s called SVG, and it’s implemented in all modern browsers, and it’s actually intended for vector graphics instead of some insane abuse of font faces, for chrissake. SVG is an actual vector graphics format. Icon fonts are an unplanned loophole that provides behavior approximating vector graphics. Let me emphasize that these two are not the same.
Why is it that the idea of icon fonts sounds so familiar? Can you place it? It’s because we tried them 20 years ago when it was called WingDings and it was just as shitty an idea then as it is now. It was a shitty idea because fonts are not a language and it doesn’t make any sense to have the letter “Z” represented as a star and crescent. Leaving aside the issue of Unicode’s own symbol sets (which at least lay some claim to functioning as a universal language), having your pictures come from little letters that you’ve made a special font to represent just doesn’t make any sense.
You can see shades of how little sense this makes reflected in the markup FontAwesome prescribes:
<i class="icon-camera-retro"></i>
What’s that <i>
element doing? Is that like ‘i’ for ‘icon’? No, it’s fucking not. It’s ‘i’ for ‘italic text style’, just like it has always been. By not only abusing an HTML element like this, but by choosing one of the most obsolete, non-semantic elements to abuse, it almost seems that someone realized the absurdity of what they were doing when they designed this. That would make sense. Icon fonts are the kind of hack you come up with when your project has been hitting roadblocks all week, and on Sunday evening you drink a bunch of Coke and bang out an idea so ridiculous you’re a little surprised when it actually works. On Monday you show your coworkers and everyone has a good laugh and a little head-shake of disbelief, and they call you “crazy bastard” endearingly, and you throw a comment in the code stating that “this should probably change” and hope you have time to come back to it before release day. That is the kind of advancement that icon fonts are. That would be fine; we’ve all committed a few outrageous pieces of code in our careers. What’s not okay is promoting this ugly hack as a real solution, building on top of it, and trying to spread it far and wide across the web. That means all of you; tweeting your tweets and blogging your blogs about “how great this FontAwesome thing is”. Icon fonts are an ugly half-solution, and treating them as anything more than that is a sure route to woe, for us and for the rest of the web. Every bit of energy we expend propping up a bad solution like icon fonts is energy we could be putting into using SVG and degrading gracefully in the few relevant browsers that don’t support it.
Mark my words: in a few years we’ll look back at icon fonts as another stupid detour on a route littered with mistakes like <marquee>
and table layouts. There is no future for this technology except regret and hurried backpedaling. So think on this: when someone in the future reminisces about the days of icon fonts and asks “can you believe we ever thought those were a good idea?”, you don’t want to be the one who averts your eyes and, full of shame, mumbles, “No… I really can’t.”
Poop icon designed by Ricardo Moreira from the Noun Project.↩
Yesterday, I published a modest post detailing a flaw in the Reddit ranking system, the consequences, and the response to it. A few hours later, it was #1 on Reddit and #2 on Hacker News. I was not expecting this, to say the least.
The Reddit developers spoke up again in the comments to explain themselves a little further. Notably, for the first time that I’m aware of, there is some indication that they do consider this a bug and might be planning to fix it at some point.
A number of helpful people also sent me links to previous discussions on this subject, some of which I had not dug up on my own. This thread contains a fairly thorough discussion of the implementation, and a proposal from a developer to change the algorithm – surprisingly, in a different way than what I believe should be done.
I want to make clear while we’re on the subject that I was never looking to be hard on the Reddit developers. I do believe that I am right and they are wrong, and I do believe that they have done a poor job of communicating the actual reasons behind this decision, whatever those reasons are. But I don’t think that the Reddit devs are jerks or idiots or anything else. Managing a community is tough work, and all the tougher when your code is open for people to pick through and criticize. I do want this issue fixed, but I wasn’t really looking to start a crusade for it1.
My discussion of vote gaming struck a nerve. If there’s one lesson to take away from this article’s popularity, it’s that vote manipulation is something Redditors are thinking about and are worried about - not just the programmer types, but everyone. Many people have pet theories about what kind of widespread vote manipulation is taking place2. All sorts of comments poured in about this, everything from the reasonable (“I wish there were more safeguards against trolls”) to the full-blown conspiracy theories (“Wake up sheeple”). Several moderators of subreddits chimed in to say that they have struggled with the vote banishment in their subreddits.
Another kind person, rubicks, sent me a graph he created based on my article that provides an abstract representation of the function curves generated3 by the existing calculation (purple) and my proposed solution (green). It’s a powerfully intuitive way to make the argument – the existing version spikes in a strange and discontiguous way.
A number of people also disagreed with my stance, proposing alternate explanations for the behavior I described. I am not entirely convinced by any of the theories I have seen so far, but some of them are interesting and I did enjoy reading them. Click through for my responses and further discussion.
In response to my article, /r/Chicago is launching a month-long experiment disabling downvotes in their subreddit. I’ll be very interested to see how it turns out for them.
Most importantly of all, /r/BirdPics held a Puffin Day in my honor.
You guys… you really shouldn’t have. What an honor. I don’t have the words to tell you how much this means to me. No, no I’m fine. I just got something in my eye. It’s fine. Just an eyelash. Runny nose. It’s fine. Excuse me for a moment.
A number of people helpfully pointed out that the fluctuating vote numbers I was seeing were due to vote fuzzing, an anti-spam feature. I have corrected that footnote4.
I linked to the wrong piece of code when discussing the bug, leading some people to believe it has been fixed in production. This was my fork of the Reddit code in which I fixed it, not the official Reddit repo. I’ve now changed that link to point to a pre-fix commit so as not to be confusing.
Someone else corrected my statement that Reddit has “tons of cash flowing in” by pointing out that they’re still not profitable. I haven’t amended that because that’s just mean.
Anthony Wing Kosner did a solid writeup on the Forbes blog, and advances his own theory on the reasons behind the ranking behavior.
A few other writeups have happened in the business press. I won’t vouch for their quality, but here they are.
Did you know that Randall Munroe has taken an interest in Reddit’s ranking algorithms? Well, now you do.
Another user contends that “Controversial” is the real worst sort implementation. Some good discussion ensues.
And then there’s this story. I don’t have a lot to say about that.
You should realize that I never, ever imagined that a technical post written for a technical audience on my quiet little blog would net this much attention. The absolute ceiling on my expectations was having it do well on /r/programming.↩
Due to the uncertainty introduced by vote fuzzing, I assume these theories are mostly speculation rather than hard observation. However, some of the theories are, like mine, quite plausible.↩
Don’t look at the actual numbers on the graph, as they won’t meet up. It’s the shape of the curves that can help visualize how scores will relate.↩
These corrections also provide an answer as to whether anyone reads the footnotes.↩
Reddit has a bug in their code. This bug is currently present in their production platform, and has been for years. It affects one of the most important algorithms in the entire site, the “Hot” ranking algorithm for link popularity. It has real, demonstrable negative effects. It has been reported to Reddit’s technical team several times and never fixed.
Reddit needs to determine which articles are “hot” right now. Newer material is better than older material. Material with many positive votes is better than material with few votes, and both are better than material with mostly negative votes. This is pretty straightforward to calculate. One determines numeric values representing these two measures, and multiplies by some constants to determine exactly how much priority each measure gets1.
The devil is in the details, or in this case, the implementation.
seconds = date - 1134028003
The time-dependent variable, named seconds
, is based on a UNIX timestamp. It’s a bright way to do it: time is forever counting up, so every new submission receives a slightly higher score from the time variable than every submission that came before it.
s = score(ups, downs)
order = log10(max(abs(s), 1))
if s > 0:
sign = 1
elif s < 0:
sign = -1
else:
sign = 0
The vote-dependent half of the equation has two parts. The sign
variable simply designates if the total vote sentiment is positive or negative. If the material received more positive votes than negative votes, sign
is 1
; if more negative votes, sign
is -1
. The other variable, order
, is the log₁₀
2 of the absolute value of the vote score.
The actual problem stems, as so many problems do, from the transposition of two characters.
return round(order + sign * seconds / 45000, 7)
Here we have our final score calculation. seconds
is a large positive number. order
will always be positive – it uses the absolute value, so a submission scored -389 will have the same value for order
as a submission scored +389. We need to use sign
to adjust order
so that net-negative submissions are penalized accordingly. But this code multiplies sign
and seconds
, not sign
and order
.
On net-positive submissions, this has no effect. sign
is 1
, so order
and seconds
are added together and everything is good.
What happens on a net-negative submission? sign
is -1
, so the very large seconds
value becomes negative. Then a positive order
is added to that. This has several surprising results!
Imagine two submissions, submitted 5 seconds apart. Each receives two downvotes. seconds
is larger for the newer submission, but because of a negative sign
, the newer submission is actually rated lower than the older submission.
Imagine two more submissions, submitted at exactly the same time. One receives 10 downvotes, the other 5 downvotes. seconds
is the same for both, sign
is -1 for both, but order
is higher for the -10 submission. So it actually ranks higher than the -5 submission, even though people hate it twice as much.
Now imagine one submission made a year ago, and another submission made just now. The year-old submission received 2 upvotes, and today’s submission received two downvotes. This is a small difference – perhaps today’s submission got off to a bad start and will rebound shortly with several upvotes. But under this implementation3, today’s submission now has a negative hotness score and will rate lower than the submission from last year.
This is not a hypothetical problem. Curious to see if the code in Reddit’s public repository was what they had running in production, I found a recent post in a fairly inactive subreddit and downvoted it, bringing its total vote score negative. Sure enough, that post not only dropped off the first page (a first page which contained month-old submissions), but it was effectively banished from the “Hot” ranking entirely. I felt bad and removed my downvote, but that post never really recovered4.
Indeed, by manipulating the query string, you can find a strange purgatory where damned submissions slowly rot, alone in the darkness5. Here is a collection of unfortunate articles from the iPhone subreddit:
These posts are sad, alone, and afraid. And notably, they are sorted oldest first, just as I predicted.
This banishment flaw opens a door for more intentional gaming of the system as well. Imagine a hypothetical subreddit, /r/BirdPics, devoted to pictures of birds6. An attacker despises puffins, and wants to keep all pictures of puffins off the front page. This attacker can downvote every picture of puffins, but will be outgunned by the other users who like and upvote puffin pics. On average, 350 people are watching the front page of this subreddit at any one time, so that’s a lot of upvotes to contend with.
Instead, our attacker will watch the new submissions very carefully, and the moment a puffin pic is submitted, immediately downvote it. If the attacker gets to the picture first, it will go negative and be utterly exiled, never again touching the front page. The only thing the attacker needs to worry about are the people watching the “New” ranking, which ignores votes. Our hypothetical subreddit only averages 10 people on the New page, so our attacker can defeat them simply by maintaining 10 sock puppet accounts, instead of the ~300 that would be needed to defeat the front page users. Just like that, our attacker has scrubbed the subreddit of all puffin pics, and the world is a poorer place for it.
I wasn’t the first person to notice this error. Jonathan Rochkind covered it in his well-written post on the subject. He was told by a Reddit developer that he was “just incorrect” and that the algorithm as it exists is “not wrong”.
I submitted a pull request fixing the bug, and was informed by a different Reddit developer that “it’s that way by design”. I do not understand, nor have received a satisfactory explanation of, in what sense this nonsensical behavior would be “by design”. But it is clear that Reddit is not interested in fixing this, and this behavior will probably persist for many more years.
Programmers tend to nurture a definition of justice that revolves around rule conformance. It’s why many of us find worldly realms like relationships or politics so intractable, and why many of us were drawn to computer science in the first place. In computation, everything is strictly deterministic. If something happens that doesn’t make sense, it can only be because our understanding of the system is incomplete7. To be Right, capital-R Right, is a system that is fully understood and executes precisely as expected.
When we hold this type of worldview, intentional propagation of a bug seems unjust. Myself and the other developer who pressed this issue seem to have a more complete understanding of the algorithm than the Reddit employees who responded to us. We’re certainly correct about the surprising and counterintuitive behavior of the unpatched algorithm. We are Right and Reddit is Wrong. And Reddit has a wildly popular site, a tremendous userbase, and tons of cash flowing in. All built on a foundation with an obviously Wrong component.
What’s the moral here? Maybe it’s that an insufficiently tested system becomes an insufficiently understood system, and eventually a system that is defended with rationales like “it just works, stop asking questions”. Or not. Maybe the moral is that the perfect is the enemy of the good, that worse is better8, that splitting hairs can distract us from the haircut9. Maybe it’s that a good technical implementation is a distant second to a good product, and that hard data should always yield to a positive experience.
Maybe there is no moral. Reddit screwed up. It could have hurt them, but it didn’t, and probably won’t. They are wrong but they are not Wrong because there is no such thing as capital-W Wrong. Moral codes are ideas that we construct, and there is no god of determinism that will one day smite Reddit for their crime of being bad at math. The world is a flawed place, has always been a flawed place, will always be a flawed place.
This is a simple yet powerful idea. You could create some wildly different sites that all relied on the same algorithm but with different constants. Want a site that surfaces very old content? Weight the time variable very low. Want Twitter? Weight the vote variable 0.↩
The logarithmic scale accounts for vast differences in popularity throughout Reddit - the difference between 1 and 11 votes is much more important than the difference between 10,001 and 10,011 votes.↩
This particular behavior is dependent on seconds
being large enough to overpower order
. In Reddit’s implementation, it is.↩
Throughout testing, I had trouble determining exactly what was happening to scores due to fluctuating vote totals. I now know that this was likely vote fuzzing, an anti-spam feature.↩
I cannot provide a persistent link to this purgatory because the indexes seem to disappear after a day, but it’s easy enough to find. First, find a recent negatively-scored submission and take note of its ID, which can be found in the URL. From the URL http://www.reddit.com/r/birdpics/comments/1s33tt/fear_the_shrike/
we get the ID 1s33tt
. Now insert it into the following URL, substituting as necessary: http://www.reddit.com/r/SUBREDDIT/?count=9999&after=t3_ID
. Our URL would become http://www.reddit.com/r/birdpics/?count=9999&after=t3_1s33tt
- note that the ID is prepended by t3_
. And yes, you may change the count
to whatever you wish; that number is totally made up.↩
This is also, I suspect, why the heisenbug is perhaps the most feared and hated event in all of Computer Science. See also: releasing Zalgo.↩
The “worse is better” meme originates in Richard Gabriel’s seminal article on the rise of C and fall of LISP. This article and the later follow-ups are some of the best writing the computer science world has ever seen.↩
Can you guess which one of these analogies I just made up on the spot?↩
Disclaimer: Despite the title, I will discuss both RSS and Atom formats. A more accurate title would be “Content Syndication the least wrong way”, but that is just not as snappy.
Generating a static site? Building your own blog engine? Or otherwise need to configure your own content syndication feed? You may have noticed that RSS is a complete shitshow. RSS grew in the messy organic way that much of the web grew, and was standardized too little and too late, like much of the web. People can’t even agree what the acronym RSS stands for. Most activity on the standard itself stopped years ago, and as a result so did discussion of it. Trying to serve your own feed comes with a number of small pitfalls, and searching for advice on the subject yields a slew of contradicting, badly outdated articles. The top hit searching for “rss content type” is 8 years old. That’s 218 in Internet years.
Let me guide you through this mess. Together we will seek simple solutions that work in today’s world. Of course, if you’re not interested in the details, you can skip to the end for my recommended best practices.
A feed is nothing more than a simple XML file that lists entries. Whenever a new entry is added, the file is updated. Generating this file is outside the scope of this post. Hopefully you are using a tool that will do it for you. If not, you may need to look at some examples or dig into the spec.
The important thing is that once you have your feed, you use the official validator to ensure that it is a valid feed format. You should only be generating Atom 1.0 or RSS 2.0. There is no reason to use any older versions of the specs.
Once upon a time, a syndication format arose, and it was called RSS. Then people got annoyed because RSS had some shortcomings, and created a better-thought-out standard called Atom. You will need to make a choice about which format you serve. The good news is it doesn’t matter much. Both are perfectly sufficient for the needs of a simple blog or other standard content stream, and as we will see later, both are widely used and widely supported. No modern feed reader will handle one of these formats but not the other.
If your tool only generates one format or the other, your decision is made for you.
Otherwise, choose one. I recommend Atom. It was built with a spec from the beginning, so the right implementation is also the implementation that works. RSS has existed longer, so in theory some tools could lack Atom support, but in practice this isn’t likely. There are echoes of descriptivism vs. prescriptivism here, and as usual I come down cautiously on the prescriptivist side. Atom is conceptually better, so barring any hurdles to its acceptance, I say we use Atom.
You must serve your feed with a content type that identifies it as a feed. It’s fairly important to get this right, or at least something approximating right. This ensures that browsers and feed readers recognize it as a feed and behave appropriately.
Your content type should be application/atom+xml
. This is the most correct, and will work well with everything. Using text/xml
is technically acceptable but too vague to be a great idea.
Your content type should be text/xml
.
Given what I just said about Atom feeds, it seems like application/rss+xml
would be a better idea. However, this is not a registered MIME type, and while it will probably work, you still should not try it. RSS suffered greatly at the hands of those who loved it, and this content-type mess is one of the greatest legacies of that. Other content types used for RSS in the past have included application/xml
, text/html
, and text/rss+xml
, which are all wrong and should be avoided.
You can easily use curl
to check that your feed is being served with the correct content type:
curl -I technotes.iangreenleaf.com/feed.xml
Look through the headers in the response for this:
Content-Type: application/atom+xml
If you are serving your blog from Amazon S3, it will guess (and guess wrong) about the content type it should serve. Give it a hint when uploading the file to prevent this behavior. For example, I use the s3cmd
tool and pass it an extra option1 like so:
s3cmd put --mime-type=application/atom+xml \
_site/feed.xml s3://technotes.iangreenleaf.com
Functionally, it doesn’t matter at all what you name the files you generate. If they are served up with the correct content type, the file name is irrelevant (though some platforms, like S3, use the filename to guess at the content type if it isn’t set explicitly). I suggest you go with something like feed.xml
or rss.xml
/atom.xml
. This is not incorrect, and should result in roughly-correct behavior from text editors, etc.
A nifty feature that you definitely want to provide is RSS discovery. Applications, when viewing a page on your site, can automatically find feed links and make them available in some special way. For example, I have an RSS button on my toolbar. If a page I’m on offers a feed, the icon will light up, and clicking on it will open the feed in my preferred feed reader.
No need to search around the page for the link to the feed, my browser has found it for me!
The key to enabling RSS discovery is adding a <link>
element inside your <head>
. Here is an Atom feed:
<link rel="alternate" type="application/atom+xml"
href="/feed.xml" title="Atom Feed" />
And an RSS feed:
<link rel="alternate" type="application/rss+xml"
title="RSS Feed" href="/feed.xml" />
The rel
attribute is very important. It must contain alternate
and only alternate
, or some clients will stumble.
The type
attribute is also important. It must be either application/atom+xml
or application/rss+xml
. “But wait,” you say, “why are we using application/rss+xml
when you just told me that’s not a valid content type?” You’re so cute with your questions. But seriously, don’t use it in the content type header, do use it here, stop asking questions and no one will get hurt.
The title
attribute must exist, but may contain whatever you like. Just name it something descriptive.
It’s also possible to offer multiple feeds from one page by using more than one <link>
:
<link rel="alternate" type="application/atom+xml"
href="/feed.xml" title="Ian's Blog Feed" />
<link rel="alternate" type="application/atom+xml"
href="./comments.xml" title='Comments on "This Post"' />
If you do this, users will be shown a selection screen to pick the feed they want.
The title
attribute is what’s shown here, so make sure you’ve picked good names!
You can even provide links to both an Atom feed and an RSS feed for the same content, and let users choose which format to use. However, I recommend against doing this - more on that later.
Dealing with a poorly-specified area of the web with a spotty history comes with its share of uncertainty. Several of the lessons in this post were learned the hard way after launching this blog and receiving bug reports. Questions of format choices and content types boil down to which combination is going to work correctly for almost everyone, almost all the time. With most of the literature on this subject badly out of date, it’s hard to determine which compatibility issues still occur and which are effectively moot.
I decided the best way to find out which practices were safe would be to check very popular feeds and see what they did. If something doesn’t work for a significant portion of the web population, I assume these people will have heard about it and taken corrective action. To that end, I conducted an unscientific survey of feeds published by popular blogging platforms and notable citizens of the web. I checked only the feeds made available by autodiscovery.
Source | Format | Content Type | Autodiscovery |
---|---|---|---|
Blogger (Atom) | Atom 1.0 | application/atom+xml; charset=UTF-8 | <link rel="alternate" type="application/atom+xml" title="Official Blog - Atom" href="http://googleblog.blogspot.com/feeds/posts/default" /> |
Blogger (RSS) | RSS 2.0 | application/rss+xml; charset=UTF-8 | <link rel="alternate" type="application/rss+xml" title="Official Blog - RSS" href="http://googleblog.blogspot.com/feeds/posts/default?alt=rss" /> |
Tumblr | RSS 2.0 | text/xml; charset=utf-8 | <link rel="alternate" type="application/rss+xml" title="RSS" href="http://staff.tumblr.com/rss"/> |
Wordpress.com | RSS 2.0 | text/xml; charset=UTF-8 | <link rel="alternate" type="application/rss+xml" title="WordPress.com News" href="http://en.blog.wordpress.com/feed/" /> |
Feedburner Status | Atom 1.0 | text/xml; charset=UTF-8 | |
Jekyll | RSS 2.0 | text/xml | <link rel="alternate" type="application/rss+xml" title="Jekyll • Simple, blog-aware, static sites - Feed" href="/feed.xml" /> |
Octopress | Atom 1.0 | text/xml | <link href="/atom.xml" rel="alternate" title="Octopress" type="application/atom+xml"> |
Jeffrey Zeldman | RSS 2.0 | text/html | <link rel="alternate" type="application/rss+xml" title="Jeffrey Zeldman Presents The Daily Report RSS Feed. Designing with web standards." href="/rss/" /> |
Eric Meyer | RSS 2.0 | text/html | <link rel="alternate" type="application/rss+xml" title="Thoughts From Eric" href="/eric/thoughts/rss2/full" /> 2 |
Daring Fireball / John Gruber3 | Atom 1.0 | application/atom+xml | <link rel="alternate" type="application/atom+xml" href="/index.xml" /> |
A List Apart | RSS 2.0 | text/xml; charset=UTF-8 | <link rel="alternate" type="application/rss+xml" title="A List Apart: The Full Feed" href="/site/rss" /> |
24 Ways | RSS 2.0 | text/xml; charset=UTF-8 | <link rel="alternate" type="application/rss+xml" title="rss" href="http://feeds.feedburner.com/24ways" /> |
The W3 Consortium4 | RSS 2.0 | text/html | <link rel="alternate" type="application/atom+xml" title="W3C News" href="/News/atom.xml" /> 5 |
RSS 2.0 is the clear favorite. Still, several important feeds use Atom 1.0, notably the Feedburner status feed and Daring Fireball. This leads me to conclude that either format is perfectly acceptable in today’s web.
There’s little consensus here. For Atom feeds, text/xml
makes an appearance, but several feeds use the most correct application/atom+xml
.
In the RSS feeds, most stick with the safe bet of text/xml
. I was surprised to discover that text/html
is served by web evangelists Jeffrey Zeldman and Eric Meyer, and even more so that is is served by the official W3C news feed. Unless I am mistaken, this content type is just flat-out wrong and should not be used. I wonder if there is a rationale behind their decisions.
Every feed surveyed supports autodiscovery. Notably, none of the feeds except Blogger offered a choice of both RSS and Atom in the autodiscovery tags. This is good UI: presenting users with a choice between two functionally interchangeable formats is unhelpful at best and badly confusing at worst. Given the widespread compatibility of both RSS and Atom, you should pick one and serve that by default.
This admonishment is directed at me as well. Until performing this survey, I had been offering both formats through autodiscovery (in the name of user choice). Upon realizing that I was in a serious minority, I reevaluated and realized that I had made a poor decision. From now on I will be pointing to only one format. I will probably continue to serve the other feed format, but will not be advertising it.
Does your head hurt? Let’s distill these discoveries down to a small set of best practices. Here’s a mildly opinionated guide to serving a successful content feed:
feed.xml
. Serve it with the content type application/atom+xml
.Put this in the <head>
element of your site:
<link rel="alternate" type="application/atom+xml"
href="/feed.xml" title="My Blog Feed" />
title
.Try to forget all that you have witnessed here today.
If you don’t have the very latest release of s3cmd
, the story gets even more complicated. In older versions, the guess_mime_type
option, if enabled, will actually override the one you specify (ugh). You’ll want to turn that option off in your s3cmd config. Here’s my bash hack to temporarily do so while uploading the file:
cat ~/.s3cfg | sed 's/\(guess_mime_type.*\)True/\1False/' > .tmpconfig
s3cmd put --mime-type=application/atom+xml _site/feed.xml s3://technotes.iangreenleaf.com
rm .tmpconfig
Eric Meyer’s final feed URL is hidden behind two redirects. http://meyerweb.com/eric/thoughts/rss2/full
-> http://meyerweb.com/eric/thoughts/feed/full/
-> http://meyerweb.com/index.php?feed=rss2&scope=full
. o.O↩
Daring Fireball actually respects the Accept
header, and returns 406 Not Acceptable
when given Accept: application/rss+xml
. I’m impressed.↩
During the drafting of this article, W3C pushed a new site design and altered their news feed. The old feed was found at http://www.w3.org/News/atom.xml
, and returned an Atom feed with Content-Type: application/xml; qs=0.9
. An atom feed is still available, but is not visible to autodiscovery.↩
The W3 site still lists an Atom feed for autodiscovery, but this URL redirects to the new RSS feed (even though a new Atom feed is available at a different URL). This clearly seems like a mistake. I contacted the site maintainers and they are planning to fix it.↩
If you’ve worked on a serious Rails project, chances are you’ve been told at some point to check in your schema. However, the reasons why we do this are often glossed over by long-time Rails developers who know the history of the feature, leaving newcomers frustrated by a habit that seems confusing or redundant.
To understand the Rails database management plan, you’ll need to keep in mind the needs of two different kinds of people who will be consuming your database changes: those with an existing environment and those setting up a new environment.
The existing environment might be your local development environment, or a colleague’s. It might be your production machine. It might be someone to whom you are distributing code. All these have something in common: they are already using your application with an existing database. These environments need a way to upgrade the database to the newest version. This is what migrations do wonderfully well.
The new environment might be someone setting up your application for the first time. It might be the developer you just hired getting her machine up and running. It might be you, a month from now, when you’ve massively screwed up your existing database and want to drop it and start fresh. Be certain of one thing: sooner or later, a new environment will come along, even if one doesn’t exist right this moment.
The new environment doesn’t need anything upgraded; it needs to create a new database that mirrors your latest. It may also need to seed the database with some defaults or dummy data. Migrations do neither of these things well.
In ages past, we did use migrations for new environments. The idea was that you would run all the migrations, in order, and by the end of the chain you would have the latest database structure. While this works in theory, it has fallen out of favor with the Rails community because in practice, it sucks.
It’s slow. As your application grows, you probably first create a few small tables, then add columns to them bit by bit as needed. Then they get too large and you move some columns to a new associated table. Then your priorities change and you delete some columns. Then you add some more columns. Then you have to change the type of one of your columns from varchar
to text
.
Running all the migrations means that you have to replay all of these migrations, one at a time, as they originally happened. This is pretty wasteful when you think about the simple CREATE TABLE
commands that would achieve the same result. This isn’t merely a theoretical problem either – loading from schema is measured in seconds, while running the full migration chain for even a modestly mature application will almost certainly be measured in minutes.
It’s fragile. So, so fragile. Running the full chain of migrations means that every migration must continue to function, forever. The classic pitfall that almost every Rails project encountered was using classes in migrations. We would pull in a class to make use of the ActiveRecord queries, or use it to manipulate data that needed to be changed. This was all well and good at the time, but months later we would rename the method, or delete the class, and suddenly the migration would be broken. Our code had changed, but the migration had not.
Worse, this is just the most common failure mode for migrations; it’s certainly not the only one. Did you use Time.now
to set a default timestamp? That was a bad plan. Using a Rails 2 method but you’ve since upgraded to Rails 3? Oops. Updating existing data with an SQL query that fails if the table is empty? Tsk.
It’s possible to protect against some of these failures by defining all classes within the migration rather than autoloading them from the rest of your Rails app, and copy-pasting any needed methods. However, this is a nuisance to follow, difficult to enforce, and easy to forget. Running a large migration suite in sequence is a messy, slow, error-prone process, and making it less so means spending time on ceremony that doesn’t actually improve anything – the worst kind of maintenance. Migrations have no test coverage and are not loaded by the server process. Migration upkeep does not happen because migrations are invisible until the moment someone wants to run them, which happens to be the worst time to discover that they are broken.
So while migrations are the right tool for existing environments, they’re not a good option for new environments. Thankfully, the Rails community has settled on a better solution: check in your schema.rb
. When a new developer wants their own database, they simply load the structure specified in your schema by running rake db:setup
(or rake db:schema:load
). No messy chain of migrations, just a direct creation of the newest database tables. This method is faster, cleaner, and avoids all the ugly failures that can crop up in an extensive migration chain.
What seems like two different solutions is actually two parts of the same solution. The schema file represents the current state of the database. The migrations explain how to reach that state from somewhere else. I like to check in a new migration and the changes it causes to schema.rb
in the same commit.
Common objections:
But I use migrations to add data to the tables!
Don’t. This is one of the most pernicious and difficult-to-fix causes of the fragility issues in migrations. The proper place for this kind of data is in db/seeds.rb
. Seed data will be loaded when someone runs rake db:setup
(or rake db:seed
), so it will show up for someone setting up their db from scratch. If it’s imperative that this seed data is pushed out to existing environments as well, go ahead and add it to your migration, but realize that this is in addition to, not instead of a clean, complete set of seeds.
But it’s a machine-generated file!
This is true, but for once this is a generated file you want to check in. The format of schema.rb
is very clean and human-readable, so changes to it are easy to understand and limited to a small number of lines. Merge conflicts are rare unless you are simultaneously modifying the same database table as someone else, which is a dodgy proposition anyhow. And Rails automatically updates schema.rb
every time you migrate, so it’s hard to forget to commit your changes.
But we’re having trouble getting all the developers’ schemas to agree!
Some ugly problems can crop up when you start checking in schema.rb
on a project where it was previously not checked in. Different developers might have slightly different existing databases. One might have an older default date in a timestamp column; another might have a different VARCHAR
length; another might be missing a :null => false
on a column. You’ll discover all these inconsistencies when a developer runs the latest migrations and ends up with uncommitted changes to his schema.rb
. If he commits them, the first developer will encounter uncommitted changes the next time she runs a migration. Their competing schemas are battling for dominance through your source control system.
This situation sucks, no doubt about it. However, don’t let it dissuade you from making the switch! This is a convincing demonstration of how many things can go wrong with migration chains – your databases have been drifting apart and no one even realized it! You’re going to need to do some grunt work on this one. Track down the offending developers one by one and insist that they come into compliance. Often this means backing up their data with mysqldump
, running rake db:drop && rake db:setup
, then re-importing the data. If that’s not enough you might have to handcraft some ALTER TABLE
statements to fix the worst of the problems. Bite the bullet and make it happen. Remember, the alternative is leaving your developers’ databases in a known broken state.
I like to start all of my personal “chains” – blogs, social network accounts, and so on, with a small, unimportant placeholder post. I’m not sure why, exactly. I could claim it’s to see if everything is configured correctly, but that would only be a partial truth. I do it because it just seems… right.
Rather than delve further into my psyche, let me share a neat trick to follow this same practice in Git. Every Git repository has a first commit. As a matter of religion I have always commented these commits with simply Initial commit
. In the past I would commit some placeholder for the project – a Rails skeleton, or a Readme, or whatever code I had so far.
The first commit in Git is tricky, though. It’s not quite as malleable as the other commits if you later try to modify the history. Better to make a true placeholder: an empty commit. It’s actually quite easy to do this.
git init
git commit --allow-empty -m "Initial commit"
Boom. Enjoy the new technique, and enjoy the irony that in showing you this, my first post has grown to something not entirely inconsequential.
]]>