The dark side of statistical rigor

Blog: The dark side of statistical rigor

Why you don't have to be wrong to make the world worse.

Blog: The dark side of statistical rigor

Why you don't have to be wrong to make the world worse.

Blog: The dark side of statistical rigor

Why you don't have to be wrong to make the world worse.

Image prompt: "a demon hiding behind a statistical distribution".

A tale of two referrers

One of the fundamental questions of our era is this: how much can i judge you by the statistical evidence I’ve gathered about people like you with whom you share a trait?

A couple weeks ago, I posted a long blog post to Hacker News, which served as a sort of soft-launch for Otherbranch. While I would have posted the blog regardless - and in fact wrote most of it before I had committed to starting a company at all - I posted it on HN in particular and at that particular moment because I hoped it would raise our profile among a pool of engineers I expected to be good.

Prior to that post, we’d been sourcing candidates elsewhere. And that sourcing - being the part of hiring that I personally have not done before and therefore unsurprisingly a weak point - was not going very well. Only 4 of 22 interviewees passed our interview, and three of those four were only just over the line.

On our coding task (practice variant here, which is slightly easier than the real problem), 3 completed zero steps, 10 completed one, 8 completed two, 1 completed three, and none completed more than three of the five total steps. We only ended up giving one of these candidates a strong recommendation.

The very first person we interviewed from the HN pool finished all five with time to spare, excellent code quality, and proper commenting. Of the people who came in from that blog post that we've interviewed, the pass rate is just shy of 50%, and less than 20% have been what I'd characterize as actually bad.

Otherbranch is new enough that we don’t have the kind of data we’d need to be statistically certain of that. (Data so far gets us a p-value in the low double digits on a homogeneity test.) But it won’t be long before we have that data, and I would be stunned if our statistics did not show that the average person from HN is a lot better than the average person from, say, Indeed. Triplebyte's statistics certainly did show that.

In short: candidate source is a statistically-verifiable proxy for skill, or at least for interview performance.

That is (I think, and expect to soon be able to prove) statistical fact. But is that a fact we should use? Should we, in effect, score candidates better on our interview if they come from a source we think is good on average?

I think this question is tougher than it looks.

A two-bit interview

One common complaint about interview processes - and about ours in particular - is that they don’t relate to the actual work of the job. And that’s true, at least to some extent. Whether or not you can write Minesweeper in the command line is a pretty far cry from whether you can set up a good database schema. Your ability to describe the time-complexity of finding an element in a balanced binary tree is a pretty far cry from whether you can figure out why prod just went down at 2 AM.

My normal defense to this claim is that interview processes necessarily cannot be fully representative of real engineering work, because real engineering work involves far too much context and far too long a timescale to be a realistic interview problem. Take-homes are arguably better measures of engineering skill, but they also effectively ask for many times as much time in unpaid labor as a regular technical screen would. (Some companies pay for that time, but that has a cost of its own, and that cost trades off against - among other things - credentialism. There are no easy answers here.)

Instead, interviews in general, and our interview in particular, are designed to be a proxy for performance. In our case, it’s intended to be a proxy for performance in company onsites. Company onsites, in turn, are intended to be proxies for on-the-job performance. The goal is not to literally represent the job, but to find assessable qualities that relate to it. Every interview with any quantitative backing works this way.

But, as much as I will ultimately defend this approach anyway, there are…problems.

An interview offers limited statistical information. A pass-fail interview necessarily offers at most one bit (in the Shannon-entropy sense) of information; our interview has five overall score levels of unequal size for about two bits. And since interviews are of course not perfect measures of candidate skill, the actual amount they communicate about candidate skill is somewhat less than the amount of information from the interview result itself (it can't be more). I think, with reasonable confidence, that our interview is better than most, but two bits is not an extraordinary amount of information.

And in cases where your information gathering is limited, your prior beliefs (in the Bayesian-statistics sense) play a very substantial role in determining your ultimate conclusions. Priors cease to be the more-or-less-arbitrary ansatz they’re often often treated as and start having an effect on, if not wholly dominating, the output.

To make the point, let’s imagine a toy model with the following parameters:

You're doing a pass/fail interview, where 80% of good candidates and 30% of bad candidates pass.
30% of HN candidates are good
5% of Indeed candidates are good

You run 1000 candidates through the interview, resulting in the following groupings:

There are 300 good HN candidates. Of these, 240 pass, 60 fail.
There are 700 bad HN candidates. Of these, 210 pass, 490 fail.
There are 50 good Indeed candidates. Of these, 40 pass, 10 fail.
There are 950 bad Indeed candidates. Of these, 285 pass, 665 fail.

None of this is surprising so far. The surprise comes when you look at your beliefs about a HN candidate who failed and an Indeed candidate who passed.

P(good | HN, fail) = (# good HN fails) / (# HN fails) = (60) / (60+490) = about 11%.
P(good | Indeed, pass) = (# good Indeed passes) / (# Indeed passes) = 40 / (40 + 285) = 12%.

You end up with the belief that HN candidates who fail are about as good as Indeed candidates who pass.

(I'm aware that this is Bayes' rule with extra steps. But framing it in terms of counts instead of probabilities makes the nature of the problem more concrete for most readers.)

This toy interview provides less than half a bit of information about candidate skill. Since we think our real interview has more statistical power than this, we think it can meaningfully differentiate between an Indeed-pass and a HN-fail. It has to, for our ability to identify highly-skilled candidates with non-traditional backgrounds to work at all.

But what our interview does not have the power to do is make the prior irrelevant. It still affects the posterior distribution quite a bit, even if you assume we get all two ideal bits of of our interview (and I promise you we do not). It could easily push a borderline HN fail above a borderline Indeed pass. If ideal, statistical truth is all we are after, we should punish you for not coming from HN.

This isn't bias in the traditional sense, or even in the DEI sense, of the word. It's just statistical trends…that happen to leave people who dislike hanging out on HN for whatever reason out in the cold.

Your honor, she has priors

The problem is what priors we can, or should, use.

I think most readers would agree that, even if the statistics supported us doing so and even setting aside legality, it would be wrong to set a prior based on a candidate’s race, gender, or other traits that are wholly out of their control. There’s a lot I’d like to say on that subject - a lot of nasty beliefs in this domain lurk below the surface of tech, unfortunately - but for now it suffices to avoid the flame-war and say “no, we’re not doing that”.

But things get complicated as the priors get murkier and more related to things someone does have control over.

Should we use a “came from Hacker News” vs “came from Indeed” prior? It’s not prima facie discriminatory, notwithstanding HN’s typical demographic breakdown. Since HN is so tightly entangled with the startup world, HN as a source is even arguably a construct-relevant (i.e., it's "measuring the thing we're trying to measure") trait. But I'm pretty uncomfortable with the idea at a glance, and I think you should be too.

Or, relatedly, even if I chose not to HN-vs-Indeed to determine our recommendations to clients, would it be legitimate for me to target HN as a channel in the implicit belief that HN is a valuable candidate source? I'm going to uncomfortably say yes later on, but for now, does the question seem entirely trivial?

How about a resume-typos prior? Friend-of-the-blog Aline Lerner of interviewing.io has written about this in the past, and anecdotally my experience matches her findings: that the quality of the writing on a resume is a reasonable signal of a candidate’s quality. GPT distortion of this particular metric in 2024 notwithstanding, could we use feed in resume-typos as a model input? Is that how you'd like to be evaluated as a developer - by your ability to not mix up English homophones?

What about location? My gut says the average candidate in the Bay Area does better on our interview than the average candidate in, say, Columbus. If my gut is right, should that be a model input? Notwithstanding morality, it would be illegal to, for example, discriminate against people born in another country. But no such protection applies to whether someone lives in Kansas or Seattle, even if the role is remote and otherwise totally unrelated to location. (Obligatory I-am-not-a-lawyer-this-is-not-legal-advice disclaimer.) We could do that, and in practice, many companies implicitly do do that. But should we do that?

It would surprise me if any of candidate source, resume typos, or location ended up having no statistical power. That statistical power exists, or at least, I think it probably does. It is directly related to Otherbranch's organizational goals. I’m confident that if we cannot detect that signal now, we will be able to in the near future, and that using that signal is well within our power.

if you think I'm wrong about that, fine. But that’s orthogonal to the question of what you would do if they were meaningful inputs. Assume for a moment that the world is the most frustrating, irritating, morally-ambiguous version of itself - what do you do then?

Can we use these inputs? Should we? By what standard might we judge?

The world we want to make

I do want to give the best provisional answer I can to these questions. But the point of this blog post is not to give my answers. It’s to emphasize that the question is very very hard when considered as an ethical question and not simply a business-self-interest one.

For me, the decision comes down to the kind of world it builds - the externalities and Goodhart’s law-style incentive gradients these decisions create. And that consideration leads me to say “no” to most of these inputs.

One of the major arguments for privacy is that people should be free to not subject every detail of their lives to scrutiny. And that argument, in turn, is based on the implicit fact that traits that are subject to scrutiny quickly become competitive races to the bottom.

In the most extreme case, it wouldn’t really shock me if whether or not you, I don’t know, enjoy Vietnamese food ended up correlating in some weird way with engineering skill. Maybe it correlates with openness to experience or something, I don’t know. (Hey, there’s another thing we don’t use as a model input and probably never will.)

I use this specific example because, in fact, consumption of Vietnamese food is rather directly psychologically linked to my current role as a CEO. The openness to one is very directly related to the openness to the other, and I literally do not think I would be a CEO today without the factors that also led to my love of a good bowl of pho. I would bet money that, given perfect information, you would find a correlation between leadership roles and enjoyment of exotic foods.

But a world in which you choose whether or not you get a dinner you like out of a fear that some hypothetical employer will use it as a model input is a horrifying world. That world might be based on facts in some sense, but I doubt anyone reading this post really wants to see it come to fruition. It’s a stupider version of Gattaca where Ethan Hawke gets banned from the space program for going to the Olive Garden.

Moving on: if candidate source became a model input, what would we be telling candidates? Well, one, we’d be telling them to lie about where they come from, and that’s a great way to make your analytics useless in one fell swoop.

But even if we could somehow compel honesty, or if we didn’t publish what we were doing, we’d be creating a world in which what site you choose to scroll through on your phone at 1 AM when you ought to be sleeping starts being a thing you have to worry about. (Actually, given that this sort of data is already seeing some use from financial agencies, perhaps it already is.) And again, no one wants this world - even the people who are playing some part in creating it.

Geographic model inputs might not create much in the way of immediate incentive structures (most people don't move just to slightly improve their chances on an application), but what they do create is a trap. And worse, they’re self-fulfilling: the more an ambitious person is compelled to move to a high-average-skill hub, the more those hubs become filled with skilled people, and the more their places of origin become brain-drained.

Of course, this already happens, and it’s a big part of why tech hubs are tech hubs in the first place. But I don’t think anyone’s arguing that this is necessarily a good thing.

The rise of remote work has shown us just how much the people who have historically been forced to concentrate in certain areas actually wanted to be elsewhere. My legal residence remains in California, but I’m writing these words in a little town high in the mountains where I can go see the Milky Way at night. And this is, in isolation, a good thing. Someone might argue that the benefits of in-office work outweigh that cost, and they might even be right, but there’s a difference between “a cost you're willing to pay” and “preferable in principle”.

The only one of these things I do feel comfortable with is that I do think it’s OK for us to target good channels. The only obvious incentive that creates is to read a site that provides you with relevant content, and I’m OK with that. That doesn't seem like it distorts life too much. Most of the incentive warping there is on me, and I founded a company - my incentives being wildly warped is what I signed up for.

What if we just didn't do the bad thing?

To put things in somewhat more utilitarian terms: the marginal benefits we gain by slightly improving our statistical judgement are outweighed by their collective chilling effects on human beings being able to live fulfilling lives.

In a broader sense, each statistical fact humanity uncovers creates another opportunity to build a world we like less. In this era of vast discovery, we are uncovering a lot. If you haven’t thought about what you’re going to do about that, well, I’d suggest you start. And it’s not just in recruiting. It’s in whether you can buy a house, whether you’re arrested for a crime, whether someone invests in your company, whether you get shown to the love of your life on insert-dating-site-here.

The better ML models get, the more of life becomes dominated by these statistical tendencies. A lot of the criticism leveled against AI-driven decision-making comes from the implicit claim that they're wrong. But it's so, so much worse if they're right. No one's incentivized to use an incorrect output, at least if they themselves are not trying to push a falsehood. But everyone is incentivized to use a correct one - no matter its effects on the world at large.

I said earlier that a fundamental question of our era is: how much can I judge you by the statistical evidence I’ve gathered from people like you?

Show me someone who says “zero”, and I’ll show you a hypocrite. Everyone does this, in one way or another. And show me someone who says “as much as you want”, and I’ll show you someone who doesn’t want their next job to depend on their taste for Vietnamese food. It’s not as simple as that. We can neither ignore all fact on principle, nor shrug and allow statistical observations to consume all the long tails we'd like to go live in sometimes.

As for Otherbranch - well, we’re still going to use interviews as a statistical proxy. We’ll do the best we can to keep them relevant to real jobs and to keep them from creating horrible incentives. But we’re not going to perfectly succeed at either. We're going to try to make them more fair than typical companies, to offer more opportunity and adjust less on construct-irrelevant factors than most companies do. But we won't fully succeed at that, either.

We already know that our interview suits some people better than others. It’s great for people who are curious and spend a lot of time learning about things that don’t directly come up in their job. It sucks for people who struggle under time pressure or with someone watching them. Our interview is already unfair, even without complex societal signaling involved. The best moral justification I have for this is that I don’t have a better idea, and that I’d rather we at least be doing it with intentionality and transparency and data and with the compassion to say that just because we think you suck doesn’t make you worth any less.

But doesn’t everyone say that? "Sure, we compromise, but look at the other guy". It’s a coward's excuse.

One of my main personal motivations for starting a company is that I want a seat at the table for this kind of decision. Whatever I might be, whatever moral failings I might have, I am at least trying to do right by everyone. But this blog post is an example of the odd nature of being in such a position. Sometimes you make a call, and then write a few thousand words about how you might be just a little bit evil, how you might be contributing just a little bit to the creeping dystopian hellscape that, in the year 2024, always seems to be just over the horizon.

Will Otherbranch’s ability to identify great engineering talent be hurt by deciding not to use some of the information at our disposal? Yeah, probably. Conditional on these measures having nonzero statistical power, it will hurt our judgements by the very definition of statistical power. We lose something by not contributing to the creation of the aforementioned dystopian hellscape. Maybe we’ll recoup it by writing moralizing blog posts, but that’s not a confident bet on my part.

But who the hell cares? Sometimes you just don't do the bad thing because it's the bad thing, because in addition to being a businessperson you're also a human who has to live in the world you play a part in building. You can just not do the thing you're incentivized to do, as long as you're willing to pay for it. And I'd rather fail to build something good than give up and build something bad.

Back to blog listing

About the author:

Rachel

Founder/CEO

I'm the founder here at Otherbranch. I used to be the head of product at Triplebyte (YC S15). I want to take what's great about tech (speed, experimentation, willingness to dump things that don't work) and get rid of what's bad (insane egos, finance games, scale-at-all-costs philosophy screwing users). Tech used to be about weird people building things that did something concrete, not about convincing investors, and that's what I'm out to do.

rachofsunshine on Hacker News

Rachel

Founder/CEO

rachofsunshine on Hacker News

Rachel

Founder/CEO

rachofsunshine on Hacker News