Haskell programmers are used to making programming super annoying in order to achieve greater correctness, Python programmers wouldn't be using Python in the first place if they were willing to make major sacrifices for correctness.

(There, have I made everyone hate me enough?)
Two design decisions I made early on that shaped the entire development of Hypothesis:

1. Shrinking / test-case reduction (the process that makes the examples Hypothesis gives you readable) is not a user configurable part of the API.
2. Failing examples are saved in a database.
Both of these were very much motivated by these user experience considerations: Shrinking is a pain in the ass for users to do themselves, and non-repeatable tests make the debugging experience during development awful.
A design decision that I didn't make as early on as I should was the move to explicit generators. In QuickCheck, partly because Haskell, partly because its verification-like nature, data is specified by types: Your test takes string arguments, so QuickCheck gives you strings.
In early Hypothesis (this predates the typing module) there was a type-like syntax that you used to specify the generator you wanted. This was weird and bad and hard to extend. My excuse is that I'd been writing a lot of Ruby when I wrote that API so I thought DSLs were good.
In Hypothesis 1.5 we moved to the data generators just being things that you constructed explicitly based on a library of primitive generators and ways of combining them.

This incidentally is part of why they are called strategies: Originally that name was purely internal. 😭
A design question that emerged around this time was how to square flexible ways of constructing these strategies with no user configurable shrinking. We have a good set of strategy combinators (composite, build, flatmap, map, filter) for constructing data. How do we shrink them?
The reason this is difficult is that classically shrinking works by taking a value and giving you a bunch of smaller variants, but we're letting people define data generators purely in terms of how you construct the data. There's no way to run that process backwards to shrink!
More on this, and many other subjects, later. For now I'm going to take a break from thread writing.
Not properly resuming yet, but let me tease you on the next part with FUN HYPOTHESIS FACTS:

1. Hypothesis was my learning Python project.
2. Prior to 1.5.0 about 50% of Hypothesis development was done while quite drunk.
3. Hypothesis 0.2.0 to 1.5.0 were written for tax reasons.
Also we now have a glorious new mascot, but for a long time the unofficial mascot of Hypothesis was this ridiculous cat: https://twitter.com/sinister_katze/status/1157349116801929218

Why? He's fuzzy, eats bugs, and was present for a lot of the major work on its early development.
Aforementioned new mascot.
All of these facts were true. https://twitter.com/DRMacIver/status/1194566395620708353

When I originally wrote Hypothesis back in 2013, I was just moving from a Ruby job to a Python one. I wanted some practice at Python before hand, so I needed a project.
I'd been talking about property-based testing with some people at a conference recently, and none of the options for Python looked very good (mostly due to a lack of shrinking), so I thought I'd write my own one that was slightly less bad as an experiment.
I did, and it wasn't very good, but it basically worked OK and taught me enough Python to have achieved my goal. I then stopped working on it for about two years. During that time, a couple people started using it because it was still better than the alternatives.
I felt vaguely bad about this, because I wasn't really maintaining it, but also I mostly only started getting bug reports about it once I joined Google, which meant I wasn't able or willing to do any work on the project. It also meant that I was severely depressed at the time.
Google was very bad for me, but that's a separate topic. The most relevant fact is that it meant I was living in Switzerland, in Zurich specifically, which will become relevant shortly.
Then I quit Google, which was great, but I'd lasted so little time there that if I moved back to the UK then and there I would have counted as resident in the UK for that tax year and ended up paying both Swiss and UK taxes (and some US taxes!), which I didn't really want.
(I've no problem with paying taxes, but once you're double-counting taxes to multiple countries it gets a bit much).

So this meant I had a couple months to kill in Zurich, and conveniently there was this side project of mine that needed some attention...
I'd originally just intended to fix a few bugs and implement some interesting ideas, but I found there were all sorts of interesting problems in it and it rather got away from me in ways that have pretty much lead to the last five years of my life (this was beginning of 2015).
BTW I was drunk a lot of the time mostly because the flat I was living in had lots of tasty booze and this was in the height of my interest in cocktail making, so most evenings were spent with a cocktail or two (or three), and evenings often started early.
I'm sure this contributed to lots of good creative ideas in early Hypothesis development. I do not entirely recommend it as a strategy.

(It was probably also a leftover coping mechanism from the depression. I don't recommend that either. It never reached dangerous levels though)
Anyway I moved back to the UK in May, and began a period of trying to make money of Hypothesis (which I mostly failed at because I'm bad at business) and then eventually bootstrapping a PhD off it. That part of the history is less interesting, so I'll skip over it.
Oh, one fun fact from early Hypothesis history: The given decorator that is now the fundamental feature of the API was an afterthought. The original API was falsify(specification, test), which returned a value x for which test(x) was false
given was written later when I went "Oh huh decorators seem neat I wonder how they work. Oh cool I can use that for test framework integration". It was a one hour hack project. Many bad design decisions from that hack project still linger in http://core.py .
e.g. this was before I realised I couldn't call them generators in Python, so there are still some bits in http://core.py  that use the terminology "generator" for arguments to given.
Anyway, this brings us up to around May 2015. At this point Hypothesis looks pretty modern: It's got the strategies module, it's got the given decorator, the public API looks a lot like what exists now but with a few weird legacy bits we've since dropped.
The internals however were both completely different, and a complete horror show. I'll talk more about that when I resume this thread... later.
But for now, a brief cap: If you occasionally wonder why some of the Hypothesis API design is a bit... strange, remember the above:

1. It started as a toy
2. I didn't know how to write Python then
3. The major work happened during a post-depression creative surge
4. While drunk
Given all of this I feel like it's turned out rather well.
(Hypothesis these days is in contrast a very serious well tested and well engineered project developed by multiple people with a good culture of code review. I also very rarely drunk commit to it these days)
TBH even in the drunk committing days it was very well tested - one of the first things I did when resuming it was get code coverage up to 100% and insist on keeping it there. Between that and relatively good self-checking on my part Hypothesis has always been pretty robust.
I'm mostly busy for the next few hours but don't worry this thread will resume later with plenty more FUN HYPOTHESIS FACTS
In the interim:

1. "Hypothesis tests" means something totally different. The correct nomenclature is "tests using Hypothesis".
2. "example" is used for five unrelated things in Hypothesis of which three are public API.
3. Hypothesis once took down all of South Africa's internet.
In order to explain 3 I have to tell you about the curse. Hypothesis is cursed, you see.

Hypothesis finds bugs obviously. That's what PBT is for. Unfortunately, that is not enough to sate it's dark hunger. Bugs are drawn to Hypothesis like bugs to a flame. Things break around it
This is The Curse of Hypothesis: by using it you will find so many ways software is broken. Things only tangentially related to your tests will break.Things you never tested with it will break. Everything will break.
One of the earliest talks about Hypothesis was at PyCon South Africa. The speaker gets up on stage... And the internet goes down. Not the conference internet, but the entire pipe into South Africa. Also later talk recordings are replaced with entirely zeros. The curse has struck.
(this story may have been slightly dramatised in the retelling but is essentially true)
Right. Enough historical FUN HYPOTHESIS FACTS. Now for some science:

The big clever idea in Hypothesis is that it has a universal representation for all test cases. The easiest way to think of this is that it records the series of non-deterministic choices made during generation
A pseudo-random number generator (PRNG) is basically a bitstream - you can think of it as an infinite sequence of coin tosses where bit n is 1 if the nth coin toss is heads and 0 if it's tails.
Any sequence of coin tosses occurs with non-zero probability, so any generator has to be able to deal with any sequence of bits regardless of where they come from - they might be uniform, but they don't have to be.
This can of course affect the probability of the generator terminating - there's no guarantee that a generator that terminates with probability 1 will still terminate if you give it an infinite sequence of heads - but we can solve that by just stopping generators at some point.
So this means that when we generate random test cases we can automatically replay them by just recording the sequence of bits and replaying that through the generator. This is how the Hypothesis database works.

As described, this is no better than storing the seed, but...
...the nice thing about this over storing the seed is that we can manipulate the sequence of bits in whatever way we like. Even if our original sequence comes from a normal boring PRNG, we can try tweaking it and see what tweaking it does to the behaviour of the generator.
This is how shrinking/test-case reduction in Hypothesis works: Rather than manipulating the generated values, we manipulate the underlying sequence of bits that lead to them being generated. The number of choices made is a pretty good proxy for test case size so this works well.
Specifically Hypothesis shrinking is *shortlex optimisation*. We try to reduce the:

1. number of choices made
2. sequence of bits lexicographically among sequences of a given size.

This gives us better normalization ( https://agroce.github.io/issta17.pdf ) and lets us e.g. shrink integers.
This approach gives Hypothesis its two strongest claims to be fundamentally different from QuickCheck:

* It can do all sorts of things to your data without anything except a generator for each data type
* It guarantees your test inputs could have been randomly generated.
The latter means that we don't have what's called the test-case validity problem: When you generate a failing test case that exposes a real bug, but your shrinking turns it into something nonsensical that your test didn't know to reject because it was never generated.
You can follow @DRMacIver.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: