Thread by @DRMacIver, Based on the results of this poll, it seems like it might [...]

Based on the results of this poll, it seems like it might be interesting for me to do a thread about Hypothesis: Some of you might learn what it is as a result, some of you might find it interesting because that& #39;s the content you& #39;re here for. So, Hypothesis, lets talk about it. https://twitter.com/DRMacIver/status/1194293686558973953">https://twitter.com/DRMacIver...

https://twitter.com/DRMacIver/status/1194293686558973953

Hypothesis is a Python library that lets you do a thing called property-based testing. Property-based testing is when you specify that something should be true for a wide range of examples, and the computer tries to give you counterexamples to that claim.

These properties can be very mathematical (this value should always be in this range, this transformation should have this effect, etc.) or they can look like normal tests (e.g. users should never be able to see things they don& #39;t have permission for, the app shouldn& #39;t crash)

For a general intro to property-based testing, I recommend https://increment.com/testing/in-praise-of-property-based-testing/">https://increment.com/testing/i... (which I wrote, but I wrote it in large part because I didn& #39;t have anything else I wanted to recommend in this space)

In praise of property-based testing – Increment: Testing

Example-based tests hinge on a single scenario. Property-based tests get to the root of software behavior across multiple parameters.

https://increment.com/testing/in-praise-of-property-based-testing/

Technically Hypothesis is not just a Python library, there are also prototype versions for Java and Ruby, but those don& #39;t really work. The Ruby one has been recently discontinued because the thing we were using to write it broke and we don& #39;t have the support bandwidth for it.

(the Java one has never been officially supported, it just exists to show that it could exist if we wanted to, and is based on a much earlier version of the concept)

Hypothesis in particular and property-based testing in general originally come from a Haskell library called QuickCheck. https://www.cs.tufts.edu/~nr/cs257/archive/john-hughes/quick.pdf

QuickCheck& #39;s">https://www.cs.tufts.edu/~nr/cs257... style is "lightweight verification". It& #39;s designed to look like theorem proving about your code despite being testing.

The reason it& #39;s called "hypothesis" BTW is:

1. I didn& #39;t want it to be called SomethingCheck because I was bored of that naming convention.
2. It falsifies hypotheses about your code.
3. I didn& #39;t think very hard when naming it.

At this point I no longer describe Hypothesis as a QuickCheck port though it started out life that way (it was really closer to a ScalaCheck port as that& #39;s the one I& #39;d used the most, but ScalaCheck is pretty close to QuickCheck - most property-based testing libraries are).

The big differences between Hypothesis and QuickCheck come in two forms:

1. One big idea that enables a lot of cool stuff (more on this later).
2. An awful lot of work on user experience and trying to make this tyle of testing easy to use and accessible to everyone.

The user experience work was necessary in order to get Python programmers at all willing to use this kind of thing. QuickCheck is a nice idea, but it& #39;s actually very annoying to use in a whole manner of ways, which I and others kept running into when using it.

Haskell programmers are used to making programming super annoying in order to achieve greater correctness, Python programmers wouldn& #39;t be using Python in the first place if they were willing to make major sacrifices for correctness.

(There, have I made everyone hate me enough?)

Two design decisions I made early on that shaped the entire development of Hypothesis:

1. Shrinking / test-case reduction (the process that makes the examples Hypothesis gives you readable) is not a user configurable part of the API.
2. Failing examples are saved in a database.

Both of these were very much motivated by these user experience considerations: Shrinking is a pain in the ass for users to do themselves, and non-repeatable tests make the debugging experience during development awful.

A design decision that I didn& #39;t make as early on as I should was the move to explicit generators. In QuickCheck, partly because Haskell, partly because its verification-like nature, data is specified by types: Your test takes string arguments, so QuickCheck gives you strings.

In early Hypothesis (this predates the typing module) there was a type-like syntax that you used to specify the generator you wanted. This was weird and bad and hard to extend. My excuse is that I& #39;d been writing a lot of Ruby when I wrote that API so I thought DSLs were good.

In Hypothesis 1.5 we moved to the data generators just being things that you constructed explicitly based on a library of primitive generators and ways of combining them.

This incidentally is part of why they are called strategies: Originally that name was purely internal.

https://abs.twimg.com/emoji/v2/... draggable="false" alt="😭" title="Loudly crying face" aria-label="Emoji: Loudly crying face">

A design question that emerged around this time was how to square flexible ways of constructing these strategies with no user configurable shrinking. We have a good set of strategy combinators (composite, build, flatmap, map, filter) for constructing data. How do we shrink them?

The reason this is difficult is that classically shrinking works by taking a value and giving you a bunch of smaller variants, but we& #39;re letting people define data generators purely in terms of how you construct the data. There& #39;s no way to run that process backwards to shrink!

More on this, and many other subjects, later. For now I& #39;m going to take a break from thread writing.

Not properly resuming yet, but let me tease you on the next part with FUN HYPOTHESIS FACTS:

1. Hypothesis was my learning Python project.
2. Prior to 1.5.0 about 50% of Hypothesis development was done while quite drunk.
3. Hypothesis 0.2.0 to 1.5.0 were written for tax reasons.

Also we now have a glorious new mascot, but for a long time the unofficial mascot of Hypothesis was this ridiculous cat: https://twitter.com/sinister_katze/status/1157349116801929218

Why?">https://twitter.com/sinister_... He& #39;s fuzzy, eats bugs, and was present for a lot of the major work on its early development.

https://twitter.com/sinister_katze/status/1157349116801929218

Aforementioned new mascot.

All of these facts were true. https://twitter.com/DRMacIver/status/1194566395620708353

When">https://twitter.com/DRMacIver... I originally wrote Hypothesis back in 2013, I was just moving from a Ruby job to a Python one. I wanted some practice at Python before hand, so I needed a project.

https://twitter.com/DRMacIver/status/1194566395620708353

I& #39;d been talking about property-based testing with some people at a conference recently, and none of the options for Python looked very good (mostly due to a lack of shrinking), so I thought I& #39;d write my own one that was slightly less bad as an experiment.

I did, and it wasn& #39;t very good, but it basically worked OK and taught me enough Python to have achieved my goal. I then stopped working on it for about two years. During that time, a couple people started using it because it was still better than the alternatives.

I felt vaguely bad about this, because I wasn& #39;t really maintaining it, but also I mostly only started getting bug reports about it once I joined Google, which meant I wasn& #39;t able or willing to do any work on the project. It also meant that I was severely depressed at the time.

Google was very bad for me, but that& #39;s a separate topic. The most relevant fact is that it meant I was living in Switzerland, in Zurich specifically, which will become relevant shortly.

Then I quit Google, which was great, but I& #39;d lasted so little time there that if I moved back to the UK then and there I would have counted as resident in the UK for that tax year and ended up paying both Swiss and UK taxes (and some US taxes!), which I didn& #39;t really want.

(I& #39;ve no problem with paying taxes, but once you& #39;re double-counting taxes to multiple countries it gets a bit much).

So this meant I had a couple months to kill in Zurich, and conveniently there was this side project of mine that needed some attention...

I& #39;d originally just intended to fix a few bugs and implement some interesting ideas, but I found there were all sorts of interesting problems in it and it rather got away from me in ways that have pretty much lead to the last five years of my life (this was beginning of 2015).

BTW I was drunk a lot of the time mostly because the flat I was living in had lots of tasty booze and this was in the height of my interest in cocktail making, so most evenings were spent with a cocktail or two (or three), and evenings often started early.

I& #39;m sure this contributed to lots of good creative ideas in early Hypothesis development. I do not entirely recommend it as a strategy.

(It was probably also a leftover coping mechanism from the depression. I don& #39;t recommend that either. It never reached dangerous levels though)

Anyway I moved back to the UK in May, and began a period of trying to make money of Hypothesis (which I mostly failed at because I& #39;m bad at business) and then eventually bootstrapping a PhD off it. That part of the history is less interesting, so I& #39;ll skip over it.

Oh, one fun fact from early Hypothesis history: The given decorator that is now the fundamental feature of the API was an afterthought. The original API was falsify(specification, test), which returned a value x for which test(x) was false

given was written later when I went "Oh huh decorators seem neat I wonder how they work. Oh cool I can use that for test framework integration". It was a one hour hack project. Many bad design decisions from that hack project still linger in http://core.py"> http://core.py .

e.g. this was before I realised I couldn& #39;t call them generators in Python, so there are still some bits in http://core.py"> http://core.py that use the terminology "generator" for arguments to given.

Anyway, this brings us up to around May 2015. At this point Hypothesis looks pretty modern: It& #39;s got the strategies module, it& #39;s got the given decorator, the public API looks a lot like what exists now but with a few weird legacy bits we& #39;ve since dropped.

The internals however were both completely different, and a complete horror show. I& #39;ll talk more about that when I resume this thread... later.

But for now, a brief cap: If you occasionally wonder why some of the Hypothesis API design is a bit... strange, remember the above:

1. It started as a toy
2. I didn& #39;t know how to write Python then
3. The major work happened during a post-depression creative surge
4. While drunk

Given all of this I feel like it& #39;s turned out rather well.

(Hypothesis these days is in contrast a very serious well tested and well engineered project developed by multiple people with a good culture of code review. I also very rarely drunk commit to it these days)

TBH even in the drunk committing days it was very well tested - one of the first things I did when resuming it was get code coverage up to 100% and insist on keeping it there. Between that and relatively good self-checking on my part Hypothesis has always been pretty robust.

I& #39;m mostly busy for the next few hours but don& #39;t worry this thread will resume later with plenty more FUN HYPOTHESIS FACTS

In the interim:

1. "Hypothesis tests" means something totally different. The correct nomenclature is "tests using Hypothesis".
2. "example" is used for five unrelated things in Hypothesis of which three are public API.
3. Hypothesis once took down all of South Africa& #39;s internet.

In order to explain 3 I have to tell you about the curse. Hypothesis is cursed, you see.

Hypothesis finds bugs obviously. That& #39;s what PBT is for. Unfortunately, that is not enough to sate it& #39;s dark hunger. Bugs are drawn to Hypothesis like bugs to a flame. Things break around it

This is The Curse of Hypothesis: by using it you will find so many ways software is broken. Things only tangentially related to your tests will break.Things you never tested with it will break. Everything will break.

One of the earliest talks about Hypothesis was at PyCon South Africa. The speaker gets up on stage... And the internet goes down. Not the conference internet, but the entire pipe into South Africa. Also later talk recordings are replaced with entirely zeros. The curse has struck.

(this story may have been slightly dramatised in the retelling but is essentially true)

Right. Enough historical FUN HYPOTHESIS FACTS. Now for some science:

The big clever idea in Hypothesis is that it has a universal representation for all test cases. The easiest way to think of this is that it records the series of non-deterministic choices made during generation

A pseudo-random number generator (PRNG) is basically a bitstream - you can think of it as an infinite sequence of coin tosses where bit n is 1 if the nth coin toss is heads and 0 if it& #39;s tails.

Any sequence of coin tosses occurs with non-zero probability, so any generator has to be able to deal with any sequence of bits regardless of where they come from - they might be uniform, but they don& #39;t have to be.

This can of course affect the probability of the generator terminating - there& #39;s no guarantee that a generator that terminates with probability 1 will still terminate if you give it an infinite sequence of heads - but we can solve that by just stopping generators at some point.

So this means that when we generate random test cases we can automatically replay them by just recording the sequence of bits and replaying that through the generator. This is how the Hypothesis database works.

As described, this is no better than storing the seed, but...

...the nice thing about this over storing the seed is that we can manipulate the sequence of bits in whatever way we like. Even if our original sequence comes from a normal boring PRNG, we can try tweaking it and see what tweaking it does to the behaviour of the generator.

This is how shrinking/test-case reduction in Hypothesis works: Rather than manipulating the generated values, we manipulate the underlying sequence of bits that lead to them being generated. The number of choices made is a pretty good proxy for test case size so this works well.

Specifically Hypothesis shrinking is *shortlex optimisation*. We try to reduce the:

1. number of choices made
2. sequence of bits lexicographically among sequences of a given size.

This gives us better normalization ( https://agroce.github.io/issta17.pdf )">https://agroce.github.io/issta17.p... and lets us e.g. shrink integers.

This approach gives Hypothesis its two strongest claims to be fundamentally different from QuickCheck:

* It can do all sorts of things to your data without anything except a generator for each data type
* It guarantees your test inputs could have been randomly generated.

The latter means that we don& #39;t have what& #39;s called the test-case validity problem: When you generate a failing test case that exposes a real bug, but your shrinking turns it into something nonsensical that your test didn& #39;t know to reject because it was never generated.

The former means we can add things like targeted property based testing http://proper.softlab.ntua.gr/papers/issta2017.pdf">https://proper.softlab.ntua.gr/papers/is... and all users have to do is specify a target() function and they& #39;ll get all the functionality for free rather than having to write a mutator for their data.

Now, currently our shrinker is world class and our mutator is fairly basic, but the nice thing about having these be part of the core engine operating on a universal representation is that they& #39;ll improve over time without the user having to do anything.

From our point of view this is also *great* for backwards compatibility reasons. If we exposed these as user visible mutation and shrinking APIs, we& #39;d have to support those APIs indefinitely even if we came up with fundamentally better approaches.

Currently the mutator is based on a relatively simple hill climbing algorithm, but we could switch to simulated annealing, genetic algorithms, whatever, and this is completely user transparent. If they defined the mutation operators we couldn& #39;t do that.

This was actually one of the original motivations for keeping shrinking part of the internal API: I was still experimenting with what the right API design for it was and didn& #39;t want to commit to a bad one. I assumed it would eventually become public. https://twitter.com/DRMacIver/status/1194663929915547648">https://twitter.com/DRMacIver...

https://twitter.com/DRMacIver/status/1194663929915547648

More FUN HYPOTHESIS FACTS:

The original motivation for this API design, which was introduced in the Hypothesis 3.0 release in 2016, was twofold:

1. Port as much of the core as possible to C
2. Support running Hypothesis under AFL

Neither of these things have happened yet.

Maybe more on this later.

Oh one big advantage I forgot to mention about this approach that I forget is a big deal for other PBT libraries in impure languages: It works flawlessly with mutable data. Because our test case representation is separate from the data, tests can mutate it however they wish.

In a library where shrinking proceeds based on values, say you generate a mutable list, and your test inserts something into it. The shrinker then tries to delete elements from the list and you get into an infinite loop where the shrinker keeps deleting and the test gets adding.

This was actually a major motivation for the previous design of hypothesis. Each strategy generated its intermediate representation, called a template, which was then reified into a value. Prior to that we generated values directly and called deepcopy to avoid mutation

The reason I moved from copying to the template system was because the cryptography library wanted to use hypothesis and some of the rules they wanted to generate were incompatible with deepcopy

A common theme, especially in early Hypothesis, is me having to solve really mundane problems like this and running up against the limitations of what we knew how to do and having to reinvent the model to support them.

Except I wasn& #39;t actually at the limits of what we knew how to do. A lot of this would have been sidestepped if I& #39;d known about Reid Draper& #39;s simple-check (now test.check): http://reiddraper.com/writing-simple-check/

I& #39;m">https://reiddraper.com/writing-s... glad I didn& #39;t. My solution was initially much worse, but had more room to grow.

The rose tree model for shrinking is unfortunately kinda a dead end, because this problem is insurmountable: #unnecessary-bind">https://github.com/clojure/test.check/blob/master/doc/growth-and-shrinking.md #unnecessary-bind

In">https://github.com/clojure/t... Hypothesis& #39;s model because we& #39;ve got a concrete representation we& #39;ve got a lot more room to mess around with the generated test case.

clojure/test.check

QuickCheck for Clojure. Contribute to clojure/test.check development by creating an account on GitHub.

https://github.com/clojure/test.check/blob/master/doc/growth-and-shrinking.md#unnecessary-bind

Things like that absolutely do cause the Hypothesis shrinker a bit of grief - they& #39;re not *easy* to shrink - but there exists a valid transformation of the underlying representation that the shrinker could perform if it knew how, so shrinking is possible.

But I& #39;ve spent a lot of time tuning the heuristics of the shrinker and these days it works pretty reliably on most weird combinations of user data. Getting to treat shrinking as a fully concrete optimisation problem is a huge win.

Earlier versions of Hypothesis however, hoo boy, those did not handle bind (flatmap in Hypothesis terms) well at all. There was a big question of what the template should be for that. If you have my_strat.flatmap(f) you want the template to be a tuple (x, y), but...

...you need y to be a valid template for the strategy that you get when you run the flatmap, which requires you to invoke reify the first template and invoke f to get the strategy.

What& #39;s the problem?

Well, Python functions aren& #39;t pure. f might do arbitrary bad things.

This isn& #39;t hypothetical either. Hypothesis has had (admittedly a bit ropy) support for generating Django models from very early on. Generating things in Hypothesis can write to the database! Also it could mutate in memory data, etc.

This design constraint has been a recurring headache in Hypothesis. User provided functions must be treated as hazardous and can never be invoked outside the context of a test. This enables a lot of nice stuff for users, but it does rather tie our hands.

Anyway, long story short, the awful terrible solution was this: Hypothesis strategies at the time also had a serialised representation for their templates. They had to be able to validate this so as to not produce bad data when rerunning changed tests.

So a template for my_strat.flatmap(f) was a pair of a template for my_strat and a serialised representation that might or might not be convertible to a template, but you would only find out when the test ran.

It was actually worse than that for reasons I won& #39;t get into.

This was dreadful and slow and buggy and caused all sorts of problems, but for the most part it worked, and it now meant something important: Hypothesis had a full end-to-end implementation of shrinking and serialisation that worked with anything you could randomly generate.

When writing the modern implementation of Hypothesis this gave me something very powerful: I knew that what I was trying to do was possible, because I already done it. The existing implementation was terrible, but I could focus on fixing that without worrying if it could work.

Remarkably little changed in the public API when I switched over from the old implementation to the new one. I had to remove a couple of really niche APIs that were no longer supportable (mostly things that made it look more QuickCheck like), but it was mostly fine.

Hypothesis has very much been a success story for writing a nice clean combinator based API that elides all of the underlying complexity of the implementation - I& #39;ve been able to make some really invasive changes without users noticing anything except maybe some bugs.

One more FUN HYPOTHESIS FACT:

Hypothesis is more monadic than QuickCheck in two senses.

1. Hypothesis strategies actually form a monad (they& #39;re parser combinators!), QuickCheck& #39;s Gen only "morally" does.
2. Everything in Hypothesis is in the monad, QC& #39;s shrinking isn& #39;t.

Despite this "monad(ic)" is mentioned exactly once in the Hypothesis docs. #chaining-strategies-together">https://hypothesis.readthedocs.io/en/latest/data.html #chaining-strategies-together">https://hypothesis.readthedocs.io/en/latest...

This thread is much longer than I intended, but people still seem entertained. As a breather, here& #39;s a photo of the unofficial Hypothesis mascot& #39;s tail being attacked by his brother Dexter (this would have been probably a bit before I left Google, so not quite Hypothesis era).

Additional amusing interlude, the post-Google creative surge when I wrote Hypothesis was also when I wrote Stargate Physics 101: https://archiveofourown.org/works/3673335

For">https://archiveofourown.org/works/367... a while it was touch and go which would be my greater contribution to software correctness, but I think Hypothesis has won

For additional slightly ridiculous software correctness writing from that time, have you read "The Purpose of Hypothesis"? It& #39;s... a bit grandiose. https://hypothesis.readthedocs.io/en/latest/manifesto.html">https://hypothesis.readthedocs.io/en/latest...

There& #39;s a bundle more things I could say about Hypothesis, but also if you have any burning questions about it I& #39;m happy to answer them in this thread.

https://twitter.com/janiczek/status/1194724196661760000

There">https://twitter.com/janiczek/... isn& #39;t really one! There& #39;s a large pile of potential features, but because it& #39;s almost entirely a volunteer time project and there& #39;s several orders of magnitude more potential work than we could hope to get through, we just work on what we feel like.

https://twitter.com/janiczek/status/1194724196661760000

We& #39;re currently working on improving the targeted property based testing implementation, and are talking about improving how we report bugs (especially when there are multiple failures), and have some plans to work on a fuzzer mode.

We& #39;ve also talked about adding some test-case generalisation features (showing what features of the failing test can vary and what cause it to start passing), but I don& #39;t know if we& #39;re likely to work on that any time soon.

Other people are also working on other things. The biggest chunk of work that I don& #39;t have much input in is the work on improving our support for the scientific Python ecosystem.

FUN HYPOTHESIS STATISTIC: There have been 568 releases of Hypothesis for Python to date. Frankly I& #39;m astonished it& #39;s that low.

Alex described our release process here: https://hypothesis.works/articles/continuous-releases/

It& #39;s">https://hypothesis.works/articles/... not a perfect system, but it& #39;s still pretty great.

In the early days of Hypothesis the release process was very simple: I manually updated the version number and changelog, and ran "python http://setup.py"> http://setup.py sdist upload" on my laptop. This lead to people being astonished at rapid turnaround.

There are a lot of IRC conversations to the tune of "Is this a bug?" "Huh. Oh yeah. Fixed it. Install the latest version of Hypothesis and it& #39;ll work."

(Yes, Hypothesis has bugs, astonishing but true)

This began to bog down as I started to get pull requests and contributions to the project. Turns out I hate writing changelogs retroactively. Also, turns out that editing changelogs manually in PRs is a great way to create merge conflicts.

After a couple iterations of release automation, we ended up with the current system. Now every pull request that is merged to master that makes source changes will result in a release - the build enforces this, and automates the package building and upload.

Some of the fact that we can do this is attributable to our high confidence in the build - Hypothesis is really well tested - and some is because we do goodish code review - but I& #39;m not sure how much. If those weren& #39;t true it& #39;s not like releasing slower would help our confidence.

Latest Threads Unrolled: