So I& #39;ve been on leave from work the last week and a half, so I haven& #39;t sat down to talk about the paper I submitted! You can read the pre-print here https://arxiv.org/abs/2009.08470 .">https://arxiv.org/abs/2009.... This is a piece of work I& #39;ve been doing for two years, to try and teach computers to classify galaxies.
The trick in this paper, is we& #39;re using machine learning to do the task without any prior knowledge about the objects in question. Like trying to learn the answers for a test without any notes from your lectures!
We& #39;re using images from @sdssurveys of galaxies that were previously classified by @galaxyzoo so that we can compare back to the human results, but we aren& #39;t using those labels in the training.

After some pre-processing, the images look like this:
The basic concept behind the model we made (AstroVaDEr), is something called an "autoencoder". It& #39;s a neural network that takes some input, squashes it down to a really small size, and then tries to recover the original image.

Like trying to learn jpeg compression from scratch.
What we& #39;d like to do, is take all of our images and compress them into this low dimensional space, and then run a clustering algorithm that works out what groups of galaxies there are.

This turns out to be more harder than it sounds, and to solve it we need BAYESIAN STATISTICS.
Please don& #39;t run away Bayes can& #39;t hurt you.
An issue arises in the low-dimension space where we compress our images. Without any outside influence, that space can get REAL janky, and there& #39;s no way to know if your clustering technique can work properly.

Unless you put some constraints on it.
What kind of constraint? A statistical one.

Using some very complicated maths, we can teach the network to not only compress the images into a very small space, but also to follow the shape of a clustering model.
I won& #39;t go too much into the detail here because really it& #39;s only useful for the maths nerds out there, but basically we are trying to embed 150,000 images into a model of 12 hyper-ellipsoids in 20 dimensions with Gaussian properties.

MATH STUFF.
This is what the model itself looks like, schematically. It takes a 128x128 pixel image, and uses a convolutional neural network to compress it down into the latent space. Each image is encoded as a mean and a variance, which we statistically sample from and feed into a decoder.
The reason we encode into a mean and a variance is that neural networks can& #39;t learn from random variables, because reasons? Basically the maths that makes the network learn breaks, so we have to trick it into thinking the variables aren& #39;t actually random.
What we get out of the other end is a "reconstructed image", basically the networks best guess and recovering the original inputs.

Those best guesses look a little like this, where the 1st/4th rows are the original, the 2nd/5th are the guesses, and the 3rd/6th the residuals.
One of the difficulties with these types of networks, is they have this bad habit of blurring the images. So with AstroVaDEr we get these pretty good guesses at the general shape of the galaxies, but the internal structures gets smoothed out.
The other thing AstroVaDEr does is produce a classification model of the inputs. In the paper we put together a model with 12 classes, and assign each galaxy a probability for each class and a label for the best match.

The results were... unexpected...
A quick refresher on galaxy morphology. Generally speaking, we classify galaxies based on whether they are Elliptical, Spiral or Barred Spiral, with some intermediary classes and exceptions.

I had hoped the network would do the same thing.
What we got instead, was few classes that look like ellipticals, spirals, big ellipticals AND spirals. Two classes that are full of edge on galaxies, except one is all the vertically oriented galaxies, and the other is all horizontal.

And then a bunch of weirdos.
The other seven classes of image are: "any galaxy with source on upper edge", "any galaxy with source on left edge", "any galaxy with source in bottom right", "any galaxy with source in top right", "any galaxy with source in top left", "any galaxy with source in bottom left"...
and finally: all the other galaxies that don& #39;t fit the other classes, because their weird or corrupted.
So, the clustering got a little weird, and of course you can read more about it in the paper if you want. But I wanna finish up this thread with the last cool thing that this model does.

It makes new pictures.
You can follow @DrAshleyNova.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: