On the Possibility of Service Oriented Astronomy

There’s an idea in the technical world called Service Oriented Architecture. It’s a pretty widely recognized as a good way of building big, complex systems in a reliable way and crucially with reusable components. Famously, Amazon’s Jeff Bezos realised the value of this approach and as a consequence something like 100 separate services are used to render a typical product page on Amazon.com.

Wikipedia defines SOA as:

A service-oriented architecture (SOA) is an architectural pattern in computer software design in which application components provide services to other components via a communications protocol, typically over a network. The principles of service-orientation are independent of any vendor, product or technology.

Over the past few months I’ve been noodling on the idea of Service Oriented Astronomy, my working definition is something like this:

Service oriented astronomy (SOA) is an approach whereby researchers develop novel methods for common (or specialist) data analysis tasks that are shared as hosted services with their peers.

In this post I’m going to try and convince you why SOA (of the Astronomy kind) might be a good idea.

» Why LSST is a future headache exciting

In 2020, following more than two decades of planning and construction the Large Synoptic Survey Telescope (LSST) will see first light. If all goes to plan then shortly afterwards it will enter survey mode and will begin producing vast quantities of data¹.

Most of the headlines about LSST are focussed on the data products: A continual ‘Level 1’ alert stream of things (transients) that have changed since that area of the sky was last observed and an annual ‘Level 2’ data releases of calibrated images. Having at least three ‘Vs’ (Volume, Velocity and Variety) considered defining characteristics, LSST is generally considered a ‘Big Data’ project². For me though, the most exciting part of LSST isn’t that it’s big data, it’s a combination of the following factors:

It’s live
Doing things live is hard. Especially if you want to be responsive (and timely) to potentially interesting events.

It’s noisy
A large number of the events being produced are likely to be of little interest to any single individual/research group.

It’s big enough that most of us are woefully unprepared to deal with the data volume
LSST isn’t Facebook/Google ‘big’ but it’s certainly big enough to present difficulties given that the vast majority of us have never received any formal training in software development. Building fast, scalable and reliable tools for processing LSST datasets is not something we’ll likely to find easy.

It’s open
Open is a pre-requisite for large-scale innovation. If everyone³ has access to LSST data there’s an opportunity for the creation of secondary data products and a marketplace for their consumption.

» Science that requires services

Although each Level 1 alert itself is going to be pretty small (some positional information, basic photometry and a small thumbnail image), there’s going to be lots of them (Volume), coming at a very quick rate⁴ (Velocity) and given that they could be low flying rocks, variable stars, supernovae or signals of currently unknown astrophysics, they’re certainly going to be Varied.

So while ingesting Level 1 data might be OK, doing something intelligent, i.e. making decisions on the fly, is going to be much much harder. To add a further complication, many of the research efforts interested in consuming the alert stream are time-sensitive.

If for example we’re interested in capturing high-resoluting spectra of distant Type Ia supernovae then we have the additional compounding factors of being very time-sensitive and highly intolerant of false-positives (because followup is expensive).

Supernovae-science is just one of many use cases where a large volume of data alerts needs to be turned into a much smaller stream of high-quality candidates for (timely) followup observation⁵.

» Modern research: a combination of many steps

Continuing to use Type Ia supernovae discovery as a good use case, what does a typical research and discovery process look like⁶?

Producing the alert stream (source of candidates - in our case LSST)
Consuming the alert stream and doing a first level analysis/classification
Scheduling followup observation
Long-term followup/photometry
Data-reduction/modelling/analysis
Sharing your work (writing papers/generating figures/publishing data/methods)

Individually, none of these steps present an insurmountable problem: it’s the combination of a noisy (and high-volume) alert stream that requires a sophisticated (fast & accurate) first-level classification followed by an expensive followup observation that means we’ve got problems.

» Composable analysis workflows or “what can open source and SOA teach us?”

Open source is modular — your software is the value-add on top of a rich ecosystem of reusable components. In many ways, Service Oriented Architecture is a implementation of the same ideas: let’s build a the thing we need once and then everyone use the reference (best?) version. Of course, it’s possible that people might choose to re-implement a component but in open source that’s usually out of preference rather than necessity.

Across astronomy, there are small numbers of teams combining deep theoretical grounding of novel methods and implementing them in high-quality software solving real astrophysical problems, hell, some of them even have Bay Area Startups.

That means that there are people who are getting really really good and solving some of the very hard problems that LSST is going to present us with - why wouldn’t you want to incorporate their knowledge into your analysis?

What if there was a way to easily plug in the premier Type Ia detection algorithm into your research and subscribe to that alert stream rather that the raw LSST Level 1 alerts?

This is what I think Service Oriented Astronomy could be.

» Recognising good candidates for SOA

Not everything needs to be a service. Here are some ways to identify potentially good candidates:

Where implementing a (good) solution is hard (either because of the methods or technologies).
Where your peers will easily recongise the value of your service
Where it’s straightforward to define the inputs and outputs. For example, calculating precision radial velocity measurements for exoplanet detection
Where the underlying implementation is undesirable (for example a large chunk of old IDL code that people might not want to run - I’m looking at you Solar Physics community).

» Things we probably need to fix to make this a reality

It would also be disingenuous of me to not talk about some significant barriers to making this happen. Namely, even if someone built an incredible piece of software, ran it as a service that a large fraction of the community used they’d not be guaranteed a good career in academia because of our paper-citation obsession as a community. So, some things to address:

Services typically require support. Who would provide this?
If a whole bunch of people become reliant on SOA then we’re going to have to be really sure the implementation is good. Code/data or it didn’t happen.
We’re going to need to work out how to credit people for running these services⁷.

» Service Oriented Astronomy Science

Of course, none of this discussion is unique to astronomy. LSST just feels like a good opportunity (through necessity) to drive significant change in the way we do our research. The same could be said for many other areas of science where there is a large source of open data. Perhaps we should just be calling this Service Oriented Science (SOS).

Service Oriented Architecture (and its close relative Microservices) has dramatically changed the way that companies deliver software and data services. How long before we might say the same for Service Oriented Science?

1. Exactly how much still remains to be seen.
2. Some folks disagree with this statement. Some people also seem to have more Vs for you.
3. Those researchers not in the US or a partner country presumably just need to find themselves a suitably geographically-located collaborator.
4. No-body seems to know exactly how many but there are likely to be upwards of tens of and perhaps even hundreds of millions per night.
5. Of course we could just build a giant army of robotic telescopes to observe a larger fraction of the candidates but this seems like a loosing battle.
6. Full disclosure: I know literally nothing about finding supernovae which is somewhat ironic given this
7. I’d love your thoughts on some related efforts I’ve been working on here