Publishing Is Hard! A Behind-the-Scenes Look at Getting the Astro2020 Science White Papers into the Bulletin of the AAS

12 June 2019 - Peter K. G. Williams, AAS/CfA Innovation Scientist and Director of the AAS WorldWide Telescope

At the beginning of 2019, a wide cross-section of the astronomical community wrote a bevy of science white papers for the National Academies’ Astro2020 Decadal Survey on Astronomy and Astrophysics. To help support the community in this important undertaking, the AAS took on the job of organizing these white papers and formally publishing them in the open-access Bulletin of the AAS. Not so difficult — just copy all the PDFs to our webserver, right? Hardly. As a new hire working with the AAS Publishing team, I was surprised at the amount of effort that went into this seemingly innocuous process. If you’ve ever been curious about just what it is that publishers do all day, read on to get a taste!

The AAS’s involvement in the process began with delivery of the submitted white papers from the National Academies’ Astro2020 team. But, as so often happens when efforts cross organizational boundaries, even that seemingly simple act required care and attention. The spreadsheet provided to us by the Academies had 590 rows, but the official survey data dump has 583 — and those 583 rows are numbered from 7 to 635. What was up? Our NAS colleagues knew exactly what the differences were between the two listings — and answered all our questions — but whenever you’re handing information from one person to the next there’s friction. And now that I mention it, “handing information from one person to the next” is a pretty essential component of being a publisher, isn’t it?

Astronomers will be familiar with these kinds of friction points — they almost inevitably crop up when you’re combining data from multiple sources or diving into a new kind of data. Indeed, you can probably guess some of the other things that we were involved in tidying up: duplicate submissions, late submissions, malformed files. When you’re the publisher, you’re the one whose job it is to sort these things out — and everyone else knows it.

Getting everyone’s documents in order was an important part of the job. Just as important but much more difficult was getting all the document metadata in order as well. Records don’t appear in the SAO/NASA Astrophysics Data System (ADS) by magic! The AAS was also responsible for gathering all the necessary meta-information about each submission and submitting it to ADS in the correct format. Once again, a lot of this process is the kind of data-wrangling with which many of us are familiar, but after spending some — maybe too much — time with the raw data, it seems to me that the act of editing is a lot more philosophically rich than I realized. Consider:

Mistakes. If someone wrote the title “M31: The Andromeda Galaxxy,” any spell-checker could tell you that there was a typo. But if someone wrote the title “M331: The Andromeda Galaxy,” the fundamental mistake is the same, but only a person with knowledge of astronomy is equipped to identify the problem. And going back to the first case, what if the title were “GALAXXY: A Code to Compute Numbers About Galaxies”? I find it fascinating that even a superficially simple task like spell-checking can ultimately require subject-matter expertise.

Mistakes? Identifying some errors is easy; identifying others can require domain knowledge. But who decides what an “error” even is? As a typography nerd, I wince whenever I see three hyphens (—) standing in for an em-dash (—), but I can tell that most of my colleagues just don’t care about such things. Meanwhile, I’m quite cavalier about some rules (I’m happy to use contractions in professional documents, obviously) that others take seriously. While it’s true that 95% of people will probably agree about 95% of edits, any kind of “fixing” that occurs — which it must — fundamentally involves imposing value judgments onto the work of others. That’s a big responsibility!

Markup. Imagine that a paper title contains the sequence of characters “$^2$”. As a person who uses TeX (maybe a little bit too much) I know what the author is trying to express. But the ADS abstract page doesn’t run TeX. To properly send titles to ADS, we need to translate all of the TeXisms that people will slip in — \alpha, \pm, $_{\rm eff}$. So not only do you need a great deal of astronomy subject-matter knowledge to check paper titles effectively, you also need to be familiar with a complex software system from the 1970s, and the current data standards used by ADS. Even more fun, people will use TeX constructs that simply can’t be expressed in ADS’s schema. In a very real sense, adapting markup is an act of translation, with all the subtleties and demands that the term implies.

With 573 science white papers to handle, there were a lot of exciting metadata mistakes to discover, and there was a lot of markup to translate. But by far the most demanding part of the work was one that we intentionally took upon ourselves. One of the value judgments of the Astro2020 process is that it is important to have a low barrier to entry for community contributions. This value motivated the decision to have simple metadata requirements for the white paper submission process. In particular, the white paper author lists were an optional form field, in which a certain format was suggested but submitters could paste whatever free text they had on hand. To properly index these white papers and credit people for their contributions, we knew that we needed to take this freeform input and convert it to structured information: forenames, surnames, and affiliations, all complete, correct, and in the right order. I had initially hoped to automate this process, but after a few exploratory attempts, it became clear that it was going to require the human touch, for all the same kinds of reasons I raised above.

You know what kind of documents tend to have long author lists? Astro2020 white papers. Specifically, 57,336 words of freeform authorship information!

In the end, I wrote a program to preprocess the raw data as effectively as I could muster, then, in a LeBron-in-the-playoffs-caliber performance, AAS Director of Publishing Julie Steffen did the bulk of the work: cross-checking the preprocessed data with the actual submitted white paper text, fixing the numerous failures, collating names and affiliations, and validating the results. Finally, I merged Julie’s work back into my database. The net result: structured data for 10,770 authors. I can guarantee you that we didn’t catch every mistake (which can be corrected — email baas@aas.org) but, frankly, I think we did a pretty darn good job. I should also mention that we were aided enormously here by ADS, which we knew we could rely on for battle-tested infrastructure to parse names, normalize affiliations, and so on.

Of course, with the Astro2020 “APC” white papers due recently, we here at AAS Publishing know that we’re going to have to go through this whole process again soon. Which is exactly what we’ve signed up for — we worry about translating markup and fixing file formats so that you can focus on the science. But we can’t do what we do without the support of the community, so the next time you easily pull up an Astro2020 white paper on ADS, please take a moment to think about everyone — scientists, funders, the National Academies, ADS, and, yes, AAS Publishing — who made that possible!