Using make and pandoc for reproducible papers

I tend to use pandoc more and more to write my papers. The syntax of markdown is ligthweight, doesn’t get in the way of my writing, and it’s easy to export the text to almost any format you can name afterward. The only issue I ran in is that I also like pgfplots a lot for graphics, because they come with the ability to update themselves at each compilation, when the data are re-generated. It’s quite important if you want to tweak som simulation parameters, for instance. Using pgfplots from within pandoc can be done, but the heavy syntax seems rather clumsy.

So I started looking at what gnuplot can do, and as it turns out, gnuplot can do a lot. Look at the demo gallery or the excellent gnuplotting if you want to see more. So anyway, I started looking for a way to make the whole process or generating data, getting them in figures, and then keeping the paper updated, as seemless as it should be (because awesome technology should make our lives awesomely simple). Then I discovered makefiles, and I think there’s no coming back from that. The extremely impressive Mike Bostock (if you are into data visualization, check his stuff out), gives good arguments about why you should use them. I spent a few minutes setting up an example, and I’m deeply in love with the concept.

Here it goes. Assume that you have a file for your paper, called paper.md. You need to get it into paper.pdf. This paper.md file needs a fig1.png figure to compile. fig1.png is produced by fig1.plt, a gnuplot script, who will read data.txt, itself generated by script.py.

A pre-make workflow would look like this:

python script.py > data.txt
gnuplot fig1.plt
pandoc paper1.md -o paper1.pdf

Incidentally, I used this type of files, usually called some variation of compile.sh, all the time. But as I discovered, you can make the whole process painless. and time saving, since you can only compile the parts that are needed. How, you ask? Using a makefile. For this simple example, the Makefile looks like this:

all: paper.pdf

paper.pdf: paper.md fig1.png script.py
  pandoc paper.md -o paper.pdf

fig1.png: data.txt fig1.plt
  gnuplot fig1.plt

data.txt: script.py
  python script.py > data.txt

All Makefiles look the same: on the first line, there is a target, followed by its dependencies. Then, on the line below, are the commands. Everytime a dependency is also a target, then if the dependencies of this target have changed, the dependency is rebuilt. OK that’s not clear at all. If you change script.py, the next time you build fig1.png, the files that fig1.png depend on that have script.py as a dependency will be rebuilt. But see it the other way around. If you just modified the text of your paper.md file, then there is no reason to re-build any of the figures or dataset. See? That’s the greatness of make.

The all rule here is just having paper.pdf as a dependency. So if I type make all, then if will go to the line starting by paper.pdf, and work its way from here. If I just need to see the figure to show it to a colleague, then I have no need for the pdf, and a simple make fig1.png will do the job.

TL;DR The command line is so great. make is awesome. Being a nerd saves energy.

comments powered by Disqus