Welcome!

I am a physical oceanographer interested in how ocean water is mixed and transformed. I am currently working as Research Scientist at the Bedford Institute of Oceanography in Halifax, Nova Scotia.

Recent Posts:

Bootstrapping uncertainties with the boot package

People often ask me what I like about R compared to other popular numerical analysis software commonly used in the oceanographic sciences (coughMatlabcough). Usually the first thing I say is the package system (including the strict rules for package design and documentation), and how easy it is to take advantage of work that others have contributed in a consistent and reproducible way. The second is usually about how the well-integrated the statistics and the statistical methods are in the various techniques. R is fundamentally a data analysis language (by design), something that I’m often reminded of when I am doing statistical analysis or model fitting.

Fitting models, with uncertainties!

Recently I found myself needing to estimate both the slope and x-intercept for a linear regression. Turns out it’s pretty easy for the slope, since it’s one of the directly estimated parameters from the regression (using the lm() function), but it wasn’t as clear how to get the uncertainty for the x-intercept. In steps the boot package, which is a nice interface for doing bootstrap estimation in R. I won’t get into the fundamentals of what bootstrapping involves here (the linked wikipedia article is a great start).

Ok, first, a toy example (which isn’t all that different from my real research problem). We make some data following a linear relationship (with noise):

library(boot)
x <- 0:20
set.seed(123) # for reproducibility
y <- x + rnorm(x)
plot(x, y)

plot of chunk unnamed-chunk-1 We can easily fit a linear model to this using the lm() function, and display the results with the summary() function:

m <- lm(y ~ x)
plot(x, y)
abline(m)

plot of chunk unnamed-chunk-2

summary(m)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8437 -0.5803 -0.1321  0.5912  1.8507 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.37967    0.41791   0.909    0.375    
## x            0.97044    0.03575  27.147   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.992 on 19 degrees of freedom
## Multiple R-squared:  0.9749,	Adjusted R-squared:  0.9735 
## F-statistic:   737 on 1 and 19 DF,  p-value: < 2.2e-16

I love lm(). It’s so easy to use, and there’s so much information attached to the result that it’s hard not to feel like you’re a real statistician (even if you’re an imposter like me). Check out the neat little table, showing the estimates of the y-intercept and slope, along with their standard errors, t values, and p values.

So how to get the error on the x-intercept? Well, one way might be to propagate the slope and y-intercept uncertainties through a rearrangement of the equation, but for anything even a little complicated this would be a pain. Let’s do it instead with the boot package.

We need to create a function that takes the data (as the first argument), with the second argument being an index that can be used by the boot() function to run the function with a subset of the data. Let’s demonstrate first by writing a function to calculate the slope, and see how the bootstrapped statistics compare with what comes straight from lm():

slope <- function(d, ind) {
    m <- lm(y ~ x, data=d[ind,])
    coef(m)[[2]]
}
slope_bs <- boot(data.frame(x, y), slope, 999)
slope_bs
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = data.frame(x, y), statistic = slope, R = 999)
## 
## 
## Bootstrap Statistics :
##      original        bias    std. error
## t1* 0.9704362 -0.0009770872   0.0391621

The bootstrap estimate decided (using 999 subsampled replicates) that the value of the slope should be 0.970436155978, while that straight from the linear regression gave 0.970436155978 (i.e. exactly the same!). Interestingly the standard error is slightly higher than from lm(). My guess is that would get closer to the real value with more replicates.

Ok, now to do it for the x-intercept, we just supply a new function:

xint <- function(d, ind) {
    m <- lm(y ~ x, data=d[ind,])
    -coef(m)[[1]]/coef(m)[[2]] # xint = -a/b
}
xint_bs <- boot(data.frame(x, y), xint, 999)
xint_bs
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = data.frame(x, y), statistic = xint, R = 999)
## 
## 
## Bootstrap Statistics :
##       original      bias    std. error
## t1* -0.3912359 -0.02204408    0.412457

So the bootstrap estimate of the x-intercept is -0.39123590.412457

Predicting tides in R

This entry is actually a re-post of a great blog I found written by Marcus Beck. It was such a great summary of the tidal analysis capabilities built in to the oce package, that I thought it would make a great addition to the (growing) library of posts here. The original post can be found here, but I’ve reproduced the Rmarkdown in its entirety here with Marcus’ permission (with a few minor format tweaks).

Overview

Skip to TL/DR….

Water movement in estuaries is affected by many processes acting across space and time. Tidal exchange with the ocean is an important hydrodynamic process that can define several characteristics of an estuary. Physical flushing rates and water circulation are often controlled by tidal advection, whereas chemical and biological components are affected by the flux of dissolved or particulate components with changes in the tide. As such, describing patterns of tidal variation is a common objective of coastal researchers and environmental managers.

Tidal predictions are nothing new. A clever analog approach has been around since the late 1800s. The tide-predicting machine represents the tide as the summation of waves with different periods and amplitudes. Think of a continuous line plot where the repeating pattern is linked to a rotating circle, Representing the line in two-dimensions from the rotating circle creates a sine wave with the amplitude equal to the radius of the circle. A more complex plot can be created by adding the output of two or more rotating disks, where each disk varies in radius and rate of rotation. The tide-predicting machine is nothing more than a set of rotating disks linked to a single graph as the sum of the rotations from all disks. Here’s a fantastic digital representation of the tide-predicting machine:

tidemachine

Tides are caused primarily by the gravitational pull of the sun and moon on the earth’s surface. The elliptical orbits of both the moon around the earth and the earth around the sun produce periodic but unequal forces that influence water movement. These forces combined with local surface topography and large-scale circulation patterns from uneven heating of the earth’s surface lead to the variation of tidal patterns across the globe. Although complex, these periodic patterns can be characterized as the summation of sine waves, where one wave represents the effect of a single physical process (e.g., diurnal pull of the moon). Describing these forces was the objecive of the earlier tide-predicting machines. Fortunately for us, modern software (i.e., R) provides us with a simpler and less expensive approach based on harmonic regression.

Approach

We’ll create and sum our own sine waves to demonstrate complexity from addition. All sine waves follow the general form y as a function of x:

where the amplitude of the wave is and the frequency (or 1 / period) is . The parameters and represent scalar shifts in the curve up/down and left/right, respectively. We can easily create a function in R to simulate sine waves with different characteristics. This function takes the parameters from the above equation as arguments and returns a sine wave () equal in length to the input time series (). The and are interpreted as units of wave height (e.g., meters) and and are in hours.

# function for creating sine wave
waves <- function(time_in, alpha = 0, beta = 1, freq = 24, phi = 0){

  # timestep per hour
  time_step <- 60 / unique(diff(time_in))
  
  # set phi as difference in hours from start of time_in
  phi  <- min(time_in) + phi * 3600
  phi<- as.numeric(difftime(phi, min(time_in)))
  phi <- phi / time_step
  
  # get input values to cos func
  in_vals <- seq(0, length(time_in), length = length(time_in))
  in_vals <- in_vals / time_step
  in_vals <- 2 * pi * in_vals * 1 / freq

  # wave
  y <- alpha + beta * sin(in_vals + phi)
	
	return(y)
	
	}

The default arguments will return a sine wave with an amplitude of one meter and frequency of one wave per 24 hours. Two additional time series are created that vary these two parameters.

# input time series for two weeks, 15 minute time step
x <- as.POSIXct(c('2017-04-01', '2017-04-15'))
x <- seq(x[1], x[2], by = 60 * 15)

# get three sine waves
# a: default
# b: amplitude 0.5, 48 hour period
# c: amplitude 2, 12 hour period
a <- waves(x)
b <- waves(x, beta = 0.5, f = 48)
c <- waves(x, beta = 2, f = 12)

We can combine all three waves in the same data object, take the summation, and plot to see how it looks.

# for data munging and plotting
library(tidyverse)

# get sum of all y values, combine to single object
yall <- rowSums(cbind(a, b, c))
dat <- data.frame(x, a, b, c, yall) %>% 
  gather('var', 'val', -x)

# plot
ggplot(dat, aes(x = x, y = val)) + 
  geom_line() + 
  facet_wrap(~var, ncol = 1) + 
  theme_bw()

plot of chunk unnamed-chunk-3

The important piece of information we get from the plot is that adding simple sine waves can create complex patterns. As a general rule, about 83% of the variation in tides is created by seven different harmonic components that, when combined, lead to the complex patterns we observe from monitoring data. These components are described as being of lunar or solar origin and relative periods occurring either once or twice daily. For example, the so-called ‘M2’ component is typically the dominant tidal wave caused by the moon, twice daily. The periods of tidal components are constant across locations but the relative strength (amplitudes) vary considerably.

The oce package in R has a nifty function for predicting up to 69 different tidal constituents. You’ll typically only care about the main components above but it’s useful to appreciate the variety of components included in a tidal signal. We’ll apply the tidem function from oce to predict the tidal components on a subset of SWMP data. A two-week period from the Apalachicola Bay Dry Bar station is used.

library(SWMPr)
library(oce)

# clean, one hour time step, subset, fill gaps
dat <- qaqc(apadbwq) %>% 
  setstep(timestep = 60) %>% 
  subset(subset = c('2013-01-01 0:0', '2013-12-31 0:0'), select = 'depth') %>% 
  na.approx(maxgap = 1e6)

The tidem function from oce requires a ‘sealevel’ object as input. Plotting the sealevel object using the plot method from oce shows three panels; the first is the complete time series, second is the first month in the record, and third is a spectral decomposition of the tidal components as cycles per hour (cph, or period).

datsl <- as.sealevel(elevation = dat$depth, time = dat$datetimestamp)
plot(datsl)

plot of chunk unnamed-chunk-5

We can create a model to estimate the components from the table above using tidem. Here, we estimate each component separately to extract predictions for each, which we then sum to estimate the complete time series.

# tidal components to estimate
constituents <- c('M2', 'S2', 'N2', 'K2', 'K1', 'O1', 'P1')

# loop through tidal components, predict each with tidem
preds <- sapply(constituents, function(x){
  
    mod <- tidem(t = datsl, constituent = x)
    pred <- predict(mod)
    pred - mean(pred)
    
  }) 

# combine prediction, sum, add time data
predall <- rowSums(preds) + mean(datsl[['elevation']])
preds <- data.frame(time = datsl[['time']], preds, Estimated = predall) 

head(preds)
##                  time           M2           S2          N2
## 1 2013-01-01 00:00:00 -0.111578526 -0.020833606 0.000215982
## 2 2013-01-01 01:00:00 -0.118544835 -0.008940681 0.006428260
## 3 2013-01-01 02:00:00 -0.095806627  0.005348532 0.011088593
## 4 2013-01-01 03:00:00 -0.049059634  0.018205248 0.013072149
## 5 2013-01-01 04:00:00  0.009986414  0.026184523 0.011900172
## 6 2013-01-01 05:00:00  0.066540974  0.027148314 0.007855534
##              K2            K1         O1            P1 Estimated
## 1 -0.0048417234  0.0911501572 0.01312209  0.0381700294  1.463683
## 2 -0.0093752262  0.0646689921 0.03909021  0.0340807303  1.465686
## 3 -0.0113830570  0.0337560517 0.06274939  0.0276811946  1.491713
## 4 -0.0103243372  0.0005294868 0.08270543  0.0194051690  1.532812
## 5 -0.0064842694 -0.0327340223 0.09778235  0.0098135843  1.574727
## 6 -0.0008973087 -0.0637552642 0.10709170 -0.0004434629  1.601819

Plotting two weeks from the estimated data shows the results. Note the variation in amplitude between the components. The M2 , K1, and O1 components are the largest at this location. Also note the clear spring/neap variation in range every two weeks for the combined time series. This complex fort-nightly variation is caused simply by adding the separate sine waves.

# prep for plot
toplo <- preds %>% 
  gather('component', 'estimate', -time) %>% 
  mutate(component = factor(component, level = c('Estimated', constituents)))

# plot two weeks
ggplot(toplo, aes(x = time, y = estimate, group = component)) + 
  geom_line() + 
  scale_x_datetime(limits = as.POSIXct(c('2013-07-01', '2013-07-31'))) + 
  facet_wrap(~component, ncol = 1, scales = 'free_y') + 
  theme_bw() 

plot of chunk unnamed-chunk-7

All tidal components can of course be estimated together. By default, the tidem function estimates all 69 tidal components. Looking at our components of interest shows the same estimated amplitudes in the plot above.

# estimate all components together
mod <- tidem(t = datsl)

# get components of interest
amps <- data.frame(mod@data[c('name', 'amplitude')]) %>% 
  filter(name %in% constituents) %>% 
  arrange(amplitude)
amps
##   name  amplitude
## 1   K2 0.01091190
## 2   N2 0.01342395
## 3   S2 0.02904518
## 4   P1 0.04100388
## 5   O1 0.11142455
## 6   M2 0.12005114
## 7   K1 0.12865764

And of course comparing the model predictions with the observed data is always a good idea.

# add predictions to observed data
dat$Estimated <- predict(mod)

# plot one month
ggplot(dat, aes(x = datetimestamp, y = depth)) + 
  geom_point() + 
  geom_line(aes(y = Estimated), colour = 'blue') + 
  scale_x_datetime(limits = as.POSIXct(c('2013-07-01', '2013-07-31'))) + 
  scale_y_continuous(limits = c(0.9, 2)) +
  theme_bw() 

plot of chunk unnamed-chunk-9

The fit is not perfect but this could be from several reasons, none of which are directly related to the method - instrument drift, fouling, water movement from non-tidal sources, etc. The real value of the model is we can use it to fill missing observations in tidal time series or to predict future observations. We also get reasonable estimates of the main tidal components, i.e., which physical forces are really driving the tide and how large are the contributions. For example, our data from Apalachicola Bay showed that the tide is driven primarily by the M2, K1, and O1 components, where each had relative amplitudes of about 0.1 meter. This is consistent with general patterns of micro-tidal systems in the Gulf of Mexico. Comparing tidal components in other geographic locations would produce very different results, both in the estimated amplitudes and the dominant components.

TL/DR

Here’s how to estimate the tide from an observed time series. The data are from SWMPr and the tidem model is from oce.

library(SWMPr)
library(oce)

# clean input data, one hour time step, subset, fill gaps
dat <- qaqc(apadbwq) %>% 
  setstep(timestep = 60) %>% 
  subset(., subset = c('2013-01-01 0:0', '2013-12-31 0:0'), select = 'depth') %>% 
  na.approx(maxgap = 1e6)

# get model
datsl <- as.sealevel(elevation = dat$depth, time = dat$datetimestamp)
mod <- tidem(t = datsl)

# add predictions to observed data
dat$Estimated <- predict(mod)

# plot
ggplot(dat, aes(x = datetimestamp, y = Estimated)) + 
  geom_line() +
  theme_bw() 

plot of chunk unnamed-chunk-10

Adding NOAA bottom profile to section plots

I use the section-class plotting method in the oce package a lot. It’s one of the examples I really like showing to new oceanographic users of R and oce, to see the power in making quick plots from potentially very complicated data sets. A canonical example is to use the built-in data(section) dataset:

library(oce)
data(section)
plot(section, which='temperature')

plot of chunk example

Note the grey bottom profile that is automatically overlaid on the plot – the values for those points come from the individual stations in the section object, from the waterDepth metadata item in each of the stations in the section. The values can be extracted to a vector with our trusty friend lapply1:

depth <- unlist(lapply(section[['station']], function(x) x[['waterDepth']]))
distance <- unique(section[['distance']])
plot(distance, depth, type='l')

plot of chunk depth

However, many CTD datasets don’t automatically include the water depth at the station, and even if they do the large spacing between stations may make the bottom look clunky.

Using the marmap package to add a high res bottom profile

To add a nicer looking profile to the bottom, we can take advantage of the awesome marmap package, which can download bathymetric data from NOAA.

To add a nice looking bottom profile to our section plot, we can use the getNOAA.bathy() and get.depth() functions. Note the resolution=1 argument, which downloads the highest resolution data available from NOAA (1 minute resolution), and the keep=TRUE argument, which saves a local copy of the data to prevent re-downloading every time the script is re-run (note that at 1 minute resolution the csv file obtained below is 29 MB):

library(marmap)
lon <- section[["longitude", "byStation"]]
lat <- section[["latitude", "byStation"]]
lon1 <- min(lon) - 0.5
lon2 <- max(lon) + 0.5
lat1 <- min(lat) - 0.5
lat2 <- max(lat) + 0.5

## get the bathy matrix -- 29 MB file
b <- getNOAA.bathy(lon1, lon2, lat1, lat2, resolution=1, keep=TRUE)
## File already exists ; loading 'marmap_coord_-74.1727;35.703;-8.0263;38.7373_res_1.csv'
plot(section, which="temperature", showBottom=FALSE)
loni <- seq(min(lon), max(lon), 0.1)
lati <- approx(lon, lat, loni, rule=2)$y
dist <- rev(geodDist(loni, lati, alongPath=TRUE))
bottom <- get.depth(b, x=loni, y=lati, locator=FALSE)
polygon(c(dist, min(dist), max(dist)), c(-bottom$depth, 10000, 10000), col='grey')

plot of chunk marmap-example

Nice!

Downloading and plotting MODIS Chlorophyll-a data

I was recently asked for help with a project that involves correlating occurrences of marine animals found at the surface with satellite measurements of surface chlorophyll. Being a physical oceanographer, I’m not too familiar with the details of such data sets (though I did previously help someone read in MODIS netcdf with the oce package), but saw it as a nice opportunity to learn a bit about a different data set, but also to gain some new data processing and plotting skills.

MODIS Chla data

The MODIS chlorophyll data are provided by NASA through the OceanColor WEB site, which provides various manual ways of downloading binary files (e.g. hdf and netCDF) files. For the present application, which potentially required approximately 400 or so images, this wasn’t a very appealing option.

A quick google search turned up two very relevant (and fantastic looking!) packages, the spnc package and the obpgcrawler package (both authored by Ben Tupper from the Bigelow Laboratory). spnc provides some simplified methods for dealing with “spatial” datasets and netCDF files, and the obpgcrawler provides an interface for programmatically downloading various datasets from the NASA Ocean Biology Processing Group (including MODIS!).

Installing spnc and obpgcrawler

As the packages are not on CRAN (yet?), they have to be installed using the devtools package (and of course all its dependencies, etc). The obpgcrawler package also depends on another non-CRAN package, called threddscrawler which must be installed first. Note that due to an issue with OpenDAP support for the ncdf4 package on windows, the below only works on Linux or OSX.

To install the packages, do:

library(devtools)
# if you don't have threddscrawler installed
install_github("BigelowLab/threddscrawler")
install_github("BigelowLab/obpgcrawler")
install_github("BigelowLab/spnc")

There are some great examples provided on the Github pages, from which I built on to accomplish what I needed. The below example is pulled straight from the obpgcrawler page, to download a subset of the most recent MODIS data and plot it as a “raster” image (more on that later).

library(obpgcrawler)
library(spnc)
library(raster)
## Loading required package: sp
query <- obpg_query(top = 'https://oceandata.sci.gsfc.nasa.gov/opendap/catalog.xml',
   platform = 'MODISA', 
   product = 'L3SMI',
   what = 'most_recent',
   greplargs = list(pattern='8D_CHL_chlor_a_4km', fixed = TRUE))
q <- query[[1]]
chl <- SPNC(q$url)
bb <- c(xmin = -77, xmax = -63, ymin = 35, ymax = 46)
r <- chl$get_raster(what = 'chlor_a', bb = bb)
spplot(log10(r), main=paste('MODIS Chla for', format(chl$TIME, '%Y-%d-%m')))

plot of chunk example

The animal data

The animal data consists of a data frame containing: a date of observation, a longitude, and a latitude. To mimic the data set, I’ll just create a single random point and time in the North Atlantic:

library(latticeExtra) # for the `layer()` function
## Loading required package: lattice
## Loading required package: RColorBrewer
date <- as.Date('2017-02-25')
lat <- 43.783179
lon <- -62.860410
query <- obpg_query(top = 'http://oceandata.sci.gsfc.nasa.gov/opendap/catalog.xml',
                    platform = 'MODISA', 
                    product = 'L3SMI',
                    what = 'within',
                    greplargs = list(pattern='8D_CHL_chlor_a_4km', fixed = TRUE),
                    date_filter=c(date-3, date+4) ## find nearest image within one week
                    )
q <- query[[1]]
bb <- c(lon-10, lon+10, lat-10, lat+10) # define a 10x10 degree box around the point
chl <- SPNC(q$url, bb=bb)
r <- chl$get_raster(what='chlor_a')
p <- spplot(log10(r), main=paste0('Image=', format(chl$TIME)),
            scales=list(draw=TRUE), auto.key=list(title="log10[chl]"))
p <- p + layer(panel.points(lon, lat, pch=19, col=1, cex=2))
print(p)

plot of chunk animal-point

Now, we can extract a chlorophyll value from the location, using the extract() function:

## Note the `radius` argument, which takes a radius in meters
surface_chl <- mean(extract(r, cbind(lon, lat), buffer=50000)[[1]], na.rm=TRUE)
print(paste0('The surface chlorophyll at ', lon, ', ', lat, ' is: ',
             format(surface_chl, digits=3), ' mg/m^-3'))
## [1] "The surface chlorophyll at -62.86041, 43.783179 is: 1.09 mg/m^-3"

Things to figure out (sp plots, rasters, projections, etc)

The world of “spatial” objects (e.g. through the sp package), and things that derive from them, is a new one for me. For example, in the oce package, we have developed methods for plotting matrices (e.g. imagep()) and geographical data (e.g. mapPlot(), mapImage(), etc) that differ from the GIS-like approach contained in the world of spatial analyses in R. I have long desired to learn more about this “other” world, and so have taken this opportunity with MODIS data to do so.

Projected rasters and lon/lat labels

The neat thing about raster objects is that they contain the coordinate projection information. For example, the coordinate system for the MODIS data that we downloaded can be seen with:

projection(r)
## [1] "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"

For those used to doing projected maps in oce, this string should be familiar as a proj4 string, which specifies that the coordinate system is simply “longlat” (i.e. not projected). To change the projection of the raster, we can use the projectRaster() function to update it to a new reference, e.g. polar stereographic centred on -20 degrees W:

rp <- projectRaster(r, crs='+proj=sterea +lon_0=-20')

Now, if we use spplot() again, we get a raster that is plotted in a projected coordinate:

spplot(log10(rp))

plot of chunk plot-projected

There are some things I haven’t figured out yet, particularly how to plot nice graticules (e.g. grid lines for constant latitude and longitude), since if the above is plotted as before with the scales= argument, what is plotted are the projected coordinates:

spplot(log10(rp), scales=list(draw=TRUE))

plot of chunk scales

It looks like the graticules package will be helpful for this, but it still doesn’t appear to be non-trivial. See also here for some other good-looking examples.

Extract a matrix from raster to use imagep()

One solution (at least for making maps), would be to extract the matrix data from the raster along with the longitude and latitude vectors. This would then allow for plotting in a projection using mapImage() from the oce package as I’m used to. Let’s try and pull stuff out of the raster object r:

lon <- unique(coordinates(r)[,1])
lat <- unique(coordinates(r)[,2])
chl_mat <- as.matrix(r)
dim(chl_mat)
## [1] 231361      1

Note that using as.matrix() on a raster should extract the matrix values of the object, however in this case it extracts a vector with all the values. So, we need to reshape it:

chl_mat <- array(as.matrix(r)[,1], dim=c(length(lon), length(lat)))
imagep(lon, lat, log10(chl_mat), col=oceColorsChlorophyll)

plot of chunk reshape Ok! Now we’re getting somewhere! Let’s try a projected version:

library(ocedata)
## Loading required package: testthat
data(coastlineWorldFine)
mapPlot(coastlineWorldFine, projection=projection(rp),
        longitudelim=range(lon), latitudelim=range(lat))
mapImage(lon, lat, log10(chl_mat), col=oceColorsChlorophyll)

plot of chunk mapPlot Awesome!

A Makefile for knitr documents

One of the best things I’ve found about using R for all my scientific work is powerful and easy to use facilities for generating dynamic reports, particularly using the knitr package. The seamless integration of text, code, and the resulting figures (or tables) is a major step toward fully-reproducible research, and I’ve even found that it’s a great way of doing “exploratory” work that allows me to keep my own notes and code contained in the same document.

Being a fan of a “Makefile” approach to working with R scripts, as well as an Emacs/ESS addict, I find the easiest way to automatically run/compile my knitr latex documents is with a Makefile. Below is a template I adapted from here:

all: pdf

MAINFILE  := **PUT MAIN FILENAME HERE**
RNWFILES  := 
RFILES    := 
TEXFILES  := 
CACHEDIR  := cache
FIGUREDIR := figures
LATEXMK_FLAGS := 
##### Explicit Dependencies #####
################################################################################
RNWTEX = $(RNWFILES:.Rnw=.tex)
ROUTFILES = $(RFILES:.R=.Rout)
RDAFILES= $(RFILES:.R=.rda)
MAINTEX = $(MAINFILE:=.tex)
MAINPDF = $(MAINFILE:=.pdf)
ALLTEX = $(MAINTEX) $(RNWTEX) $(TEXFILES)

# Dependencies
$(RNWTEX): $(RDAFILES)
$(MAINTEX): $(RNWTEX) $(TEXFILES)
$(MAINPDF): $(MAINTEX) $(ALLTEX) 

.PHONY: pdf tex clean 

pdf: $(MAINPDF)

tex: $(RDAFILES) $(ALLTEX) 

%.tex:%.Rnw
	Rscript \
	  -e "library(knitr)" \
	  -e "knitr::opts_chunk[['set']](fig.path='$(FIGUREDIR)/$*-')" \
	  -e "knitr::opts_chunk[['set']](cache.path='$(CACHEDIR)/$*-')" \
	  -e "knitr::knit('$<','$@')"

%.R:%.Rnw
	Rscript -e "Sweave('$^', driver=Rtangle())"

%.Rout:%.R
	R CMD BATCH "$^" "$@"

%.pdf: %.tex 
	latexmk -pdf $<

clean:
	-latexmk -c -quiet $(MAINFILE).tex
	-rm -f $(MAINTEX) $(RNWTEX)
	-rm -rf $(FIGUREDIR)
	-rm *tikzDictionary
	-rm $(MAINPDF)