Welcome!

I am a physical oceanographer interested in how ocean water is mixed and transformed. I am currently working as Research Scientist at the Bedford Institute of Oceanography in Halifax, Nova Scotia.

Recent Posts:

Using the scan() function in R to read weirdly formatted data

I am writing some code to parse a weird data format, and using scan() to suck in everything first. Basically, it’s csv-style lines, but some lines have a different number of fields and are for different things — imaging CTD data interspersed with system messages, where the line is identified by the very first field. Something like:

GPS,20150727T120000,-10.1,12.2
MESSAGE, 20150727T120005,Begin descent
CTD,20150727120100,1,25,35
CTD,20150727120200,10,20,34
CTD,20150727120400,100,10,33
MESSAGE,20150727T121000,Begin ascent
CTD,20150727121500,100,10,33
CTD,20150727121600,90,12,33.5
etc ...

Anyway, when I was just reading in the CTD fields, everything was fine, but when I started trying to parse the MESSAGE fields, I found that scan() was doing something unexpected with the spaces in the message field, and producing a char vector like:

"GPS,20150727T120000,-10.1,12.2"
"MESSAGE, 20150727T120005,Begin" 
"descent"
"CTD,20150727120100,1,25,35"
...

Basically, scan() was treating the space between “Begin” and “descent” as a delimiter (as well as the carriage returns).

Anyway, after much attempting to interpret the man page, and trying different things, I discovered that

scan(con, character(), sep='\n')

would suck in the entire line as a character vector, which is what I wanted.

Using the R apply() family with oce objects

Introduction

In the oce package, the various different data formats are stored in consistently structured objects. In this post, I’ll explore a way to access elements of multiple oce objects using the R lapply(), from the apply family of functions.

Example with a ctd object

The objects always contain three fields (or “slots”): metadata, data, and processingLog. The layout of the object can be visualized using the str() command, like:

library(oce)
## Loading required package: methods
## Loading required package: gsw
data(ctd)
str(ctd)

which produces something like:

Formal class 'ctd' [package "oce"] with 3 slots
  ..@ metadata     :List of 26
  .. ..$ header                  : chr [1:42] "* Sea-Bird SBE 25 Data File:"
  .. ..$ type                    : chr "SBE"
  .. ..$ conductivityUnit        : chr "ratio"
  .. ..$ temperatureUnit         : chr "IPTS-68"
  .. ..$ systemUploadTime        : POSIXct[1:1], format: "2003-10-15 11:38:38"
  .. ..$ station                 : chr "Stn 2"
  .. ..$ date                    : POSIXct[1:1], format: "2003-10-15 11:38:38"
  .. ..$ startTime               : POSIXct[1:1], format: "2003-10-15 11:38:38"
  .. ..$ latitude                : num 44.7
  .. ..$ longitude               : num -63.6
  ..@ data         :List of 9
  .. ..$ scan         : int [1:181] 130 131 132 133 134 135 136 137 138 139 ...
  .. ..$ time         : num [1:181] 129 130 131 132 133 134 135 136 137 138 ...
  .. ..$ pressure     : num [1:181] 1.48 1.67 2.05 2.24 2.62 ...
  .. ..$ depth        : num [1:181] 1.47 1.66 2.04 2.23 2.6 ...
  .. ..$ temperature  : num [1:181] 14.2 14.2 14.2 14.2 14.2 ...
  .. ..$ salinity     : num [1:181] 29.9 29.9 29.9 29.9 29.9 ...
  .. ..$ temperature68: num [1:181] 14.2 14.2 14.2 14.2 14.2 ...
  ..@ processingLog:List of 2
  .. ..$ time : POSIXct[1:5], format: "2015-08-18 19:22:36" "2015-08-18 19:22:36" ...
  .. ..$ value: chr [1:5] "create 'ctd' object" "ctdAddColumn(x = res, column = swSigmaTheta(res@data$salinity,     res@data$temperature, res@data$pressure), name = "sigmaThet"| __truncated__ "read.ctd.sbe(file = file, processingLog = processingLog)" "converted temperature from IPTS-69 to ITS-90" ...

(where I’ve trimmed a few lines out just to make it shorter).

For a single object, there are several ways to access the information contained in the object. The first (and generally recommended) way is to use the [[ accessor — for example if you wanted the temperature values from a ctd object you would do

T <- ctd[['temperature']]

Another way is to access the element directly, by using the slot and list syntax, like:

T <- ctd@data$temperature

The disadvantage to the latter is that it requires knowledge of exactly where the desired field is in the object structure, and is brittle to downstream changes in the oce source.

Working with multiple objects

Frequently, especially with CTD data, it is common to have to work with a number of individual ctd objects — usually representing different casts. One way of organizing such objects, particularly if they share a common instrument, or ship, or experiment etc, is to collect them into a list.

For example, we could loop through a directory of individual cast files (or extract multiple casts from one file using ctdFindProfiles()), and append each one to a list like:

files <- dir(pattern='*.cnv')
casts <- list()
for (ifile in 1:length(files)) {
    casts[[ifile]] <- read.oce(files[ifile])
}

If we summarize the new casts list, we can see that it’s filled with ctd objects:

str(casts, 1) # the "1" means just go one level deep

Extracting fields from multiple objects at once

Say we want to extract all the temperature measurements from each object in our new list? How could we do it?

The brute force approach would be to loop through the list elements, and append the temperature field to a vector, maybe something like:

T_all <- NULL
for (i in 1:length(casts)) {
    T_all <- c(T_all, casts[[i]][['temperature']])
}

But in R, there’s a more elegant way — lapply()!

T_all <- unlist(lapply(casts, function(x) x[['temperature']]))

Forking and syncing branches with git and github

When forking a branch on github, it was not entirely clear to me how to sync branches other than master (e.g. to make a pull request). The following eventually seemed to work:

Set upstream remotes

First, you need to make sure that your fork is set up to track the original repo as upstream (from here):

List the current remotes:

$ git remote -v
# origin  https://github.com/YOUR_USERNAME/YOUR_FORK.git (fetch)
# origin  https://github.com/YOUR_USERNAME/YOUR_FORK.git (push)

Specify a new remote upstream repository that will be synced with the fork.

$ git remote add upstream https://github.com/ORIGINAL_OWNER/ORIGINAL_REPOSITORY.git

Verify the new upstream repository you’ve specified for your fork.

$ git remote -v
# origin    https://github.com/YOUR_USERNAME/YOUR_FORK.git (fetch)
# origin    https://github.com/YOUR_USERNAME/YOUR_FORK.git (push)
# upstream  https://github.com/ORIGINAL_OWNER/ORIGINAL_REPOSITORY.git (fetch)
# upstream  https://github.com/ORIGINAL_OWNER/ORIGINAL_REPOSITORY.git (push)

Syncing a fork

Now you’re ready to sync changes! See here for more details on syncing a “main” branch:

Fetch the branches:

$ git fetch upstream
# remote: Counting objects: 75, done.
# remote: Compressing objects: 100% (53/53), done.
# remote: Total 62 (delta 27), reused 44 (delta 9)
# Unpacking objects: 100% (62/62), done.
# From https://github.com/ORIGINAL_OWNER/ORIGINAL_REPOSITORY
#  * [new branch]      master     -> upstream/master

Check out your fork’s local master branch.

$ git checkout master
# Switched to branch 'master'

Merge the changes from upstream/master into your local master branch. This brings your fork’s master branch into sync with the upstream repository, without losing your local changes.

$ git merge upstream/master
# Updating a422352..5fdff0f
# Fast-forward
#  README                    |    9 -------
#  README.md                 |    7 ++++++
#  2 files changed, 7 insertions(+), 9 deletions(-)
#  delete mode 100644 README
#  create mode 100644 README.md

Syncing an upstream branch

To sync upstream changes from a different branch, do the following (from here):

git fetch upstream                            #make sure you have all the upstream changes
git checkout --no-track upstream/newbranch    #grab the new branch but don't track it
git branch --set-upstream-to=origin/newbranch #set the upstream repository to your origin
git push                                      #push your new branch up to origin

Turning off Auctex fontification so that columns can align

I love Emacs. I use it for everything, and particularly love it for doing tables in LaTeX because I can easily align everything so that it looks sensible, and rectangle mode makes it easy to move columns around if desired.

That being said, Auctex defaults do some fontification to math-mode super- and subscripts, which cause the horizontal alignments of characters to be off (essentially it is no longer a fixed-width font). To turn this off, do:

M-x customize-variable font-latex-fontify-script

Beauty!

Switching from Matlab to R: Part 1

Introduction

I was thinking recently about how best to help someone transitioning from Matlab(TM) to R, and did my best to recall what sorts of things I struggled with when I made the switch. Though I resisted for quite a while, when I finally committed to making the change I recall that it mostly happened in a matter of weeks. It helped that my thesis supervisor exclusively used R, and we were working on code for a paper together at the time, but in the end I found that the switch was easier than I had anticipated.

Tips

  1. Don’t be afraid of the assign <- operator. It means exactly the same thing as you would use = in matlab, as in

    a <- 1:10 # in matlab a=1:10;
    

    except that it makes more sense logically. The only place you should use = is in logical comparisons like a == b (as in matlab), or for specifying argument values in a function (see number 5).

  2. Vectors are truly 1 dimensional. This is different from matlab in the way that you could not add together an Nx1 and a 1xN vector. In R it would be just two vectors of length N. The transpose in R is by doing t(), and the transpose of a vector (or class “numeric”) is the same as the original.

  3. Array indices use square brackets, like

    a[1:5] <- 2 # assign the value 2 to the first 5 indices of a
    

    This is one of the things that drove me crazy about matlab, that it used () for indices as well as function arguments. It makes mixed array indexing and function calls very confusing to look at and interpret.

  4. By default arithmetic operations are done element-wise. If you have two MxN matrices (say A and B), and you do C <- A*B, every element in C is the product of the corresponding elements in A and B. No need to do the .* stuff as in matlab. To get matrix multiplication, you use the %*% operator.

  5. Function arguments are named, so the order isn’t super important. If you don’t name them, then you have to give them in the order they appear (do ?function to see the help page). For example if a function took arguments like:

    foo <- function(a, b, c, type, bar) { #function code here
    }
    

    You could call it with something like:

    junk <- foo(1, 2, bar='whatever')
    

    where a and b are given the values of 1 and 2, and c and type are left unspecified. This would be equivalent:

    junk <- foo(a=1, b=2, bar='whatever')
    

    You could also do:

    junk <- foo(bar='whatever', a=1, b=2)
    
  6. No semicolons needed (except where you’d like to have more than one operation per line, like a <- 1; b <- 2

  7. In R, the equivalent to a matlab structure is called a “list”. Instead of separating the levels with a ., it is generally done with a $. So the structure of a list could be something like:

    a <- junk$stuff$whatever
    

    Use the str() command to look at the structure of a list object.

  8. Most functions that return more than just a single value will return in a list. Unlike matlab there isn’t a simple way returning separate values to separate variables, like [a, b] = foo('bar'). For example, using the histogram function:

    a <- rnorm(1000)
    h <- hist(a)
    

    plot of chunk unnamed-chunk-8

    str(h)
    
    ## List of 6
    ##  $ breaks  : num [1:15] -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 ...
    ##  $ counts  : int [1:14] 1 4 19 40 97 162 193 177 146 96 ...
    ##  $ density : num [1:14] 0.002 0.008 0.038 0.08 0.194 0.324 0.386 0.354 0.292 0.192 ...
    ##  $ mids    : num [1:14] -3.25 -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 ...
    ##  $ xname   : chr "a"
    ##  $ equidist: logi TRUE
    ##  - attr(*, "class")= chr "histogram"
    

    If I wanted to extract something from that I could use

    b <- h$breaks
    

    If you really only want one thing out of the list, you could do something like

    b <- hist(a, plot=FALSE)$breaks
    
  9. You can use .’s in variable and function names, but I don’t recommend you do. Often a function with a . in it means that it applies a “generic” operation to a specific class. For example, the plot() function is a straightforward way of plotting data, much like in matlab. However, there exist lots of variants of plot for different classes, which are usually specified as plot.class(). E.g. for the histogram object I created above, if I want to plot it, I can just do

    h2 <- hist(a, plot=FALSE, breaks=100)
    plot(h2, main='A plot with more breaks')
    

    plot of chunk unnamed-chunk-11 and it will plot it as a histogram, using the generic function plot.histogram(), as well as accept the arguments appropriate to that generic function.

Thoughts on topics for future editions of matlab2R

  • plotting, including:

    • points, lines, styles, etc
    • “image”-style plots, contours, filled contours, colormaps, etc
  • POSIX times vs Matlab datenum

  • … suggestions in comments?