Memory profiling in R using lineprof
… in which we show an example of memory profiling in R using the lineprof package and end up discussing S4 constructors and the Biobase package.
update: include link to Advanced R programming.
Introduction
Memory usage is an important aspect of using and developing for R. For a while, we have had the possibility of doing memory profiling in R, but it has been hard to use. This was recently revolutionized by Hadley Wickham’s lineprof package.
What follows is my first attempt at using this package for memory profiling on one of my own R packages. This is an edited transcript of that attempt, and it shows how I managed to improve the memory usage in less than 2 hours, including reading the documentation of lineprof. In my opinion, this is very good value for money.
Advanced R programming: Memory by the lineprof
author has
more discussion of the package. This includes some nice GUI using shiny
. I use the package
without, since I am often on a server. Also, being a grumpy man I don’t see the point of the GUI.
lineprof installation
At the time of writing, I don’t think lineprof is on CRAN. You can get it from github using devtools.
install.packages("devtools")
library(devtools)
devtools::install_github("lineprof")
Profiling preprocessRaw from minfi
As an example we will profile the function preprocessRaw
from my package minfi
on Bioconductor.
I started with this function because it is a fundamental workhorse in this package; there exists
multiple preprocessXX
functions which are more advanced version, with greater computational
overload. I had reports (from Jean-Phillipe Fortin) that the function seemed “slow” at times.
We start by using minfi
1.8.9 from Bioconductor 2.13, which is the current stable release (see
sessionInfo below).
First, we load the package (and some example data) look at it
library(lineprof)
library(minfiData)
data(RGsetEx)
preprocessRaw
function (rgSet) {
.isRG(rgSet)
locusNames <- getManifestInfo(rgSet, "locusNames")
M <- matrix(NA_real_, ncol = ncol(rgSet), nrow = length(locusNames),
dimnames = list(locusNames, sampleNames(rgSet)))
U <- M
TypeII.Name <- getProbeInfo(rgSet, type = "II")$Name
M[TypeII.Name, ] <- getGreen(rgSet)[getProbeInfo(rgSet, type = "II")$AddressA, ]
U[TypeII.Name, ] <- getRed(rgSet)[getProbeInfo(rgSet, type = "II")$AddressA, ]
TypeI.Red <- getProbeInfo(rgSet, type = "I-Red")
TypeI.Green <- getProbeInfo(rgSet, type = "I-Green")
M[TypeI.Red$Name, ] <- getRed(rgSet)[TypeI.Red$AddressB, ]
M[TypeI.Green$Name, ] <- getGreen(rgSet)[TypeI.Green$AddressB, ]
U[TypeI.Red$Name, ] <- getRed(rgSet)[TypeI.Red$AddressA, ]
U[TypeI.Green$Name, ] <- getGreen(rgSet)[TypeI.Green$AddressA, ]
out <- new("MethylSet", Meth = M, Unmeth = U, phenoData = phenoData(rgSet),
annotation = annotation(rgSet))
out@preprocessMethod <- c(rg.norm = "Raw (no normalization or bg correction)",
minfi = as.character(packageVersion("minfi")),
manifest = as.character(packageVersion("IlluminaHumanMethylation450kmanifest")))
out
}
Let us run the function through lineprof
, using input data from the minfiData
package.
prof <- lineprof(out <- preprocessRaw(RGsetEx))
prof
Reducing depth to 2 (from 49)
time alloc release dups ref
1 0.762 57.126 9.783 6 c("preprocessRaw", ".isRG")
2 0.763 53.099 17.311 2083 c("preprocessRaw", "getManifestInfo")
3 0.001 0.461 0.000 0 "preprocessRaw"
4 0.127 3.734 9.879 603 c("preprocessRaw", "matrix")
5 0.001 0.648 0.000 172 c("preprocessRaw", "lazyLoadDBfetch")
6 0.265 16.299 7.429 371 "preprocessRaw"
7 0.372 52.962 52.391 15272 c("preprocessRaw", "getProbeInfo")
8 0.159 11.143 13.202 350 "preprocessRaw"
9 1.123 314.745 240.995 16679 c("preprocessRaw", "new")
10 0.001 0.000 0.000 203 c("preprocessRaw", "packageVersion")
src
1 preprocessRaw/.isRG
2 preprocessRaw/getManifestInfo
3 preprocessRaw
4 preprocessRaw/matrix
5 preprocessRaw/lazyLoadDBfetch
6 preprocessRaw
7 preprocessRaw/getProbeInfo
8 preprocessRaw
9 preprocessRaw/new
10 preprocessRaw/packageVersion
The output should be essentially self-explained. Most lines in preprocessRaw
has a time
,
alloc
and release
associated with it. This tells us how much time the line takes, how much
memory is allocated and how much is released. For this function, anything that takes time also has a
lot of memory allocated. We are particular interested in lines where a lot of memory is allocated
and subsequently released. The candidates are easily seen to be lines 1,2,4,6,7,8,9.
In this package I have been following a tried and tested design approach, where information related
to the array design (which does not change between experiments) is stored in an external package, in
this case called IlluminaHumanMethylation450kmanifest
. It turns out that a lot of the initial
allocation is simply a conseqeunce of loading this package automatically. Also, the RGsetEx
dataset is lazy-loaded, so part of the time is spent loading this dataset into memory.
In other words, some amount of this memory allocation happens the first time we run it. We can verify this, by simply running the function one more time, where the manifest package and the dataset has already been accessed.
prof <- lineprof(out <- preprocessRaw(RGsetEx))
prof
Reducing depth to 2 (from 38)
time alloc release dups ref
1 0.002 0.463 0.000 368 c("preprocessRaw", "getManifestInfo")
2 0.001 0.569 0.000 0 "preprocessRaw"
3 0.004 4.047 0.000 62 c("preprocessRaw", "matrix")
4 0.359 17.443 0.000 540 "preprocessRaw"
5 0.291 48.166 49.693 9232 c("preprocessRaw", "getProbeInfo")
6 0.266 12.536 20.570 350 "preprocessRaw"
7 0.927 314.958 216.111 6587 c("preprocessRaw", "new")
8 0.001 0.123 0.000 29 c("preprocessRaw", "packageVersion")
9 0.002 0.463 0.000 155 "preprocessRaw"
src
1 preprocessRaw/getManifestInfo
2 preprocessRaw
3 preprocessRaw/matrix
4 preprocessRaw
5 preprocessRaw/getProbeInfo
6 preprocessRaw
7 preprocessRaw/new
8 preprocessRaw/packageVersion
9 preprocessRaw
Great, that reduced the memory consumption substantially for several of the calls.
Now it is time to discuss the input and output data. The input object RGsetEx
is a class defined
in minfi
, called RGChannelSet
. It represents X samples profiled on the Illumina 450k
microarray. We are interested in running this code on large sample sizes (by large I mean X in the
1,000 to 10,000 range), but we are profiling it on a small instance only containing X=6 samples. We
are particular interested in any inefficiencies which scales with the number of samples.
The memory intensive data stored in the class can be thought of as two matrices Red
and Green
which have 622,399 rows and X columns. In addition, there is a number of additional slots which are
small for practical purposes. The number of rows is fixed for this array. The preprocessRaw
function takes these two matrices and returns an object with the data intensive part stored as two
matrices Meth
and Unmeth
with 485,512 rows and X columns (these two matrices are shortened as
M
and U
in the code in preprocessRaw
above).
The classes are build on work done in the Bioconductor package Biobase
, specifically the eSet
class. I have more to say on this below, but for now, it might be good to know that the matrices
mentioned above are stored inside an environment inside the class, to get pass-by-reference (any
reader who is inspired to explore this ought to look at Biobase
and see the extensive amount of
code necessary to get this right – these days a user would probably use a reference class). This
make simple memory reading a bit difficult, as evidenced by this
print(object.size(RGsetEx), units = "auto")
38 Mb
print(object.size(getRed(RGsetEx)), units = "auto")
52.2 Mb
Note that the 38MB in the output above does not include the Red
and Green
matrices. The actual
memory footprint of the object is actually 38+2x52 = 142MB.
In the remaining lines, there are issues with a call to getProbeInfo
and to new
. I know that
getProbeInfo
does not depend on X, whereas new
does. This is also the function call which looks
to be the main culprit. It appears that a number of unnecessary data duplication is taking
place. Understanding how I fix this, requires some discussion of Biobase
which I do in the next
section. For the reader uninterested in this discussion, I addressed this by writing a specific
constructor for my class, which I call MethylSet
and it is defined as
MethylSet <- function(Meth, Unmeth, phenoData, annotation = NULL) {
if(is.null(annotation))
annotation <- c(array = "IlluminaHumanMethylation450k",
annotation = .default.450k.annotation)
stopifnot(rownames(Meth) == rownames(Unmeth))
stopifnot(colnames(Meth) == colnames(Unmeth))
stopifnot(colnames(Meth) == rownames(phenoData))
tmp <- matrix(nrow = 0, ncol = ncol(Meth))
out <- new("MethylSet", Meth = tmp, Unmeth = tmp,
phenoData = phenoData, annotation = annotation)
assayDataElement(out, "Meth") <- Meth
assayDataElement(out, "Unmeth") <- Unmeth
featureData(out) <- AnnotatedDataFrame(data = data.frame(row.names = row.names(Meth)),
dimLabels = c("featureNames", "featureColumns"))
out
}
I fix preprocessRaw
by replacing
out <- new("MethylSet", Meth = M, Unmeth = U, phenoData = phenoData(rgSet),
annotation = annotation(rgSet))
with
out <- MethylSet(Meth = M, Unmeth = U, phenoData = phenoData(rgSet),
annotation = annotation(rgSet))
Because I am writing this blog post after I did the actual profiling, I have now implemented this
fix in minfi
1.9.9 in Bioconductor 2.14, currently Bioconductor devel. I hence switch
R versions etc. (sessionInfo below), and get the following timings
prof <- lineprof(out <- preprocessRaw(RGsetEx))
prof
Reducing depth to 2 (from 41)
time alloc release dups ref
1 0.002 0.464 0.000 340 c("preprocessRaw", "getManifestInfo")
2 0.036 3.474 0.000 58 c("preprocessRaw", "matrix")
3 0.171 8.606 3.223 330 "preprocessRaw"
4 0.001 1.264 0.000 166 c("preprocessRaw", "$")
5 0.150 7.705 5.868 5 "preprocessRaw"
6 0.415 43.322 41.536 8809 c("preprocessRaw", "getProbeInfo")
7 0.038 2.063 0.000 68 "preprocessRaw"
8 0.001 0.243 0.000 0 c("preprocessRaw", "$")
9 0.091 6.614 8.380 194 "preprocessRaw"
10 0.001 0.773 0.000 0 c("preprocessRaw", "getGreen")
11 0.026 1.200 0.000 43 "preprocessRaw"
12 0.001 0.161 0.000 0 c("preprocessRaw", "$")
13 0.018 0.599 8.155 36 "preprocessRaw"
14 0.031 15.590 0.000 7736 c("preprocessRaw", "MethylSet")
15 0.001 0.512 0.000 115 c("preprocessRaw", "packageVersion")
16 0.001 0.000 0.000 60 "preprocessRaw"
src
1 preprocessRaw/getManifestInfo
2 preprocessRaw/matrix
3 preprocessRaw
4 preprocessRaw/$
5 preprocessRaw
6 preprocessRaw/getProbeInfo
7 preprocessRaw
8 preprocessRaw/$
9 preprocessRaw
10 preprocessRaw/getGreen
11 preprocessRaw
12 preprocessRaw/$
13 preprocessRaw
14 preprocessRaw/MethylSet
15 preprocessRaw/packageVersion
16 preprocessRaw
A dramatic improvement in speed and memory usage, which will be really important for use cases with X=1,000 instead of the very small example profiled here, with X=6.
It is hard to overstate how easy it was to use lineprof
to do this. I did all of this in about 2
hours, but that included reading the lineprof
documentation.
Fixing object construction
The class MethylSet
is a simple extensive of eSet
from Biobase
. This package (and this
particular class) has been hugely influential, it is probably hard to overstate the importance of
this contribution. Biobase
was written many years ago, and is one of the early examples of using
S4 classes to store massive datasets. For this reason, there is a number of design decisions that we
now would do differentily (and which are done differently in more recent Bioconductor packages), but
due to the importance of Biobase
we have been reluctant to change the eSet
class.
One such design decision is the insistence on having the end-user (in this case me) using new()
to
instantiate new objects. These days, this is all done using a class specific constructor, which is
how I fixed the problem.
The eSet
class has the following important components. First, a number of matrices, all with
dimensions nFeatures x nSamples. This is the Red
/Green
and Meth
/Unmeth
matrices mentioned
above. These matrices has to have the same row and column names. We also have a data.frame
called pData
which has dimensions nSamples x nCovariates. The rownames of pData
has to be equal
to the column names of the data matrices. Finally, we have a data.frame
called featureData
with
dimensions nFeatures x nFeatureCovariates. Again, there are constraints on featureData
and the
data matrices including rownames. In many practical examples, users (developers) does not include
any data in the fData
slot. And when I say data.frame
, it is more complicated. Both fData
and pData
are components of a class called AnnotatedDataFrame
which is really a data.frame
with additional annotation (doh, could have guesses this). The class is
getClass("eSet")
Virtual Class "eSet" [package "Biobase"]
Slots:
Name: assayData phenoData featureData
Class: AssayData AnnotatedDataFrame AnnotatedDataFrame
Name: experimentData annotation protocolData
Class: MIAxE character AnnotatedDataFrame
Name: .__classVersion__
Class: Versions
Extends:
Class "VersionedBiobase", directly
Class "Versioned", by class "VersionedBiobase", distance 2
Known Subclasses: "ExpressionSet", "NChannelSet", "MultiSet", "SnpSet"
The instantiation method for eSet
(the new
method) takes its arguments and does all kinds fo
checking and then passes information to sub constructors that does more checking. If you’re really
interested, I invite you to read the instatiation method(s) for the classes involved – it is not
trivial stuff. This provides a certain amount of robustness, that I do not need. For
example, in the call
out <- new("MethylSet", Meth = M, Unmeth = U, phenoData = phenoData(rgSet),
annotation = annotation(rgSet))
I know that the column names of Meth
and Unmeth
match up with the row names of pData
. This is
insured by design. But the general instatiator does this for me: I could have provided Meth
and
Unmeth
without column names and then the instantiation method would have used the row names from
pData
as column names for these two matrices. It is this kind of robustness which makes the
instantiation do a lot of unnecessary copying.
I fix this by constructing my return object in pieces (I am now dissecting MethylSet
defined
above). First, I use new
on empty data matrices.
tmp <- matrix(nrow = 0, ncol = ncol(Meth))
out <- new("MethylSet", Meth = tmp, Unmeth = tmp,
phenoData = phenoData, annotation = annotation)
This involves very small objects, and actually fills in the column names of tmp
as described
above. I then manually fix the Meth
and Unmeth
channels as
assayDataElement(out, "Meth") <- Meth
assayDataElement(out, "Unmeth") <- Unmeth
At this point, the return object out
is invalid. This is because the rows of Meth
and Unmeth
does not correspond to the rows of the featureData
slot that was created indirectly above with 0
rows. I need to fix this manually, by creating a data.frame
with many rows, but 0 columns, like this
featureData(out) <- AnnotatedDataFrame(data = data.frame(row.names = row.names(Meth)),
dimLabels = c("featureNames", "featureColumns"))
This last fix took me a while to figure out; everything else was straightforward. But I also have a
decent insight into the internals of Biobase
having used the package extensively as a foundation
for several analysis packages I have written.
Since it is always a good idea with error checking, I add some simple tests and settle on
MethylSet <- function(Meth, Unmeth, phenoData, annotation = NULL) {
if(is.null(annotation))
annotation <- c(array = "IlluminaHumanMethylation450k",
annotation = .default.450k.annotation)
stopifnot(rownames(Meth) == rownames(Unmeth))
stopifnot(colnames(Meth) == colnames(Unmeth))
stopifnot(colnames(Meth) == rownames(phenoData))
tmp <- matrix(nrow = 0, ncol = ncol(Meth))
out <- new("MethylSet", Meth = tmp, Unmeth = tmp,
phenoData = phenoData, annotation = annotation)
assayDataElement(out, "Meth") <- Meth
assayDataElement(out, "Unmeth") <- Unmeth
featureData(out) <- AnnotatedDataFrame(data = data.frame(row.names = row.names(Meth)),
dimLabels = c("featureNames", "featureColumns"))
out
}
sessionInfo
For minfi
1.8.9, with the first version of preprocessRaw
sessionInfo()
R version 3.0.2 Patched (2014-01-27 r64899)
Platform: x86_64-apple-darwin13.0.0 (64-bit)
locale:
[1] C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] minfiData_0.4.2
[2] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.2.1
[3] IlluminaHumanMethylation450kmanifest_0.4.0
[4] minfi_1.8.9
[5] bumphunter_1.2.0
[6] locfit_1.5-9.1
[7] iterators_1.0.6
[8] foreach_1.4.1
[9] Biostrings_2.30.1
[10] GenomicRanges_1.14.4
[11] XVector_0.2.0
[12] IRanges_1.20.6
[13] reshape_0.8.4
[14] plyr_1.8
[15] lattice_0.20-24
[16] Biobase_2.22.0
[17] BiocGenerics_0.8.0
[18] lineprof_0.1
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.24.0 DBI_0.2-7 MASS_7.3-29
[4] R.methodsS3_1.6.1 RColorBrewer_1.0-5 RSQLite_0.11.4
[7] XML_3.98-1.1 annotate_1.40.0 base64_1.1
[10] beanplot_1.1 codetools_0.2-8 digest_0.6.4
[13] doRNG_1.5.5 genefilter_1.44.0 grid_3.0.2
[16] illuminaio_0.4.0 itertools_0.1-1 limma_3.18.9
[19] matrixStats_0.8.14 mclust_4.2 multtest_2.18.0
[22] nlme_3.1-113 nor1mix_1.1-4 pkgmaker_0.17.4
[25] preprocessCore_1.24.0 registry_0.2 rngtools_1.2.3
[28] siggenes_1.36.0 splines_3.0.2 stats4_3.0.2
[31] stringr_0.6.2 survival_2.37-7 tools_3.0.2
[34] xtable_1.7-1
For minfi
1.9.9, with the fixed version of preprocessRaw
sessionInfo()
R Under development (unstable) (2014-01-30 r64899)
Platform: x86_64-apple-darwin13.0.0 (64-bit)
locale:
[1] C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] minfiData_0.4.3
[2] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.2.1
[3] IlluminaHumanMethylation450kmanifest_0.4.0
[4] minfi_1.9.9
[5] bumphunter_1.3.7
[6] locfit_1.5-9.1
[7] iterators_1.0.6
[8] foreach_1.4.1
[9] Biostrings_2.31.12
[10] XVector_0.3.6
[11] GenomicRanges_1.15.24
[12] IRanges_1.21.23
[13] lattice_0.20-24
[14] Biobase_2.23.3
[15] BiocGenerics_0.9.3
[16] lineprof_0.1
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.25.9 DBI_0.2-7 MASS_7.3-29
[4] R.methodsS3_1.6.1 RColorBrewer_1.0-5 RSQLite_0.11.4
[7] XML_3.98-1.1 annotate_1.41.1 base64_1.1
[10] beanplot_1.1 codetools_0.2-8 digest_0.6.4
[13] doRNG_1.5.5 genefilter_1.45.1 grid_3.1.0
[16] illuminaio_0.5.5 limma_3.19.15 matrixStats_0.8.14
[19] mclust_4.2 multtest_2.19.1 nlme_3.1-113
[22] nor1mix_1.1-4 pkgmaker_0.17.4 preprocessCore_1.25.4
[25] registry_0.2 reshape_0.8.4 rngtools_1.2.3
[28] siggenes_1.37.1 splines_3.1.0 stats4_3.1.0
[31] stringr_0.6.2 survival_2.37-7 tools_3.1.0
[34] xtable_1.7-1