Reading regularly named files

This vignette contains a number of examples which explain how to use capture_first_glob to read data from a set of regularly named files.

Example 0: iris data, one file per species

We begin with a simple example: iris data have 150 rows, as shown below.

library(data.table)
dir.create(iris.dir <- tempfile())
icsv <- function(sp)file.path(iris.dir, paste0(sp, ".csv"))
(iris.dt <- data.table(iris))
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#>             <num>       <num>        <num>       <num>    <fctr>
#>   1:          5.1         3.5          1.4         0.2    setosa
#>   2:          4.9         3.0          1.4         0.2    setosa
#>   3:          4.7         3.2          1.3         0.2    setosa
#>   4:          4.6         3.1          1.5         0.2    setosa
#>   5:          5.0         3.6          1.4         0.2    setosa
#>  ---                                                            
#> 146:          6.7         3.0          5.2         2.3 virginica
#> 147:          6.3         2.5          5.0         1.9 virginica
#> 148:          6.5         3.0          5.2         2.0 virginica
#> 149:          6.2         3.4          5.4         2.3 virginica
#> 150:          5.9         3.0          5.1         1.8 virginica

In the code below, we save one CSV file for each of the three Species.

iris.dt[, fwrite(.SD, icsv(Species)), by=Species]
#> Empty data.table (0 rows and 1 cols): Species
dir(iris.dir)
#> [1] "setosa.csv"     "versicolor.csv" "virginica.csv"

The output above shows that there are three CSV files, one for each Species in the iris data. Below we read the first two rows of one file,

data.table::fread(file.path(iris.dir,"setosa.csv"), nrows=2)
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width
#>           <num>       <num>        <num>       <num>
#> 1:          5.1         3.5          1.4         0.2
#> 2:          4.9         3.0          1.4         0.2

The output above shows that the CSV data file itself does not contain a Species column (the Species is instead encoded in the file name). Below we construct a glob, which is a string for matching files,

(iglob <- file.path(iris.dir,"*.csv"))
#> [1] "/tmp/RtmpaHnllH/file2d4a3c1e3aef2c/*.csv"
Sys.glob(iglob)
#> [1] "/tmp/RtmpaHnllH/file2d4a3c1e3aef2c/setosa.csv"    
#> [2] "/tmp/RtmpaHnllH/file2d4a3c1e3aef2c/versicolor.csv"
#> [3] "/tmp/RtmpaHnllH/file2d4a3c1e3aef2c/virginica.csv"

The output above indicates that iglob matches the three data files. Below we read those files into R, using the following syntax:

The first argument iglob is a string/glob which indicates the files to read,
the other arguments form a regular expression pattern:
- The named argument Species matches that part of the file name, and is captured to the resulting column of the same name,
- the un-named argument "[.]csv" indicates that suffix must be matched (but since the argument is not named, it is not captured, nor saved as a column in the output).

nc::capture_first_glob(iglob, Species="[^/]+", "[.]csv")
#>        Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#>         <char>        <num>       <num>        <num>       <num>
#>   1:    setosa          5.1         3.5          1.4         0.2
#>   2:    setosa          4.9         3.0          1.4         0.2
#>   3:    setosa          4.7         3.2          1.3         0.2
#>   4:    setosa          4.6         3.1          1.5         0.2
#>   5:    setosa          5.0         3.6          1.4         0.2
#>  ---                                                            
#> 146: virginica          6.7         3.0          5.2         2.3
#> 147: virginica          6.3         2.5          5.0         1.9
#> 148: virginica          6.5         3.0          5.2         2.0
#> 149: virginica          6.2         3.4          5.4         2.3
#> 150: virginica          5.9         3.0          5.1         1.8

The output above indicates that we have successfully read the iris data back into R, including the Species column which was not present in the CSV data files.

Example 1: four files, two capture groups, custom read function

Consider the example below, which is slightly more complex. The code below defines a glob for matching several data files.

db <- system.file("extdata/chip-seq-chunk-db", package="nc", mustWork=TRUE)
(glob <- paste0(db, "/*/*/counts/*gz"))
#> [1] "/tmp/RtmppDHVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/*/*/counts/*gz"
(matched.files <- Sys.glob(glob))
#> [1] "/tmp/RtmppDHVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K36me3_AM_immune/9/counts/McGill0101.bedGraph.gz"
#> [2] "/tmp/RtmppDHVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K36me3_TDH_other/1/counts/McGill0019.bedGraph.gz"
#> [3] "/tmp/RtmppDHVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K4me3_TDH_immune/9/counts/McGill0024.bedGraph.gz"
#> [4] "/tmp/RtmppDHVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K4me3_XJ_immune/2/counts/McGill0024.bedGraph.gz"

The output above indicates there are four data files that are matched by the glob. Below we read the first one,

readLines(matched.files[1], n=5)
#> [1] "track type=bedGraph db=hg19 visibility=full graphType=points name=101K36monocyte description=\"McGill0101 H3K36me3 aligned read counts\""
#> [2] "chr10\t111456281\t111456338\t2"                                                                                                          
#> [3] "chr10\t111456338\t111456381\t1"                                                                                                          
#> [4] "chr10\t111456381\t111459312\t0"                                                                                                          
#> [5] "chr10\t111459312\t111459316\t5"

We can see from the output above that this data file has a header of meta-data (not column names) on the first line, whereas the other lines contain tab-delimited data. We can read it with fread, as long as we provide a couple non-default arguments, as in the code below:

read.bedGraph <- function(f)data.table::fread(
  f, skip=1, col.names = c("chrom","start", "end", "count"))
read.bedGraph(matched.files[1])
#>        chrom     start       end count
#>       <char>     <int>     <int> <int>
#>    1:  chr10 111456281 111456338     2
#>    2:  chr10 111456338 111456381     1
#>    3:  chr10 111456381 111459312     0
#>    4:  chr10 111459312 111459316     5
#>    5:  chr10 111459316 111459409    10
#>   ---                                 
#> 7130:  chr10 111721272 111721347     4
#> 7131:  chr10 111721347 111721354     2
#> 7132:  chr10 111721354 111722459     0
#> 7133:  chr10 111722459 111722461     2
#> 7134:  chr10 111722461 111722555     4

The output above indicates the data has been correctly read into R as a table with four columns. To do that for each of the files, we use this custom READ function in the code below,

data.chunk.pattern <- list(
  data="H.*?",
  "/",
  chunk="[0-9]+", as.integer)
(data.chunk.dt <- nc::capture_first_glob(glob, data.chunk.pattern, READ=read.bedGraph))
#>                                                                            data chunk  chrom
#>                                                                          <char> <int> <char>
#>     1: HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K36me3_AM_immune     9  chr10
#>     2: HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K36me3_AM_immune     9  chr10
#>     3: HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K36me3_AM_immune     9  chr10
#>     4: HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K36me3_AM_immune     9  chr10
#>     5: HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K36me3_AM_immune     9  chr10
#>    ---                                                                                      
#> 20297:  HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K4me3_XJ_immune     2  chr22
#> 20298:  HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K4me3_XJ_immune     2  chr22
#> 20299:  HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K4me3_XJ_immune     2  chr22
#> 20300:  HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K4me3_XJ_immune     2  chr22
#> 20301:  HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K4me3_XJ_immune     2  chr22
#>            start       end count
#>            <int>     <int> <int>
#>     1: 111456281 111456338     2
#>     2: 111456338 111456381     1
#>     3: 111456381 111459312     0
#>     4: 111459312 111459316     5
#>     5: 111459316 111459409    10
#>    ---                          
#> 20297:  20689768  20689770     0
#> 20298:  20689770  20689870     1
#> 20299:  20689870  20689995     0
#> 20300:  20689995  20690080     1
#> 20301:  20690080  20691400     0

The output above indicates the data files have been read into R as a table, with two additional columns (data and chunk), which correspond to the capture group names used in the regular expression pattern above.

Why not base R?

We can absolutely use base R to read these files, but it takes a bit more code, as shown below.

base.df.list <- list()
for(file.csv in matched.files){
  file.df <- read.bedGraph(file.csv)
  counts.path <- dirname(file.csv)
  chunk.path <- dirname(counts.path)
  data.path <- dirname(chunk.path)
  base.df.list[[file.csv]] <- data.frame(
    data=basename(data.path),
    chunk=basename(chunk.path),
    file.df)
}
base.df <- do.call(rbind, base.df.list)
rownames(base.df) <- NULL
head(base.df)
#>                 data chunk chrom     start       end count
#> 1 H3K36me3_AM_immune     9 chr10 111456281 111456338     2
#> 2 H3K36me3_AM_immune     9 chr10 111456338 111456381     1
#> 3 H3K36me3_AM_immune     9 chr10 111456381 111459312     0
#> 4 H3K36me3_AM_immune     9 chr10 111459312 111459316     5
#> 5 H3K36me3_AM_immune     9 chr10 111459316 111459409    10
#> 6 H3K36me3_AM_immune     9 chr10 111459409 111459411     8
str(base.df)
#> 'data.frame':    20301 obs. of  6 variables:
#>  $ data : chr  "H3K36me3_AM_immune" "H3K36me3_AM_immune" "H3K36me3_AM_immune" "H3K36me3_AM_immune" ...
#>  $ chunk: chr  "9" "9" "9" "9" ...
#>  $ chrom: chr  "chr10" "chr10" "chr10" "chr10" ...
#>  $ start: int  111456281 111456338 111456381 111459312 111459316 111459409 111459411 111459415 111463412 111463512 ...
#>  $ end  : int  111456338 111456381 111459312 111459316 111459409 111459411 111459415 111463412 111463512 111466726 ...
#>  $ count: int  2 1 0 5 10 8 5 0 2 0 ...

The output above shows that we have read a data frame into R, and that it is consistent with the data table returned by nc::capture_first_glob, which should be preferred for simplicity when the files are regularly named. In contrast, this section shows how arbitrary R code can be used, so this approach should be preferred when the data in the file path can not be captured using regular expressions.

Example 3: Hive partition file names

In the code below, we write the same data to a set of CSV files with different names,

arrow.available <- requireNamespace("arrow") && arrow::arrow_with_dataset()
#> Le chargement a n?cessit? le package : arrow
if(arrow.available){
  path <- tempfile()
  arrow::write_dataset(
    dataset=data.chunk.dt,
    path=path,
    format="csv",
    partitioning=c("data","chunk"),
    max_rows_per_file=1000)
  hive.glob <- file.path(path, "*", "*", "*.csv")
  (hive.files <- Sys.glob(hive.glob))
}
#>  [1] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune/chunk=9/part-0.csv" 
#>  [2] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune/chunk=9/part-1.csv" 
#>  [3] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune/chunk=9/part-2.csv" 
#>  [4] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune/chunk=9/part-3.csv" 
#>  [5] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune/chunk=9/part-4.csv" 
#>  [6] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune/chunk=9/part-5.csv" 
#>  [7] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune/chunk=9/part-6.csv" 
#>  [8] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune/chunk=9/part-7.csv" 
#>  [9] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other/chunk=1/part-0.csv" 
#> [10] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other/chunk=1/part-1.csv" 
#> [11] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other/chunk=1/part-10.csv"
#> [12] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other/chunk=1/part-11.csv"
#> [13] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other/chunk=1/part-12.csv"
#> [14] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other/chunk=1/part-2.csv" 
#> [15] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other/chunk=1/part-3.csv" 
#> [16] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other/chunk=1/part-4.csv" 
#> [17] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other/chunk=1/part-5.csv" 
#> [18] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other/chunk=1/part-6.csv" 
#> [19] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other/chunk=1/part-7.csv" 
#> [20] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other/chunk=1/part-8.csv" 
#> [21] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other/chunk=1/part-9.csv" 
#> [22] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K4me3_TDH_immune/chunk=9/part-0.csv" 
#> [23] "/tmp/RtmpaHnllH/file2d4a3c57bdf6fb/data=HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K4me3_XJ_immune/chunk=2/part-0.csv"

In the output above, we can see that there are regularly named files with three variables encoded in the file path (data, chunk, part). The code below reads one of the files back into R:

if(arrow.available){
  data.table::fread(hive.files[1])
}
#>        chrom     start       end count
#>       <char>     <int>     <int> <int>
#>    1:  chr10 111456281 111456338     2
#>    2:  chr10 111456338 111456381     1
#>    3:  chr10 111456381 111459312     0
#>    4:  chr10 111459312 111459316     5
#>    5:  chr10 111459316 111459409    10
#>   ---                                 
#>  996:  chr10 111619010 111619035     1
#>  997:  chr10 111619035 111619092     2
#>  998:  chr10 111619092 111619101     3
#>  999:  chr10 111619101 111619128     2
#> 1000:  chr10 111619128 111619129     3

The output above indicates that the file only has four columns (and is missing the variables which are encoded in the file path). In the code below, we read all those files back into R:

if(arrow.available){
  hive.pattern <- list(
    nc::field("data","=",".*?"),
    "/",
    nc::field("chunk","=",".*?", as.integer),
    "/",
    nc::field("part","-","[0-9]+", as.integer))
  print(hive.dt <- nc::capture_first_glob(hive.glob, hive.pattern))
  hive.dt[, .(rows=.N), keyby=.(data,chunk,part)]
}
#>                                                                                      data chunk
#>                                                                                    <char> <int>
#>     1: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune     9
#>     2: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune     9
#>     3: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune     9
#>     4: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune     9
#>     5: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune     9
#>    ---                                                                                         
#> 20297:  HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K4me3_XJ_immune     2
#> 20298:  HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K4me3_XJ_immune     2
#> 20299:  HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K4me3_XJ_immune     2
#> 20300:  HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K4me3_XJ_immune     2
#> 20301:  HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K4me3_XJ_immune     2
#>         part  chrom     start       end count
#>        <int> <char>     <int>     <int> <int>
#>     1:     0  chr10 111456281 111456338     2
#>     2:     0  chr10 111456338 111456381     1
#>     3:     0  chr10 111456381 111459312     0
#>     4:     0  chr10 111459312 111459316     5
#>     5:     0  chr10 111459316 111459409    10
#>    ---                                       
#> 20297:     0  chr22  20689768  20689770     0
#> 20298:     0  chr22  20689770  20689870     1
#> 20299:     0  chr22  20689870  20689995     0
#> 20300:     0  chr22  20689995  20690080     1
#> 20301:     0  chr22  20690080  20691400     0
#> Key: <data, chunk, part>
#>                                                                                   data chunk  part
#>                                                                                 <char> <int> <int>
#>  1: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune     9     0
#>  2: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune     9     1
#>  3: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune     9     2
#>  4: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune     9     3
#>  5: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune     9     4
#>  6: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune     9     5
#>  7: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune     9     6
#>  8: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_AM_immune     9     7
#>  9: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other     1     0
#> 10: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other     1     1
#> 11: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other     1     2
#> 12: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other     1     3
#> 13: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other     1     4
#> 14: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other     1     5
#> 15: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other     1     6
#> 16: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other     1     7
#> 17: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other     1     8
#> 18: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other     1     9
#> 19: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other     1    10
#> 20: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other     1    11
#> 21: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K36me3_TDH_other     1    12
#> 22: HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K4me3_TDH_immune     9     0
#> 23:  HVyz%2FRinst2d4985211eda79%2Fnc%2Fextdata%2Fchip-seq-chunk-db%2FH3K4me3_XJ_immune     2     0
#>                                                                                   data chunk  part
#>      rows
#>     <int>
#>  1:  1000
#>  2:  1000
#>  3:  1000
#>  4:  1000
#>  5:  1000
#>  6:  1000
#>  7:  1000
#>  8:   134
#>  9:  1000
#> 10:  1000
#> 11:  1000
#> 12:  1000
#> 13:  1000
#> 14:  1000
#> 15:  1000
#> 16:  1000
#> 17:  1000
#> 18:  1000
#> 19:  1000
#> 20:  1000
#> 21:   109
#> 22:   886
#> 23:   172
#>      rows

The output above indicates that we have successfully read the data back into R.

Example 4: pattern with two more capture groups

In the code below, we read the same data files, with a more complex pattern that has two additional capture groups (name and id).

(count.dt <- nc::capture_first_glob(
  glob,
  data.chunk.pattern,
  "/counts/", 
  name=list("McGill", id="[0-9]+", as.integer),
  READ=read.bedGraph))
#>                                                                            data chunk       name
#>                                                                          <char> <int>     <char>
#>     1: HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K36me3_AM_immune     9 McGill0101
#>     2: HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K36me3_AM_immune     9 McGill0101
#>     3: HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K36me3_AM_immune     9 McGill0101
#>     4: HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K36me3_AM_immune     9 McGill0101
#>     5: HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K36me3_AM_immune     9 McGill0101
#>    ---                                                                                          
#> 20297:  HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K4me3_XJ_immune     2 McGill0024
#> 20298:  HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K4me3_XJ_immune     2 McGill0024
#> 20299:  HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K4me3_XJ_immune     2 McGill0024
#> 20300:  HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K4me3_XJ_immune     2 McGill0024
#> 20301:  HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K4me3_XJ_immune     2 McGill0024
#>           id  chrom     start       end count
#>        <int> <char>     <int>     <int> <int>
#>     1:   101  chr10 111456281 111456338     2
#>     2:   101  chr10 111456338 111456381     1
#>     3:   101  chr10 111456381 111459312     0
#>     4:   101  chr10 111459312 111459316     5
#>     5:   101  chr10 111459316 111459409    10
#>    ---                                       
#> 20297:    24  chr22  20689768  20689770     0
#> 20298:    24  chr22  20689770  20689870     1
#> 20299:    24  chr22  20689870  20689995     0
#> 20300:    24  chr22  20689995  20690080     1
#> 20301:    24  chr22  20690080  20691400     0
count.dt[, .(count=.N), by=.(data, chunk, name, id, chrom)]
#>                                                                        data chunk       name    id
#>                                                                      <char> <int>     <char> <int>
#> 1: HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K36me3_AM_immune     9 McGill0101   101
#> 2: HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K36me3_TDH_other     1 McGill0019    19
#> 3: HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K4me3_TDH_immune     9 McGill0024    24
#> 4:  HVyz/Rinst2d4985211eda79/nc/extdata/chip-seq-chunk-db/H3K4me3_XJ_immune     2 McGill0024    24
#>     chrom count
#>    <char> <int>
#> 1:  chr10  7134
#> 2:  chr21 12109
#> 3:   chr1   886
#> 4:  chr22   172

The output above indicates that we have successfully read the data into R, with two additional columns (name and id). These data can be visualized using the code below,

if(require(ggplot2)){
  ggplot()+
    facet_wrap(~data+chunk+name+chrom, labeller=label_both, scales="free")+
    geom_step(aes(
      start/1e3, count),
      data=count.dt)
}

The plot above includes panel/facet titles which come from the variables which were stored in the file names.

Example 5: parsing non-CSV data

The following example demonstrates how non-CSV data may be parsed, using a custom READ function. Consider the vignette data files,

vignettes <- system.file("extdata/vignettes", package="nc", mustWork=TRUE)
(vglob <- paste0(vignettes, "/*.Rmd"))
#> [1] "/tmp/RtmppDHVyz/Rinst2d4985211eda79/nc/extdata/vignettes/*.Rmd"
(vfiles <- Sys.glob(vglob))
#> [1] "/tmp/RtmppDHVyz/Rinst2d4985211eda79/nc/extdata/vignettes/v0-overview.Rmd"     
#> [2] "/tmp/RtmppDHVyz/Rinst2d4985211eda79/nc/extdata/vignettes/v1-capture-first.Rmd"
#> [3] "/tmp/RtmppDHVyz/Rinst2d4985211eda79/nc/extdata/vignettes/v2-capture-all.Rmd"  
#> [4] "/tmp/RtmppDHVyz/Rinst2d4985211eda79/nc/extdata/vignettes/v3-capture-melt.Rmd" 
#> [5] "/tmp/RtmppDHVyz/Rinst2d4985211eda79/nc/extdata/vignettes/v4-comparisons.Rmd"  
#> [6] "/tmp/RtmppDHVyz/Rinst2d4985211eda79/nc/extdata/vignettes/v5-helpers.Rmd"      
#> [7] "/tmp/RtmppDHVyz/Rinst2d4985211eda79/nc/extdata/vignettes/v6-engines.Rmd"

The output above includes the glob and the files it matches. Below we define a function for parsing one of those files,

non.greedy.lines <- list(
  list(".*\n"), "*?")
optional.name <- list(
  list(" ", chunk_name="[^,}]+"), "?")
chunk.pattern <- list(
  before=non.greedy.lines,
  "```\\{r",
  optional.name,
  parameters=".*",
  "\\}\n",
  code=non.greedy.lines,
  "```")
READ.vignette <- function(f)nc::capture_all_str(f, chunk.pattern)
str(READ.vignette(vfiles[1]))

#> Classes 'data.table' and 'data.frame':   7 obs. of  4 variables:
#>  $ before    : chr  "<!--\n%\\VignetteEngine{knitr::knitr}\n%\\VignetteIndexEntry{vignette 0: Overview}\n-->\n\n# Overview of nc functionality\n\n" "\n\nHere is an index of topics which are explained in the different\nvignettes, along with an overview of funct"| __truncated__ "\n\nA variant is doing the same thing, but with input\nsubjects coming from a data table/frame with character columns.\n\n" "\n\n## Capture all matches in a single subject\n  \n[Capture all](v2-capture-all.html) is for the situation whe"| __truncated__ ...
#>  $ chunk_name: chr  "setup" "" "" "" ...
#>  $ parameters: chr  ", include = FALSE" "" "" "" ...
#>  $ code      : chr  "knitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#>\"\n)\n" "subject.vec <- c(\n  \"chr10:213054000-213,055,000\",\n  \"chrM:111000\",\n  \"chr1:110-111 chr2:220-222\")\nnc"| __truncated__ "subject.dt <- data.table::data.table(\n  JobID = c(\"13937810_25\", \"14022192_1\"),\n  Elapsed = c(\"07:04:42\"| __truncated__ "nc::capture_all_str(\n  subject.vec, chrom=\"chr.*?\", \":\", chromStart=\"[0-9,]+\", as.integer)\n" ...
#>  - attr(*, ".internal.selfref")=<externalptr>

The output above shows a data table with 7 rows, one for each code chunk defined in the vignette data file. We read all of the vignette files using the code below.

chunk.dt <- nc::capture_first_glob(
  vglob,
  "/v",
  vignette_number="[0-9]", as.integer,
  "-",
  vignette_name=".*?",
  ".Rmd",
  READ=READ.vignette
)[
, chunk_number := seq_along(chunk_name), by=vignette_number
]
chunk.dt[, .(
  vignette_number, vignette_name, chunk_number, chunk_name, 
  lines=nchar(code))]
#>      vignette_number vignette_name chunk_number chunk_name lines
#>                <int>        <char>        <int>     <char> <int>
#>   1:               0      overview            1      setup    61
#>   2:               0      overview            2              192
#>   3:               0      overview            3              314
#>   4:               0      overview            4               91
#>   5:               0      overview            5              242
#>  ---                                                            
#> 104:               6       engines            1      setup    61
#> 105:               6       engines            2              115
#> 106:               6       engines            3               67
#> 107:               6       engines            4              122
#> 108:               6       engines            5              320

The output above is a data table with one row for each chunk in each data file. Some columns (vignette_number and vignette_name) come from the file path, and others come from the data file contents, including chunk number, name, and line count. The files also contain code which has been parsed and can be extracted via the code below, for example:

cat(chunk.dt$code[2])
#> subject.vec <- c(
#>   "chr10:213054000-213,055,000",
#>   "chrM:111000",
#>   "chr1:110-111 chr2:220-222")
#> nc::capture_first_vec(
#>   subject.vec, chrom="chr.*?", ":", chromStart="[0-9,]+", as.integer)