Guide to EGRETci 2.0 enhancements

Robert M. Hirsch and Laura De Cicco

2018-06-18

1 Introduction

This vignette documents a set of enhancements of the EGRETci software package. EGRET is “Exploration and Graphics for RivEr Trend” which has been developed my Robert M. Hirsch and Laura DeCicco of the U.S. Geological Survey. The EGRETci package contains functions that can be used to evaluate the uncertainty associated with results generated by the EGRET code. This document, on the EGRETci 2.0 enhancements assumes that the reader already has a good understanding of WRTDS (Weighted Regressions on Time Discharge and Season), and the EGRET 2.0 package and the EGRET 3.0 enhancements (“Guide to EGRET 3.0 enhancements”) as well as the documentation “Introduction to EGRET Confidence Intervals”"). The two releases EGRET 3.0 and EGRETci 2.0 are tightly linked in terms of naming conventions and the sharing of various objects between them. Brief reference is made to these EGRETci functions in the documentation of the EGRET 3.0 enhancements.

The releases of EGRET 3.0 add some new flexibility to the WRTDS method. The flexibility that the new version provides are of two kinds. One is an ability to partition the sample data into two periods, one before and one after, a change that happened in the watershed that we believe to have had an important, sudden, but lasting impact on water quality. The other is the ability to relax the assumption of stationarity of streamflow in the flow normalization process. As shorthand we refer to the first of these enhancements as the “wall”, and the second as “Generalized Flow Normalization.” These are discussed in detail in “Guide to EGRET 3.0 enhancements” and are not repeated here.

This document discusses the motivations and general concepts behind these enhancements, and then move to instructions for implementation. It is assumed that the reader already has a good working knowledge of WRTDS, EGRET and EGRETci. Hirsch, Moyer and Archfield, 2010, Hirsch and DeCicco, 2015, Hirsch, Archfield and DeCicco, 2015 and the Introduction to the EGRET package, and Introduction to EGRET Confidence Intervals.

EGRETci 2.0 is designed so that it will produce exactly the same outputs as EGRETci 1.0 version did using the same set of commands that would have been used with one notable exception. Previous versions of EGRETci did not require a “startSeed” argument when running the confidence intervals in parallel. The reason for the change is to assure that we can now reproduce all bootstrapping calculations. It is important to adjust any parallel code to include that argument as follows (specifically the “startSeed = n” argument):

nCores <- detectCores() - coreOut
cl <- makeCluster(nCores)
registerDoParallel(cl)
repAnnual <- foreach(n = 1:nBoot,.packages=c('EGRETci')) %dopar% {
   annualResults <- bootAnnual(eList, 
                               blockLength,
                               startSeed = n)  
}
stopCluster(cl)    

2 Problem set up

There are three distinct types of problem set-ups that are possible in the new formulation and each of them has its own distinct workflow and outputs. They are known as Pairs, Series, and Groups. What do these terms mean here? Pairs is for the comparison of any two years in the study period, Series is to describe the entire study period to produce graphs of the change as a function of time, Groups is for the comparison of any two groups of contiguous years. The appropriate use of each of these three set-ups is described in detail in the “Guide to EGRET 3.0 enhancements.”

3 runPairsBoot: The bootstrap uncertainty analysis for runPairs results

The function that does the uncertainty analysis for determining the change between any pair of years is called runPairsBoot. It is very similar to the wBT function that runs the WRTDS Bootstrap Test in EGRETci 1.0. It differs from wBT in that it runs a specific number of bootstrap replicates, unlike the wBT approach that will stop running replicates based on the status of the test statistics along the way.

The call to runPairsBoot is very simple, because it gets virtually all of the information about how the analysis is to be set up from information in the attributes of pairResults. The only arguments that are required are the eList and pairResults and three more arguments specifically related to the bootstrap process. These are described here:

nBoot This is the number of bootstrap replicates to run. It has a default value of 100. A value of 100 should give rough approximation of the correct likelihood of the trend direction. For example, if the bootstrap results were upwards trends in 98 of 100 replicates, then we could say that we have about 95% confidence that the true likelihood that the trend is upwards is between 0.945 and 0.998 (rather strong evidence of an upwards trend). For purposes of a published study going to nBoot = 500 may be appropriate. For example if we had 490 upwards trends out of 500 replicates we could say with 95% confidence that the likelihood that the true trend is upwards is between 0.966 and 0.991. To see if it is even worthwhile to do a large number of replicates one could use a very small number of replicates, say nBoot = 10 and if there is a fairly even mix of increases and decreases we can dispense with the high number of replicates and simply conclude that it is about as likely up as it is down. In a few cases some of the bootstrap replicates have results that do not converge and they generate an error message but no results for that replicate. To deal with that, the function is designed to run more replicates than nBoot but to quit when it reaches nBoot

startSeed This is the seed for the random number generator, which is used to create the random bootstrap replicates of the data. It has a default value but any integer will work. Knowing the startSeed value is vital to reproducing the results. If a different value of startSeed is used in a later run it will produce slightly different results, but with nBoot of 200 or more these differences should be trivially small.

blockLength This is the block length for the block bootstrap resampling (see Hirsch, Archfield and DeCicco, 2015 for an explanation). This argument has a default of 200 so it shouldn’t need to be specified.

Here we will run a trivially small example of runPairsBoot based on the problem set up in the second example of runPairs in the EGRET 3.0 enhancements User Guide.

library(EGRET)
library(EGRETci)
eList <- Choptank_eList
pairResults2 <- runPairs(eList, year1 = 1985, year2 = 2010, 
                         windowSide = 7, flowBreak = TRUE, 
                         Q1EndDate = "1995-05-31", wall = TRUE,
                         sample1EndDate = "1995-05-31", 
                         QStartDate = "1979-10-01", 
                         QEndDate = "2010-09-30", 
                         paStart = 4, paLong = 5)
bootPairOut2 <- runPairsBoot(eList, 
                             pairResults2, 
                             nBoot = 10) 
Show in New Window

 iBoot, xConc and xFlux 1 0.4048814 0.03866322
 iBoot, xConc and xFlux 2 0.4186308 0.04813132
 iBoot, xConc and xFlux 3 0.4114679 0.04757094
 iBoot, xConc and xFlux 4 0.4314246 0.04415677
 iBoot, xConc and xFlux 5 0.3456514 0.03323089
 iBoot, xConc and xFlux 6 0.4639338 0.04792485
 iBoot, xConc and xFlux 7 0.3480087 0.03212271
 iBoot, xConc and xFlux 8 0.3694464 0.03815445
 iBoot, xConc and xFlux 9 0.3502027 0.04509857
 iBoot, xConc and xFlux 10 0.4180412 0.04559656
   Choptank River 
   Inorganic nitrogen (nitrate and nitrite)
   Season Consisting of Apr May Jun Jul Aug 


 Change estimates are for  2010  minus  1985
 Sample data set was partitioned with a wall at  1995-05-31 



Should we reject Ho that Flow Normalized Concentration Trend = 0 ? Reject Ho
 best estimate of change in concentration is   0.412 mg/L
  Lower and Upper 90% CIs    0.346    0.464
 also 95% CIs    0.346    0.464 
 and 50% CIs    0.350    0.422
 approximate two-sided p-value for Conc      0.18
* Note p-value should be considered to be < stated value
 Likelihood that Flow Normalized Concentration is trending up = 0.955  is trending down = 0.0455

Should we reject Ho that Flow Normalized Flux Trend = 0 ? Reject Ho
 best estimate of change in flux is   0.0456 10^6 kg/year
  Lower and Upper 90% CIs   0.0321   0.0481
 also 95% CIs   0.0321   0.0481 
 and 50% CIs   0.0369   0.0477
 approximate two-sided p-value for Flux      0.18
* Note p-value should be considered to be < stated value
 Likelihood that Flow Normalized Flux is trending up = 0.955  is trending down = 0.0455

 Upward trend in concentration is highly likely 
 Upward trend in flux is highly likely
 Downward trend in concentration is highly unlikely 
 Downward trend in flux is highly unlikely

The output is pretty self-explanatory and the latter part of it is identical to the summary information provided by wBT.

The object returned by runPairsBoot is virtually the same as the eBoot object returned by wBT. It contains the data frame called bootOut which contains all the results of the test that are shown in the console. One new feature here is bootOut$nBootGood. Because there is a small chance that the bootstrap results will contain a small number of iterations that fail to complete it may end up with a number of valid replicates that is less than nBoot. The number of usable replicates is nBootGood. The other parts of the returned object are the same as those returned from wBT.

One of the outputs we can use to describe the results of the test is the histogram of percentage changes. These can be done with the plotHistogramTrend function just as they have been in the EGRETci 1.0.3 software. The only difference in the call to the function is that we don’t use caseSetUp in runPairs so that argument is set to caseSetUp = NA.

To call this function for a runPairsBoot result would look like this:

plotHistogramTrend(eList, eBoot, caseSetUp = NA, 
                   flux = TRUE, xMin = NA, xMax = NA, 
                   xStep = NA, printTitle = TRUE, 
                   cex.main = 1.1, cex.axis = 1.1, 
                   cex.lab = 1.1, col.fill = "grey", ...)

The only two arguments required are eList, and eBoot which represents the output from runPairsBoot.

Here are the calls to plotHistogramTrend first for flux, and then for concentration. [We have substituted in a set of runPairsBoot outputs for which nBoot = 100 rather than using the one shown above.]

bootPairOut2 <- runPairsBoot(eList, pairResults2, nBoot = 100)

Here are the calls to plotHistogramTrend first for flux, and then for concentration.

plotHistogramTrend(eList,bootPairOut2, caseSetUp = NA)

plotHistogramTrend(eList,bootPairOut2, caseSetUp = NA, flux = FALSE)

So, what the results clearly show us is that flux has clearly increased and probably by a magnitude of a few hundred percent and concentration has clearly increased probably by an amount between about 100 % to 200 %.

4 The bootstrap uncertainty analysis for runSeries

The purpose of Series analysis is to create a time series of flow-normalized concentrations and flow-normalized flux values. The function that does this is runSeries in the EGRET package.

In the case of runSeries we can do uncertainty analysis which provides us with confidence intervals around the flow normalized time series that we estimated from running the runSeries function. In this case we actually use the same functions that were used in EGRETci 1.0.3, but they have been modified here to accommodate all of the flexible features of runSeries (but they run exactly as they did before if they were just working off of results from modelEstimation).

This is how it is called:

CIAnnualResults <- ciCalculations(eList, 
                                  startSeed = 494817, 
                                  verbose = TRUE, ...)

The eList that is passed to the function should be a version of eList that has already been run through runSeries. So, for the example run shown above the eList argument would be eListOut. That version contains all the estimated values as well as all of the arguments that were used in the computation (related to the wall and flowBreak etc.). There is no need to specify startSeed, the default will be fine (unless one wants to explore how the results might change in another run of the function). To avoid seeing all the progress indicators on the screen one can set verbose = FALSE, but it may be desirable just to see that the job is continuing to run by setting verbose = TRUE. It can take a considerable amount of computer time.

The ... in the call specifically relate to some other bootstrap-related parameters that the user might wish to set. They can be left out of the call, and the program with interactively request the information, or they can be set directly in the call. We will set up to run the job specifying them without the interactive part (as would be done in a batch implementation).

We will run the bootstrap analysis here. We will use the output of the last run of runSeries example in the EGRET enhancement vignette.

eList <- Choptank_eList
eListOut <- runSeries(eList, windowSide = 7, verbose = FALSE)
CIAnnualResults <- ciCalculations(eListOut, 
                                  verbose = FALSE, 
                                  nBoot = 100,
                                  blockLength = 200, 
                                  widthCI = 90)
CIAnnualResults <- ciCalculations(eListOut, 
                                  verbose = FALSE, 
                                  nBoot = 100,
                                  blockLength = 200, 
                                  widthCI = 90)

nBoot is the number of bootstrap replicates to be run. A result that is suitable for publication should probably have nBoot at least 100 and preferably 200. blockLength is typically set to 200. widthCI is the width of the confidence interval. If widthCI = 90 that means that the confidence intervals that will be drawn are the 5th and 95th percentiles.

We can visualize the results by using the plotConcHistBoot and plotFluxHistBoot functions (which were a part of the EGRETci 1.0.3 code). Running them with results from runSeries requires that we use the eList that is produced by the runSeries function. So, following the results we saw from runSeries we would obtain the confidence interval plots as follows. (Note that there are several other arguments to these functions that are the same as those used in plotConcHist and plotFluxHist).

plotConcHistBoot(eListOut, CIAnnualResults)

plotFluxHistBoot(eListOut, CIAnnualResults)

Looking at the output, particularly for plotFluxHistBoot, suggests that the windowSide argument is probably too short. The plot is simply too jagged. A better choice would probably have been to set windowSide = 11. We have included this example here to show the problem of having windowSide be too small. The general conclusions about the overall increase over the more than 30 year record would remain the same and the width of the confidence intervals wouldn’t change much if a better windowSide argument had been used.

5 runGroupsBoot: The bootstrap uncertainty analysis for runGroups results

The function that does the uncertainty analysis for determining the change between two groups of years is called runGroupsBoot. The process is virtually identical to what is used for runPairsBoot, so much of the detail will be left out of this discussion. The call to runGroupsBoot is very simple, because it gets virtually all of the information about how the analysis is to be set up from information in the attributes of groupResults. The only arguments that are required are the eList and groupResults and three more arguments specifically related to the bootstrap process:

nBoot This is the number of bootstrap replicates to run. It has a default value of 100. In a few cases some of the bootstrap replicates have results that do not converge and they generate an error message but no results for that replicate. To deal with that the function is designed to run more replicates than nBoot but to quit when it reaches nBoot

startSeed This is the seed for the random number generator, which is used to create the random bootstrap resamples of the data. It has a default value but any integer will work.

blockLength This is the block length for the block bootstrap resampling. This argument has a default of 200 so it shouldn’t need to be specified.

Here we will run runGroupsBoot based on the problem set-up in the example of runGroups in the EGRET 3.0 Enhancements User Guide.

eList <- Choptank_eList
groupResults <- runGroups(eList, 
                          group1firstYear = 1995, group1lastYear = 2004, 
                          group2firstYear = 2005, group2lastYear = 2010,
                          windowSide = 7, wall = TRUE, 
                          sample1EndDate = "2004-10-30", 
                          paStart = 4, paLong = 2, verbose = FALSE)
bootGroupsOut <- runGroupsBoot(eList, groupResults, nBoot = 100)


 iBoot, xConc and xFlux 1 0.1112342 0.00210587
 iBoot, xConc and xFlux 2 0.2313755 0.01785959
 iBoot, xConc and xFlux 3 0.1940411 0.01241572
 iBoot, xConc and xFlux 4 0.1362666 0.01586453
 ...
 iBoot, xConc and xFlux 98 0.2323236 0.02383054
 iBoot, xConc and xFlux 99 0.1258737 0.01037859
 iBoot, xConc and xFlux 100 0.2255112 0.01616338

   Choptank River 
   Inorganic nitrogen (nitrate and nitrite)
   Season Consisting of Apr May 

 Change estimates for
 average of 2005  through 2010  minus average of 1995  through 2004 

 Sample data set was partitioned with a wall at  2004-10-30 



Should we reject Ho that Flow Normalized Concentration Trend = 0 ? Reject Ho
 best estimate of change in concentration is   0.153 mg/L
  Lower and Upper 90% CIs  0.03306  0.25923
 also 95% CIs  0.00971  0.29603 
 and 50% CIs  0.09818  0.19353
 approximate two-sided p-value for Conc     0.033
 Likelihood that Flow Normalized Concentration is trending up = 0.985  is trending down = 0.0149

Should we reject Ho that Flow Normalized Flux Trend = 0 ? Do Not Reject Ho
 best estimate of change in flux is  0.00931 10^6 kg/year
  Lower and Upper 90% CIs -0.014505  0.026523
 also 95% CIs -0.017796  0.033343 
 and 50% CIs  0.000622  0.016089
 approximate two-sided p-value for Flux      0.47
 Likelihood that Flow Normalized Flux is trending up = 0.767  is trending down = 0.233

 Upward trend in concentration is highly likely 
 Upward trend in flux is likely
 Downward trend in concentration is highly unlikely 
 Downward trend in flux is unlikely

The output is pretty self-explanatory and virtually the same as that which is generated by runPairsBoot.

The object returned by runGroupsBoot is virtually the same as the eBoot object returned by wBT. The object returned, called bootGroupsOut contains all the results of the test that are shown in the console.

We cam visualize the results of the test as a the histogram of percentage changes. These can be done with the plotHistogramTrend function.

To call this function for a runGroupsBoot result would look like this:

plotHistogramTrend(eList, eBoot, flux = TRUE, 
                         xMin = NA, xMax = NA, xStep = NA, 
                         printTitle = TRUE, cex.main = 1.1, 
                         cex.axis = 1.1, cex.lab = 1.1, 
                         col.fill = "grey")

The only two arguments required are eList, and eBoot which represents the output from runGroupsBoot.

Here are the calls to plotHistogramTrend first for flux, and then for concentration. The xMin, xMax, and xStep arguments shown below were selected based on first viewing the results using the default values and then searching for a set of values for the three arguments that provided a good representation of the distribution.

plotHistogramTrend(eList, bootGroupsOut, 
                   xMin = -30, xMax = 40, xStep = 10)

plotHistogramTrend(eList, bootGroupsOut, 
                   flux=FALSE, xMin = -10,
                   xMax = 50, xStep = 5)

What the results clearly show us is that both flux and concentration indicate that it is somewhat more likely to have been an increase rather than a decrease between the two groups of years. Our best estimate is an increase is a little less than about + 10%. For both flux and concentration the trend is fairly likely to be in the range from -20% to about +30%.