NCEA Data Integration Exploration Tool

Overview

This application is intended to show the implications of analysing two different data sets in isolation and then together in an integrated approach. The two data sets can have different propeties such as the volume and variability of the data. The ultimate goal is assumed to be to detect a trend over time. This tool allows the user to compare the estimated trend and the variability around the trend estimated from using individual data sources and from an integrated approach.

The tool is based on simulated data where the trend, that one is interested in detecting, is known and present in the data. The user can simulate two different data sets according to different design options and input parameters. The monitoring design options specify the number of sites, the intensity of sampling and over how long monitoring is assumed to have taken place over. This felxibility allows the user to specify many different potential designs reflecting real data sets. In addition to this, the user can choose from a preset list of options that automatically define the monitoring design options for that option. The input parameters define the distribution of the data of interest. Here, the user specifies the type of data of interest, the mean values and the different sources of variability. Exisiting preset options are availabie based on real data. These two hypothetical data sets may reflect a national, structured survey and a citizen science scheme or perhaps where supplementary, localized monitoring has occurred. The schematic below shows the different data sets, how they are simulated and the different models fitted.

The tool includes an option to extract the input parameters from an existing data set. The user can simply upload an exisitng data set and the required input parameters, specifying the data distribution will be estimated and returned to the user.

Simulated data and model results

A plot of the data simulated according to the specified design and input parameters.

Comparison of fitted models and inference

The table below shows the true trend simulated and the estimated trend from 3 different models. The first model is based on Data Source 1 only; the second model is based on Data Source 2 only; and the third model is based on an integrated assessment using both Data Source 1 and Data Source 2.

The table below shows the variance of the trend estimates from the 3 different models

The plots below show the estimated trend (black line) and the variance around that trend over time (grey region), together with the known, true trend (red line)

Trend estimated from Data Source 1

Trend estimated from Data Source 2

Trend estimated from integrated model using both Data Source 1 and Data Source 2

Run through demonstration

In this brief example, we hypothesize that there are two data sources available that collect data on phosphorus concentrations in the freshwater environment: one data source with 100 sites monitored every year for 10 years; and a secondary source of 50 sites monitored every year for 5 years. The assumed true trend was 0.005, which represented a change in phosphorus concentrations of 0.005 each year. It is this trend we attempt to retrieve from the simulated data and models of that.

First, we parameterize the first data source and specify the 10 years worth of annual data with 100 sites. We also use the inbuilt presets to assume that the variability of the phosphorus data is the same as the observed variability in the river surveillance network. This is specified as shown below:

We then assume that the data from the second data source is only available for 5 years with annual data for 50 sites. As with Data source 1, we assume that the phosphorus data can be parameterised according to the river surveillance network. As with Data Source 1, this data will also undergo the same magnitude of change per year of 0.005. This is all parameterized as shown below.

The resulting data is shown in the boxplot and one can see how the observations differ between the two data sets. The different volume of data can clearly be seen.

Finally we run the models by pressing the fit models button. That produces the plots as shown below. Here we can see how the 'Data Source 1’ data provides a good fit to the truth but, although the Data Source 2 data seems to estimate the overall trend reasonably well, there is much greater variance. In addition to this, the integrated model performs better than either model with the correct trend estimated without any bias and with little uncertainty.

Extract parameters from data frame

This part of the tool allows a user to submit a dataset and extract the relevant parameters required to simulate data on the 'specify values' tab.

A video demonstration of how to use this tool