This package provides utilities for working with the UK Biobank NMR metabolomics data.
There are three groups of functions in this package:
All functions are designed to be applied directly to the UK Biobank phenotype data on the UK Biobank Research Analysis Platform after the NMR metabolomics fields have been extracted using the Table Exporter tool.
This package also works with datasets predating the Research Analysis Platform, which have been extractedusing the ukbconv tool or processed with the ukbtools R package.
This package also provides a data.frame
of biomarker
information, loaded as nmr_info
, and a
data.frame
of sample processing information, loaded as
sample_qc_info
. See help("nmr_info")
and
help("sample_qc_info")
for details on column contents.
If using this package to remove additional technical variation or compute additional biomarker ratios, please cite:
Ritchie S. C. et al., Quality control and removal of technical variation of NMR metabolic biomarker data in ~120,000 UK Biobank participants, Sci Data 10 64 (2023). doi: 10.1038/s41597-023-01949-y.
Note that several updates have been made to the package and algorithm based on subsequent releases of NMR metabolomic biomarker data that have expanded to cover all ~500,000 UK Biobank participants with blood samples. These updates are described in more detail in the Algorithms for removing technical varation section below. The impact of technical variation and its removal in the full UK Biobank data are also shown in the Technical variation in the full UK Biobank NMR data section below.
Citation is appreciated, but not expected, if simply using the data extraction functions for convenience to extract the NMR biomarker data and associated information as-is into analysis-ready data.frames.
Three data extraction functions are supplied by this package for extracting the UK Biobank NMR data and associated processing information and quality control tags into an analysis-ready format from the CSV or TSV files of field data saved by the Table Exporter tool on the UK Biobank Research Analysis Platform.
Exported field data saved by the Table
Exporter has column names following a naming
scheme with the format
“p
The extract_biomarkers()
function extracts from this raw
field data a data.frame
that contains one column per NMR
biomarker which are labelled with short descriptive ( and
analysis-friendly) column names for each biomarker. Each row of the
extracted data.frame
corresponds to a single observation
for a UK Biobank participant at either baseline assessment (2006-2010)
or the first repeat assessment (2012-2013): rows are uniquely
identifiable by their combination of "eid"
and
"visit_index"
columns. The "eid"
column
contains the project-specific identifier for each participant and the
"visit_index"
column contains either a 0 or 1 depending on
whether the biomarker was quantified from blood samples taken at
baseline assessemt (visit_index == 0) or at the first repeat assessment
(visit_index == 1). Mappings between biomarker column names and UK
Biobank field identifiers, along with detailed descriptions of each
biomarker, are provided in the nmr_info
data.frame
that is bundled with this package.
The extract_biomarker_qc_flags()
function similarly
returns a data.frame
with one column for each biomarker,
with observations containing the quality
control flags for the measurement of the respective biomarker for
the UK Biobank participant and timepoint indicated in the
"eid"
and "visit_index"
columns. Observations
with no quality control flags contain NA
. In instances
where there were multiple quality control flags, the individual flags
are separated by "; "
.
The extract_sample_qc_flags()
function similarly returns
a data.frame
with one column for each of the NMR
sample processing flags and quality control flags for each sample
for the respective UK Biobank participant ("eid"
) and
timepoint ("visit_index"
). Mappings between sample
processing column names and UK Biobank field identifiers, along with
detailed descriptions of each sample processing flag, are provided in
the sample_qc_info
data.frame
that is bundled
with this package.
An example workflow for extracting these data and saving them for later use:
library(ukbnmr)
library(data.table) # for fast reading and writing of csv files using fread() and fwrite()
# Load exported field data saved by the Table Exporter tool on the RAP
exported <- fread("path/to/exported_ukbiobank_phenotype_data.csv")
nmr <- extract_biomarkers(exported)
biomarker_qc_flags <- extract_biomarker_qc_flags(exported)
sample_qc_flags <- extract_sample_qc_flags(exported)
fwrite(nmr, file="path/to/nmr_biomarker_data.csv")
fwrite(biomarker_qc_flags, file="path/to/nmr_biomarker_qc_flags.csv")
fwrite(sample_qc_flags, file="path/to/nmr_sample_qc_flags.csv")
Remember to use the dx upload
tool provided by the UK
Biobank Research Analysis Platform to save
these files to your persistant project storage for later use.
You can try this out using the test dataset bundled with the
ukbnmr
package:
The remove_technical_variation()
function removes
additional technical variation present in the UK Biobank NMR data (see
section below
for details), returning a list
containing the corrected NMR
biomarker data, biomarker QC flags, and sample processing information in
analysis-ready data.frame
s.
Note that the no prefiltering of samples or columns should be performed prior to running this function: the algorithms used for removing technical variation expect all the data to be present.
This function takes 40 minutes to run, and requires at least 32 GB of RAM, so you will want to save the output, rather than incorporate this function into your analysis scripts.
An example workflow for using this function and saving the output for loading into future R sessions or other programs:
library(ukbnmr)
library(data.table) # for fast reading and writing of csv files using fread() and fwrite()
# Load exported field data saved by the Table Exporter tool on the RAP
exported <- fread("path/to/exported_ukbiobank_phenotype_data.csv")
processed <- remove_technical_variation(exported)
fwrite(processed$biomarkers, file="path/to/nmr_biomarker_data.csv")
fwrite(processed$biomarker_qc_flags, file="path/to/nmr_biomarker_qc_flags.csv")
fwrite(processed$sample_processing, file="path/to/nmr_sample_qc_flags.csv")
fwrite(processed$log_offset, file="path/to/nmr_biomarker_log_offset.csv")
fwrite(processed$outlier_plate_detection, file="path/to/outlier_plate_info.csv")
Remember to use the dx upload
tool provided by the UK
Biobank Research Analysis Platform to save
these files to your persistant project storage for later use.
You can try this out using the test dataset bundled with the
ukbnmr
package:
library(ukbnmr)
exported <- ukbnmr::test_data # see help("test_data") for more details
processed <- remove_technical_variation(exported)
#> Checking for revelant UKB fields...
#> Extracting and pre-processing data...
#> Checking for required sample processing fields needed for QC procedure...
#> Processing sample processing fields for QC procedure...
#> Determining log offsets for biomarker concentrations...
#> Adjusting for time between sample prep and sample measurement...
#> Adjusting for within plate structure across 96-well plate rows A-H...
#> Adjusting for within plate structure across 96-well plate columns 1-12...
#> Adjusting for drift over time within spectrometer...
#> Rescaling adjusted biomarkers to absolute concentrations...
#> Identifying outlier plates and setting their concentrations to NA...
#> Adding outlier plates to measurement QC tags...
#> Recalculating derived biomarkers...
#> Collating measurement QC tags for derived biomarkers...
#> Returning result...
Three versions of the QC algorithm have been developed:
Version 1 of the algorithm is as described in Ritchie et al. 2023, which was developed based on the technical variation observed in the NMR metabolomics data in the first ~120,000 participants that were measured. In brief, this multi-step procedure applies the following steps in sequence:
Version 2 of the algorithm modifies this algorithm:
Steps 4 and 5 above are performed within each processing batch
Step 6 above is modified to:
The first modification was made as applying version 1 of the algorithm to the combined data from the first and second tranche of measurements revealed introduced stratification by well position when examining the correctedconcentrations in each data release separately.
The second modification was made to ensure consistent bin sizes across data releases when correcting for drift over time. Otherwise, spectrometers used in multiple data releases would have different bin sizes when adjusting different releases. A bin split is also hard coded on spectrometer 5 between plates 0490000006726 and 0490000006714 which correspond to a large change in concentrations akin to a spectrometer recalibration event most strongly observed for alanine concentrations.
Version 3 of the algorithm makes two further minor changes:
Imputation of missing sample preparation times has been improved. Previously, any samples missing time of measurement (N=3 in the phase 2 public release) had their time of measurement set to 00:00. In version 3, the time of measurement is set to the median time of measurement for that spectrometer on that day, which is between 12:00-13:00, instead of 00:00.
Underlying code for adjusting drift over time has been modified to accommodate the phase 3 public release, which includes one spectrometer with ~2,500 samples. Version 2 of the algorithm would split this into two bins, whereas version 3 keeps this as a single bin to better match the bin sizes of the rest of the spectrometers.
The phase 3 release (January 2025) of the UK Biobank data in covers all ~500,000 UK Biobank participants, including the ~122,000 measured as part of the phase 1 release (June 2021) and the ~170,000 measured as part of the phase 2 release (July 2023).
The following figures below summarise the impact of the possible sources of variation on this updated dataset, and the impact of applying version 3 of our algorithm for removing technical variation, similar to what was shown in Figure 2 and Figure 7 of Ritchie et al. 2023 for the phase 1 release data:
Extended diagnostic plots showing the impact of technical variation and its removal on all biomarkers are available to download on FigShare at 10.6084/m9.figshare.27730101.
The July 2023 release of the UK Biobank NMR data covered ~275,000 UK Biobank participants, including ~122,000 measured as part of the phase 1 release (June 2021).
The following figures below summarise the impact of the possible sources of variation on this updated dataset, and the impact of applying version 2 of our algorithm for removing technical variation, similar to what was shown in Figure 2 of Ritchie et al. 2023 for the phase 1 release data:
Extended diagnostic plots showing the impact of technical variation and its removal on all biomarkers are available to download on FigShare at 10.6084/m9.figshare.23686407.
Our exploration of this updated data release (advance access under UK Biobank application 30418) revealed several changes were needed to our existing algorithm for removing technical variation developed on the phase 1 data.
First, we observed that correcting for systematic differences in well position (steps 4 and 5 of the algorithm) over all 275,000 participants introduced systematic differences between the phase 1 and phase 2 data release samples: