# Data format

library(anovir)

## Introduction

This vignette describes how survival data should be formatted for use with the functions in this package.

There is a general data format that works for most functions (here), but some functions require data to be in a specific format, these are;

## General format required

The negative log-likelihood (nll) functions in this package require the survival data to be analysed to be in a data frame.

The default assumption is each row contains data for an individual host.

Data can be grouped, where a row contains data on the frequency of individuals from a particular treatment or population experiencing the same event, in the same sampling interval. In this case, the frequency data must be in a column named, 'fq'. This column will be automatically detected and nll calculations adjusted accordingly; frequencies of zero ('0') are allowed.

By default, most nll functions assume a data frame will contain three columns named as follows,

• censor,
• time,
• infection_treatment

containing the following information;

• censor
• describes whether event was death or right-censoring
• needs a numerical value of;
• '0' for death,
• '1' for right-censoring.
• time
• describes the time when the event occurred
• needs to be a numerical value > 0.
• infection_treatment
• identifies whether data are from an infected or uninfected treatment
• needs to be a numerical value of;
• '0' for an uninfected treatment,
• '1' for an infected treatment.

These columns can be renamed when specifying parameters for the nll function to be sent for estimation by maximum likelihood. Columns with the default names above to not need to be specified, but the contents of their rows must be specified as above, i.e., data from an infected treatment must be specified as '1' and not 'infected', '+ve', etc.

All nll functions assume individuals in an uninfected treatment are uninfected.

Not all functions assume all individuals in an infected treatment are infected.

## Specific formats

Some nll functions have specific data formatting requirements.

### nll_two_inf_subpops_obs

This function applies to cases where two distinct subpopulations of hosts have been identified ('observed') within an infected population or treatment. In addition to the columns above, this function requires the data frame to be analysed to have a column identifying the two infected subpopulations;

• infsubpop
• identifies which subpopulation of data infected hosts belong to;
• '1' for subpopulation '1'
• '2' for subpopulation '2'
• values of '1' or '2' are arbitrary and only used for identifying each subpopulation

The column can be renamed when specifying the nll function, but it must contain values of '1' or '2' for the two subpopulations.

### nll_recovery

The data frame required by this function has a specific structure. In this case, whether an event was death or right-censoring is not coded in the rows of a data frame, but in columns.

The data frame needs six columns with the following column names and these columns need to be filled with binary [0/1] data as follows;

• control.d
• '1' for control individuals dying during the experiment,
• '0' otherwise
• control.c
• '1' for control individuals censored during or at the end of the experiment
• '0' otherwise
• infected.d
• '1' for infected individuals dying while still infected during of the experiment
• '0' otherwise
• infected.c
• '1' for infected individuals censored during or at the end of the experiment
• '0' otherwise
• recovered.d
• '1' for recovered individuals dying during the end of the experiment
• '0' otherwise
• recovered.c
• '1' for recovered individuals censored during or at the end of the experiment
• '0' otherwise

Each of these six columns needs an individual row for every sampling interval between the first and last sampling interval, i.e., from time t = 1 to time t = tmax, where tmax is the last sampling interval.

For example, if survival data was sampled each day from days 1 to 20 of an experiment, the data frame will need to have; 6 x tmax = 6 x 20 = 120 rows.

NB it is assumed sampling intervals are equally spaced throughout the experiment.

There also needs to the following columns with the following names and contents,

• censor
• '1' for censored data
• '0' otherwise
• t
• data for the time of event; needs to be numeric with t > 0
• fq
• data for the frequency of events occuring at time t; values of zero (0) are allowed

For example, the first few lines of the data frame data_recovery are given below;

head(recovery_data, 3)
#>   control.d control.c infected.d infected.c recovered.d recovered.c censor d t
#> 1         1         0          0          0           0           0      0 1 1
#> 2         1         0          0          0           0           0      0 1 2
#> 3         1         0          0          0           0           0      0 1 3
#>   fq
#> 1  1
#> 2  4
#> 3 11

they are for the population control.d, that is control individuals dying during the experiment (control.d = 1), and show these individuals were not censored (censor = 0), and for times 1, 2, 3, the frequency of individuals dying was 1, 4, 11, respectively.

The last few lines of the same data frame are,

tail(recovery_data, 3)
#>     control.d control.c infected.d infected.c recovered.d recovered.c censor d
#> 118         0         0          0          0           0           1      1 0
#> 119         0         0          0          0           0           1      1 0
#> 120         0         0          0          0           0           1      1 0
#>      t fq
#> 118 18  0
#> 119 19  0
#> 120 20 41

for the population of hosts that recovered and were right-censored, recovered.c = 1, censor = 1, and for times 18, 19, 20, the frequency of individauls censored in this population was 0, 0, 41, respectively. NB all rows between t = 1 and t = tmax need to be included and in ascending order, even if the frequency of individuals involved is zero.