Prepare birthweight data for modelling


In this script, the data is processed to be used in our models. This last processing is done to make sure that the data is in adequate format for the model. We select variables to be included in the models and also we add missing values of factors as a new level.

Load packages, read data and source custom scripts

Paths are defined relative to the git repository location.

rm(list = ls())
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
path_proj <- day2day::git_path()
path_data <- file.path(path_proj, "data")
path_processed <- file.path(path_data, "processed")

bwdata <- fst::read_fst(file.path(path_processed, "bwdata_31_exposure.fst")) %>%
    mutate(res_muni = factor(res_muni),
           premature = duration_weeks < 37,
           prop_tap_toilet = prop_tap_toilet / 100
    )

Standardized municipality variables

bwdata <- bwdata %>%
    mutate(
        prop_tap_toilet_cd = prop_tap_toilet - mean(prop_tap_toilet),
        remoteness_cd = remoteness - mean(remoteness)
    )

Relabel gestational weeks

We relabel the categories of gestational weeks to differentiate between newborns with full gestational age and lower periods of pregnancy.

levs <- rev(levels(bwdata$gestation_weeks))
labs <- gsub("37 - 41|> 42", "> 37", levs)
bwdata <- mutate(bwdata, gestation_weeks = factor(gestation_weeks, levs, labs))

Select variables of interest

Selecting the variables that will be used for modelling:

  • birthweight
  • low birthweight (\(< 2500\) g)
  • prematurity (\(< 37\) weeks)
varnames <- c("born_weight", "lbw", "premature",
              "sex", "born_race", "birth_place", "gestation_weeks",
              "marital_status", "study_years", "consult_num", "age",
              "wk_ini", "rivwk_conception",
              "res_muni", "longitud", "latitud",
              "remoteness", "remoteness_cd", "rur_prop", "prop_tap_toilet",
              "prop_tap_toilet_cd")
bwdata_model <- bwdata %>%
    dplyr::select(dplyr::one_of(varnames), matches("^(pos|neg)_"))
n_old <- nrow(bwdata_model)

Add missing level to factors with missing values

A missing level is added when:

  • the variable is a factor and
  • the number of missing values is greater then 100. This is done because the variable sex is used to model the scale parameter having only one missing value. It made the model non-identifiable and the mcmc samples did not get updated on each iteration.
bwdata_model <- mutate_if(bwdata_model, ~ is.factor(.) & sum(is.na(.)) > 100, addNA)

After this operation, the profiles (rows) with missing values on any of the factor without a NA category will be removed on the next step.

Remove missing values

Given that we added the NA values as a level in the factors with more than 100 missing values, this NA level is not considered a missing value on those factors.

bwdata_model <- na.omit(bwdata_model)
n_new <- nrow(bwdata_model)

Let’s check the number and percentage of removed rows.

metadata <- c("Initial number of rows" = n_old,
              "Final number of rows" = n_new,
              "Deleted number of rows" = n_old - n_new)
metadata_perc <- metadata / n_old * 100
cbind(metadata, metadata_perc)
#>                        metadata metadata_perc
#> Initial number of rows   296773    100.000000
#> Final number of rows     291479     98.216145
#> Deleted number of rows     5294      1.783855

Save birthweight data for modelling

fst::write_fst(bwdata_model, path = file.path(path_processed, "bwdata_41_model.fst"))

Time to execute the task

Only useful when executed with Rscript.

proc.time()
#>    user  system elapsed 
#>   4.277   0.231   4.524