Prepare birthweight data for modelling
In this script, the data is processed to be used in our models. This last processing is done to make sure that the data is in adequate format for the model. We select variables to be included in the models and also we add missing values of factors as a new level.
Load packages, read data and source custom scripts
Paths are defined relative to the git repository location.
rm(list = ls())
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
path_proj <- day2day::git_path()
path_data <- file.path(path_proj, "data")
path_processed <- file.path(path_data, "processed")
bwdata <- fst::read_fst(file.path(path_processed, "bwdata_31_exposure.fst")) %>%
mutate(res_muni = factor(res_muni),
premature = duration_weeks < 37,
prop_tap_toilet = prop_tap_toilet / 100
)
Standardized municipality variables
bwdata <- bwdata %>%
mutate(
prop_tap_toilet_cd = prop_tap_toilet - mean(prop_tap_toilet),
remoteness_cd = remoteness - mean(remoteness)
)
Relabel gestational weeks
We relabel the categories of gestational weeks to differentiate between newborns with full gestational age and lower periods of pregnancy.
levs <- rev(levels(bwdata$gestation_weeks))
labs <- gsub("37 - 41|> 42", "> 37", levs)
bwdata <- mutate(bwdata, gestation_weeks = factor(gestation_weeks, levs, labs))
Select variables of interest
Selecting the variables that will be used for modelling:
- birthweight
- low birthweight (\(< 2500\) g)
- prematurity (\(< 37\) weeks)
varnames <- c("born_weight", "lbw", "premature",
"sex", "born_race", "birth_place", "gestation_weeks",
"marital_status", "study_years", "consult_num", "age",
"wk_ini", "rivwk_conception",
"res_muni", "longitud", "latitud",
"remoteness", "remoteness_cd", "rur_prop", "prop_tap_toilet",
"prop_tap_toilet_cd")
bwdata_model <- bwdata %>%
dplyr::select(dplyr::one_of(varnames), matches("^(pos|neg)_"))
n_old <- nrow(bwdata_model)
Add missing level to factors with missing values
A missing level is added when:
- the variable is a factor and
- the number of missing values is greater then 100.
This is done because the variable
sex
is used to model the scale parameter having only one missing value. It made the model non-identifiable and the mcmc samples did not get updated on each iteration.
bwdata_model <- mutate_if(bwdata_model, ~ is.factor(.) & sum(is.na(.)) > 100, addNA)
After this operation, the profiles (rows) with missing values on any of the factor without a NA
category will be removed on the next step.
Remove missing values
Given that we added the NA
values as a level in the factors with more than 100
missing values, this NA
level is not considered a missing value on those factors.
bwdata_model <- na.omit(bwdata_model)
n_new <- nrow(bwdata_model)
Let’s check the number and percentage of removed rows.
metadata <- c("Initial number of rows" = n_old,
"Final number of rows" = n_new,
"Deleted number of rows" = n_old - n_new)
metadata_perc <- metadata / n_old * 100
cbind(metadata, metadata_perc)
#> metadata metadata_perc
#> Initial number of rows 296773 100.000000
#> Final number of rows 291479 98.216145
#> Deleted number of rows 5294 1.783855
Save birthweight data for modelling
fst::write_fst(bwdata_model, path = file.path(path_processed, "bwdata_41_model.fst"))
Time to execute the task
Only useful when executed with Rscript
.
proc.time()
#> user system elapsed
#> 4.277 0.231 4.524