13 Study definition
13.1 Reproducibile dummy data
To ensure the dummy data system generates exactly the same data every time you run it, set a random number generator seed at the top of the
study_definition.pyfileimport numpy as np # Change this number to one for which your scripts # successfully run on the dummy data np.random.seed(123456)
13.2 File formats
Use
.featherfiles for outputs from the cohortextractor, so specify an action in yourproject.yamlas followsgenerate_study_population: run: cohortextractor:latest generate_cohort --study-definition study_definition --output-format feather needs: - design outputs: highly_sensitive: cohort: output/input.featherUse the arrow package to read
.featherfiles into Rarrow::read_feather(file = file.path("output", "input.feather"))- The
col_selectargument can be used to read in just the columns you need
- The
Start each project with a preprocessing action that formats
.featherfiles and outputs (gzipped).rdsfiles which can be saved withreadr::write_rds()readr::write_rds(object, file.path("output", "mydata.rds"), compress = "gz")