13  Study definition

13.1 Reproducibile dummy data

  • To ensure the dummy data system generates exactly the same data every time you run it, set a random number generator seed at the top of the study_definition.py file

    import numpy as np
    # Change this number to one for which your scripts 
    # successfully run on the dummy data

13.2 File formats

  • Use .feather files for outputs from the cohortextractor, so specify an action in your project.yaml as follows

      run: cohortextractor:latest generate_cohort --study-definition study_definition --output-format feather
      - design
          cohort: output/input.feather
  • Use the arrow package to read .feather files into R

    arrow::read_feather(file = file.path("output", "input.feather"))
    • The col_select argument can be used to read in just the columns you need
  • Start each project with a preprocessing action that formats .feather files and outputs (gzipped) .rds files which can be saved with readr::write_rds()

                     file.path("output", "mydata.rds"), 
                     compress = "gz")