13  Study definition

13.1 Reproducibile dummy data

  • To ensure the dummy data system generates exactly the same data every time you run it, set a random number generator seed at the top of the study_definition.py file

    import numpy as np
    # Change this number to one for which your scripts 
    # successfully run on the dummy data
    np.random.seed(123456)

13.2 File formats

  • Use .feather files for outputs from the cohortextractor, so specify an action in your project.yaml as follows

    generate_study_population:
      run: cohortextractor:latest generate_cohort --study-definition study_definition --output-format feather
     needs: 
      - design
      outputs:
        highly_sensitive:
          cohort: output/input.feather
  • Use the arrow package to read .feather files into R

    arrow::read_feather(file = file.path("output", "input.feather"))
    • The col_select argument can be used to read in just the columns you need
  • Start each project with a preprocessing action that formats .feather files and outputs (gzipped) .rds files which can be saved with readr::write_rds()

    readr::write_rds(object, 
                     file.path("output", "mydata.rds"), 
                     compress = "gz")