Apart from the documentation of your project discussed previously, there are a few other practices you should follow during your analysis.
Automate every single step
This is an extremely important point in making your research reproducible. Never use the point-and-click capabilities of a software to perform a step in your analysis. Instead, use a script or notebook in your preferred programming language to run everything, from data preparation, over visualization, up to the modeling.
Note: Should you really need to defer from this practice, document every click you make carefully and provide screenshots or a video of the process.
Relative paths
Another thing that is important to remember is to always use relative instead of absolute paths throughout your code base. This is another reason why it was important to put the project into its own directory. /Users/sbinder/repository_name/data/raw/
for example becomes ./data/raw
. Relative paths allow others to rerun the code without first adjusting any paths.
Refactor codes
If you follow a notebook-based workflow, your notebooks might get cluttered with a lot of code at one point which you might even end up copying from one notebook to the next. To increase the reusability of your code and clean up your notebooks, you can put code into functions and then into a separate script. This allows to import the same functions in multiple notebooks. If you have, for example, some code which loads data from a SQL database, and you want to do this in multiple notebooks, this would be a perfect use case for a function in a separate script. You can see this example implemented in the following Jupyter notebook:
Show notebook
In the beginning, the line
from src.prepare_data.crime_database import load_relevant_crimes
imports the function load_relevant_crimes
that is defined in the script src/prepare_data/crime_database.py
.
Show function in crime_database.py
def load_relevant_crimes(min_date,
max_date=None,
sqldb_path='data/processed/crimes.db',
chunksize=None,
disk_engine=None,
**kwargs):
"""Loads relevant violent and property crimes
Wrapper around load_crimes which only loads violent and property crimes.
Furthermore this function provides an easier interface to load
relevant columns. If add_violent_col is True an additional column is
added to the returned dataset with a dummy which is 1 for a violent crime
and 0 for a property crime.
Parameters
----------
min_date : str, format = "YYYY-MM-DD"
Min date which should be loaded
max_date : str, format = "YYYY-MM-DD", optional (default=None)
Max date which should be loaded
sqldb_path : str, optional (default='data/processed/crimes.db')
Path to SQL database, defaults to relative path
to crimes database from project root. Useful to change
for example when using Jupyter notebooks in other directories.
chunksize : int, optional (default=None)
If specified, return an iterator where chunksize
is the number of rows to include in each chunk.
kwargs : keyword arguments, optional
Further keyword arguments directly passed on to pd.read_sql_query call
Returns
-------
pd.DataFrame
Dataframe containing loaded crimes
"""
# Code
...
return ...
Note: For more information on how to refactor your Jupyter notebooks, see Part 5 of Jake Vanderplas’ Reproducible Data Analysis series.
For R users, the process is exactly the same, and you can import functions defined in an .R script by calling (sticking with the above example):
source("src/prepare_data/crime_database.R")
at the beginning of your notebook, be it Jupyter or R.
Use random seeds
Another random but important note: If your project involves the use of a pseudorandom number generator (or an algorithm which uses one), you should explicitly set the random seed such that a rerun of the experiment gives exactly the same results.
So assume you’ve finished your first research project that adheres to the basic tools and principles of reproducible research outlined in this guide, and you’re now ready to share your work with the world through GitHub. In order to do this, we need to spend 60 seconds talking about the legal aspect of licensing your work..