Jupyter Notebooks in STEM

Work in progress...more updates soon. This post will also soon appear on iBiology Courses news

Jupyter notebooks are everywhere in STEM education. What started as a simple enhanced Python shell in 2001 has become a common scientific and educational tool for encapsulating logic and data in an interactive, page-like format. If you’re a student and you’re in the sciences, your life will include some amount of time spent opening and running Jupyter notebooks. Sooner or later, probably sooner.

As science has become more about software and data — and reproducibility of what comes out of that combination — Jupyter notebooks have emerged as a convenient mechanism to package code and data into an immediate, interactive environment. All from an ordinary web browser. For an educator, the notebook provides a single place to weave explanatory text, wonky equations, and the actual code and helper libraries that bring the concepts to life. For the student, the notebook hides the ugliness and frustration it took to bring all the supporting libraries together, and provides a narrative for steps defined in code. It's all there and ready for iterative experimentation.

For evidence of the trend, have a look at a random sample of submissions posted to the Journal of Open Source Education. Chances are you selected some courseware that includes Jupyter notebooks as part of the educational material. Even if it’s not advertised as a key feature in the course description, the giveaway is that suspicious cluster of files hanging out in the course materials, the ones with the “.ipynb” extension.

An example of a recent paper published on JOSE. This one is for geoscience students learning about the Argo program, a system of ocean sensors. Jupyter notebooks are integral to the learning materials.

Although Jupyter notebooks are usually written in Python (the file extension .ipynb means 'interactive python notebook'), that’s no longer a given, as over 100 languages are now supported. R and Julia are also commonly used.

So Jupyter notebooks are here to stay, and it would seem any science-focused Learning Management System (LMS) worth its salt needs to somehow allow a course author to add notebooks to course content, and then build, spin up and manage a notebook server for a student when it’s time. Someone has to pay the complexity gods, and they’ve just rung the doorbell. Hosting Jupyter notebooks for many users is not exactly easy.

A Brief History of Many Notebooks for Many Users

A single student running a Jupyter notebook is good.
A bunch of them on their laptops around a table is great.
But those who distribute many notebooks to many users, they’re the indispensable ones.

As Jupyter Notebooks gained popularity in the early 2010’s in education and research, it became clear someone needed to create a system to handle distributing notebooks to groups, classrooms, or entire institutions without requiring each user to set up their own environment. If one student using a Jupyter notebook is great, running notebook servers for lots of them at once would be even better.

However, that means building, launching and closing notebook servers for each student gracefully and dynamically. Like any python program, a Jupyter notebook may expect certain libraries to be available, say matplotlib for graphs. Some of those libraries may have peculiar build sequences or system requirements, say the library ‘eppy’ which requires a C++ simulation program called EnergyPlus to be available on the system path. Various data files may be involved. A notebook may expect a sqlite file or other format to be present and available.

Essentially, to run an arbitrary notebook we need to set up a complete compute environment appropriate for that notebook’s dependencies, and then situate the data files and pre-load those depencies, all before the user shows up full of enthusiasm and the most delicate strands of patience. And we need to do that for different notebooks for different users at varying scales. There’s that doorbell again.

Once again, open source software came to the rescue. Well…kind of.

In 2015 the Jupyter developers created JupyterHub, a system for managing individual notebook servers. This was great until it wasn’t: managing resources for all those Jupyter notebook servers itself became problematic, so in 2017 “Zero to JupyterHub”, or ZTJH, arrived. ZTJH is kind of like a collection of scripts and documentation for running JupyterHub in Kubernetes. This meant inviting Kubernetes into the mix. A questionable party guest indeed.

The complexity continued to stack in 2018, when computer scientist Matthew Rocklin introduced Pangeo, a new architecture for large scale, multi-dimensional distributed computing on Google Cloud, with Jupyter as the user interface touch point.

To me this looks like a mini version of the mother-of-all-demos…but for geoscience geeks. The proposed architecture involves weaving JupyterHub together with other tools for handling large datasets (Dask, XArray) to provide everything a researcher (or student) needs in one place.

At this point the complexity exploded. The problems encountered by those running Pangeo reflect the same for anyone running just ZTJH. Issues like costs, maintenance, storage, scale, and can-you-explain-again-what-kubernetes-is-doing? (See Rocklin's post on Pangeo 2.0 for more details.)

Add in to this the fact that if one wants to integrate JupyterHub into common e-Learning platforms, the convention way to do this is through a technology called LTI. LTI is what lets some random external tool automatically log a student in and show the tool embedded in a nice iframe directly in the course content. LTI is supported in ZTJH…but just barely. (See LTIAuthenticator if you're curious.). LTIv1.3, by the way, also defines ways to send grades and such from a tool ( like JupyterHub ) back into the LMS...but I don't think anyone is doing this with JupyterHub yet...if you are, please do tell.

When the complexity of an open-source software project you want to use gets to this level, often the best thing to do is make it someone else’s problem: find somebody hosting it as a SaaS. That’s what happened with Open edX in the world of e-Learning, and in a similar way some organizations have grown up to provide hosted JupyterHub services, such as 2i2C. Other groups like Nebari have developed alternative approaches to supporting Jupyter and other data science tools. There's also Google Colab, which is targeted at individual users.

Even if you get as far as a running ZTJH cluster, integrating it into an e-Learning system is not a fully documented use case. People are doing it, but there isn’t a clear cut path.

Notebooks in KinesinLMS

So back to KinesinLMS, the humble LMS for the average dev. In the list of features we want to add, a big one is allowing course authors to provide Jupyter notebooks to students while authoring the course, all without leaving Composer or worrying about other systems.

In adding this feature, we want to implement it in a way that keeps to our mantra that KinesinLMS is simple enough that a single dev can grok, extend and run without too much trouble (K8S isn't a strong candidate for inclusion). For the educator, meanwhile, it should be as simple as that caveman game where you can't use articles or multisyllabic words: "attach...file...easy...student...see....good."

I’ve already built out a simple LTIv1.3 connection in KinsinLMS, so the natural approach would seem to be to just set up JupyterHub in our own e.g. Digital Ocean space and connect units in our courses to notebooks via LTi.


However, there a few issues:

  • Running ZTJH via Kubernetes on Digital Ocean isn’t a trivial exercise. One can get it up and running in relatively straightforward way but understanding what’s going on, how to enhance it, and how to debug it, is not trivial
  • The LTIAuthenticator plugin that gives ZTJH basic LTIv1.3 connectivity works, but it’s only partially documented and doesn’t give guidance on how one manages and distributes of notebooks to students and courses. You could rely on GitHub and nbgitpuller to help store and point to various notebooks but that required course authors to get into the world of GitHub...and kind of sidelines the whole way LTI is supposed to work.

So building an integrated system between ZTJH + LTIAuthenticator and KinesinLMS was a bit of a strech. Luckily, there was an easier way: modal.com

JupyterHub-lite with Modal.com

Modal.com is a cloud platform that lets developers deploy Python code and machine learning models without managing infrastructure. It's supposed to help eliminate the complexity of cloud deployment by automatically handling containerization, scaling, and GPU provisioning, allowing developers to run their code in the cloud simply by adding a few decorators to their Python functions - making it particularly useful for machine learning deployments, batch jobs, and API hosting.

Lucky for me, one of their examples is a bare-bones JupyterHub facimile. Using that, I was able to build a simple system to launch a JupyterLab server, pre-loaded with a notebook, and direct the student to the temporary URL. There’s no persistence management yet, and I haven’t figured out a system yet to customize the container in case the notebook needs a system library (like EnergyPlus), but it’s all working at a basic level, which is fun.

On modal.com, I’ve defined two functions: spawn_jupyter_lab_server to manage spawning the JupyterLab server and returning the server's URL, and run_jupyter_lab to actually run the server. Note that the method that actually runs the server, run_jupyter_lab, knows where on AWS KinesinLMS stores notebooks and notebook resources.

This is all very preliminary, a proof-of-concept really. But including below in case you can use it in your own experiments.


import logging
import os
import shutil
import subprocess
from pathlib import Path
from secrets import token_urlsafe
from typing import Dict, List

from resources import validate_resource

import modal
from config import config_jupyterlab

logger = logging.getLogger(__name__)

# Configuration
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

JUPYTER_PORT = 8888
# By convention, we're storing JupyterLab .ipynb files as the
# 'file_resource' File property of a KinesinLMS 'Resource' model.
# These files will be stored in the '/media/block_resources' directory
# of the 'kinesinlms' bucket on S3. So we'll read them straight from
# there as we've stored the S3 credentials in the 'aws-s3-bucket-secrets' variable.
AWS_MOUNT_PATH = Path("/kinesinlms")
AWS_BLOCK_RESOURCES_PATH = AWS_MOUNT_PATH / "media" / "block_resources"
RESOURCES_PATH = Path("/resources_data")

# Define a secret to store the AWS S3 credentials
# The app is going to read in the notebook as well as any
# accompanying resources from the S3 bucket.
s3_secret = modal.Secret.from_name(
	"aws-s3-bucket-secrets",
	required_keys=[
    	"AWS_ACCESS_KEY_ID",
    	"AWS_SECRET_ACCESS_KEY",
	],
)

# Apps and Volumes
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# Define a persistent volume for KinesinLMS resources
# e.g. SQLite databases, CSV files, etc.
resources_volume = modal.Volume.from_name(
	"resources-data-volume",
	create_if_missing=True,
)

# This is the modal app that's going to run the Jupyter Lab server.
app = modal.App("my_jupyter_hub")
app.image = (
	modal.Image.debian_slim()
	.apt_install(
    	[
        	"curl",
        	"npm",
        	"nodejs",
        	"build-essential",
    	]
	)
	.pip_install(
    	"jupyterlab",
    	"matplotlib",
    	"pandas",
    	"numpy",
    	"sqlalchemy",
    	"scipy",
	)
)


# JUPYTER LAB SERVER
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


@app.function(
	timeout=1_500,
	volumes={
    	AWS_MOUNT_PATH: modal.CloudBucketMount(
        	bucket_name="kinesinlms",
        	read_only=True,
        	secret=s3_secret,
    	),
    	RESOURCES_PATH: resources_volume,
	},
)
def run_jupyter_lab(
	q,
	notebook_filename=None,
	resources: List[Dict] = [],
	extra_pip_packages=[],
	access_token=None,
):
	"""
	Start a Jupyter Lab server and return the URL.

	Args:
    	q (modal.Queue):	A queue to pass the URL back to the caller.
    	resources:      	A list of resources that accompany the notebook.
                        	Resources are stored in the same 'block_resources'
                        	directory as the notebook file.
                        	The dictionary has two keys:
                        	- 'filename': the name of the file, may include subdirs
                        	- 'type': type of the resource (e.g. 'CSV', 'SQLITE', ...)

    	notebook_filename (str):	The name of the notebook file to open
                                	in Jupyter Lab. This file should be present in the
                                	`MOUNT_PATH` directory.

	Returns:
    	( nothing. The URL is passed back to the caller via the `q` queue. )
	"""
	logging.basicConfig(level=logging.INFO)
	logger = logging.getLogger(__name__)

	if not access_token:
    	access_token = token_urlsafe(16)

	if extra_pip_packages:
    	print(f"run_jupyter(): Installing packages: {extra_pip_packages}")
    	try:
        	subprocess.check_call(
            	[
                	"pip",
                	"install",
                	"--user",
                	"--quiet",
                	"--no-cache-dir",
                	*extra_pip_packages,
            	]
        	)
        	logging.info("Package installation completed successfully")
    	except subprocess.CalledProcessError as e:
        	logging.error(f"Failed to install packages: {e}")
        	raise
    	print("  - done installing packages.")

	print(f"Starting Jupyter Lab. notebook_filename: {notebook_filename}")

	# Create config and workspace directories if they don't exist
	os.makedirs("/root/.jupyter", exist_ok=True)
	workspace_dir = Path("/root/workspace")
	os.makedirs(workspace_dir, exist_ok=True)

	local_notebook_path = None
	if notebook_filename:
    	# HANDLE INCOMING NOTEBOOK
    	# Copy notebook file if provided so that it's available in the workspace
    	# immediately upon starting Jupyter Lab.
    	try:
        	s3_path = AWS_BLOCK_RESOURCES_PATH / notebook_filename
        	local_notebook_path = workspace_dir / notebook_filename
        	print(
            	f"run_jupyter(): Copying notebook from {s3_path} to {local_notebook_path}"
        	)
        	shutil.copy2(s3_path, local_notebook_path)
        	print(
            	f"run_jupyter()  - successfully copied notebook from {s3_path} to "
            	f"{local_notebook_path}"
        	)

    	except Exception as e:
        	logger.error(f"Error copying notebook file: {e}")
        	raise

	if resources:
    	# HANDLE INCOMING KINESINLMS RESOURCES
    	# Store them in the attached volume so they're available to the notebook
    	print("run_jupyter(): Processing resources...")
    	os.makedirs(RESOURCES_PATH, exist_ok=True)
    	for resource in resources:
        	print(
            	f"run_jupyter():  - resource type: {resource.get('type')} "
            	f"filename: {resource.get('filename')}"
        	)
        	try:
            	filename = resource.get("filename")
            	resource_type = resource.get("type")

            	if not filename or not resource_type:
                	logger.warning(
                    	f"run_jupyter(): Skipping resource with missing "
                    	f"filename or type: {resource}"
                	)
                	continue

            	try:
                	validate_resource(
                    	filename=filename,
                    	resource_type=resource_type,
                    	aws_resource_path=AWS_BLOCK_RESOURCES_PATH,
                	)
            	except Exception as e:
                	logger.warning(f"Skipping invalid resource: {filename} : {e}")
                	continue

            	s3_path = AWS_BLOCK_RESOURCES_PATH / filename

            	# For SQLite files, copy to workspace to match notebook's working dir
            	if resource_type in ["SQLITE", "CSV"]:
                	# Extract just the filename without path
                	base_filename = Path(filename).name
                	dest_path = workspace_dir / base_filename
                	print(
                    	f"  - copying {resource_type} file "
                    	f"from {s3_path} "
                    	f"to {dest_path}"
                	)
                	shutil.copy2(s3_path, dest_path)
                	os.chmod(dest_path, 0o640)  # Ensure file permissions allow rw
                	print(
                    	f"run_jupyter():   - successfully copied "
                    	f"SQLite file: {base_filename}"
                	)
            	else:
                	print(f"run_jupyter(): Unsupported resource type: {resource_type}")

        	except Exception as e:
            	logger.error(f"run_jupyter(): Error copying resource {filename}: {e}")
            	raise
	else:
    	print("run_jupyter(): No resources!")

	for resource in resources:
    	resource_type = resource.get("type")
    	if resource_type == "SQLITE":
        	# Even thought we already saved the sqlite file to the workspace,
        	# we still need to store the actual database file that sqlite
        	# creates when it opens the database. We'll store it in the
        	# resources directory so that it's available to the notebook.
        	os.environ["SQLITE_DATABASE_PATH"] = str(RESOURCES_PATH / "database.db")

	config_jupyterlab(notebook_filename=notebook_filename)

	# Disable announcements extension
	try:
    	subprocess.run(
        	[
            	"jupyter",
            	"labextension",
            	"disable",
            	"@jupyterlab/apputils-extension:announcements",
        	],
        	check=True,
        	capture_output=True,
    	)
    	print("Successfully disabled JupyterLab announcements extension")
	except subprocess.CalledProcessError as e:
    	logger.error(f"Failed to disable announcements extension: {e}")
    	# Continue anyway as this is not critical

	print("Starting Jupyter Lab as tunnel...")
	with modal.forward(JUPYTER_PORT) as tunnel:
    	url = tunnel.url + "/?token=" + access_token

    	q.put(url, block=False)
    	print(f"run_jupyter(): Starting Jupyter at {url}")

    	subprocess.run(
        	[
            	"jupyter",
            	"lab",
            	"--no-browser",
            	"--allow-root",
            	"--ip=0.0.0.0",
            	"--port=8888",
            	"--LabApp.allow_origin='*'",
            	"--LabApp.allow_remote_access=1",
        	],
        	env={**os.environ, "JUPYTER_TOKEN": access_token, "SHELL": "/bin/bash"},
    	)


# SPAWN JUPYTER LAB (FAKE JUPYTER HUB)
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@app.function()
def spawn_jupyter_lab_server(
	notebook_filename=None,
	extra_pip_packages=[],
	resources=[],
	access_token=None,
):
	"""
	Spawn a Jupyter Lab server and return the URL.

	Args:
    	notebook_filename:str       	The filename of an .ipynb file stored
                                    	in the 'block_resources' directory on S3.
    	extra_pip_packages: list    	A list of extra pip packages to install.
    	resources: list             	A list of KinesinLMS 'resources' that
                                    	accompany the notebook (e.g. a SQLite file).
    	access_token: str           	A secret token to authenticate the user.

	Raises:
    	Exception

	Returns:
    	The URL of the Jupyter Lab server.
	"""

	# do some validation on the secret or bearer token
	is_valid = True

	if is_valid:
    	with modal.Queue.ephemeral() as q:
        	print(
            	f"spawn_jupyter(): spawning Jupyter Lab with "
            	f"notebook {notebook_filename} "
            	f"resources : {resources}."
        	)
        	if extra_pip_packages:
            	print(f"  -  Extra pip packages: {extra_pip_packages}")

        	print("spawn_jupyter(): spawinging Jupyter Lab...")
        	run_jupyter.spawn(
            	q,
            	notebook_filename=notebook_filename,
            	extra_pip_packages=extra_pip_packages,
            	resources=resources,
            	access_token=access_token,
        	)
        	print("spawn_jupyter(): Getting Jupyter Lab url...")
        	url = q.get()
        	print(f"spawn_jupyter(): Jupyter is ready at {url}")
        	return url
	else:
    	logger.error(f"Error copying notebook file: {e}")
    	raise Exception("Not authenticated")
When a student views a unit on KinesinLMS that has a related notbook, they see a “Launch Jupyter Notebook” button. That button calls a view class to launch the notebook and return the server URL. That Django view uses a simple service class to handle asking modal.com to load our pre-built function to spawn the server and return the URL.

def get_launch_url(
    	self,
    	*args,
    	notebook_filename: str = None,
    	resources: List[Dict] = [],
    	**kwargs,
	):
    	"""
    	Launches the Modal.com external tool in a new window.
    	"""

    	logger.info("Launching jupyter notebook in Modal.com...")

    	extra_pip_packages = []  # ["pynbody"]

    	spawn_jupyter_remote_function = modal.Function.lookup(
        	"my_jupyter_hub",
        	"spawn_jupyter",
    	)

    	function_stats: modal.functions.FunctionStats = (
        	spawn_jupyter_remote_function.get_current_stats()
    	)

    	if function_stats.num_total_runners >= self.throttle:
        	raise TooManyNotebooksError(
            	"Too many notebooks are currently running. Please try again later."
        	)

    	launch_url = spawn_jupyter_remote_function.remote(
        	notebook_filename=notebook_filename,
        	extra_pip_packages=extra_pip_packages,
        	resources=resources,
    	)
    	logger.info(f"Jupyter notebook launched successfully. URL: {launch_url}")
    	return launch_urlb

And with that, a simple system for showing notebooks to students! All very preliminary, and lots still to define. But a simple way to allow a course author to attach various notebooks to course contents and just have them appear for students, without having to worry about setting up ZTJH.