Files
jupyterhub/docs/source/explanation/concepts.md
Richard Darst f3cc79e453 what-is-jupyterhub: Fix links
- Apparently recommonmark does intelligently uses links like
  sphinx+rst, and you shouldn't use `.html` on the links.
2024-06-10 18:25:53 -07:00

466 lines
21 KiB
Markdown

# What is Jupyter and JupyterHub?
JupyterHub is not what you think it is. Most things you think are
part of JupyterHub are actually handled by some other component, for
example the spawner or notebook server itself, and it's not always
obvious how the parts relate. The knowledge contained here hasn't
been assembled in one place before, and is essential to understand
when setting up a sufficiently complex Jupyter(Hub) setup.
This document was originally written to assist in debugging: very
often, the actual problem is not where one thinks it is and thus
people can't easily debug. In order to tell this story, we start at
JupyterHub and go all the way down to the fundamental components of
Jupyter.
In this document, we occasionally leave things out or bend the truth
where it helps in explanation, and give our explanations in terms of
Python even though Jupyter itself is language-neutral. The "(&)"
symbol highlights important points where there is more.
This guide is long, but after reading it you will be know of all major
components in the Jupyter ecosystem and everything else you read
should make sense.
## Just what is Jupyter?
Before we get too far, let's remember what our end goal is. A
**Jupyter Notebook** is really nothing more than a Python(&) process
which is getting commands from a web browser and displaying the output
via that browser. What the process actually sees can roughly like
getting commands on standard input(&) and writing to standard
output(&). There is nothing intrinsically special about this process
- it can do anything a normal Python process can do, and nothing more.
The **Jupyter kernel** handles capturing output and converting things
such as graphics to a form usable by the browser.
Everything we explain below is building up to this, going through many
different layers which give you many ways of customizing how this
process runs. But this process is not *too* special.
## JupyterHub
**JupyterHub** is the central piece that provides multi-user
login. Despite this, the end user only briefly interacts with
JupyterHub and most of the actual Jupyter session does not relate to
the hub at all: the hub mainly handles authentication and spawning the
single-user server. In short, anything which is related to *starting*
the user's workspace/environment is about JupyterHub, anything about
*running* usually isn't.
If you have problems connecting the authentication, spawning, and the
proxy (explained below), the issues is usually with JupyterHub. To
debug, JupyterHub has extensive logs which get printed to its console
and can be used to discover most problems.
The main pieces of JupyterHub are:
### Authenticator
JupyterHub itself doesn't actually manage your users. It has a
database of users, but it is usually connected with some other system
that manages the usernames and passwords. When someone tries to log
in to JupyteHub, it just asks the
**authenticator**([basics](authenticators-users-basics),
[reference](../reference/authenticators)) if the
username/password is valid(&). The authenticator can also return user
groups and admin status of users, so that JupyterHub can do some
higher-level management. The authenticator returns a username(&),
which is passed on to the spawner, which has to use it to start that
user's environment.
The following authenticators are included with JupyterHub:
- **PAMAuthenticator** uses the standard Unix/Linux operating system
functions to check users. Roughly, if someone already has access to
the machine (they can log in by ssh), they will be able to log in to
JupyterHub without any other setup. Thus, JupyterHub fills the role
of a ssh server, but providing a web-browser based way to access the
machine.
But those are fairly limited, and thus there are [plenty of others to
choose
from](https://github.com/jupyterhub/jupyterhub/wiki/Authenticators).
You can connect to almost any other existing service to manage your
users. You either use all users from this other service (e.g. your
company), or whitelist only the allowed users (e.g. your group's
Github usernames). Some other popular authenticators include:
- **OAuthenticator** uses the standard OAuth protocol to verify users.
For example, you can easily use Github to authenticate your users -
people have a "click to login with Github" button. This is often
done with a whitelist to only allow certain users.
- **NativeAuthenticator** actually stores and validates its own
usernames and passwords, unlike most other authenticators. Thus,
you can manage all your users within JupyterHub only.
- There are authenticators for LTI (learning management systems),
Shibboleth, Kerberos - and so on.
The authenticator is configured with the
`c.JupyterHub.authenticator_class` configuration option in the
`jupyterhub_config.py` file.
The authenticator runs internally to the Hub process but communicates
with outside services.
If you have trouble logging in, this is usually a problem of the
authenticator. The authenticator logs are part of the the JupyterHub
logs, but there may also be relevant information in whatever external
services you are using.
### Spawner
The **spawner** ([basics](spawners-basics),
[reference](../reference/spawners)) is the real core of
JupyterHub: when someone wants a notebook server, it allocates
resources and starts the server. The notebook server could run on the
same server as JupyterHub, on another server, on some cloud service,
or even more. They can limit resources (CPU, memory) or isolate users
from each other - if the spawner supports it. They can also do no
limiting and allow any user to access any other user's files if they
are not configured properly.
Some basic spawners included in JupyterHub is:
- **LocalProcessSpawner** is build into JupyterHub and basically tries
to switch user to the given username (`su` (&)) and start the
notebook server. It requires that the hub be run as root (because
only root has permission to start processes as other user IDs).
LocalProcessSpawner is no different than a user logging in with
something like `ssh` and running something. PAMAuthenticator and
LocalProcessSpawner is the most basic way of using JupyterHub (and
what it does out of the box) and makes the hub not too dissimilar to
an advanced ssh server.
There are many more advanced spawners:
- **SudoSpawner** is like LocalProcessSpawner but lets you run
JupyterHub without root. `sudo` has to be configured to allow the
hub's user to run processes under other user IDs.
- **SystemdSpawner** uses Systemd to start other processes. It can
isolate users from each other and provide resource limiting.
- **DockerSpawner** runs stuff in Docker, a containerization system.
This lets you fully isolate users, limit CPU, memory, and provide
other container images to fully customize the environment.
- **KubeSpawner** runs on the Kubernetes, a cloud orchestration
system. The spawner can easily limit users and provide cloud
scaling - but the spawner doesn't actually do that, Kubernetes
does. The spawner just tells Kubernetes what to do. If you want to
get KubeSpawner to do something, first you would figure out how to
do it in Kubernetes, then figure out how to tell KubeSpawner to tell
Kubernetes that. Actually... this is true for most spawners.
- **BatchSpawner** runs on computer clusters with batch job scheduling
systems (e.g Slurm, HTCondor, PBS, etc). The user processes are run
as batch jobs, having access to all the data and software that the
users normally will.
In short, spawners are the interface to the rest of the operating
system, and to configure them right you need to know a bit about how
the corresponding operating system service works.
The spawner is responsible for the environment of the single-user
notebook servers (described in the next section). In the end, it just
makes a choice about how to start these processes: for example, the
Docker spawner starts a normal Docker container and runs the right
command inside of it. Thus, the spawner is responsible for setting
what kind of software and data is available to the user.
The spawner runs internally to the Hub process but communicates with
outside services. It is configured by `c.JupyterHub.spawner_class` in
`jupyterhub_config.py`.
If a user tries to launch a notebook server and it doesn't work, the
error is usually with the spawner or the notebook server (as described
in the next section). Each spawner outputs some logs to the main
JupyterHub logs, but may also have logs in other places depending on
what services it interacts with (for example, the Docker spawner
somehow puts logs in the Docker system services, Kubernetes through
the `kubectl` API).
### Proxy
Previously, we said that the hub is between the user's web browser and
the user's notebook servers. It actually isn't directly between,
because the JupyterHub **proxy** relays connections between the users
and their single-user notebook servers. What this basically means is
that the hub itself can shut down, and if the proxy can continue to
allow users to communicate with their notebook servers. (This just
further emphasizes that the hub is responsible for starting, not
running, the notebooks). By default, the hub starts the proxy
automatically (so that you don't realize there is a separate proxy)
and stops the proxy when the hub stops (so that connections get
interrupted). But when you [configure the proxy to run
separately](../reference/separate-proxy),
users connection will stay working even without the hub.
The default proxy is **ConfigurableHttpProxy** which is simple but
effective. A more advanced option is the **Traefik Proxy**, which
gives you redundancy and high-availability.
When users "connect to JupyterHub", they *always* first connect to the
proxy and the proxy relays the connection to the hub. Thus, the proxy
is responsible for SSL and accepting connections from the rest of the
internet. The user uses the hub to authenticate and start the server,
and then the hub connect back to the proxy to adjust the proxy routes
for the user's server (e.g. the web path `/user/someone` redirects to
the server of someone at a certain internal address). The proxy has
to be able to internally connect to both the hub and all the
single-user servers.
The proxy always runs as a separate process to JupyterHub (even though
JupyterHub can start it for you). JupyterHub has one set of
configuration options for the proxy addresses (`bind_url`) and one for
the hub (`bind_url`). If `bind_url` is given, it is just passed to
the automatic proxy to tell it what to do.
If you have problems after users are redirected to their single-user
notebook servers, or making the first connection to the hub, it is
usually caused by the proxy. The ConfigurableHttpProxy's logs are
mixed with JupyterHub's logs if it's started through the hub (the
default case), otherwise from whatever system runs the proxy (if you
do configure it, you'll know).
### Services
JupyterHub has the concept of **services**
([basics](services-basics),
[reference](../reference/services)), which are other web services
started by the hub, but otherwise are not necessarily related to the
hub itself. They are often used to do things related to Jupyter
(things that user interacts with, usually not the hub), but could
always be run some other way. Running from the hub provides an easy
way to get Hub API tokens and authenticate users against the hub. It
can also automatically add a proxy route to forward web requests to
that service.
A common example of a service is the [cull idle
servers](https://jupyterhub.readthedocs.io/en/stable/getting-started/services-basics.html#real-world-example-to-cull-idle-servers)
script. When started by the hub, it automatically gets admin API
tokens. It uses the API to list all running servers, compare against
activity timeouts, and shut down servers exceeding the limits. Even
though this is an intrinsic part of JupyterHub, it is only loosely
coupled and running as a service provides convenience of
authentication - it could be just as well run some other way, with a
manually provided API token.
Another example of an often-requested question of *sharing files using
hubshare* as an example. Hubshare would work as an external service
which user notebooks talk to and use Hub authentication, but otherwise
it isn't directly a matter of the hub. You could equally well share
files by other extensions to the single-user notebook servers or
configuring the spawners to access shared storage spaces. In order to
use something such as hubshare, the difficulty is not modifying
JupyterHub: it is modifying the notebook servers to speak to some
service, and making that service.
The configuration option `c.JupyterHub.services` is used to start
services from the hub.
When a service is started from JupyterHub automatically, its logs are
included in the JupyterHub logs.
## Single-user notebook server
The **single-user notebook server** is the same thing you get by
running `jupyter notebook` or `jupyter lab` from the command line -
the actual Jupyter user interface for a single person.
The role of the spawner is to start this server - basically, running
the command `jupyter notebook`. Actually it doesn't run that, it runs
`jupyterhub-singleuser` which first communicates with the hub to say
"I'm alive" before running a completely normal Jupyter server. The
single-user server can be JupyterLab or classic notebooks. By this
point, the hub is almost completely out of the picture (the web
traffic is going through proxy unchanged). Also by this time, the
spawner has already decided the environment which this single-user
server will have and the single-user server has to deal with that.
The spawner starts the server using `jupyterhub-singleuser` with some
environment variables like `JUPYTERHUB_API_TOKEN` and
`JUPYTERHUB_BASE_URL` which tell the single-user server how to connect
back to the hub in order to say that it's ready.
The single-user server options are **JupyterLab** and **classic
Jupyter Notebook**. Really, there isn't that much difference between
them, they run through the same backend server process and the web
frontend is an option when it is starting. The spawner can choose the
command line when it starts the single-user server. Extensions are a
property of the single-user server (in two parts: there can be a part
that runs in the Python server process, and parts that run in
javascript in lab or notebook).
If one wants to install software for users, it is not a matter of
"installing it for JupyerHub" - it's a matter of installing it for the
single-user server, which might be the same environment as the hub,
but not necessarily. (Actually, see below - it's a matter of the
kernels!)
After the single-user notebook server is started, any errors are only
an issue of the single-user notebook server. Sometimes, it seems like
the spawner is failing, but really the spawner is working but the
single-user notebook server dies right away (in this case, you need to
find the problem with the single-user server and adjust the spawner to
start it correctly or fix the environment). This can happen, for
example, if the spawner doesn't set an environment variable or doesn't
provide storage.
The single-user server's logs are handled by the spawner, so if you
notice problems at this phase you need to check your spawner for
instructions for accessing the single-user logs. For example, the
LocalProcessSpawner logs are just outputted to the same JupyterHub
output logs (TODO is this correct?), the SystemdSpawner logs are
written to the Systemd journal, Docker and Kubernetes logs are written
to Docker and Kubernetes respectively, and batchspawner output goes to
the normal output places of batch jobs and is an explicit
configuration option of the spawner.
### Notebook
**(Jupyter) Notebook** is the classic interface, where each notebook
opens in a separate tab. It is traditionally started by `jupyter
notebook`.
Does anything need to be said here?
### Lab
**JupyterLab** is the new interface, where multiple notebooks are
openable in the same tab in an IDE-like environment. It is
traditionally started with `jupyter lab`. Both Notebook and Lab use
the same `.ipynb` file format.
JupyterLab is run thorugh the same server file, but at a path `/lab`
instead of `/tree`. Thus, they can be active at the same time in the
backend and you can switch between them at runtime by changing your
URL path.
Extensions need to be re-written for JupyterLab (if moving from
classic notebooks). But, the server-side of the extensions can be
shared by both.
## Kernel
Normally, our tour of the Jupyter ecosystem would stop here. But,
since if you've read this far you probably need to know every last
bit, let's go further and talk about the kernels. The commands you
run in the notebook session are not executed in the same process as
the notebook itself, but in a separate **Jupyter kernel**. There are [many
kernels
available](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels).
As a basic approximation, a **Jupyter kernel** is a process which
accepts commands (cells that are run) and returns the output to
Jupyter to display. One example is the **IPython Jupyter kernel**,
which runs Python. There is nothing special about it, it can be
considered a *normal Python process. The kernel process can be
approximated in UNIX terms as a process that takes commands on stdin
and returns stuff on stdout(&). Obviously, it's more because it has
to be able to disentangle all the possible outputs, such as figures,
and present it to the user in a web browser.
Kernel communication is via the the ZeroMQ protocol on the local
computer. Kernels are separate processes from the main single-user
notebook server (and thus obviously, different from the JupyterHub
process and everything else). By default (and unless you do something
special), kernels share the same environment as the notebook server
(data, resource limits, permissions, user id, etc.). But they *can*
run in a separate Python environment from the single-user server
(search `--prefix` in the [ipykernel installation
instructions](https://ipython.readthedocs.io/en/stable/install/kernel_install.html))
There are also more fancy techniques such as the [Jupyter Kernel
Gateway](https://jupyter-kernel-gateway.readthedocs.io/) and [Enterprise
Gateway](https://jupyter-enterprise-gateway.readthedocs.io/), which
allow you to run the kernels on a different machine and possibly with
a different environment.
A kernel doesn't just execute it's language - cell magics such as `%`,
`%%`, and `!` are a property of the kernel - in particular, these are
IPython kernel commands and don't necessarily work in any other
kernel unless they specifically support them.
What does this mean? There is yet *another* layer of configurability.
Each kernel can run a different programming language, with different
software, and so on. By default, they would run in the same
environment as the single-user notebook server, and the most common
other way they are configured is by
running in different Python virtual environments or conda
environments. They can be started and killed independently (there is
normally one per notebook you have open). The kernels is what uses
most of your memory and CPU when running Jupyter - the rest of the web
interface has a small footprint.
You can list your installed kernels with `jupyter kernelspec list`.
If you look at one of `kernel.json` files in those directories, you
will see exactly what command is run. These are normally
automatically made by the kernels, but can be edited as needed. [The
spec](https://jupyter-client.readthedocs.io/en/stable/kernels.html)
tells you even more.
The normally has to be reachable by the single-user notebook server
but the gateways mentioned above can get around that limitation.
If you get problems with "Kernel died" or some other error in a single
notebook but the single-user notebook server stays working, it is
usually a problem with the kernel. It could be that you are trying to
use more resources than you are allowed and the symptom is the kernel
getting killed. It could be that it crashes for some other reason.
In these cases, you need to find the kernel logs and investigate.
The debug logs for the kernel are normally mixed in with the
single-user notebook server logs.
## JupyterHub distributions
There are several "distributions" which automatically install all of
the things above and configure them for a certain purpose. They are
good ways to get started, but if you have custom needs, eventually it
may become hard to adapt them to your requirements.
* [**Zero to JupyterHub with
Kubernetes**](https://zero-to-jupyterhub.readthedocs.io/) installs
an entire scaleable system using Kubernetes. Uses KubeSpawner,
....Authenticator, ....
* [**The Littlest JupyterHub**](https://tljh.jupyter.org/) installs JupyterHub on a single system
using SystemdSpawner and NativeAuthenticator (which manages users
itself).
* [**JupyterHub the hard
way**](https://jupyterhub.readthedocs.io/en/stable/installation-guide-hard.html)
takes you through everything yourself. It is a natural companion to
this guide, since you get to experience every little bit.
## I want to...
TODO: answers to common cross-layer questions.
## What's next?
Now you know everything. Well, you know how everything relates, but
there are still plenty of details, implementations, and exceptions.
When setting up JupyterHub, the first step is to consider the above
layers, decide the right option for each of them, then begin putting
everything together.