what-is-jupyterhub: Full revision

2025-10-16 14:33:00 +00:00 · 2020-04-14 01:52:15 +03:00
parent fd2919b36f
commit cc0bc531d3
1 changed files with 196 additions and 134 deletions
--- a/docs/source/explanation/concepts.md
+++ b/docs/source/explanation/concepts.md
@@ -1,16 +1,22 @@
 # What is Jupyter and JupyterHub?
 JupyterHub is not what you think it is.  Most things you think are
-part of JupyterHub are actually handled by some other component, and
+part of JupyterHub are actually handled by some other component, for
-it's not always obvious how the parts relate.  This document was
+example the spawner or notebook server itself, and it's not always
-originally written to assist in debugging: very often, the actual
+obvious how the parts relate.  The knowledge contained here hasn't
-problem is not where one thinks it is and thus people can't easily
+been assembled in one place before, and is essential to understand
-debug.  In order to tell this story, we start at JupyterHub and go all
+when setting up a sufficiently complex Jupyter(Hub) setup.
 the way down to the fundamental components of Jupyter.
-We occasionally leave things out or bend the truth where it helps in
+This document was originally written to assist in debugging: very
-explanation, and give our explanations in terms of Python even though
+often, the actual problem is not where one thinks it is and thus
-many other languages can be used instead.
+people can't easily debug.  In order to tell this story, we start at
 JupyterHub and go all the way down to the fundamental components of
 Jupyter.
 In this document, we occasionally leave things out or bend the truth
 where it helps in explanation, and give our explanations in terms of
 Python even though Jupyter itself is language-neutral.  The "(&)"
 symbol highlights important points where there is more.
 This guide is long, but after reading it you will be know of all major
 components in the Jupyter ecosystem and everything else you read
@@ -20,15 +26,15 @@ should make sense.
 ## Just what is Jupyter?
-Before we get too far, let's remember what our end goal is.  A Jupyter
+Before we get too far, let's remember what our end goal is.  A
-Notebook is really nothing more than a Python process (or some
+**Jupyter Notebook** is really nothing more than a Python(&) process
-language) which is getting commands from a web browser and displaying
+which is getting commands from a web browser and displaying the output
-the output via a browser.  What the process actually sees can roughly
+via that browser.  What the process actually sees can roughly like
-be considered getting data on standard input and writing to standard
+getting commands on standard input(&) and writing to standard
-output (*).  There is nothing intrinsically special about this process
+output(&).  There is nothing intrinsically special about this process
 - it can do anything a normal Python process can do, and nothing more.
-The kernel handles capturing output and converting things like
+The **Jupyter kernel** handles capturing output and converting things
-graphics to a form usable by the browser.
+such as graphics to a form usable by the browser.
 Everything we explain below is building up to this, going through many
 different layers which give you many ways of customizing how this
@@ -39,36 +45,42 @@ process runs.  But this process is not *too* special.
 ## JupyterHub
 **JupyterHub** is the central piece that provides multi-user
-login. Despite this, the end user only briefly interacts with it and
+login. Despite this, the end user only briefly interacts with
-most of the actual Jupyter session does not relate to the hub at all.
+JupyterHub and most of the actual Jupyter session does not relate to
-In short, anything which is related to *starting* the user's workspace
+the hub at all: the hub mainly handles authentication and spawning the
-is about JupyterHub, anything about *running* usually isn't.
+single-user server.  In short, anything which is related to *starting*
 the user's workspace/environment is about JupyterHub, anything about
 *running* usually isn't.
 If you have problems connecting the authentication, spawning, and the
 proxy (explained below), the issues is usually with JupyterHub.  To
 debug, JupyterHub has extensive logs which get printed to its console
 and can be used to discover most problems.
-JupyterHub consists of the main pieces below:
+The main pieces of JupyterHub are:
-### Authenticators
+### Authenticator
-JupyterHub itself doesn't actually (necessarily) manage your users.
+JupyterHub itself doesn't actually manage your users(&).  It has a
-It has a database of users, but it is usually connected with some
+database of users, but it is usually connected with some other system
-other system that manages the usernames and passwords.  When someone
+that manages the usernames and passwords.  When someone tries to log
-tries to log in to JupyteHub, it just asks the **authenticator** if
+in to JupyteHub, it just asks the
-the username/password is valid.  The authenticator can also return
+**authenticator**([basics](authenticators-users-basics.html),
-user groups and admin status of users, so that JupyterHub can roughly
+[reference](../reference/authenticators.html)) if the
-manage users to services.
+username/password is valid(&).  The authenticator can also return user
 groups and admin status of users, so that JupyterHub can do some
 higher-level management.  The authenticator returns a username(&),
 which is passed on to the spawner, which has to use it to start that
 user's environment.
 The following authenticators are included with JupyterHub:
 - **PAMAuthenticator** uses the standard Unix/Linux operating system
  functions to check users.  Roughly, if someone already has access to
-  the machine (they can log in by ssh or otherwise), they will be able
+  the machine (they can log in by ssh), they will be able to log in to
-  to log in to JupyterHub automatically.  Thus, JupyterHub fills the
+  JupyterHub without any other setup.  Thus, JupyterHub fills the role
-  role of a ssh server, but providing a web-browser based way to
+  of a ssh server, but providing a web-browser based way to access the
-  access the machine.
+  machine.
 But those are fairly limited, and thus there are [plenty of others to
@@ -77,17 +89,16 @@ from](https://github.com/jupyterhub/jupyterhub/wiki/Authenticators).
 You can connect to almost any other existing service to manage your
 users.  You either use all users from this other service (e.g. your
 company), or whitelist only the allowed users (e.g. your group's
-Github users).  Some other popular authenticators include:
+Github usernames).  Some other popular authenticators include:
 - **OAuthenticator** uses the standard OAuth protocol to verify users.
  For example, you can easily use Github to authenticate your users -
  people have a "click to login with Github" button.  This is often
  done with a whitelist to only allow certain users.
- **NativeAuthenticator** actually stores its own usernames and
+- **NativeAuthenticator** actually stores and validates its own
-  passwords, unlike most other authenticators.  Thus, you can manage
+  usernames and passwords, unlike most other authenticators.  Thus,
-  all your users within JupyterHUb only.  (include one more example
+  you can manage all your users within JupyterHub only.
  here)
 - There are authenticators for LTI (learning management systems),
  Shibboleth, Kerberos - and so on.
@@ -100,15 +111,17 @@ The authenticator runs internally to the Hub process but communicates
 with outside services.
 If you have trouble logging in, this is usually a problem of the
-authenticator.  The authenticator debug information goes to the
+authenticator.  The authenticator logs are part of the the JupyterHub
-JupyterHub logs, but there may also be hints in whatever external
+logs, but there may also be relevant information in whatever external
 services you are using.
-### Spawners
+### Spawner
-The **spawner** is the real core of JupyterHub: when someone wants a
+The **spawner** ([basics](spawners-basics.html),
-notebook server, it finds resources and starts the server.  It could
+[reference](../reference/spawners.html)) is the real core of
-run on the current server, on another server, on some cloud service,
+JupyterHub: when someone wants a notebook server, it allocates
 resources and starts the server.  The notebook server could run on the
 same server as JupyterHub, on another server, on some cloud service,
 or even more.  They can limit resources (CPU, memory) or isolate users
 from each other - if the spawner supports it.  They can also do no
 limiting and allow any user to access any other user's files if they
@@ -116,35 +129,41 @@ are not configured properly.
 Some basic spawners included in JupyterHub is:
-**LocalProcessSpawner** is build in to JupyterHub and basically starts
+- **LocalProcessSpawner** is build into JupyterHub and basically tries
-tries to switch user to the given username and start Jupyter.  It
+  to switch user to the given username (`su` (&)) and start the
-requires that the hub be run as root (because only root has permission
+  notebook server.  It requires that the hub be run as root (because
-to start processes as other user IDs).  LocalProcessSpawner is no
+  only root has permission to start processes as other user IDs).
-different than a user logging in with something like `ssh` and running
+  LocalProcessSpawner is no different than a user logging in with
-jobs.  PAMAuthenticator and LocalProcessSpawner is the most basic way
+  something like `ssh` and running something.  PAMAuthenticator and
-of using JupyterHub (and what it does out of the box) and makes the
+  LocalProcessSpawner is the most basic way of using JupyterHub (and
-hub not too dissimilar to an advanced ssh server.
+  what it does out of the box) and makes the hub not too dissimilar to
  an advanced ssh server.
-There are many more advanced fancy spawners:
+There are many more advanced spawners:
 - **SudoSpawner** is like LocalProcessSpawner but lets you run
-  JupyterHub without root.  sudo has to be configured to allow the
+  JupyterHub without root.  `sudo` has to be configured to allow the
  hub's user to run processes under other user IDs.
 - **SystemdSpawner** uses Systemd to start other processes.  It can
-  isolate users from each other and provide some limits.
+  isolate users from each other and provide resource limiting.
 - **DockerSpawner** runs stuff in Docker, a containerization system.
  This lets you fully isolate users, limit CPU, memory, and provide
-  other operating system images to fully customize the environment.
+  other container images to fully customize the environment.
 - **KubeSpawner** runs on the Kubernetes, a cloud orchestration
  system.  The spawner can easily limit users and provide cloud
-  scaling - but the spawner doesn't actually do that, Kubernetes does.
+  scaling - but the spawner doesn't actually do that, Kubernetes
  does.  The spawner just tells Kubernetes what to do.  If you want to
  get KubeSpawner to do something, first you would figure out how to
  do it in Kubernetes, then figure out how to tell KubeSpawner to tell
  Kubernetes that.  Actually... this is true for most spawners.
- **BatchSpawner** runs on computer clusters with batch queuing
+- **BatchSpawner** runs on computer clusters with batch job scheduling
-  systems.  The user processes are run as batch jobs, having access to
+  systems (e.g Slurm, HTCondor, PBS, etc).  The user processes are run
-  all the data and software that the users normally will.
+  as batch jobs, having access to all the data and software that the
  users normally will.
 In short, spawners are the interface to the rest of the operating
 system, and to configure them right you need to know a bit about how
@@ -166,24 +185,25 @@ error is usually with the spawner or the notebook server (as described
 in the next section).  Each spawner outputs some logs to the main
 JupyterHub logs, but may also have logs in other places depending on
 what services it interacts with (for example, the Docker spawner
-somehow puts logs in the Docker system services).
+somehow puts logs in the Docker system services, Kubernetes through
 the `kubectl` API).
 ### Proxy
-Previously, we said that the hub is between the user and the user's
+Previously, we said that the hub is between the user's web browser and
-notebook servers.  It actually isn't directly between, because the
+the user's notebook servers.  It actually isn't directly between,
-JupyterHub **proxy** relays connections between the users and their
+because the JupyterHub **proxy** relays connections between the users
-single-user notebook servers.  What this basically means is that the
+and their single-user notebook servers.  What this basically means is
-hub itself can shut down, and if the proxy can continue to allow users
+that the hub itself can shut down, and if the proxy can continue to
-to communicate with their notebook servers.  (This just further
+allow users to communicate with their notebook servers.  (This just
-emphasizes that the hub is responsible for starting, not running, the
+further emphasizes that the hub is responsible for starting, not
-notebooks).  By default, the hub starts the proxy automatically (so
+running, the notebooks).  By default, the hub starts the proxy
-that you don't realize there is a separate proxy) and stops the proxy
+automatically (so that you don't realize there is a separate proxy)
-when the hub stops (so that connections get interrupted).  But when
+and stops the proxy when the hub stops (so that connections get
-you [configure the proxy to run
+interrupted).  But when you [configure the proxy to run
-separately](https://jupyterhub.readthedocs.io/en/stable/reference/separate-proxy.html),
+separately](../reference/separate-proxy.html),
-your users connections will stay working even without the hub.
+users connection will stay working even without the hub.
 The default proxy is **ConfigurableHttpProxy** which is simple but
 effective.  A more advanced option is the **Traefik Proxy**, which
@@ -192,11 +212,11 @@ gives you redundancy and high-availability.
 When users "connect to JupyterHub", they *always* first connect to the
 proxy and the proxy relays the connection to the hub.  Thus, the proxy
 is responsible for SSL and accepting connections from the rest of the
-internet.
+internet.  The user uses the hub to authenticate and start the server,
-
+and then the hub connect back to the proxy to adjust the proxy routes
-The hub has to connect to the proxy to adjust the routes (The web path
+for the user's server (e.g. the web path `/user/someone` redirects to
-`/user/someone` goes to the server of someone at a certain address).
+the server of someone at a certain internal address).  The proxy has
-The proxy has to be able to connect to both the hub and all the
+to be able to internally connect to both the hub and all the
 single-user servers.
 The proxy always runs as a separate process to JupyterHub (even though
@@ -210,26 +230,43 @@ notebook servers, or making the first connection to the hub, it is
 usually caused by the proxy.  The ConfigurableHttpProxy's logs are
 mixed with JupyterHub's logs if it's started through the hub (the
 default case), otherwise from whatever system runs the proxy (if you
-do it, you'll know).
+do configure it, you'll know).
 ### Services
-JupyterHub has the concept of **services**, which are other web
+JupyterHub has the concept of **services**
-services started by the hub, but otherwise are not really related to
+([basics](services-basics.html),
-the hub itself.  They are often used to do things related to Jupyter
+[reference](../reference/services.html)), which are other web services
 started by the hub, but otherwise are not necessarily related to the
 hub itself.  They are often used to do things related to Jupyter
 (things that user interacts with, usually not the hub), but could
 always be run some other way.  Running from the hub provides an easy
-way to get Hub API tokens and authenticate users against the hub.
+way to get Hub API tokens and authenticate users against the hub.  It
 can also automatically add a proxy route to forward web requests to
 that service.
-The configuration option `c.JupyterHub.services` (??) is used to start
+A common example of a service is the [cull idle
-services from the hub.
+servers](https://jupyterhub.readthedocs.io/en/stable/getting-started/services-basics.html#real-world-example-to-cull-idle-servers)
 script.  When started by the hub, it automatically gets admin API
 tokens.  It uses the API to list all running servers, compare against
 activity timeouts, and shut down servers exceeding the limits.  Even
 though this is an intrinsic part of JupyterHub, it is only loosely
 coupled and running as a service provides convenience of
 authentication - it could be just as well run some other way, with a
 manually provided API token.
-Let's use the often-requested question of *sharing files using
+Another example of an often-requested question of *sharing files using
 hubshare* as an example.  Hubshare would work as an external service
 which user notebooks talk to and use Hub authentication, but otherwise
 it isn't directly a matter of the hub.  You could equally well share
 files by other extensions to the single-user notebook servers or
-configuring the spawners to access shared storage spaces.
+configuring the spawners to access shared storage spaces.  In order to
 use something such as hubshare, the difficulty is not modifying
 JupyterHub: it is modifying the notebook servers to speak to some
 service, and making that service.
 The configuration option `c.JupyterHub.services` is used to start
 services from the hub.
 When a service is started from JupyterHub automatically, its logs are
 included in the JupyterHub logs.
@@ -243,15 +280,14 @@ running `jupyter notebook` or `jupyter lab` from the command line -
 the actual Jupyter user interface for a single person.
 The role of the spawner is to start this server - basically, running
-the command `jupyter notebook`.
+the command `jupyter notebook`.  Actually it doesn't run that, it runs
-Actually it doesn't run that, it runs `jupyterhub-singleuser` which
+`jupyterhub-singleuser` which first communicates with the hub to say
-first communicates with the hub to say "I'm alive" before running a
+"I'm alive" before running a completely normal Jupyter server.  The
-completely normal Jupyter server.  The single-user server can be
+single-user server can be JupyterLab or classic notebooks.  By this
-JupyterLab or classic notebooks.  By this point, the hub is almost
+point, the hub is almost completely out of the picture (the web
-completely out of the picture (the web traffic is going through proxy
+traffic is going through proxy unchanged).  Also by this time, the
-unchanged).  By this time, the spawner has already decided the
+spawner has already decided the environment which this single-user
-environment which this single-user server will have and the
+server will have and the single-user server has to deal with that.
 single-user server has to deal with that.
 The spawner starts the server using `jupyterhub-singleuser` with some
 environment variables like `JUPYTERHUB_API_TOKEN` and
@@ -264,16 +300,23 @@ them, they run through the same backend server process and the web
 frontend is an option when it is starting.  The spawner can choose the
 command line when it starts the single-user server.  Extensions are a
 property of the single-user server (in two parts: there can be a part
-that runs in server process, and parts that run in javascript in lab
+that runs in the Python server process, and parts that run in
-or notebook).
+javascript in lab or notebook).
 If one wants to install software for users, it is not a matter of
 "installing it for JupyerHub" - it's a matter of installing it for the
 single-user server, which might be the same environment as the hub,
 but not necessarily.  (Actually, see below - it's a matter of the
 kernels!)
 After the single-user notebook server is started, any errors are only
 an issue of the single-user notebook server.  Sometimes, it seems like
 the spawner is failing, but really the spawner is working but the
 single-user notebook server dies right away (in this case, you need to
 find the problem with the single-user server and adjust the spawner to
-start it correctly).  This can happen, for example, if the spawner
+start it correctly or fix the environment).  This can happen, for
-doesn't set an environment variable or doesn't provide storage.
+example, if the spawner doesn't set an environment variable or doesn't
 provide storage.
 The single-user server's logs are handled by the spawner, so if you
 notice problems at this phase you need to check your spawner for
@@ -289,21 +332,26 @@ configuration option of the spawner.
 ### Notebook
 **(Jupyter) Notebook** is the classic interface, where each notebook
-opens in a separate tab.
+opens in a separate tab.  It is traditionally started by `jupyter
 notebook`.
 Does anything need to be said here?
 ### Lab
 **JupyterLab** is the new interface, where multiple notebooks are
-openable in the same tab in an IDE-like environment.  JupyterLab is
+openable in the same tab in an IDE-like environment.  It is
-run thorugh the same server file, but at a path `/lab` instead of
+traditionally started with `jupyter lab`.  Both Notebook and Lab use
-`/tree`.
+the same `.ipynb` file format.
-Both Notebook and Lab use the same `.ipynb` file format.
+JupyterLab is run thorugh the same server file, but at a path `/lab`
 instead of `/tree`.  Thus, they can be active at the same time in the
 backend and you can switch between them at runtime by changing your
 URL path.
-Does anything need to be said here?
+Extensions need to be re-written for JupyterLab (if moving from
- how extensions work in lab compared to notebook
+classic notebooks).  But, the server-side of the extensions can be
 shared by both.
@@ -313,30 +361,40 @@ Normally, our tour of the Jupyter ecosystem would stop here.  But,
 since if you've read this far you probably need to know every last
 bit, let's go further and talk about the kernels.  The commands you
 run in the notebook session are not executed in the same process as
-the notebook itself, but in a separate **kernel**.  There are [many
+the notebook itself, but in a separate **Jupyter kernel**.  There are [many
 kernels
 available](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels).
 As a basic approximation, a **Jupyter kernel** is a process which
 accepts commands (cells that are run) and returns the output to
 Jupyter to display.  One example is the **IPython Jupyter kernel**,
-which runs Python and adds the IPython magic functions (`%`, `%%`,
+which runs Python.  There is nothing special about it, it can be
-`!`, etc. commands).  There is nothing special about it, it can be
+considered a *normal Python process.  The kernel process can be
-considered a *normal Python process*.  Like we said above, the kernel
+approximated in UNIX terms as a process that takes commands on stdin
-process can be approximated as a process that takes commands on stdin
+and returns stuff on stdout(&).  Obviously, it's more because it has
-and returns stuff on stdout.  Actually, a kernel is more fancy,
+to be able to disentangle all the possible outputs, such as figures,
-because it can communicate over the network and add in magic commands.
+and present it to the user in a web browser.
 Kernel communication is via the the ZeroMQ protocol on the local
 computer.  Kernels are separate processes from the main single-user
 notebook server (and thus obviously, different from the JupyterHub
 process and everything else).  By default (and unless you do something
 special), kernels share the same environment as the notebook server
-(data, resource limits, permissions, user id, etc.).  But there are
+(data, resource limits, permissions, user id, etc.).  But they *can*
-things like the Jupyter Kernel Gateway / Enterprise Gateway, which
+run in a separate Python environment from the single-user server
 (search `--prefix` in the [ipykernel installation
 instructions](https://ipython.readthedocs.io/en/stable/install/kernel_install.html))
 There are also more fancy techniques such as the [Jupyter Kernel
 Gateway](https://jupyter-kernel-gateway.readthedocs.io/) and [Enterprise
 Gateway](https://jupyter-enterprise-gateway.readthedocs.io/), which
 allow you to run the kernels on a different machine and possibly with
 a different environment.
 A kernel doesn't just execute it's language - cell magics such as `%`,
 `%%`, and `!` are a property of the kernel - in particular, these are
 IPython kernel commands and don't necessarily work in any other
 kernel unless they specifically support them.
 What does this mean?  There is yet *another* layer of configurability.
 Each kernel can run a different programming language, with different
 software, and so on.  By default, they would run in the same
@@ -345,8 +403,8 @@ other way they are configured is by
 running in different Python virtual environments or conda
 environments.  They can be started and killed independently (there is
 normally one per notebook you have open).  The kernels is what uses
-most of your memory and CPU if you have large amounts of data open or
+most of your memory and CPU when running Jupyter - the rest of the web
-are using a lot of compute power.
+interface has a small footprint.
 You can list your installed kernels with `jupyter kernelspec list`.
 If you look at one of `kernel.json` files in those directories, you
@@ -355,43 +413,47 @@ automatically made by the kernels, but can be edited as needed.  [The
 spec](https://jupyter-client.readthedocs.io/en/stable/kernels.html)
 tells you even more.
-The kernel has to be reachable by the single-user notebook server.
+The normally has to be reachable by the single-user notebook server
 but the gateways mentioned above can get around that limitation.
 If you get problems with "Kernel died" or some other error in a single
 notebook but the single-user notebook server stays working, it is
 usually a problem with the kernel.  It could be that you are trying to
 use more resources than you are allowed and the symptom is the kernel
 getting killed.  It could be that it crashes for some other reason.
 In these cases, you need to find the kernel logs and investigate.
 The debug logs for the kernel are normally mixed in with the
 single-user notebook server logs.
-### JupyterHub distributions
+## JupyterHub distributions
 There are several "distributions" which automatically install all of
 the things above and configure them for a certain purpose.  They are
-good ways to get started, but if you are doing very custom things
+good ways to get started, but if you have custom needs, eventually it
-eventually it may become hard to adapt them to your needs.
+may become hard to adapt them to your requirements.
-* **Zero to JupyterHub with Kubernetes** installs an entire scaleable
+* [**Zero to JupyterHub with
-  system using Kubernetes.  Uses KubeSpawner, ....Authenticator, ....
+  Kubernetes**](https://zero-to-jupyterhub.readthedocs.io/) installs
  an entire scaleable system using Kubernetes.  Uses KubeSpawner,
  ....Authenticator, ....
-* **The Littlest JupyterHub** installs JupyterHub on a single system
+* [**The Littlest JupyterHub**](https://tljh.jupyter.org/) installs JupyterHub on a single system
  using SystemdSpawner and NativeAuthenticator (which manages users
  itself).
-* **JupyterHub the hard way** takes you through everything yourself.
+* [**JupyterHub the hard
-  It is a natural companion to this guide, since you get to experience
+  way**](https://jupyterhub.readthedocs.io/en/stable/installation-guide-hard.html)
-  every little bit.
+  takes you through everything yourself.  It is a natural companion to
  this guide, since you get to experience every little bit.
 ## I want to...
-**Share files between users**.  Spawner to share data, or
+TODO: answers to common cross-layer questions.
 JupyterNotebook/Lab user interface + some service for distributing
 files.
 ## What's next?
@@ -399,5 +461,5 @@ files.
 Now you know everything.  Well, you know how everything relates, but
 there are still plenty of details, implementations, and exceptions.
 When setting up JupyterHub, the first step is to consider the above
-layers and see what options are suitable for you.  Then, put
+layers, decide the right option for each of them, then begin putting
 everything together.