incorporated changes after review

2025-10-12 04:23:01 +00:00 · 2023-02-07 15:58:05 +03:00
parent a084d23107
commit c766f5866e
10 changed files with 49 additions and 41 deletions
--- a/docs/source/explanation/admin/capacity-planning.md
+++ b/docs/source/explanation/admin/capacity-planning.md
@@ -0,0 +1,308 @@
+# Capacity planning
+
+General capacity planning advice for JupyterHub is hard to give,
+because it depends almost entirely on what your users are doing,
+and what JupyterHub users do varies _wildly_ in terms of resource consumption.
+
+**There is no single answer to "I have X users, what resources do I need?" or "How many users can I support with this machine?"**
+
+Here are three _typical_ Jupyter use patterns that require vastly different resources:
+
+- **Learning**: negligible resources because computation is mostly idle,
+  e.g. students learning programming for the first time
+- **Production code**: very intense, sustained load, e.g. training machine learning models
+- **Bursting**: _mostly_ idle, but needs a lot of resources for short periods of time
+  (interactive research often looks like this)
+
+But just because there's no single answer doesn't mean we can't help.
+So we have gathered here some useful information to help you make your decisions
+about what resources you need based on how your users work,
+including the relative invariants in terms of resources that JupyterHub itself needs.
+
+## JupyterHub infrastructure
+
+JupyterHub consists of a few components that are always running.
+These take up very little resources,
+especially relative to the resources consumed by users when you have more than a few.
+
+As an example, an instance of mybinder.org (running JupyterHub 1.5.0),
+running with typically ~100-150 users has:
+
+| Component | CPU (mean/peak) | Memory (mean/peak) |
+| --------- | --------------- | ------------------ |
+| Hub       | 4% / 13%        | (230 MB / 260 MB)  |
+| Proxy     | 6% / 13%        | (47 MB / 65 MB)    |
+
+So it would be pretty generous to allocate ~25% of one CPU core
+and ~500MB of RAM to overall JupyterHub infrastructure.
+
+The rest is going to be up to your users.
+Per-user overhead from JupyterHub is typically negligible
+up to at least a few hundred concurrent active users.
+
+```{figure} /images/mybinder-hub-components-cpu-memory.png
+JupyterHub component resource usage for mybinder.org.
+```
+
+## Factors to consider
+
+### Static vs elastic resources
+
+A big factor in planning resources is:
+**how much does it cost to change your mind?**
+If you are using a single shared machine with local storage,
+migrating to a new one because it turns out your users don't fit might be very costly.
+You will have to get a new machine, set it up, and maybe even migrate user data.
+
+On the other hand, if you are using ephemeral resources,
+such as node pools in Kubernetes,
+changing resource types costs close to nothing
+because nodes can automatically be added or removed as needed.
+
+Take that cost into account when you are picking how much memory or cpu to allocate to users.
+
+Static resources (like [the-littlest-jupyterhub][]) provide for more **stable, predictable costs**,
+but elastic resources (like [zero-to-jupyterhub][]) tend to provide **lower overall costs**
+(especially when deployed with monitoring allowing cost optimizations over time),
+but which are **less predictable**.
+
+[the-littlest-jupyterhub]: https://the-littlest-jupyterhub.readthedocs.io
+
+[zero-to-jupyterhub]: https://z2jh.jupyter.org
+
+(limits-requests)=
+
+### Limit vs Request for resources
+
+Many scheduling tools like Kubernetes have two separate ways of allocating resources to users.
+A **Request** or **Reservation** describes how much resources are _set aside_ for each user.
+Often, this doesn't have any practical effect other than deciding when a given machine is considered 'full'.
+If you are using expandable resources like an autoscaling Kubernetes cluster,
+a new node must be launched and added to the pool if you 'request' more resources than fit on currently running nodes (a cluster **scale-up event**).
+If you are running on a single VM, this describes how many users you can run at the same time, full stop.
+
+A **Limit**, on the other hand, enforces a limit to how much resources any given user can consume.
+For more information on what happens when users try to exceed their limits, see [](oversubscription).
+
+In the strictest, safest case, you can have these two numbers be the same.
+That means that each user is _limited_ to fit within the resources allocated to it.
+This avoids **[oversubscription](oversubscription)** of resources (allowing use of more than you have available),
+at the expense (in a literal, this-costs-money sense) of reserving lots of usually-idle capacity.
+
+However, you often find that a small fraction of users use more resources than others.
+In this case you may give users limits that _go beyond the amount of resources requested_.
+This is called **oversubscribing** the resources available to users.
+
+Having a gap between the request and the limit means you can fit a number of _typical_ users on a node (based on the request),
+but still limit how much a runaway user can gobble up for themselves.
+
+(oversubscription)=
+
+### Oversubscribed CPU is okay, running out of memory is bad
+
+An important consideration when assigning resources to users is: **What happens when users need more than I've given them?**
+
+A good summary to keep in mind:
+
+> When tasks don't get enough CPU, things are slow.
+> When they don't get enough memory, things are broken.
+
+This means it's **very important that users have enough memory**,
+but much less important that they always have exclusive access to all the CPU they can use.
+
+This relates to [Limits and Requests](limits-requests),
+because these are the consequences of your limits and/or requests not matching what users actually try to use.
+
+A table of mismatched resource allocation situations and their consequences:
+
+| issue                                                    | consequence                                                                           |
+| -------------------------------------------------------- | ------------------------------------------------------------------------------------- |
+| Requests too high                                        | Unnecessarily high cost and/or low capacity.                                          |
+| CPU limit too low                                        | Poor performance experienced by users                                                 |
+| CPU oversubscribed (too-low request + too-high limit)    | Poor performance across the system; may crash, if severe                              |
+| Memory limit too low                                     | Servers killed by Out-of-Memory Killer (OOM); lost work for users                     |
+| Memory oversubscribed (too-low request + too-high limit) | System memory exhaustion - all kinds of hangs and crashes and weird errors. Very bad. |
+
+Note that the 'oversubscribed' problem case is where the request is lower than _typical_ usage,
+meaning that the total reserved resources isn't enough for the total _actual_ consumption.
+This doesn't mean that _all_ your users exceed the request,
+just that the _limit_ gives enough room for the _average_ user to exceed the request.
+
+All of these considerations are important _per node_.
+Larger nodes means more users per node, and therefore more users to average over.
+It also means more chances for multiple outliers on the same node.
+
+### Example case for oversubscribing memory
+
+Take for example, this system and sampling of user behavior:
+
+- System memory = 8G
+- memory request = 1G, limit = 3G
+- typical 'heavy' user: 2G
+- typical 'light' user: 0.5G
+
+This will assign 8 users to those 8G of RAM (remember: only requests are used for deciding when a machine is 'full').
+As long as the total of 8 users _actual_ usage is under 8G, everything is fine.
+But the _limit_ allows a total of 24G to be used,
+which would be a mess if everyone used their full limit.
+But _not_ everyone uses the full limit, which is the point!
+
+This pattern is fine if 1/8 of your users are 'heavy' because _typical_ usage will be ~0.7G,
+and your total usage will be ~5G (`1 × 2 + 7 × 0.5 = 5.5`).
+
+But if _50%_ of your users are 'heavy' you have a problem because that means your users will be trying to use 10G (`4 × 2 + 4 × 0.5 = 10`),
+which you don't have.
+
+You can make guesses at these numbers, but the only _real_ way to get them is to measure (see [](measuring)).
+
+### CPU:memory ratio
+
+Most of the time, you'll find that only one resource is the limiting factor for your users.
+Most often it's memory, but for certain tasks, it could be CPU (or even GPUs).
+
+Many cloud deployments have just one or a few fixed ratios of cpu to memory
+(e.g. 'general purpose', 'high memory', and 'high cpu').
+Setting your secondary resource allocation according to this ratio
+after selecting the more important limit results in a balanced resource allocation.
+
+For instance, some of Google Cloud's ratios are:
+
+| node type   | GB RAM / CPU core |
+| ----------- | ----------------- |
+| n2-highmem  | 8                 |
+| n2-standard | 4                 |
+| n2-highcpu  | 1                 |
+
+(idleness)=
+
+### Idleness
+
+Jupyter being an interactive tool means people tend to spend a lot more time reading and thinking than actually running resource-intensive code.
+This significantly affects how much _cpu_ resources a typical active user needs,
+but often does not significantly affect the _memory_.
+
+Ways to think about this:
+
+- More idle users means unused CPU.
+  This generally means setting your CPU _limit_ higher than your CPU _request_.
+- What do your users do when they _are_ running code?
+  Is it typically single-threaded local computation in a notebook?
+  If so, there's little reason to set a limit higher than 1 CPU core.
+- Do typical computations take a long time, or just a few seconds?
+  Longer typical computations means it's more likely for users to be trying to use the CPU at the same moment,
+  suggesting a higher _request_.
+- Even with idle users, parallel computation adds up quickly - one user fully loading 4 cores and 3 using almost nothing still averages to more than a full CPU core per user.
+- Long-running intense computations suggest higher requests.
+
+Again, using mybinder.org as an example—we run around 100 users on 8-core nodes,
+and still see fairly _low_ overall CPU usage on each user node.
+The limit here is actually Kubernetes' pods per node, not memory _or_ CPU.
+This is likely a extreme case, as many Binder users come from clicking links on webpages
+without any actual intention of running code.
+
+```{figure} /images/mybinder-load5.png
+mybinder.org node CPU usage is low with 50-150 users sharing just 8 cores
+```
+
+### Concurrent users and culling idle servers
+
+Related to [][idleness], all of these resource consumptions and limits are calculated based on **concurrently active users**,
+not total users.
+You might have 10,000 users of your JupyterHub deployment, but only 100 of them running at any given time.
+That 100 is the main number you need to use for your capacity planning.
+JupyterHub costs scale very little based on the number of _total_ users,
+up to a point.
+
+There are two important definitions for **active user**:
+
+- Are they _actually_ there (i.e. a human interacting with Jupyter, or running code that might be )
+- Is their server running (this is where resource reservations and limits are actually applied)
+
+Connecting those two definitions (how long are servers running if their humans aren't using them) is an important area of deployment configuration, usually implemented via the [JupyterHub idle culler service][idle-culler].
+
+[idle-culler]: https://github.com/jupyterhub/jupyterhub-idle-culler
+
+There are a lot of considerations when it comes to culling idle users that will depend:
+
+- How much does it save me to shut down user servers? (e.g. keeping an elastic cluster small, or keeping a fixed-size deployment available to active users)
+- How much does it cost my users to have their servers shut down? (e.g. lost work if shutdown prematurely)
+- How easy do I want it to be for users to keep their servers running? (e.g. Do they want to run unattended simulations overnight? Do you want them to?)
+
+Like many other things in this guide, there are many correct answers leading to different configuration choices.
+For more detail on culling configuration and considerations, consult the [JupyterHub idle culler documentation][idle-culler].
+
+## More tips
+
+### Start strict and generous, then measure
+
+A good tip, in general, is to give your users as much resources as you can afford that you think they _might_ use.
+Then, use resource usage metrics like prometheus to analyze what your users _actually_ need,
+and tune accordingly.
+Remember: **Limits affect your user experience and stability. Requests mostly affect your costs**.
+
+For example, a sensible starting point (lacking any other information) might be:
+
+```yaml
+request:
+  cpu: 0.5
+  mem: 2G
+limit:
+  cpu: 1
+  mem: 2G
+```
+
+(more memory if significant computations are likely - machine learning models, data analysis, etc.)
+
+Some actions
+
+- If you see out-of-memory killer events, increase the limit (or talk to your users!)
+- If you see typical memory well below your limit, reduce the request (but not the limit)
+- If _nobody_ uses that much memory, reduce your limit
+- If CPU is your limiting scheduling factor and your CPUs are mostly idle,
+  reduce the cpu request (maybe even to 0!).
+- If CPU usage continues to be low, increase the limit to 2 or 4 to allow bursts of parallel execution.
+
+(measuring)=
+
+### Measuring user resource consumption
+
+It is _highly_ recommended to deploy monitoring services such as [Prometheus][]
+and [Grafana][] to get a view of your users' resource usage.
+This is the only way to truly know what your users need.
+
+JupyterHub has some experimental [grafana dashboards][] you can use as a starting point,
+to keep an eye on your resource usage.
+Here are some sample charts from (again from mybinder.org),
+showing >90% of users using less than 10% CPU and 200MB,
+but a few outliers near the limit of 1 CPU and 2GB of RAM.
+This is the kind of information you can use to tune your requests and limits.
+
+![Snapshot from JupyterHub's Grafana dashboards on mybinder.org](/images/mybinder-user-resources.png)
+
+[prometheus]: https://prometheus.io
+[grafana]: https://grafana.com
+[grafana dashboards]: https://github.com/jupyterhub/grafana-dashboards
+
+### Measuring costs
+
+Measuring costs may be as important as measuring your users activity.
+If you are using a cloud provider, you can often use cost thresholds and quotas to instruct them to notify you if your costs are too high,
+e.g. "Have AWS send me an email if I hit X spending trajectory on week 3 of the month."
+You can then use this information to tune your resources based on what you can afford.
+You can mix this information with user resource consumption to figure out if you have a problem,
+e.g. "my users really do need X resources, but I can only afford to give them 80% of X."
+This information may prove useful when asking your budget-approving folks for more funds.
+
+### Additional resources
+
+There are lots of other resources for cost and capacity planning that may be specific to JupyterHub and/or your cloud provider.
+
+Here are some useful links to other resources
+
+- [Zero to JupyterHub](https://z2jh.jupyter.org) documentation on
+  - [projecting costs](https://z2jh.jupyter.org/en/latest/administrator/cost.html)
+  - [configuring user resources](https://z2jh.jupyter.org/en/latest/jupyterhub/customizing/user-resources.html)
+- Cloud platform cost calculators:
+  - [Google Cloud](https://cloud.google.com/products/calculator/)
+  - [Amazon AWS](https://calculator.aws)
+  - [Microsoft Azure](https://azure.microsoft.com/en-us/pricing/calculator/)
--- a/docs/source/explanation/admin/database.md
+++ b/docs/source/explanation/admin/database.md
@@ -0,0 +1,155 @@
+(hub-database)=
+
+# The Hub's Database
+
+JupyterHub uses a database to store information about users, services, and other data needed for operating the Hub.
+This is the **state** of the Hub.
+
+## Why does JupyterHub have a database?
+
+JupyterHub is a **stateful** application (more on that 'state' later).
+Updating JupyterHub's configuration or upgrading the version of JupyterHub requires restarting the JupyterHub process to apply the changes.
+We want to minimize the disruption caused by restarting the Hub process, so it can be a mundane, frequent, routine activity.
+Storing state information outside the process for later retrieval is necessary for this, and one of the main thing databases are for.
+
+A lot of the operations in JupyterHub are also **relationships**, which is exactly what SQL databases are great at.
+For example:
+
+- Given an API token, what user is making the request?
+- Which users don't have running servers?
+- Which servers belong to user X?
+- Which users have not been active in the last 24 hours?
+
+Finally, a database allows us to have more information stored without needing it all loaded in memory,
+e.g. supporting a large number (several thousands) of inactive users.
+
+## What's in the database?
+
+The short answer of what's in the JupyterHub database is "everything."
+JupyterHub's **state** lives in the database.
+That is, everything JupyterHub needs to be aware of to function that _doesn't_ come from the configuration files, such as
+
+- users, roles, role assignments
+- state, urls of running servers
+- Hashed API tokens
+- Short-lived state related to OAuth flow
+- Timestamps for when users, tokens, and servers were last used
+
+### What's _not_ in the database
+
+Not _quite_ all of JupyterHub's state is in the database.
+This mostly involves transient state, such as the 'pending' transitions of Spawners (starting, stopping, etc.).
+Anything not in the database must be reconstructed on Hub restart, and the only sources of information to do that are the database and JupyterHub configuration file(s).
+
+## How does JupyterHub use the database?
+
+JupyterHub makes some _unusual_ choices in how it connects to the database.
+These choices represent trade-offs favoring single-process simplicity and performance at the expense of horizontal scalability (multiple Hub instances).
+
+We often say that the Hub 'owns' the database.
+This ownership means that we assume the Hub is the only process that will talk to the database.
+This assumption enables us to make several caching optimizations that dramatically improve JupyterHub's performance (i.e. data written recently to the database can be read from memory instead of fetched again from the database) that would not work if multiple processes could be interacting with the database at the same time.
+
+Database operations are also synchronous, so while JupyterHub is waiting on a database operation, it cannot respond to other requests.
+This allows us to avoid complex locking mechanisms, because transaction races can only occur during an `await`, so we only need to make sure we've completed any given transaction before the next `await` in a given request.
+
+:::{note}
+We are slowly working to remove these assumptions, and moving to a more traditional db session per-request pattern.
+This will enable multiple Hub instances and enable scaling JupyterHub, but will significantly reduce the number of active users a single Hub instance can serve.
+:::
+
+### Database performance in a typical request
+
+Most authenticated requests to JupyterHub involve a few database transactions:
+
+1. look up the authenticated user (e.g. look up token by hash, then resolve owner and permissions)
+2. record activity
+3. perform any relevant changes involved in processing the request (e.g. create the records for a running server when starting one)
+
+This means that the database is involved in almost every request, but only in quite small, simple queries, e.g.:
+
+- lookup one token by hash
+- lookup one user by name
+- list tokens or servers for one user (typically 1-10)
+- etc.
+
+### The database as a limiting factor
+
+As a result of the above transactions in most requests, database performance is the _leading_ factor in JupyterHub's baseline requests-per-second performance, but that cost does not scale significantly with the number of users, active or otherwise.
+However, the database is _rarely_ a limiting factor in JupyterHub performance in a practical sense, because the main thing JupyterHub does is start, stop, and monitor whole servers, which take far more time than any small database transaction, no matter how many records you have or how slow your database is (within reason).
+Additionally, there is usually _very_ little load on the database itself.
+
+By far the most taxing activity on the database is the 'list all users' endpoint, primarily used by the [idle-culling service](https://github.com/jupyterhub/jupyterhub-idle-culler).
+Database-based optimizations have been added to make even these operations feasible for large numbers of users:
+
+1. State filtering on [GET /users](jupyterhub-rest-API) with `?state=active`,
+   which limits the number of results in the query to only the relevant subset (added in JupyterHub 1.3), rather than all users.
+2. [Pagination](api-pagination) of all list endpoints, allowing the request of a large number of resources to be more fairly balanced with other Hub activities across multiple requests (added in 2.0).
+
+:::{note}
+It's important to note when discussing performance and limiting factors and that all of this only applies to requests to `/hub/...`.
+The Hub and its database are not involved in most requests to single-user servers (`/user/...`), which is by design, and largely motivated by the fact that the Hub itself doesn't _need_ to be fast because its operations are infrequent and large.
+:::
+
+## Database backends
+
+JupyterHub supports a variety of database backends via [SQLAlchemy][].
+The default is sqlite, which works great for many cases, but you should be able to use many backends supported by SQLAlchemy.
+Usually, this will mean PostgreSQL or MySQL, both of which are well tested with JupyterHub.
+
+[sqlalchemy]: https://www.sqlalchemy.org
+
+### Default backend: SQLite
+
+The default database backend for JupyterHub is [SQLite](https://sqlite.org).
+We have chosen SQLite as JupyterHub's default because it's simple (the 'database' is a single file) and ubiquitous (it is in the Python standard library).
+It works very well for testing, small deployments, and workshops.
+
+For production systems, SQLite has some disadvantages when used with JupyterHub:
+
+- `upgrade-db` may not always work, and you may need to start with a fresh database
+- `downgrade-db` **will not** work if you want to rollback to an earlier
+  version, so backup the `jupyterhub.sqlite` file before upgrading
+
+The sqlite documentation provides a helpful page about [when to use SQLite and
+where traditional RDBMS may be a better choice](https://sqlite.org/whentouse.html).
+
+### Picking your database backend (PostgreSQL, MySQL)
+
+When running a long term deployment or a production system, we recommend using a full-fledged relational database, such as [PostgreSQL](https://www.postgresql.org) or [MySQL](https://www.mysql.com), that supports the SQL `ALTER TABLE` statement.
+
+## Notes and Tips
+
+### SQLite
+
+The SQLite database should not be used on NFS. SQLite uses reader/writer locks
+to control access to the database. This locking mechanism might not work
+correctly if the database file is kept on an NFS filesystem. This is because
+`fcntl()` file locking is broken on many NFS implementations. Therefore, you
+should avoid putting SQLite database files on NFS since it will not handle well
+multiple processes which might try to access the file at the same time.
+
+### PostgreSQL
+
+We recommend using PostgreSQL for production if you are unsure whether to use
+MySQL or PostgreSQL or if you do not have a strong preference. There is
+additional configuration required for MySQL that is not needed for PostgreSQL.
+
+### MySQL / MariaDB
+
+- You should use the `pymysql` sqlalchemy provider (the other one, MySQLdb,
+  isn't available for py3).
+- You also need to set `pool_recycle` to some value (typically 60 - 300)
+  which depends on your MySQL setup. This is necessary since MySQL kills
+  connections serverside if they've been idle for a while, and the connection
+  from the hub will be idle for longer than most connections. This behavior
+  will lead to frustrating 'the connection has gone away' errors from
+  sqlalchemy if `pool_recycle` is not set.
+- If you use `utf8mb4` collation with MySQL earlier than 5.7.7 or MariaDB
+  earlier than 10.2.1 you may get an `1709, Index column size too large` error.
+  To fix this you need to set `innodb_large_prefix` to enabled and
+  `innodb_file_format` to `Barracuda` to allow for the index sizes jupyterhub
+  uses. `row_format` will be set to `DYNAMIC` as long as those options are set
+  correctly. Later versions of MariaDB and MySQL should set these values by
+  default, as well as have a default `DYNAMIC` `row_format` and pose no trouble
+  to users.
--- a/docs/source/explanation/admin/oauth.md
+++ b/docs/source/explanation/admin/oauth.md
@@ -0,0 +1,373 @@
+# JupyterHub and OAuth
+
+JupyterHub uses [OAuth 2](https://oauth.net/2/) as an internal mechanism for authenticating users.
+As such, JupyterHub itself always functions as an OAuth **provider**.
+You can find out more about what that means [below](oauth-terms).
+
+Additionally, JupyterHub is _often_ deployed with [OAuthenticator](https://oauthenticator.readthedocs.io),
+where an external identity provider, such as GitHub or KeyCloak, is used to authenticate users.
+When this is the case, there are _two_ nested OAuth flows:
+an _internal_ OAuth flow where JupyterHub is the **provider**,
+and an _external_ OAuth flow, where JupyterHub is the **client**.
+
+This means that when you are using JupyterHub, there is always _at least one_ and often two layers of OAuth involved in a user logging in and accessing their server.
+
+The following points are noteworthy:
+
+- Single-user servers _never_ need to communicate with or be aware of the upstream provider configured in your Authenticator.
+  As far as the servers are concerned, only JupyterHub is an OAuth provider,
+  and how users authenticate with the Hub itself is irrelevant.
+- When interacting with a single-user server,
+  there are ~always two tokens:
+  first, a token issued to the server itself to communicate with the Hub API,
+  and second, a per-user token in the browser to represent the completed login process and authorized permissions.
+  More on this [later](two-tokens).
+
+(oauth-terms)=
+
+## Key OAuth terms
+
+Here are some key definitions to keep in mind when we are talking about OAuth.
+You can also read more in detail [here](https://www.oauth.com/oauth2-servers/definitions/).
+
+- **provider**: The entity responsible for managing identity and authorization;
+  always a web server.
+  JupyterHub is _always_ an OAuth provider for JupyterHub's components.
+  When OAuthenticator is used, an external service, such as GitHub or KeyCloak, is also an OAuth provider.
+- **client**: An entity that requests OAuth **tokens** on a user's behalf;
+  generally a web server of some kind.
+  OAuth **clients** are services that _delegate_ authentication and/or authorization
+  to an OAuth **provider**.
+  JupyterHub _services_ or single-user _servers_ are OAuth **clients** of the JupyterHub **provider**.
+  When OAuthenticator is used, JupyterHub is itself _also_ an OAuth **client** for the external OAuth **provider**, e.g. GitHub.
+- **browser**: A user's web browser, which makes requests and stores things like cookies.
+- **token**: The secret value used to represent a user's authorization. This is the final product of the OAuth process.
+- **code**: A short-lived temporary secret that the **client** exchanges
+  for a **token** at the conclusion of OAuth,
+  in what's generally called the "OAuth callback handler."
+
+## One oauth flow
+
+OAuth **flow** is what we call the sequence of HTTP requests involved in authenticating a user and issuing a token, ultimately used for authorizing access to a service or single-user server.
+
+A single OAuth flow typically goes like this:
+
+### OAuth request and redirect
+
+1. A **browser** makes an HTTP request to an OAuth **client**.
+2. There are no credentials, so the client _redirects_ the browser to an "authorize" page on the OAuth **provider** with some extra information:
+   - the OAuth **client ID** of the client itself.
+   - the **redirect URI** to be redirected back to after completion.
+   - the **scopes** requested, which the user should be presented with to confirm.
+     This is the "X would like to be able to Y on your behalf. Allow this?" page you see on all the "Login with ..." pages around the Internet.
+3. During this authorize step,
+   the browser must be _authenticated_ with the provider.
+   This is often already stored in a cookie,
+   but if not the provider webapp must begin its _own_ authentication process before serving the authorization page.
+   This _may_ even begin another OAuth flow!
+4. After the user tells the provider that they want to proceed with the authorization,
+   the provider records this authorization in a short-lived record called an **OAuth code**.
+5. Finally, the oauth provider redirects the browser _back_ to the oauth client's "redirect URI"
+   (or "OAuth callback URI"),
+   with the OAuth code in a URL parameter.
+
+That marks the end of the requests made between the **browser** and the **provider**.
+
+### State after redirect
+
+At this point:
+
+- The browser is authenticated with the _provider_.
+- The user's authorized permissions are recorded in an _OAuth code_.
+- The _provider_ knows that the permissions requested by the OAuth client have been granted, but the client doesn't know this yet.
+- All the requests so far have been made directly by the browser.
+  No requests have originated from the client or provider.
+
+### OAuth Client Handles Callback Request
+
+At this stage, we get to finish the OAuth process.
+Let's dig into what the OAuth client does when it handles
+the OAuth callback request.
+
+- The OAuth client receives the _code_ and makes an API request to the _provider_ to exchange the code for a real _token_.
+  This is the first direct request between the OAuth _client_ and the _provider_.
+- Once the token is retrieved, the client _usually_
+  makes a second API request to the _provider_
+  to retrieve information about the owner of the token (the user).
+  This is the step where behavior diverges for different OAuth providers.
+  Up to this point, all OAuth providers are the same, following the OAuth specification.
+  However, OAuth does not define a standard for issuing tokens in exchange for information about their owner or permissions ([OpenID Connect](https://openid.net/connect/) does that),
+  so this step may be different for each OAuth provider.
+- Finally, the OAuth client stores its own record that the user is authorized in a cookie.
+  This could be the token itself, or any other appropriate representation of successful authentication.
+- Now that credentials have been established,
+  the browser can be redirected to the _original_ URL where it started,
+  to try the request again.
+  If the client wasn't able to keep track of the original URL all this time
+  (not always easy!),
+  you might end up back at a default landing page instead of where you started the login process. This is frustrating!
+
+😮‍💨 _phew_.
+
+So that's _one_ OAuth process.
+
+## Full sequence of OAuth in JupyterHub
+
+Let's go through the above OAuth process in JupyterHub,
+with specific examples of each HTTP request and what information it contains.
+For bonus points, we are using the double-OAuth example of JupyterHub configured with GitHubOAuthenticator.
+
+To disambiguate, we will call the OAuth process where JupyterHub is the **provider** "internal OAuth,"
+and the one with JupyterHub as a **client** "external OAuth."
+
+Our starting point:
+
+- a user's single-user server is running. Let's call them `danez`
+- Jupyterhub is running with GitHub as an OAuth provider (this means two full instances of OAuth),
+- Danez has a fresh browser session with no cookies yet.
+
+First request:
+
+- browser->single-user server running JupyterLab or Jupyter Classic
+- `GET /user/danez/notebooks/mynotebook.ipynb`
+- no credentials, so single-user server (as an OAuth **client**) starts internal OAuth process with JupyterHub (the **provider**)
+- response: 302 redirect -> `/hub/api/oauth2/authorize`
+  with:
+  - client-id=`jupyterhub-user-danez`
+  - redirect-uri=`/user/danez/oauth_callback` (we'll come back later!)
+
+Second request, following redirect:
+
+- browser->JupyterHub
+- `GET /hub/api/oauth2/authorize`
+- no credentials, so JupyterHub starts external OAuth process _with GitHub_
+- response: 302 redirect -> `https://github.com/login/oauth/authorize`
+  with:
+  - client-id=`jupyterhub-client-uuid`
+  - redirect-uri=`/hub/oauth_callback` (we'll come back later!)
+
+_pause_ This is where JupyterHub configuration comes into play.
+Recall, in this case JupyterHub is using:
+
+```python
+c.JupyterHub.authenticator_class = 'github'
+```
+
+That means authenticating a request to the Hub itself starts
+a _second_, external OAuth process with GitHub as a provider.
+This external OAuth process is optional, though.
+If you were using the default username+password PAMAuthenticator,
+this redirect would have been to `/hub/login` instead, to present the user
+with a login form.
+
+Third request, following redirect:
+
+- browser->GitHub
+- `GET https://github.com/login/oauth/authorize`
+
+Here, GitHub prompts for login and asks for confirmation of authorization
+(more redirects if you aren't logged in to GitHub yet, but ultimately back to this `/authorize` URL).
+
+After successful authorization
+(either by looking up a pre-existing authorization,
+or recording it via form submission)
+GitHub issues an **OAuth code** and redirects to `/hub/oauth_callback?code=github-code`
+
+Next request:
+
+- browser->JupyterHub
+- `GET /hub/oauth_callback?code=github-code`
+
+Inside the callback handler, JupyterHub makes two API requests:
+
+The first:
+
+- JupyterHub->GitHub
+- `POST https://github.com/login/oauth/access_token`
+- request made with OAuth **code** from URL parameter
+- response includes an access **token**
+
+The second:
+
+- JupyterHub->GitHub
+- `GET https://api.github.com/user`
+- request made with access **token** in the `Authorization` header
+- response is the user model, including username, email, etc.
+
+Now the external OAuth callback request completes with:
+
+- set cookie on `/hub/` path, recording jupyterhub authentication so we don't need to do external OAuth with GitHub again for a while
+- redirect -> `/hub/api/oauth2/authorize`
+
+🎉 At this point, we have completed our first OAuth flow! 🎉
+
+Now, we get our first repeated request:
+
+- browser->jupyterhub
+- `GET /hub/api/oauth2/authorize`
+- this time with credentials,
+  so jupyterhub either
+  1. serves the internal authorization confirmation page, or
+  2. automatically accepts authorization (shortcut taken when a user is visiting their own server)
+- redirect -> `/user/danez/oauth_callback?code=jupyterhub-code`
+
+Here, we start the same OAuth callback process as before, but at Danez's single-user server for the _internal_ OAuth.
+
+- browser->single-user server
+- `GET /user/danez/oauth_callback`
+
+(in handler)
+
+Inside the internal OAuth callback handler,
+Danez's server makes two API requests to JupyterHub:
+
+The first:
+
+- single-user server->JupyterHub
+- `POST /hub/api/oauth2/token`
+- request made with oauth code from url parameter
+- response includes an API token
+
+The second:
+
+- single-user server->JupyterHub
+- `GET /hub/api/user`
+- request made with token in the `Authorization` header
+- response is the user model, including username, groups, etc.
+
+Finally completing `GET /user/danez/oauth_callback`:
+
+- response sets cookie, storing encrypted access token
+- _finally_ redirects back to the original `/user/danez/notebooks/mynotebook.ipynb`
+
+Final request:
+
+- browser -> single-user server
+- `GET /user/danez/notebooks/mynotebook.ipynb`
+- encrypted jupyterhub token in cookie
+
+To authenticate this request, the single token stored in the encrypted cookie is passed to the Hub for verification:
+
+- single-user server -> Hub
+- `GET /hub/api/user`
+- browser's token in Authorization header
+- response: user model with name, groups, etc.
+
+If the user model matches who should be allowed (e.g. Danez),
+then the request is allowed.
+See [Scopes in JupyterHub](jupyterhub-scopes) for how JupyterHub uses scopes to determine authorized access to servers and services.
+
+_the end_
+
+## Token caches and expiry
+
+Because tokens represent information from an external source,
+they can become 'stale,'
+or the information they represent may no longer be accurate.
+For example: a user's GitHub account may no longer be authorized to use JupyterHub,
+that should ultimately propagate to revoking access and force logging in again.
+
+To handle this, OAuth tokens and the various places they are stored can _expire_,
+which should have the same effect as no credentials,
+and trigger the authorization process again.
+
+In JupyterHub's internal OAuth, we have these layers of information that can go stale:
+
+- The OAuth client has a **cache** of Hub responses for tokens,
+  so it doesn't need to make API requests to the Hub for every request it receives.
+  This cache has an expiry of five minutes by default,
+  and is governed by the configuration `HubAuth.cache_max_age` in the single-user server.
+- The internal OAuth token is stored in a cookie, which has its own expiry (default: 14 days),
+  governed by `JupyterHub.cookie_max_age_days`.
+- The internal OAuth token itself can also expire,
+  which is by default the same as the cookie expiry,
+  since it makes sense for the token itself and the place it is stored to expire at the same time.
+  This is governed by `JupyterHub.cookie_max_age_days` first,
+  or can overridden by `JupyterHub.oauth_token_expires_in`.
+
+That's all for _internal_ auth storage,
+but the information from the _external_ authentication provider
+(could be PAM or GitHub OAuth, etc.) can also expire.
+Authenticator configuration governs when JupyterHub needs to ask again,
+triggering the external login process anew before letting a user proceed.
+
+- `jupyterhub-hub-login` cookie stores that a browser is authenticated with the Hub.
+  This expires according to `JupyterHub.cookie_max_age_days` configuration,
+  with a default of 14 days.
+  The `jupyterhub-hub-login` cookie is encrypted with `JupyterHub.cookie_secret`
+  configuration.
+- {meth}`.Authenticator.refresh_user` is a method to refresh a user's auth info.
+  By default, it does nothing, but it can return an updated user model if a user's information has changed,
+  or force a full login process again if needed.
+- {attr}`.Authenticator.auth_refresh_age` configuration governs how often
+  `refresh_user()` will be called to check if a user must login again (default: 300 seconds).
+- {attr}`.Authenticator.refresh_pre_spawn` configuration governs whether
+  `refresh_user()` should be called prior to spawning a server,
+  to force fresh auth info when a server is launched (default: False).
+  This can be useful when Authenticators pass access tokens to spawner environments, to ensure they aren't getting a stale token that's about to expire.
+
+**So what happens when these things expire or get stale?**
+
+- If the HubAuth **token response cache** expires,
+  when a request is made with a token,
+  the Hub is asked for the latest information about the token.
+  This usually has no visible effect, since it is just refreshing a cache.
+  If it turns out that the token itself has expired or been revoked,
+  the request will be denied.
+- If the token has expired, but is still in the cookie:
+  when the token response cache expires,
+  the next time the server asks the hub about the token,
+  no user will be identified and the internal OAuth process begins again.
+- If the token _cookie_ expires, the next browser request will be made with no credentials,
+  and the internal OAuth process will begin again.
+  This will usually have the form of a transparent redirect browsers won't notice.
+  However, if this occurs on an API request in a long-lived page visit
+  such as a JupyterLab session, the API request may fail and require
+  a page refresh to get renewed credentials.
+- If the _JupyterHub_ cookie expires, the next time the browser makes a request to the Hub,
+  the Hub's authorization process must begin again (e.g. login with GitHub).
+  Hub cookie expiry on its own **does not** mean that a user can no longer access their single-user server!
+- If credentials from the upstream provider (e.g. GitHub) become stale or outdated,
+  these will not be refreshed until/unless `refresh_user` is called
+  _and_ `refresh_user()` on the given Authenticator is implemented to perform such a check.
+  At this point, few Authenticators implement `refresh_user` to support this feature.
+  If your Authenticator does not or cannot implement `refresh_user`,
+  the only way to force a check is to reset the `JupyterHub.cookie_secret` encryption key,
+  which invalidates the `jupyterhub-hub-login` cookie for all users.
+
+### Logging out
+
+Logging out of JupyterHub means clearing and revoking many of these credentials:
+
+- The `jupyterhub-hub-login` cookie is revoked, meaning the next request to the Hub itself will require a new login.
+- The token stored in the `jupyterhub-user-username` cookie for the single-user server
+  will be revoked, based on its associaton with `jupyterhub-session-id`, but the _cookie itself cannot be cleared at this point_
+- The shared `jupyterhub-session-id` is cleared, which ensures that the HubAuth **token response cache** will not be used,
+  and the next request with the expired token will ask the Hub, which will inform the single-user server that the token has expired
+
+## Extra bits
+
+(two-tokens)=
+
+### A tale of two tokens
+
+**TODO**: discuss API token issued to server at startup ($JUPYTERHUB_API_TOKEN)
+and OAuth-issued token in the cookie,
+and some details of how JupyterLab currently deals with that.
+They are different, and JupyterLab should be making requests using the token from the cookie,
+not the token from the server,
+but that is not currently the case.
+
+### Redirect loops
+
+In general, an authenticated web endpoint has this behavior,
+based on the authentication/authorization state of the browser:
+
+- If authorized, allow the request to happen
+- If authenticated (I know who you are) but not authorized (you are not allowed), fail with a 403 permission denied error
+- If not authenticated, start a redirect process to establish authorization,
+  which should end in a redirect back to the original URL to try again.
+  **This is why problems in authentication result in redirect loops!**
+  If the second request fails to detect the authentication that should have been established during the redirect,
+  it will start the authentication redirect process over again,
+  and keep redirecting in a loop until the browser balks.
--- a/docs/source/explanation/admin/websecurity.md
+++ b/docs/source/explanation/admin/websecurity.md
@@ -0,0 +1,136 @@
+(web-security)=
+
+# Security Overview
+
+The **Security Overview** section helps you learn about:
+
+- the design of JupyterHub with respect to web security
+- the semi-trusted user
+- the available mitigations to protect untrusted users from each other
+- the value of periodic security audits
+
+This overview also helps you obtain a deeper understanding of how JupyterHub
+works.
+
+## Semi-trusted and untrusted users
+
+JupyterHub is designed to be a _simple multi-user server for modestly sized
+groups_ of **semi-trusted** users. While the design reflects serving
+semi-trusted users, JupyterHub can also be suitable for serving **untrusted** users.
+
+As a result, using JupyterHub with **untrusted** users means more work by the
+administrator, since much care is required to secure a Hub, with extra caution on
+protecting users from each other.
+
+One aspect of JupyterHub's _design simplicity_ for **semi-trusted** users is that
+the Hub and single-user servers are placed in a _single domain_, behind a
+[_proxy_][configurable-http-proxy]. If the Hub is serving untrusted
+users, many of the web's cross-site protections are not applied between
+single-user servers and the Hub, or between single-user servers and each
+other, since browsers see the whole thing (proxy, Hub, and single user
+servers) as a single website (i.e. single domain).
+
+## Protect users from each other
+
+To protect users from each other, a user must **never** be able to write arbitrary
+HTML and serve it to another user on the Hub's domain. This is prevented by JupyterHub's
+authentication setup because only the owner of a given single-user notebook server is
+allowed to view user-authored pages served by the given single-user notebook
+server.
+
+To protect all users from each other, JupyterHub administrators must
+ensure that:
+
+- A user **does not have permission** to modify their single-user notebook server,
+  including:
+  - the installation of new packages in the Python environment that runs
+    their single-user server;
+  - the creation of new files in any `PATH` directory that precedes the
+    directory containing `jupyterhub-singleuser` (if the `PATH` is used
+    to resolve the single-user executable instead of using an absolute path);
+  - the modification of environment variables (e.g. PATH, PYTHONPATH) for
+    their single-user server;
+  - the modification of the configuration of the notebook server
+    (the `~/.jupyter` or `JUPYTER_CONFIG_DIR` directory).
+
+If any additional services are run on the same domain as the Hub, the services
+**must never** display user-authored HTML that is neither _sanitized_ nor _sandboxed_
+(e.g. IFramed) to any user that lacks authentication as the author of a file.
+
+## Mitigate security issues
+
+The several approaches to mitigating security issues with configuration
+options provided by JupyterHub include:
+
+### Enable subdomains
+
+JupyterHub provides the ability to run single-user servers on their own
+subdomains. This means the cross-origin protections between servers has the
+desired effect, and user servers and the Hub are protected from each other. A
+user's single-user server will be at `username.jupyter.mydomain.com`. This also
+requires all user subdomains to point to the same address, which is most easily
+accomplished with wildcard DNS. Since this spreads the service across multiple
+domains, you will need wildcard SSL as well. Unfortunately, for many
+institutional domains, wildcard DNS and SSL are not available. **If you do plan
+to serve untrusted users, enabling subdomains is highly encouraged**, as it
+resolves the cross-site issues.
+
+### Disable user config
+
+If subdomains are unavailable or undesirable, JupyterHub provides a
+configuration option `Spawner.disable_user_config`, which can be set to prevent
+the user-owned configuration files from being loaded. After implementing this
+option, `PATH`s and package installation are the other things that the
+admin must enforce.
+
+### Prevent spawners from evaluating shell configuration files
+
+For most Spawners, `PATH` is not something users can influence, but it's important that
+the Spawner should _not_ evaluate shell configuration files prior to launching the server.
+
+### Isolate packages using virtualenv
+
+Package isolation is most easily handled by running the single-user server in
+a virtualenv with disabled system-site-packages. The user should not have
+permission to install packages into this environment.
+
+It is important to note that the control over the environment only affects the
+single-user server, and not the environment(s) in which the user's kernel(s)
+may run. Installing additional packages in the kernel environment does not
+pose additional risk to the web application's security.
+
+### Encrypt internal connections with SSL/TLS
+
+By default, all communications on the server, between the proxy, hub, and single
+-user notebooks are performed unencrypted. Setting the `internal_ssl` flag in
+`jupyterhub_config.py` secures the aforementioned routes. Turning this
+feature on does require that the enabled `Spawner` can use the certificates
+generated by the `Hub` (the default `LocalProcessSpawner` can, for instance).
+
+It is also important to note that this encryption **does not** cover the
+`zmq tcp` sockets between the Notebook client and kernel yet. While users cannot
+submit arbitrary commands to another user's kernel, they can bind to these
+sockets and listen. When serving untrusted users, this eavesdropping can be
+mitigated by setting `KernelManager.transport` to `ipc`. This applies standard
+Unix permissions to the communication sockets thereby restricting
+communication to the socket owner. The `internal_ssl` option will eventually
+extend to securing the `tcp` sockets as well.
+
+## Security audits
+
+We recommend that you do periodic reviews of your deployment's security. It's
+good practice to keep [JupyterHub](https://readthedocs.org/projects/jupyterhub/), [configurable-http-proxy][], and [nodejs
+versions](https://github.com/nodejs/Release) up to date.
+
+A handy website for testing your deployment is
+[Qualsys' SSL analyzer tool](https://www.ssllabs.com/ssltest/analyze.html).
+
+[configurable-http-proxy]: https://github.com/jupyterhub/configurable-http-proxy
+
+## Vulnerability reporting
+
+If you believe you have found a security vulnerability in JupyterHub, or any
+Jupyter project, please report it to
+[security@ipython.org](mailto:security@ipython.org). If you prefer to encrypt
+your security reports, you can use [this PGP public
+key](https://jupyter.org/assets/ipython_security.asc).