mirror of
https://github.com/jupyter/docker-stacks.git
synced 2025-10-08 02:24:04 +00:00
Merge branch 'master' into asalikhov/ubuntu_focal
This commit is contained in:
@@ -8,13 +8,13 @@ This page describes the options supported by the startup script as well as how t
|
||||
|
||||
You can pass [Jupyter command line options](https://jupyter.readthedocs.io/en/latest/projects/jupyter-command.html) to the `start-notebook.sh` script when launching the container. For example, to secure the Notebook server with a custom password hashed using `IPython.lib.passwd()` instead of the default token, you can run the following:
|
||||
|
||||
```
|
||||
```bash
|
||||
docker run -d -p 8888:8888 jupyter/base-notebook start-notebook.sh --NotebookApp.password='sha1:74ba40f8a388:c913541b7ee99d15d5ed31d4226bf7838f83a50e'
|
||||
```
|
||||
|
||||
For example, to set the base URL of the notebook server, you can run the following:
|
||||
|
||||
```
|
||||
```bash
|
||||
docker run -d -p 8888:8888 jupyter/base-notebook start-notebook.sh --NotebookApp.base_url=/some/path
|
||||
```
|
||||
|
||||
@@ -54,7 +54,7 @@ script for execution details.
|
||||
|
||||
You may mount SSL key and certificate files into a container and configure Jupyter Notebook to use them to accept HTTPS connections. For example, to mount a host folder containing a `notebook.key` and `notebook.crt` and use them, you might run the following:
|
||||
|
||||
```
|
||||
```bash
|
||||
docker run -d -p 8888:8888 \
|
||||
-v /some/host/folder:/etc/ssl/notebook \
|
||||
jupyter/base-notebook start-notebook.sh \
|
||||
@@ -64,7 +64,7 @@ docker run -d -p 8888:8888 \
|
||||
|
||||
Alternatively, you may mount a single PEM file containing both the key and certificate. For example:
|
||||
|
||||
```
|
||||
```bash
|
||||
docker run -d -p 8888:8888 \
|
||||
-v /some/host/folder/notebook.pem:/etc/ssl/notebook.pem \
|
||||
jupyter/base-notebook start-notebook.sh \
|
||||
@@ -85,13 +85,13 @@ For additional information about using SSL, see the following:
|
||||
|
||||
The `start-notebook.sh` script actually inherits most of its option handling capability from a more generic `start.sh` script. The `start.sh` script supports all of the features described above, but allows you to specify an arbitrary command to execute. For example, to run the text-based `ipython` console in a container, do the following:
|
||||
|
||||
```
|
||||
```bash
|
||||
docker run -it --rm jupyter/base-notebook start.sh ipython
|
||||
```
|
||||
|
||||
Or, to run JupyterLab instead of the classic notebook, run the following:
|
||||
|
||||
```
|
||||
```bash
|
||||
docker run -it --rm -p 8888:8888 jupyter/base-notebook start.sh jupyter lab
|
||||
```
|
||||
|
||||
@@ -107,7 +107,7 @@ The default Python 3.x [Conda environment](http://conda.pydata.org/docs/using/en
|
||||
|
||||
The `jovyan` user has full read/write access to the `/opt/conda` directory. You can use either `conda` or `pip` to install new packages without any additional permissions.
|
||||
|
||||
```
|
||||
```bash
|
||||
# install a package into the default (python 3.x) environment
|
||||
pip install some-package
|
||||
conda install some-package
|
||||
|
@@ -17,7 +17,7 @@ orchestrator config.
|
||||
|
||||
For example:
|
||||
|
||||
```
|
||||
```bash
|
||||
docker run -it -e GRANT_SUDO=yes --user root jupyter/minimal-notebook
|
||||
```
|
||||
|
||||
@@ -75,7 +75,7 @@ Python 2.x was removed from all images on August 10th, 2017, starting in tag `cc
|
||||
add a Python 2.x environment by defining your own Dockerfile inheriting from one of the images like
|
||||
so:
|
||||
|
||||
```
|
||||
```dockerfile
|
||||
# Choose your desired base image
|
||||
FROM jupyter/scipy-notebook:latest
|
||||
|
||||
@@ -103,7 +103,7 @@ Ref:
|
||||
The default version of Python that ships with conda/ubuntu may not be the version you want.
|
||||
To add a conda environment with a different version and make it accessible to Jupyter, the instructions are very similar to Python 2.x but are slightly simpler (no need to switch to `root`):
|
||||
|
||||
```
|
||||
```dockerfile
|
||||
# Choose your desired base image
|
||||
FROM jupyter/minimal-notebook:latest
|
||||
|
||||
@@ -168,12 +168,12 @@ ENTRYPOINT ["jupyter", "lab", "--ip=0.0.0.0", "--allow-root"]
|
||||
```
|
||||
|
||||
And build the image as:
|
||||
```
|
||||
```bash
|
||||
docker build -t jupyter/scipy-dasklabextension:latest .
|
||||
```
|
||||
|
||||
Once built, run using the command:
|
||||
```
|
||||
```bash
|
||||
docker run -it --rm -p 8888:8888 -p 8787:8787 jupyter/scipy-dasklabextension:latest
|
||||
```
|
||||
|
||||
@@ -194,7 +194,7 @@ Ref:
|
||||
[RISE](https://github.com/damianavila/RISE) allows via extension to create live slideshows of your
|
||||
notebooks, with no conversion, adding javascript Reveal.js:
|
||||
|
||||
```
|
||||
```bash
|
||||
# Add Live slideshows with RISE
|
||||
RUN conda install -c damianavila82 rise
|
||||
```
|
||||
@@ -207,7 +207,7 @@ Credit: [Paolo D.](https://github.com/pdonorio) based on
|
||||
You need to install conda's gcc for Python xgboost to work properly. Otherwise, you'll get an
|
||||
exception about libgomp.so.1 missing GOMP_4.0.
|
||||
|
||||
```
|
||||
```bash
|
||||
%%bash
|
||||
conda install -y gcc
|
||||
pip install xgboost
|
||||
@@ -312,8 +312,8 @@ Credit: [Justin Tyberg](https://github.com/jtyberg), [quanghoc](https://github.c
|
||||
To use a specific version of JupyterHub, the version of `jupyterhub` in your image should match the
|
||||
version in the Hub itself.
|
||||
|
||||
```
|
||||
FROM jupyter/base-notebook:5ded1de07260
|
||||
```dockerfile
|
||||
FROM jupyter/base-notebook:5ded1de07260
|
||||
RUN pip install jupyterhub==0.8.0b1
|
||||
```
|
||||
|
||||
@@ -375,7 +375,7 @@ Ref:
|
||||
|
||||
### Using Local Spark JARs
|
||||
|
||||
```
|
||||
```python
|
||||
import os
|
||||
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /home/jovyan/spark-streaming-kafka-assembly_2.10-1.6.1.jar pyspark-shell'
|
||||
import pyspark
|
||||
@@ -404,7 +404,7 @@ Ref:
|
||||
|
||||
### Use jupyter/all-spark-notebooks with an existing Spark/YARN cluster
|
||||
|
||||
```
|
||||
```dockerfile
|
||||
FROM jupyter/all-spark-notebook
|
||||
|
||||
# Set env vars for pydoop
|
||||
@@ -480,13 +480,13 @@ convenient to launch the server without a password or token. In this case, you s
|
||||
|
||||
For jupyterlab:
|
||||
|
||||
```
|
||||
```bash
|
||||
docker run jupyter/base-notebook:6d2a05346196 start.sh jupyter lab --LabApp.token=''
|
||||
```
|
||||
|
||||
For jupyter classic:
|
||||
|
||||
```
|
||||
```bash
|
||||
docker run jupyter/base-notebook:6d2a05346196 start.sh jupyter notebook --NotebookApp.token=''
|
||||
```
|
||||
|
||||
@@ -494,7 +494,7 @@ docker run jupyter/base-notebook:6d2a05346196 start.sh jupyter notebook --Notebo
|
||||
|
||||
NB: this works for classic notebooks only
|
||||
|
||||
```
|
||||
```dockerfile
|
||||
# Update with your base image of choice
|
||||
FROM jupyter/minimal-notebook:latest
|
||||
|
||||
@@ -513,7 +513,7 @@ Ref:
|
||||
|
||||
Using `auto-sklearn` requires `swig`, which the other notebook images lack, so it cant be experimented with. Also, there is no Conda package for `auto-sklearn`.
|
||||
|
||||
```
|
||||
```dockerfile
|
||||
ARG BASE_CONTAINER=jupyter/scipy-notebook
|
||||
FROM jupyter/scipy-notebook:latest
|
||||
|
||||
|
@@ -5,7 +5,8 @@ This page provides details about features specific to one or more images.
|
||||
## Apache Spark
|
||||
|
||||
**Specific Docker Image Options**
|
||||
* `-p 4040:4040` - The `jupyter/pyspark-notebook` and `jupyter/all-spark-notebook` images open [SparkUI (Spark Monitoring and Instrumentation UI)](http://spark.apache.org/docs/latest/monitoring.html) at default port `4040`, this option map `4040` port inside docker container to `4040` port on host machine . Note every new spark context that is created is put onto an incrementing port (ie. 4040, 4041, 4042, etc.), and it might be necessary to open multiple ports. For example: `docker run -d -p 8888:8888 -p 4040:4040 -p 4041:4041 jupyter/pyspark-notebook`
|
||||
|
||||
* `-p 4040:4040` - The `jupyter/pyspark-notebook` and `jupyter/all-spark-notebook` images open [SparkUI (Spark Monitoring and Instrumentation UI)](http://spark.apache.org/docs/latest/monitoring.html) at default port `4040`, this option map `4040` port inside docker container to `4040` port on host machine . Note every new spark context that is created is put onto an incrementing port (ie. 4040, 4041, 4042, etc.), and it might be necessary to open multiple ports. For example: `docker run -d -p 8888:8888 -p 4040:4040 -p 4041:4041 jupyter/pyspark-notebook`.
|
||||
|
||||
**Usage Examples**
|
||||
|
||||
@@ -13,30 +14,66 @@ The `jupyter/pyspark-notebook` and `jupyter/all-spark-notebook` images support t
|
||||
|
||||
### Using Spark Local Mode
|
||||
|
||||
Spark local mode is useful for experimentation on small data when you do not have a Spark cluster available.
|
||||
Spark **local mode** is useful for experimentation on small data when you do not have a Spark cluster available.
|
||||
|
||||
#### In a Python Notebook
|
||||
#### In Python
|
||||
|
||||
In a Python notebook.
|
||||
|
||||
```python
|
||||
from pyspark.sql import SparkSession
|
||||
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
|
||||
# do something to prove it works
|
||||
spark.sql('SELECT "Test" as c1').show()
|
||||
|
||||
# Spark session & context
|
||||
spark = SparkSession.builder.master('local').getOrCreate()
|
||||
sc = spark.sparkContext
|
||||
|
||||
# Sum of the first 100 whole numbers
|
||||
rdd = sc.parallelize(range(100 + 1))
|
||||
rdd.sum()
|
||||
# 5050
|
||||
```
|
||||
|
||||
#### In a R Notebook
|
||||
#### In R
|
||||
|
||||
```r
|
||||
In a R notebook with [SparkR][sparkr].
|
||||
|
||||
```R
|
||||
library(SparkR)
|
||||
|
||||
as <- sparkR.session("local[*]")
|
||||
# Spark session & context
|
||||
sc <- sparkR.session("local")
|
||||
|
||||
# do something to prove it works
|
||||
df <- as.DataFrame(iris)
|
||||
head(filter(df, df$Petal_Width > 0.2))
|
||||
# Sum of the first 100 whole numbers
|
||||
sdf <- createDataFrame(list(1:100))
|
||||
dapplyCollect(sdf,
|
||||
function(x)
|
||||
{ x <- sum(x)}
|
||||
)
|
||||
# 5050
|
||||
```
|
||||
|
||||
#### In a Spylon Kernel Scala Notebook
|
||||
In a R notebook with [sparklyr][sparklyr].
|
||||
|
||||
```R
|
||||
library(sparklyr)
|
||||
|
||||
# Spark configuration
|
||||
conf <- spark_config()
|
||||
# Set the catalog implementation in-memory
|
||||
conf$spark.sql.catalogImplementation <- "in-memory"
|
||||
|
||||
# Spark session & context
|
||||
sc <- spark_connect(master = "local", config = conf)
|
||||
|
||||
# Sum of the first 100 whole numbers
|
||||
sdf_len(sc, 100, repartition = 1) %>%
|
||||
spark_apply(function(e) sum(e))
|
||||
# 5050
|
||||
```
|
||||
|
||||
#### In Scala
|
||||
|
||||
##### In a Spylon Kernel
|
||||
|
||||
Spylon kernel instantiates a `SparkContext` for you in variable `sc` after you configure Spark
|
||||
options in a `%%init_spark` magic cell.
|
||||
@@ -44,27 +81,30 @@ options in a `%%init_spark` magic cell.
|
||||
```python
|
||||
%%init_spark
|
||||
# Configure Spark to use a local master
|
||||
launcher.master = "local[*]"
|
||||
launcher.master = "local"
|
||||
```
|
||||
|
||||
```scala
|
||||
// Now run Scala code that uses the initialized SparkContext in sc
|
||||
val rdd = sc.parallelize(0 to 999)
|
||||
rdd.takeSample(false, 5)
|
||||
// Sum of the first 100 whole numbers
|
||||
val rdd = sc.parallelize(0 to 100)
|
||||
rdd.sum()
|
||||
// 5050
|
||||
```
|
||||
|
||||
#### In an Apache Toree Scala Notebook
|
||||
##### In an Apache Toree Kernel
|
||||
|
||||
Apache Toree instantiates a local `SparkContext` for you in variable `sc` when the kernel starts.
|
||||
|
||||
```scala
|
||||
val rdd = sc.parallelize(0 to 999)
|
||||
rdd.takeSample(false, 5)
|
||||
// Sum of the first 100 whole numbers
|
||||
val rdd = sc.parallelize(0 to 100)
|
||||
rdd.sum()
|
||||
// 5050
|
||||
```
|
||||
|
||||
### Connecting to a Spark Cluster in Standalone Mode
|
||||
|
||||
Connection to Spark Cluster on Standalone Mode requires the following set of steps:
|
||||
Connection to Spark Cluster on **[Standalone Mode](https://spark.apache.org/docs/latest/spark-standalone.html)** requires the following set of steps:
|
||||
|
||||
0. Verify that the docker image (check the Dockerfile) and the Spark Cluster which is being
|
||||
deployed, run the same version of Spark.
|
||||
@@ -72,98 +112,107 @@ Connection to Spark Cluster on Standalone Mode requires the following set of ste
|
||||
2. Run the Docker container with `--net=host` in a location that is network addressable by all of
|
||||
your Spark workers. (This is a [Spark networking
|
||||
requirement](http://spark.apache.org/docs/latest/cluster-overview.html#components).)
|
||||
* NOTE: When using `--net=host`, you must also use the flags `--pid=host -e
|
||||
TINI_SUBREAPER=true`. See https://github.com/jupyter/docker-stacks/issues/64 for details.
|
||||
* NOTE: When using `--net=host`, you must also use the flags `--pid=host -e
|
||||
TINI_SUBREAPER=true`. See https://github.com/jupyter/docker-stacks/issues/64 for details.
|
||||
|
||||
#### In a Python Notebook
|
||||
**Note**: In the following examples we are using the Spark master URL `spark://master:7077` that shall be replaced by the URL of the Spark master.
|
||||
|
||||
#### In Python
|
||||
|
||||
The **same Python version** need to be used on the notebook (where the driver is located) and on the Spark workers.
|
||||
The python version used at driver and worker side can be adjusted by setting the environment variables `PYSPARK_PYTHON` and / or `PYSPARK_DRIVER_PYTHON`, see [Spark Configuration][spark-conf] for more information.
|
||||
|
||||
```python
|
||||
import os
|
||||
# make sure pyspark tells workers to use python3 not 2 if both are installed
|
||||
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'
|
||||
from pyspark.sql import SparkSession
|
||||
|
||||
import pyspark
|
||||
conf = pyspark.SparkConf()
|
||||
# Spark session & context
|
||||
spark = SparkSession.builder.master('spark://master:7077').getOrCreate()
|
||||
sc = spark.sparkContext
|
||||
|
||||
# Point to spark master
|
||||
conf.setMaster("spark://10.10.10.10:7070")
|
||||
# point to spark binary package in HDFS or on local filesystem on all slave
|
||||
# nodes (e.g., file:///opt/spark/spark-2.2.0-bin-hadoop2.7.tgz)
|
||||
conf.set("spark.executor.uri", "hdfs://10.10.10.10/spark/spark-2.2.0-bin-hadoop2.7.tgz")
|
||||
# set other options as desired
|
||||
conf.set("spark.executor.memory", "8g")
|
||||
conf.set("spark.core.connection.ack.wait.timeout", "1200")
|
||||
|
||||
# create the context
|
||||
sc = pyspark.SparkContext(conf=conf)
|
||||
|
||||
# do something to prove it works
|
||||
rdd = sc.parallelize(range(100000000))
|
||||
rdd.sumApprox(3)
|
||||
# Sum of the first 100 whole numbers
|
||||
rdd = sc.parallelize(range(100 + 1))
|
||||
rdd.sum()
|
||||
# 5050
|
||||
```
|
||||
|
||||
#### In a R Notebook
|
||||
#### In R
|
||||
|
||||
```r
|
||||
In a R notebook with [SparkR][sparkr].
|
||||
|
||||
```R
|
||||
library(SparkR)
|
||||
|
||||
# Point to spark master
|
||||
# Point to spark binary package in HDFS or on local filesystem on all worker
|
||||
# nodes (e.g., file:///opt/spark/spark-2.2.0-bin-hadoop2.7.tgz) in sparkEnvir
|
||||
# Set other options in sparkEnvir
|
||||
sc <- sparkR.session("spark://10.10.10.10:7070", sparkEnvir=list(
|
||||
spark.executor.uri="hdfs://10.10.10.10/spark/spark-2.4.3-bin-hadoop2.7.tgz",
|
||||
spark.executor.memory="8g"
|
||||
)
|
||||
)
|
||||
# Spark session & context
|
||||
sc <- sparkR.session("spark://master:7077")
|
||||
|
||||
# do something to prove it works
|
||||
data(iris)
|
||||
df <- as.DataFrame(iris)
|
||||
head(filter(df, df$Petal_Width > 0.2))
|
||||
# Sum of the first 100 whole numbers
|
||||
sdf <- createDataFrame(list(1:100))
|
||||
dapplyCollect(sdf,
|
||||
function(x)
|
||||
{ x <- sum(x)}
|
||||
)
|
||||
# 5050
|
||||
```
|
||||
|
||||
#### In a Spylon Kernel Scala Notebook
|
||||
In a R notebook with [sparklyr][sparklyr].
|
||||
|
||||
```R
|
||||
library(sparklyr)
|
||||
|
||||
# Spark session & context
|
||||
# Spark configuration
|
||||
conf <- spark_config()
|
||||
# Set the catalog implementation in-memory
|
||||
conf$spark.sql.catalogImplementation <- "in-memory"
|
||||
sc <- spark_connect(master = "spark://master:7077", config = conf)
|
||||
|
||||
# Sum of the first 100 whole numbers
|
||||
sdf_len(sc, 100, repartition = 1) %>%
|
||||
spark_apply(function(e) sum(e))
|
||||
# 5050
|
||||
```
|
||||
|
||||
#### In Scala
|
||||
|
||||
##### In a Spylon Kernel
|
||||
|
||||
Spylon kernel instantiates a `SparkContext` for you in variable `sc` after you configure Spark
|
||||
options in a `%%init_spark` magic cell.
|
||||
|
||||
```python
|
||||
%%init_spark
|
||||
# Point to spark master
|
||||
launcher.master = "spark://10.10.10.10:7070"
|
||||
launcher.conf.spark.executor.uri=hdfs://10.10.10.10/spark/spark-2.4.3-bin-hadoop2.7.tgz
|
||||
# Configure Spark to use a local master
|
||||
launcher.master = "spark://master:7077"
|
||||
```
|
||||
|
||||
```scala
|
||||
// Now run Scala code that uses the initialized SparkContext in sc
|
||||
val rdd = sc.parallelize(0 to 999)
|
||||
rdd.takeSample(false, 5)
|
||||
// Sum of the first 100 whole numbers
|
||||
val rdd = sc.parallelize(0 to 100)
|
||||
rdd.sum()
|
||||
// 5050
|
||||
```
|
||||
|
||||
#### In an Apache Toree Scala Notebook
|
||||
##### In an Apache Toree Scala Notebook
|
||||
|
||||
The Apache Toree kernel automatically creates a `SparkContext` when it starts based on configuration
|
||||
information from its command line arguments and environment variables. You can pass information
|
||||
about your cluster via the `SPARK_OPTS` environment variable when you spawn a container.
|
||||
The Apache Toree kernel automatically creates a `SparkContext` when it starts based on configuration information from its command line arguments and environment variables. You can pass information about your cluster via the `SPARK_OPTS` environment variable when you spawn a container.
|
||||
|
||||
For instance, to pass information about a standalone Spark master, Spark binary location in HDFS,
|
||||
and an executor options, you could start the container like so:
|
||||
For instance, to pass information about a standalone Spark master, you could start the container like so:
|
||||
|
||||
```
|
||||
docker run -d -p 8888:8888 -e SPARK_OPTS='--master=spark://10.10.10.10:7070 \
|
||||
--spark.executor.uri=hdfs://10.10.10.10/spark/spark-2.4.3-bin-hadoop2.7.tgz \
|
||||
--spark.executor.memory=8g' jupyter/all-spark-notebook
|
||||
```bash
|
||||
docker run -d -p 8888:8888 -e SPARK_OPTS='--master=spark://master:7077' \
|
||||
jupyter/all-spark-notebook
|
||||
```
|
||||
|
||||
Note that this is the same information expressed in a notebook in the Python case above. Once the
|
||||
kernel spec has your cluster information, you can test your cluster in an Apache Toree notebook like
|
||||
so:
|
||||
Note that this is the same information expressed in a notebook in the Python case above. Once the kernel spec has your cluster information, you can test your cluster in an Apache Toree notebook like so:
|
||||
|
||||
```scala
|
||||
// should print the value of --master in the kernel spec
|
||||
println(sc.master)
|
||||
|
||||
// do something to prove it works
|
||||
val rdd = sc.parallelize(0 to 99999999)
|
||||
// Sum of the first 100 whole numbers
|
||||
val rdd = sc.parallelize(0 to 100)
|
||||
rdd.sum()
|
||||
// 5050
|
||||
```
|
||||
|
||||
## Tensorflow
|
||||
@@ -199,3 +248,7 @@ init = tf.global_variables_initializer()
|
||||
sess.run(init)
|
||||
sess.run(hello)
|
||||
```
|
||||
|
||||
[sparkr]: https://spark.apache.org/docs/latest/sparkr.html
|
||||
[sparklyr]: https://spark.rstudio.com/
|
||||
[spark-conf]: https://spark.apache.org/docs/latest/configuration.html
|
Reference in New Issue
Block a user