Merge branch 'master' into asalikhov/ubuntu_focal

This commit is contained in:
Peter Parente
2020-05-29 09:12:34 -04:00
committed by GitHub
18 changed files with 1267 additions and 650 deletions

View File

@@ -8,13 +8,13 @@ This page describes the options supported by the startup script as well as how t
You can pass [Jupyter command line options](https://jupyter.readthedocs.io/en/latest/projects/jupyter-command.html) to the `start-notebook.sh` script when launching the container. For example, to secure the Notebook server with a custom password hashed using `IPython.lib.passwd()` instead of the default token, you can run the following:
```
```bash
docker run -d -p 8888:8888 jupyter/base-notebook start-notebook.sh --NotebookApp.password='sha1:74ba40f8a388:c913541b7ee99d15d5ed31d4226bf7838f83a50e'
```
For example, to set the base URL of the notebook server, you can run the following:
```
```bash
docker run -d -p 8888:8888 jupyter/base-notebook start-notebook.sh --NotebookApp.base_url=/some/path
```
@@ -54,7 +54,7 @@ script for execution details.
You may mount SSL key and certificate files into a container and configure Jupyter Notebook to use them to accept HTTPS connections. For example, to mount a host folder containing a `notebook.key` and `notebook.crt` and use them, you might run the following:
```
```bash
docker run -d -p 8888:8888 \
-v /some/host/folder:/etc/ssl/notebook \
jupyter/base-notebook start-notebook.sh \
@@ -64,7 +64,7 @@ docker run -d -p 8888:8888 \
Alternatively, you may mount a single PEM file containing both the key and certificate. For example:
```
```bash
docker run -d -p 8888:8888 \
-v /some/host/folder/notebook.pem:/etc/ssl/notebook.pem \
jupyter/base-notebook start-notebook.sh \
@@ -85,13 +85,13 @@ For additional information about using SSL, see the following:
The `start-notebook.sh` script actually inherits most of its option handling capability from a more generic `start.sh` script. The `start.sh` script supports all of the features described above, but allows you to specify an arbitrary command to execute. For example, to run the text-based `ipython` console in a container, do the following:
```
```bash
docker run -it --rm jupyter/base-notebook start.sh ipython
```
Or, to run JupyterLab instead of the classic notebook, run the following:
```
```bash
docker run -it --rm -p 8888:8888 jupyter/base-notebook start.sh jupyter lab
```
@@ -107,7 +107,7 @@ The default Python 3.x [Conda environment](http://conda.pydata.org/docs/using/en
The `jovyan` user has full read/write access to the `/opt/conda` directory. You can use either `conda` or `pip` to install new packages without any additional permissions.
```
```bash
# install a package into the default (python 3.x) environment
pip install some-package
conda install some-package

View File

@@ -17,7 +17,7 @@ orchestrator config.
For example:
```
```bash
docker run -it -e GRANT_SUDO=yes --user root jupyter/minimal-notebook
```
@@ -75,7 +75,7 @@ Python 2.x was removed from all images on August 10th, 2017, starting in tag `cc
add a Python 2.x environment by defining your own Dockerfile inheriting from one of the images like
so:
```
```dockerfile
# Choose your desired base image
FROM jupyter/scipy-notebook:latest
@@ -103,7 +103,7 @@ Ref:
The default version of Python that ships with conda/ubuntu may not be the version you want.
To add a conda environment with a different version and make it accessible to Jupyter, the instructions are very similar to Python 2.x but are slightly simpler (no need to switch to `root`):
```
```dockerfile
# Choose your desired base image
FROM jupyter/minimal-notebook:latest
@@ -168,12 +168,12 @@ ENTRYPOINT ["jupyter", "lab", "--ip=0.0.0.0", "--allow-root"]
```
And build the image as:
```
```bash
docker build -t jupyter/scipy-dasklabextension:latest .
```
Once built, run using the command:
```
```bash
docker run -it --rm -p 8888:8888 -p 8787:8787 jupyter/scipy-dasklabextension:latest
```
@@ -194,7 +194,7 @@ Ref:
[RISE](https://github.com/damianavila/RISE) allows via extension to create live slideshows of your
notebooks, with no conversion, adding javascript Reveal.js:
```
```bash
# Add Live slideshows with RISE
RUN conda install -c damianavila82 rise
```
@@ -207,7 +207,7 @@ Credit: [Paolo D.](https://github.com/pdonorio) based on
You need to install conda's gcc for Python xgboost to work properly. Otherwise, you'll get an
exception about libgomp.so.1 missing GOMP_4.0.
```
```bash
%%bash
conda install -y gcc
pip install xgboost
@@ -312,8 +312,8 @@ Credit: [Justin Tyberg](https://github.com/jtyberg), [quanghoc](https://github.c
To use a specific version of JupyterHub, the version of `jupyterhub` in your image should match the
version in the Hub itself.
```
FROM jupyter/base-notebook:5ded1de07260
```dockerfile
FROM jupyter/base-notebook:5ded1de07260
RUN pip install jupyterhub==0.8.0b1
```
@@ -375,7 +375,7 @@ Ref:
### Using Local Spark JARs
```
```python
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /home/jovyan/spark-streaming-kafka-assembly_2.10-1.6.1.jar pyspark-shell'
import pyspark
@@ -404,7 +404,7 @@ Ref:
### Use jupyter/all-spark-notebooks with an existing Spark/YARN cluster
```
```dockerfile
FROM jupyter/all-spark-notebook
# Set env vars for pydoop
@@ -480,13 +480,13 @@ convenient to launch the server without a password or token. In this case, you s
For jupyterlab:
```
```bash
docker run jupyter/base-notebook:6d2a05346196 start.sh jupyter lab --LabApp.token=''
```
For jupyter classic:
```
```bash
docker run jupyter/base-notebook:6d2a05346196 start.sh jupyter notebook --NotebookApp.token=''
```
@@ -494,7 +494,7 @@ docker run jupyter/base-notebook:6d2a05346196 start.sh jupyter notebook --Notebo
NB: this works for classic notebooks only
```
```dockerfile
# Update with your base image of choice
FROM jupyter/minimal-notebook:latest
@@ -513,7 +513,7 @@ Ref:
Using `auto-sklearn` requires `swig`, which the other notebook images lack, so it cant be experimented with. Also, there is no Conda package for `auto-sklearn`.
```
```dockerfile
ARG BASE_CONTAINER=jupyter/scipy-notebook
FROM jupyter/scipy-notebook:latest

View File

@@ -5,7 +5,8 @@ This page provides details about features specific to one or more images.
## Apache Spark
**Specific Docker Image Options**
* `-p 4040:4040` - The `jupyter/pyspark-notebook` and `jupyter/all-spark-notebook` images open [SparkUI (Spark Monitoring and Instrumentation UI)](http://spark.apache.org/docs/latest/monitoring.html) at default port `4040`, this option map `4040` port inside docker container to `4040` port on host machine . Note every new spark context that is created is put onto an incrementing port (ie. 4040, 4041, 4042, etc.), and it might be necessary to open multiple ports. For example: `docker run -d -p 8888:8888 -p 4040:4040 -p 4041:4041 jupyter/pyspark-notebook`
* `-p 4040:4040` - The `jupyter/pyspark-notebook` and `jupyter/all-spark-notebook` images open [SparkUI (Spark Monitoring and Instrumentation UI)](http://spark.apache.org/docs/latest/monitoring.html) at default port `4040`, this option map `4040` port inside docker container to `4040` port on host machine . Note every new spark context that is created is put onto an incrementing port (ie. 4040, 4041, 4042, etc.), and it might be necessary to open multiple ports. For example: `docker run -d -p 8888:8888 -p 4040:4040 -p 4041:4041 jupyter/pyspark-notebook`.
**Usage Examples**
@@ -13,30 +14,66 @@ The `jupyter/pyspark-notebook` and `jupyter/all-spark-notebook` images support t
### Using Spark Local Mode
Spark local mode is useful for experimentation on small data when you do not have a Spark cluster available.
Spark **local mode** is useful for experimentation on small data when you do not have a Spark cluster available.
#### In a Python Notebook
#### In Python
In a Python notebook.
```python
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
# do something to prove it works
spark.sql('SELECT "Test" as c1').show()
# Spark session & context
spark = SparkSession.builder.master('local').getOrCreate()
sc = spark.sparkContext
# Sum of the first 100 whole numbers
rdd = sc.parallelize(range(100 + 1))
rdd.sum()
# 5050
```
#### In a R Notebook
#### In R
```r
In a R notebook with [SparkR][sparkr].
```R
library(SparkR)
as <- sparkR.session("local[*]")
# Spark session & context
sc <- sparkR.session("local")
# do something to prove it works
df <- as.DataFrame(iris)
head(filter(df, df$Petal_Width > 0.2))
# Sum of the first 100 whole numbers
sdf <- createDataFrame(list(1:100))
dapplyCollect(sdf,
function(x)
{ x <- sum(x)}
)
# 5050
```
#### In a Spylon Kernel Scala Notebook
In a R notebook with [sparklyr][sparklyr].
```R
library(sparklyr)
# Spark configuration
conf <- spark_config()
# Set the catalog implementation in-memory
conf$spark.sql.catalogImplementation <- "in-memory"
# Spark session & context
sc <- spark_connect(master = "local", config = conf)
# Sum of the first 100 whole numbers
sdf_len(sc, 100, repartition = 1) %>%
spark_apply(function(e) sum(e))
# 5050
```
#### In Scala
##### In a Spylon Kernel
Spylon kernel instantiates a `SparkContext` for you in variable `sc` after you configure Spark
options in a `%%init_spark` magic cell.
@@ -44,27 +81,30 @@ options in a `%%init_spark` magic cell.
```python
%%init_spark
# Configure Spark to use a local master
launcher.master = "local[*]"
launcher.master = "local"
```
```scala
// Now run Scala code that uses the initialized SparkContext in sc
val rdd = sc.parallelize(0 to 999)
rdd.takeSample(false, 5)
// Sum of the first 100 whole numbers
val rdd = sc.parallelize(0 to 100)
rdd.sum()
// 5050
```
#### In an Apache Toree Scala Notebook
##### In an Apache Toree Kernel
Apache Toree instantiates a local `SparkContext` for you in variable `sc` when the kernel starts.
```scala
val rdd = sc.parallelize(0 to 999)
rdd.takeSample(false, 5)
// Sum of the first 100 whole numbers
val rdd = sc.parallelize(0 to 100)
rdd.sum()
// 5050
```
### Connecting to a Spark Cluster in Standalone Mode
Connection to Spark Cluster on Standalone Mode requires the following set of steps:
Connection to Spark Cluster on **[Standalone Mode](https://spark.apache.org/docs/latest/spark-standalone.html)** requires the following set of steps:
0. Verify that the docker image (check the Dockerfile) and the Spark Cluster which is being
deployed, run the same version of Spark.
@@ -72,98 +112,107 @@ Connection to Spark Cluster on Standalone Mode requires the following set of ste
2. Run the Docker container with `--net=host` in a location that is network addressable by all of
your Spark workers. (This is a [Spark networking
requirement](http://spark.apache.org/docs/latest/cluster-overview.html#components).)
* NOTE: When using `--net=host`, you must also use the flags `--pid=host -e
TINI_SUBREAPER=true`. See https://github.com/jupyter/docker-stacks/issues/64 for details.
* NOTE: When using `--net=host`, you must also use the flags `--pid=host -e
TINI_SUBREAPER=true`. See https://github.com/jupyter/docker-stacks/issues/64 for details.
#### In a Python Notebook
**Note**: In the following examples we are using the Spark master URL `spark://master:7077` that shall be replaced by the URL of the Spark master.
#### In Python
The **same Python version** need to be used on the notebook (where the driver is located) and on the Spark workers.
The python version used at driver and worker side can be adjusted by setting the environment variables `PYSPARK_PYTHON` and / or `PYSPARK_DRIVER_PYTHON`, see [Spark Configuration][spark-conf] for more information.
```python
import os
# make sure pyspark tells workers to use python3 not 2 if both are installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'
from pyspark.sql import SparkSession
import pyspark
conf = pyspark.SparkConf()
# Spark session & context
spark = SparkSession.builder.master('spark://master:7077').getOrCreate()
sc = spark.sparkContext
# Point to spark master
conf.setMaster("spark://10.10.10.10:7070")
# point to spark binary package in HDFS or on local filesystem on all slave
# nodes (e.g., file:///opt/spark/spark-2.2.0-bin-hadoop2.7.tgz)
conf.set("spark.executor.uri", "hdfs://10.10.10.10/spark/spark-2.2.0-bin-hadoop2.7.tgz")
# set other options as desired
conf.set("spark.executor.memory", "8g")
conf.set("spark.core.connection.ack.wait.timeout", "1200")
# create the context
sc = pyspark.SparkContext(conf=conf)
# do something to prove it works
rdd = sc.parallelize(range(100000000))
rdd.sumApprox(3)
# Sum of the first 100 whole numbers
rdd = sc.parallelize(range(100 + 1))
rdd.sum()
# 5050
```
#### In a R Notebook
#### In R
```r
In a R notebook with [SparkR][sparkr].
```R
library(SparkR)
# Point to spark master
# Point to spark binary package in HDFS or on local filesystem on all worker
# nodes (e.g., file:///opt/spark/spark-2.2.0-bin-hadoop2.7.tgz) in sparkEnvir
# Set other options in sparkEnvir
sc <- sparkR.session("spark://10.10.10.10:7070", sparkEnvir=list(
spark.executor.uri="hdfs://10.10.10.10/spark/spark-2.4.3-bin-hadoop2.7.tgz",
spark.executor.memory="8g"
)
)
# Spark session & context
sc <- sparkR.session("spark://master:7077")
# do something to prove it works
data(iris)
df <- as.DataFrame(iris)
head(filter(df, df$Petal_Width > 0.2))
# Sum of the first 100 whole numbers
sdf <- createDataFrame(list(1:100))
dapplyCollect(sdf,
function(x)
{ x <- sum(x)}
)
# 5050
```
#### In a Spylon Kernel Scala Notebook
In a R notebook with [sparklyr][sparklyr].
```R
library(sparklyr)
# Spark session & context
# Spark configuration
conf <- spark_config()
# Set the catalog implementation in-memory
conf$spark.sql.catalogImplementation <- "in-memory"
sc <- spark_connect(master = "spark://master:7077", config = conf)
# Sum of the first 100 whole numbers
sdf_len(sc, 100, repartition = 1) %>%
spark_apply(function(e) sum(e))
# 5050
```
#### In Scala
##### In a Spylon Kernel
Spylon kernel instantiates a `SparkContext` for you in variable `sc` after you configure Spark
options in a `%%init_spark` magic cell.
```python
%%init_spark
# Point to spark master
launcher.master = "spark://10.10.10.10:7070"
launcher.conf.spark.executor.uri=hdfs://10.10.10.10/spark/spark-2.4.3-bin-hadoop2.7.tgz
# Configure Spark to use a local master
launcher.master = "spark://master:7077"
```
```scala
// Now run Scala code that uses the initialized SparkContext in sc
val rdd = sc.parallelize(0 to 999)
rdd.takeSample(false, 5)
// Sum of the first 100 whole numbers
val rdd = sc.parallelize(0 to 100)
rdd.sum()
// 5050
```
#### In an Apache Toree Scala Notebook
##### In an Apache Toree Scala Notebook
The Apache Toree kernel automatically creates a `SparkContext` when it starts based on configuration
information from its command line arguments and environment variables. You can pass information
about your cluster via the `SPARK_OPTS` environment variable when you spawn a container.
The Apache Toree kernel automatically creates a `SparkContext` when it starts based on configuration information from its command line arguments and environment variables. You can pass information about your cluster via the `SPARK_OPTS` environment variable when you spawn a container.
For instance, to pass information about a standalone Spark master, Spark binary location in HDFS,
and an executor options, you could start the container like so:
For instance, to pass information about a standalone Spark master, you could start the container like so:
```
docker run -d -p 8888:8888 -e SPARK_OPTS='--master=spark://10.10.10.10:7070 \
--spark.executor.uri=hdfs://10.10.10.10/spark/spark-2.4.3-bin-hadoop2.7.tgz \
--spark.executor.memory=8g' jupyter/all-spark-notebook
```bash
docker run -d -p 8888:8888 -e SPARK_OPTS='--master=spark://master:7077' \
jupyter/all-spark-notebook
```
Note that this is the same information expressed in a notebook in the Python case above. Once the
kernel spec has your cluster information, you can test your cluster in an Apache Toree notebook like
so:
Note that this is the same information expressed in a notebook in the Python case above. Once the kernel spec has your cluster information, you can test your cluster in an Apache Toree notebook like so:
```scala
// should print the value of --master in the kernel spec
println(sc.master)
// do something to prove it works
val rdd = sc.parallelize(0 to 99999999)
// Sum of the first 100 whole numbers
val rdd = sc.parallelize(0 to 100)
rdd.sum()
// 5050
```
## Tensorflow
@@ -199,3 +248,7 @@ init = tf.global_variables_initializer()
sess.run(init)
sess.run(hello)
```
[sparkr]: https://spark.apache.org/docs/latest/sparkr.html
[sparklyr]: https://spark.rstudio.com/
[spark-conf]: https://spark.apache.org/docs/latest/configuration.html