mirror of
https://github.com/jupyter/docker-stacks.git
synced 2025-10-08 10:34:06 +00:00
Resolves #1131: Allow alternative Spark version
Allow to build `pyspark-notebook` image with an alternative Spark version. - Define arguments for Spark installation - Add a note in "Image Specifics" explaining how to build an image with an alternative Spark version - Remove Toree documentation from "Image Specifics" since its support has been droped in #1115
This commit is contained in:
@@ -2,21 +2,81 @@
|
||||
|
||||
This page provides details about features specific to one or more images.
|
||||
|
||||
## Apache Spark
|
||||
## Apache Spark™
|
||||
|
||||
**Specific Docker Image Options**
|
||||
### Specific Docker Image Options
|
||||
|
||||
* `-p 4040:4040` - The `jupyter/pyspark-notebook` and `jupyter/all-spark-notebook` images open [SparkUI (Spark Monitoring and Instrumentation UI)](http://spark.apache.org/docs/latest/monitoring.html) at default port `4040`, this option map `4040` port inside docker container to `4040` port on host machine . Note every new spark context that is created is put onto an incrementing port (ie. 4040, 4041, 4042, etc.), and it might be necessary to open multiple ports. For example: `docker run -d -p 8888:8888 -p 4040:4040 -p 4041:4041 jupyter/pyspark-notebook`.
|
||||
|
||||
**Usage Examples**
|
||||
### Build an Image with a Different Version of Spark
|
||||
|
||||
You can build a `pyspark-notebook` image (and also the downstream `all-spark-notebook` image) with a different version of Spark by overriding the default value of the following arguments at build time.
|
||||
|
||||
* Spark distribution is defined by the combination of the Spark and the Hadoop version and verified by the package checksum, see [Download Apache Spark](https://spark.apache.org/downloads.html) for more information.
|
||||
* `spark_version`: The Spark version to install (`3.0.0`).
|
||||
* `hadoop_version`: The Hadoop version (`3.2`).
|
||||
* `spark_checksum`: The package checksum (`BFE4540...`).
|
||||
* Spark is shipped with a version of Py4J that has to be referenced in the `PYTHONPATH`.
|
||||
* `py4j_version`: The Py4J version (`0.10.9`), see the tip below.
|
||||
* Spark can run with different OpenJDK versions.
|
||||
* `openjdk_version`: The version of (JRE headless) the OpenJDK distribution (`11`), see [Ubuntu packages](https://packages.ubuntu.com/search?keywords=openjdk).
|
||||
|
||||
For example here is how to build a `pyspark-notebook` image with Spark `2.4.6`, Hadoop `2.7` and OpenJDK `8`.
|
||||
|
||||
```bash
|
||||
# From the root of the project
|
||||
# Build the image with different arguments
|
||||
docker build --rm --force-rm \
|
||||
-t jupyter/pyspark-notebook:spark-2.4.6 ./pyspark-notebook \
|
||||
--build-arg spark_version=2.4.6 \
|
||||
--build-arg hadoop_version=2.7 \
|
||||
--build-arg spark_checksum=3A9F401EDA9B5749CDAFD246B1D14219229C26387017791C345A23A65782FB8B25A302BF4AC1ED7C16A1FE83108E94E55DAD9639A51C751D81C8C0534A4A9641 \
|
||||
--build-arg openjdk_version=8 \
|
||||
--build-arg py4j_version=0.10.7
|
||||
|
||||
# Check the newly built image
|
||||
docker images jupyter/pyspark-notebook:spark-2.4.6
|
||||
|
||||
# REPOSITORY TAG IMAGE ID CREATED SIZE
|
||||
# jupyter/pyspark-notebook spark-2.4.6 7ad7b5a9dbcd 4 minutes ago 3.44GB
|
||||
|
||||
# Check the Spark version
|
||||
docker run -it --rm jupyter/pyspark-notebook:spark-2.4.6 pyspark --version
|
||||
|
||||
# Welcome to
|
||||
# ____ __
|
||||
# / __/__ ___ _____/ /__
|
||||
# _\ \/ _ \/ _ `/ __/ '_/
|
||||
# /___/ .__/\_,_/_/ /_/\_\ version 2.4.6
|
||||
# /_/
|
||||
#
|
||||
# Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_265
|
||||
```
|
||||
|
||||
**Tip**: to get the version of Py4J shipped with Spark:
|
||||
|
||||
* Build a first image without changing `py4j_version` (it will not prevent the image to build it will just prevent Python to find the `pyspark` module),
|
||||
* get the version (`ls /usr/local/spark/python/lib/`),
|
||||
* set the version `--build-arg py4j_version=0.10.7`.
|
||||
|
||||
*Note: At the time of writing there is an issue preventing to use Spark `2.4.6` with Python `3.8`, see [this answer on SO](https://stackoverflow.com/a/62173969/4413446) for more information.*
|
||||
|
||||
```bash
|
||||
docker run -it --rm jupyter/pyspark-notebook:spark-2.4.6 ls /usr/local/spark/python/lib/
|
||||
# py4j-0.10.7-src.zip PY4J_LICENSE.txt pyspark.zip
|
||||
# You can now set the build-arg
|
||||
# --build-arg py4j_version=
|
||||
```
|
||||
|
||||
### Usage Examples
|
||||
|
||||
The `jupyter/pyspark-notebook` and `jupyter/all-spark-notebook` images support the use of [Apache Spark](https://spark.apache.org/) in Python, R, and Scala notebooks. The following sections provide some examples of how to get started using them.
|
||||
|
||||
### Using Spark Local Mode
|
||||
#### Using Spark Local Mode
|
||||
|
||||
Spark **local mode** is useful for experimentation on small data when you do not have a Spark cluster available.
|
||||
|
||||
#### In Python
|
||||
##### In Python
|
||||
|
||||
In a Python notebook.
|
||||
|
||||
@@ -33,7 +93,7 @@ rdd.sum()
|
||||
# 5050
|
||||
```
|
||||
|
||||
#### In R
|
||||
##### In R
|
||||
|
||||
In a R notebook with [SparkR][sparkr].
|
||||
|
||||
@@ -71,9 +131,7 @@ sdf_len(sc, 100, repartition = 1) %>%
|
||||
# 5050
|
||||
```
|
||||
|
||||
#### In Scala
|
||||
|
||||
##### In a Spylon Kernel
|
||||
##### In Scala
|
||||
|
||||
Spylon kernel instantiates a `SparkContext` for you in variable `sc` after you configure Spark
|
||||
options in a `%%init_spark` magic cell.
|
||||
@@ -91,18 +149,7 @@ rdd.sum()
|
||||
// 5050
|
||||
```
|
||||
|
||||
##### In an Apache Toree Kernel
|
||||
|
||||
Apache Toree instantiates a local `SparkContext` for you in variable `sc` when the kernel starts.
|
||||
|
||||
```scala
|
||||
// Sum of the first 100 whole numbers
|
||||
val rdd = sc.parallelize(0 to 100)
|
||||
rdd.sum()
|
||||
// 5050
|
||||
```
|
||||
|
||||
### Connecting to a Spark Cluster in Standalone Mode
|
||||
#### Connecting to a Spark Cluster in Standalone Mode
|
||||
|
||||
Connection to Spark Cluster on **[Standalone Mode](https://spark.apache.org/docs/latest/spark-standalone.html)** requires the following set of steps:
|
||||
|
||||
@@ -117,7 +164,7 @@ Connection to Spark Cluster on **[Standalone Mode](https://spark.apache.org/docs
|
||||
|
||||
**Note**: In the following examples we are using the Spark master URL `spark://master:7077` that shall be replaced by the URL of the Spark master.
|
||||
|
||||
#### In Python
|
||||
##### In Python
|
||||
|
||||
The **same Python version** need to be used on the notebook (where the driver is located) and on the Spark workers.
|
||||
The python version used at driver and worker side can be adjusted by setting the environment variables `PYSPARK_PYTHON` and / or `PYSPARK_DRIVER_PYTHON`, see [Spark Configuration][spark-conf] for more information.
|
||||
@@ -135,7 +182,7 @@ rdd.sum()
|
||||
# 5050
|
||||
```
|
||||
|
||||
#### In R
|
||||
##### In R
|
||||
|
||||
In a R notebook with [SparkR][sparkr].
|
||||
|
||||
@@ -172,9 +219,7 @@ sdf_len(sc, 100, repartition = 1) %>%
|
||||
# 5050
|
||||
```
|
||||
|
||||
#### In Scala
|
||||
|
||||
##### In a Spylon Kernel
|
||||
##### In Scala
|
||||
|
||||
Spylon kernel instantiates a `SparkContext` for you in variable `sc` after you configure Spark
|
||||
options in a `%%init_spark` magic cell.
|
||||
@@ -192,29 +237,6 @@ rdd.sum()
|
||||
// 5050
|
||||
```
|
||||
|
||||
##### In an Apache Toree Scala Notebook
|
||||
|
||||
The Apache Toree kernel automatically creates a `SparkContext` when it starts based on configuration information from its command line arguments and environment variables. You can pass information about your cluster via the `SPARK_OPTS` environment variable when you spawn a container.
|
||||
|
||||
For instance, to pass information about a standalone Spark master, you could start the container like so:
|
||||
|
||||
```bash
|
||||
docker run -d -p 8888:8888 -e SPARK_OPTS='--master=spark://master:7077' \
|
||||
jupyter/all-spark-notebook
|
||||
```
|
||||
|
||||
Note that this is the same information expressed in a notebook in the Python case above. Once the kernel spec has your cluster information, you can test your cluster in an Apache Toree notebook like so:
|
||||
|
||||
```scala
|
||||
// should print the value of --master in the kernel spec
|
||||
println(sc.master)
|
||||
|
||||
// Sum of the first 100 whole numbers
|
||||
val rdd = sc.parallelize(0 to 100)
|
||||
rdd.sum()
|
||||
// 5050
|
||||
```
|
||||
|
||||
## Tensorflow
|
||||
|
||||
The `jupyter/tensorflow-notebook` image supports the use of
|
||||
|
@@ -11,20 +11,30 @@ SHELL ["/bin/bash", "-o", "pipefail", "-c"]
|
||||
USER root
|
||||
|
||||
# Spark dependencies
|
||||
ENV APACHE_SPARK_VERSION=3.0.0 \
|
||||
HADOOP_VERSION=3.2
|
||||
# Default values can be overridden at build time
|
||||
# (ARGS are in lower case to distinguish them from ENV)
|
||||
ARG spark_version="3.0.0"
|
||||
ARG hadoop_version="3.2"
|
||||
ARG spark_checksum="BFE45406C67CC4AE00411AD18CC438F51E7D4B6F14EB61E7BF6B5450897C2E8D3AB020152657C0239F253735C263512FFABF538AC5B9FFFA38B8295736A9C387"
|
||||
ARG py4j_version="0.10.9"
|
||||
ARG openjdk_version="11"
|
||||
|
||||
ENV APACHE_SPARK_VERSION="${spark_version}" \
|
||||
HADOOP_VERSION="${hadoop_version}"
|
||||
|
||||
RUN apt-get -y update && \
|
||||
apt-get install --no-install-recommends -y openjdk-11-jre-headless ca-certificates-java && \
|
||||
apt-get install --no-install-recommends -y \
|
||||
"openjdk-${openjdk_version}-jre-headless" \
|
||||
ca-certificates-java && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Using the preferred mirror to download Spark
|
||||
# Spark installation
|
||||
WORKDIR /tmp
|
||||
|
||||
# Using the preferred mirror to download Spark
|
||||
# hadolint ignore=SC2046
|
||||
RUN wget -q $(wget -qO- https://www.apache.org/dyn/closer.lua/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz\?as_json | \
|
||||
python -c "import sys, json; content=json.load(sys.stdin); print(content['preferred']+content['path_info'])") && \
|
||||
echo "BFE45406C67CC4AE00411AD18CC438F51E7D4B6F14EB61E7BF6B5450897C2E8D3AB020152657C0239F253735C263512FFABF538AC5B9FFFA38B8295736A9C387 *spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" | sha512sum -c - && \
|
||||
#echo "${spark_checksum} *spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" | sha512sum -c - && \
|
||||
tar xzf "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" -C /usr/local --owner root --group root --no-same-owner && \
|
||||
rm "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz"
|
||||
|
||||
@@ -33,7 +43,7 @@ RUN ln -s "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}" spark
|
||||
|
||||
# Configure Spark
|
||||
ENV SPARK_HOME=/usr/local/spark
|
||||
ENV PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip \
|
||||
ENV PYTHONPATH="${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-${py4j_version}-src.zip" \
|
||||
SPARK_OPTS="--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info" \
|
||||
PATH=$PATH:$SPARK_HOME/bin
|
||||
|
||||
|
Reference in New Issue
Block a user