mirror of
https://github.com/jupyter/docker-stacks.git
synced 2025-10-10 19:42:58 +00:00
Improve spark installation
Spark installation improved by sourcing the `spark-config.sh` in the `before-notebook.d` hook that is run by `start.sh`. It permits to add automatically the right Py4J dependency version in the `PYTHONPATH`. So it is not needed anymore to set this variable at build time. Documentation describing the installation of a custom Spark version modified to remove this step. Also updated to install the latest `2.x` Spark version. `test_pyspark` fixed (was always OK before that).
This commit is contained in:
@@ -16,8 +16,6 @@ You can build a `pyspark-notebook` image (and also the downstream `all-spark-not
|
|||||||
* `spark_version`: The Spark version to install (`3.0.0`).
|
* `spark_version`: The Spark version to install (`3.0.0`).
|
||||||
* `hadoop_version`: The Hadoop version (`3.2`).
|
* `hadoop_version`: The Hadoop version (`3.2`).
|
||||||
* `spark_checksum`: The package checksum (`BFE4540...`).
|
* `spark_checksum`: The package checksum (`BFE4540...`).
|
||||||
* Spark is shipped with a version of Py4J that has to be referenced in the `PYTHONPATH`.
|
|
||||||
* `py4j_version`: The Py4J version (`0.10.9`), see the tip below.
|
|
||||||
* Spark can run with different OpenJDK versions.
|
* Spark can run with different OpenJDK versions.
|
||||||
* `openjdk_version`: The version of (JRE headless) the OpenJDK distribution (`11`), see [Ubuntu packages](https://packages.ubuntu.com/search?keywords=openjdk).
|
* `openjdk_version`: The version of (JRE headless) the OpenJDK distribution (`11`), see [Ubuntu packages](https://packages.ubuntu.com/search?keywords=openjdk).
|
||||||
|
|
||||||
@@ -27,47 +25,25 @@ For example here is how to build a `pyspark-notebook` image with Spark `2.4.6`,
|
|||||||
# From the root of the project
|
# From the root of the project
|
||||||
# Build the image with different arguments
|
# Build the image with different arguments
|
||||||
docker build --rm --force-rm \
|
docker build --rm --force-rm \
|
||||||
-t jupyter/pyspark-notebook:spark-2.4.6 ./pyspark-notebook \
|
-t jupyter/pyspark-notebook:spark-2.4.7 ./pyspark-notebook \
|
||||||
--build-arg spark_version=2.4.6 \
|
--build-arg spark_version=2.4.7 \
|
||||||
--build-arg hadoop_version=2.7 \
|
--build-arg hadoop_version=2.7 \
|
||||||
--build-arg spark_checksum=3A9F401EDA9B5749CDAFD246B1D14219229C26387017791C345A23A65782FB8B25A302BF4AC1ED7C16A1FE83108E94E55DAD9639A51C751D81C8C0534A4A9641 \
|
--build-arg spark_checksum=0F5455672045F6110B030CE343C049855B7BA86C0ECB5E39A075FF9D093C7F648DA55DED12E72FFE65D84C32DCD5418A6D764F2D6295A3F894A4286CC80EF478 \
|
||||||
--build-arg openjdk_version=8 \
|
--build-arg openjdk_version=8
|
||||||
--build-arg py4j_version=0.10.7
|
|
||||||
|
|
||||||
# Check the newly built image
|
# Check the newly built image
|
||||||
docker images jupyter/pyspark-notebook:spark-2.4.6
|
docker run -it --rm jupyter/pyspark-notebook:spark-2.4.7 pyspark --version
|
||||||
|
|
||||||
# REPOSITORY TAG IMAGE ID CREATED SIZE
|
|
||||||
# jupyter/pyspark-notebook spark-2.4.6 7ad7b5a9dbcd 4 minutes ago 3.44GB
|
|
||||||
|
|
||||||
# Check the Spark version
|
|
||||||
docker run -it --rm jupyter/pyspark-notebook:spark-2.4.6 pyspark --version
|
|
||||||
|
|
||||||
# Welcome to
|
# Welcome to
|
||||||
# ____ __
|
# ____ __
|
||||||
# / __/__ ___ _____/ /__
|
# / __/__ ___ _____/ /__
|
||||||
# _\ \/ _ \/ _ `/ __/ '_/
|
# _\ \/ _ \/ _ `/ __/ '_/
|
||||||
# /___/ .__/\_,_/_/ /_/\_\ version 2.4.6
|
# /___/ .__/\_,_/_/ /_/\_\ version 2.4.7
|
||||||
# /_/
|
# /_/
|
||||||
#
|
#
|
||||||
# Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_265
|
# Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_275
|
||||||
```
|
```
|
||||||
|
|
||||||
**Tip**: to get the version of Py4J shipped with Spark:
|
|
||||||
|
|
||||||
* Build a first image without changing `py4j_version` (it will not prevent the image to build it will just prevent Python to find the `pyspark` module),
|
|
||||||
* get the version (`ls /usr/local/spark/python/lib/`),
|
|
||||||
* set the version `--build-arg py4j_version=0.10.7`.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker run -it --rm jupyter/pyspark-notebook:spark-2.4.6 ls /usr/local/spark/python/lib/
|
|
||||||
# py4j-0.10.7-src.zip PY4J_LICENSE.txt pyspark.zip
|
|
||||||
# You can now set the build-arg
|
|
||||||
# --build-arg py4j_version=
|
|
||||||
```
|
|
||||||
|
|
||||||
*Note: At the time of writing there is an issue preventing to use Spark `2.4.6` with Python `3.8`, see [this answer on SO](https://stackoverflow.com/a/62173969/4413446) for more information.*
|
|
||||||
|
|
||||||
### Usage Examples
|
### Usage Examples
|
||||||
|
|
||||||
The `jupyter/pyspark-notebook` and `jupyter/all-spark-notebook` images support the use of [Apache Spark](https://spark.apache.org/) in Python, R, and Scala notebooks. The following sections provide some examples of how to get started using them.
|
The `jupyter/pyspark-notebook` and `jupyter/all-spark-notebook` images support the use of [Apache Spark](https://spark.apache.org/) in Python, R, and Scala notebooks. The following sections provide some examples of how to get started using them.
|
||||||
|
@@ -16,7 +16,6 @@ USER root
|
|||||||
ARG spark_version="3.0.1"
|
ARG spark_version="3.0.1"
|
||||||
ARG hadoop_version="3.2"
|
ARG hadoop_version="3.2"
|
||||||
ARG spark_checksum="E8B47C5B658E0FBC1E57EEA06262649D8418AE2B2765E44DA53AAF50094877D17297CC5F0B9B35DF2CEEF830F19AA31D7E56EAD950BBE7F8830D6874F88CFC3C"
|
ARG spark_checksum="E8B47C5B658E0FBC1E57EEA06262649D8418AE2B2765E44DA53AAF50094877D17297CC5F0B9B35DF2CEEF830F19AA31D7E56EAD950BBE7F8830D6874F88CFC3C"
|
||||||
ARG py4j_version="0.10.9"
|
|
||||||
ARG openjdk_version="11"
|
ARG openjdk_version="11"
|
||||||
|
|
||||||
ENV APACHE_SPARK_VERSION="${spark_version}" \
|
ENV APACHE_SPARK_VERSION="${spark_version}" \
|
||||||
@@ -39,14 +38,17 @@ RUN wget -q $(wget -qO- https://www.apache.org/dyn/closer.lua/spark/spark-${APAC
|
|||||||
rm "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz"
|
rm "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz"
|
||||||
|
|
||||||
WORKDIR /usr/local
|
WORKDIR /usr/local
|
||||||
RUN ln -s "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}" spark
|
|
||||||
|
|
||||||
# Configure Spark
|
# Configure Spark
|
||||||
ENV SPARK_HOME=/usr/local/spark
|
ENV SPARK_HOME=/usr/local/spark
|
||||||
ENV PYTHONPATH="${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-${py4j_version}-src.zip" \
|
ENV SPARK_OPTS="--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info" \
|
||||||
SPARK_OPTS="--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info" \
|
|
||||||
PATH=$PATH:$SPARK_HOME/bin
|
PATH=$PATH:$SPARK_HOME/bin
|
||||||
|
|
||||||
|
RUN ln -s "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}" spark && \
|
||||||
|
# Add a link in the before_notebook hook in order to source automatically PYTHONPATH
|
||||||
|
mkdir -p /usr/local/bin/before-notebook.d && \
|
||||||
|
ln -s "${SPARK_HOME}/sbin/spark-config.sh" /usr/local/bin/before-notebook.d/spark-config.sh
|
||||||
|
|
||||||
USER $NB_UID
|
USER $NB_UID
|
||||||
|
|
||||||
# Install pyarrow
|
# Install pyarrow
|
||||||
|
@@ -12,19 +12,19 @@ def test_spark_shell(container):
|
|||||||
tty=True,
|
tty=True,
|
||||||
command=['start.sh', 'bash', '-c', 'spark-shell <<< "1+1"']
|
command=['start.sh', 'bash', '-c', 'spark-shell <<< "1+1"']
|
||||||
)
|
)
|
||||||
c.wait(timeout=30)
|
c.wait(timeout=60)
|
||||||
logs = c.logs(stdout=True).decode('utf-8')
|
logs = c.logs(stdout=True).decode('utf-8')
|
||||||
LOGGER.debug(logs)
|
LOGGER.debug(logs)
|
||||||
assert 'res0: Int = 2' in logs
|
assert 'res0: Int = 2' in logs, "spark-shell does not work"
|
||||||
|
|
||||||
|
|
||||||
def test_pyspark(container):
|
def test_pyspark(container):
|
||||||
"""PySpark should be in the Python path"""
|
"""PySpark should be in the Python path"""
|
||||||
c = container.run(
|
c = container.run(
|
||||||
tty=True,
|
tty=True,
|
||||||
command=['start.sh', 'python', '-c', '"import pyspark"']
|
command=['start.sh', 'python', '-c', 'import pyspark']
|
||||||
)
|
)
|
||||||
rv = c.wait(timeout=30)
|
rv = c.wait(timeout=30)
|
||||||
assert rv == 0 or rv["StatusCode"] == 0
|
assert rv == 0 or rv["StatusCode"] == 0, "pyspark not in PYTHONPATH"
|
||||||
logs = c.logs(stdout=True).decode('utf-8')
|
logs = c.logs(stdout=True).decode('utf-8')
|
||||||
LOGGER.debug(logs)
|
LOGGER.debug(logs)
|
||||||
|
Reference in New Issue
Block a user