mirror of
https://github.com/jupyter/docker-stacks.git
synced 2025-10-15 22:12:57 +00:00
Automatically install latest spark version (#2075)
* Automatically install latest pyspark version * Better text * Do not use shutil to keep behaviour * Make setup_script cwd independent * Use _get_program_version to calculate spark version * Update setup_spark.py reqs * Update setup_spark.py * Add info about HADOOP_VERSION * Add customization back * Better text * Specify build args when they are actually needed * Better text * Better code * Better code * Better text * Get rid of warning * Improve code * Remove information about checksum * Better text
This commit is contained in:
@@ -42,18 +42,20 @@ ipython profile create
|
||||
You can build a `pyspark-notebook` image with a different `Spark` version by overriding the default value of the following arguments at build time.
|
||||
`all-spark-notebook` is inherited from `pyspark-notebook`, so you have to first build `pyspark-notebook` and then `all-spark-notebook` to get the same version in `all-spark-notebook`.
|
||||
|
||||
- Spark distribution is defined by the combination of Spark, Hadoop, and Scala versions and verified by the package checksum,
|
||||
- Spark distribution is defined by the combination of Spark, Hadoop, and Scala versions,
|
||||
see [Download Apache Spark](https://spark.apache.org/downloads.html) and the [archive repo](https://archive.apache.org/dist/spark/) for more information.
|
||||
|
||||
- `spark_version`: The Spark version to install (`3.3.0`).
|
||||
- `hadoop_version`: The Hadoop version (`3.2`).
|
||||
- `scala_version`: The Scala version (`2.13`, optional).
|
||||
- `spark_checksum`: The package checksum (`BFE4540...`).
|
||||
- `openjdk_version`: The version of the OpenJDK (JRE headless) distribution (`17`).
|
||||
- `openjdk_version`: The version of the OpenJDK (JRE headless) distribution (`17` by default).
|
||||
- This version needs to match the version supported by the Spark distribution used above.
|
||||
- See [Spark Overview](https://spark.apache.org/docs/latest/#downloading) and [Ubuntu packages](https://packages.ubuntu.com/search?keywords=openjdk).
|
||||
|
||||
- Starting with _Spark >= 3.2_, the distribution file might contain the Scala version.
|
||||
- `spark_version` (optional): The Spark version to install, for example `3.5.0`.
|
||||
If not specified (this is the default), latest stable Spark will be installed.
|
||||
- `hadoop_version`: The Hadoop version (`3` by default).
|
||||
Note, that _Spark < 3.3_ require to specify `major.minor` Hadoop version (i.e. `3.2`).
|
||||
- `scala_version` (optional): The Scala version, for example `2.13` (not specified by default).
|
||||
Starting with _Spark >= 3.2_, the distribution file might contain the Scala version.
|
||||
- `spark_download_url`: URL to use for Spark downloads.
|
||||
You may need to use <https://archive.apache.org/dist/spark/> url if you want to download old Spark versions.
|
||||
|
||||
For example, here is how to build a `pyspark-notebook` image with Spark `3.2.0`, Hadoop `3.2`, and OpenJDK `11`.
|
||||
|
||||
@@ -65,14 +67,14 @@ This recipe is not tested and might be broken.
|
||||
# From the root of the project
|
||||
# Build the image with different arguments
|
||||
docker build --rm --force-rm \
|
||||
-t jupyter/pyspark-notebook:spark-3.2.0 ./images/pyspark-notebook \
|
||||
-t my-pyspark-notebook ./images/pyspark-notebook \
|
||||
--build-arg openjdk_version=11 \
|
||||
--build-arg spark_version=3.2.0 \
|
||||
--build-arg hadoop_version=3.2 \
|
||||
--build-arg spark_checksum=707DDE035926A50B75E53FCA72CADA519F3239B14A96546911CB4916A58DCF69A1D2BFDD2C7DD5899324DBD82B6EEAB9797A7B4ABF86736FFCA4C26D0E0BF0EE \
|
||||
--build-arg openjdk_version=11
|
||||
--build-arg spark_download_url="https://archive.apache.org/dist/spark/"
|
||||
|
||||
# Check the newly built image
|
||||
docker run -it --rm quay.io/jupyter/pyspark-notebook:spark-3.2.0 pyspark --version
|
||||
docker run -it --rm my-pyspark-notebook pyspark --version
|
||||
|
||||
# Welcome to
|
||||
# ____ __
|
||||
@@ -81,7 +83,12 @@ docker run -it --rm quay.io/jupyter/pyspark-notebook:spark-3.2.0 pyspark --versi
|
||||
# /___/ .__/\_,_/_/ /_/\_\ version 3.2.0
|
||||
# /_/
|
||||
|
||||
# Using Scala version 2.13.5, OpenJDK 64-Bit Server VM, 11.0.15
|
||||
# Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.21
|
||||
# Branch HEAD
|
||||
# Compiled by user ubuntu on 2021-10-06T12:46:30Z
|
||||
# Revision 5d45a415f3a29898d92380380cfd82bfc7f579ea
|
||||
# Url https://github.com/apache/spark
|
||||
# Type --help for more information.
|
||||
```
|
||||
|
||||
### Usage Examples
|
||||
|
Reference in New Issue
Block a user