From 327f78db39c3fd403a6dc4ccdc5bd5ec4f21517a Mon Sep 17 00:00:00 2001 From: romainx Date: Wed, 2 Jun 2021 21:19:15 +0200 Subject: [PATCH] Define Spark Dependencies --- docs/using/specifics.md | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/docs/using/specifics.md b/docs/using/specifics.md index 8f61b235..3e2f6f71 100644 --- a/docs/using/specifics.md +++ b/docs/using/specifics.md @@ -212,6 +212,39 @@ rdd.sum() // 5050 ``` +### Define Spark Dependencies + +Spark dependencies can be declared thanks to the `spark.jars.packages` property +(see [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html#runtime-environment) for more information). + +They can be defined as a comma-separated list of Maven coordinates at the creation of the Spark session. + +```python +from pyspark.sql import SparkSession + +spark = ( + SparkSession.builder.appName("elasticsearch") + .config( + "spark.jars.packages", + "org.elasticsearch:elasticsearch-spark-30_2.12:7.13.0" + ) + .getOrCreate() +) +``` + +Dependencies can also be defined in the `spark-defaults.conf`. +However, it has to be done by `root` so it should only be considered to build custom images. + +```dockerfile +USER root +RUN echo "spark.jars.packages org.elasticsearch:elasticsearch-spark-30_2.12:7.13.0" >> $SPARK_HOME/conf/spark-defaults.conf +USER $NB_UID +``` + +Jars will be downloaded dynamically at the creation of the Spark session and stored by default in `$HOME/.ivy2/jars` (can be changed by setting `spark.jars.ivy`). + +_Note: This example is given for [Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/hadoop/current/install.html)._ + ## Tensorflow The `jupyter/tensorflow-notebook` image supports the use of