From 3277f48c23f9d31ff5b6e1dc6cca5cdb5ff0ef33 Mon Sep 17 00:00:00 2001 From: Romain Date: Thu, 28 May 2020 12:26:09 +0200 Subject: [PATCH] Follow-on PR #911: Spark documentation rework Some changes to the Spark documentation for local and standalone use cases with the following drivers * Simplify some of them (removing options, etc.) * Use the same code as much as possible in each example to be consistent (only kept R different from the others) * Add Sparklyr as an option for R * Add some notes about prerequisites (same version of Python, R installed on workers) --- docs/using/specifics.md | 180 ++++++++++++++++++++++++---------------- 1 file changed, 108 insertions(+), 72 deletions(-) diff --git a/docs/using/specifics.md b/docs/using/specifics.md index 14a15b27..36d82b6b 100644 --- a/docs/using/specifics.md +++ b/docs/using/specifics.md @@ -5,6 +5,7 @@ This page provides details about features specific to one or more images. ## Apache Spark **Specific Docker Image Options** + * `-p 4040:4040` - The `jupyter/pyspark-notebook` and `jupyter/all-spark-notebook` images open [SparkUI (Spark Monitoring and Instrumentation UI)](http://spark.apache.org/docs/latest/monitoring.html) at default port `4040`, this option map `4040` port inside docker container to `4040` port on host machine . Note every new spark context that is created is put onto an incrementing port (ie. 4040, 4041, 4042, etc.), and it might be necessary to open multiple ports. For example: `docker run -d -p 8888:8888 -p 4040:4040 -p 4041:4041 jupyter/pyspark-notebook` **Usage Examples** @@ -13,30 +14,59 @@ The `jupyter/pyspark-notebook` and `jupyter/all-spark-notebook` images support t ### Using Spark Local Mode -Spark local mode is useful for experimentation on small data when you do not have a Spark cluster available. +Spark **local mode** is useful for experimentation on small data when you do not have a Spark cluster available. -#### In a Python Notebook +#### In Python + +In a Python notebook. ```python from pyspark.sql import SparkSession -spark = SparkSession.builder.appName("SimpleApp").getOrCreate() -# do something to prove it works -spark.sql('SELECT "Test" as c1').show() + +# Spark session & context +spark = SparkSession.builder.master('local').getOrCreate() +sc = spark.sparkContext + +# Do something to prove it works +rdd = sc.parallelize(range(100)) +rdd.sum() ``` -#### In a R Notebook +#### In R -```r +In a R notebook with [SparkR][sparkr]. + +```R library(SparkR) -as <- sparkR.session("local[*]") +# Spark session & context +sc <- sparkR.session("local") -# do something to prove it works +# Do something to prove it works +data(iris) df <- as.DataFrame(iris) head(filter(df, df$Petal_Width > 0.2)) ``` -#### In a Spylon Kernel Scala Notebook +In a R notebook with [sparklyr][sparklyr] + +```R +library(sparklyr) +library(dplyr) + +# Spark session & context +sc <- spark_connect(master = "local") + +# Do something to prove it works +iris_tbl <- copy_to(sc, iris) +iris_tbl %>% + filter(Petal_Width > 0.2) %>% + head() +``` + +#### In Scala + +##### In a Spylon Kernel Spylon kernel instantiates a `SparkContext` for you in variable `sc` after you configure Spark options in a `%%init_spark` magic cell. @@ -44,27 +74,28 @@ options in a `%%init_spark` magic cell. ```python %%init_spark # Configure Spark to use a local master -launcher.master = "local[*]" +launcher.master = "local" ``` ```scala -// Now run Scala code that uses the initialized SparkContext in sc -val rdd = sc.parallelize(0 to 999) -rdd.takeSample(false, 5) +// Do something to prove it works +val rdd = sc.parallelize(0 to 100) +rdd.sum() ``` -#### In an Apache Toree Scala Notebook +##### In an Apache Toree Kernel Apache Toree instantiates a local `SparkContext` for you in variable `sc` when the kernel starts. ```scala -val rdd = sc.parallelize(0 to 999) -rdd.takeSample(false, 5) +// do something to prove it works +val rdd = sc.parallelize(0 to 100) +rdd.sum() ``` ### Connecting to a Spark Cluster in Standalone Mode -Connection to Spark Cluster on Standalone Mode requires the following set of steps: +Connection to Spark Cluster on **[Standalone Mode](https://spark.apache.org/docs/latest/spark-standalone.html)** requires the following set of steps: 0. Verify that the docker image (check the Dockerfile) and the Spark Cluster which is being deployed, run the same version of Spark. @@ -75,94 +106,95 @@ Connection to Spark Cluster on Standalone Mode requires the following set of ste * NOTE: When using `--net=host`, you must also use the flags `--pid=host -e TINI_SUBREAPER=true`. See https://github.com/jupyter/docker-stacks/issues/64 for details. -#### In a Python Notebook +**Note**: In the following examples we are using the Spark master URL `spark://master:7077` that shall be replaced by the URL of the Spark master. + +#### In Python + +The **same Python version** need to be used on the notebook (where the driver is located) and on the Spark workers. +The python version used at driver and worker side can be adjusted by setting the environment variables `PYSPARK_PYTHON` and / or `PYSPARK_DRIVER_PYTHON`, see [Spark Configuration][spark-conf] for more information. ```python -import os -# make sure pyspark tells workers to use python3 not 2 if both are installed -os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3' +from pyspark.sql import SparkSession -import pyspark -conf = pyspark.SparkConf() +# Spark session & context +spark = SparkSession.builder.master('spark://master:7077').getOrCreate() +sc = spark.sparkContext -# Point to spark master -conf.setMaster("spark://10.10.10.10:7070") -# point to spark binary package in HDFS or on local filesystem on all slave -# nodes (e.g., file:///opt/spark/spark-2.2.0-bin-hadoop2.7.tgz) -conf.set("spark.executor.uri", "hdfs://10.10.10.10/spark/spark-2.2.0-bin-hadoop2.7.tgz") -# set other options as desired -conf.set("spark.executor.memory", "8g") -conf.set("spark.core.connection.ack.wait.timeout", "1200") - -# create the context -sc = pyspark.SparkContext(conf=conf) - -# do something to prove it works -rdd = sc.parallelize(range(100000000)) -rdd.sumApprox(3) +# Do something to prove it works +rdd = sc.parallelize(range(100)) +rdd.sum() ``` -#### In a R Notebook +#### In R -```r +In a R notebook with [SparkR][sparkr]. + +```R library(SparkR) -# Point to spark master -# Point to spark binary package in HDFS or on local filesystem on all worker -# nodes (e.g., file:///opt/spark/spark-2.2.0-bin-hadoop2.7.tgz) in sparkEnvir -# Set other options in sparkEnvir -sc <- sparkR.session("spark://10.10.10.10:7070", sparkEnvir=list( - spark.executor.uri="hdfs://10.10.10.10/spark/spark-2.4.3-bin-hadoop2.7.tgz", - spark.executor.memory="8g" - ) -) +# Spark session & context +sc <- sparkR.session("spark://master:7077") -# do something to prove it works +# Do something to prove it works data(iris) df <- as.DataFrame(iris) head(filter(df, df$Petal_Width > 0.2)) ``` -#### In a Spylon Kernel Scala Notebook +In a R notebook with [sparklyr][sparklyr] + +```R +library(sparklyr) +library(dplyr) + +# Spark session & context +sc <- spark_connect(master = "spark://master:7077") + +# Do something to prove it works +iris_tbl <- copy_to(sc, iris) +iris_tbl %>% + filter(Petal_Width > 0.2) %>% + head() +``` + +#### In Scala + +##### In a Spylon Kernel + +Spylon kernel instantiates a `SparkContext` for you in variable `sc` after you configure Spark +options in a `%%init_spark` magic cell. ```python %%init_spark -# Point to spark master -launcher.master = "spark://10.10.10.10:7070" -launcher.conf.spark.executor.uri=hdfs://10.10.10.10/spark/spark-2.4.3-bin-hadoop2.7.tgz +# Configure Spark to use a local master +launcher.master = "spark://master:7077" ``` ```scala -// Now run Scala code that uses the initialized SparkContext in sc -val rdd = sc.parallelize(0 to 999) -rdd.takeSample(false, 5) +// Do something to prove it works +val rdd = sc.parallelize(0 to 100) +rdd.sum() ``` -#### In an Apache Toree Scala Notebook +##### In an Apache Toree Scala Notebook -The Apache Toree kernel automatically creates a `SparkContext` when it starts based on configuration -information from its command line arguments and environment variables. You can pass information -about your cluster via the `SPARK_OPTS` environment variable when you spawn a container. +The Apache Toree kernel automatically creates a `SparkContext` when it starts based on configuration information from its command line arguments and environment variables. You can pass information about your cluster via the `SPARK_OPTS` environment variable when you spawn a container. -For instance, to pass information about a standalone Spark master, Spark binary location in HDFS, -and an executor options, you could start the container like so: +For instance, to pass information about a standalone Spark master, you could start the container like so: -``` -docker run -d -p 8888:8888 -e SPARK_OPTS='--master=spark://10.10.10.10:7070 \ - --spark.executor.uri=hdfs://10.10.10.10/spark/spark-2.4.3-bin-hadoop2.7.tgz \ - --spark.executor.memory=8g' jupyter/all-spark-notebook +```bash +docker run -d -p 8888:8888 -e SPARK_OPTS='--master=spark://master:7077' \ + jupyter/all-spark-notebook ``` -Note that this is the same information expressed in a notebook in the Python case above. Once the -kernel spec has your cluster information, you can test your cluster in an Apache Toree notebook like -so: +Note that this is the same information expressed in a notebook in the Python case above. Once the kernel spec has your cluster information, you can test your cluster in an Apache Toree notebook like so: ```scala // should print the value of --master in the kernel spec println(sc.master) // do something to prove it works -val rdd = sc.parallelize(0 to 99999999) +val rdd = sc.parallelize(0 to 100) rdd.sum() ``` @@ -199,3 +231,7 @@ init = tf.global_variables_initializer() sess.run(init) sess.run(hello) ``` + +[sparkr]: https://spark.apache.org/docs/latest/sparkr.html +[sparklyr]: https://spark.rstudio.com/ +[spark-conf]: https://spark.apache.org/docs/latest/configuration.html \ No newline at end of file