Follow-on PR #911: Spark documentation rework

Some changes to the Spark documentation
for local and standalone use cases with the following drivers

* Simplify some of them (removing options, etc.)
* Use the same code as much as possible in each example to be consistent (only kept R different from the others)
* Add Sparklyr as an option for R
* Add some notes about prerequisites (same version of Python, R installed on workers)
This commit is contained in:
Romain
2020-05-28 12:26:09 +02:00
parent 2c0af4ab51
commit 3277f48c23

View File

@@ -5,6 +5,7 @@ This page provides details about features specific to one or more images.
## Apache Spark
**Specific Docker Image Options**
* `-p 4040:4040` - The `jupyter/pyspark-notebook` and `jupyter/all-spark-notebook` images open [SparkUI (Spark Monitoring and Instrumentation UI)](http://spark.apache.org/docs/latest/monitoring.html) at default port `4040`, this option map `4040` port inside docker container to `4040` port on host machine . Note every new spark context that is created is put onto an incrementing port (ie. 4040, 4041, 4042, etc.), and it might be necessary to open multiple ports. For example: `docker run -d -p 8888:8888 -p 4040:4040 -p 4041:4041 jupyter/pyspark-notebook`
**Usage Examples**
@@ -13,30 +14,59 @@ The `jupyter/pyspark-notebook` and `jupyter/all-spark-notebook` images support t
### Using Spark Local Mode
Spark local mode is useful for experimentation on small data when you do not have a Spark cluster available.
Spark **local mode** is useful for experimentation on small data when you do not have a Spark cluster available.
#### In a Python Notebook
#### In Python
In a Python notebook.
```python
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
# do something to prove it works
spark.sql('SELECT "Test" as c1').show()
# Spark session & context
spark = SparkSession.builder.master('local').getOrCreate()
sc = spark.sparkContext
# Do something to prove it works
rdd = sc.parallelize(range(100))
rdd.sum()
```
#### In a R Notebook
#### In R
```r
In a R notebook with [SparkR][sparkr].
```R
library(SparkR)
as <- sparkR.session("local[*]")
# Spark session & context
sc <- sparkR.session("local")
# do something to prove it works
# Do something to prove it works
data(iris)
df <- as.DataFrame(iris)
head(filter(df, df$Petal_Width > 0.2))
```
#### In a Spylon Kernel Scala Notebook
In a R notebook with [sparklyr][sparklyr]
```R
library(sparklyr)
library(dplyr)
# Spark session & context
sc <- spark_connect(master = "local")
# Do something to prove it works
iris_tbl <- copy_to(sc, iris)
iris_tbl %>%
filter(Petal_Width > 0.2) %>%
head()
```
#### In Scala
##### In a Spylon Kernel
Spylon kernel instantiates a `SparkContext` for you in variable `sc` after you configure Spark
options in a `%%init_spark` magic cell.
@@ -44,27 +74,28 @@ options in a `%%init_spark` magic cell.
```python
%%init_spark
# Configure Spark to use a local master
launcher.master = "local[*]"
launcher.master = "local"
```
```scala
// Now run Scala code that uses the initialized SparkContext in sc
val rdd = sc.parallelize(0 to 999)
rdd.takeSample(false, 5)
// Do something to prove it works
val rdd = sc.parallelize(0 to 100)
rdd.sum()
```
#### In an Apache Toree Scala Notebook
##### In an Apache Toree Kernel
Apache Toree instantiates a local `SparkContext` for you in variable `sc` when the kernel starts.
```scala
val rdd = sc.parallelize(0 to 999)
rdd.takeSample(false, 5)
// do something to prove it works
val rdd = sc.parallelize(0 to 100)
rdd.sum()
```
### Connecting to a Spark Cluster in Standalone Mode
Connection to Spark Cluster on Standalone Mode requires the following set of steps:
Connection to Spark Cluster on **[Standalone Mode](https://spark.apache.org/docs/latest/spark-standalone.html)** requires the following set of steps:
0. Verify that the docker image (check the Dockerfile) and the Spark Cluster which is being
deployed, run the same version of Spark.
@@ -75,94 +106,95 @@ Connection to Spark Cluster on Standalone Mode requires the following set of ste
* NOTE: When using `--net=host`, you must also use the flags `--pid=host -e
TINI_SUBREAPER=true`. See https://github.com/jupyter/docker-stacks/issues/64 for details.
#### In a Python Notebook
**Note**: In the following examples we are using the Spark master URL `spark://master:7077` that shall be replaced by the URL of the Spark master.
#### In Python
The **same Python version** need to be used on the notebook (where the driver is located) and on the Spark workers.
The python version used at driver and worker side can be adjusted by setting the environment variables `PYSPARK_PYTHON` and / or `PYSPARK_DRIVER_PYTHON`, see [Spark Configuration][spark-conf] for more information.
```python
import os
# make sure pyspark tells workers to use python3 not 2 if both are installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'
from pyspark.sql import SparkSession
import pyspark
conf = pyspark.SparkConf()
# Spark session & context
spark = SparkSession.builder.master('spark://master:7077').getOrCreate()
sc = spark.sparkContext
# Point to spark master
conf.setMaster("spark://10.10.10.10:7070")
# point to spark binary package in HDFS or on local filesystem on all slave
# nodes (e.g., file:///opt/spark/spark-2.2.0-bin-hadoop2.7.tgz)
conf.set("spark.executor.uri", "hdfs://10.10.10.10/spark/spark-2.2.0-bin-hadoop2.7.tgz")
# set other options as desired
conf.set("spark.executor.memory", "8g")
conf.set("spark.core.connection.ack.wait.timeout", "1200")
# create the context
sc = pyspark.SparkContext(conf=conf)
# do something to prove it works
rdd = sc.parallelize(range(100000000))
rdd.sumApprox(3)
# Do something to prove it works
rdd = sc.parallelize(range(100))
rdd.sum()
```
#### In a R Notebook
#### In R
```r
In a R notebook with [SparkR][sparkr].
```R
library(SparkR)
# Point to spark master
# Point to spark binary package in HDFS or on local filesystem on all worker
# nodes (e.g., file:///opt/spark/spark-2.2.0-bin-hadoop2.7.tgz) in sparkEnvir
# Set other options in sparkEnvir
sc <- sparkR.session("spark://10.10.10.10:7070", sparkEnvir=list(
spark.executor.uri="hdfs://10.10.10.10/spark/spark-2.4.3-bin-hadoop2.7.tgz",
spark.executor.memory="8g"
)
)
# Spark session & context
sc <- sparkR.session("spark://master:7077")
# do something to prove it works
# Do something to prove it works
data(iris)
df <- as.DataFrame(iris)
head(filter(df, df$Petal_Width > 0.2))
```
#### In a Spylon Kernel Scala Notebook
In a R notebook with [sparklyr][sparklyr]
```R
library(sparklyr)
library(dplyr)
# Spark session & context
sc <- spark_connect(master = "spark://master:7077")
# Do something to prove it works
iris_tbl <- copy_to(sc, iris)
iris_tbl %>%
filter(Petal_Width > 0.2) %>%
head()
```
#### In Scala
##### In a Spylon Kernel
Spylon kernel instantiates a `SparkContext` for you in variable `sc` after you configure Spark
options in a `%%init_spark` magic cell.
```python
%%init_spark
# Point to spark master
launcher.master = "spark://10.10.10.10:7070"
launcher.conf.spark.executor.uri=hdfs://10.10.10.10/spark/spark-2.4.3-bin-hadoop2.7.tgz
# Configure Spark to use a local master
launcher.master = "spark://master:7077"
```
```scala
// Now run Scala code that uses the initialized SparkContext in sc
val rdd = sc.parallelize(0 to 999)
rdd.takeSample(false, 5)
// Do something to prove it works
val rdd = sc.parallelize(0 to 100)
rdd.sum()
```
#### In an Apache Toree Scala Notebook
##### In an Apache Toree Scala Notebook
The Apache Toree kernel automatically creates a `SparkContext` when it starts based on configuration
information from its command line arguments and environment variables. You can pass information
about your cluster via the `SPARK_OPTS` environment variable when you spawn a container.
The Apache Toree kernel automatically creates a `SparkContext` when it starts based on configuration information from its command line arguments and environment variables. You can pass information about your cluster via the `SPARK_OPTS` environment variable when you spawn a container.
For instance, to pass information about a standalone Spark master, Spark binary location in HDFS,
and an executor options, you could start the container like so:
For instance, to pass information about a standalone Spark master, you could start the container like so:
```
docker run -d -p 8888:8888 -e SPARK_OPTS='--master=spark://10.10.10.10:7070 \
--spark.executor.uri=hdfs://10.10.10.10/spark/spark-2.4.3-bin-hadoop2.7.tgz \
--spark.executor.memory=8g' jupyter/all-spark-notebook
```bash
docker run -d -p 8888:8888 -e SPARK_OPTS='--master=spark://master:7077' \
jupyter/all-spark-notebook
```
Note that this is the same information expressed in a notebook in the Python case above. Once the
kernel spec has your cluster information, you can test your cluster in an Apache Toree notebook like
so:
Note that this is the same information expressed in a notebook in the Python case above. Once the kernel spec has your cluster information, you can test your cluster in an Apache Toree notebook like so:
```scala
// should print the value of --master in the kernel spec
println(sc.master)
// do something to prove it works
val rdd = sc.parallelize(0 to 99999999)
val rdd = sc.parallelize(0 to 100)
rdd.sum()
```
@@ -199,3 +231,7 @@ init = tf.global_variables_initializer()
sess.run(init)
sess.run(hello)
```
[sparkr]: https://spark.apache.org/docs/latest/sparkr.html
[sparklyr]: https://spark.rstudio.com/
[spark-conf]: https://spark.apache.org/docs/latest/configuration.html