
I can't seem to configure the Jupyter Albaster theme correctly and will reach out for help later.
6.7 KiB
Image Specifics
This page provides details about features specific to one or more images.
Apache Spark
The jupyter/pyspark-notebook
and jupyter/all-spark-notebook
images support the use of Apache Spark in Python, R, and Scala notebooks. The following sections provide some examples of how to get started using them.
Using Spark Local Mode
Spark local mode is useful for experimentation on small data when you do not have a Spark cluster available.
In a Python Notebook
import pyspark
sc = pyspark.SparkContext('local[*]')
# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)
In a R Notebook
library(SparkR)
as <- sparkR.session("local[*]")
# do something to prove it works
df <- as.DataFrame(iris)
head(filter(df, df$Petal_Width > 0.2))
In a Spylon Kernel Scala Notebook
Spylon kernel instantiates a SparkContext
for you in variable sc
after you configure Spark options in a %%init_spark
magic cell.
%%init_spark
# Configure Spark to use a local master
launcher.master = "local[*]"
// Now run Scala code that uses the initialized SparkContext in sc
val rdd = sc.parallelize(0 to 999)
rdd.takeSample(false, 5)
In an Apache Toree Scala Notebook
Apache Toree instantiates a local SparkContext
for you in variable sc
when the kernel starts.
val rdd = sc.parallelize(0 to 999)
rdd.takeSample(false, 5)
Connecting to a Spark Cluster on Mesos
This configuration allows your compute cluster to scale with your data.
- Deploy Spark on Mesos.
- Configure each slave with the
--no-switch_user
flag or create the$NB_USER
account on every slave node. - Run the Docker container with
--net=host
in a location that is network addressable by all of your Spark workers. (This is a Spark networking requirement.)- NOTE: When using
--net=host
, you must also use the flags--pid=host -e TINI_SUBREAPER=true
. See https://github.com/jupyter/docker-stacks/issues/64 for details.
- NOTE: When using
- Follow the language specific instructions below.
In a Python Notebook
import os
# make sure pyspark tells workers to use python3 not 2 if both are installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'
import pyspark
conf = pyspark.SparkConf()
# point to mesos master or zookeeper entry (e.g., zk://10.10.10.10:2181/mesos)
conf.setMaster("mesos://10.10.10.10:5050")
# point to spark binary package in HDFS or on local filesystem on all slave
# nodes (e.g., file:///opt/spark/spark-2.2.0-bin-hadoop2.7.tgz)
conf.set("spark.executor.uri", "hdfs://10.10.10.10/spark/spark-2.2.0-bin-hadoop2.7.tgz")
# set other options as desired
conf.set("spark.executor.memory", "8g")
conf.set("spark.core.connection.ack.wait.timeout", "1200")
# create the context
sc = pyspark.SparkContext(conf=conf)
# do something to prove it works
rdd = sc.parallelize(range(100000000))
rdd.sumApprox(3)
In a R Notebook
library(SparkR)
# Point to mesos master or zookeeper entry (e.g., zk://10.10.10.10:2181/mesos)
# Point to spark binary package in HDFS or on local filesystem on all slave
# nodes (e.g., file:///opt/spark/spark-2.2.0-bin-hadoop2.7.tgz) in sparkEnvir
# Set other options in sparkEnvir
sc <- sparkR.session("mesos://10.10.10.10:5050", sparkEnvir=list(
spark.executor.uri="hdfs://10.10.10.10/spark/spark-2.2.0-bin-hadoop2.7.tgz",
spark.executor.memory="8g"
)
)
# do something to prove it works
data(iris)
df <- as.DataFrame(iris)
head(filter(df, df$Petal_Width > 0.2))
In a Spylon Kernel Scala Notebook
%%init_spark
# Configure the location of the mesos master and spark distribution on HDFS
launcher.master = "mesos://10.10.10.10:5050"
launcher.conf.spark.executor.uri=hdfs://10.10.10.10/spark/spark-2.2.0-bin-hadoop2.7.tgz
// Now run Scala code that uses the initialized SparkContext in sc
val rdd = sc.parallelize(0 to 999)
rdd.takeSample(false, 5)
In an Apache Toree Scala Notebook
The Apache Toree kernel automatically creates a SparkContext
when it starts based on configuration information from its command line arguments and environment variables. You can pass information about your Mesos cluster via the SPARK_OPTS
environment variable when you spawn a container.
For instance, to pass information about a Mesos master, Spark binary location in HDFS, and an executor options, you could start the container like so:
docker run -d -p 8888:8888 -e SPARK_OPTS='--master=mesos://10.10.10.10:5050 \
--spark.executor.uri=hdfs://10.10.10.10/spark/spark-2.2.0-bin-hadoop2.7.tgz \
--spark.executor.memory=8g' jupyter/all-spark-notebook
Note that this is the same information expressed in a notebook in the Python case above. Once the kernel spec has your cluster information, you can test your cluster in an Apache Toree notebook like so:
// should print the value of --master in the kernel spec
println(sc.master)
// do something to prove it works
val rdd = sc.parallelize(0 to 99999999)
rdd.sum()
Connecting to a Spark Cluster in Standalone Mode
Connection to Spark Cluster on Standalone Mode requires the following set of steps:
- Verify that the docker image (check the Dockerfile) and the Spark Cluster which is being deployed, run the same version of Spark.
- Deploy Spark in Standalone Mode.
- Run the Docker container with
--net=host
in a location that is network addressable by all of your Spark workers. (This is a Spark networking requirement.)- NOTE: When using
--net=host
, you must also use the flags--pid=host -e TINI_SUBREAPER=true
. See https://github.com/jupyter/docker-stacks/issues/64 for details.
- NOTE: When using
- The language specific instructions are almost same as mentioned above for Mesos, only the master url would now be something like spark://10.10.10.10:7077
Tensorflow
The jupyter/tensorflow-notebook
image supports the use of Tensorflow in single machine or distributed mode.
Single Machine Mode
import tensorflow as tf
hello = tf.Variable('Hello World!')
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
sess.run(hello)
Distributed Mode
import tensorflow as tf
hello = tf.Variable('Hello Distributed World!')
server = tf.train.Server.create_local_server()
sess = tf.Session(server.target)
init = tf.global_variables_initializer()
sess.run(init)
sess.run(hello)