Merge pull request #3 from parente/pyspark

Notebook stack for Python w/ Spark on Mesos
2025-10-17 15:02:57 +00:00 · 2015-08-12 21:50:48 -07:00
parent 641b2936f7 871c0edc16
commit 65940dc748
3 changed files with 166 additions and 3 deletions
--- a/12
+++ b/12
@@ -2,11 +2,14 @@

 OWNER:=jupyter
 STACK?=
+ARGS?=
+DARGS?=

 help:
 	@echo
-	@echo 'build STACK=<dirname> - build using Dockerfile in named directory'
-	@echo '  dev STACK=<dirname> - run container using stack name'
+	@echo ' build STACK=<dirname> - build using Dockerfile in named directory'
+	@echo '   dev STACK=<dirname> - run container using stack name'
+	@echo 'server STACK=<dirname> - run stack container in background'
 	@echo

 build:
@@ -14,4 +17,7 @@ build:
 		docker build --rm --force-rm -t $(OWNER)/$(STACK) .

 dev:
-	docker run -it --rm -p 8888:8888 $(OWNER)/$(STACK)
+	@docker run -it --rm -p 8888:8888 $(DARGS) $(OWNER)/$(STACK) $(ARGS)
+
+server:
+	@docker run -d -p 8888:8888 $(DARGS) $(OWNER)/$(STACK) $(ARGS)
--- a/pyspark-notebook/Dockerfile
+++ b/pyspark-notebook/Dockerfile
@@ -0,0 +1,57 @@
+# Copyright (c) IPython Development Team.
+# (c) Copyright IBM Corp. 2015
+FROM jupyter/minimal-notebook
+
+MAINTAINER Jupyter Project <jupyter@googlegroups.com>
+
+USER root
+
+# Spark dependencies
+ENV APACHE_SPARK_VERSION 1.4.1
+RUN apt-get -y update && \
+    apt-get install -y --no-install-recommends openjdk-7-jre-headless && \
+    apt-get clean
+RUN wget -qO - http://d3kbcqa49mib13.cloudfront.net/spark-${APACHE_SPARK_VERSION}-bin-hadoop2.6.tgz | tar -xz -C /usr/local/
+RUN cd /usr/local && ln -s spark-${APACHE_SPARK_VERSION}-bin-hadoop2.6 spark
+
+# Mesos dependencies
+RUN apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF && \
+    DISTRO=debian && \
+    CODENAME=wheezy && \
+    echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" > /etc/apt/sources.list.d/mesosphere.list && \
+    apt-get -y update && \
+    apt-get --no-install-recommends -y --force-yes install mesos=0.22.1-1.0.debian78 && \
+    apt-get clean
+
+USER jovyan
+
+# Spark and Mesos pointers
+ENV SPARK_HOME /usr/local/spark
+ENV PYTHONPATH $SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip
+ENV MESOS_NATIVE_LIBRARY /usr/local/lib/libmesos.so
+
+# Install Python 3 packages
+RUN conda install --yes \
+    'pandas=0.16*' \
+    'matplotlib=1.4*' \
+    'scipy=0.15*' \
+    'seaborn=0.6*' \
+    'scikit-learn=0.16*' \
+    && conda clean -yt
+
+# Install Python 2 packages and kernel spec
+RUN conda create -p $CONDA_DIR/envs/python2 python=2.7 \
+    'ipython=3.2*' \
+    'pandas=0.16*' \
+    'matplotlib=1.4*' \
+    'scipy=0.15*' \
+    'seaborn=0.6*' \
+    'scikit-learn=0.16*' \
+    pyzmq \
+    && conda clean -yt
+RUN $CONDA_DIR/envs/python2/bin/python \
+    $CONDA_DIR/envs/python2/bin/ipython \
+    kernelspec install-self --user
+
+# Switch back to root so that supervisord runs under that user
+USER root
--- a/pyspark-notebook/README.md
+++ b/pyspark-notebook/README.md
@@ -0,0 +1,100 @@
+# Jupyter Notebook Python, Spark, Mesos Stack
+
+## What it Gives You
+
+* Jupyter Notebook server v3.2.x
+* Conda Python 3.4.x and Python 2.7.x environments
+* pyspark, pandas, matplotlib, scipy, seaborn, scikit-learn pre-installed
+* Spark 1.4.1 for use in local mode or to connect to a cluster of Spark workers
+* Mesos client 0.22 binary that can communicate with a Mesos master
+* Options for HTTPS, password auth, and passwordless `sudo`
+
+## Basic Use
+
+The following command starts a container with the Notebook server listening for HTTP connections on port 8888 without authentication configured.
+
+```
+docker run -d -p 8888:8888 jupyter/pyspark-notebook
+```
+
+## Using Spark Local Mode
+
+This configuration is nice for using Spark on small, local data.
+
+0. Run the container as shown above.
+2. Open a Python 2 or 3 notebook.
+3. Create a `SparkContext` configured for local mode.
+
+For example, the first few cells in a Python 3 notebook might read:
+
+```python
+import pyspark
+sc = pyspark.SparkContext('local[*]')
+
+# do something to prove it works
+rdd = sc.parallelize(range(1000))
+rdd.takeSample(False, 5)
+```
+
+In a Python 2 notebook, prefix the above with the following code to ensure the local workers use Python 2 as well.
+
+```python
+import os
+os.environ['PYSPARK_PYTHON'] = 'python2'
+
+# include pyspark cells from above here ...
+```
+
+## Connecting to a Spark Cluster on Mesos
+
+This configuration allows your compute cluster to scale with your data.
+
+0. [Deploy Spark on Mesos](http://spark.apache.org/docs/latest/running-on-mesos.html).
+1. Ensure Python 2.x and/or 3.x and any Python libraries you wish to use in your Spark lambda functions are installed on your Spark workers.
+2. Run the Docker container with `--net=host` in a location that is network addressable by all of your Spark workers. (This is a [Spark networking requirement](http://spark.apache.org/docs/latest/cluster-overview.html#components).)
+3. Open a Python 2 or 3 notebook.
+4. Create a `SparkConf` instance in a new notebook pointing to your Mesos master node (or Zookeeper instance) and Spark binary package location.
+5. Create a `SparkContext` using this configuration. 
+
+For example, the first few cells in a Python 3 notebook might read:
+
+```python
+import os
+# make sure pyspark tells workers to use python3 not 2 if both are installed
+os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'
+
+import pyspark
+conf = pyspark.SparkConf()
+
+# point to mesos master or zookeeper entry (e.g., zk://10.10.10.10:2181/mesos)
+conf.setMaster("mesos://10.10.10.10:5050")
+# point to spark binary package in HDFS or on local filesystem on all slave
+# nodes (e.g., file:///opt/spark/spark-1.4.1-bin-hadoop2.6.tgz) 
+conf.set("spark.executor.uri", "hdfs://10.122.193.209/spark/spark-1.4.1-bin-hadoop2.6.tgz")
+# set other options as desired
+conf.set("spark.executor.memory", "8g")
+conf.set("spark.core.connection.ack.wait.timeout", "1200")
+
+# create the context
+sc = pyspark.SparkContext(conf=conf)
+
+# do something to prove it works
+rdd = sc.parallelize(range(100000000))
+rdd.sumApprox(3)
+```
+
+To use Python 2 in the notebook and on the workers, change the `PYSPARK_PYTHON` environment variable to point to the location of the Python 2.x interpreter binary. If you leave this environment variable unset, it defaults to `python`.
+
+Of course, all of this can be hidden in an [IPython kernel startup script](http://ipython.org/ipython-doc/stable/development/config.html?highlight=startup#startup-files), but "explicit is better than implicit." :)
+
+## Options
+
+You may customize the execution of the Docker container and the Notebook server it contains with the following optional arguments.
+
+* `-e PASSWORD="YOURPASS"` - Configures Jupyter Notebook to require the given password. Should be conbined with `USE_HTTPS` on untrusted networks.
+* `-e USE_HTTPS=yes` - Configures Jupyter Notebook to accept encrypted HTTPS connections. If a `pem` file containing a SSL certificate and key is not found in `/home/jovyan/.ipython/profile_default/security/notebook.pem`, the container will generate a self-signed certificate for you.
+* `-e GRANT_SUDO=yes` - Gives the `jovyan` user passwordless `sudo` capability. Useful for installing OS packages. **You should only enable `sudo` if you trust the user or if the container is running on an isolated host.**
+* `-v /some/host/folder/for/work:/home/jovyan/work` - Host mounts the default working directory on the host to preserve work even when the container is destroyed and recreated (e.g., during an upgrade).
+* `-v /some/host/folder/for/server.pem:/home/jovyan/.ipython/profile_default/security/notebook.pem` - Mounts a SSL certificate plus key for `USE_HTTPS`. Useful if you have a real certificate for the domain under which you are running the Notebook server.
+* `-e INTERFACE=10.10.10.10` - Configures Jupyter Notebook to listen on the given interface. Defaults to '*', all interfaces, which is appropriate when running using default bridged Docker networking. When using Docker's `--net=host`, you may wish to use this option to specify a particular network interface.
+* `-e PORT=8888` - Configures Jupyter Notebook to listen on the given port. Defaults to 8888, which is the port exposed within the Dockerfile for the image. When using Docker's `--net=host`, you may wish to use this option to specify a particular port.