diff --git a/docs/contributing/recipes.md b/docs/contributing/recipes.md index 2d26a77b..32998583 100644 --- a/docs/contributing/recipes.md +++ b/docs/contributing/recipes.md @@ -1 +1 @@ -# Documented Recipes \ No newline at end of file +# New Recipes diff --git a/docs/index.rst b/docs/index.rst index 2ce90d0f..f0241eb2 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -31,6 +31,7 @@ Table of Contents using/running using/common using/specifics + using/recipes .. toctree:: :maxdepth: 2 diff --git a/docs/using/recipes.md b/docs/using/recipes.md new file mode 100644 index 00000000..27df32d5 --- /dev/null +++ b/docs/using/recipes.md @@ -0,0 +1,256 @@ +# Contributed Recipes + +Users sometimes share interesting ways of using the Jupyter Docker Stacks. We encourage users to [contribute these recipes](../contributing/recipes.html) to the documentation in case they prove useful to other members of the community. The sections below capture this knowledge. + +## Add RISE + +@pdonorio said: + +> There is a great repo called [RISE](https://github.com/damianavila/RISE) which allow via extension to create live slideshows of your notebooks, with no conversion, adding javascript Reveal.js. + +> I like it a lot, and find my self often adding this feature on top of your official images. + +``` +# Add Live slideshows with RISE +RUN conda install -c damianavila82 rise +``` + +Ref: [https://github.com/jupyter/docker-stacks/issues/43](https://github.com/jupyter/docker-stacks/issues/43), updated 2018-04-22 to use `conda` + +## Running behind a nginx proxy + +Sometimes it is useful to run the Jupyter instance behind a nginx proxy, for instance: + +- you would prefer to access the notebook at a server URL with a path (`https://example.com/jupyter`) rather than a port (`https://example.com:8888`) +- you may have many different services in addition to Jupyter running on the same server, and want to nginx to help improve server performance in manage the connections + +Here is a [quick example NGINX configuration](https://gist.github.com/cboettig/8643341bd3c93b62b5c2) to get started. You'll need a server, a `.crt` and `.key` file for your server, and `docker` & `docker-compose` installed. Then just download the files at that gist and run `docker-compose up -d` to test it out. Customize the `nginx.conf` file to set the desired paths and add other services. + +## Using spark-packages.org + +If you'd like to use packages from spark-packages.org, see [https://gist.github.com/parente/c95fdaba5a9a066efaab](https://gist.github.com/parente/c95fdaba5a9a066efaab) for an example of how to specify the package identifier in the environment before creating a SparkContext. + +Ref: [https://github.com/jupyter/docker-stacks/issues/43](https://github.com/jupyter/docker-stacks/issues/43) + +## Let's Encrypt a Notebook server + +See the README for the simple automation here [https://github.com/jupyter/docker-stacks/tree/master/examples/make-deploy](https://github.com/jupyter/docker-stacks/tree/master/examples/make-deploy) which includes steps for requesting and renewing a Let's Encrypt certificate. + +Ref: [https://github.com/jupyter/docker-stacks/issues/78](https://github.com/jupyter/docker-stacks/issues/78) + +## Add Incubating Dashboard, Declarative Widget, Content Management Extensions + +Create a new Dockerfile like the one shown in this gist: [https://gist.github.com/parente/0d735d93cb81a582d635](https://gist.github.com/parente/0d735d93cb81a582d635). Switch the base stack image to whichever you please (e.g., `FROM jupyter/datascience-notebook`, `FROM jupyter/pyspark-notebook`). + +## Using `pip install` in a Child Docker image + +Create a new Dockerfile like the one shown below. + +```dockerfile +# Start from a core stack version +FROM jupyter/datascience-notebook:9f9e5ca8fe5a +# Install in the default python3 environment +RUN pip install 'ggplot==0.6.8' +``` + +Then build a new image. + +```bash +docker build --rm -t jupyter/my-datascience-notebook . +``` + +Ref: [https://github.com/jupyter/docker-stacks/commit/79169618d571506304934a7b29039085e77db78c#commitcomment-15960081](https://github.com/jupyter/docker-stacks/commit/79169618d571506304934a7b29039085e77db78c#commitcomment-15960081) + +## Use with JupyterHub's dockerspawner + +@jtyberg contributed [https://github.com/jupyter/docker-stacks/pull/185](https://github.com/jupyter/docker-stacks/pull/185) + +Originally, @quanghoc asked: + +> How does this [docker-stacks] work with dockerspawner? + +@minrk replied: + +> ... in most cases for use with DockerSpawner, given any image that already has a notebook stack set up, you would only need to add: + +> 1. install the jupyterhub-singleuser script (for the right Python) +> 2. change the command to launch the single-user server + +> Swapping out the `FROM` line in the `jupyterhub/singleuser` Dockerfile should be enough for most cases. + +Ref: [https://github.com/jupyter/docker-stacks/issues/124](https://github.com/jupyter/docker-stacks/issues/124) + +## Use xgboost + +You need to install conda's gcc for Python xgboost to work properly. Otherwise, you'll get an exception about libgomp.so.1 missing GOMP_4.0. + +``` +%%bash +conda install -y gcc +pip install xgboost + +import xgboost +``` + +Ref: [https://github.com/jupyter/docker-stacks/issues/177](https://github.com/jupyter/docker-stacks/issues/177) + +## Using PySpark with AWS S3 + +``` +import os +os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell' + +import pyspark +sc = pyspark.SparkContext("local[*]") + +from pyspark.sql import SQLContext +sqlContext = SQLContext(sc) + +hadoopConf = sc._jsc.hadoopConfiguration() +myAccessKey = input() +mySecretKey = input() +hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") +hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey) +hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey) + +df = sqlContext.read.parquet("s3://myBucket/myKey") +``` + +Ref: [https://github.com/jupyter/docker-stacks/issues/127](https://github.com/jupyter/docker-stacks/issues/127) + +## Using Local Spark JARs + +``` +import os +os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /home/jovyan/spark-streaming-kafka-assembly_2.10-1.6.1.jar pyspark-shell' +import pyspark +from pyspark.streaming.kafka import KafkaUtils +from pyspark.streaming import StreamingContext +sc = pyspark.SparkContext() +ssc = StreamingContext(sc,1) +broker = "" +directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"], {"metadata.broker.list": broker}) +directKafkaStream.pprint() +ssc.start() +``` + +Ref: [https://github.com/jupyter/docker-stacks/issues/154](https://github.com/jupyter/docker-stacks/issues/154) + +## Host volume mounts and notebook errors + +If you are mounting a host directory as `/home/jovyan/work` in your container and you receive permission errors or connection errors when you create a notebook, be sure that the `jovyan` user (UID=1000 by default) has read/write access to the directory on the host. Alternatively, specify the UID of the `jovyan` user on container startup using the `-e NB_UID` option described in the [Common Features, Docker Options section](../using/common.html#Docker-Options) + +Ref: [https://github.com/jupyter/docker-stacks/issues/199](https://github.com/jupyter/docker-stacks/issues/199) + +## Run JupyterLab + +JupyterLab is preinstalled as a notebook extension starting in tag [c33a7dc0eece](https://github.com/jupyter/docker-stacks/wiki/Docker-build-history). + +You can try jupyterlab using a command like `docker run -it --rm -p 8888:8888 jupyter/datascience-notebook start.sh jupyter lab` + +## Use jupyter/all-spark-notebooks with an existing Spark/YARN cluster + +Courtesy of @britishbadger: + +``` +FROM jupyter/all-spark-notebook + +# Set env vars for pydoop +ENV HADOOP_HOME /usr/local/hadoop-2.7.3 +ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64 +ENV HADOOP_CONF_HOME /usr/local/hadoop-2.7.3/etc/hadoop +ENV HADOOP_CONF_DIR /usr/local/hadoop-2.7.3/etc/hadoop + +USER root +# Add proper open-jdk-8 not just the jre, needed for pydoop +RUN echo 'deb http://cdn-fastly.deb.debian.org/debian jessie-backports main' > /etc/apt/sources.list.d/jessie-backports.list && \ + apt-get -y update && \ + apt-get install --no-install-recommends -t jessie-backports -y openjdk-8-jdk && \ + rm /etc/apt/sources.list.d/jessie-backports.list && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/ && \ +# Add hadoop binaries + wget http://mirrors.ukfast.co.uk/sites/ftp.apache.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz && \ + tar -xvf hadoop-2.7.3.tar.gz -C /usr/local && \ + chown -R $NB_USER:users /usr/local/hadoop-2.7.3 && \ + rm -f hadoop-2.7.3.tar.gz && \ +# Install os dependencies required for pydoop, pyhive + apt-get update && \ + apt-get install --no-install-recommends -y build-essential python-dev libsasl2-dev && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* && \ +# Remove the example hadoop configs and replace +# with those for our cluster. +# Alternatively this could be mounted as a volume + rm -f /usr/local/hadoop-2.7.3/etc/hadoop/* + +# Download this from ambari / cloudera manager and copy here +COPY example-hadoop-conf/ /usr/local/hadoop-2.7.3/etc/hadoop/ + +# Spark-Submit doesn't work unless I set the following +RUN echo "spark.driver.extraJavaOptions -Dhdp.version=2.5.3.0-37" >> /usr/local/spark/conf/spark-defaults.conf && \ + echo "spark.yarn.am.extraJavaOptions -Dhdp.version=2.5.3.0-37" >> /usr/local/spark/conf/spark-defaults.conf && \ + echo "spark.master=yarn" >> /usr/local/spark/conf/spark-defaults.conf && \ + echo "spark.hadoop.yarn.timeline-service.enabled=false" >> /usr/local/spark/conf/spark-defaults.conf && \ + chown -R $NB_USER:users /usr/local/spark/conf/spark-defaults.conf && \ + # Create an alternative HADOOP_CONF_HOME so we can mount as a volume and repoint + # using ENV var if needed + mkdir -p /etc/hadoop/conf/ && \ + chown $NB_USER:users /etc/hadoop/conf/ + +USER $NB_USER + +# Install useful jupyter extensions and python libraries like : +# - Dashboards +# - PyDoop +# - PyHive +RUN pip install jupyter_dashboards faker && \ + jupyter dashboards quick-setup --sys-prefix && \ + pip2 install pyhive pydoop thrift sasl thrift_sasl faker + +USER root +# Ensure we overwrite the kernel config so that toree connects to cluster +RUN jupyter toree install --sys-prefix --spark_opts="--master yarn --deploy-mode client --driver-memory 512m --executor-memory 512m --executor-cores 1 --driver-java-options -Dhdp.version=2.5.3.0-37 --conf spark.hadoop.yarn.timeline-service.enabled=false" +USER $NB_USER +``` + +Ref: [https://github.com/jupyter/docker-stacks/issues/369](https://github.com/jupyter/docker-stacks/issues/369) + +## Use containers with a specific version of JupyterHub + +The fix is to make sure that the same version of `jupyterhub` is installed in your image as the Hub itself. + +In general, this is enough: + +``` +FROM jupyter/base-notebook:5ded1de07260 +RUN pip install jupyterhub==0.8.0b1 +``` + +Ref: [https://github.com/jupyter/docker-stacks/issues/423#issuecomment-322767742](https://github.com/jupyter/docker-stacks/issues/423#issuecomment-322767742) + +## Add a Python 2.x environment + +Python 2.x was removed from all images on August 10th, 2017, starting in tag `cc9feab481f7`. You can add a Python 2.x environment by defining your own Dockerfile inheriting from one of the images like so: + +``` +# Choose your desired base image +FROM jupyter/scipy-notebook:latest + +# Create a Python 2.x environment using conda including at least the ipython kernel +# and the kernda utility. Add any additional packages you want available for use +# in a Python 2 notebook to the first line here (e.g., pandas, matplotlib, etc.) +RUN conda create --quiet --yes -p $CONDA_DIR/envs/python2 python=2.7 ipython ipykernel kernda && \ + conda clean -tipsy + +USER root + +# Create a global kernelspec in the image and modify it so that it properly activates +# the python2 conda environment. +RUN $CONDA_DIR/envs/python2/bin/python -m ipykernel install && \ +$CONDA_DIR/envs/python2/bin/kernda -o -y /usr/local/share/jupyter/kernels/python2/kernel.json + +USER $NB_USER +``` + +Ref: [https://github.com/jupyter/docker-stacks/issues/440](https://github.com/jupyter/docker-stacks/issues/440) \ No newline at end of file