Merge pull request #112 from parente/fix-python2-path

Set PYSPARK_PYTHON path in python2 kernelspec
2025-10-19 07:52:56 +00:00 · 2016-02-03 08:58:18 -06:00
parent 5836ee4641 232d6fc465
commit 55d5ca6be1
4 changed files with 32 additions and 43 deletions
--- a/all-spark-notebook/Dockerfile
+++ b/all-spark-notebook/Dockerfile
@@ -6,6 +6,9 @@ MAINTAINER Jupyter Project <jupyter@googlegroups.com>

 USER root

+# Util to help with kernel spec later
+RUN apt-get -y update && apt-get -y install jq
+
 # Spark dependencies
 ENV APACHE_SPARK_VERSION 1.5.1
 RUN apt-get -y update && \
@@ -90,12 +93,13 @@ RUN conda install --yes \
 RUN mkdir -p /opt/conda/share/jupyter/kernels/scala
 COPY kernel.json /opt/conda/share/jupyter/kernels/scala/

-USER root
-
-# Install Python 2 kernel spec globally to avoid permission problems when NB_UID
-# switching at runtime.
-RUN $CONDA_DIR/envs/python2/bin/python \
-    $CONDA_DIR/envs/python2/bin/ipython \
-    kernelspec install-self
-
-USER jovyan
+# Install Python 2 kernel spec into the Python 3 conda environment which
+# runs the notebook server
+RUN bash -c '. activate python2 && \
+    python -m ipykernel.kernelspec --prefix=$CONDA_DIR && \
+    . deactivate'
+# Set PYSPARK_HOME in the python2 spec
+RUN jq --arg v "$CONDA_DIR/envs/python2/bin/python" \
+        '.["env"]["PYSPARK_PYTHON"]=$v' \
+        $CONDA_DIR/share/jupyter/kernels/python2/kernel.json > /tmp/kernel.json && \
+        mv /tmp/kernel.json $CONDA_DIR/share/jupyter/kernels/python2/kernel.json
--- a/all-spark-notebook/README.md
+++ b/all-spark-notebook/README.md
@@ -32,7 +32,7 @@ This configuration is nice for using Spark on small, local data.
 1. Open a Python 2 or 3 notebook.
 2. Create a `SparkContext` configured for local mode.

-For example, the first few cells in a Python 3 notebook might read:
+For example, the first few cells in a notebook might read:

 ```python
 import pyspark
@@ -43,15 +43,6 @@ rdd = sc.parallelize(range(1000))
 rdd.takeSample(False, 5)
 ```

-In a Python 2 notebook, prefix the above with the following code to ensure the local workers use Python 2 as well.
-
-```python
-import os
-os.environ['PYSPARK_PYTHON'] = 'python2'
-
-# include pyspark cells from above here ...
-```
-
 ### In a R Notebook

 0. Run the container as shown above.
--- a/pyspark-notebook/Dockerfile
+++ b/pyspark-notebook/Dockerfile
@@ -6,6 +6,9 @@ MAINTAINER Jupyter Project <jupyter@googlegroups.com>

 USER root

+# Util to help with kernel spec later
+RUN apt-get -y update && apt-get -y install jq
+
 # Spark dependencies
 ENV APACHE_SPARK_VERSION 1.5.1
 RUN apt-get -y update && \
@@ -52,13 +55,13 @@ RUN conda create -p $CONDA_DIR/envs/python2 python=2.7 \
    pyzmq \
    && conda clean -yt

-USER root
-
-# Install Python 2 kernel spec globally to avoid permission problems when NB_UID
-# switching at runtime.
-RUN $CONDA_DIR/envs/python2/bin/python \
-    $CONDA_DIR/envs/python2/bin/ipython \
-    kernelspec install-self
-
-USER jovyan
-
+# Install Python 2 kernel spec into the Python 3 conda environment which
+# runs the notebook server
+RUN bash -c '. activate python2 && \
+    python -m ipykernel.kernelspec --prefix=$CONDA_DIR && \
+    . deactivate'
+# Set PYSPARK_HOME in the python2 spec
+RUN jq --arg v "$CONDA_DIR/envs/python2/bin/python" \
+        '.["env"]["PYSPARK_PYTHON"]=$v' \
+        $CONDA_DIR/share/jupyter/kernels/python2/kernel.json > /tmp/kernel.json && \
+        mv /tmp/kernel.json $CONDA_DIR/share/jupyter/kernels/python2/kernel.json
--- a/pyspark-notebook/README.md
+++ b/pyspark-notebook/README.md
@@ -27,7 +27,7 @@ This configuration is nice for using Spark on small, local data.
 2. Open a Python 2 or 3 notebook.
 3. Create a `SparkContext` configured for local mode.

-For example, the first few cells in a Python 3 notebook might read:
+For example, the first few cells in the notebook might read:

 ```python
 import pyspark
@@ -38,15 +38,6 @@ rdd = sc.parallelize(range(1000))
 rdd.takeSample(False, 5)
 ```

-In a Python 2 notebook, prefix the above with the following code to ensure the local workers use Python 2 as well.
-
-```python
-import os
-os.environ['PYSPARK_PYTHON'] = 'python2'
-
-# include pyspark cells from above here ...
-```
-
 ## Connecting to a Spark Cluster on Mesos

 This configuration allows your compute cluster to scale with your data.