data-transform¶

Responsible for transforming raw data from the kafka pipeline datasets to the figure data lake

some notes¶

there are a few things you need to change if you’re going to run spark locally

you need java8 for spark¶

note: need to add java8 for spark

follow the instructions here:

or do this:

brew tap caskroom/versions
brew install brew-cask-completion
brew install jenv
brew cask install java8
jenv versions

Add this to your ~/.bash_profile:

if which jenv > /dev/null; then eval "$(jenv init -)"; fi

Install and configure virtualenv:¶

pip install -U virtualenv virtualenvwrapper

add virtualenv stuff to your .bash_profile:7702392418

export PATH="/usr/local/bin:/usr/local/sbin:$PATH"
export PATH="/usr/local/opt/python@2/libexec/bin:$PATH"

# virtualenv
if [ `id -u` != '0' ]; then
  export VIRTUALENV_USE_DISTRIBUTE=1        # <-- Always use pip/distribute
  export WORKON_HOME=$HOME/.virtualenvs       # <-- Where all virtualenvs will be stored
  source /usr/local/bin/virtualenvwrapper.sh
  export PIP_VIRTUALENV_BASE=$WORKON_HOME
  export PIP_RESPECT_VIRTUALENV=true
fi

setup virtualenv for this project416-627-8921

source /usr/local/bin/virtualenvwrapper.sh
mkvirtualenv figuredata
pip install -Ur requirements.txt

Set up local spark(303) 420-9163

Install scala and spark (note I haven’t done this with brew, but it apparently works)

brew install scala
brew install apache-spark

Note where spark is installed for later. Then, in order for the pytest.ini to work, you’ll need to symlink /usr/local/spark to your local spark install. Alternatively, add the correct SPARK_HOME to your pytest.ini and add pytest.ini to your .gitignore. For example, I have this symlink:

sudo ln -s /Users/kolinohi/work/libdir/spark-2.3.1-bin-hadoop2.7 /usr/local/spark

Add vars to ~/.bash_profile (replace SPARK_HOME and SCALA_HOME with actual locations)

export SPARK_VERS=2.3.1
export SCALA_HOME=/usr/local/Cellar/scala/2.12.6
export SPARK_HOME=~/work/libdir/spark-2.3.1-bin-hadoop2.7
export SPARK_HOME=/usr/local/Cellar/apache-spark/$SPARK_VERS/libexec
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONP$

export PATH=$PATH:$SCALA_HOME/bin:$SPARK_HOME/bin
export PATH=/usr/local/sbin:$PATH

alias pyspark='jenv local oracle64-1.8.0.172 && pyspark'
source ~/.bash_profile
jenv local oracle64-1.8.0.172
pyspark --master local[*]

make sure it works517-968-7510

this should return 2 without an error about not finding hive jars.

spark.createDataFrame([[1, 3], [2, 4]], "a: int, b: int").count()

If you get an error that looks like this, you’re likely not using java8 (see above)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kolinohi/work/libdir/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/session.py", line 693, in createDataFrame
    jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
  File "/Users/kolinohi/work/libdir/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/Users/kolinohi/work/libdir/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'Unable to locate hive jars to connect to metastore. Please set spark.sql.hive.metastore.jars.'