Feb 25, 2016

Spark, Python and Windows

Yetxtit may not be common, but if you want a quick start for a Windows guy on the hotest Big Data platform around, you will find this tutorial relevant for you:

Get the Needed Prerequistes Software
git (for building Spark)
Install and verify git was added to your path

Python 3.5.1
Or even better, try the Anaconda version

Define Java
Get Java 8 (Oracle JDK):

Set JAVA_HOME environment variable to c:\Program Files\Java\jdk1.8.0_73\ (don't use double quotes if you are used to it)

Prepare SBT
Get SBT 0.13.11 (for building spark compilation)

Spark build is huge, therefore we need to increase the memory limit of the sbt from 256MB to something larger (2048MB for example, but saw some cases where 6GB were needed). This can be performed by modify the sbt runner file (C:\Program Files (x86)\sbt\bin\sbt.bat):
"%_JAVACMD%" %_JAVA_OPTS% %SBT_OPTS% -Xmx2048m -Xms2048m -cp "%SBT_HOME%sbt-launch.jar" xsbt.boot.Boot %*

Or better, change these settings at C:\Program Files (x86)\sbt\conf\sbtconfig.txt

Buid Spark
Download Spark 1.6 (No need to intall now, see details below)
  1. Extract the source using 7zip (or anything else that can use tgz files) for example to C:\Spark
  2. Build spark:
    1. cd C:\Spark
    2. "c:\Program Files (x86)\sbt\bin\sbt.bat" package
    3. "c:\Program Files (x86)\sbt\bin\sbt.bat" assembly
Get Winutils
Create Hadoop folder (for example c:\hadoop)
Define HADOOP_HOME environment variable (c:\hadoop)
Download to c:\hadoop\bin the winutils.exe file

Run It Like a Pro
Create a test file c:\spark\verify.txt and enter to it the following text: this is a trial
Run pyspark:
> c:\spark\bin\pyspark
In pyspark enter the following code:
text_file = sc.textFile("file://c:/spark/verify.txt")
The file should be printed

Keep Performing,
Moshe Kaplan


Intense Debate Comments

Ratings and Recommendations