Apache Spark – Scala

My colleague share a book that he bought with me – “Machine Learning with Spark“. I read the part of the first chapter and feel pretty good about. I think it is definitely a hard book for the ones who doesn’t have that much programming experience, the Author probably has the assumption that the reader should know one of the three languages(Java, Python, Scala) at least to even read the book.

I have been doing some Spark programming in Python and today, I read a few examples written in Scala, the syntax is extremely simple and similar as Python. I have also heard from some people that Scala code will run much faster than Pyspark most of the cases. Here are a few things that are new to me:

use “val” whenever to create a new variable
=> is the anonymous function
map{ case (a,b,c) => (b,c) } where case will contain a block of code
.reduceByKey(_ + _)

Pyspark – read in avro file

I found a fantastic example in Spark’s example called avro_inputformat.py, where you can read in avro file by initiating this command:

./bin/spark-submit \
–driver-class-path ./examples/target/scala-2.10/spark-examples-1.3.0-hadoop1.0.4.jar \
./examples/src/main/python/avro_inputformat.py \
./examples/src/main/resources/users.avro

As you can see, you added the spark-example-hadoop jar file to the driver-class-path, in that case, all the necessary java class will be correctly located, in another way, take a look at the code in avro_inputformat.py:

avro_rdd = sc.newAPIHadoopFile(
path,
“org.apache.avro.mapreduce.AvroKeyInputFormat”,
“org.apache.avro.mapred.AvroKey”,
“org.apache.hadoop.io.NullWritable”,
keyConverter=”org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter”,
conf=conf)

However, the downside of using this approach is very clear, it is not pure Python, you have to find the jar file, you have to use the spark-submit to include the driverclasspath, I asked a question here and hope I can find the solution later on.

Linux – Export

I think if you are a frequent Linux or Mac user, you probably have seen or used the command export a lot. My impression of export is very simple, generate a variable to contain certain value. For example, in the big data world, you constantly have applications need HADOOP_HOME variable to be defined pointing to the Hadoop directory, another very frequently used one is JAVA_HOME. After reading an article from linuxcareer, I think I have a much better understanding of what it does.

That article briefly describe the concept of child vs parent process. They mentioned that any process can be a parent/child process at the same time with exception of “init” process, which is always marked with PID. init in that way turned out to be the parent of all processes on your linux system.

The ultimate definition of export:

“In general, the export command marks an environment variable to be exported with any newly forked child processes and thus it allows a child process to inherit all marked variables.”

A few take-aways from the article:

(1) $$ will get you the current bash process id

(2) export -n will remove the variable from the export list.

(3) export -f will export a function

Freecad – best open source model development environment

CAD is short for computer assisted design, while I was in school study engineering, professors always recommend some sort of CAD software for developing the 3D models. Recently, I heard that 3D printer can import 3D models and can print that into a product. I am a big fan of DOTA2 and all the characters there has corresponding 3D models. For example, here you can download the model for a hero “Juggernaut”, however, there are three different types of models that you can download.

FBX: FBX (Filmbox) is a proprietary file format (.fbx) developed by Kaydara and owned by Autodesk since 2006. It is used to provide interoperability between digital content creation applications. FBX is also part of Autodesk Gameware, a series of video game middleware.

MAYA: a very popular software for 3D model development. Very expensive too $1300+

SMD: is a specific format that developed by the company Valve who is the company that built DOTA2

While I was doing some research how to download some open source software to play with 3D design, I came across FreeCAD.

freecad_ide

However, OpenCAD is not omnipotent and I have to convert the FBX format into STL format to be able to load into Freecad. I am using an online converter greentoken, but feel free to leave a comment if you have a better solution or you happen to know how to load in FBX, MA or SMT directly into Freecad.

Here is how it looks like in the end:

Running Spark Locally on MacOS

I want to install Spark locally, in that way, I can easily try out the SparkR package without worrying about breaking the team cluster.

There are a few things that you need to know and prepare beforehand:

Spark was originally written in Scala which need JVM, so be ready to set up a Java development environment and you also need the Scala interpreter if you want to build everything from source code.
I was installing Spark from source using Maven. I am not a frequent Java developer so my maven version got out of date 3.0.3 where it requires minimum 3.0.4 to work. I followed this tutorial where downloaded the latest maven and modified the softlink of /usr/bin/mvn to point to the latest maven.
Spark is not a small application, which means that you need to optimize your jvm limit. It gave me the “memory error” a few times until I switched off most of my apps and increased the Java heapsize.
There is a fantastic series of videos where one of them really walked through the installation in detail.
IT IS GOING TO TAKE A LONG TIME.

Overall impression, the setup was pretty easy, probably due to the fact that I have installed Scala and built some other projects from sources before, so the environment was almost ready. The building process `mvn -DskipTests clean package` took quite a while to download all the dependencies and the whole process took about minutes to finish after I changed the Java heapsize ot be 1GB.

For later reference, there was a list of projects printed to the screen following the building order while compiling. It was quite interesting to see that Spark is already well connected with whole Hadoop ecosystem.

$ mvn -DskipTests clean package
..

Found 0 errors
Found 0 warnings
Found 0 infos
Finished in 1 ms
[INFO] ————————————————————————
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ……………………… SUCCESS [ 4.261 s]
[INFO] Spark Project Networking ……………………… SUCCESS [ 10.123 s]
[INFO] Spark Project Shuffle Streaming Service ………… SUCCESS [ 5.928 s]
[INFO] Spark Project Core …………………………… SUCCESS [03:23 min]
[INFO] Spark Project Bagel ………………………….. SUCCESS [ 20.073 s]
[INFO] Spark Project GraphX …………………………. SUCCESS [05:43 min]
[INFO] Spark Project Streaming ………………………. SUCCESS [08:31 min]
[INFO] Spark Project Catalyst ……………………….. SUCCESS [11:37 min]
[INFO] Spark Project SQL ……………………………. SUCCESS [15:13 min]
[INFO] Spark Project ML Library ……………………… SUCCESS [20:38 min]
[INFO] Spark Project Tools ………………………….. SUCCESS [ 39.707 s]
[INFO] Spark Project Hive …………………………… SUCCESS [14:10 min]
[INFO] Spark Project REPL …………………………… SUCCESS [03:11 min]
[INFO] Spark Project Assembly ……………………….. SUCCESS [06:22 min]
[INFO] Spark Project External Twitter ………………… SUCCESS [ 36.120 s]
[INFO] Spark Project External Flume Sink ……………… SUCCESS [ 45.845 s]
[INFO] Spark Project External Flume ………………….. SUCCESS [01:17 min]
[INFO] Spark Project External MQTT …………………… SUCCESS [ 43.300 s]
[INFO] Spark Project External ZeroMQ …………………. SUCCESS [ 39.041 s]
[INFO] Spark Project External Kafka ………………….. SUCCESS [01:59 min]
[INFO] Spark Project Examples ……………………….. SUCCESS [11:35 min]
[INFO] Spark Project External Kafka Assembly ………….. SUCCESS [01:21 min]
[INFO] ————————————————————————
[INFO] BUILD SUCCESS
[INFO] ————————————————————————
[INFO] Total time: 01:49 h
[INFO] Finished at: 2015-04-17T01:14:42-05:00
[INFO] Final Memory: 77M/829M
[INFO] ————————————————————————

After two hours, it took me 1.5 minutes to count700MB file 75 million rows.

(it took me almost a millin to count using ls and 13 seconds to use Python std…:( )

TODO: performance tuning.

Intel MKL – math kernal library

Apache UIMA – Unstructured Information Management

“Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.”

Annual Percentage Rate: finally solve the equation thanks to wikipedia

Thanks to Wikipedia, it showed me there is a recursive relation between the consecutive loan payments where P(n) = P(n-1) + P(n-1) * r – c = AmountOwe + Interest – Payment. This is totally different than my assumption that we the principal equally month by month, calculate the interest in that way and evenly distribute them to have a fixed monthly payment. Once we are sure about the equation, it is just a matter of solving the series problem. I am not kidding, this is a problem that could be easily solved by a high school student if you received your education in China. Here is a hand calculation with some math tricks, a bit different from the one in Wikipedia but it works. 🙂

Nifi – Drag and Drop Tool for data source management.

https://blogs.apache.org/nifi/

Apache Zeppelin – An interactive analytic platform for Spark development

I just started developing Spark code in iPython notebook. Today, there is a Korean developer Moon who presented Zeppelin at the Apache 2015 Con. The flavor really looks like Cloudera Hue, when I look at those pie chart, bar chart…etc. And the functionality really feels like iPython notebook.