Pyspark – read in avro file

I found a fantastic example in Spark’s example called, where you can read in avro file by initiating this command:

./bin/spark-submit \
–driver-class-path ./examples/target/scala-2.10/spark-examples-1.3.0-hadoop1.0.4.jar \
./examples/src/main/python/ \

As you can see, you added the spark-example-hadoop jar file to the driver-class-path, in that case, all the necessary java class will be correctly located, in another way, take a look at the code in

avro_rdd = sc.newAPIHadoopFile(

However, the downside of using this approach is very clear, it is not pure Python, you have to find the jar file, you have to use the spark-submit to include the driverclasspath, I asked a question here and hope I can find the solution later on.

One thought on “Pyspark – read in avro file

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s