I found a fantastic example in Spark’s example called avro_inputformat.py, where you can read in avro file by initiating this command:
–driver-class-path ./examples/target/scala-2.10/spark-examples-1.3.0-hadoop1.0.4.jar \
As you can see, you added the spark-example-hadoop jar file to the driver-class-path, in that case, all the necessary java class will be correctly located, in another way, take a look at the code in avro_inputformat.py:
avro_rdd = sc.newAPIHadoopFile(
However, the downside of using this approach is very clear, it is not pure Python, you have to find the jar file, you have to use the spark-submit to include the driverclasspath, I asked a question here and hope I can find the solution later on.
One thought on “Pyspark – read in avro file”
How the file fantastic is? Sir, you never say.