Pyspark – read in avro file

I found a fantastic example in Spark’s example called avro_inputformat.py, where you can read in avro file by initiating this command:

./bin/spark-submit \
–driver-class-path ./examples/target/scala-2.10/spark-examples-1.3.0-hadoop1.0.4.jar \
./examples/src/main/python/avro_inputformat.py \
./examples/src/main/resources/users.avro

As you can see, you added the spark-example-hadoop jar file to the driver-class-path, in that case, all the necessary java class will be correctly located, in another way, take a look at the code in avro_inputformat.py:

avro_rdd = sc.newAPIHadoopFile(
path,
“org.apache.avro.mapreduce.AvroKeyInputFormat”,
“org.apache.avro.mapred.AvroKey”,
“org.apache.hadoop.io.NullWritable”,
keyConverter=”org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter”,
conf=conf)

However, the downside of using this approach is very clear, it is not pure Python, you have to find the jar file, you have to use the spark-submit to include the driverclasspath, I asked a question here and hope I can find the solution later on.

One thought on “Pyspark – read in avro file

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s