Hive – Variable Substitution

I am planning to write a query that pull relevant rows using the `in` keyword where the list is pretty long, like 20 different elements, what is even worse, I need to run this query against many different tables.

The query will look like this:

create table mytable as 
select * from (
select * from table1 where mycolumn in (value1, value2, value3, value4 ..)
union all
select * from table2 where mycolumn in (value1, value2, value3, value4 ..)
union all
select * from tablen where mycolumn in (value1, value2, value3, value4 ..)
union all
... ) unionresult

Clearly, a professional developer will start thinking about how to optimize the query to remove the highly repetitive syntax. Clearly, there is a feature in Hive called “variable substitution” that will help me.

set mylist = (value1, value2, value3, value4 ..)
create table mytable as 
select * from (
select * from table1 where mycolumn in ${hiveconf:mylist}
union all
select * from table2 where mycolumn in ${hiveconf:mylist}
union all
select * from tablen where mycolumn in ${hiveconf:mylist}
union all
... ) unionresult

Clearly, this approach will help whenever you think you need a variable. 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s