hadoop - complex Hive Query -


hi have following table:

id------ |--- time  ====================== 5-------  | ----200101 3--------| ---  200102   2--------|----  200103   12 ------|----  200101  16-------|----  200103   18-------|----  200106 

now want know how month in year appears. cant use group because counts number of times appears in table. want 0 when month in year not appear. output should this:

time-------|----count ===================== 200101--|--      2  200102--|--      1  200103--|--      1  200104--|--      0  200105--|--      0  200106--|--      1 

sorry bad table format, hope still clear mean. apreciate help

you can provide year-month table containing year , month information. wrote script generate such csv file:

#!/bin/bash  # year_month.sh  start_year=1970 end_year=2015  year in $( seq ${start_year} ${end_year} );     month in $( seq 1 12 );         echo ${year}$( echo ${month} | awk '{printf("%02d\n", $1)}');     done; done > year_month.csv 

save in year_month.sh , run it. file year_month.csv containing year , month 1970 2015. can change start_year , end_year specify year range.

then, upload year_month.csv file hdfs. example,

hadoop fs -mkdir /user/joe/year_month hadoop fs -put year_month.csv /user/joe/year_month/ 

after that, can load year_month.csv hive. example,

create external table if not exists  year_month (time int)  location '/user/joe/year_month'; 

at last, can join new table table final result. example, assume table id_time:

from (select year_month.time time, time_count.id id        year_month        left outer join id_time        on year_month.time = id_time.time) temp select time, count(id) count  group time; 

note: need make tiny modification (such path, type) above statement.


Comments

Popular posts from this blog

Django REST Framework perform_create: You cannot call `.save()` after accessing `serializer.data` -

Why does Go error when trying to marshal this JSON? -