spark udf with data frame -


i using spark 1.3. have dataset dates in column (ordering_date column) in yyyy/mm/dd format. want calculations dates , therefore want use jodatime conversions/formatting. here udf have :

 val return_date = udf((str: string, dtf: datetimeformatter) => dtf.formatted(str)) 

here code udf being called. however, error saying "not applicable". need register udf or missing here?

val user_with_dates_formatted = users.withcolumn(   "formatted_date",   return_date(users("ordering_date"), datetimeformat.forpattern("yyyy/mm/dd") ) 

i don't believe can pass in datetimeformatter argument udf. can pass in column. 1 solution do:

val return_date = udf((str: string, format: string) => {   datetimeformat.forpatten(format).formatted(str)) }) 

and then:

val user_with_dates_formatted = users.withcolumn(   "formatted_date",   return_date(users("ordering_date"), lit("yyyy/mm/dd")) ) 

honestly, though -- both , original algorithms have same problem. both parse yyyy/mm/dd using forpattern every record. better create singleton object wrapped around map[string,datetimeformatter], maybe (thoroughly untested, idea):

object dateformatters {   var formatters = map[string,datetimeformatter]()    def getformatter(format: string) : datetimeformatter = {     if (formatters.get(format).isempty) {       formatters = formatters + (format -> datetimeformat.forpattern(format))     }     formatters.get(format).get   } } 

then change udf to:

val return_date = udf((str: string, format: string) => {   dateformatters.getformatter(format).formatted(str)) }) 

that way, datetimeformat.forpattern(...) called once per format per executor.

one thing note singleton object solution can't define object in spark-shell -- have pack in jar file , use --jars option spark-shell if want use dateformatters object in shell.


Comments

Popular posts from this blog

Django REST Framework perform_create: You cannot call `.save()` after accessing `serializer.data` -

Why does Go error when trying to marshal this JSON? -