spark udf with data frame -
i using spark 1.3. have dataset dates in column (ordering_date column) in yyyy/mm/dd format. want calculations dates , therefore want use jodatime conversions/formatting. here udf have :
val return_date = udf((str: string, dtf: datetimeformatter) => dtf.formatted(str))
here code udf being called. however, error saying "not applicable". need register udf or missing here?
val user_with_dates_formatted = users.withcolumn( "formatted_date", return_date(users("ordering_date"), datetimeformat.forpattern("yyyy/mm/dd") )
i don't believe can pass in datetimeformatter
argument udf
. can pass in column
. 1 solution do:
val return_date = udf((str: string, format: string) => { datetimeformat.forpatten(format).formatted(str)) })
and then:
val user_with_dates_formatted = users.withcolumn( "formatted_date", return_date(users("ordering_date"), lit("yyyy/mm/dd")) )
honestly, though -- both , original algorithms have same problem. both parse yyyy/mm/dd
using forpattern
every record. better create singleton object wrapped around map[string,datetimeformatter]
, maybe (thoroughly untested, idea):
object dateformatters { var formatters = map[string,datetimeformatter]() def getformatter(format: string) : datetimeformatter = { if (formatters.get(format).isempty) { formatters = formatters + (format -> datetimeformat.forpattern(format)) } formatters.get(format).get } }
then change udf
to:
val return_date = udf((str: string, format: string) => { dateformatters.getformatter(format).formatted(str)) })
that way, datetimeformat.forpattern(...)
called once per format per executor.
one thing note singleton object solution can't define object in spark-shell
-- have pack in jar file , use --jars
option spark-shell
if want use dateformatters
object in shell.
Comments
Post a Comment