spark udf with data frame -
i using spark 1.3. have dataset dates in column (ordering_date column) in yyyy/mm/dd format. want calculations dates , therefore want use jodatime conversions/formatting. here udf have :
val return_date = udf((str: string, dtf: datetimeformatter) => dtf.formatted(str)) here code udf being called. however, error saying "not applicable". need register udf or missing here?
val user_with_dates_formatted = users.withcolumn( "formatted_date", return_date(users("ordering_date"), datetimeformat.forpattern("yyyy/mm/dd") )
i don't believe can pass in datetimeformatter argument udf. can pass in column. 1 solution do:
val return_date = udf((str: string, format: string) => { datetimeformat.forpatten(format).formatted(str)) }) and then:
val user_with_dates_formatted = users.withcolumn( "formatted_date", return_date(users("ordering_date"), lit("yyyy/mm/dd")) ) honestly, though -- both , original algorithms have same problem. both parse yyyy/mm/dd using forpattern every record. better create singleton object wrapped around map[string,datetimeformatter], maybe (thoroughly untested, idea):
object dateformatters { var formatters = map[string,datetimeformatter]() def getformatter(format: string) : datetimeformatter = { if (formatters.get(format).isempty) { formatters = formatters + (format -> datetimeformat.forpattern(format)) } formatters.get(format).get } } then change udf to:
val return_date = udf((str: string, format: string) => { dateformatters.getformatter(format).formatted(str)) }) that way, datetimeformat.forpattern(...) called once per format per executor.
one thing note singleton object solution can't define object in spark-shell -- have pack in jar file , use --jars option spark-shell if want use dateformatters object in shell.
Comments
Post a Comment