Creating a table to query in Spark using Python -
i'm trying load file directly s3 , trying sparksql on it. eventually, plan on bringing in multiple files multiple tables (1:1 map between files , tables).
so i've been following this tutorial @ describing each step. i'm bit stuck on declaring proper schema , variable, if any, referred in from
clause in sql statement.
here's code:
sqlcontext.sql("create table if not exists region (region_id int, name string, comment string)") region = sc.textfile("s3n://thisisnotabucketname/region.tbl") raw_data = sc.textfile("s3n://thisisnotabucketname/region.tbl") csv_data = raw_data.map(lambda l: l.split("|")) row_data = csv_data.map(lambda p: row( region_id=int(p[0]), name=p[1], comment=p[2] )) interactions_df = sqlcontext.createdataframe(row_data) interactions_df.registertemptable("interactions") tcp_interactions = sqlcontext.sql(""" select region_id, name, comment region region_id > 1 """) tcp_interactions = sqlcontext.sql(""" select * region """) tcp_interactions.show()
and here's sample data. there no header
0|africa|lar deposits. blithely final packages cajole. regular waters final requests. regular accounts according | 1|america|hs use ironic, requests. s| 2|asia|ges. thinly pinto beans ca|
tcp_interactions.show()
returning nothing. header of region_id|name|comment|
. doing incorrectly? in sql statement, region
pointing region
variable declared in first line of code, or pointing else?
Comments
Post a Comment