python - how to filter a spark dataframe by a boolean column -
i created dataframe has following schema:
in [43]: yelp_df.printschema() root |-- business_id: string (nullable = true) |-- cool: integer (nullable = true) |-- date: string (nullable = true) |-- funny: integer (nullable = true) |-- id: string (nullable = true) |-- stars: integer (nullable = true) |-- text: string (nullable = true) |-- type: string (nullable = true) |-- useful: integer (nullable = true) |-- user_id: string (nullable = true) |-- name: string (nullable = true) |-- full_address: string (nullable = true) |-- latitude: double (nullable = true) |-- longitude: double (nullable = true) |-- neighborhoods: string (nullable = true) |-- open: boolean (nullable = true) |-- review_count: integer (nullable = true) |-- state: string (nullable = true)
now want select records "open" column "true". shown below, lots of them "open".
business_id cool date funny id stars text type useful user_id name full_address latitude longitude neighborhoods open review_count state 9ykzy9papeippouje... 2 2011-01-26 0 fwkvx83p0-ka4js3d... 4 wife took me h... business 5 rltl8zkdx5vh5nax9... morning glory cafe 6106 s 32nd st ph... 33.3907928467 -112.012504578 [] true 116 az zrjwvlyzejq1vaihd... 0 2011-07-27 0 ijz33sjrzxqu-0x6u... 4 have no idea wh... business 0 0a2kyel0d3yb1v6ai... spinato's pizzeria 4848 e chandler b... 33.305606842 -111.978759766 [] true 102 az 6orac4uyjcsjl1x0w... 0 2012-06-14 0 ieslbzqucldszsqm0... 4 love gyro pla... business 1 0ht2ktfliobpvh6cd... haji-baba 1513 e apache bl... 33.4143447876 -111.913032532 [] true 265 az _1qqzuf4zzoyfcvxc... 1 2010-05-27 0 g-wvgaisbqqamhlnn... 4 rosie, dakota, an... business 2 uzetl9t0ncrogoyff... chaparral dog park 5401 n hayden rd ... 33.5229454041 -111.90788269 [] true 88 az 6ozycu1rpktng2-1b... 0 2012-01-05 0 1ujfq2r5qfjg_6exm... 4 general manager s... business 0 vymm4ktsc8zfqbg-j... discount tire 1357 s power road... 33.3910255432 -111.68447876 [] true 5 az
however following command run in pyspark returns nothing:
yelp_df.filter(yelp_df["open"] == "true").collect()
what right way it?
you're comparing data types incorrectly. open
listed boolean value, not string, doing yelp_df["open"] == "true"
incorrect - "true"
string.
instead want do
yelp_df.filter(yelp_df["open"] == true).collect()
this correctly compares values of open
against boolean primitive true
, rather non-boolean string "true"
.
Comments
Post a Comment