PySpark入门八:DataFrame 删及SQL操作

tech2022-08-16  141

一、删

1.1 删除列
df.drop('age').collect() df.drop(df.age).collect()
1.2 删除空值
df = df.na.drop() # 删除任何列包含na的行 df = df.dropna(subset=['col1', 'col2']) # 删掉col1或col2中任一一列包含na的行
1.3 填充NA
df.fillna(-1)
1.4 去重
# 返回不包含重复行的dataframe df.distinct() # 指定字段去重 df.select('Age','Gender').dropDuplicates().show()
小插曲:pandas与pyspark中的dataframe 相互转换
# spark 转换为pandas的dataframe pandas_df = spark_df.toPandas() # pandas的dataframe转化为spark的dataframe spark_df = sc.createDataFrame(pandas_df)
df与RDD的相互转换
# DF转换为RDD rdd_df = df.rdd # RDD转换为DF df = rdd_df.toDF()

二、 SQL操作

2.1 Dataframe注册为SQL表
df.createOrReplaceTempView("df")
2.2 SQL 查询
conf = SparkConf() ss = SparkSession.builder.appName("APP_NAME").config(conf=conf).getOrCreate() df = ss.sql(“SELECT name, age FROM TBL1 WHERE age >= 20 AND age <= 25)

最后记录一个网上看来的代码:去除两个表中相同的内容,保留剩下的

from pyspark.sql import functions def LeftDeleteRight(test_left,test_right,left_col = 'user_pin',right_col = 'user_pin'): print('right data process ...') columns_right = test_right.columns test_right = test_right.withColumn('user_pin_right', test_right[right_col]) test_right = test_right.withColumn('notDelete', functions.lit(0)) # 删除其余的 for col in columns_right: test_right = test_right.drop(col) # 合并 print('rbind left and right data ...') test_left = test_left.join(test_right, test_left[left_col] == test_right['user_pin_right'], "left") test_left = test_left.fillna(1) test_left = test_left.where('notDelete =1') # 去掉多余的字段 for col in ['user_pin_right','notDelete']: test_left = test_left.drop(col) return test_left %time test_left = LeftDeleteRight(test_b,test_a,left_col = 'user_pin',right_col = 'user_pin')
最新回复(0)