pyspark遇到的坑

tech2022-08-07  147

问题1:All master are unresponsive! Giving up.

问题1—提交代码的方式:spark-submit connectedComponentAnalysis.py --master yarn --deploy-mode cluster --executor-memory 3g --num-executors 10

问题1—代码环境配置:

conf = SparkConf() conf.setAppName("cca") sc = SparkContext(conf=conf) # 必须带参数名

解决方法:

在代码中配置 yarn

conf = SparkConf() conf.setAppName("cca").setMaster("yarn") sc = SparkContext(conf=conf) # 必须带参数名

原因:不太清楚

问题2:Error from python worker: /bin/python. No module named pyspark.

问题2—代码

conf = SparkConf() conf.setAppName("cca").setMaster("yarn")\ sc = SparkContext(conf=conf) # 必须带参数名 spark = SparkSession(sc)

解决方法:SparkConf增加spark.yarn.dist.file配置、pythonpath配置

conf = SparkConf() conf.setAppName("cca")\ .setMaster("yarn")\ .set("spark.yarn.dist.file","/home/xindun/BDML/spark-2.3.1/python/lib/pyspark.zip,/home/xindun/BDML/spark-2.3.1/python/lib/py4j-0.10.7-src.zip")\ .setExecutorEnv("PYTHONPATH","pyspark.zip:py4j-0.10.7-src.zip") sc = SparkContext(conf=conf) # 必须带参数名 spark = SparkSession(sc)

问题3:Exception: Python in worker has different version 3.6 than in driver 2.7.

解决方法:

(1)方法一:按照提示,在driver和worker中设置相同的python环境,并配置PYSPARK_PYTHON、PYSPARK_DRIVER_PYTHON。

(2)方法二:代码中添加

os.environ["PYSPARK_PYTHON"]="~/anaconda3/bin/python3.6" # 最好写绝对路径 os.environ["PYSPARK_DRIVER_PYTHON"]="~/anaconda3/bin/python3.6" # 最好写绝对路径

 

 

最新回复(0)