问题1—提交代码的方式:spark-submit connectedComponentAnalysis.py --master yarn --deploy-mode cluster --executor-memory 3g --num-executors 10
问题1—代码环境配置:
conf = SparkConf() conf.setAppName("cca") sc = SparkContext(conf=conf) # 必须带参数名解决方法:
在代码中配置 yarn
conf = SparkConf() conf.setAppName("cca").setMaster("yarn") sc = SparkContext(conf=conf) # 必须带参数名原因:不太清楚
问题2—代码
conf = SparkConf() conf.setAppName("cca").setMaster("yarn")\ sc = SparkContext(conf=conf) # 必须带参数名 spark = SparkSession(sc)解决方法:SparkConf增加spark.yarn.dist.file配置、pythonpath配置
conf = SparkConf() conf.setAppName("cca")\ .setMaster("yarn")\ .set("spark.yarn.dist.file","/home/xindun/BDML/spark-2.3.1/python/lib/pyspark.zip,/home/xindun/BDML/spark-2.3.1/python/lib/py4j-0.10.7-src.zip")\ .setExecutorEnv("PYTHONPATH","pyspark.zip:py4j-0.10.7-src.zip") sc = SparkContext(conf=conf) # 必须带参数名 spark = SparkSession(sc)解决方法:
(1)方法一:按照提示,在driver和worker中设置相同的python环境,并配置PYSPARK_PYTHON、PYSPARK_DRIVER_PYTHON。
(2)方法二:代码中添加
os.environ["PYSPARK_PYTHON"]="~/anaconda3/bin/python3.6" # 最好写绝对路径 os.environ["PYSPARK_DRIVER_PYTHON"]="~/anaconda3/bin/python3.6" # 最好写绝对路径