Documented an error encountered on CDS server.
Traceback (most recent call last):
File "Costly_21nov.py", line 22, in <module>
.config("spark.driver.maxResultSize", "4g") \
File "/home/serverteam_1/anaconda3/lib/python3.6/site-packages/pyspark/sql/session.py", line 169, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/home/serverteam_1/anaconda3/lib/python3.6/site-packages/pyspark/context.py", line 334, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/home/serverteam_1/anaconda3/lib/python3.6/site-packages/pyspark/context.py", line 115, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/home/serverteam_1/anaconda3/lib/python3.6/site-packages/pyspark/context.py", line 283, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/home/serverteam_1/anaconda3/lib/python3.6/site-packages/pyspark/java_gateway.py", line 95, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number
serverteam_1@master:~/data$
Anyway, we outlined what needs to be done/summarized in our final writeup: Performance and Costly Functions on Server
Basically, up to this point, Linnea and I have investigated
- The theory of
- Basics of partitioning in Spark
- Caching/Persisting
- Lazy Evaluation
- In practice, issues with
- OOM
- Overhang
- Heartbeat
- Py4JJavaError
- RDD actions
- Java issues on server
- (Efficient) join operations
- OOM