Data Engineering: HDFS

Linnea and I explored stuff about Datanodes and HDFS Balancer.

We summarized our progress this week on Google Drive and presented it to the entire server team.

We have also started a Google Doc shared with the entire team where we post diagnostic reports. The idea is to assemble a collection of common issues discovered and solved by the CDS Data Enginnering team members.

Examples of such include:

Hadoop Start & Setup

Category: Hadoop Start
User: T. K. Daisy Leung
Date: 10/22/17

Missing namenode every time restart hadoop
Problem Description: Missing namenode every time restart hadoop, forcing me to reformat it every time
Current Configuration: JAVA 8.0 + hadoop 2.8.1 + mac OSX El Capitan
Logs: 
Screenshot of Error Message: 
Screenshot of Spark UI:
Solution (include links and resources used): 
Edit $HADOOP_INSTALL/etc/hdfs-site.xml and create the respective directories for Namenode and Datanode.

- Edit $HADOOP_INSTALL/etc/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>dfs.name.dir</name>
<value>file:///Users/admin/Downloads/hadoop-2.8.1/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///Users/admin/Downloads/hadoop-2.8.1/hdfs/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

admin@dhcp-ccc-5418:~/Downloads/hadoop-2.8.1$sbin/stop-all.sh
mkdir the respective directories first:
admin@dhcp-ccc-5418:~/Downloads/hadoop-2.8.1$ mkdir -p hdfs/namenode
admin@dhcp-ccc-5418:~/Downloads/hadoop-2.8.1$ mkdir -p hdfs/datanode
admin@MacBook-Air:~/Downloads/hadoop-2.8.1$ bin/hdfs namenode -format
admin@MacBook-Air:~/Downloads/hadoop-2.8.1$ sbin/start-all.sh 
dmin@MacBook-Air:~/Downloads/hadoop-2.8.1$ jps
admin@MacBook-Air:~/Downloads/hadoop-2.8.1$ sbin/stop-all.sh 
admin@MacBook-Air:~/Downloads/hadoop-2.8.1$ sbin/start-all.sh 
dmin@MacBook-Air:~/Downloads/hadoop-2.8.1$ jps

Time Spent on Issue (in hours): .5 

Hadoop connection failed
Problem Description: Hadoop connection failed
Current Configuration: Java 8.0 + Hadoop 2.8.1 + Mac El Capitan
Logs: 
admin@MacBook-Air:~/Downloads/hadoop-2.8.1$ bin/hadoop fs -ls /yelp
17/10/22 17:16:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/22 17:16:55 WARN ipc.Client: Failed to connect to server: localhost/127.0.0.1:9000: try once and fail.
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:681)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:777)
at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:409)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1542)
at org.apache.hadoop.ipc.Client.call(Client.java:1373)
at org.apache.hadoop.ipc.Client.call(Client.java:1337)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:787)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:398)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:335)
at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1700)
at org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1436)
at org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1433)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1433)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:64)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:269)
at org.apache.hadoop.fs.Globber.glob(Globber.java:148)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1685)
at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:326)
at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:235)
at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:218)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:103)
at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:378)
ls: Call From dhcp-ccc-5418.eduroam.cornell.edu/10.128.21.42 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

Error Message: 
ls: Call From dhcp-ccc-5418.eduroam.cornell.edu/10.128.21.42 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

Screenshot of Spark UI:
Solution (include links and resources used): 
run "sbin/start-all.sh" before running "bin/hadoop fs"

Time Spent on Issue (in hours): .05

Pyspark import error
Category: Hadoop Start
User: T. K. Daisy Leung
Date: 10/22/17

                Getting errors suddenly about sc=SparkContext()
Problem Description: error when trying to set sc=SparkContext(), but notebook was working before..    
Current Configuration: JAVA 8.0 + hadoop 2.8.1 + mac OSX El Capitan + spark 2.2.0 (but installed with pip) 
Logs: 
Screenshot of Error Message: 

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-2-bd8ed31b87ac> in <module>()
      2 from pyspark.sql.session import SparkSession
      3 try:
----> 4     sc = SparkContext()
      5 except ValueError:
      6     pass

/Users/admin/anaconda/lib/python2.7/site-packages/pyspark/context.pyc in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    113         """
    114         self._callsite = first_spark_call() or CallSite(None, None, None)
--> 115         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    116         try:
    117             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

/Users/admin/anaconda/lib/python2.7/site-packages/pyspark/context.pyc in _ensure_initialized(cls, instance, gateway, conf)
    281         with SparkContext._lock:
    282             if not SparkContext._gateway:
--> 283                 SparkContext._gateway = gateway or launch_gateway(conf)
    284                 SparkContext._jvm = SparkContext._gateway.jvm
    285 

/Users/admin/anaconda/lib/python2.7/site-packages/pyspark/java_gateway.pyc in launch_gateway(conf)
     93                 callback_socket.close()
     94         if gateway_port is None:
---> 95             raise Exception("Java gateway process exited before sending the driver its port number")
     96 
     97         # In Windows, ensure the Java child processes do not linger after Python has exited.

Exception: Java gateway process exited before sending the driver its port number


Screenshot of Spark UI:
Solution (include links and resources used): 
Perhaps I have changed my .dotfiles in the meantime?? (I don’t remember that). Maybe: because I added “exec(open(os.path.join(spark_home, 'python/pyspark/shell.py')).read())” to my .ipython/profile_pyspark/
ipython_notebook_config.py

But seems like it’s working in the background. Will just use try and except to work around for the meantime.

try:
    sc = SparkContext()     
except:
    pass
# spark = SparkSession(sc)

In [1]:
Sc
Out[1]:
SparkContext
Spark UI
Version
v2.2.0
Master
yarn
AppName
PySparkShell


Time Spent on Issue (in hours): 0.05