In this article I will explain how HDFS stores blocks. As per HDFS architecture, every file is split into blocks of size defined by dfs.blocksize and replicated by dfs.replication times to achieve fault tolerance and data locality. We can use hadoop's fsck command to check file status.Let's go through the steps.
In this article I'll be doing HDFS blocks listing on file /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar .
Step 1: Check that file exists
Lets confirm that file exists at given path.
[root@sandbox ~]# hadoop fs -ls /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar
-rw-r--r-- 1 oozie hdfs 190868107 2017-05-05 12:59 /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar
Step 2: Check blocksize configured in cluster
In HDFS dfs.blocksize property is used to configure block size of files. Lets grep for the pattern in hdfs configuration file. In sanbox environment the file exists at /etc/hadoop/conf/hdfs-site.xml location.
[root@sandbox ~]# grep -C2 'dfs.blocksize' /etc/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>
Step 3: Check size of file
Let us check if the file has size more than a block size ?
[root@sandbox ~]# hadoop fs -du -s -h /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar
182.0 M /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar
Step 4: Check replication factor configured for cluster
In HDFS dfs.replication property is used to configure replicas of blocks. Lets grep for the pattern in hdfs configuration file. In sanbox environment the file exists at /etc/hadoop/conf/hdfs-site.xml location.
[root@sandbox ~]# grep -C2 'dfs.replication<' /etc/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
[root@sandbox ~]#
Step 5: Use fsck to check blocks and datanode
Now we can use fsck command to check the status of file including number of blocks and locations.
[root@sandbox ~]# hdfs fsck /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar -files -blocks -locations
Connecting to namenode via http://sandbox.hortonworks.com:50070/fsck?ugi=root&files=1&blocks=1&locations=1&path=%2Fuser%2Foozie%2Fshare%2Flib%2Flib_20170505125744%2Fspark%2Fspark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar
FSCK started by root (auth:SIMPLE) from /127.0.0.1 for path /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar at Mon Jan 29 05:32:44 UTC 2018
/user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar 190868107 bytes, 2 block(s): OK
0. BP-1875268269-127.0.0.1-1493988757398:blk_1073742550_1726 len=134217728 repl=1 [DatanodeInfoWithStorage[127.0.0.1:50010,DS-3fd6f5d7-12ac-4a3c-8890-77034935b5e6,DISK]]
1. BP-1875268269-127.0.0.1-1493988757398:blk_1073742551_1727 len=56650379 repl=1 [DatanodeInfoWithStorage[127.0.0.1:50010,DS-3fd6f5d7-12ac-4a3c-8890-77034935b5e6,DISK]]
Status: HEALTHY
Total size: 190868107 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 2 (avg. block size 95434053 B)
Minimally replicated blocks: 2 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Mon Jan 29 05:32:44 UTC 2018 in 3 milliseconds
The filesystem under path '/user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar' is HEALTHY
[root@sandbox ~]#