Monday 29 January 2018

How to list blocks of file in HDFS

In this article I will explain how HDFS stores blocks. As per HDFS architecture, every file is split into blocks of size defined by dfs.blocksize and replicated by dfs.replication times to achieve fault tolerance and data locality. We can use hadoop's fsck command to check file status.Let's go through the steps.

In this article I'll be doing HDFS blocks listing on file /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar .

Step 1: Check that file exists

Lets confirm that file exists at given path.

[root@sandbox ~]# hadoop fs -ls /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar
-rw-r--r--   1 oozie hdfs  190868107 2017-05-05 12:59 /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar

Step 2: Check blocksize configured in cluster

In HDFS dfs.blocksize property is used to configure block size of files. Lets grep for the pattern in hdfs configuration file. In sanbox environment the file exists at /etc/hadoop/conf/hdfs-site.xml location.

[root@sandbox ~]# grep -C2 'dfs.blocksize' /etc/hadoop/conf/hdfs-site.xml 

    <property>
      <name>dfs.blocksize</name>
      <value>134217728</value>
    </property>

Step 3: Check size of file

Let us check if the file has size more than a block size ?

[root@sandbox ~]# hadoop fs -du -s -h /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar
182.0 M  /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar

Step 4: Check replication factor configured for cluster

In HDFS dfs.replication property is used to configure replicas of blocks. Lets grep for the pattern in hdfs configuration file. In sanbox environment the file exists at /etc/hadoop/conf/hdfs-site.xml location.

[root@sandbox ~]# grep -C2 'dfs.replication<' /etc/hadoop/conf/hdfs-site.xml 
    
    <property>
      <name>dfs.replication</name>
      <value>1</value>
    </property>
[root@sandbox ~]# 

Step 5: Use fsck to check blocks and datanode

Now we can use fsck command to check the status of file including number of blocks and locations.

[root@sandbox ~]# hdfs fsck /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar -files -blocks -locations
Connecting to namenode via http://sandbox.hortonworks.com:50070/fsck?ugi=root&files=1&blocks=1&locations=1&path=%2Fuser%2Foozie%2Fshare%2Flib%2Flib_20170505125744%2Fspark%2Fspark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar
FSCK started by root (auth:SIMPLE) from /127.0.0.1 for path /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar at Mon Jan 29 05:32:44 UTC 2018
/user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar 190868107 bytes, 2 block(s):  OK
0. BP-1875268269-127.0.0.1-1493988757398:blk_1073742550_1726 len=134217728 repl=1 [DatanodeInfoWithStorage[127.0.0.1:50010,DS-3fd6f5d7-12ac-4a3c-8890-77034935b5e6,DISK]]
1. BP-1875268269-127.0.0.1-1493988757398:blk_1073742551_1727 len=56650379 repl=1 [DatanodeInfoWithStorage[127.0.0.1:50010,DS-3fd6f5d7-12ac-4a3c-8890-77034935b5e6,DISK]]

Status: HEALTHY
 Total size:    190868107 B
 Total dirs:    0
 Total files:   1
 Total symlinks:        0
 Total blocks (validated):  2 (avg. block size 95434053 B)
 Minimally replicated blocks:   2 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:   0 (0.0 %)
 Mis-replicated blocks:     0 (0.0 %)
 Default replication factor:    1
 Average block replication: 1.0
 Corrupt blocks:        0
 Missing replicas:      0 (0.0 %)
 Number of data-nodes:      1
 Number of racks:       1
FSCK ended at Mon Jan 29 05:32:44 UTC 2018 in 3 milliseconds


The filesystem under path '/user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar' is HEALTHY
[root@sandbox ~]#