Monday 29 January 2018

How to list blocks of file in HDFS

In this article I will explain how HDFS stores blocks. As per HDFS architecture, every file is split into blocks of size defined by dfs.blocksize and replicated by dfs.replication times to achieve fault tolerance and data locality. We can use hadoop's fsck command to check file status.Let's go through the steps.

In this article I'll be doing HDFS blocks listing on file /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar .

Step 1: Check that file exists

Lets confirm that file exists at given path.

[root@sandbox ~]# hadoop fs -ls /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar
-rw-r--r--   1 oozie hdfs  190868107 2017-05-05 12:59 /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar

Step 2: Check blocksize configured in cluster

In HDFS dfs.blocksize property is used to configure block size of files. Lets grep for the pattern in hdfs configuration file. In sanbox environment the file exists at /etc/hadoop/conf/hdfs-site.xml location.

[root@sandbox ~]# grep -C2 'dfs.blocksize' /etc/hadoop/conf/hdfs-site.xml 

    <property>
      <name>dfs.blocksize</name>
      <value>134217728</value>
    </property>

Step 3: Check size of file

Let us check if the file has size more than a block size ?

[root@sandbox ~]# hadoop fs -du -s -h /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar
182.0 M  /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar

Step 4: Check replication factor configured for cluster

In HDFS dfs.replication property is used to configure replicas of blocks. Lets grep for the pattern in hdfs configuration file. In sanbox environment the file exists at /etc/hadoop/conf/hdfs-site.xml location.

[root@sandbox ~]# grep -C2 'dfs.replication<' /etc/hadoop/conf/hdfs-site.xml 
    
    <property>
      <name>dfs.replication</name>
      <value>1</value>
    </property>
[root@sandbox ~]# 

Step 5: Use fsck to check blocks and datanode

Now we can use fsck command to check the status of file including number of blocks and locations.

[root@sandbox ~]# hdfs fsck /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar -files -blocks -locations
Connecting to namenode via http://sandbox.hortonworks.com:50070/fsck?ugi=root&files=1&blocks=1&locations=1&path=%2Fuser%2Foozie%2Fshare%2Flib%2Flib_20170505125744%2Fspark%2Fspark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar
FSCK started by root (auth:SIMPLE) from /127.0.0.1 for path /user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar at Mon Jan 29 05:32:44 UTC 2018
/user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar 190868107 bytes, 2 block(s):  OK
0. BP-1875268269-127.0.0.1-1493988757398:blk_1073742550_1726 len=134217728 repl=1 [DatanodeInfoWithStorage[127.0.0.1:50010,DS-3fd6f5d7-12ac-4a3c-8890-77034935b5e6,DISK]]
1. BP-1875268269-127.0.0.1-1493988757398:blk_1073742551_1727 len=56650379 repl=1 [DatanodeInfoWithStorage[127.0.0.1:50010,DS-3fd6f5d7-12ac-4a3c-8890-77034935b5e6,DISK]]

Status: HEALTHY
 Total size:    190868107 B
 Total dirs:    0
 Total files:   1
 Total symlinks:        0
 Total blocks (validated):  2 (avg. block size 95434053 B)
 Minimally replicated blocks:   2 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:   0 (0.0 %)
 Mis-replicated blocks:     0 (0.0 %)
 Default replication factor:    1
 Average block replication: 1.0
 Corrupt blocks:        0
 Missing replicas:      0 (0.0 %)
 Number of data-nodes:      1
 Number of racks:       1
FSCK ended at Mon Jan 29 05:32:44 UTC 2018 in 3 milliseconds


The filesystem under path '/user/oozie/share/lib/lib_20170505125744/spark/spark-assembly-1.6.3.2.6.0.3-8-hadoop2.7.3.2.6.0.3-8.jar' is HEALTHY
[root@sandbox ~]# 

How to edit hosts file on MacOS

On MacOS, hosts file is present at two places i.e /etc/hosts and /private/etc/hosts. Bit if you do detailed listing on /etc path, you will notice that its pointing to /private/etc/hosts file.

chetnachaudhari@chetnas-MacBook-Pro:~$ ls -lsa /etc
8 lrwxr-xr-x@ 1 root  wheel  11 Jan 12  2017 /etc -> private/etc

To update new hosts entry on your machine, edit /private/etc/hosts file. Following is sample of how this file looks:

chetnachaudhari@chetnas-MacBook-Pro:~$ cat /private/etc/hosts
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1   localhost
255.255.255.255 broadcasthost
::1             localhost

lsblk - List block device information

Lsblk is a linux utility to list block device information. In this blog post, I'll cover some useful lsblk commands.

To see list of devices :

[root@sandbox ~]# lsblk
NAME                          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                             8:0    0 48.8G  0 disk
|-sda1                          8:1    0  500M  0 part /boot
`-sda2                          8:2    0 48.3G  0 part
  |-vg_sandbox-lv_root (dm-0) 253:0    0 43.5G  0 lvm  /
  `-vg_sandbox-lv_swap (dm-1) 253:1    0  4.9G  0 lvm  [SWAP]

By default lsblk prints information in tree view, if you want to see information in list view, you can use -l option.

[root@sandbox ~]# lsblk -l
NAME                      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                         8:0    0 48.8G  0 disk
sda1                        8:1    0  500M  0 part /boot
sda2                        8:2    0 48.3G  0 part
vg_sandbox-lv_root (dm-0) 253:0    0 43.5G  0 lvm  /
vg_sandbox-lv_swap (dm-1) 253:1    0  4.9G  0 lvm  [SWAP]

Here,

  • NAME is name of device ,
  • MAJ:MIN is major:minor version of device
  • RM tells that its a removal device
  • SIZE is size of device in human readable format
  • RO tells that its Read Only device
  • TYPE is device type
  • MOUNTPOINT is location where device is mounted.

To see device size in bytes

[root@sandbox ~]# lsblk -b
NAME                          MAJ:MIN RM        SIZE RO TYPE MOUNTPOINT
sda                             8:0    0 52428800000  0 disk
|-sda1                          8:1    0   524288000  0 part /boot
`-sda2                          8:2    0 51903463424  0 part
  |-vg_sandbox-lv_root (dm-0) 253:0    0 46657437696  0 lvm  /
  `-vg_sandbox-lv_swap (dm-1) 253:1    0  5242880000  0 lvm  [SWAP]

To see filesystem information

[root@sandbox ~]# lsblk -fl
NAME                      FSTYPE      LABEL UUID                                   MOUNTPOINT
sda
sda1                      ext4              8ed32b8c-b23a-423b-b96f-29eaa1303ae1   /boot
sda2                      LVM2_member       6CXjrD-6st6-olYP-BQAK-psA0-dS3T-8KeIRU
vg_sandbox-lv_root (dm-0) ext4              d6e7730a-608a-4e67-8814-131e23411619   /
vg_sandbox-lv_swap (dm-1) swap              dc07cc2c-1b35-4b06-a52b-c0d162669afe   [SWAP]

Here

  • FSTYPE is filesystem type
  • LABEL is filesystem label
  • UUID is filesystem UUID

To see device permissions

[root@sandbox ~]# lsblk -m
NAME                           SIZE OWNER GROUP MODE
sda                           48.8G root  disk  brw-rw----
|-sda1                         500M root  disk  brw-rw----
`-sda2                        48.3G root  disk  brw-rw----
  |-vg_sandbox-lv_root (dm-0) 43.5G root  disk  brw-rw----
  `-vg_sandbox-lv_swap (dm-1)  4.9G root  disk  brw-rw----

Here,

  • OWNER is user who created this device
  • GROUP is group name to which user belongs
  • MODE is device permissions

To see device topology information

[root@sandbox ~]# lsblk -tl
NAME                      ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE   RA
sda                               0    512      0     512     512    1 cfq       128  128
sda1                              0    512      0     512     512    1 cfq       128  128
sda2                              0    512      0     512     512    1 cfq       128  128
vg_sandbox-lv_root (dm-0)         0    512      0     512     512    1           128  128
vg_sandbox-lv_swap (dm-1)         0    512      0     512     512    1           128  128

Here,

  • ALIGNMENT is alignment offset of device
  • MIN-IO is minimum I/O size
  • OPT-IO is optimal I/O size
  • PHY-SEC is physical sector size
  • LOG-SEC is logical sector size
  • ROTA tells that its a rotational device
  • SCHED is name of I/O scheduler
  • RQ-SIZE is size of request queue
  • RA is read ahead of device.

Sunday 28 January 2018

HDFS Metadata: Namenode

In this article I will explain how namenode maintains metadata information in directory configured using dfs.namenode.name.dir in hdfs-site.xml file. In my case dfs.namenode.name.dir is configured to /hadoop/hdfs/namenode location. So lets start with listing on this directory.

ls -1 /hadoop/hdfs/namenode
current
in_use.lock

There are two entries namely

in_use.lock :

This is lock file held by namenode process. It is used to prevent concurrent modification of directory by multiple namenode processes.

current: This is directory. Lets do listing on this

ls -1 /hadoop/hdfs/namenode/current
VERSION
edits_0000000000000138313-0000000000000138314
edits_0000000000000138315-0000000000000138316
edits_0000000000000138317-0000000000000138318
edits_0000000000000138319-0000000000000138320
edits_0000000000000138321-0000000000000138322
edits_0000000000000138323-0000000000000138324
edits_inprogress_0000000000000138325
fsimage_0000000000000137650
fsimage_0000000000000137650.md5
fsimage_0000000000000138010
fsimage_0000000000000138010.md5
seen_txid

There are lot of files, lets explore one by one.

VERSION:

This is a Storage information file with following content:

#Wed Dec 02 13:16:31 IST 2015
namespaceID=2109784471
clusterID=CID-59abe9cc-89c7-4cf8-ada2-6c6409c98c97
cTime=0
storageType=NAME_NODE
blockpoolID=BP-1469059006-127.0.0.1-1449042391563
layoutVersion=-63

You can refer to org.apache.hadoop.hdfs.server.common.StorageInfo.java and org.apache.hadoop.hdfs.server.common.Storage.java for more information.

namespaceID:

Unique namespace identifier assigned to the file system after hdfs format. This is stored on all nodes of cluster. This is essential to join cluster, datanode with different namespaceID is not allowed to join cluster.

clusterID:

It identifies a cluster, and it has to be unique during the life time of a cluster. This is important for federated deployment. Introduced in HDFS-1365

cTime:

creation time of file system, this field is updated during HDFS upgrades.

storageType:

storageType can be one of NAME_NODE OR JOURNAL_NODE ( one of the org.apache.hadoop.hdfs.server.common.NodeType exclude DATA_NODE.)

blockpoolID:

Unique identifier of storage block pool.This is important for federated deployment. Introduced in HDFS-1365

layoutVersion:

Layout version of storage data. Whenever new features related to metadata are added to HDFS project, this version is changed.

edits_0000000000000abcdef-0000000000000uvwxyz ( edits_startTransactionID-endTransactionID):

This file contains all edit log transactions information between startTransactionID to endTransactionID. Its a log of each file system change like file creation,deletion or modification.

edits_inprogress0000000000000138325 (edits_inprogressstartTransactionID):

Current edit log file. This file contains edit logs starting from startTransactionID. All the new transactions are appended to this file.

fsimage_0000000000000abcdef ( fsimage_endTransactionID):

This file contains the complete state of the file system at a point in time, in this case till endTransactionID.

fsimage_0000000000000abcdef.md5 ( fsimage_endTransactionID.md5):

It is MD5 checksum of fsimage_endTransactionID file, used to prevent from disk corruption.

seen_txid:

The last transactionID of last checkpointing or edit logs roll. This file is updated when fsimage is merged with edits file or a new edits file is created. This is used to verify if edits are missing at the time of startup.

Linux command for Base64 encode and decode

Linux has base64 command to encode and decode using Base64 representation. Here is an example : To encode a string Chetna Chaudhari you can use following command:

echo "Chetna Chaudhari" | base64
Q2hldG5hIENoYXVkaGFyaQo=

You can enable debug mode using -d flag to see more details :

echo "Chetna Chaudhari" | base64 -d
May 16 10:56:35 Chetna.local base64[26454] <Info>: Read 17 bytes.
May 16 10:56:35 Chetna.local base64[26454] <Info>: Wrote 24 bytes.
Q2hldG5hIENoYXVkaGFyaQo=

To decode the encoded text,

echo Q2hldG5hIENoYXVkaGFyaQo= | base64 --decode
Chetna Chaudhari

You can check more details using following command:

echo Q2hldG5hIENoYXVkaGFyaQo= | base64 -d --decode
May 16 10:56:37 Chetna.local base64[26431] <Info>: Read 25 bytes.
May 16 10:56:37 Chetna.local base64[26431] <Info>: Decoded to 17 bytes.
Chetna Chaudhari
May 16 10:56:37 Chetna.local base64[26431] <Info>: Wrote 17 bytes.

How to split a string on first occurrence of character in Hive.

In this article we will see how to split a string in hive on first occurrence of a character. Lets say, you have strings like aplfinancereporting or org_namespace . Where you want to split by org (i.e string before first occurrence of '_') or namespace (string after '_').

hive> create table testSplit(namespace string);
hive> insert into table testSplit values ("scp_apl_finance");
hive> insert into table testSplit values ("apl_finance_reporting");
hive> select namespace from testSplit;
OK
scp_apl_finance
apl_finance_reporting
Time taken: 0.118 seconds, Fetched: 2 row(s)
hive> select regexp_extract(namespace, '^(.*?)(?:_)(.*)$', 0)  from testSplit;
OK
scp_apl_finance
apl_finance_reporting
Time taken: 0.064 seconds, Fetched: 2 row(s)

To get list of all orgs we can execute following query:

hive> select regexp_extract(namespace, '^(.*?)(?:_)(.*)$', 1)  from testSplit;
OK
scp
apl
Time taken: 0.056 seconds, Fetched: 2 row(s)

And to get list of all namespaces, use following one:

hive> select regexp_extract(namespace, '^(.*?)(?:_)(.*)$', 2)  from testSplit;
OK
apl_finance
finance_reporting
Time taken: 0.066 seconds, Fetched: 2 row(s)

HDFS Metadata Datanode

In this article I will explain how datanode maintains metadata information in directory configured using dfs.datanode.name.dir in hdfs-site.xml file. In my case dfs.datanode.name.dir is configured to /hadoop/hdfs/datanode location. So lets start with listing on this directory.

ls -1 /hadoop/hdfs/datanode
current
in_use.lock

There are two entries namely

in_use.lock :

This is lock file held by datanode process. It is used to prevent concurrent modification of directory by multiple datanode processes.

current: This is directory. Lets do tree listing on this

tree current/
current/
|-- BP-1469059006-127.0.0.1-1449042391563
|   |-- current
|   |   |-- VERSION
|   |   |-- finalized
|   |   |   `-- subdir0
|   |   |       `-- subdir0
|   |   |           |-- blk_1073741825
|   |   |           `-- blk_1073741825_1001.meta
|   |   |-- rbw
|   |-- dncp_block_verification.log.curr
|   |-- dncp_block_verification.log.prev
|   `-- tmp
`-- VERSION

There are lot of files and directories, lets explore one by one.

VERSION:

This is a Storage information file with following content:

#Wed Dec 02 13:16:39 IST 2015
storageID=DS-c25c62e1-a512-451e-87b2-e9175afca9f4
clusterID=CID-59abe9cc-89c7-4cf8-ada2-6c6409c98c97
cTime=0
datanodeUuid=ad7ecbe4-b4a2-4b52-8146-5240ec849119
storageType=DATA_NODE
layoutVersion=-56

You can refer to org.apache.hadoop.hdfs.server.common.StorageInfo.java and org.apache.hadoop.hdfs.server.common.Storage.java for more information.

storageID:

It is unique to the datanode, and same across all storage directories on datanode. Namenode uses this id, to uniquely identify the datanode.

clusterID:

It identifies a cluster, and it has to be unique during the life time of a cluster. This is important for federated deployment. Introduced in HDFS-1365

cTime:

creation time of file system, this field is updated during HDFS upgrades.

datanodeUuid:

Unique identifier of a datanode, introduced in HDFS-5233

storageType:

It'll be DATA_NODE.

layoutVersion:

Layout version of storage data. Whenever new features related to metadata are added to HDFS project, this version is changed.

BP-randomInteger-NameNodeIpAddress-creationTime:

This is unique block pool id, where BP stands for Block Pool, it is followed by unique random integer, IP address of namenode and block pool creation time.Block pool collects a set of blocks whihc belongs to a namespace.

finalized:

This directory contains block which are completed. Each block file contains hdfs data.

rbw:

This directory contains blocks that are still being written to by HDFS client. Here rbw stands for replic being written.

dncpblockverification.log.*:

This file tracks the last time each block was verified by comparing its contents against the checksum. This file is rolled periodically, so dncpblockverification.log.curr is current file and dncpblockverification.log.prev this is old file which has been rolled back. Background block verification work happens in ascending order of last verification time.

Saturday 27 January 2018

HDFS Metadata Configuration Properties

In this article, we will see configuration properties used to decide behaviour of hdfs metadata directories. All these properties are part of hdfs-site.xml file. Description and default values are picked from hdfs-default.xml.

dfs.datanode.data.dir

Description:

Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. The directories should be tagged with corresponding storage types ([SSD]/[DISK]/[ARCHIVE]/[RAM_DISK]) for HDFS storage policies. The default storage type will be DISK if the directory does not have a storage type tagged explicitly. Directories that do not exist will be created if local filesystem permission allows.

Default Value:

file://${hadoop.tmp.dir}/dfs/data

dfs.namenode.checkpoint.check.period

Description:

The SecondaryNameNode and CheckpointNode will poll the NameNode every dfs.namenode.checkpoint.check.period seconds to query the number of uncheckpointed transactions.

Default Value:

60

dfs.namenode.checkpoint.period

Description:

The number of seconds between two periodic checkpoints.

Default Value:

3600

dfs.namenode.checkpoint.txns

Description:

The Secondary NameNode or CheckpointNode will create a checkpoint of the namespace every dfs.namenode.checkpoint.txns transactions, regardless of whether dfs.namenode.checkpoint.period has expired.

Default Value:

1000000

dfs.namenode.edit.log.autoroll.check.interval.ms

Description:

How often an active namenode will check if it needs to roll its edit log, in milliseconds.

Default Value:

300000

dfs.namenode.edit.log.autoroll.multiplier.threshold

Description:

Determines when an active namenode will roll its own edit log. The actual threshold (in number of edits) is determined by multiplying this value by dfs.namenode.checkpoint.txns. This prevents extremely large edit files from accumulating on the active namenode, which can cause timeouts during namenode startup and pose an administrative hassle. This behavior is intended as a failsafe for when the standby or secondary namenode fail to roll the edit log by the normal checkpoint threshold.

Default Value:

2.0

dfs.namenode.edits.dir

Description:

Determines where on the local filesystem the DFS name node should store the transaction (edits) file. If this is a comma-delimited list of directories then the transaction file is replicated in all of the directories, for redundancy. Default value is same as dfs.namenode.name.dir

Default Value:

${dfs.namenode.name.dir}

dfs.namenode.name.dir

Description:

Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.

Default Value:

file://${hadoop.tmp.dir}/dfs/name

dfs.namenode.num.checkpoints.retained

Description:

The number of image checkpoint files (fsimage_*) that will be retained by the NameNode and Secondary NameNode in their storage directories. All edit logs (stored on edits_* files) necessary to recover an up-to-date namespace from the oldest retained checkpoint will also be retained.

Default Value:

2

dfs.namenode.num.extra.edits.retained

Description:

The number of extra transactions which should be retained beyond what is minimally necessary for a NN restart. It does not translate directly to file's age, or the number of files kept, but to the number of transactions (here "edits" means transactions). One edit file may contain several transactions (edits). During checkpoint, NameNode will identify the total number of edits to retain as extra by checking the latest checkpoint transaction value, subtracted by the value of this property. Then, it scans edits files to identify the older ones that don't include the computed range of retained transactions that are to be kept around, and purges them subsequently. The retainment can be useful for audit purposes or for an HA setup where a remote Standby Node may have been offline for some time and need to have a longer backlog of retained edits in order to start again. Typically each edit is on the order of a few hundred bytes, so the default of 1 million edits should be on the order of hundreds of MBs or low GBs.
NOTE: Fewer extra edits may be retained than value specified for this setting if doing so would mean that more segments would be retained than the number configured by dfs.namenode.max.extra.edits.segments.retained.

Default Value:

1000000

dfs.namenode.checkpoint.dir

Description:

Determines where on the local filesystem the DFS secondary name node should store the temporary images to merge. If this is a comma-delimited list of directories then the image is replicated in all of the directories for redundancy.

Default Value:

file://${hadoop.tmp.dir}/dfs/namesecondary

dfs.namenode.checkpoint.edits.dir

Description:

Determines where on the local filesystem the DFS secondary name node should store the temporary edits to merge. If this is a comma-delimited list of directories then the edits is replicated in all of the directories for redundancy. Default value is same as dfs.namenode.checkpoint.dir

Default Value:

${dfs.namenode.checkpoint.dir}

dfs.namenode.checkpoint.max-retries

Description:

The SecondaryNameNode retries failed check pointing. If the failure occurs while loading fsimage or replaying edits, the number of retries is limited by this variable.

Default Value:

3

How to enable DebugFS on Linux System

Debugfs is Debug Filesystem , its RAM based filesystem which can be used for kernel debugging information. This makes kernel space information available in user space.

How to enable debugfs :

To enable it for onetime, i.e information will be available until next boot of system.
mount -t debugfs none /sys/kernel/debug
To make the change permanent, add following line to /etc/fstab file.
debugfs    /sys/kernel/debug      debugfs  defaults  0 0
Once you enable debugfs, you can see multiple directories inside /sys/kernel/debug :
[root@sandbox ~] ls /sys/kernel/debug 
bdi    boot_params  dynamic_debug  gpio  kprobes  sched_features  usb  xen
block  dma_buf      extfrag        hid   mce      tracing         x86
These files holds information about kernel subsystems which helps in debugging.