Showing posts with label Metadata. Show all posts

Sunday, 28 January 2018

HDFS Metadata: Namenode

In this article I will explain how namenode maintains metadata information in directory configured using dfs.namenode.name.dir in hdfs-site.xml file. In my case dfs.namenode.name.dir is configured to /hadoop/hdfs/namenode location. So lets start with listing on this directory.

ls -1 /hadoop/hdfs/namenode
current
in_use.lock

There are two entries namely

in_use.lock :

This is lock file held by namenode process. It is used to prevent concurrent modification of directory by multiple namenode processes.

current: This is directory. Lets do listing on this

ls -1 /hadoop/hdfs/namenode/current
VERSION
edits_0000000000000138313-0000000000000138314
edits_0000000000000138315-0000000000000138316
edits_0000000000000138317-0000000000000138318
edits_0000000000000138319-0000000000000138320
edits_0000000000000138321-0000000000000138322
edits_0000000000000138323-0000000000000138324
edits_inprogress_0000000000000138325
fsimage_0000000000000137650
fsimage_0000000000000137650.md5
fsimage_0000000000000138010
fsimage_0000000000000138010.md5
seen_txid

There are lot of files, lets explore one by one.

VERSION:

This is a Storage information file with following content:

#Wed Dec 02 13:16:31 IST 2015
namespaceID=2109784471
clusterID=CID-59abe9cc-89c7-4cf8-ada2-6c6409c98c97
cTime=0
storageType=NAME_NODE
blockpoolID=BP-1469059006-127.0.0.1-1449042391563
layoutVersion=-63

You can refer to org.apache.hadoop.hdfs.server.common.StorageInfo.java and org.apache.hadoop.hdfs.server.common.Storage.java for more information.

namespaceID:

Unique namespace identifier assigned to the file system after hdfs format. This is stored on all nodes of cluster. This is essential to join cluster, datanode with different namespaceID is not allowed to join cluster.

clusterID:

It identifies a cluster, and it has to be unique during the life time of a cluster. This is important for federated deployment. Introduced in HDFS-1365

cTime:

creation time of file system, this field is updated during HDFS upgrades.

storageType:

storageType can be one of NAME_NODE OR JOURNAL_NODE ( one of the org.apache.hadoop.hdfs.server.common.NodeType exclude DATA_NODE.)

blockpoolID:

Unique identifier of storage block pool.This is important for federated deployment. Introduced in HDFS-1365

layoutVersion:

Layout version of storage data. Whenever new features related to metadata are added to HDFS project, this version is changed.

edits_0000000000000abcdef-0000000000000uvwxyz ( edits_startTransactionID-endTransactionID):

This file contains all edit log transactions information between startTransactionID to endTransactionID. Its a log of each file system change like file creation,deletion or modification.

edits_inprogress0000000000000138325 (edits_inprogressstartTransactionID):

Current edit log file. This file contains edit logs starting from startTransactionID. All the new transactions are appended to this file.

fsimage_0000000000000abcdef ( fsimage_endTransactionID):

This file contains the complete state of the file system at a point in time, in this case till endTransactionID.

fsimage_0000000000000abcdef.md5 ( fsimage_endTransactionID.md5):

It is MD5 checksum of fsimage_endTransactionID file, used to prevent from disk corruption.

seen_txid:

The last transactionID of last checkpointing or edit logs roll. This file is updated when fsimage is merged with edits file or a new edits file is created. This is used to verify if edits are missing at the time of startup.

HDFS Metadata Datanode

In this article I will explain how datanode maintains metadata information in directory configured using dfs.datanode.name.dir in hdfs-site.xml file. In my case dfs.datanode.name.dir is configured to /hadoop/hdfs/datanode location. So lets start with listing on this directory.

ls -1 /hadoop/hdfs/datanode
current
in_use.lock

There are two entries namely

in_use.lock :

This is lock file held by datanode process. It is used to prevent concurrent modification of directory by multiple datanode processes.

current: This is directory. Lets do tree listing on this

tree current/
current/
|-- BP-1469059006-127.0.0.1-1449042391563
|   |-- current
|   |   |-- VERSION
|   |   |-- finalized
|   |   |   `-- subdir0
|   |   |       `-- subdir0
|   |   |           |-- blk_1073741825
|   |   |           `-- blk_1073741825_1001.meta
|   |   |-- rbw
|   |-- dncp_block_verification.log.curr
|   |-- dncp_block_verification.log.prev
|   `-- tmp
`-- VERSION

There are lot of files and directories, lets explore one by one.

VERSION:

This is a Storage information file with following content:

#Wed Dec 02 13:16:39 IST 2015
storageID=DS-c25c62e1-a512-451e-87b2-e9175afca9f4
clusterID=CID-59abe9cc-89c7-4cf8-ada2-6c6409c98c97
cTime=0
datanodeUuid=ad7ecbe4-b4a2-4b52-8146-5240ec849119
storageType=DATA_NODE
layoutVersion=-56

You can refer to org.apache.hadoop.hdfs.server.common.StorageInfo.java and org.apache.hadoop.hdfs.server.common.Storage.java for more information.

storageID:

It is unique to the datanode, and same across all storage directories on datanode. Namenode uses this id, to uniquely identify the datanode.

clusterID:

It identifies a cluster, and it has to be unique during the life time of a cluster. This is important for federated deployment. Introduced in HDFS-1365

cTime:

creation time of file system, this field is updated during HDFS upgrades.

datanodeUuid:

Unique identifier of a datanode, introduced in HDFS-5233

storageType:

It'll be DATA_NODE.

layoutVersion:

Layout version of storage data. Whenever new features related to metadata are added to HDFS project, this version is changed.

BP-randomInteger-NameNodeIpAddress-creationTime:

This is unique block pool id, where BP stands for Block Pool, it is followed by unique random integer, IP address of namenode and block pool creation time.Block pool collects a set of blocks whihc belongs to a namespace.

finalized:

This directory contains block which are completed. Each block file contains hdfs data.

rbw:

This directory contains blocks that are still being written to by HDFS client. Here rbw stands for replic being written.

**dncpblockverification.log.*:**

This file tracks the last time each block was verified by comparing its contents against the checksum. This file is rolled periodically, so dncpblockverification.log.curr is current file and dncpblockverification.log.prev this is old file which has been rolled back. Background block verification work happens in ascending order of last verification time.

Saturday, 27 January 2018

HDFS Metadata Configuration Properties

In this article, we will see configuration properties used to decide behaviour of hdfs metadata directories. All these properties are part of hdfs-site.xml file. Description and default values are picked from hdfs-default.xml.

dfs.datanode.data.dir

Description:

Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. The directories should be tagged with corresponding storage types ([SSD]/[DISK]/[ARCHIVE]/[RAM_DISK]) for HDFS storage policies. The default storage type will be DISK if the directory does not have a storage type tagged explicitly. Directories that do not exist will be created if local filesystem permission allows.

Default Value:

file://${hadoop.tmp.dir}/dfs/data

dfs.namenode.checkpoint.check.period

Description:

The SecondaryNameNode and CheckpointNode will poll the NameNode every dfs.namenode.checkpoint.check.period seconds to query the number of uncheckpointed transactions.

Default Value:

dfs.namenode.checkpoint.period

Description:

The number of seconds between two periodic checkpoints.

Default Value:

3600

dfs.namenode.checkpoint.txns

Description:

The Secondary NameNode or CheckpointNode will create a checkpoint of the namespace every dfs.namenode.checkpoint.txns transactions, regardless of whether dfs.namenode.checkpoint.period has expired.

Default Value:

1000000

dfs.namenode.edit.log.autoroll.check.interval.ms

Description:

How often an active namenode will check if it needs to roll its edit log, in milliseconds.

Default Value:

300000

dfs.namenode.edit.log.autoroll.multiplier.threshold

Description:

Determines when an active namenode will roll its own edit log. The actual threshold (in number of edits) is determined by multiplying this value by dfs.namenode.checkpoint.txns. This prevents extremely large edit files from accumulating on the active namenode, which can cause timeouts during namenode startup and pose an administrative hassle. This behavior is intended as a failsafe for when the standby or secondary namenode fail to roll the edit log by the normal checkpoint threshold.

Default Value:

2.0

dfs.namenode.edits.dir

Description:

Determines where on the local filesystem the DFS name node should store the transaction (edits) file. If this is a comma-delimited list of directories then the transaction file is replicated in all of the directories, for redundancy. Default value is same as dfs.namenode.name.dir

Default Value:

${dfs.namenode.name.dir}

dfs.namenode.name.dir

Description:

Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.

Default Value:

file://${hadoop.tmp.dir}/dfs/name

dfs.namenode.num.checkpoints.retained

Description:

The number of image checkpoint files (fsimage_*) that will be retained by the NameNode and Secondary NameNode in their storage directories. All edit logs (stored on edits_* files) necessary to recover an up-to-date namespace from the oldest retained checkpoint will also be retained.

Default Value:

dfs.namenode.num.extra.edits.retained

Description:

The number of extra transactions which should be retained beyond what is minimally necessary for a NN restart. It does not translate directly to file's age, or the number of files kept, but to the number of transactions (here "edits" means transactions). One edit file may contain several transactions (edits). During checkpoint, NameNode will identify the total number of edits to retain as extra by checking the latest checkpoint transaction value, subtracted by the value of this property. Then, it scans edits files to identify the older ones that don't include the computed range of retained transactions that are to be kept around, and purges them subsequently. The retainment can be useful for audit purposes or for an HA setup where a remote Standby Node may have been offline for some time and need to have a longer backlog of retained edits in order to start again. Typically each edit is on the order of a few hundred bytes, so the default of 1 million edits should be on the order of hundreds of MBs or low GBs.
NOTE: Fewer extra edits may be retained than value specified for this setting if doing so would mean that more segments would be retained than the number configured by dfs.namenode.max.extra.edits.segments.retained.

Default Value:

1000000

dfs.namenode.checkpoint.dir

Description:

Determines where on the local filesystem the DFS secondary name node should store the temporary images to merge. If this is a comma-delimited list of directories then the image is replicated in all of the directories for redundancy.

Default Value:

file://${hadoop.tmp.dir}/dfs/namesecondary

dfs.namenode.checkpoint.edits.dir

Description:

Determines where on the local filesystem the DFS secondary name node should store the temporary edits to merge. If this is a comma-delimited list of directories then the edits is replicated in all of the directories for redundancy. Default value is same as dfs.namenode.checkpoint.dir

Default Value:

${dfs.namenode.checkpoint.dir}

dfs.namenode.checkpoint.max-retries

Description:

The SecondaryNameNode retries failed check pointing. If the failure occurs while loading fsimage or replaying edits, the number of retries is limited by this variable.