Sunday, 28 February 2016

HDFS : Components Overview

In this article I will explain main components of Hadoop Distributed File System (HDFS) and their responsibilities.

1) Namenode:

  • A single server which stores metadata i.e (namespace and inodes details)
  • Namespace is hierarchy of files and directories , represented using inodes.
  • Inode holds information like permissions, modification and access time, name, disk quota, named quota, blocks etc.
  • It maintains mapping of namespace to machine list where the actual data is located.
  • Keeps entire namespace in memory.
  • Namenode keeps all metadata in two files namely
    • fsimage
    • edits.

2) Datanode:

  • Stores series of named blocks where each block replica has 2 files
    • first file contains data
    • second file contains blocks metadata like checksum of data and generation time stamp.
  • Allows clients to read / write / delete blocks data.
  • Based on namenodes instruction it also copies block from one datanode to another.
  • On startup datanode sends full block report to namenode, later on periodically sends incremental block reports.
  • It periodically (per 3 seconds) sends a heartbeat to namenode.
  • Namenode marks a datanode as a dead node, if it does not receive hearbeat in 10 minutes (configurable)
  • Along with heartbeat datanode sends information about
    • total storage capacity
    • storage capacity in use
    • number of data transfers currently in progress.

3) Checkpoint Node:

  • A node which periodically combines the existing checkpoint and the journal to create a new checkpoint and an empty journal
  • It downloads the current checkpoint and the journal files from the namenode, merges these two locally and finally returns the new checkpoint back to the namenode.

4) Backup Node:

  • Is considered as read only namenode.
  • it is capable to maintain an in-memory, up-to-date image of the file system namespace which is always synchronized with the state of the NameNode
  • It is always ready to accept the journal stream of the namespace transactions from the active NameNode. It then saves them in the journal on its own storage directories, and then applies these transactions on its own namespace image in the memory.
  • The NameNode treats the BackupNode as journal storage, in the same way as it treats the journal files in its storage directories.
  • If the NameNode fails for any reason, the BackupNode’s image in the memory and the checkpoint on disk is a record of the latest namespace state.

5) HDFS Client:

  • A client exposed to perform file operations like
    • read, write and delete files
    • create and delete directories

Will cover the details of each operations in next blogs.

Friday, 19 February 2016

How to add auxiliary Jars in Hive

Many times we need to add auxiliary (3rd party) jars in hive class path to make use of them. Some of the auxiliary jars which I use most of the times like serde , dim lookup or 4mc.

There are different ways to achieve this.

1) Hive Server Config (hive-site.xml):

Modify your hive-site.xml config and add following property to it.

<property>
    <name>hive.aux.jars.path</name>
    <value>comma separated list of jar paths</value>
</property>

Example:

<property>
    <name>hive.aux.jars.path</name>
    <value>/usr/share/dimlookup.jar,/usr/share/serde.jar</value>
 </property>

You will need to restart hive server, so that these properties take effect.

2) Hive-Cli –auxpath option:

You can mention the comma separated list of auxiliary jars path while launching hive shell.

Example.

hive --auxpath  /usr/share/dimlookup.jar,/usr/share/serde.jar

3) Hive Cli add jar command:

You can add jar using

add jar jar_path;

Example:

add jar /usr/share/serde.jar;
add jar /usr/share/dimlookup.jar;

4) Add in HIVEAUXJARS_PATH environment variable:

export HIVE_AUX_JARS_PATH=/usr/share/serde.jar

5) .hiverc:

You can add all your add jars statements to .hiverc file in your home / hive config directory. So that they take effect on hive-cli launch.

Tuesday, 26 January 2016

How to setup cron job for last day of month

You can run any command or script any time or repeatedly with the help of linux utility ‘cron’. To add a cron, just run command 'crontab -e’, this will open a file with crontab entries if any.

The format of a cron line is like below:

MIN HOUR DOM MON DOW CMD


Here,
MIN - Minute field, which can have values between 0 - 59
HOUR - Hour field, which can have values between 0 - 23
DOM - Day of the Month field, values 1 - 31
MON - Month field, values 1 - 12
DOW - Day of Week field, values 0 - 6
CMD - Command field, here you can mention the command which you want to schedule.

For example, if you want to run ‘/home/chetna/gather_stats.sh’ at 27th January at 2.30 pm. Then you can add a cron entry like below:

30 14 27 01 * /home/chetna/gather_stats.sh

So, how do you write a cron to run on last day of month? The problem is, we don’t have a number to put in DOM field, as it could be 28, 30, 31 days in month, sometimes 29.
For eg.

59 23 28-31 * * /home/chetna/gather_stats.sh

This will run the script on 28,29,30 and 31st of every month at 11.59 pm. But in our case, if a month has 31 days, say January, we don’t want to run our script on other 3 days.It should only run on 31 January.
But in all the months, the next day will be 1. So we can use 'date’
to check if next day is 1, and if yes, run my script.
This is how I achieved the task:

59 23 28-32 * * [‘$(date +%d -d tomorrow)’ == ’01’ ] && /home/chetna/gather_stats.sh

Here,

date +%d -d tomorrow
will give tomorrows date as two character string. So we can check it, if it matches “01”