Tuesday, 8 December 2015

How to enable Log Aggregation in Yarn

Log-Aggregation is a centralized management of logs in all NodeManager nodes provided by Yarn. It will aggregate and upload finished container or task's log to HDFS.If you are getting a message similar to “Log Aggregation not enabled”, you can follow the below steps to enable it.Add the following configuration to the yarn-site.xml of all the yarn hosts and restart node managers.

<property>
    <description>Whether to enable log aggregation.</description>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>

<property>
    <description>Where to aggregate logs to.</description>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>/tmp/logs</value>
</property>

<property>
    <description>How long to keep aggregation logs before deleting them.</description>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>259200</value>
</property>

<property>
    <description>How long to wait between aggregated log retention checks. If set to 0 or a negative value then the value is computed as one-tenth of the aggregated log retention time. </description>
    <name>yarn.log-aggregation.retain-check-interval-seconds</name>
    <value>3600</value>
</property>

Thursday, 8 October 2015

Compressing Hive Data

To reduce the amount of disk space hive query uses, you should enable hive compression codecs. There are two places where you can enable compression in hive one is during intermediate processing and other is while writing the output of hive query to hdfs location. There are different compression codecs which you can use with hive for e.g. bzip2, 4mc, snappy, lzo, lz4 and gzip.Each one has their own drawbacks and advantages. Following are the codecs

gzip org.apache.hadoop.io.compress.GzipCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
lzo com.hadoop.compression.lzo.LzopCodec
lz4 org.apache.hadoop.io.compress.Lz4Codec
Snappy org.apache.hadoop.io.compress.SnappyCodec
4mc com.hadoop.compression.fourmc.FourMcCodec

By default DEFLATE codec is set in most of hadoop configurations.

How to enable Intermediate Compression:

The contents of the intermediate files between jobs can be compressed with the following property in the hive-site.xml file.

<property>
   <name>hive.exec.compress.intermediate</name>
   <value>true</value>
</property>

The compression codec can be specified in either the mapred-site.xml, hive-site.xml, or for the hive session.

<property>
   <name>mapred.map.output.compression.codec</name>
   <value>com.hadoop.compression.fourmc.FourMCHighCodec</value>
</property>

This compression will only save disk space for intermediate files in case of multiple map reduce operations.

How to enable Hive Output Compression:

When the hive.exec.compress.output property is set to true, Hive will use the codec configured by the mapred.map.output.compression.codec property to compress the storage in HDFS.

<property>
   <name>hive.exec.compress.output</name>
   <value>true</value>
</property>

The 4mc compression can be compressed into separatable blocks so it can still be used as input efficiently for subsequent map reduce jobs.

<property>
   <name>mapred.output.compression.codec</name>
   <value>com.hadoop.compression.fourmc.FourMCHighCodec</value>
</property>

Users can always enable or disable this in the Hive session for each queries. If it is enabled there will be an extra step to extract data from HDFS. These properties can be set in the hive.site.xml or in the Hive session via the Hive command line interface.

hive>set hive.exec.compress.output = true;
hive>set mapred.output.compression.codec= com.hadoop.compression.fourmc.FourMCHighCodec;

Here are more details around listing files before and after enabling 4mc compression.

chetna.chaudhari@fdphadoop-cc-gw-0001:~$hadoop fs -ls /user/hive/warehouse/raw
Found 2 items
-rw-r--r-- 3 chetna.chaudhari hdfs 267628173 2015-09-29 20:48 /user/hive/warehouse/raw/000000_0
-rw-r--r-- 3 chetna.chaudhari hdfs 38765577 2015-09-29 20:48 /user/hive/warehouse/raw/000001_0

After creating a table called fourmc, there will be files in hdfs that have the .4mc extension. Hive queries will be able to decode the compressed data and this will be transparent to the user that is running queries.

chetna.chaudhari@fdphadoop-cc-gw-0001:~$hadoop fs -ls /user/hive/warehouse/fourmc
Found 2 items
-rw-r--r-- 3 chetna.chaudhari hdfs 26178373 2015-09-29 21:05 /user/hive/warehouse/fourmc/000000_0.4mc
-rw-r--r-- 3 chetna.chaudhari hdfs 3563411 2015-09-29 21:05 /user/hive/warehouse/fourmc/000001_0.4mc

The conclusion is that, if you enable fourmc compression with hive, you can reduce the overall processing time of your query along with less disk consumption.

Friday, 17 July 2015

Error during apt-get update - Can't exec "insserv"

Today, While doing an apt-get update on a box, I faced the following issue:

Setting up initscripts (2.88dsf-34) ...
Can't exec "insserv": No such file or directory at /usr/sbin/update-rc.d line 406.
update-rc.d: error: insserv rejected the script header
dpkg: error processing initscripts (--configure):
subprocess installed post-installation script returned error exit status 255

This was the first time I was seeing this issue, so was curious to know, what is insserv?. Insserv command is used to control the start and stop order of the services that are on a Linux system. How I fixed the problem? After checking the details about this, I found that insserv symlink was broken.

sudo ln -s /usr/lib/insserv/insserv /sbin/insserv

above command fixed the issue.

Sunday, 24 May 2015

Do mkdir and cd using a single command?

Most of the times, when you create a new directory, you may cd to it, to do some work.

chetna.chaudhari@Chetna:~$ mkdir -p /a/b/c
chetna.chaudhari@Chetna:~$ cd /a/b/c
chetna.chaudhari@Chetna:~$ pwd
/a/b/c

How about combining these two commands into a single one.? yes its doable. You
to add following snippet to your .bash_profile file and relaunch a shell.

function mkdircd () { mkdir -p "$@" && eval cd "\"\$$#\""; }

Now do mkdir and cd using a single command

chetna.chaudhari@Chetna:~$ mkdircd /x/y/z
chetna.chaudhari@Chetna:~$ pwd
/x/y/z

Friday, 17 April 2015

How to add pre-commit hook for JIRA tracking in git commits.

Having a clean and useful commit messages always makes debugging easier. There are many different patterns people follow to maintain neat git log history. Here is what I like the most, I always like to link a JIRA issue id with commit, so that any point in time, I can check the description of the task, for which the code change was made.
To make it compulsory, so that other committers will also follow the same commit message convension, you can add a pre commit hook in your git repository.
Steps to add hook:
1) cd into your git repository folder.
2) Since this is commit-msg hook, you'll need to make it executable.

chmod a+x .git/hooks/commit-msg

3) Now update the commit-msg file with the following content.

 test "" != "$(grep 'JIRA-' "$1")" || {
       echo >&2 "ERROR: JIRA issue number missing in commit message."
       exit 1;
}

Here replace your project name with JIRA.

Just to validate its working,try to make a commit without the pattern (JIRA-) , You should get the error "ERROR: JIRA issue number missing in commit message."
Enjoy git committing :) :)

Thursday, 9 April 2015

How to enable date timestamp in bash history.

Many times while debugging I have a question, when did I execute this command? Here is a way to enable date and timestamp while listing your bash history.

Without date and timestamp, your history will look like

chetna.chaudhari@Chetna:~$ history
ps aux
jps
ls
clear
history

To enable date and timestamp,

chetna.chaudhari@Chetna:~$ export HISTTIMEFORMAT='%F %T '

Here, %F enables date in yyyy-mm-dd format (%Y-%m-%d) %T enables time in hour:minutes:seconds (%H:%M:%S) So now your history should look like

chetna.chaudhari@Chetna:~$ history
1  2015-04-08 19:49:35 ps aux
2  2015-04-08 19:49:35 jps
3  2015-04-08 19:49:35 ls
4  2015-04-08 19:49:35 clear
5  2015-04-08 19:49:36 ps aux | grep sshd
6  2015-04-08 19:49:37 history

To make it permanent add the export command to your .bash_profile file.

Tuesday, 7 April 2015

HDFS - Quota Management

HDFS Quotas:

You can set two types of quotas in HDFS

a. Space Quota: The amount of space used by given directory

b. Name Quota: The number of file and directory names used.

Notes:

Quotas for space and names are independent of each other
File and directory creation fails if creation would cause the quota to be exceeded.
Block allocations fail if the quota would not allow a full block to be written.
Each replica counts against quota. For eg. if user is writing 3GB file with replication factor of 3, then 9GB will be consumed from his quota.
Largest quota is Long.Max_Value

HDFS Quota Operations:

a. Set a Name quota:

ACL: Only admin can perform this operation.

Command: Hadoop admin can use following command to set name quota.

hadoop dfsadmin -setQuota number_of_files path

eg.

hadoop dfsadmin -setQuota 100 /grid/landing

Explanation: It sets hadoop quota to 100, which means user can create 100 files including directories under /grid/landing path.

b. Clear a Name quota:

ACL: Only admin can perform this operation.

Command: Hadoop admin can use following command to clear name quota.

hadoop dfsadmin -clearQuota path

eg.

hadoop dfsadmin -clearQuota /grid/landing

c. Set Space quota:

ACL: Only admin can perform this operation.

Command: To set space quota, hadoop admin can use following command,

hadoop dfsadmin -setSpaceQuota size path

eg.

hadoop dfsadmin -setSpaceQuota 15G /grid/landing

Explanation: It means user can write upto 5GB ( 5 * 3 = 15) of data under /grid/landing path , assuming the replication factor of 3. Here user cannot write data less than block size. why? Because HDFS assumes an entire block will be filled, when its allocated. eg. say, if some path /projects/ingestion has quota of 50 MB, and if someone is writing a file of 10MB under this path, it'll fail, because of quota violation. Here HDFS thinks user is writing 384MB (128 * 3) data, instead of (10 * 3 = 30 MB).

d. Clear Space quota:

ACL: Only admin can perform this operation.

Command: To clear space quota, admin can use following command,

hadoop dfsadmin -clearSpaceQuota path

eg.

hadoop dfsadmin -clearSpaceQuota /grid/landing

e. Get quota allocation of Path:

ACL: Anyone can check quota allocation of path

Command: Hadoop admin can use following command to check quota allocation of a path,

hadoop fs -count -q path

eg.

hadoop fs -count -q /grid

Explanation: Above command will give output of following format,

hadoop fs -count -q /grid
9223372036854775807 9223372036854775333 none  inf  141  333 655855032 /grid

where ,

column1 --> Namespace quota, which means total 9223372036854775807 files can be created
column2 --> Available Namespace quota, user can add 9223372036854775333 files .
column3 --> Space quota
column4 --> Available space quota
column5 --> Number of directories
column6 --> Number of Files
column7 --> Size of content available
column8 --> Path