Showing posts with label Hive. Show all posts
Showing posts with label Hive. Show all posts

Sunday, 28 January 2018

How to split a string on first occurrence of character in Hive.

In this article we will see how to split a string in hive on first occurrence of a character. Lets say, you have strings like aplfinancereporting or org_namespace . Where you want to split by org (i.e string before first occurrence of '_') or namespace (string after '_').

hive> create table testSplit(namespace string);
hive> insert into table testSplit values ("scp_apl_finance");
hive> insert into table testSplit values ("apl_finance_reporting");
hive> select namespace from testSplit;
OK
scp_apl_finance
apl_finance_reporting
Time taken: 0.118 seconds, Fetched: 2 row(s)
hive> select regexp_extract(namespace, '^(.*?)(?:_)(.*)$', 0)  from testSplit;
OK
scp_apl_finance
apl_finance_reporting
Time taken: 0.064 seconds, Fetched: 2 row(s)

To get list of all orgs we can execute following query:

hive> select regexp_extract(namespace, '^(.*?)(?:_)(.*)$', 1)  from testSplit;
OK
scp
apl
Time taken: 0.056 seconds, Fetched: 2 row(s)

And to get list of all namespaces, use following one:

hive> select regexp_extract(namespace, '^(.*?)(?:_)(.*)$', 2)  from testSplit;
OK
apl_finance
finance_reporting
Time taken: 0.066 seconds, Fetched: 2 row(s)

Friday, 19 February 2016

How to add auxiliary Jars in Hive

Many times we need to add auxiliary (3rd party) jars in hive class path to make use of them. Some of the auxiliary jars which I use most of the times like serde , dim lookup or 4mc.

There are different ways to achieve this.

1) Hive Server Config (hive-site.xml):

Modify your hive-site.xml config and add following property to it.

<property>
    <name>hive.aux.jars.path</name>
    <value>comma separated list of jar paths</value>
</property>

Example:

<property>
    <name>hive.aux.jars.path</name>
    <value>/usr/share/dimlookup.jar,/usr/share/serde.jar</value>
 </property>

You will need to restart hive server, so that these properties take effect.

2) Hive-Cli –auxpath option:

You can mention the comma separated list of auxiliary jars path while launching hive shell.

Example.

hive --auxpath  /usr/share/dimlookup.jar,/usr/share/serde.jar

3) Hive Cli add jar command:

You can add jar using

add jar jar_path;

Example:

add jar /usr/share/serde.jar;
add jar /usr/share/dimlookup.jar;

4) Add in HIVEAUXJARS_PATH environment variable:

export HIVE_AUX_JARS_PATH=/usr/share/serde.jar

5) .hiverc:

You can add all your add jars statements to .hiverc file in your home / hive config directory. So that they take effect on hive-cli launch.

Thursday, 8 October 2015

Compressing Hive Data

To reduce the amount of disk space hive query uses, you should enable hive compression codecs. There are two places where you can enable compression in hive one is during intermediate processing  and other is while writing the output of hive query to hdfs location. There are different compression codecs which you can use with hive for e.g. bzip2, 4mc, snappy, lzo, lz4 and gzip.Each one has their own drawbacks and advantages. Following are the codecs
  • gzip org.apache.hadoop.io.compress.GzipCodec
  • bzip2 org.apache.hadoop.io.compress.BZip2Codec
  • lzo com.hadoop.compression.lzo.LzopCodec
  • lz4 org.apache.hadoop.io.compress.Lz4Codec
  • Snappy org.apache.hadoop.io.compress.SnappyCodec
  • 4mc com.hadoop.compression.fourmc.FourMcCodec
By default DEFLATE codec is set in most of hadoop configurations.

How to enable Intermediate Compression:

The contents of the intermediate files between jobs can be compressed with the following property in the hive-site.xml file.
<property>
   <name>hive.exec.compress.intermediate</name>
   <value>true</value>
</property>
The compression codec can be specified in either the mapred-site.xml, hive-site.xml, or for the hive session.
<property>
   <name>mapred.map.output.compression.codec</name>
   <value>com.hadoop.compression.fourmc.FourMCHighCodec</value>
</property>
This compression will only save disk space for intermediate files in case of multiple map reduce operations.

How to enable Hive Output Compression:

When the hive.exec.compress.output property is set to true, Hive will use the codec configured by the mapred.map.output.compression.codec property to compress the storage in HDFS.
<property>
   <name>hive.exec.compress.output</name>
   <value>true</value>
</property>
The 4mc compression can be compressed into separatable blocks so it can still be used as input efficiently for subsequent map reduce jobs.
<property>
   <name>mapred.output.compression.codec</name>
   <value>com.hadoop.compression.fourmc.FourMCHighCodec</value>
</property>
Users can always enable or disable this in the Hive session for each queries.  If it is enabled there will be an extra step to extract data from HDFS. These properties can be set in the hive.site.xml or in the Hive session via the Hive command line interface.
hive>set hive.exec.compress.output = true;
hive>set mapred.output.compression.codec= com.hadoop.compression.fourmc.FourMCHighCodec;
Here are more details around listing files before and after enabling 4mc compression.
chetna.chaudhari@fdphadoop-cc-gw-0001:~$hadoop fs -ls /user/hive/warehouse/raw
Found 2 items
-rw-r--r-- 3 chetna.chaudhari hdfs 267628173 2015-09-29 20:48 /user/hive/warehouse/raw/000000_0
-rw-r--r-- 3 chetna.chaudhari hdfs 38765577 2015-09-29 20:48 /user/hive/warehouse/raw/000001_0
After creating a table called fourmc, there will be files in hdfs that have the .4mc extension.  Hive queries will be able to decode the compressed data and this will be transparent to the user that is running queries.
chetna.chaudhari@fdphadoop-cc-gw-0001:~$hadoop fs -ls /user/hive/warehouse/fourmc
Found 2 items
-rw-r--r-- 3 chetna.chaudhari hdfs 26178373 2015-09-29 21:05 /user/hive/warehouse/fourmc/000000_0.4mc
-rw-r--r-- 3 chetna.chaudhari hdfs 3563411 2015-09-29 21:05 /user/hive/warehouse/fourmc/000001_0.4mc
The conclusion is that, if you enable fourmc compression with hive, you can reduce the overall processing time of your query along with less disk consumption.