Playing with functions in hive

Apache Hive has built in functions which can be listed with


to play with concat we will run the following script.

-- Use testdb
use testdb;
-- describe concat function
-- describe table mytable2
DESC mytable2;
-- Perform select query uniting fname and lname
SELECT CONCAT(fname,' ',lname) FROM mytable2;

We can execute with beeline or hive. We will use beeline.

user@computer:$ beeline -u jdbc:hive2://localhost:10000 -f Documents/concat.hql --verbose=false --showWarnings=false
scan complete in 8ms
Connecting to jdbc:hive2://localhost:10000
Connected to: Apache Hive (version 0.13.1-cdh5.2.0)
Driver: Hive JDBC (version 0.13.1-cdh5.2.0)
0: jdbc:hive2://localhost:10000> use testdb;
No rows affected (0.104 seconds)
0: jdbc:hive2://localhost:10000> -- Describe concat function
0: jdbc:hive2://localhost:10000> DESC FUNCTION concat;
| tab_name |
| concat(str1, str2, ... strN) - returns the concatenation of str1, str2, ... strN or concat(bin1, bin2, ... binN) - returns the concatenation of bytes in binary data bin1, bin2, ... binN |
1 row selected (0.162 seconds)
0: jdbc:hive2://localhost:10000>
0: jdbc:hive2://localhost:10000> DESC mytable2;
| col_name | data_type | comment |
| id | int | |
| lname | string | |
| fname | string | |
3 rows selected (0.133 seconds)
0: jdbc:hive2://localhost:10000>
0: jdbc:hive2://localhost:10000> SELECT CONCAT(fname,' ',lname) FROM mytable2;
| _c0 |
| John Doe |
| William Lancaster |
| Burp Gentoo |
3 rows selected (18.848 seconds)
0: jdbc:hive2://localhost:10000>
Closing: 0: jdbc:hive2://localhost:10000

We can also play with functions from inside hive cli as shown below with the sqrt function.

hive> DESC function sqrt;
sqrt(x) - returns the square root of x
Time taken: 0.018 seconds, Fetched: 1 row(s)
hive> SELECT SQRT(64);
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201710131004_0373, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201710131004_0373
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201710131004_0373
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-12-06 12:33:16,450 Stage-1 map = 0%, reduce = 0%
2017-12-06 12:33:23,476 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.58 sec
2017-12-06 12:33:28,497 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.58 sec
MapReduce Total cumulative CPU time: 1 seconds 580 msec
Ended Job = job_201710131004_0373
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.58 sec HDFS Read: 273 HDFS Write: 4 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 580 msec
Time taken: 18.894 seconds, Fetched: 1 row(s)

More info about Hive functions here.

Run httpd with docker

So below is the script:


echo "Running httpd with docker."

docker run   --rm -v "$PWD":/usr/local/apache2/htdocs  httpd

We use the following options:
-v, –volume list Bind mount a volume
–rm Automatically remove the container when it exits

Quite simple, right? Execution below.

user@computer:$ bash ~/docker/
Running httpd with docker.
AH00558: httpd: Could not reliably determine the server's fully qualified domain name, using Set the 'ServerName' directive globally to suppress this message
AH00558: httpd: Could not reliably determine the server's fully qualified domain name, using Set the 'ServerName' directive globally to suppress this message
[Thu Aug 17 22:22:55.249981 2017] [mpm_event:notice] [pid 1:tid 140029488904064] AH00489: Apache/2.4.27 (Unix) configured -- resuming normal operations
[Thu Aug 17 22:22:55.250079 2017] [core:notice] [pid 1:tid 140029488904064] AH00094: Command line: 'httpd -D FOREGROUND'

And we proceed to test.

user@computer:$ curl
First page
Testing docker httpd

We are getting below index.html because we are mapping /tmp/httpd (current $PWD) to /usr/local/apache2/htdocs. In /tmp/httpd we created an example index.html as shown above.
More info here

Run python script with docker

So I started playing with docker and asked myself whether it would be possible to run a python script with docker. Well, answer is yes. Example of the script below.


import sys
print "Running script!!"
print sys.version_info

Execution below:

user@computer:$ docker run -it --rm --name pythonscript -v "$PWD":/usr/src/myapp -w /usr/src/myapp python:2 python
Running script!!
sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0)

Docker documentation

Listing namenodes and datanodes in Hadoop

Ever wondered how to list Hadoop namenodes? Quite easy as seen below.

user@computer:$ hdfs getconf -namenodes

Now if you want to list the datanodes we do that with dfsadmin.

user@computer:$ hdfs dfsadmin -printTopology
Rack: /default ( ( ( ( ( ( ( ( ( (

Above command should be executed as a user with hdfs superuser permissions.

Apache Flume to write web server logs to Hadoop

In this post we will use flume to dump Apache webserver logs into HDFS. We already have a web server running and flume installed, but we need to configure a target and a source.

We use the following file as target.

## configuration file location:  /etc/flume-ng/conf
## START Agent: flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent.conf -n collector

collector.sources = AvroIn  
collector.sources.AvroIn.type = avro  
collector.sources.AvroIn.bind =  
collector.sources.AvroIn.port = 4545  
collector.sources.AvroIn.channels = mc1 mc2

## Channels ##
## Source writes to 2 channels, one for each sink
collector.channels = mc1 mc2


collector.channels.mc1.type = memory  
collector.channels.mc1.capacity = 100

collector.channels.mc2.type = memory  
collector.channels.mc2.capacity = 100

## Sinks ##
collector.sinks = LocalOut HadoopOut

## Write copy to Local Filesystem 
collector.sinks.LocalOut.type = file_roll = /var/log/flume-ng  
collector.sinks.LocalOut.sink.rollInterval = 0 = mc1

## Write to HDFS
collector.sinks.HadoopOut.type = hdfs = mc2  
collector.sinks.HadoopOut.hdfs.path = /user/training/flume/events/%{log_type}/%y%m%d  
collector.sinks.HadoopOut.hdfs.fileType = DataStream  
collector.sinks.HadoopOut.hdfs.writeFormat = Text  
collector.sinks.HadoopOut.hdfs.rollSize = 0  
collector.sinks.HadoopOut.hdfs.rollCount = 10000  
collector.sinks.HadoopOut.hdfs.rollInterval = 600

Continue reading