Tag Archives: hadoop

Create a new table with Apache Hive

We are going to create a new table with Apache Hive from a previous one, populate it and then perform a UNION ALL of both tables. Below is the script that will create the new table.

-- Below script creates a new table
USE testdb;
-- show current tables
SHOW tables;
-- describe mytable2, table we will use to create mytable4
DESCRIBE mytable2;
-- create new table copying format from mytable2
CREATE TABLE mytable4 LIKE mytable2 ;

SHOW tables;
-- describe newly created table
DESCRIBE mytable4;
-- select content from newly created table
SELECT * FROM mytable4;

We proceed executing via hive in a linux shell.

hive    -f  create-new-table.hql

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
Time taken: 0.898 seconds
Time taken: 0.206 seconds, Fetched: 3 row(s)
id                  	int                 	                    
lname               	string              	                    
fname               	string              	                    
Time taken: 0.263 seconds, Fetched: 3 row(s)
Time taken: 0.272 seconds
Time taken: 0.043 seconds, Fetched: 4 row(s)
id                  	int                 	                    
lname               	string              	                    
fname               	string              	                    
Time taken: 0.166 seconds, Fetched: 3 row(s)
Time taken: 0.666 seconds

Continue reading

Listing namenodes and datanodes in Hadoop

Ever wondered how to list Hadoop namenodes? Quite easy as seen below.

hdfs getconf -namenodes
hadoop02.mydomain.com hadoop01.mydomain.com

Now if you want to list the datanodes we do that with dfsadmin.

hdfs dfsadmin  -printTopology
Rack: /default (hadoop15.mydomain.com) (hadoop16.mydomain.com) (hadoop17.mydomain.com) (hadoop18.mydomain.com) (hadoop19.mydomain.com) (hadoop20.mydomain.com) (hadoop21.mydomain.com) (hadoop22.mydomain.com) (hadoop23.mydomain.com) (hadoop24.mydomain.com)

Above command should be executed as a user with hdfs superuser permissions.

Apache Flume to write web server logs to Hadoop

In this post we will use flume to dump Apache webserver logs into HDFS. We already have a web server running and flume installed, but we need to configure a target and a source.

We use the following file as target.

## configuration file location:  /etc/flume-ng/conf
## START Agent: flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent.conf -n collector

collector.sources = AvroIn  
collector.sources.AvroIn.type = avro  
collector.sources.AvroIn.bind =  
collector.sources.AvroIn.port = 4545  
collector.sources.AvroIn.channels = mc1 mc2

## Channels ##
## Source writes to 2 channels, one for each sink
collector.channels = mc1 mc2


collector.channels.mc1.type = memory  
collector.channels.mc1.capacity = 100

collector.channels.mc2.type = memory  
collector.channels.mc2.capacity = 100

## Sinks ##
collector.sinks = LocalOut HadoopOut

## Write copy to Local Filesystem 
collector.sinks.LocalOut.type = file_roll  
collector.sinks.LocalOut.sink.directory = /var/log/flume-ng  
collector.sinks.LocalOut.sink.rollInterval = 0  
collector.sinks.LocalOut.channel = mc1

## Write to HDFS
collector.sinks.HadoopOut.type = hdfs  
collector.sinks.HadoopOut.channel = mc2  
collector.sinks.HadoopOut.hdfs.path = /user/training/flume/events/%{log_type}/%y%m%d  
collector.sinks.HadoopOut.hdfs.fileType = DataStream  
collector.sinks.HadoopOut.hdfs.writeFormat = Text  
collector.sinks.HadoopOut.hdfs.rollSize = 0  
collector.sinks.HadoopOut.hdfs.rollCount = 10000  
collector.sinks.HadoopOut.hdfs.rollInterval = 600

Continue reading