Listing namenodes and datanodes in Hadoop

Ever wondered how to list Hadoop namenodes? Quite easy as seen below.

hdfs getconf -namenodes
hadoop02.mydomain.com hadoop01.mydomain.com

Now if you want to list the datanodes we do that with dfsadmin.

hdfs dfsadmin  -printTopology
Rack: /default
   192.168.0.15:50010 (hadoop15.mydomain.com)
   192.168.0.16:50010 (hadoop16.mydomain.com)
   192.168.0.17:50010 (hadoop17.mydomain.com)
   192.168.0.18:50010 (hadoop18.mydomain.com)
   192.168.0.19:50010 (hadoop19.mydomain.com)
   192.168.0.20:50010 (hadoop20.mydomain.com)
   192.168.0.21:50010 (hadoop21.mydomain.com)
   192.168.0.22:50010 (hadoop22.mydomain.com)
   192.168.0.23:50010 (hadoop23.mydomain.com)
   192.168.0.24:50010 (hadoop24.mydomain.com)

Above command should be executed as a user with hdfs superuser permissions.

Apache Flume to write web server logs to Hadoop

Leave a reply

In this post we will use flume to dump Apache webserver logs into HDFS. We already have a web server running and flume installed, but we need to configure a target and a source.

We use the following file as target.

## TARGET AGENT ##  
## configuration file location:  /etc/flume-ng/conf
## START Agent: flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent.conf -n collector

#http://flume.apache.org/FlumeUserGuide.html#avro-source
collector.sources = AvroIn  
collector.sources.AvroIn.type = avro  
collector.sources.AvroIn.bind = 0.0.0.0  
collector.sources.AvroIn.port = 4545  
collector.sources.AvroIn.channels = mc1 mc2

## Channels ##
## Source writes to 2 channels, one for each sink
collector.channels = mc1 mc2

#http://flume.apache.org/FlumeUserGuide.html#memory-channel

collector.channels.mc1.type = memory  
collector.channels.mc1.capacity = 100

collector.channels.mc2.type = memory  
collector.channels.mc2.capacity = 100

## Sinks ##
collector.sinks = LocalOut HadoopOut

## Write copy to Local Filesystem 
#http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
collector.sinks.LocalOut.type = file_roll  
collector.sinks.LocalOut.sink.directory = /var/log/flume-ng  
collector.sinks.LocalOut.sink.rollInterval = 0  
collector.sinks.LocalOut.channel = mc1

## Write to HDFS
#http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
collector.sinks.HadoopOut.type = hdfs  
collector.sinks.HadoopOut.channel = mc2  
collector.sinks.HadoopOut.hdfs.path = /user/training/flume/events/%{log_type}/%y%m%d  
collector.sinks.HadoopOut.hdfs.fileType = DataStream  
collector.sinks.HadoopOut.hdfs.writeFormat = Text  
collector.sinks.HadoopOut.hdfs.rollSize = 0  
collector.sinks.HadoopOut.hdfs.rollCount = 10000  
collector.sinks.HadoopOut.hdfs.rollInterval = 600

Continue reading →

Export data from HDFS to MySQL

Leave a reply

First create the DB and table where you want to populate.

echo "create database staff2; use staff2; CREATE TABLE editorial (id INT(100) unsigned not null AUTO_INCREMENT, name VARCHAR(20), email VARCHAR(20), primary key (id));" | mysql -u root -p

Once done, we have the data we want to copy in HDFS.

hdfs dfs -cat /home/training/staff/editorial/part-m-*
1,Peter,peter@example.com
2,Jack,jack@example.com

Now dump into MySQL using sqoop.

sqoop export --connect jdbc:mysql://localhost/staff2 --username root -P --table editorial --export-dir /home/training/staff/editorial
17/02/27 12:51:56 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.2.0
Enter password: 
17/02/27 12:51:58 INFO manager.SqlManager: Using default fetchSize of 1000
17/02/27 12:51:58 INFO tool.CodeGenTool: Beginning code generation
17/02/27 12:51:59 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `editorial` AS t LIMIT 1
17/02/27 12:51:59 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `editorial` AS t LIMIT 1
17/02/27 12:51:59 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
Note: /tmp/sqoop-training/compile/e560499b42a9738bbc5ef127712adc7b/editorial.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17/02/27 12:52:03 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-training/compile/e560499b42a9738bbc5ef127712adc7b/editorial.jar
17/02/27 12:52:03 INFO mapreduce.ExportJobBase: Beginning export of editorial
17/02/27 12:52:06 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
17/02/27 12:52:08 INFO input.FileInputFormat: Total input paths to process : 2
17/02/27 12:52:08 INFO input.FileInputFormat: Total input paths to process : 2
17/02/27 12:52:09 INFO mapred.JobClient: Running job: job_201702221239_0006
17/02/27 12:52:10 INFO mapred.JobClient:  map 0% reduce 0%
17/02/27 12:52:31 INFO mapred.JobClient:  map 50% reduce 0%
17/02/27 12:52:45 INFO mapred.JobClient:  map 100% reduce 0%
17/02/27 12:52:49 INFO mapred.JobClient: Job complete: job_201702221239_0006
17/02/27 12:52:49 INFO mapred.JobClient: Counters: 24
17/02/27 12:52:49 INFO mapred.JobClient:   File System Counters
17/02/27 12:52:49 INFO mapred.JobClient:     FILE: Number of bytes read=0
17/02/27 12:52:49 INFO mapred.JobClient:     FILE: Number of bytes written=1176756
17/02/27 12:52:49 INFO mapred.JobClient:     FILE: Number of read operations=0
17/02/27 12:52:49 INFO mapred.JobClient:     FILE: Number of large read operations=0
17/02/27 12:52:49 INFO mapred.JobClient:     FILE: Number of write operations=0
17/02/27 12:52:49 INFO mapred.JobClient:     HDFS: Number of bytes read=759
17/02/27 12:52:49 INFO mapred.JobClient:     HDFS: Number of bytes written=0
17/02/27 12:52:49 INFO mapred.JobClient:     HDFS: Number of read operations=19
17/02/27 12:52:49 INFO mapred.JobClient:     HDFS: Number of large read operations=0
17/02/27 12:52:49 INFO mapred.JobClient:     HDFS: Number of write operations=0
17/02/27 12:52:49 INFO mapred.JobClient:   Job Counters 
17/02/27 12:52:49 INFO mapred.JobClient:     Launched map tasks=4
17/02/27 12:52:49 INFO mapred.JobClient:     Data-local map tasks=4
17/02/27 12:52:49 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=64216
17/02/27 12:52:49 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=0
17/02/27 12:52:49 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
17/02/27 12:52:49 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
17/02/27 12:52:49 INFO mapred.JobClient:   Map-Reduce Framework
17/02/27 12:52:49 INFO mapred.JobClient:     Map input records=2
17/02/27 12:52:49 INFO mapred.JobClient:     Map output records=2
17/02/27 12:52:49 INFO mapred.JobClient:     Input split bytes=661
17/02/27 12:52:49 INFO mapred.JobClient:     Spilled Records=0
17/02/27 12:52:49 INFO mapred.JobClient:     CPU time spent (ms)=3390
17/02/27 12:52:49 INFO mapred.JobClient:     Physical memory (bytes) snapshot=422584320
17/02/27 12:52:49 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2940895232
17/02/27 12:52:49 INFO mapred.JobClient:     Total committed heap usage (bytes)=127401984
17/02/27 12:52:49 INFO mapreduce.ExportJobBase: Transferred 759 bytes in 42.9426 seconds (17.6748 bytes/sec)
17/02/27 12:52:49 INFO mapreduce.ExportJobBase: Exported 2 records.

Now we can see the content in MySQL DB named staff2.

echo "use staff2; SELECT * FROM editorial;" | mysql -u root -p 
Enter password: 
id	name	email
1	Peter	peter@example.com
2	Jack	jack@example.com

Using sqoop to import a DB table into HDFS

Leave a reply

In the world of Big Data to import data from a DB into HDFS you need Apache Sqoop.

sqoop import --connect jdbc:mysql://localhost/mysql  --username training  -P --warehouse-dir /home/training/db --table user
17/02/23 10:38:19 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.2.0
Enter password: 
17/02/23 10:38:24 INFO manager.SqlManager: Using default fetchSize of 1000
17/02/23 10:38:24 INFO tool.CodeGenTool: Beginning code generation
17/02/23 10:38:24 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `user` AS t LIMIT 1
17/02/23 10:38:24 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `user` AS t LIMIT 1
17/02/23 10:38:24 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
Note: /tmp/sqoop-training/compile/7f3a9709c50f58c2c6bb24de91922c6b/user.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17/02/23 10:38:29 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-training/compile/7f3a9709c50f58c2c6bb24de91922c6b/user.jar
17/02/23 10:38:29 WARN manager.MySQLManager: It looks like you are importing from mysql.
17/02/23 10:38:29 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
17/02/23 10:38:29 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
17/02/23 10:38:29 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
17/02/23 10:38:29 WARN manager.CatalogQueryManager: The table user contains a multi-column primary key. Sqoop will default to the column Host only for this job.
17/02/23 10:38:29 WARN manager.CatalogQueryManager: The table user contains a multi-column primary key. Sqoop will default to the column Host only for this job.
17/02/23 10:38:29 INFO mapreduce.ImportJobBase: Beginning import of user
17/02/23 10:38:31 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
17/02/23 10:38:33 INFO db.DBInputFormat: Using read commited transaction isolation
17/02/23 10:38:33 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`Host`), MAX(`Host`) FROM `user`
17/02/23 10:38:33 WARN db.TextSplitter: Generating splits for a textual index column.
17/02/23 10:38:33 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
17/02/23 10:38:33 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
17/02/23 10:38:33 INFO mapred.JobClient: Running job: job_201702221239_0003
17/02/23 10:38:34 INFO mapred.JobClient:  map 0% reduce 0%
17/02/23 10:38:55 INFO mapred.JobClient:  map 17% reduce 0%
17/02/23 10:38:56 INFO mapred.JobClient:  map 33% reduce 0%
17/02/23 10:39:08 INFO mapred.JobClient:  map 50% reduce 0%
17/02/23 10:39:09 INFO mapred.JobClient:  map 67% reduce 0%
17/02/23 10:39:21 INFO mapred.JobClient:  map 83% reduce 0%
17/02/23 10:39:22 INFO mapred.JobClient:  map 100% reduce 0%
17/02/23 10:39:26 INFO mapred.JobClient: Job complete: job_201702221239_0003
17/02/23 10:39:26 INFO mapred.JobClient: Counters: 23
17/02/23 10:39:26 INFO mapred.JobClient:   File System Counters
17/02/23 10:39:26 INFO mapred.JobClient:     FILE: Number of bytes read=0
17/02/23 10:39:26 INFO mapred.JobClient:     FILE: Number of bytes written=1778658
17/02/23 10:39:26 INFO mapred.JobClient:     FILE: Number of read operations=0
17/02/23 10:39:26 INFO mapred.JobClient:     FILE: Number of large read operations=0
17/02/23 10:39:26 INFO mapred.JobClient:     FILE: Number of write operations=0
17/02/23 10:39:26 INFO mapred.JobClient:     HDFS: Number of bytes read=791
17/02/23 10:39:26 INFO mapred.JobClient:     HDFS: Number of bytes written=818
17/02/23 10:39:26 INFO mapred.JobClient:     HDFS: Number of read operations=6
17/02/23 10:39:26 INFO mapred.JobClient:     HDFS: Number of large read operations=0
17/02/23 10:39:26 INFO mapred.JobClient:     HDFS: Number of write operations=6
17/02/23 10:39:26 INFO mapred.JobClient:   Job Counters 
17/02/23 10:39:26 INFO mapred.JobClient:     Launched map tasks=6
17/02/23 10:39:26 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=89702
17/02/23 10:39:26 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=0
17/02/23 10:39:26 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
17/02/23 10:39:26 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
17/02/23 10:39:26 INFO mapred.JobClient:   Map-Reduce Framework
17/02/23 10:39:26 INFO mapred.JobClient:     Map input records=8
17/02/23 10:39:26 INFO mapred.JobClient:     Map output records=8
17/02/23 10:39:26 INFO mapred.JobClient:     Input split bytes=791
17/02/23 10:39:26 INFO mapred.JobClient:     Spilled Records=0
17/02/23 10:39:26 INFO mapred.JobClient:     CPU time spent (ms)=5490
17/02/23 10:39:26 INFO mapred.JobClient:     Physical memory (bytes) snapshot=666267648
17/02/23 10:39:26 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=4423995392
17/02/23 10:39:26 INFO mapred.JobClient:     Total committed heap usage (bytes)=191102976
17/02/23 10:39:26 INFO mapreduce.ImportJobBase: Transferred 818 bytes in 56.7255 seconds (14.4203 bytes/sec)
17/02/23 10:39:26 INFO mapreduce.ImportJobBase: Retrieved 8 records.

Example above dumps table user from mysql DB into hadoop.
First connect to DB using –connect
–username would by the authentication username, -P to ask for password at prompt. –warehouse-dir HDFS parent for table destination and –table to select the table to import.

Below dumped content is shown.

hdfs dfs -cat   /home/training/db/user/part-m-0000*
127.0.0.1,root,,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,,,,,0,0,0,0
localhost,root,,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,,,,,0,0,0,0
localhost.localdomain,root,,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,,,,,0,0,0,0
localhost,,,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,,,,,0,0,0,0
localhost.localdomain,,,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,,,,,0,0,0,0
localhost,training,*27CF0BD18BDADD517165824F8C1FFF667B47D04B,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,N,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,,,,,0,0,0,0
localhost,hiveuser,*2470C0C06DEE42FD1618BB99005ADCA2EC9D1E19,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,,,,,0,0,0,0
localhost,hue,*15221DE9A04689C4D312DEAC3B87DDF542AF439E,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,,,,,0,0,0,0

Satanic Route v2.0 back to Navarre via Col de Larrau

Leave a reply

Back to Navarre via Col de Larrau. Tough one, four kms above 10%. But the views were worth every pedal stroke.

Cloudy, humid day to start

Up we go

Things get steep

Continue reading →

XaviGNU

Me and my world

Listing namenodes and datanodes in Hadoop

Apache Flume to write web server logs to Hadoop

Export data from HDFS to MySQL

Using sqoop to import a DB table into HDFS

Satanic Route v2.0 back to Navarre via Col de Larrau