Tag Archives: apache

Using sqoop to import a DB table into HDFS

In the world of Big Data to import data from a DB into HDFS you need Apache Sqoop.

sqoop import --connect jdbc:mysql://localhost/mysql  --username training  -P --warehouse-dir /home/training/db --table user
17/02/23 10:38:19 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.2.0
Enter password: 
17/02/23 10:38:24 INFO manager.SqlManager: Using default fetchSize of 1000
17/02/23 10:38:24 INFO tool.CodeGenTool: Beginning code generation
17/02/23 10:38:24 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `user` AS t LIMIT 1
17/02/23 10:38:24 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `user` AS t LIMIT 1
17/02/23 10:38:24 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
Note: /tmp/sqoop-training/compile/7f3a9709c50f58c2c6bb24de91922c6b/user.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17/02/23 10:38:29 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-training/compile/7f3a9709c50f58c2c6bb24de91922c6b/user.jar
17/02/23 10:38:29 WARN manager.MySQLManager: It looks like you are importing from mysql.
17/02/23 10:38:29 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
17/02/23 10:38:29 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
17/02/23 10:38:29 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
17/02/23 10:38:29 WARN manager.CatalogQueryManager: The table user contains a multi-column primary key. Sqoop will default to the column Host only for this job.
17/02/23 10:38:29 WARN manager.CatalogQueryManager: The table user contains a multi-column primary key. Sqoop will default to the column Host only for this job.
17/02/23 10:38:29 INFO mapreduce.ImportJobBase: Beginning import of user
17/02/23 10:38:31 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
17/02/23 10:38:33 INFO db.DBInputFormat: Using read commited transaction isolation
17/02/23 10:38:33 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`Host`), MAX(`Host`) FROM `user`
17/02/23 10:38:33 WARN db.TextSplitter: Generating splits for a textual index column.
17/02/23 10:38:33 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
17/02/23 10:38:33 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
17/02/23 10:38:33 INFO mapred.JobClient: Running job: job_201702221239_0003
17/02/23 10:38:34 INFO mapred.JobClient:  map 0% reduce 0%
17/02/23 10:38:55 INFO mapred.JobClient:  map 17% reduce 0%
17/02/23 10:38:56 INFO mapred.JobClient:  map 33% reduce 0%
17/02/23 10:39:08 INFO mapred.JobClient:  map 50% reduce 0%
17/02/23 10:39:09 INFO mapred.JobClient:  map 67% reduce 0%
17/02/23 10:39:21 INFO mapred.JobClient:  map 83% reduce 0%
17/02/23 10:39:22 INFO mapred.JobClient:  map 100% reduce 0%
17/02/23 10:39:26 INFO mapred.JobClient: Job complete: job_201702221239_0003
17/02/23 10:39:26 INFO mapred.JobClient: Counters: 23
17/02/23 10:39:26 INFO mapred.JobClient:   File System Counters
17/02/23 10:39:26 INFO mapred.JobClient:     FILE: Number of bytes read=0
17/02/23 10:39:26 INFO mapred.JobClient:     FILE: Number of bytes written=1778658
17/02/23 10:39:26 INFO mapred.JobClient:     FILE: Number of read operations=0
17/02/23 10:39:26 INFO mapred.JobClient:     FILE: Number of large read operations=0
17/02/23 10:39:26 INFO mapred.JobClient:     FILE: Number of write operations=0
17/02/23 10:39:26 INFO mapred.JobClient:     HDFS: Number of bytes read=791
17/02/23 10:39:26 INFO mapred.JobClient:     HDFS: Number of bytes written=818
17/02/23 10:39:26 INFO mapred.JobClient:     HDFS: Number of read operations=6
17/02/23 10:39:26 INFO mapred.JobClient:     HDFS: Number of large read operations=0
17/02/23 10:39:26 INFO mapred.JobClient:     HDFS: Number of write operations=6
17/02/23 10:39:26 INFO mapred.JobClient:   Job Counters 
17/02/23 10:39:26 INFO mapred.JobClient:     Launched map tasks=6
17/02/23 10:39:26 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=89702
17/02/23 10:39:26 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=0
17/02/23 10:39:26 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
17/02/23 10:39:26 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
17/02/23 10:39:26 INFO mapred.JobClient:   Map-Reduce Framework
17/02/23 10:39:26 INFO mapred.JobClient:     Map input records=8
17/02/23 10:39:26 INFO mapred.JobClient:     Map output records=8
17/02/23 10:39:26 INFO mapred.JobClient:     Input split bytes=791
17/02/23 10:39:26 INFO mapred.JobClient:     Spilled Records=0
17/02/23 10:39:26 INFO mapred.JobClient:     CPU time spent (ms)=5490
17/02/23 10:39:26 INFO mapred.JobClient:     Physical memory (bytes) snapshot=666267648
17/02/23 10:39:26 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=4423995392
17/02/23 10:39:26 INFO mapred.JobClient:     Total committed heap usage (bytes)=191102976
17/02/23 10:39:26 INFO mapreduce.ImportJobBase: Transferred 818 bytes in 56.7255 seconds (14.4203 bytes/sec)
17/02/23 10:39:26 INFO mapreduce.ImportJobBase: Retrieved 8 records.

Example above dumps table user from mysql DB into hadoop.
First connect to DB using –connect
–username would by the authentication username, -P to ask for password at prompt. –warehouse-dir HDFS parent for table destination and –table to select the table to import.

Below dumped content is shown.

hdfs dfs -cat   /home/training/db/user/part-m-0000*
127.0.0.1,root,,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,,,,,0,0,0,0
localhost,root,,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,,,,,0,0,0,0
localhost.localdomain,root,,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,,,,,0,0,0,0
localhost,,,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,,,,,0,0,0,0
localhost.localdomain,,,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,,,,,0,0,0,0
localhost,training,*27CF0BD18BDADD517165824F8C1FFF667B47D04B,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,N,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,,,,,0,0,0,0
localhost,hiveuser,*2470C0C06DEE42FD1618BB99005ADCA2EC9D1E19,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,,,,,0,0,0,0
localhost,hue,*15221DE9A04689C4D312DEAC3B87DDF542AF439E,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,,,,,0,0,0,0

Get NYT for free

I got tired of paying for the NY Times and the Wall Street Journal. Its not worth it when you can look at the articles for free if you Google them.

This can be done having Firefox as your browser and having the Ref Control add-on. Once installed configure it placing https://google.com as the referrer.

RefControl config

RefControl config

This would basically tell NYT or WSJ that you are coming from Google instead of coming directly from your home or whereever you are located.

Playing around with Nagios

I have been playing around with Nagios lately. This is the most widely used monitoring tool. I’m not going to write another howto, but clarify some things I found missing in the configuration manuals I used.

First you need the PHP module on the apache server for Nagios to work. Else you will find yourself downloading a phtml file to some directory in your box. This is because of the following files in the nagios directory.

xavi@server:~$ sudo find /nagios/ -type f -name "*.php"
/nagios/share/side.php
/nagios/share/includes/utils.inc.php
/nagios/share/main.php
/nagios/share/config.inc.php
/nagios/share/index.php
xavi@server:~$ 

Another issue I found was the with nrpe plugin. This plugin allows us to run commands on remote hosts. Issue is that default compilation values don’t allow us to pass arguments. To pass arguments from the server to the monitored host it must be compiled with the –enable-command-args argument when compiling the nrpe source. Another thing that must be done is setting the dont_blame_nrpe to 1 in the nrpe.cfg file.

dont_blame_nrpe=1

This changes allows us to run checks with arguments remotely. Example:

server:/nagios/libexec# ./check_nrpe -H 212.34.95.23 -c check_procs -a 100 120
PROCS OK: 96 processes
server:/nagios/libexec#

Questions and suggestions are always welcome. 🙂

Apache MySQL authentication

Lets explain how to configure Apache MySQL authentication. We will suppose apache web server is already up and running on the server.
First we need to install auth_mysql module.

$ sudo aptitude search libapache2-mod-auth-mysql
p   libapache2-mod-auth-mysql       - Apache 2 module for MySQL authentication  
$ sudo aptitude install libapache2-mod-auth-mysql

Continue reading