Apache Flume to write web server logs to Hadoop

In this post we will use flume to dump Apache webserver logs into HDFS. We already have a web server running and flume installed, but we need to configure a target and a source.

We use the following file as target.

## TARGET AGENT ##  
## configuration file location:  /etc/flume-ng/conf
## START Agent: flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent.conf -n collector

#http://flume.apache.org/FlumeUserGuide.html#avro-source
collector.sources = AvroIn  
collector.sources.AvroIn.type = avro  
collector.sources.AvroIn.bind = 0.0.0.0  
collector.sources.AvroIn.port = 4545  
collector.sources.AvroIn.channels = mc1 mc2

## Channels ##
## Source writes to 2 channels, one for each sink
collector.channels = mc1 mc2

#http://flume.apache.org/FlumeUserGuide.html#memory-channel

collector.channels.mc1.type = memory  
collector.channels.mc1.capacity = 100

collector.channels.mc2.type = memory  
collector.channels.mc2.capacity = 100

## Sinks ##
collector.sinks = LocalOut HadoopOut

## Write copy to Local Filesystem 
#http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
collector.sinks.LocalOut.type = file_roll  
collector.sinks.LocalOut.sink.directory = /var/log/flume-ng  
collector.sinks.LocalOut.sink.rollInterval = 0  
collector.sinks.LocalOut.channel = mc1

## Write to HDFS
#http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
collector.sinks.HadoopOut.type = hdfs  
collector.sinks.HadoopOut.channel = mc2  
collector.sinks.HadoopOut.hdfs.path = /user/training/flume/events/%{log_type}/%y%m%d  
collector.sinks.HadoopOut.hdfs.fileType = DataStream  
collector.sinks.HadoopOut.hdfs.writeFormat = Text  
collector.sinks.HadoopOut.hdfs.rollSize = 0  
collector.sinks.HadoopOut.hdfs.rollCount = 10000  
collector.sinks.HadoopOut.hdfs.rollInterval = 600

and below as source

## SOURCE AGENT ##  
## Local instalation: /home/ec2-user/apache-flume
## configuration file location:  /home/ec2-user/apache-flume/conf
## bin file location: /home/ec2-user/apache-flume/bin
## START Agent: bin/flume-ng agent -c conf -f conf/flume-src-agent.conf -n source_agent

# http://flume.apache.org/FlumeUserGuide.html#exec-source
source_agent.sources = apache_server  
source_agent.sources.apache_server.type = exec  
source_agent.sources.apache_server.command = tail -f /var/log/httpd/access_log  
source_agent.sources.apache_server.batchSize = 1  
source_agent.sources.apache_server.channels = memoryChannel  
source_agent.sources.apache_server.interceptors = itime ihost itype

# http://flume.apache.org/FlumeUserGuide.html#timestamp-interceptor
source_agent.sources.apache_server.interceptors.itime.type = timestamp

# http://flume.apache.org/FlumeUserGuide.html#host-interceptor
source_agent.sources.apache_server.interceptors.ihost.type = host  
source_agent.sources.apache_server.interceptors.ihost.useIP = false  
source_agent.sources.apache_server.interceptors.ihost.hostHeader = host

# http://flume.apache.org/FlumeUserGuide.html#static-interceptor
source_agent.sources.apache_server.interceptors.itype.type = static  
source_agent.sources.apache_server.interceptors.itype.key = log_type  
source_agent.sources.apache_server.interceptors.itype.value = apache_access_combined

# http://flume.apache.org/FlumeUserGuide.html#memory-channel
source_agent.channels = memoryChannel  
source_agent.channels.memoryChannel.type = memory  
source_agent.channels.memoryChannel.capacity = 100

## Send to Flume Collector on Hadoop Node
# http://flume.apache.org/FlumeUserGuide.html#avro-sink
source_agent.sinks = avro_sink  
source_agent.sinks.avro_sink.type = avro  
source_agent.sinks.avro_sink.channel = memoryChannel  
source_agent.sinks.avro_sink.hostname = 192.168.46.169
source_agent.sinks.avro_sink.port = 4545

We start the target

flume-ng agent -c conf -f flume-trg-agent.conf -n collector 

And now we start the source

flume-ng agent -c conf -f flume-src-agent.conf -n source_agent

And after some time we can see Apache logs being written to HDFS

hdfs dfs -cat  /user/training/flume/events/apache_access_combined/170327/FlumeData.1490635725898
192.168.46.169 - - [27/Mar/2017:10:28:42 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
192.168.46.169 - - [27/Mar/2017:10:28:57 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
192.168.46.169 - - [27/Mar/2017:10:29:07 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
192.168.46.169 - - [27/Mar/2017:10:29:08 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
192.168.46.169 - - [27/Mar/2017:10:29:10 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
192.168.46.169 - - [27/Mar/2017:10:30:48 -0700] "GET / HTTP/1.1" 200 401 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0"
192.168.46.169 - - [27/Mar/2017:10:30:48 -0700] "GET /favicon.ico HTTP/1.1" 404 289 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0"
192.168.46.169 - - [27/Mar/2017:10:30:48 -0700] "GET /favicon.ico HTTP/1.1" 404 289 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0"
192.168.46.169 - - [27/Mar/2017:10:30:51 -0700] "GET /first.html HTTP/1.1" 200 291 "http://192.168.46.169/" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0"
192.168.46.169 - - [27/Mar/2017:10:30:52 -0700] "GET /index.html HTTP/1.1" 200 401 "http://192.168.46.169/first.html" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0"
192.168.46.169 - - [27/Mar/2017:10:30:53 -0700] "GET /second.html HTTP/1.1" 200 293 "http://192.168.46.169/index.html" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0"
192.168.46.169 - - [27/Mar/2017:10:32:01 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"

Leave a Reply