In this post we will use flume to dump Apache webserver logs into HDFS. We already have a web server running and flume installed, but we need to configure a target and a source.
We use the following file as target.
## TARGET AGENT ##
## configuration file location: /etc/flume-ng/conf
## START Agent: flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent.conf -n collector
#http://flume.apache.org/FlumeUserGuide.html#avro-source
collector.sources = AvroIn
collector.sources.AvroIn.type = avro
collector.sources.AvroIn.bind = 0.0.0.0
collector.sources.AvroIn.port = 4545
collector.sources.AvroIn.channels = mc1 mc2
## Channels ##
## Source writes to 2 channels, one for each sink
collector.channels = mc1 mc2
#http://flume.apache.org/FlumeUserGuide.html#memory-channel
collector.channels.mc1.type = memory
collector.channels.mc1.capacity = 100
collector.channels.mc2.type = memory
collector.channels.mc2.capacity = 100
## Sinks ##
collector.sinks = LocalOut HadoopOut
## Write copy to Local Filesystem
#http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
collector.sinks.LocalOut.type = file_roll
collector.sinks.LocalOut.sink.directory = /var/log/flume-ng
collector.sinks.LocalOut.sink.rollInterval = 0
collector.sinks.LocalOut.channel = mc1
## Write to HDFS
#http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
collector.sinks.HadoopOut.type = hdfs
collector.sinks.HadoopOut.channel = mc2
collector.sinks.HadoopOut.hdfs.path = /user/training/flume/events/%{log_type}/%y%m%d
collector.sinks.HadoopOut.hdfs.fileType = DataStream
collector.sinks.HadoopOut.hdfs.writeFormat = Text
collector.sinks.HadoopOut.hdfs.rollSize = 0
collector.sinks.HadoopOut.hdfs.rollCount = 10000
collector.sinks.HadoopOut.hdfs.rollInterval = 600
and below as source
## SOURCE AGENT ## ## Local instalation: /home/ec2-user/apache-flume ## configuration file location: /home/ec2-user/apache-flume/conf ## bin file location: /home/ec2-user/apache-flume/bin ## START Agent: bin/flume-ng agent -c conf -f conf/flume-src-agent.conf -n source_agent # http://flume.apache.org/FlumeUserGuide.html#exec-source source_agent.sources = apache_server source_agent.sources.apache_server.type = exec source_agent.sources.apache_server.command = tail -f /var/log/httpd/access_log source_agent.sources.apache_server.batchSize = 1 source_agent.sources.apache_server.channels = memoryChannel source_agent.sources.apache_server.interceptors = itime ihost itype # http://flume.apache.org/FlumeUserGuide.html#timestamp-interceptor source_agent.sources.apache_server.interceptors.itime.type = timestamp # http://flume.apache.org/FlumeUserGuide.html#host-interceptor source_agent.sources.apache_server.interceptors.ihost.type = host source_agent.sources.apache_server.interceptors.ihost.useIP = false source_agent.sources.apache_server.interceptors.ihost.hostHeader = host # http://flume.apache.org/FlumeUserGuide.html#static-interceptor source_agent.sources.apache_server.interceptors.itype.type = static source_agent.sources.apache_server.interceptors.itype.key = log_type source_agent.sources.apache_server.interceptors.itype.value = apache_access_combined # http://flume.apache.org/FlumeUserGuide.html#memory-channel source_agent.channels = memoryChannel source_agent.channels.memoryChannel.type = memory source_agent.channels.memoryChannel.capacity = 100 ## Send to Flume Collector on Hadoop Node # http://flume.apache.org/FlumeUserGuide.html#avro-sink source_agent.sinks = avro_sink source_agent.sinks.avro_sink.type = avro source_agent.sinks.avro_sink.channel = memoryChannel source_agent.sinks.avro_sink.hostname = 192.168.46.169 source_agent.sinks.avro_sink.port = 4545
We start the target
flume-ng agent -c conf -f flume-trg-agent.conf -n collector
And now we start the source
flume-ng agent -c conf -f flume-src-agent.conf -n source_agent
And after some time we can see Apache logs being written to HDFS
hdfs dfs -cat /user/training/flume/events/apache_access_combined/170327/FlumeData.1490635725898 192.168.46.169 - - [27/Mar/2017:10:28:42 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2" 192.168.46.169 - - [27/Mar/2017:10:28:57 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2" 192.168.46.169 - - [27/Mar/2017:10:29:07 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2" 192.168.46.169 - - [27/Mar/2017:10:29:08 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2" 192.168.46.169 - - [27/Mar/2017:10:29:10 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2" 192.168.46.169 - - [27/Mar/2017:10:30:48 -0700] "GET / HTTP/1.1" 200 401 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0" 192.168.46.169 - - [27/Mar/2017:10:30:48 -0700] "GET /favicon.ico HTTP/1.1" 404 289 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0" 192.168.46.169 - - [27/Mar/2017:10:30:48 -0700] "GET /favicon.ico HTTP/1.1" 404 289 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0" 192.168.46.169 - - [27/Mar/2017:10:30:51 -0700] "GET /first.html HTTP/1.1" 200 291 "http://192.168.46.169/" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0" 192.168.46.169 - - [27/Mar/2017:10:30:52 -0700] "GET /index.html HTTP/1.1" 200 401 "http://192.168.46.169/first.html" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0" 192.168.46.169 - - [27/Mar/2017:10:30:53 -0700] "GET /second.html HTTP/1.1" 200 293 "http://192.168.46.169/index.html" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0" 192.168.46.169 - - [27/Mar/2017:10:32:01 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"