{"id":1385,"date":"2017-03-27T17:40:14","date_gmt":"2017-03-27T21:40:14","guid":{"rendered":"http:\/\/www.xavignu.com\/?p=1385"},"modified":"2017-03-27T17:40:14","modified_gmt":"2017-03-27T21:40:14","slug":"apache-flume-to-write-web-server-logs-to-hadoop","status":"publish","type":"post","link":"https:\/\/www.xavignu.com\/?p=1385","title":{"rendered":"Apache Flume to write web server logs to Hadoop"},"content":{"rendered":"<p>In this post we will use <a hre=\"https:\/\/flume.apache.org\/index.html\" target=\"_blank\">flume<\/a> to dump <a href=\"https:\/\/httpd.apache.org\/\" target=\"_blank\">Apache<\/a> webserver logs into <a href=\"https:\/\/hadoop.apache.org\/docs\/stable\/hadoop-project-dist\/hadoop-hdfs\/HdfsUserGuide.html\" target=\"_blank\">HDFS<\/a>. We already have a web server running and flume installed, but we need to configure a target and a source.<\/p>\n<p>We use the following file as target.<\/p>\n<p>[xml]<br \/>\n## TARGET AGENT ##<br \/>\n## configuration file location:  \/etc\/flume-ng\/conf<br \/>\n## START Agent: flume-ng agent -c conf -f \/etc\/flume-ng\/conf\/flume-trg-agent.conf -n collector<\/p>\n<p>#http:\/\/flume.apache.org\/FlumeUserGuide.html#avro-source<br \/>\ncollector.sources = AvroIn<br \/>\ncollector.sources.AvroIn.type = avro<br \/>\ncollector.sources.AvroIn.bind = 0.0.0.0<br \/>\ncollector.sources.AvroIn.port = 4545<br \/>\ncollector.sources.AvroIn.channels = mc1 mc2<\/p>\n<p>## Channels ##<br \/>\n## Source writes to 2 channels, one for each sink<br \/>\ncollector.channels = mc1 mc2<\/p>\n<p>#http:\/\/flume.apache.org\/FlumeUserGuide.html#memory-channel<\/p>\n<p>collector.channels.mc1.type = memory<br \/>\ncollector.channels.mc1.capacity = 100<\/p>\n<p>collector.channels.mc2.type = memory<br \/>\ncollector.channels.mc2.capacity = 100<\/p>\n<p>## Sinks ##<br \/>\ncollector.sinks = LocalOut HadoopOut<\/p>\n<p>## Write copy to Local Filesystem<br \/>\n#http:\/\/flume.apache.org\/FlumeUserGuide.html#file-roll-sink<br \/>\ncollector.sinks.LocalOut.type = file_roll<br \/>\ncollector.sinks.LocalOut.sink.directory = \/var\/log\/flume-ng<br \/>\ncollector.sinks.LocalOut.sink.rollInterval = 0<br \/>\ncollector.sinks.LocalOut.channel = mc1<\/p>\n<p>## Write to HDFS<br \/>\n#http:\/\/flume.apache.org\/FlumeUserGuide.html#hdfs-sink<br \/>\ncollector.sinks.HadoopOut.type = hdfs<br \/>\ncollector.sinks.HadoopOut.channel = mc2<br \/>\ncollector.sinks.HadoopOut.hdfs.path = \/user\/training\/flume\/events\/%{log_type}\/%y%m%d<br \/>\ncollector.sinks.HadoopOut.hdfs.fileType = DataStream<br \/>\ncollector.sinks.HadoopOut.hdfs.writeFormat = Text<br \/>\ncollector.sinks.HadoopOut.hdfs.rollSize = 0<br \/>\ncollector.sinks.HadoopOut.hdfs.rollCount = 10000<br \/>\ncollector.sinks.HadoopOut.hdfs.rollInterval = 600<\/p>\n<p>[\/xml]<br \/>\n<!--more--><\/p>\n<p>and below as source<br \/>\n[xml]<br \/>\n## SOURCE AGENT ##<br \/>\n## Local instalation: \/home\/ec2-user\/apache-flume<br \/>\n## configuration file location:  \/home\/ec2-user\/apache-flume\/conf<br \/>\n## bin file location: \/home\/ec2-user\/apache-flume\/bin<br \/>\n## START Agent: bin\/flume-ng agent -c conf -f conf\/flume-src-agent.conf -n source_agent<\/p>\n<p># http:\/\/flume.apache.org\/FlumeUserGuide.html#exec-source<br \/>\nsource_agent.sources = apache_server<br \/>\nsource_agent.sources.apache_server.type = exec<br \/>\nsource_agent.sources.apache_server.command = tail -f \/var\/log\/httpd\/access_log<br \/>\nsource_agent.sources.apache_server.batchSize = 1<br \/>\nsource_agent.sources.apache_server.channels = memoryChannel<br \/>\nsource_agent.sources.apache_server.interceptors = itime ihost itype<\/p>\n<p># http:\/\/flume.apache.org\/FlumeUserGuide.html#timestamp-interceptor<br \/>\nsource_agent.sources.apache_server.interceptors.itime.type = timestamp<\/p>\n<p># http:\/\/flume.apache.org\/FlumeUserGuide.html#host-interceptor<br \/>\nsource_agent.sources.apache_server.interceptors.ihost.type = host<br \/>\nsource_agent.sources.apache_server.interceptors.ihost.useIP = false<br \/>\nsource_agent.sources.apache_server.interceptors.ihost.hostHeader = host<\/p>\n<p># http:\/\/flume.apache.org\/FlumeUserGuide.html#static-interceptor<br \/>\nsource_agent.sources.apache_server.interceptors.itype.type = static<br \/>\nsource_agent.sources.apache_server.interceptors.itype.key = log_type<br \/>\nsource_agent.sources.apache_server.interceptors.itype.value = apache_access_combined<\/p>\n<p># http:\/\/flume.apache.org\/FlumeUserGuide.html#memory-channel<br \/>\nsource_agent.channels = memoryChannel<br \/>\nsource_agent.channels.memoryChannel.type = memory<br \/>\nsource_agent.channels.memoryChannel.capacity = 100<\/p>\n<p>## Send to Flume Collector on Hadoop Node<br \/>\n# http:\/\/flume.apache.org\/FlumeUserGuide.html#avro-sink<br \/>\nsource_agent.sinks = avro_sink<br \/>\nsource_agent.sinks.avro_sink.type = avro<br \/>\nsource_agent.sinks.avro_sink.channel = memoryChannel<br \/>\nsource_agent.sinks.avro_sink.hostname = 192.168.46.169<br \/>\nsource_agent.sinks.avro_sink.port = 4545<\/p>\n<p>[\/xml]<br \/>\nWe start the target<\/p>\n<pre id=\"terminal\">flume-ng agent -c conf -f flume-trg-agent.conf -n collector \r\n<\/pre>\n<p>And now we start the source<\/p>\n<pre id=\"terminal\">flume-ng agent -c conf -f flume-src-agent.conf -n source_agent\r\n<\/pre>\n<p>And after some time we can see Apache logs being written to HDFS<\/p>\n<pre id=\"terminal\">hdfs dfs -cat  \/user\/training\/flume\/events\/apache_access_combined\/170327\/FlumeData.1490635725898\r\n192.168.46.169 - - [27\/Mar\/2017:10:28:42 -0700] \"GET \/ HTTP\/1.1\" 200 401 \"-\" \"curl\/7.19.7 (x86_64-redhat-linux-gnu) libcurl\/7.19.7 NSS\/3.14.0.0 zlib\/1.2.3 libidn\/1.18 libssh2\/1.4.2\"\r\n192.168.46.169 - - [27\/Mar\/2017:10:28:57 -0700] \"GET \/ HTTP\/1.1\" 200 401 \"-\" \"curl\/7.19.7 (x86_64-redhat-linux-gnu) libcurl\/7.19.7 NSS\/3.14.0.0 zlib\/1.2.3 libidn\/1.18 libssh2\/1.4.2\"\r\n192.168.46.169 - - [27\/Mar\/2017:10:29:07 -0700] \"GET \/ HTTP\/1.1\" 200 401 \"-\" \"curl\/7.19.7 (x86_64-redhat-linux-gnu) libcurl\/7.19.7 NSS\/3.14.0.0 zlib\/1.2.3 libidn\/1.18 libssh2\/1.4.2\"\r\n192.168.46.169 - - [27\/Mar\/2017:10:29:08 -0700] \"GET \/ HTTP\/1.1\" 200 401 \"-\" \"curl\/7.19.7 (x86_64-redhat-linux-gnu) libcurl\/7.19.7 NSS\/3.14.0.0 zlib\/1.2.3 libidn\/1.18 libssh2\/1.4.2\"\r\n192.168.46.169 - - [27\/Mar\/2017:10:29:10 -0700] \"GET \/ HTTP\/1.1\" 200 401 \"-\" \"curl\/7.19.7 (x86_64-redhat-linux-gnu) libcurl\/7.19.7 NSS\/3.14.0.0 zlib\/1.2.3 libidn\/1.18 libssh2\/1.4.2\"\r\n192.168.46.169 - - [27\/Mar\/2017:10:30:48 -0700] \"GET \/ HTTP\/1.1\" 200 401 \"-\" \"Mozilla\/5.0 (X11; Linux x86_64; rv:17.0) Gecko\/20131029 Firefox\/17.0\"\r\n192.168.46.169 - - [27\/Mar\/2017:10:30:48 -0700] \"GET \/favicon.ico HTTP\/1.1\" 404 289 \"-\" \"Mozilla\/5.0 (X11; Linux x86_64; rv:17.0) Gecko\/20131029 Firefox\/17.0\"\r\n192.168.46.169 - - [27\/Mar\/2017:10:30:48 -0700] \"GET \/favicon.ico HTTP\/1.1\" 404 289 \"-\" \"Mozilla\/5.0 (X11; Linux x86_64; rv:17.0) Gecko\/20131029 Firefox\/17.0\"\r\n192.168.46.169 - - [27\/Mar\/2017:10:30:51 -0700] \"GET \/first.html HTTP\/1.1\" 200 291 \"http:\/\/192.168.46.169\/\" \"Mozilla\/5.0 (X11; Linux x86_64; rv:17.0) Gecko\/20131029 Firefox\/17.0\"\r\n192.168.46.169 - - [27\/Mar\/2017:10:30:52 -0700] \"GET \/index.html HTTP\/1.1\" 200 401 \"http:\/\/192.168.46.169\/first.html\" \"Mozilla\/5.0 (X11; Linux x86_64; rv:17.0) Gecko\/20131029 Firefox\/17.0\"\r\n192.168.46.169 - - [27\/Mar\/2017:10:30:53 -0700] \"GET \/second.html HTTP\/1.1\" 200 293 \"http:\/\/192.168.46.169\/index.html\" \"Mozilla\/5.0 (X11; Linux x86_64; rv:17.0) Gecko\/20131029 Firefox\/17.0\"\r\n192.168.46.169 - - [27\/Mar\/2017:10:32:01 -0700] \"GET \/ HTTP\/1.1\" 200 401 \"-\" \"curl\/7.19.7 (x86_64-redhat-linux-gnu) libcurl\/7.19.7 NSS\/3.14.0.0 zlib\/1.2.3 libidn\/1.18 libssh2\/1.4.2\"\r\n\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>In this post we will use flume to dump Apache webserver logs into HDFS. We already have a web server running and flume installed, but we need to configure a target and a source. We use the following file as target. [xml] ## TARGET AGENT ## ## configuration file location: \/etc\/flume-ng\/conf ## START Agent: flume-ng [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[74],"tags":[20,80,81,6,23,70],"jetpack_featured_media_url":"","jetpack_publicize_connections":[],"jetpack_shortlink":"https:\/\/wp.me\/pTQgt-ml","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.xavignu.com\/index.php?rest_route=\/wp\/v2\/posts\/1385"}],"collection":[{"href":"https:\/\/www.xavignu.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.xavignu.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.xavignu.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.xavignu.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1385"}],"version-history":[{"count":6,"href":"https:\/\/www.xavignu.com\/index.php?rest_route=\/wp\/v2\/posts\/1385\/revisions"}],"predecessor-version":[{"id":1392,"href":"https:\/\/www.xavignu.com\/index.php?rest_route=\/wp\/v2\/posts\/1385\/revisions\/1392"}],"wp:attachment":[{"href":"https:\/\/www.xavignu.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1385"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.xavignu.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1385"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.xavignu.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1385"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}