Using Flume in Ambari with Kerberos

flume

What is Flume

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

An Event is a unit of data, and events that carrying payloads flows from source to channel to sink. All above running in a flume agent that runs in a JVM.

flume_principle

Three key components of Flume:

  1. Source

The purpose of a source is to receive data from an external client and store it in a configured channel.

  1. Channel

The channel is just like a bridge which receives data from a source and buffers them till they are consumed by sinks.

  1. Sink

Sinks can consume data from a channel and deliver it to another destination. The destination of the sink might be another agent or storage.

Attention:
In HDP3.0, Ambari doesn’t use Flume continuously instead of using Nifi.

Kerberos authorization

Before we use flume, we need to configure it in Ambari with Kerberos. Visit Ambari UI and click the “Flume” service to change the configuration file.

flume_configuration

More importantly, you need to append security content at the end of the property “flume-env template” when you using Kafka components with Kerberos. Every Kafka client needs a JAAS file to get a TGT from TGS.

1
export JAVA_OPTS="$JAVA_OPTS -Djava.security.auth.login.config=/home/flume/kafka-jaas.conf"

Configuration

There is a simple example that transferring log data from one pc to another pc, and I paste the configuration code below.
Using avro-source and avro-sink is a perfect choice to transfer data from one agent to another agent.

One PC configuration

source: exec-source
channel: memory
sink: avro-sink

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
exec-memory-avro.sources = exec-source
exec-memory-avro.sinks = avro-sink
exec-memory-avro.channels = memory-channel

exec-memory-avro.sources.exec-source.type = exec
exec-memory-avro.sources.exec-source.command = tail -F /Users/JohnnyLiu/Documents/local_flume/data.log
exec-memory-avro.sources.exec-source.shell = /bin/sh -c

exec-memory-avro.sinks.avro-sink.type = avro
exec-memory-avro.sinks.avro-sink.hostname = bigdata.com
exec-memory-avro.sinks.avro-sink.port = 44444

exec-memory-avro.channels.memory-channel.type = memory

exec-memory-avro.sources.exec-source.channels = memory-channel
exec-memory-avro.sinks.avro-sink.channel = memory-channel

Another PC configuration

source: avro-source
channel: memory
sink: logger

1
2
3
4
5
6
7
8
9
10
11
12
13
14
avro-memory-logger.sources = avro-source
avro-memory-logger.sinks = logger-sink
avro-memory-logger.channels = memory-channel

avro-memory-logger.sources.avro-source.type = avro
avro-memory-logger.sources.avro-source.bind = bigdata.com
avro-memory-logger.sources.avro-source.port = 44444

avro-memory-logger.sinks.logger-sink.type = logger

avro-memory-logger.channels.memory-channel.type = memory

avro-memory-logger.sources.avro-source.channels = memory-channel
avro-memory-logger.sinks.logger-sink.channel = memory-channel