Collecting logs into Azure DocumentDB using fluent-plugin-documentdb

In this article, I’d like to introduces a solution to collect logs and store them into Azure DocumentDB using fluentd and its plugin, fluent-plugin-documentdb.

Azure DocumentDB is a managed NoSQL database service provided by Microsoft Azure. It’s schemaless, natively support JSON, very easy-to-use, very fast, highly reliable, and enables rapid deployment, you name it. Fluentd is an open source data collector, which lets you unify the data collection and consumption for a better use and understanding of data. fluent-plugin-documentdb is fluentd output plugin that enables to store event collections into Azure DocumentDB.

This article shows how to

Collect Apache httpd logs across web servers
Ship the collected logs into the aggregator Fluentd in near real-time
Store the collected logs into DocumentDB
Utilize the collected log data stored on Document DB for advanced scenarios - like archiving the data to Azure Storage, doing Big data analysis, visualizing the data with PowerBI, and so forth

fluentd-azure-documentdb

Pre-requisites

A basic understanding of fluentd - if you’re not familiar with fluentd, fluentd quickstart guide is good starting point
Azure subscription - you need to have Azure subscription that grants you access to Microsoft Azure services, and under which you can create DocumentDB cluster. If you don’t have yet click here to create it

Setup: Azure DocumentDB

To use Azure DocumentDB, you must create a DocumentDB database account using either the Azure portal, Azure Resource Manager templates, or Azure command-line interface (CLI). In addition, you must have a database and a collection to which fluent-plugin-documentdb writes event-stream out. Here are instructions:

Create a DocumentDB database account using the Azure portal, or Azure Resource Manager templates and Azure CLI
How to create a database for DocumentDB
Create a DocumentDB collection

Setup: Fluentd Aggregator

First of all, install Fluentd. The following shows how to install Fluentd using Ruby gem packger but if you are not using Ruby Gem for the installation, please refer to this installation guide where you can find many other ways to install Fluentd on many platforms.

# install fluentd
$ sudo gem install fluentd --no-ri --no-rdoc

# create fluent.conf
$ fluentd --setup

Also, install fluent-plugin-documentdb for fluentd aggregator to store collected logs data into Azure DocumentDB.

$ sudo gem install fluent-plugin-documentdb

Next, configure fluent.conf, a fluentd configuration file as follows. Please see this for fluent-plugin-documentdb configuration.

# Receive events from 24224/tcp
# This is used by log forwarding and the fluent-cat command
<source>
    @type forward
    port 24224
</source>

# Store Data in DocumentDB
<match apache.access>
    @type documentdb
    docdb_endpoint https://yoichikademo.documents.azure.com:443/
    docdb_account_key Tl1+ikQtnExxxUisJ+BXwbbaC8NtUqYVE9kUDXCNust5aYBduhui29Xtxz3DLP88PayjtgtnARc1PW+2wlA6jCJw==  (dummy)
    docdb_database LogDB
    docdb_collection Collection1
    auto_create_database true
    auto_create_collection true
    time_format %Y%m%d-%H:%M:%S
    localtime true
    add_time_field true
    time_field_name time
    add_tag_field true
    tag_field_name tag
</match>

Regarding he port number of the aggregator host above, the default is 24224. Note that both TCP packets (event stream) and UDP packets (heartbeat message) are sent to this port, which mean you need to open both TCP and UDP for this port if you have access controls between forwarders and aggregator. Please see the forward Output plugin article to understand more about forward plugin.

Finally, run fluentd with specifiying fluent.conf that you configurea above.

$ fluentd -c ./fluent.conf -vv &

Setup: Fluentd Forwarders

First, to set up Fluentd, run the following command to setup Fluentd. Again If you are not using Ruby Gem for the installation, please refer to the installation document.

# install fluentd
$ sudo gem install fluentd --no-ri --no-rdoc`

# create fluent.conf
$ fluentd --setup

Then, give Fluentd a read access to servers’log files.

$ sudo chmod og+rx /var/log/apache2
$ sudo chmod og+r /var/log/apache2/*

Next, configure fluent.conf, a fluentd configuration file as follows to tail apache access logs and forard event to aggregator

# Apache Access Logs
<source>
    @type tail
    path /var/log/apache2/access.log   # monitoring file
    pos_file /tmp/fluentd_pos_file     # position file
    format apache                      # format
    tag apache.access                  # tag
</source>

# Forward data to the aggregator
<match apache.access>
    @type forward
    buffer_type memory
    buffer_chunk_limit 8m
    buffer_queue_limit 64
    flush_interval 1s
    <server>
        host  <Aggregator's hostname or IP>
        port 24224
    </server>
    <secondary>
        @type file
        path /var/log/fluentd/forward-failed
    </secondary>
</match>

Finally, start Fluentd with the configuration above to start log collections

$ fluentd -c ./fluent.conf -vv &

TEST

Let’s check if logs will be forwarded from apache nodes to aggegator and ultimately stored in documentdb. First, create log events by sending test requests to web servers somehow (here using apache bench for example)

$ ab -n 5 -c 2 http://<targetserver>/foo/bar/test.html

If logs are collected successfully, you can see the logs stored in DocumentDB easily by using Document DB’s query explorer. Go to Azure Portal > Display your DocumentDB dashboard > Query Explorer.

DocumentDB-QueryExplorer

More log collections from external sources

Fluentd’s Input plugins extend Fluentd to retrieve and pull event logs from external sources. An input plugin typically creates a thread socket and a listen socket. It can also periodically pull data from data sources. For examples, listening syslog events, tailing apache/nginx logs, and pulling data from well-known RDBMS/NoSQLs on-premices or on public cloud. See http://www.fluentd.org/plugins

Setup for Advanced senarios

Using Secure-forward instead of simple forward plugin in order to have communication over SSL between forward servers and aggregator: https://github.com/tagomoris/fluent-plugin-secure-forward
Configuring DocumentDB’s Hadoop connector that allows DocumentDB to act as both a source and sink for Hive, Pig and MapReduce jobs: https://azure.microsoft.com/en-us/documentation/articles/documentdb-run-hadoop-with-hdinsight/
Setup Azure Data factory to from DocumentDB to Azure Blob Storage for Archiving the data: https://azure.microsoft.com/en-us/documentation/articles/data-factory-azure-documentdb-connector/
Configuring DocumentDB’s PowerBI connector that allows DocumentDB to act as a source for PowerBI to create insightful visualizations for DocumentDB data: https://azure.microsoft.com/en-us/blog/unleashing-insights-from-data-in-documentdb-with-power-bi/

Happy log collections with Azure DocumentDB and fluentd!!

END

Pre-requisites#

Setup: Azure DocumentDB#

Setup: Fluentd Aggregator#

Setup: Fluentd Forwarders#

TEST#

More log collections from external sources#

Setup for Advanced senarios#