Sunday, December 28, 2014

How to automate Large Scale log Parsing using Logstash?

Logstash is an application tool for Managing your events and logs. You can use it to collect logs, parse them, and store them for later use or real time Parsing using advanced Streaming Techniques.

In the current trend of Application Development and Automation Practices immediate action which are taken against customer data within a Short time span  or visualize their application current Status From Log data will benefit the customer to save their Pockets on vulnerable activities.

LogStash been bundled with many Open Source log Processor Features.  Application logs like Apache, Collectd , Ganglia and log4j etc. Other than this Logstash can also able to Process all types of  System logs, webserver logs, error logs, application logs and just about anything you can throw at it.

LogStash is Developed with Combination of Java and Ruby. Its an opensource you can vist the Logstash in below url

Github Url : -

LogStash been Mainly Falls under Four categories

1) Input

2) Codec

3) Filter

4) Output

Logstash Prerequisties:-

The only prerequisite required by Logstash is a Java runtime. It is recommended to run a recent version of Java in order to ensure the greatest success in running Logstash.

Download a latest version from


Untar the Logstash File

tar -zxvf  logstash-1.4.2.tar.gz

cd logstash-1.4.2/

Now run the Logstash by Command

bin/logstash -e 'input { stdin { } } output { stdout {} }' 

Now type something into your command prompt, and you will see it output by Logstash

Now we type:-

Hello OpenSourceCircle

The Output Screen shows:-

2014-12-02T15:12:23.490+0000 Hello OpenSourceCircle.

Here we ran logstash with stdin and stdout. So what text you are typing in command line shows as output below in Structured format.

Life of an Event :

Inputs, outputs, codecs and filters are the core concepts of logstash. 


Inputs are the main mechanism for passing log data to logstash. 

  • file - read file from a file system. Like the output of tail commandn  in linux.
  • syslog - To parse syslog messages. 
     Popular Input Mechanisms are collectd, s3, redis, rabbitmq, etc.


Codecs are stream operations for input and output data. It easily interpret the transport of your messages from the Serialization. Popular Codecs are json, multiline


Filters are used as on intermediate to act upon the Input data and take action with some conditions and filter the unimportant data.

  • grok - Grok Parse the arbitrary text and structure the data.
  • drop - drop an event completely ex: debug events

Outputs are the last stage in LogStash Lifecycle. When the Output stage is completed the event been marked as executed.

  • csv
  • elasticsearch
  • json
  • http
  • cloudwatch
Now we can see the real time example of logstash using access logs of apache server. 

we can configure the logstash in localhost or webserver based on your need and use a conditional to process the events.

Create an file based on your wish, I name it the file as opensourcecircle-apache.conf

Paste the Below contents in conf file.

input {
  file {
    path => "/var/log/apache2/access_log"
    start_position => beginning

filter {
  if [path] =~ "access" {
    mutate { replace => { "type" => "apache_access" } }
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
  date {
    match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]

output {
  stdout { codec => rubydebug }

Once you paste and save the file run logstash as bin/logstash -f opensourcecircle-apache.conf.

You Should able to see the output data in Command line.

We can also add Elasticsearch to save the log data by adding this configuration in output section by :-

elasticsearch { host => localhost  } 

Now the logstash open the configured apache access log file and start to process the events encountered. You can adjust the path of log file defined for apache access log.

Points to Note:-

1) The output lines been stashed with the type "field" set to "apache-access" (By the configuration line in filter as type "apache_access").

2) grok filter match the standard apache log format and split the data into separate fields.

3) Logstash not reprocess the events which were already encountered in the access log file and its able to save its positions in files and only process the new line when they are added.