Monday, November 17, 2014

What is Mean by Sharding?

Sharding is a type of database partitioning that separates very large databases into smaller, faster, more easily managed parts called data shards and can be spread across multiple servers. The word shard means a small part of a whole.

The concept of Database Sharding has been gaining popularity over the past several years, due to the enormous growth in transaction volume and size of business application databases. This is particularly true for many successful online service providers, Software as a Service (SaaS) companies, and social networking Web sites.


Database Size Grow Year By Year
The governing concept behind sharding is based on the idea that as the size of a database and the number of transactions per unit of time made on the database increase linearly, the response time for querying the database increases exponentially. 

Additionally, the costs of creating and maintaining a very large database in one place can increase exponentially because the database will require high-end computers. In contrast, data shards can be distributed across a number of much less expensive commodity servers. Data shards have comparatively little restriction as far as hardware and software requirements are concerned. 
Database Sharding Challenges:-
  • Reliability
  • Distributed queries
  • Avoidance of cross-shard joins
  • Auto-increment key management
  • Support for multiple Shard Schemes
The basic concept of Database Sharding is very straightforward take a large database and divide into a number of smaller databases across servers. The concept is illustrated in the following diagram:




The reasons for the performance and scalability challenges are inherent to the fundamental design of the database management systems themselves.  Databases rely heavily on the primary three components of any computer:  CPU, memory and disk. 


Each of these elements on a single server can only scale to a given point-after that, you need to take additional measures to improve performance.  While it is common knowledge that disk I/O is the primary bottleneck, as database management systems have improved they also continue to take greater advantage of CPU and memory. 

Therefore, as business applications gain sophistication and continue to grow in demand, architects, developers and database administrators have been presented with a constant challenge of maintaining database performance for mission-critical systems.  This landscape drives the need for database sharding.

Historically, sharding a database required manually coding data distribution policies directly into your applications. Application developers would write code that stipulates directly where specific data should be placed and found.  In essence developers were creating work-around code to solve a database scalability problem so their applications could handle more users, more transactions and more data. 


In some cases, database sharding can be done fairly simply. One common example is splitting a customer database geographically. Customers located on the East Coast can be placed on one server, while customers on the West Coast can be placed on a second  server. Assuming there are no customers with multiple locations, the split is easy to maintain and build rules around.


Fig 2. Separate Database table based on each location

Using an example can help explain MySQL sharding more clearly, so let’s take the following table:


This is a small table containing a list of customers. Any modern database can handle such a table. But happens if instead the table has to store seven million rows instead of just seven rows?

Theoretically, this should not be a problem.  But usually there are lots of operations on such a large table – for example we may have many read and write operations on this table every second.In practice, a very large customer table can become a database bottleneck. Why? Because it doesn’t fit in the database server cache anymore, because of database isolation management, and for other reasons that cause the database to crawl under load.


How does sharding solve MySQL Scalability?

If we take the customers table, and split it into four different databases, each database will contain 1.80 million rows. That’s still a lot, but less than 8 million rows. This will result in improved database performance. In fact the following diagram shows how such a table can be split:

MySQL Data Distribution Database


Data distribution Database

Every database will get some of the rows. In old-fashioned do-it-yourself sharding, it was the developer’s responsibility to create an efficient, application-specific data distribution policy that efficiently stipulated exactly where each row should be stored and found for each table. Nowadays, that work is simplified and automated using Mysql Clustering.


Saturday, November 15, 2014

What is ElasticSearch?

Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a Restful web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.

It provides scalable search, has near real-time search, and supports multitenancy. ElasticSearch is distributed, which means that indices can be divided into shards and each shard can have zero or more replicas. Each node hosts one or more shards, and acts as a coordinator to delegate operations to the correct shards. Rebalancing and routing are done automatically.

It uses Lucene and tries to make all features of it available through the JSON and Java API. It supports facetting, highlights, suggesters and percolating,  Percolating which can be useful for notifying if new documents match for registered queries.

Another feature is called "gateway" and handles the long term persistence of the index, for example, an index can be recovered from the gateway in a case of a server crash. Elasticsearch supports real-time GET requests, which makes it suitable as a NoSQL solution, but it lacks distributed transactions.

ElasticSearch can scale out to hundreds of servers and petabytes of data. 

Figure 1


Elasticsearch is much more than just Lucene and much more than “just” full text search. It is also:
A distributed real-time document store where every field is indexed and searchable. A distributed search engine with real-time analytics. Capable of scaling to hundreds of servers and petabytes(Figure -1) of structured and unstructured data .

ElasticSearch Requirements:-

Requirement for installing Elasticsearch is a recent version of Java. Preferably, you should install the latest version of the official Java from www.java.com.

You can download the latest version of Elasticsearch from elasticsearch.org/download.

curl -L -O http://download.elasticsearch.org/PATH/TO/VERSION.zip
unzip elasticsearch-$VERSION.zip
cd  elasticsearch-$VERSION

Elasticsearch is now ready to run. You can start it up in the foreground with:
./bin/elasticsearch

Add -d if you want to run it in the background as a daemon.
Test it out by opening another terminal window and running:
curl 'http://localhost:9200/?pretty'

You should see a response like this:
{
   "status": 200,
   "name": "Shrunken Bones",
   "version": {
      "number": "1.4.0",
      "lucene_version": "4.10"
   },
   "tagline": "You Know, for Search"
}

This means that your Elasticsearch cluster is up and running, and we can start experimenting with it.
You should change the default cluster.name to something appropriate to you, like your own name, to stop your nodes from trying to join another cluster on the same network with the same name!
You can do this by editing the elasticsearch.yml file in the config/ directory, then restarting Elasticsearch. When Elasticsearch is running in the foreground, you can stop it by pressing Ctrl-C, otherwise you can shut it down with the api

curl -XPOST 'http://localhost:9200/_shutdown'