Tuesday, August 15, 2017

Couchbase Primary vs Secondary Indexes

Couchbase supports key-value as well as JSON based data model. Key-value store is schema less where the object gets mapped to a given key (Just like a HashMap or Dictionary).   Couchbase is more like a distributed HashMap. The value could be any supported data type (JSON, CSV, or BLOB). You perform any operation using the key or Document Id. In this case, Couchbase looks up the value corresponding to a given id using Primary Index. We don't create it, it's the Document Id value which is used to do lookup. 

Other data model i.e JSON based, stores data as model so we can query them through a SQL like expressive language of Couchbase named as N1QL. This is much more flexible model, we can look for a document(s) through through the keys contained inside JSON. Obviously, to optimise lookup /search we can create index on those value. These indexes are named as secondary indexes or more precisely Global Seconday Indexes.

--- happy learning !

Friday, August 11, 2017

RAM sizing Data Node of Couchbase Cluster

This post talks about finding out how much RAM does your Couchbase cluster needs for holding your Data (in RAM)! This post will give you guidelines as well as help you in estimating your RAM. 

RAM Calculator 

RAM is one of the most crucial areas to size correctly. Cached documents allow the reads to be served at low latency and high throughput.  Please note that, this doesn't not incorporate RAM requirement from the host/VM OS and other applications running along with Couchbase.

Enter below fields to estimate RAM -

Sample Document        (key)    (Value) 
This is required as document content length as well as ID length impacts RAM. Be mindful of the size aspect when deciding your key generation strategy. 

# Replicas                                        
Couchbase only supports upto 3 replicas. So enter either - 1, 2 or 3.

% Of Data you want to be in RAM  %
For best throughput you need to have all your documents in RAM i.e. 100% . This way any request will be served from RAM and there will be no IO.  In the field please enter only the value like 80, 100 etc. 

# Documents                                   
Number of documents in the cluster. When your application is starting from scratch then you can start with a number depending on the load of the application and then you need to evaluate it regularly and adjust your RAM quota if required. So, you can start with say 10000 or 1000000 documents. 

Type of Storage                                SSD        HDD
If storage is SSD then overhead % is 25 else it's 30%. SSD will bring better performance in disk throughput and latency. SSD storage will help improved performance if all data is not in the RAM. 

Couchbase Version                        < 2.1       2.1 or higher  
Size of meta data for 2.1 and higher versions is 56 bytes but for lower versions it's 64. 

High Water Mark                             %
If you want to use default value enter 85. 
If the amount of RAM used by documents reaches high water mark (upper threshold), both primary and replica documents are ejected until the memory usage reaches low Water Mark (lower threshold). 


Based on the RAM requirement for the cluster, you can plan how many nodes are required. Another important aspect in deciding number of data nodes is how you expect your system to behave if 1, 2 or more nodes go down at the same time. This link, I have discussed about Replication factor and how it affects your system performance. So, take your call wisely!

The value got calculated as explained in the Couchbase link, here.
Reference for calculating document size is, here

--- happy sizing :)

What's so special about Java 8 stream API

Java 8 has added functional programming and one of the major addition in terms of API is, stream.

A mechanical analogy is car-manufacturing line where a stream of cars is queued between processing stations. Each take a car, does some modification/operation and then pass it to next station for further processing.

Main benefit of stream API is that, now in Java (8) you can program at higher level of abstraction. So you can transform stream of one type to stream of other type rather than processing each item at a time (using for loop or iterator). With this Java 8 can run a pipeline of stream operations on several CPU cores on different components of the input. This way you are getting parallelism almost free instead of hard work using threads and locks.

Stream focuses on partitioning the data rather than coordinating access to it. 


Collection is mostly about storing and accessing the data, whereas stream is mostly about describing computation on data. 

Wednesday, August 9, 2017

Replication Factor in Couchbase

One of the core requirement for Distributed DBs is to be as High Availability as possible. What this literally means is that, even if node/nodes go down the DB should function (on its own or with minimum intervention). This is possible only if there are backup copies of the data. 

Replication factor controls number of replicas or backup of an item/data/document stored in a DB. The general rule is to have replica for each node which can fail in the cluster.

Let's check how one of the famous NoSQL distributed Db handles Replication Factor. 


Default replication factor is 1 in Couchbase (if it's enabled). Drop down field (as shown below) has only 3 values i.e. 1, 2 and 3. Practically, it doesn't make sense to have replication factor more than 3 no matter how large your cluster is.

 So even if you have only one node and enable replicas then in the same node there will be two copies of the same data (one original and one backup). Once you add more nodes to the cluster original and replicas will get re-distributed automatically. 


Number of Nodes <= 5 - RF = 1
5 <= Number Of Nodes <= 10 - RF =2
Number of Nodes > 10 - RF = 3

Number of nodes mentioned above is only for data nodes if you are using Multi Dimensional Scaling.  If you are not using MDS then also above rule should hold good. 

In the event of failure we can fail over (manually or automatically) to replicas. 
  • In a 5 node cluster with 1 replica. If one node goes down cluster can fail it over. Now before the the failed node is up, what if another node goes down ? You are out of luck. You will have to add another node to the cluster. 
  • After a node goes down and it's failed over try to replace that node ASAP and perform rebalance. Rebalance creates the replica copies if there are enough nodes available. 


Saturday, July 29, 2017

Understanding AWS' IAM Service

In AWS world, everything is a service; in fact more technically a web service. Even for security there is one, IAM (Identity Access Management)

IAM background 

Let's assume, you manage a team (in a big company or you are a startup) and you decide to embrace your favourite cloud platform, AWS.  To start with, you create an account on AWS. This is root account or root user (as called in Linux world). You want your team members to get access to AWS console and it's different services.

Do you want to give them the same access as you have? Definitely NOT!

Full administrative access to all users will affect security of your systems and critical data. Root access might affect your monthly bills - what if a user starts bunch of powerful EC2 instances, although you wanted to use only S3 services of AWS.  That's where the concept of users, groups, role, policies comes into picture in AWS and this all gets achieved using IAM service. 

Amazon follows a shared security model - this means it's responsible for securing platform but as as customer you need to secure your data and access to a service. 

Root Account gets created when you first setup your AWS account. It has complete admin access. 

What is IAM ?

IAM is authentication and authorisation service of AWS. IAM allows your to control who can access AWS resources, how they can access and in what ways. As an Administrator, it gives you centralised control of your AWS account and enables you to manage users and their level of access to the AWS console. 

IAM, being a core service has global scope (not specific to a region). This means your user accounts, roles will be available all across the world. 

IAM Page on AWS Console

Sign-in through your root account to the AWS console. On the left top of UI, click on services and then from the list of services click IAM (comes under Security, Identity and Compliance). This takes to the IAM page of the console. 

At the very top it gives sign-in link which has numeric account number in the URL. I customised the url for easy readability by replacing account number with geekrai (this blog name). Attaching below screenshot of my IAM page.

IAM Components

Above image shows the IAM page in the AWS console.  It shows that there are 0 users (root user is not counted as a user), 0 groups and 0 roles. We will explore in detail all IAM components -

Users - End users of the services. 
Click on the Create individual IAM users to configure users for this account.  Through this you can add as many users as you want to this account. By default new users have no permission when they get created.  There are two types of accesses for new users. 
  1. Programatic Access: AWS enables an access key ID and secret access key for accessing AWS programatically (AWS API, CLI, SDK etc). 
  2. AWS Management Console Access: This allows your users to sign-in to AWS console. Users need a password to sign-in. 
You can choose either or both of above access types for a user. 

Groups - A collection of user under one set of permission.
Once a user is created, (ideally) it should be part of a group like developer, administrator etc. I created a group named as developer and added user with name siddheshwar to the group. This enables you to add policies to the group. 

Policies (Policy Document) - It's a document which defines one or more permissions which gets attached to the group or user.  
Policy document is a key value pair in JSON format.  AWS console provides list of all possible policies, you just need to select the one which is apt for your case. 


Provides full access to Amazon EC2 via the AWS Management Console.

    "Version": "2012-10-17",
    "Statement": [
            "Action": "ec2:*",
            "Effect": "Allow",
            "Resource": "*"
            "Effect": "Allow",
            "Action": "elasticloadbalancing:*",
            "Resource": "*"
            "Effect": "Allow",
            "Action": "cloudwatch:*",
            "Resource": "*"
            "Effect": "Allow",
            "Action": "autoscaling:*",
            "Resource": "*"


Provides full access to AWS services and resources
    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": "*",
            "Resource": "*"

Administrator access is the same as root access.  Please note that these policies can be added directly to the user, does't have always to be through Group.

Roles - Roles control responsibilities which get assigned to AWS resources. 
IAM roles is similar to user, but instead of being associated with a person it can be assigned to an application and service as well. Remember, in AWS everything is a service. 

How roles can help - 
  • Enable Identity federation. Allow users to log-in to AWS console through gmail, Amazon, OpenId etc. 
  • Enable access between your AWS account and 3rd party AWS account. 
  • Allow EC2 instance to call AWS services on your behalf.  

Below is the screenshot of a role page.

More details about Roles, here

IAM Best Practices

Multi Factor Authentication (MFA) For Root Account: Root account is the id password which you used to sign in to the AWS. Root account gives you unlimited access to AWS, and that's why security is quite important and AWS recommends to set up MFA. Once you set MFA, you will have to provide a MFA code as well while signing in.

Reference- https://aws.amazon.com/iam/details/mfa/

Set Password Policy: It's a good practice to set password polices- like what all characters are mandatory in password, expiry time or rotation policy.

Set Billing Alarm: You can set a threshold level on your monthly bills; if that level crosses then AWS will send an e-mail. This feature is not directly related to IAM. Amazon's cloud watch service helps in monitoring the billing. 

happy learning !