Friday, January 6, 2017

Pure Functions

This post discusses about one of the important aspect of functional programming, Pure Function using Java example.

Pure function are side-effect free functions or stateless functions. The output of pure functions depend purely (or completely) on input. What side-effect free basically means is that, pure functions should not modify argument as well as any other state variables. 

Now, obvious question is why is it so important. We will see later how it's one of the important ingredient for functional and Reactive programming. Let take a simple Java class to understand pure functions.

package functional;

import java.util.Random;

public class PureFunction {
 private int state = 420;

 // pure function
 public int addTwo(int x) {
  return x + 2;

 // Impure function
 public int addState(int x) {
  return x + state;

 // Impure Function
 public int randomizeArgument(int x) {
  Random r = new Random(1000);
  return x + r.nextInt(1000);

Method, addTwo purely depends on it's input/argument. No matter how many times you call it, it will behave in a predictable manner (i.e just add 2 to argument and return the added number as response). While the response other two methods (addState and randomizeArgument) is not going to be consistent even if it's called with same argument multiple times. Response of method addState depends on the instance variable of the class. If it gets mutated somewhere else, the response of the method will not be consistent. Similarly, the randomizedArgument method response will vary depending on the random value.

Let's cover some of the important points

  • If a pure function references any instance variable of the class, then that instance variable should be declared as final.
  • Pure functions help in highly concurrent and scalable environment by making the behavior predictable. 
  • You get parallelism free of cost if your method doesn't access a shared mutable data to perform its job. 

--happy learning !!!

Sunday, December 18, 2016

Git branching strategy

This post explores different Git branching options to optimally manage code base. Primarily, there are two options which most of the teams use - Git flow and GitHub flow. Let's cover them in detail.

Git Flow

Git Flow works on the idea of releases; you have many branches each specializing in specific stages of application. The branches are named as master, develop, feature, release and hotfix.

 You create a develop branch from the master branch which will be used by all developers for feature development and bug fixes for a specific release. If a developer needs to add a new feature then he will create a separate branch from develop branch so that both branches can evolve independent of each other. When feature is fully implemented, that branch gets merged back to develop. Idea is to have a stable and working develop branch; so feature gets merged back only when it's complete. Once all the features related to a release are implemented; develop branch is branched to release branch where formal testing of the release will commence. Release branch should be used only for bug fixes and not development. And finally, when it's ready for deployment it will be tagged inside master branch to support multiple feature releases.
This flow is not ideal for projects where you do releases quite often. This helps where you have releases happening like every other month or even every month. 

GitHub Flow

GitHub has popularized it's own flow which is quite effective for open source projects and other projects which doesn't have longer release cycles (in this, you could release every day). It's based on the idea of features.

It's light weight version of git flow which removes all unwarranted overhead of maintaining so many branches (It's just have feature branches apart from master). If you want to work for a feature or bug you start a separate branch from master. Once feature/bug is implemented it get's merged back to master branch. Feature branches are given quite descriptive names like oauth-support or bug#123. Developers can raise pull request to collaborate with co-workers or to do code review on the feature branch. Once code is reviewed and sign off is received the featured branch is merged back to master branch. You can have as many feature branches as you wish and once feature branch is merged it can also be deleted as master branch will have all commit history (unless you don't want). 

GitHub flow assumes that every time you merge changes to master; you are ready for production deployment.

There is 3rd branching strategy as well; GitLab Flow.


--happy branching !!!

Tuesday, December 6, 2016

Show Directory Tree structure in Unix

Graphics based operating systems provide option for you to see the complete directory structure graphically. But, in console based Linux/Unix operating systems this is not possible (by default). At most you can go inside each directory and do ls or if you are a scripting geek you can create one to do the same for you.

Recently, I came across directory listing program, tree which makes this job possible in Unix/Linux from the terminal. 


CentoOs/RHEL: $sudo yum install tree -y

Ubuntu: $sudo apt-get install tree

Using Tree

Use tree and the directory name. That's it!

$tree cookbooks/
`-- learn_chef_httpd
    |-- Berksfile
    |-- chefignore
    |-- metadata.rb
    |-- recipes
    |   `-- default.rb
    |-- spec
    |   |-- spec_helper.rb
    |   `-- unit
    |       `-- recipes
    |           `-- default_spec.rb
    `-- test
        `-- recipes
            `-- default_test.rb

7 directories, 8 files

Saturday, November 19, 2016

How Couchbase identifies Node for Data Access

This post talks about how Couchbase identifies where exactly the document is stored to facilitate quick read / update.

Couchbase uses a term Bucket which is equivalent to the term Database in relational world to logically partition all documents across cluster (or data nodes). Being a distributed DB, it tries to evenly distribute or partition (or shard) data into virtual buckets known as vBuckets. Each vbucket owns a subset of the keys or document id (and of course corresponding data as well) . Documents get mapped to vBucket by applying hash function on the key. Once the vBucket is identified there is a separate lookup table to know which nodes hosts the vBucket. The thing which maps different virtual buckets to nodes is known as vBucket map. (Note: Cluster Map contains of mapping of which service belong to which node at any given point of time)

Steps Involved (as shown in diagram):

  1. Hash(key) to get vBucket identifier (vb-k) which hosts/owns Key.
  2. Looking up vBucket map, tells vb-k is owned by node or server n-t
  3. Request is send directly to primary server node, n-t to fetch the document. 

 Both hashing function as well number of vBucket is configurable. So mapping will change if either changes. By default, Couchbase automatically divides each bucket into 1024 active vBuckets and 1024 replica buckets (per replica). When there is only one node, all vBuckets reside on that node. 

What if a new node gets added to cluster ?
When number of nodes scales up or down the information stored in vBuckets are re-distributed among the available nodes and then the corresponding vBucket map is also updated. This entire process is known as rebalancing. Rebalancing doesn't happen automatically; as a administrator/developer you need to trigger it either from UI or through CLI.

What if primary node is down ?
All read/update request by default go to the primary node. So, if a primary node fails for some reason, Couchbase takes off that node from the cluster (if configured to do so) and promotes the replica to become the primary node. So you can fail-over to replica node manually or automatically. You can address issue with the node, fix it and add it back to the cluster by performing rebalancing. Below table shows a sample mapping.
| vbucket id | active  | replica |
|     0      | node A  | node B  |
|     1      | node B  | node C  |
|     2      | node C  | node D  |
|     3      | node D  | node A  |


happy learning !!!

Wednesday, November 16, 2016

Why Multi-Dimensional Scaling in Couchbase

Couchbase has been supporting horizontal scaling in a monolithic fashion since its inception. You keep adding more nodes to the cluster to scale and improve performance (all nodes being exactly same). This single dimension scaling works to a great extent as all services - Query, Index and Data scale at same rate. But, they all are unique and have specific resource requirement.

Let's profile these services in detail and their specific hardware requirements to drive home the point - why MDS is required! 
This feature got added in Couchbase 4.0.

Query Service primarily executes Couchbase native queries, N1QL(similar to SQL, pronounced as nickel - leverages flexibility of JSON and power of SQL) . The query engine parses the query, generates execution plan and then executes the query in collaboration with index service and data service. The faster queries are executed, the better the performance.

Faster query processing requires more CPU or fast processor (and less memory & HDD). More cores will help in processing queries in parallel. 

Reference on  - Query Data with n1ql

Index Service performs indexing with Global Secondary Indexes (GSI - similar to B+tree used commonly in relational DBs). Index is a data structure which provides quick and efficient means to access data.  Index service creates and maintains secondary indexes and also performs index scan for N1QL queries. GSI/indexes are global across cluster and are defined using CREATE INDEX statement in N1QL. 

Index service is disk intensive so Optimized storage / SSD  will help in boosting performance. It need basic processor and less RAM/memory.  As an administrator, you can configure GSI with either the standard GSI storage, which uses ForestDB underneath, for indexes that cannot fit in memory or can pick the memory optimized GSI for faster in-memory indexing and queries. 

Data Service is central for Couchbase as data is the reason for any DB.  It stores all data and handles all fetch and update requests on data.  Data service is also responsible for creating and managing MapReduce views.   Active documents that exist only on the disk take much longer to access, which creates bottleneck for both reading and writing data. Couchbase tries to keep as much data as possible in memory.
Data refers to : (document) keys, metadata and the working set or the actual document.   Couchbase relies on extensive caching to achieve high throughput and low read/write latency. In perfect world, all data will be sitting in memory.

Data Service : Managed Cache (based on Memcached) + Storage Engine + View Query Engine

Memory and the speed of storage device affects performance (IO operations are queued by the server so faster storage helps to drain the queue faster). 


So, each type of service has it's own resource constraints. Couchbase introduced multi-dimensional scaling in version 4.0 so that these services can be independently optimized and assigned the kind of hardware which will help them excel. One size fits all is not going to work (especially when you are looking for higher throughput i.e. sub-milliseconds response times).  For example, storing data and executing queries on same node will cause CPU contention. Similarly, storing data and indexes on same node will cause disk IO contention.

Through MDS, we can separate, isolate and scale these three services independent of each other which will improve resource utilization as well as performance.


happy learning !!!

Friday, November 11, 2016

IP Address in Private Range

Ah, again; I forgot the range of private IP address. So, no more cursing of my memory. Now...instead of googling I will search on my blog :D

Below are permissible private IP ranges:
  •        -
  •    -
  •  -
These IP address are used for computers inside a network which needs to access inside resources. Routers inside the private network can route traffic between these private addresses. However, if they want to access resource in outside world (like internet) these network entities have to have a public address in order for response to reach to them. This is where NATing is used. 

Representing IP address in CIDR format

Wednesday, October 12, 2016

Containerize your application to achieve Scale

This post talks in general about Containers; their evolution and contribution in scaling systems.

Once upon a time, applications used to run on servers configured on bare mettle sitting in companies own data centers. Provisioning used to take anywhere from few days to few weeks. Then came Virtual Machines which use hardware visualization to provide isolation. They take time in minutes to create as they require significance resource.  Then finally; here, comes a brand new guy in the race, which takes 300 ms to couple of seconds to bootstrap a new instance; yes I am talking about containers. They don't use hardware virtualization. They interface directly with host's linux kernel .

Managing VMs at scale is not easy.  In-fact, I find difficult to manage even couple of VMs :D So just imagine how difficult it would be for companies like Google and Amazon which operate at internet scale.

Two features which has been part of Linux Kernel since 2007 are cgroups and namespacesEngineers at Google started exploring process isolation using these kernel features (to manage and scale their millions of computing units). This eventually resulted in what we know today as containers. Containers inherently are light weight and that makes them super flexible and fast. If containers even think of misbehaving, they can easily be replaced by another brand new container because the cost of doing so is not high at all. This means, they need to be run in a managed and well guarded environment. Their small footprint help in using them for specific purpose and they can easily be scheduled and re-arranged/load balanced. 

So one thing is quite clear, Containers are not brand new product or technology. They use existing features of OS. 

With containers the actual problem of making every component of a system resilient and bullet proof doesn’t hold good. This seems contradictory - we want to make systems more resilient but containers themselves are very fragile. This means any component deployed in them automatically becomes non-reliable. 
We can design our system with assumption that containers are fragile. If any instance failed - just mark it bad, replace it with a new instance. With containers the real hard problems are not isolation but orchestration and scheduling.

Read more in details on Containers vs VMs

Containers are also described as jail which guards the inmates to make sure that they behave themselves. Currently, one of the most popular container is Docker. And at the same time there are tools available to manage or orchestrate them (one of the most popular one is Kubernetes from Google).

Happy learning!!!

Wednesday, July 27, 2016

Vert.x Event Loop

Event Loop is not new and also not specific to Vert.x (used in Node.js).   I will be talking about it here, in a top down fashion. 

Vert.x is a library (yes, it's just jar) which allows to develop Reactive applications on JVM.  Reactive system is defined in Reactive Manifesto. Vert.x supports this manifesto, so it enables application written in Vert.x to be - Responsive, Elastic, Resilient and Message driven.

The last point (Message Driven) of reactive manifesto defines the essence of Vert.x - It's event/message driven and non-blocking/asynchronous. This means, different components interact with each other asynchronously by passing messages. 

//Synchronous and blocking API call
Object obj = new MyObject();

Traditional application make blocking API call, so calling thread waits until the response is available. This means, until the response is available the thread is sitting ideal and doing nothing. This is not a good thing from resource utilization point of view. Now, how about making that thread more specialized whose job is only to post request, i.e. it's not going to wait for response to arrive. The thread will go on doing only one thing till the sky falls. This way the thread will not be sitting ideal (unless there are NO request). Putting it in a more generic term, the thread will be passing on messages or events. 

Event Loop is basically a thread (or a group of threads; Vert.x matches it closest to CPU cores) whose job is to keep passing on messages to their specific handlers. The threads picks the event (from a queue) and then hands over the event to the right handler. Event loop maintains the order (as it picks the events internally from a queue) and it's also synchronous as there is going to be one or limited threads (Note: they themselves are synchronous). Vert.x allows to configure the number of threads (one per core of the CPU). Handlers are always called by the same event loop so there is no need of synchronization.

Event Loops are limited and so special, so blocking them will be disaster. Event loop calls the method asynchronously and in a non-blocking manner. Once the response arrives the same event loop calls the callback. 


Friday, July 22, 2016

Pushing a new project to Github

This post talks about pushing a new project on Github. Make sure that the project is not created already on the Github.

I have illustrated using a sample application named as websockets.  I have shown created a project with just one file ( and then pushing the project to github. You can run below commands in terminal or gitbash. 

$ mkdir websockets
$ cd websockets
$ echo "# websockets" >>

$ git init
Initialized empty Git repository in /Users/siddheshwar/Documents/gitRepo/websockets/.git/

$ git add

$ git commit -m "first commit"
[master (root-commit) 24fac01] first commit
1 file changed, 1 insertion(+)
create mode 100644

$ git remote add origin

$ git push -u origin master
Username for '':
Password for '': 
Counting objects: 3, done.
Writing objects: 100% (3/3), 233 bytes | 0 bytes/s, done.
Total 3 (delta 0), reused 0 (delta 0)
 * [new branch]      master -> master

That's it, you are done !!!

Wednesday, July 20, 2016

Gradle - Create Project Structure Automatically

Gradle Init plugin can be used to bootstrap the process of creating a new Java, Groovy or Scala project. This plugin needs to be applied to a project before it can be used. So if we want to create default directory structure of a Java project this plugin can be handy (Especially if you don't have Gradle plugin in your IDE).

$gradle init --type java-library

The init plugin supports multiple types (it's 'java-library' in above command). Below is the command sequence and directory which gets created after successful execution.

$ mkdir hello-gradle
$ cd hello-gradle/
$ gradle init --type java-library


Total time: 8.766 secs

$ ls -ltr
total 20
drwxrwxr-x. 3 vagrant vagrant   20 Jul 20 06:00 gradle
-rwxrwxr-x. 1 vagrant vagrant 5080 Jul 20 06:00 gradlew
-rw-rw-r--. 1 vagrant vagrant 2404 Jul 20 06:00 gradlew.bat
-rw-rw-r--. 1 vagrant vagrant  643 Jul 20 06:00 settings.gradle
-rw-rw-r--. 1 vagrant vagrant 1212 Jul 20 06:00 build.gradle
drwxrwxr-x. 4 vagrant vagrant   28 Jul 20 06:00 src

So above command also installs the other gradle dependencies to run the build (i.e. bradlew, gradlew.bat). If you don't know what the appropriate type for your project, specify any value then it will list valid types.

$ gradle init --type something
Execution failed for task ':init'.
> The requested build setup type 'something' is not supported. Supported types: 'basic', 'groovy-library', 'java-library', 'pom', 'scala-library'.

So, if you just type any random text as type; Gradle tells the allowed types.

If you just use $gradle init , then gradle tries (it's best) to automatically detect the type. If it fails to identify type, then applies basic type. 

Importing Gradle Project to Eclipse

Note that, above command created gradle specific files along with default java directories (like src) but it didn't create eclipse specific files. This means, if you try to import above created project in eclipse it will not work. To achieve that, do below:
  1. Add eclipse plugin in gradle build file (i.e. build.gradle).  Put below after java plugin. 
          apply plugin: 'eclipse'
  1. Run $gradle eclipse

This basically creates files - .classpath, .project and .settings

Now, you are good to import above project in eclipse.

Alternatively, you can clone from my github repository

Happy coding !!!