Tuesday, December 6, 2016

Show Directory Tree structure in Unix

Graphics based operating systems provide option for you to see the complete directory structure graphically. But, in console based Linux/Unix operating systems this is not possible (by default). At most you can go inside each directory and do ls or if you are a scripting geek you can create one to do the same for you.

Recently, I came across directory listing program, tree which makes this job possible in Unix/Linux from the terminal. 


CentoOs/RHEL: $sudo yum install tree -y

Using Tree

Use tree and the directory name. That's it!

$tree cookbooks/
`-- learn_chef_httpd
    |-- Berksfile
    |-- chefignore
    |-- metadata.rb
    |-- README.md
    |-- recipes
    |   `-- default.rb
    |-- spec
    |   |-- spec_helper.rb
    |   `-- unit
    |       `-- recipes
    |           `-- default_spec.rb
    `-- test
        `-- recipes
            `-- default_test.rb

7 directories, 8 files

Saturday, November 19, 2016

How Couchbase identifies Node for Data Access

This post talks about how Couchbase identifies where exactly the document is stored to facilitate quick read / update.

Couchbase uses a term Bucket which is equivalent to the term Database in relational world to logically partition all documents across cluster (or data nodes). Being a distributed DB, it tries to evenly distribute or partition (or shard) data into virtual buckets known as vBuckets. Each vbucket owns a subset of the keys or document id (and of course corresponding data as well) . Documents get mapped to vBucket by applying hash function on the key. Once the vBucket is identified there is a separate lookup table to know which nodes hosts the vBucket. The thing which maps different virtual buckets to nodes is known as vBucket map. (Note: Cluster Map contains of mapping of which service belong to which node at any given point of time)

Steps Involved (as shown in diagram):

  1. Hash(key) to get vBucket identifier (vb-k) which hosts/owns Key.
  2. Looking up vBucket map, tells vb-k is owned by node or server n-t
  3. Request is send directly to primary server node, n-t to fetch the document. 

 Both hashing function as well number of vBucket is configurable. So mapping will change if either changes. By default, Couchbase automatically divides each bucket into 1024 active vBuckets and 1024 replica buckets (per replica). When there is only one node, all vBuckets reside on that node. 

What if a new node gets added to cluster ?
When number of nodes scales up or down the information stored in vBuckets are re-distributed among the available nodes and then the corresponding vBucket map is also updated. This entire process is known as rebalancing. Rebalancing doesn't happen automatically; as a administrator/developer you need to trigger it either from UI or through CLI.

What if primary node is down ?
All read/update request by default go to the primary node. So, if a primary node fails for some reason, Couchbase takes off that node from the cluster (if configured to do so) and promotes the replica to become the primary node. So you can fail-over to replica node manually or automatically. You can address issue with the node, fix it and add it back to the cluster by performing rebalancing. Below table shows a sample mapping.
| vbucket id | active  | replica |
|     0      | node A  | node B  |
|     1      | node B  | node C  |
|     2      | node C  | node D  |
|     3      | node D  | node A  |


happy learning !!!

Wednesday, November 16, 2016

Why Multi-Dimensional Scaling in Couchbase

Couchbase has been supporting horizontal scaling in a monolithic fashion since its inception. You keep adding more nodes to the cluster to scale and improve performance (all nodes being exactly same). This single dimension scaling works to a great extent as all services - Query, Index and Data scale at same rate. But, they all are unique and have specific resource requirement.

Let's profile these services in detail and their specific hardware requirements to drive home the point - why MDS is required! 
This feature got added in Couchbase 4.0.

Query Service primarily executes Couchbase native queries, N1QL(similar to SQL, pronounced as nickel - leverages flexibility of JSON and power of SQL) . The query engine parses the query, generates execution plan and then executes the query in collaboration with index service and data service. The faster queries are executed, the better the performance.

Faster query processing requires more CPU or fast processor (and less memory & HDD). More cores will help in processing queries in parallel. 

Reference on  - Query Data with n1ql

Index Service performs indexing with Global Secondary Indexes (GSI - similar to B+tree used commonly in relational DBs). Index is a data structure which provides quick and efficient means to access data.  Index service creates and maintains secondary indexes and also performs index scan for N1QL queries. GSI/indexes are global across cluster and are defined using CREATE INDEX statement in N1QL. 

Index service is disk intensive so Optimized storage / SSD  will help in boosting performance. It need basic processor and less RAM/memory.  As an administrator, you can configure GSI with either the standard GSI storage, which uses ForestDB underneath, for indexes that cannot fit in memory or can pick the memory optimized GSI for faster in-memory indexing and queries. 

Data Service is central for Couchbase as data is the reason for any DB.  It stores all data and handles all fetch and update requests on data.  Data service is also responsible for creating and managing MapReduce views.   Active documents that exist only on the disk take much longer to access, which creates bottleneck for both reading and writing data. Couchbase tries to keep as much data as possible in memory.
Data refers to : (document) keys, metadata and the working set or the actual document.   Couchbase relies on extensive caching to achieve high throughput and low read/write latency. In perfect world, all data will be sitting in memory.

Data Service : Managed Cache (based on Memcached) + Storage Engine + View Query Engine

Memory and the speed of storage device affects performance (IO operations are queued by the server so faster storage helps to drain the queue faster). 


So, each type of service has it's own resource constraints. Couchbase introduced multi-dimensional scaling in version 4.0 so that these services can be independently optimized and assigned the kind of hardware which will help them excel. One size fits all is not going to work (especially when you are looking for higher throughput i.e. sub-milliseconds response times).  For example, storing data and executing queries on same node will cause CPU contention. Similarly, storing data and indexes on same node will cause disk IO contention.


Through MDS, we can separate, isolate and scale these three services independent of each other which will improve resource utilization as well as performance.



happy learning !!!

Friday, November 11, 2016

IP Address in Private Range

Ah, again; I forgot the range of private IP address. So, no more cursing of my memory. Now...instead of googling I will search on my blog :D

Below are permissible private IP ranges:
  •        -
  •    -
  •  -
These IP address are used for computers inside a network which needs to access inside resources. Routers inside the private network can route traffic between these private addresses. However, if they want to access resource in outside world (like internet) these network entities have to have a public address in order for response to reach to them. This is where NATing is used. 

Representing IP address in CIDR format

Wednesday, October 12, 2016

Containerize your application to achieve Scale

This post talks in general about Containers; their evolution and contribution in scaling systems.

Once upon a time, applications used to run on servers configured on bare mettle sitting in companies own data centers. Provisioning used to take anywhere from few days to few weeks. Then came Virtual Machines which use hardware visualization to provide isolation. They take time in minutes to create as they require significance resource.  Then finally; here, comes a brand new guy in the race, which takes 300 ms to couple of seconds to bootstrap a new instance; yes I am talking about containers. They don't use hardware virtualization. They interface directly with host's linux kernel .

Managing VMs at scale is not easy.  In-fact, I find difficult to manage even couple of VMs :D So just imagine how difficult it would be for companies like Google and Amazon which operate at internet scale.

Two features which has been part of Linux Kernel since 2007 are cgroups and namespacesEngineers at Google started exploring process isolation using these kernel features (to manage and scale their millions of computing units). This eventually resulted in what we know today as containers. Containers inherently are light weight and that makes them super flexible and fast. If containers even think of misbehaving, they can easily be replaced by another brand new container because the cost of doing so is not high at all. This means, they need to be run in a managed and well guarded environment. Their small footprint help in using them for specific purpose and they can easily be scheduled and re-arranged/load balanced. 

So one thing is quite clear, Containers are not brand new product or technology. They use existing features of OS. 

With containers the actual problem of making every component of a system resilient and bullet proof doesn’t hold good. This seems contradictory - we want to make systems more resilient but containers themselves are very fragile. This means any component deployed in them automatically becomes non-reliable. 
We can design our system with assumption that containers are fragile. If any instance failed - just mark it bad, replace it with a new instance. With containers the real hard problems are not isolation but orchestration and scheduling.

Read more in details on Containers vs VMs

Containers are also described as jail which guards the inmates to make sure that they behave themselves. Currently, one of the most popular container is Docker. And at the same time there are tools available to manage or orchestrate them (one of the most popular one is Kubernetes from Google).

Happy learning!!!

Wednesday, July 27, 2016

Vert.x Event Loop

Event Loop is not new and also not specific to Vert.x (used in Node.js).   I will be talking about it here, in a top down fashion. 

Vert.x is a library (yes, it's just jar) which allows to develop Reactive applications on JVM.  Reactive system is defined in Reactive Manifesto. Vert.x supports this manifesto, so it enables application written in Vert.x to be - Responsive, Elastic, Resilient and Message driven.

The last point (Message Driven) of reactive manifesto defines the essence of Vert.x - It's event/message driven and non-blocking/asynchronous. This means, different components interact with each other asynchronously by passing messages. 

//Synchronous and blocking API call
Object obj = new MyObject();

Traditional application make blocking API call, so calling thread waits until the response is available. This means, until the response is available the thread is sitting ideal and doing nothing. This is not a good thing from resource utilization point of view. Now, how about making that thread more specialized whose job is only to post request, i.e. it's not going to wait for response to arrive. The thread will go on doing only one thing till the sky falls. This way the thread will not be sitting ideal (unless there are NO request). Putting it in a more generic term, the thread will be passing on messages or events. 

Event Loop is basically a thread (or a group of threads; Vert.x matches it closest to CPU cores) whose job is to keep passing on messages to their specific handlers. The threads picks the event (from a queue) and then hands over the event to the right handler. Event loop maintains the order (as it picks the events internally from a queue) and it's also synchronous as there is going to be one or limited threads (Note: they themselves are synchronous). Vert.x allows to configure the number of threads (one per core of the CPU). Handlers are always called by the same event loop so there is no need of synchronization.

Event Loops are limited and so special, so blocking them will be disaster. Event loop calls the method asynchronously and in a non-blocking manner. Once the response arrives the same event loop calls the callback. 


Friday, July 22, 2016

Pushing a new project to Github

This post talks about pushing a new project on Github. Make sure that the project is not created already on the Github.

I have illustrated using a sample application named as websockets.  I have shown created a project with just one file (README.md) and then pushing the project to github. You can run below commands in terminal or gitbash. 

$ mkdir websockets
$ cd websockets
$ echo "# websockets" >> README.md

$ git init
Initialized empty Git repository in /Users/siddheshwar/Documents/gitRepo/websockets/.git/

$ git add README.md

$ git commit -m "first commit"
[master (root-commit) 24fac01] first commit
1 file changed, 1 insertion(+)
create mode 100644 README.md

$ git remote add origin https://github.com/raiskumar/websockets.git

$ git push -u origin master
Username for 'https://github.com': rai.skumar@gmail.com
Password for 'https://rai.skumar@gmail.com@github.com': 
Counting objects: 3, done.
Writing objects: 100% (3/3), 233 bytes | 0 bytes/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To https://github.com/raiskumar/websockets.git
 * [new branch]      master -> master

That's it, you are done !!!

Wednesday, July 20, 2016

Gradle - Create Project Structure Automatically

Gradle Init plugin can be used to bootstrap the process of creating a new Java, Groovy or Scala project. This plugin needs to be applied to a project before it can be used. So if we want to create default directory structure of a Java project this plugin can be handy (Especially if you don't have Gradle plugin in your IDE).

$gradle init --type java-library

The init plugin supports multiple types (it's 'java-library' in above command). Below is the command sequence and directory which gets created after successful execution.

$ mkdir hello-gradle
$ cd hello-gradle/
$ gradle init --type java-library


Total time: 8.766 secs

$ ls -ltr
total 20
drwxrwxr-x. 3 vagrant vagrant   20 Jul 20 06:00 gradle
-rwxrwxr-x. 1 vagrant vagrant 5080 Jul 20 06:00 gradlew
-rw-rw-r--. 1 vagrant vagrant 2404 Jul 20 06:00 gradlew.bat
-rw-rw-r--. 1 vagrant vagrant  643 Jul 20 06:00 settings.gradle
-rw-rw-r--. 1 vagrant vagrant 1212 Jul 20 06:00 build.gradle
drwxrwxr-x. 4 vagrant vagrant   28 Jul 20 06:00 src

So above command also installs the other gradle dependencies to run the build (i.e. bradlew, gradlew.bat). If you don't know what the appropriate type for your project, specify any value then it will list valid types.

$ gradle init --type something
Execution failed for task ':init'.
> The requested build setup type 'something' is not supported. Supported types: 'basic', 'groovy-library', 'java-library', 'pom', 'scala-library'.

So, if you just type any random text as type; Gradle tells the allowed types.

If you just use $gradle init , then gradle tries (it's best) to automatically detect the type. If it fails to identify type, then applies basic type. 

Importing Gradle Project to Eclipse

Note that, above command created gradle specific files along with default java directories (like src) but it didn't create eclipse specific files. This means, if you try to import above created project in eclipse it will not work. To achieve that, do below:
  1. Add eclipse plugin in gradle build file (i.e. build.gradle).  Put below after java plugin. 
          apply plugin: 'eclipse'
  1. Run $gradle eclipse

This basically creates files - .classpath, .project and .settings

Now, you are good to import above project in eclipse.

Alternatively, you can clone from my github repository

Happy coding !!!

Thursday, July 7, 2016

Microservices Explained

I have been reading about microservices for a while (must admit, I delayed it thinking, it's just old wine in new bottle), and the more I dive deeper, the more exciting I find it. I am a big fan of design principle, Single Responsibility Principle (SRP) as it helps in putting boundries on class (and even on methods). SRP helps in making code simpler and cleaner (ofcourse, other design principles are equally important, but my love for SRP is boundless!). And I always used to wonder, why can't we apply SRP at service layer? And finally, Microservices God have heard me!

For a service to be called as micro it should be really small in size and that's going to be possible, if and only if, your service does only one thing(i.e follows SRP). And it should do that one thing really well. This in turn will help to easily implement, change, build, distribute, and deploy the service. This will also help in creating a highly decentralized and scalable systems. I tried looking on web to find definition of miroservices, the one which I found covering all aspects is from Martin Fowler (link).

In short, the microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare minimum of centralized management of these services, which may be written in different programming languages and use different data storage technologies.

Let's cover some of the aspects of microservices:
  1. Each service (or set of few) should have it's own data store. So having 100's of data store is normal. You can choose between relational, noSQL DB, or even in-memory db depending on your need. If you have 2 DB and dozen of services around it, you are not micro yet.
  2. They are organized functionally or around business functions. This helps in keeping boundaries separate and clear.
  3. Loosely coupled and highly cohesive (or another way to put, they do one thing really well).
  4. They are usually RESTFul (to maintain simplicity). They receive request, apply logic and produce response. But they also support other interaction styles as well like RPC, message, event, stream. 
  5. They are usually asynchronous and use simple message queues. The real intelligence of the system lies at either ends of the queue (i.e. with services).
  6. The complete infrastructure of build to production deployment should be automated (i.e. it should have CI/CD).
  7. In microservice world, your single monolithic service could be replaced by 100's of microservices. You should design each microservice keeping in mind worst; design for failure. A service being down in production is a normal thing, your system should be able to handle it gracefully.  
  8. Microservices stresses a lot on real time monitoring matrices like average response time, generate warning if a service is not responding etc. 
  9. In ideal scenario, event bus should replace your operational database. Kafka is one of the fast bus and it's fast because it's dumb (i.e. it just delivers events).
  10. Microservices make your system/application more distributed which in tern adds more complexity. 

Found this video interesting as it talks about challenges in implementing microservies, link.


Sunday, February 14, 2016

EJB Good Practices

EJBs abstract your middle ware or business logic layer. They are transactional in nature so when you hit your persistence layer (mostly through JPA), transaction is already there for your database session. As a result, all DB operations are going to complete or none of them, i.e. EJB operation is atomic. Let's cover some of the good practices:

Don't create EJB methods for CRUD operations

Imagine creating operations in your EJB for creating, fetching, updating or deleting your entity. It's not going to server the purpose; quite clearly, CRUD operations are not your business logic!

In fact CRUD operations will be part of your more sophisticated business operations. Let's take that you want to transfer x amount from a bank account A to another account B. There should be just a single method which reads appropriate records from DB, modifies them and performs update.

Also, creating CRUD operation gives impression that EJB is created for each entity. We should create EJB for a group of related problems like manage accounts, policy manager etc. You can abstract your CRUD operations in Data Access Layer though!

Minimise Injecting EJBs in Presentation layer

Imagine yourself working in presentation layer (Servlets, web services, ..) and having to deal with multiple EJBs to delegate the call. You are going to struggle to find appropriate EJB and then method for delegating the incoming calls.  Especially when you some one else is taking care of business layer!

This is going to defeat the separation of concern principle which is important to manage your complex distributed system. So what's the solution to deal with this - Bundle related EJBs in a single (super) EJB and inject this super EJB. 

But make sure that, in doing so, you are not putting un-related EJBs together just for the sake of minimizing number of EJBs.  Each EJB (including super) should adhere to single responsibility principle

Reusing EJB methods

Suppose you have quite complex-use cases which resulted in a big EJB class definition. So, obvious question would be, how do you achieve reusability with EJB methods?

Just like a normal Java class, you can create helper EJBs with reusable methods. Multiple EJBs can use the services provided by this helper EJB. And to make these helper utilities more clear, you can put this in an utility or helper package inside your main EJB package. 


would love to hear your suggestions/feedback about this post.