How To Install an Elasticsearch Cluster on Ubuntu 18.04

Elasticsearch is a platform for distributed search and data analysis in real-time. It offers a multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents with a simple installation.

By Elasticsearch, you can execute and combine various types of searches giving you the like Kibana, Logstash, X-Pack, etc., Elasticsearch can collect and monitor Big Data at a large scale. This Elasticsearch cluster includes three data nodes and with this, we could avoid a split-brain and have a quorum of master-eligible nodes.

In this free and ultimate tutorial, we will be going to learn how to install and configure a 3 Node Elasticsearch Cluster on Ubuntu 18.04 and with this you can go through some API examples on creating indexes, ingesting documents, searches, etc.

What is ElasticSearch?
Basic Concepts of Elasticsearch
Note on Master Elections
Prerequisites
Installing Elasticsearch on Ubuntu
Configuring Elasticsearch
Start Elasticsearch
Using Elasticsearch Restful API
View Cluster Health
View the Number of Nodes in your Cluster
Create your first Elasticsearch Index
Replicating a Yellow Cluster Status
Ingest Data into Elasticsearch
Search Queries
Delete your Index

What is ElasticSearch?

ElasticSearch is a highly scalable open-source analytics engine, RESTful search engine built on top of Apache Lucene and issued under an Apache license. It is the most famous search engine and is generally used for full-text search, log analytics, security intelligence, business analytics, analyze big volumes of data faster and in near real-time, etc. Also, Elasticsearch is Java-based and can search and index document files in different formats.

Before we get to the main topic, let’s cover some basics about Elasticsearch from below:

Basic Concepts of Elasticsearch

An Elasticsearch Cluster is made up of a number of nodes;
Each Node includes Indexes, where an Index is a Collection of Documents;
Master nodes are subjective for Cluster related tasks, creating/deleting indexes, tracking of nodes, allocate shards to nodes;
Data nodes are liable for hosting the actual shards that have the indexed data also handles data related operations like CRUD, search, and aggregations;
Indexes are split into Multiple Shards;
Shards exist of Primary Shards and Replica Shards;
A Replica Shard is a Copy of a Primary Shard that is used for HA/Redundancy;
Shards get placed on random nodes throughout the cluster;
A Replica Shard will NEVER be on the same node as the Primary Shard’s associated shard-id.

Note on Master Elections

The least number of master eligible nodes that want to join a newly elected master in order for an election is configured via the setting discovery.zen.minimum_master_nodes. This configuration is very powerful, as it makes each master-eligible node informed of the minimum number of master-eligible nodes that must be visible in order to form a cluster.

Without this setting or incorrect configuration, this might lead to a split-brain, where let’s say something went wrong and upon nodes rejoining the cluster, it may form 2 different clusters, which we want to avoid at all costs.

From consulting elasticsearch documentation, to avoid a split brain, this setting should be set to a quorum of master-eligible nodes via the following formula:

(master_eligible_nodes / 2) + 1
# in our case:
(3/2) + 1 = 2

It is advised to evade having only two master eligible nodes since a quorum of two is two. To read more on elasticsearch cluster master election process, take a look at their documentation

Prerequisites

We have to set the internal IP addresses of our nodes to either our hosts’ file or DNS server. To keep it easy & straightforward, I will add them to my host file. This needs to apply to both nodes:

$ sudo su - 
$ cat > /etc/hosts << EOF
127.0.0.1 localhost
172.31.0.77 es-node-1
172.31.0.45 es-node-2
172.31.0.48 es-node-3
EOF

Now that our host entries are set, we can start with the fun stuff.

Installing Elasticsearch on Ubuntu

The following instructions and directions should be implemented to both nodes.

Get the Elasticsearch repositories and update your system so that your servers are aware of the newly added Elasticsearch repository:

$ apt update && apt upgrade -y
$ apt install software-properties-common python-software-properties apt-transport-https -y
$ wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
$ echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-6.x.list
$ apt update

Elasticsearch relies on Java, so install the java development kit:

$ apt install default-jdk -y

Verify that java is installed:

$ java -version
openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment (build 11.0.3+7-Ubuntu-1ubuntu218.04.1)
OpenJDK 64-Bit Server VM (build 11.0.3+7-Ubuntu-1ubuntu218.04.1, mixed mode, sharing)

Install Elasticsearch:

$ apt install elasticsearch -y

Once Elasticsearch is installed, repeat these steps on the second node. Once that is done, move on to the configuration section.

Configuring Elasticsearch

For nodes to join the same cluster, they should all share the same cluster name.

We also need to specify the discovery hosts as the masters so that the nodes can be discoverable. Since we are installing a 3 node cluster, all nodes will contribute to a master and data node role.

Feel free to inspect the Elasticsearch cluster configuration, but I will be overwriting the default configuration with the config that I need.

Make sure to apply the configuration on both nodes:

$ cat > /etc/elasticsearch/elasticsearch.yml << EOF
cluster.name: es-cluster
node.name: \${HOSTNAME}
node.master: true
node.data: true
path.logs: /var/log/elasticsearch
path.data: /usr/share/elasticsearch/data
bootstrap.memory_lock: true
network.host: 0.0.0.0
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts: ["es-node-1", "es-node-2"]
EOF

Important settings for your elasticsearch cluster is described on their docs:

Disable swapping
Increase file descriptors
Ensure sufficient virtual memory
Ensure sufficient threads
JVM DNS cache settings
Temporary directory not mounted with noexec

Increase the file descriptors on the nodes, as instructed by the documentation:

$ cat > /etc/default/elasticsearch << EOF
ES_STARTUP_SLEEP_TIME=5
MAX_OPEN_FILES=65536
MAX_LOCKED_MEMORY=unlimited
EOF

Ensure that pages are not swapped out to disk by requesting the JVM to lock the heap in memory by setting LimitMEMLOCK=infinity.

Set the maximum file descriptor number for this process: LimitNOFILE and increase the number of threads using LimitNPROC:

$ vim /usr/lib/systemd/system/elasticsearch.service
[Service]
LimitMEMLOCK=infinity
LimitNOFILE=65535
LimitNPROC=4096
...

Increase the limit on the number of open files descriptors to user elasticsearch of 65536 or higher

$ cat > /etc/security/limits.conf << EOF
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited
EOF

Increase the value of the map counts as elasticsearch uses maps directory to store its indices:

$ sysctl -w vm.max_map_count=262144

For a permanent setting, update vm.max_map_count in /etc/sysctl.conf and run:

$ sysctl -p /etc/sysctl.conf

Change the permissions of the elasticsearch data path, so that the elasticsearch user and group has permissions to read and write from the configured path:

$ chown -R elasticsearch:elasticsearch /usr/share/elasticsearch

Make sure that you have applied these steps to all the nodes before continuing.

Start Elasticsearch

Enable Elasticsearch on boot time and start the Elasticsearch service:

$ systemctl enable elasticsearch
$ systemctl start elasticsearch

Verify that Elasticsearch is running:

$ netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp6       0      0 :::9200                 :::*                    LISTEN      278/java
tcp6       0      0 :::9300                 :::*                    LISTEN      278/java

Using Elasticsearch Restful API

In this section we will get comfortable with using Elasticsearch API, by covering the following examples:

Cluster Overview;
How to view Node, Index, and Shard information;
How to Ingest Data into Elasticsearch;
Who to Search data in Elasticsearch;
How to delete your Index

View Cluster Health

From any node, use an HTTP client such as curl to investigate the current health of the cluster by looking at the cluster API:

$ curl -XGET http://localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "es-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

As you can see the cluster status is Green, which means everything works as expected.

In Elasticsearch you get Green, Yellow and Red statuses. Yellow would essentially mean that one or more replica shards are in an unassigned state. Red status means that some or all primary shards are unassigned which is really bad.

From this output, we can also see the number of data nodes, primary shards, unassigned shards, etc.

This is a good place to get an overall view of your Elasticsearch cluster’s health.

View the Number of Nodes in your Cluster

By looking at that /_cat/nodes API we can get information about our nodes that is part of our cluster:

$ curl -XGET http://localhost:9200/_cat/nodes?v
ip             heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
172.31.0.77              10          95   0    0.00    0.00     0.00 mdi       -      es-node-1
172.31.0.45              11          94   0    0.00    0.00     0.00 mdi       -      es-node-2
172.31.0.48              25          96   0    0.07    0.02     0.00 mdi       *      es-node-3

As you can see, we can see information about our nodes such as the JVM Heap, CPU, Load averages, node role of each node, and which node is master.

As we are not running dedicated masters, we can see that node es-node-3 got elected as master.

Create your first Elasticsearch Index

Note that when you create an index, the default primary shards are set to 5 and the default replica shard count is set to 1. You can change the replica shard count after an index has been created, but not the primary shard count as that you will need to set on index creation.

Let’s create an Elasticsearch index named myfirstindex:

$ curl -XPUT http://localhost:9200/myfirstindex
{"acknowledged":true,"shards_acknowledged":true,"index":"myfirstindex"}

Now that your index has been created, let’s have a look at the /_cat/indices API to get information about our indices:

$ curl -XGET http://localhost:9200/_cat/indices?v
health status index        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   myfirstindex xSX9nOQJQ2qNIq4A6_0bTw   5   1          0            0      1.2kb           650b

From the output you will find that we have 5 primary shards and 1 replica shard, with 0 documents in our index and that our cluster is in a green state, meaning that our primary and replica shards have been assigned to the nodes in our cluster.

Note that a replica shard will NEVER reside on the same node as the primary shard for HA and Redundancy.

Let’s go a bit deeper and have a look at the shards, to see how our shards are distributed through our cluster, using the /_cat/shards API:

$ curl -XGET http://localhost:9200/_cat/shards?v
index        shard prirep state   docs store ip             node
myfirstindex 1     r      STARTED    0  230b 172.31.0.77    es-node-2
myfirstindex 1     p      STARTED    0  230b 172.31.0.48    es-node-3
myfirstindex 4     p      STARTED    0  230b 172.31.0.48    es-node-3
myfirstindex 4     r      STARTED    0  230b 172.31.0.77    es-node-1
myfirstindex 2     r      STARTED    0  230b 172.31.0.45    es-node-2
myfirstindex 2     p      STARTED    0  230b 172.31.0.77    es-node-1
myfirstindex 3     p      STARTED    0  230b 172.31.0.45    es-node-2
myfirstindex 3     r      STARTED    0  230b 172.31.0.48    es-node-3
myfirstindex 0     p      STARTED    0  230b 172.31.0.45    es-node-2
myfirstindex 0     r      STARTED    0  230b 172.31.0.77    es-node-1

As you can see each replica shard of its primary is spread on different nodes.

Replicating a Yellow Cluster Status

For a yellow cluster status, we know that it’s when one or more replica shards are in an unassigned state.

So let’s replicate that behavior by scaling our replica count to 3, which would mean that 5 replica shards will be in an unassigned state:

$ curl -XPUT -H 'Content-Type: application/json' \
http://localhost:9200/myfirstindex/_settings -d \
'{"number_of_replicas": 3}'

Now we have scaled the replica count to 3, but since we only have 3 nodes, we will have a yellow state cluster:

$ curl -XGET http://localhost:9200/_cat/indices?v
health status index        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   myfirstindex xSX9nOQJQ2qNIq4A6_0bTw   5   3          0            0      3.3kb           1.1kb

The cluster health status should show the number of unassigned shards, and while they are unassigned we can verify that by looking at the shards API again:

$ curl -XGET http://localhost:9200/_cat/shards?v
index        shard prirep state      docs store ip             node
myfirstindex 1     r      STARTED       0  230b 172.31.0.45    es-node-2
myfirstindex 1     p      STARTED       0  230b 172.31.0.48    es-node-3
myfirstindex 1     r      STARTED       0  230b 172.31.0.77    es-node-1
myfirstindex 1     r      UNASSIGNED
myfirstindex 4     r      STARTED       0  230b 172.31.0.45    es-node-2
myfirstindex 4     p      STARTED       0  230b 172.31.0.48    es-node-3
myfirstindex 4     r      STARTED       0  230b 172.31.0.77    es-node-1
myfirstindex 4     r      UNASSIGNED
myfirstindex 2     r      STARTED       0  230b 172.31.0.45    es-node-2
myfirstindex 2     r      STARTED       0  230b 172.31.0.48    es-node-3
myfirstindex 2     p      STARTED       0  230b 172.31.0.77    es-node-1
myfirstindex 2     r      UNASSIGNED
myfirstindex 3     p      STARTED       0  230b 172.31.0.45    es-node-2
myfirstindex 3     r      STARTED       0  230b 172.31.0.48    es-node-3
myfirstindex 3     r      STARTED       0  230b 172.31.0.77    es-node-1
myfirstindex 3     r      UNASSIGNED
myfirstindex 0     p      STARTED       0  230b 172.31.0.45    es-node-2
myfirstindex 0     r      STARTED       0  230b 172.31.0.48    es-node-3
myfirstindex 0     r      STARTED       0  230b 172.31.0.77    es-node-1
myfirstindex 0     r      UNASSIGNED

At this point in time, we could either add another node to the cluster or scale the replication factor back to 1 to get the cluster health to green again.

I will scale it back down to a replication factor of 1:

$ curl -XPUT http://localhost:9200/myfirstindex/_settings -d '{"number_of_replicas": 1}'

Ingest Data into Elasticsearch

We will ingest 3 documents into our index, this will be a simple document consisting of a name, country and gender, for example:

{ 
  "name": "james", 
  "country": "south africa", 
  "gender": "male"
}

First, we will ingest the document using a PUT HTTP method, when using a PUT method, we need to specify the document ID.

PUT methods will be used to create or update a document. For creating:

$ curl -XPUT -H 'Content-Type: application/json' \
http://localhost:9200/myfirstindex/_doc/1 -d '
{"name":"james", "country":"south africa", "gender": "male"}'

Now you will find we have one index in our cluster:

$ curl -XGET http://localhost:9200/_cat/indices?v
health status index        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   myfirstindex xSX9nOQJQ2qNIq4A6_0bTw   5   1          1            0     11.3kb          5.6kb

Since we know that the document ID is “1”, we can do a GET on the document ID to read the document from the index:

$ curl -XGET http://localhost:9200/myfirstindex/people/1?pretty
{
  "_index" : "myfirstindex",
  "_type" : "people",
  "_id" : "1",
  "found" : false
}

If we ingest documents with a POST request, Elasticsearch generates the document ID for us automatically. Let’s create 2 documents:

$ curl -XPOST -H 'Content-Type: application/json' \
http://localhost:9200/myfirstindex/_doc/ -d '
{"name": "kevin", "country": "new zealand", "gender": "male"}'

$ curl -XPOST -H 'Content-Type: application/json' \
http://localhost:9200/myfirstindex/_doc/ -d '
{"name": "sarah", "country": "ireland", "gender": "female"}'

When we have a look again at our index, we can see that we now have 3 documents in our index:

$ curl -XGET http://localhost:9200/_cat/indices?v
health status index        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   myfirstindex xSX9nOQJQ2qNIq4A6_0bTw   5   1          3            0       29kb         14.5kb

Search Queries

Now that we have 3 documents in our elasticsearch index, let’s explore the search APIs to get data from our index. First, let’s search for the keyword “sarah” as a source query parameter:

$ curl -XGET 'http://localhost:9200/myfirstindex/_search?q=sarah&pretty'
{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "myfirstindex",
        "_type" : "_doc",
        "_id" : "cvU96GsBP0-G8XdN24s4",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "sarah",
          "country" : "ireland",
          "gender" : "female"
        }
      }
    ]
  }
}

We can also narrow our search query down to a specific field, for example, show me all the documents with the name kevin:

$ curl -XGET 'http://localhost:9200/myfirstindex/_search?q=name:kevin&pretty'
{
...
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "myfirstindex",
        "_type" : "_doc",
        "_id" : "gPU96GsBP0-G8XdNHoru",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "kevin",
          "country" : "new zealand",
          "gender" : "male"
        }
      }
    ]
  }
}

With Elasticsearch we can also search with our query in the request body, a similar query as above would look like this:

$ curl -XPOST -H 'Content-Type: application/json' \
'http://localhost:9200/myfirstindex/_search?pretty' -d '
{
  "query": {
    "match": {
      "name": "kevin"
    }
  }
}'

{
...
        "_index" : "myfirstindex",
        "_source" : {
          "name" : "kevin",
          "country" : "new zealand",
          "gender" : "male"
        }
...
}

We can use wildcard queries:

$ curl -XPOST -H 'Content-Type: application/json' \
'http://172.31.0.77:9200/myfirstindex/_search?pretty' -d '
{
  "query": {
    "wildcard": {
      "country": "*land"
    }
  }
}'

{
...
    "hits" : [
      {
        "_index" : "myfirstindex",
        "_type" : "_doc",
        "_id" : "cvU96GsBP0-G8XdN24s4",
        "_score" : 1.0,
        "_source" : {
          "name" : "sarah",
          "country" : "ireland",
          "gender" : "female"
        }
      },
      {
        "_index" : "myfirstindex",
        "_type" : "_doc",
        "_id" : "gPU96GsBP0-G8XdNHoru",
        "_score" : 1.0,
        "_source" : {
          "name" : "kevin",
          "country" : "new zealand",
          "gender" : "male"
        }
      }
    ]
...
}

Have a look at their documentation for more information on the Search API

Delete your Index

To wrap this up, we will go ahead and delete our index:

$ curl -XDELETE http://localhost:9200/myfirstindex

Going Further

If this got you curious, then definitely have a look at this Elasticsearch Cheatsheet that I’ve put together and if you want to generate lots of data to ingest to your elasticsearch cluster, have a look at this python script.

Our other links related to ELK:

How To Install Logstash on Ubuntu and Debian
Monitoring Linux logs using Kibana and rsyslog