How to reduce AWS costs

We’ve decided to do a more in-depth article on additional cost-cutting measures apart from those described in a previous article. Please keep in mind this isn’t the way to do it ™, because in a perfect world everyone would have cost-tags in place in every resource created, preferably via your-favourite-Infrastructure-as-Code strategy, which would make everyone’s life easier.

AWS service cost reductions

In this example, we’ve divided the analysis in seven different areas, which mirror AWS Services usually higher on the billing data (in no specific order):

  • RDS
  • EC2
  • S3
  • Elastic Load Balancer
  • Elasticache
  • Route 53
  • ACM

In the Reduce AWS costs – example report we can see some of the details with the technical explanation of the cost reduction activities.

VPC Endpoint for Amazon S3

A VPC Endpoint is a service that enables you to have selected AWS Services on your VPC. It has several advantages, as it allows finer-grained control access to your resources and avoids traffic through the Internet, which you’ll pay for.

In a typical Web Application, Amazon S3 is used to store static assets, such as images, CSS, to improve your site’s performance and modularity. It also allows you to store your assets on a highly durable and available Object Store (99.999999999%).

Types of VPC Endpoints

AWS provides two types of VPC Endpoints:

  • Interface Endpoints – it creates an ENI on your VPC, with a private IP. The service integrates with internal DNS resolution on your VPC, which allows you to reach the service through your subnets;
  • Gateway Endpoints – adds a gateway that can be used on your Route Tables. You may add a Gateway Endpoint for each VPC.

Test the new Setup

Before going into Live, you should add a new subnet on your current setup/VPC, to make sure your current scenario is tested before going into Production. (We are using Terraform version 0.11 on these samples)

We will start by creating a new subnet prv-subnet-1, on an existing VPC (vpc_id):

resource "aws_subnet" "prv-subnet-1" {
  vpc_id                  = "${var.vpc_id}"
  cidr_block              = "172.31.60.0/24"
  availability_zone       = "eu-west-1a"
  map_public_ip_on_launch = false
  
  tags = {
    Name        = "prv-subnet-1"
    Terraform   = "true"
  }
}

Now, let’s add a new Route Table, using an existing NAT GW (natgw_id):

# create the route table
resource "aws_route_table" "test" {
  vpc_id = "${var.vpc_id}"

  # add default gw
  route {
    cidr_block      = "0.0.0.0/0"
    nat_gateway_id  = "${var.natgw_id}"
  }

  tags = {
    Name = "prv-eu-west-1a-rtb"
  }
}

# associate with subnet
resource "aws_route_table_association" "assoc" {
  subnet_id      = "${aws_subnet.prv-subnet-1.id}"
  route_table_id = "${aws_route_table.test.id}"
}

You may now add a Gateway VPC Endpoint (vpce) to your new Route Table, which is now associated with prv-subnet-1.

resource "aws_vpc_endpoint" "vpce-s3" {
  vpc_id              = "${var.vpc_id}"
  vpc_endpoint_type   = "Gateway"
  service_name        = "com.amazonaws.eu-west-3.s3"
  route_table_ids     = ["${aws_route_table.test.id}"]
}

We will see that after some minutes you will have a new route entry pointing to a pl-123456ab (vpce) device, the Gateway VPC Endpoint.

At this time, you should check that your access to S3 is still valid.
The first issue you might encounter is that aws:SourceIp bucket policies, based on Public IPs, will no longer work. This is due to the fact that now you’ll be accessing S3 objects directly from your VPC, rather than using Public IP Address.

Let’s take a look at a sample ACL based on https://docs.aws.amazon.com/AmazonS3/latest/dev/example-bucket-policies.html, which uses a aws:SourceIp condition:

{
"Version": "2012-10-17",
"Id": "S3PolicyId1",
{
  "Version": "2012-10-17",
  "Id": "S3PolicyId1",
  "Statement": [
    {
      "Sid": "IPAllow",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": "arn:aws:s3:::examplebucket/*",
      "Condition": {
         "IpAddress": {"aws:SourceIp": "54.240.143.188/32"},
      } 
    } 
  ]
}

The expected change here is to change the aws:SourceIp to the CIDR block of your VPC. However, that will not work! If you review the AWS Documentation it states:

You cannot use an IAM policy or bucket policy to allow access from a VPC IPv4 CIDR range (the private IPv4 address range). VPC CIDR blocks can be overlapping or identical, which may lead to unexpected results. Therefore, you cannot use the aws:SourceIp condition in your IAM policies for requests to Amazon S3 through a VPC endpoint.

So, you’re left with two options to allow/restrict access:

  • Restrict your policy to a VPC or a specific Gateway Endpoint, using aws:sourceVpc;
  • On the VPC side, only add the Gateway VPC Endpoint to the subnets that need access to.

Heads up: Even after raising the limits, you cannot have more than 255 gateway endpoints per VPC. (https://docs.aws.amazon.com/vpc/latest/userguide/amazon-vpc-limits.html#vpc-limits-endpoints).


Here are the changes you should make to the previous example:

{
"Version": "2012-10-17",
"Id": "S3PolicyId1",
{
  "Version": "2012-10-17",
  "Id": "S3PolicyId1",
  "Statement": [
    {
      "Sid": "IPAllow",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": "arn:aws:s3:::examplebucket/*",
      "Condition": {
        "StringEquals": {
          "aws:sourceVpc": [
            "vpc-aabbccddeeff"
          ]
        }
    } 
  ]
}


Going into Production

Before going into Production, you should review all your bucket policies, to make sure you are using the right Bucket Policies, as explained earlier. To keep your configuration clean and reusable, we’ve created a Terraform module:

# main.tf

resource "aws_vpc_endpoint" "s3" {
  vpc_id              = "${data.aws_vpc.selected.id}"
  vpc_endpoint_type   = "Gateway"
  service_name        = "com.amazonaws.${var.aws_region}.s3"
  route_table_ids     = ["${distinct(data.aws_route_table.selected.*.route_table_id)}"]
}
# data.tf

data "aws_subnet" "selected" {
  count      = "${length(var.subnets)}"
  cidr_block = "${var.subnets[count.index]}"
}

data "aws_route_table" "selected" {
  count     = "${length(var.subnets)}"
  subnet_id = "${data.aws_subnet.selected.*.id[count.index]}"
}

data "aws_vpc" "selected" {
  tags = "${map("Name",var.vpc_name)}"
}
# variables.tf

variable "aws_region" {
  description = "Region to attach S3 VPC Endpoint"
}

variable "tags" {
  description = "Tags to the resources"
  type = "map"
  default = {}
}

variable "vpc_name" {
  description = "VPC Name where to attach the S3 VPC Endpoint"
}

variable "subnets" {
  description = "List of Subnets to add the VPC Endpoint"
  type = "list"
}

Here is a sample usage of the module:

# s3-vpc-endpoint.tf

module "vpc-attach" {
  source        = "modules/terraform-aws-s3-vpc-gateway"
  aws_region    = "${var.aws_region}"
  vpc_name      = "${var.vpc_name}"
  subnets       = ["172.31.50.0/24","172.31.51.0/24",""
172.31.52.0/24"]
  tags = {
    Terraform   = "true"
    Environment = "production"
  }
}

Select a Maintenance Window to apply this module, as the new endpoint will switch the network routes, and consequentely open TCP connections will be closed.

From now on, your S3 Bucket connections on your VPC Region will no longer use the Internet which reduces cost. And that is, apparently, important.

Reduce costs on AWS, not spending

This looks like something out of Captain Obvious journal, but in fact is one of the ways we’ve been helping some customers cutting costs on AWS: stop them from spending.

Usually we have access to an invoice which looks somewhat like the following:

AWS Service Charges:

CloudFront $2400

CloudTrail $901

CloudWatch $124

Data Transfer $4901

DynamoDB $0

Elastic Compute Cloud $28432

Simple Storage Service $5326

Kinesis $1143

There’s that big Elastic Compute Cloud line which you can drill-down on. However, in order for you to be able to do it efficiently (and possibly allocate the costs internally) you’d have to know how to identify each of the billing components. That’s where tagging comes to your rescue: deploy your infrastructure with the corresponding cost tags (Prod/Dev; Marketing/Finance/etc) on each resource and benefit from the results in the end of the following month. To make things really easy, invest some time in terraform-deploying your resources with the tags, which will ensure you’re measuring costs right from the start. Use whichever tool you like to collect and measure costs (Cost Explorer would be a good choice).

At last, a very frequent mistake is usually responsible for unusually high Data Transfer expenses: go through all existing VPCs and make sure you have Gateway Endpoints for S3 or DynamoDB; otherwise you’ll be uselessly paying for traffic regarding AWS services usage.

For more in-depth cost reduction measures, call us.

Installing Kubernetes 1.13 on CentOS 7

In this post I will try to describe a Kubernetes 3.11 test install on our servers. Its main purpose is to allow the team to have more in-depth knowledge of Kubernetes and its building blocks as we are currently implementing OpenShift Origin and Amazon EKS. The goal is to implement something along the lines of:

Kubernetes cluster

Requirements

Install 3 Centos 7.6 servers (can be virtual machines) with the following requirements (please beware that for a production cluster, the requirements should be pumped up):

  • 2 vCPUs at least
  • 4 GB Ram for the master
  • 10 GB Ram for each of the worker nodes
  • 30 GB root disk (I will in a later post address some of the “hyper-converged” solutions – storage & compute – and in that scenario, more than one disk is advised)

Next, set the network configuration on those Linux servers to match the above diagram (make sure that the hostname is set correctly).

1 – Set named based communication

All servers need to be able to resolve the name of the other nodes. That can be achieved by adding them to the DNS server zone or by adding the information to /etc/hosts on all servers.

# vi /etc/hosts (on all 3 nodes)
    10.11.12.1 kube-master.install.etux master
    10.11.12.2 kube-node1.install.etux node1
    10.11.12.3 kube-node2.install.etux node2

2 – Disable selinux and swap

Yeah, yeah.. when someone disables selinux a kitten dies. Nevertheless, this is for demos and testing. Make sure that the following commands are executed on all 3 nodes.

setenforce 0
sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/sysconfig/selinux
swapoff -a
sed -i '/swap/s/^/#/g' /etc/fstab

3 – Enable br_netfilter

Depending on the network overlay that you will be using, the next should or shouldn’t be applied. When choosing Flannel, please run on all nodes:

modprobe br_netfilter
cat >> /etc/sysctl.conf <<EOF
net.bridge.bridge-nf-call-ip6tables=1
net.bridge.bridge-nf-call-iptables=1
EOF
sysctl -p

4 – Install docker-ce

Docker CE has a few interesting features that the version of docker that comes with Centos7 doesn’t have. One of those is the multi-stage docker build. For that reason, I chose to use docker-ce running the following on all nodes:

yum install -y yum-utils
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
yum install -y docker-ce device-mapper-persistent-data lvm2
systemctl enable --now docker

5 – Install Kubernetes

The kubernetes packages I used are in the projects’ repo. Please run the following commands on all nodes:

cat > /etc/yum.repos.d/kubernetes.repo <<EOF
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg
       https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF

yum install -y kubelet kubeadm kubectl

6 – Reboot all instances

Reboot all 3 instances to make sure that they all are on the same state.

7 – Initialize Kubernetes

The first step is to initialize the master node with all the services required to bootstrap the cluster. This step should be done on the master node only. Make sure that the pod network cidr is a /16 (each node will “own” a /24) and it doesn’t conflict with any other network you own:

# on master node only
kubeadm init --apiserver-advertise-address=10.11.12.1 --pod-network-cidr=10.11.0.0/16
Output of running kubeadm

Before running the kubeadm joinon the other nodes, I first need to create the network overlay.

8 – Network overlay – Flannel

Wait a few seconds and proceed to the network overlay. There are several network plugins available to install. Some of them are:

Why did I choose Flannel? Well.. because I think that it was the easiest to install. I’m still assessing its features and comparing them to the other options to see which one is better suited to our customers.

# only on master
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml                                                                                                           

9 – Join other nodes to the Kubernetes cluster

Next, go to the node1 and node2 and run the kubeadm command that was displayed on the output you got:

# On node1
kubeadm join 10.11.12.1:6443 --token 9yd8mu.w4invwht9c8gappw --discovery-token-ca-cert-hash sha256:dd0e7d4ee7e60577f923bb3abf7658ea16db018684aa283fb0ebae2ec14154d9

# On node2
kubeadm join 10.11.12.1:6443 --token 9yd8mu.w4invwht9c8gappw --discovery-token-ca-cert-hash sha256:dd0e7d4ee7e60577f923bb3abf7658ea16db018684aa283fb0ebae2ec14154d9

In the end…

In the end you should have a running cluster with some pods running:

Next steps

The next steps on this demo cluster is to install a local registry, the web console and a loadbalancer (MetalLB) but that will be for another blog post.

CI/CD in OpenShift with Gitlab and Terraform

We’re always searching new ways of implementing CI/CD at Eurotux, and in this post I’ll describe one of those by leveraging 3 components that we are using in our customers:

  • Gitlab
  • Terraform
  • Openshift

Application

The application we wanted to deploy is Fess Enterprise Search Server , (“Fess is Elasticsearch-based search server”) and we use it to scan our internal Wiki server and allow our teams to have a “google”-like search engine for that wiki. Fess supports all sorts of targets (file servers, web sites, databases) and it supports several authentication methods such as BASIC/DIGEST/NTLM/FORM (keep this in mind for the next minutes).

We use OpenShift which is a container based orchestration, so the first thing to do is to create a container for the application. Fortunately Fess already provides a base container image which I’ll use as the base container for the project and I will only improve on that. The first thing is to create a Dockerfile:

FROM docker.io/codelibs/fess:latest
RUN perl -i -p -e "s/crawler.document.cache.enabled=true/crawler.document.cache.enabled=false/" /etc/fess/fess_config.properties
ADD logo-head.png /usr/share/fess/app/images/logo.png
ADD osdd.xml /usr/share/fess/app/WEB-INF/orig/open-search/osdd.xml
ADD logo-head.png /usr/share/fess/app/images/logo-head.png
RUN apt-get update && apt-get -y install libjson-perl
COPY entrypoint.sh /usr/share/fess/run.sh

ADD insert.sh /insert.sh

# We are using oc 3.9 because the later ones require libcrypto.so.10 (see https://github.com/openshift/origin/issues/21061)
RUN wget https://mirror.openshift.com/pub/openshift-v3/clients/3.9.63/linux/oc.tar.gz && tar zxf oc.tar.gz -C /usr/bin && rm oc.tar.gz

ADD fess_config.properties /etc/fess/fess_config.properties

I did do some customization (like changing the logo to our company one), changing the entrypoint and adding the oc (openshift client) command. As one can easily understand, our internal wiki is password protected. It is a form-based username/password (you see why it is great that Fess supports form-based authentication) and I only need to provide the Fess server the username and password to access the wiki.

The entrypoint is changed so that when the container starts, will get the username and password from OpenShift secrets (that’s why the container installs the oc command), update the Fess server configuration and start indexing the wiki. As this is a stateless service, I don’t need to worry about saving state and using Persistent Volumes. If the container dies or gets redeployed, the search engine will re-index our wiki. This keeps this project simpler and cleaner. Here is a snippet of the insert.sh script:

if [ -z "$WIKIUSER" ]; then
    export WIKIUSER="`oc get secret wikiuser --template='{{.data.username}}' | base64 -d`"
fi
if [ -z "$WIKIPASS" ]; then
    export WIKIPASS="`oc get secret wikiuser --template='{{.data.password}}' | base64 -d`"
fi

curl -XPOST "http://localhost:9200/.fess_config.web_authentication/web_authentication" -H 'Content-Type: application/json' -d "
{
           \"webConfigId\" : \"$CONFIGID\",
           \"updatedTime\" : 1509224726193,
           \"hostname\" : \"wiki.eurotux.com\",
           \"password\" : \"$WIKIPASS\",
           \"updatedBy\" : \"admin\",
           \"createdBy\" : \"admin\",
           \"createdTime\" : 1509224726193,
           \"protocolScheme\" : \"FORM\",
           \"username\" : \"$WIKIUSER\",
           \"parameters\" : \"encoding=UTF-8\\nlogin_method=POST\\nlogin_url=https://wiki.eurotux.com/Special:UserLogin\\nlogin_parameters=username=\${username}&password=\${password}&auth_id=1&deki_buttons%5Baction%5D%5Blogin%5D=login\"
}"

Terraform

We use Terraform to bootstrap the infrastructure required for the deployment of this application, which is responsible for the following:

  • OpenShift Project (Namespace)
  • Secrets (wiki username and password)
  • Granting permissions to the container default service account to access the secret (so that the container can fetch that info)
  • Granting the gitlab runner service account to edit this namespace objects (so that the deployment pipeline can deploy to this namespace)
  • Adding the anyuid scc to the deployer service account. The Fess container runs several services (actually this is an anti-pattern in the container world), and requires to run as root inside the container (later on it changes the uid to another)

Unfortunately, the terraform kubernetes provider is somewhat lacking in features comparing to others (like aws or azure provider). Because of that, I use a mix of internal resources like the kubernetes_namespace and null_resource as a wrapper to the occommand:

# Create namespace
resource "kubernetes_namespace" "search" {
  metadata {
    annotations {
      name = "search-engine"
    }

    labels {
      owner = "npf"
    }

    name = "${var.namespace}"
  }

  lifecycle {
    # because we are using openshift, we have to ignore the annotations as openshift does add some annotations
    ignore_changes = ["metadata.0.annotations"]
  }
}
# This container requires root, so we need to allow anyuserid
resource "null_resource" "add-scc-anyuid" {
  provisioner "local-exec" {
    command = "oc -n ${kubernetes_namespace.search.id} adm policy add-scc-to-user anyuid -z deployer"
  }

  provisioner "local-exec" {
    command = "oc -n ${kubernetes_namespace.search.id} adm policy remove-scc-from-user anyuid -z deployer"
    when    = "destroy"
  }
}

As you can see, I use local-exec to spawn the oc command when there isn’t support for those features in the kubernetes terraform provider. As a result of a terraform apply:

Gitlab

At Eurotux we are using an internal gitlab server to house all our projects. As so, we make extensive use of its’ CI/CD capabilities. To implement the CI/CD I’ve created a .gitlab-ci.yml file to describe the pipeline:

image: $CI_REGISTRY/docker/base-builder

stages:
  - review
  - staging
  - production
  - cleanup

variables:
  OPENSHIFT_SERVER: https://oshift.install.etux:8443
  OPENSHIFT_DOMAIN: oshift.install.etux

.deploy: &deploy
  tags:
    - kubernetes
  before_script:
    - ci-bootstrap
  script:
    - "oc -n $CI_PROJECT_NAME get services $APP 2> /dev/null || oc -n $CI_PROJECT_NAME new-app fess --name=$APP --strategy=docker"
    - "oc -n $CI_PROJECT_NAME start-build $APP --from-dir=fess --follow || sleep 3s && oc -n $CI_PROJECT_NAME start-build $APP --from-dir=fess --follow"
    - "oc -n $CI_PROJECT_NAME get routes $APP 2> /dev/null || oc -n $CI_PROJECT_NAME create route edge --hostname=$APP_HOST --insecure-policy=Redirect --service=$APP"
......
......
staging:
  <<: *deploy
  stage: staging
  tags:
    - kubernetes
  variables:
    APP: staging
    APP_HOST: $CI_PROJECT_NAME-staging.$OPENSHIFT_DOMAIN
  environment:
    name: staging
    url: http://$CI_PROJECT_NAME-staging.$OPENSHIFT_DOMAIN
  only:
    - master

production:
  <<: *deploy
  stage: production
  tags:
    - kubernetes
  variables:
    APP: production
    APP_HOST: $CI_PROJECT_NAME.$OPENSHIFT_DOMAIN
  when: manual
  environment:
    name: production
    url: http://$CI_PROJECT_NAME.$OPENSHIFT_DOMAIN
  only:
    - master

The pipeline will create a review application when working on a git branch other than master so that I can review and fix things. When a merge (or a commit for that matter) occurs in master, it will deploy automatically to staging and then I can press play to deploy to production. Here is an example of the pipeline:

Here is a snippet of the pipeline running:

After that, i can browse to https://search.oshift.install.etux/ and I’m presented with the search engine webpage:

OpenShift

As you’ve figured out by now, all of this is running in our testing OpenShift cluster. We are using the 3.11 version of OpenShift, which features monitoring using Prometheus and Grafana (later on, I will detail some other interesting features, such as integration with Keycloak). OpenShift automatically provides some Grafana dashboards so that you can see what are the usage patterns:

One of the interesting things that these dashboards present is the lifecycle of the application (starting new containers and stopping the older ones).