We have a web application that has been running on AWS for several years. As application load balancers and the AWS WAF service was not available, we utilised and external classic ELB point to a pool of EC2 instances running mod_security as our WAF solution. Mod_security was using the OWASP Mod_security core rule set.
Now that Application Load Balancers and AWS WAFs are available, we would like to remove the CPU bottleneck which stems from using EC2 instances with mod security as the current WAF.
Step 1 – Base-lining performance with EC2 WAF solution.
The baseline was completed using https://app.loadimpact.com where we ran 1000 concurrent users, with immediate rampup. On our test with 2 x m5.large EC2 instances as the WAF, the WAFs became CPU pinned within 2mins 30 seconds.
This test was repeated with the EC2 WAFs removed from the chain and we averaged 61ms across the loadimpact test with 1000 users. So – now we need to implement the AWS WAF solution so that can be compared.
Step 2 – Create an ‘equivalent’ rule-set and start using AWS WAF service.
So – getting started with the AWS WAF documentation we read, ‘define your conditions, combine your conditions into rules, and combine the rules into a web ACL.
Conditions: Request strings, source IPs, Country/Geo location of request IP, Length of specified parts of the requests, SQL code (SQL injection), header values (i.e.: User-Agent). Conditions can be multiple values and regex.
Rules: Combinations of conditions along with an ACTION (allow/block/count). There are Regular rules whereby conditions can be and/or chained. Rate-based rules where by the addition of a rate-based condition can be added.
Web ACLs: Whereby the action for rules are defined. Multiple rules can have the same action, thus be grouped in the same ACL. The WAF uses Web ACLs to assess requests against rules in the order which the rules are added to the ACL, whichever/if any rules is matched first defines which action is taken.
Starting simple: To get started I will implement a rate limiting rule which limits 5 requests per minute to our login page from a specified IP along with the basic OWASP rules from terraform code upload by traveloka . Below is our main.tf with the aws_waf_owasp_top_10_rules created for this test.
main.tf which references our newly created aws_waf_owasp_top_10_rules module
ab -v 3 -n 2000 -c 100 https://am.ameu.sonet.com.au/login > ab_2000_100_waf_test.log
This command logs request headers (-v 3 for verbosity of output), makes 2000 requests (-n 2000) and conducts those request 100 concurrently (-c 100). I can then see failed requests by tailing the output:
All looks good for the rate limiting based blocking, though it appears that blocking does not occur are exactly 2000 requests in the 5 minute period. It also appears that there is a significant (5-10min) delay on metrics coming through to the WAF stats in the AWS console.
After success on the rate limiting rule, the OWASP Top 10 mitigation rules need to be tested. I will use Owasp Zap to generate some malicious traffic and see when happen!
So it works – which is good, but I am not really confident about the effectiveness of the OWASP rules (as implemented on the AWS WAF). For now, they will do… but some tuning will probably be desirable as all of the requests OWASP ZAP made contained (clearly) malicious content but only 7% (53 / 755) of the requests were blocked by the WAF service. It will be interesting to see if there are false positives (valid requests that are blocked) when I conduct step 4, performance testing.
Step 4 – Conduct performance test using AWS WAF service, and
Conducting a load test with https://app.loadimpact.com demonstrated that the AWS WAF service is highly unlikely to become a bottleneck (though this may differ for other applications and implementations).
Step 5 – Migrate PROD to the AWS WAF service.
Our environment is fully ‘terraformed’, implementing the AWS WAF service as part of our terraform code was working within an hour or so (which is good time for me!).
With almost all of our clients now preferring AWS and Azure for hosting VMs / Docker containers we have to manage a lot of AMIs / VM images. Ensuring that these AMIs are correctly configured, hardened and patched is a critical requirement. To do this time and cost effectively, we use packer and ansible. There are solutions such as Amazon’s ECS that extend the boundary of the opaque cloud all the way to the containers, which has a number of benefits but does not currently meet a number of non-functional requirements for most of our clients. If those non-functional requirements we gone, or met by something like AWS ECS, it would be hard to argue against simply using terraform and ecs – removing our responsibility for managing the docker host VMs.
Anyway, we are making some updates to our IaaS code base which includes a number of new requirements and code changes to our packer and ansible code. To make these changes correctly and quickly I need a build/test cycle that is as short as possible (shorted than spinning up a new EC2 instance). Fortunately, one of the benefits of packer is the ‘cloud agnosticism’… so theoretically I should be able to test 99% of my new packer and ansible code on my windows 10 laptop using packer’s Hyper-V Builder.
I am running Windows 10 Pro on a Dell XPS 15 9560. VirtualBox is the most common go-to option for local vm testing but thats not an option if you are already running Hyper-V (which I am). So to get things started we need to:
Have a git solution for windows – I am using Microsoft’s VS Code (which is really a great opensource tool from M$)
Install Hyper-V Linux Integration Services on the Centos 7 base VM (this is required for Packer to be able to determine new VMs’ IP addresses) – if you are stuck with packer failing to connect with SSH to the VM and you are using a Hyper-V switch this will most likely be the issue
Add a Hyper-V builder to our packer.json (as below)
A lot of people need to do offsite backups for AWS RDS – which can be done trivially within AWS. If you need offsite backups to protect you against things like AWS account breach or AWS specific issues – offsite backups must include diversification of suppliers.
I am going to use Amazon’s Data Migration service to replicate AWS RDS data to a VM running in Azure and set up snapshots/backups of the Azure hosts.
The steps I used to do this are:
Set up an Azure Windows 2016 VM
Create an IPSec tunnel between the Azure Windows 2016 VM and my AWS Native VPN
Install matching version of Oracle on the Windows 2016 VM
Configure Data Migration service
Create a data migration and continuous replication task
Snapshots/Backups and Monitoring
Debug and Gotchyas
1,2 – Set up Azure Windows 2016 VM and IPSec tunnel
Create Network on Azure and place a VM in the network with 2 interfaces. One interface must have an public IP, call this one ‘external’ and the other inteface will be called ‘internal’ – Once you have the public IP address of your Windows 2016 VM, create a ‘Customer Gateway’ in your AWS VPC pointing to that IP. You will also need a ‘Virual Private Gateway’ configured for that VPC. Then create a ‘Site-to-Site VPN connection’ in your VPC (it won’t connect for now but create it anyway). Configure your Azure Win 2016 VM to make an IPSec tunnel by following these instructions (The instructions are for 2012 R2 but the only tiny difference is some menu items): https://docs.aws.amazon.com/vpc/latest/adminguide/customer-gateway-windows-2012.html#cgw-win2012-download-config. Once this is completed both your AWS site-to-site connection and your Azure VM are trying to connect to each other. Ensure that the Azure VM has its security groups configured to allow your AWS site-to-site vpn to get to the Azure VM (I am not sure which ports and protocols specifically, I just white-listed all traffic from the two AWS tunnel end points. Once this is done it took around 5 mins for the tunnel to come up (I was checking the status via the AWS Console), I also found that it requires traffic to be flowing over the link, so I was running a ping -t <aws_internal_ip> from my Azure VM. Also note that you will need to add routes to your applicable AWS route tables and update AWS security groups for the Azure subnet as required.
3 – Install matching version of Oracle on the Windows 2016 VM
4,5 – Configure Data Migration service and migration/replication
Log into your AWS console and go to ‘Data Migration Service’ / ‘DMS’ and hit get started. You will need to set up a replication VM (well atleast pick a size, security group, type etc). Note that the security group that you add the replication host to must have access to both your RDS and your Azure DBs – I could not pick which subnet the host went into so I had to add routes for a couple more subnets that expected. Next you will need to add your source and target databases. When you add in the details and hit test the wizard will confirm connectivity to both databases. I ran into issue on both of these points because of not adding the correct security groups, the windows firewall on the Azure VM and my VPN link dropping due to no traffic (I am still investigating a fix better than ping -t for this). Next you will be creating a migration/replication task, if you are going to be doing ongoing replication you need to run the following on your Oracle RDS db:
You can filter by schema, which should provide you with a drop down box to select which schema/s. Ensure that you enable logging on the migration/replication task (if you get errors, which I did the first couple of attempts, you won’t be fixing anything without the logs.
6 – Snapshots and Monitoring
For my requirements, daily snapshots/backups of the Azure VM will provide sufficient coverage. The Backup vault must be upgraded to v2 if you are using a Standrd SSD disk on the Azure VM, see: https://docs.microsoft.com/en-us/azure/backup/backup-upgrade-to-vm-backup-stack-v2#upgrade . To enable email notifications for Azure backups, go to the azure portal, select the applicable vault, click on ‘view alerts’ -> ‘Configure notifications’ -> enter an email address and check ‘critical’ (or what type of email notifications you want. Other recommended monitoring checks include: ping for VPN connectivity, status check of DMS task (using aws cli), SQL query on destination database confirming latest timestamp of a table that should have regular updates.
7 – Debug and Gotchyas
Azure security group allowing AWS vpn tunnel endpoint to Azure VM
Windows firewall rule on VM allowing Oracle traffic (default port 1521) from AWS RDS private subnet
Route tables on AWS subnets to route traffic to your Azure subnet via the Virtual Private Network
Security groups on AWS to allow traffic from Azure subnet
Stability of the AWS <–> Azure VM site-to-site tunnel requires constant traffic
The DMS replication host seems to go into an arbitrary subnet of your VPC (there probably some default setting I didn’t see) but check this and ensure it has routes for the Azure site-to-site
Ensure the RDS Oracle database has the archive log retention and supplemental logs settings as per steps 4,5.
It has been a long while since I looked at RDS – with Azure, Office 365 and Server 2016 there seems to be a lot of new (or better) options. To get across some of the options I have decided to do a review of Microsoft’s documentation with the aim of deciding on a solution for a client. The specific scenario I am looking at is a client with low spec workstations, using Office 365 Business Premium (including OneDrive), Windows 10 and have a single Windows 2016 Virtual Private Server.
Some desired features:
Users should be able to use their workstations or the remote desktop server interchangeably
Everything done on workstations should be replicated to the RDS server and visa-versa
Contention on editing documents should be dealt with reasonably
The credential for signing into workstations, email and remote services should be the same (ideally with a 2FA option for RDS)
The Office 365 users were created several months before the RDS server was deployed
The Azure AD connect service which synchronizes users in an Active Directory deployment with Office 365 user (Azure AD) is a one way street, assuming the ‘on-prem’ active directory object exist already and only need to be create in Azure AD (Office 365) – see the work around for this here
Office 365 licensing for ‘shared’ computers means that Office 365 Business Premium users can’t use a VPS – so entrerprise plans of business plus must be used.
How to configure Remote Desktop Services? After getting Active Directory installed and configure to sync with Azure AD I now need choose and implement the RDS configuration.
Create an AD service and link it to office 365 with Azure AD Connect
As Microsoft says:
You still must have an internet-facing server to utilize RD Web Access and RD Gateway for external users
You still must have an Active Directory and–for highly available environments–a SQL database to house user and Remote Desktop properties
You still must have communication access between the RD infrastructure roles (RD Connection Broker, RD Gateway, RD Licensing, and RD Web Access) and the end RDSH or RDVH hosts to be able to connect end-users to their desktops or applications.
After setting all of this up I am very happy with the results. The single source of truth for user must be the ‘on-prem’ AD. Syncing an on-prem AD service to Office 365 is almost seemless with some miner tweak required that are fairly easy to find with some googling.
Importing users from Office 365 to an on Prem-AD can be required in cases where an organisation who has been using Office 365 wants to start using a Remote Desktop Service or alike. To reduce the number of passwords and provide single sign on (or at least same sign on) the Windows Server my have Azure AD connect installed and be syncing with the businesses Office 365 account. The problem is that out of the box Azure AD connect is a one way street. It only creates object on the Azure side – it does not import Office 365 users into the server’s Active Directory.
To get users from Office 365 created in a new Windows Active Directory Service:
These endpoints are defined in Keystone, to see them and edit them there I can ssh to the keystone server and run some mysql queries. Before I do this I need to make sure that the swift and keystone endpoints are configure to use TLS/SSL.
Again the TLS/SSL termination point is apache… so some modification to /etc/httpd/conf.d/wsgi-keystone.conf is all that is required:
*Note: It can take some time (5 mins or longer if core-os is updating) for the kubernetes cluster to become available. To see status, vagrant ssh c1 (or w1/w2/e1) and run journalctl -f (following service logs).
First interesting point… with simple deployment above, I have already gone awry. Though I have 2 nginx containers (presumably for redundancy and load balancing), they have both been deployed on the same worker node (host). Lets not get bogged down now — will keep working through examples which probably cover how to ensure redundancy across hosts.
We are currently operating a service oriented architecture that is ‘dockerized’ with both host and containers running CentOS 7 when deployed straight on top of ec2 instances. We also have a deployment pipline with beanstalk + homegrown scripts. I imagine our position/maturity is similar to a lot of SMEs, we have aspirations of being on top of new technologies/practices but are some where in between old school and new school:
IT and Dev separate
Devops (Ops and Devs have the same goals and responsibilities)
Almost total automation with self-service
Image management (with immutable deployments)
IT staff have a high baseline work
IT staff have low baseline work, more room for initiatives
This is not about which end of this incomplete spectrum is better… we have decided that for our place in the world, moving further the left is desirable. I know there are a lot of experienced IT operators that take this view:
Why CoreOS for Docker Hosts?
CoreOS: A lightweight Linux operating system designed for clustered deployments providing automation, security, and scalability for your most critical applications – https://coreos.com/why/
Our application and supporting services run in docker, there should not be any dependencies on the host operating system (apart from the docker engine and storage mounts).
Some questions I ask myself now:
Why do I need to monitor for and stage deployments of updates?
Why am I managing packages on a host OS that could be immutable (like CoreOS is, kind of)?
Why am I managing what should be homogeneous machines with puppet?
Why am I nursing host machines back to health when things go wrong (instead of blowing them away and redeploying)?
Why do I need to monitor SE Linux events?
I want a Docker Host OS that is/has:
Smaller, Stricter, Homogeneous and Disposable
Built in hosts and service clustering
As little management as possible post deployment
CoreOS looks good for removing the first set of questions and sufficing the wants.
The rise of single page apps (ie AngularJS) present some interesting problems for Ops. Specifically, the increased dependence on browser executed code means that real user experience monitoring is a must.
The solution/s must have the following requirements:
Negligible performance impact
Real user performance monitoring
Effective single page app (AnglularJS support)
Real time alerting
Nice to have:
Easy to deploy and maintain integration
Easy integration with tools we use for notifications (icinga2, Slack)
Google Analytics with custom event push does not have any real time alerting which disqualifies it as an Ops solution.