We have a web application that has been running on AWS for several years. As application load balancers and the AWS WAF service was not available, we utilised and external classic ELB point to a pool of EC2 instances running mod_security as our WAF solution. Mod_security was using the OWASP Mod_security core rule set.
Now that Application Load Balancers and AWS WAFs are available, we would like to remove the CPU bottleneck which stems from using EC2 instances with mod security as the current WAF.
Step 1 – Base-lining performance with EC2 WAF solution.
The baseline was completed using https://app.loadimpact.com where we ran 1000 concurrent users, with immediate rampup. On our test with 2 x m5.large EC2 instances as the WAF, the WAFs became CPU pinned within 2mins 30 seconds.
This test was repeated with the EC2 WAFs removed from the chain and we averaged 61ms across the loadimpact test with 1000 users. So – now we need to implement the AWS WAF solution so that can be compared.
Step 2 – Create an ‘equivalent’ rule-set and start using AWS WAF service.
So – getting started with the AWS WAF documentation we read, ‘define your conditions, combine your conditions into rules, and combine the rules into a web ACL.
Conditions: Request strings, source IPs, Country/Geo location of request IP, Length of specified parts of the requests, SQL code (SQL injection), header values (i.e.: User-Agent). Conditions can be multiple values and regex.
Rules: Combinations of conditions along with an ACTION (allow/block/count). There are Regular rules whereby conditions can be and/or chained. Rate-based rules where by the addition of a rate-based condition can be added.
Web ACLs: Whereby the action for rules are defined. Multiple rules can have the same action, thus be grouped in the same ACL. The WAF uses Web ACLs to assess requests against rules in the order which the rules are added to the ACL, whichever/if any rules is matched first defines which action is taken.
Starting simple: To get started I will implement a rate limiting rule which limits 5 requests per minute to our login page from a specified IP along with the basic OWASP rules from terraform code upload by traveloka . Below is our main.tf with the aws_waf_owasp_top_10_rules created for this test.
main.tf which references our newly created aws_waf_owasp_top_10_rules module
ab -v 3 -n 2000 -c 100 https://am.ameu.sonet.com.au/login > ab_2000_100_waf_test.log
This command logs request headers (-v 3 for verbosity of output), makes 2000 requests (-n 2000) and conducts those request 100 concurrently (-c 100). I can then see failed requests by tailing the output:
All looks good for the rate limiting based blocking, though it appears that blocking does not occur are exactly 2000 requests in the 5 minute period. It also appears that there is a significant (5-10min) delay on metrics coming through to the WAF stats in the AWS console.
After success on the rate limiting rule, the OWASP Top 10 mitigation rules need to be tested. I will use Owasp Zap to generate some malicious traffic and see when happen!
So it works – which is good, but I am not really confident about the effectiveness of the OWASP rules (as implemented on the AWS WAF). For now, they will do… but some tuning will probably be desirable as all of the requests OWASP ZAP made contained (clearly) malicious content but only 7% (53 / 755) of the requests were blocked by the WAF service. It will be interesting to see if there are false positives (valid requests that are blocked) when I conduct step 4, performance testing.
Step 4 – Conduct performance test using AWS WAF service, and
Conducting a load test with https://app.loadimpact.com demonstrated that the AWS WAF service is highly unlikely to become a bottleneck (though this may differ for other applications and implementations).
Step 5 – Migrate PROD to the AWS WAF service.
Our environment is fully ‘terraformed’, implementing the AWS WAF service as part of our terraform code was working within an hour or so (which is good time for me!).
A lot of people need to do offsite backups for AWS RDS – which can be done trivially within AWS. If you need offsite backups to protect you against things like AWS account breach or AWS specific issues – offsite backups must include diversification of suppliers.
I am going to use Amazon’s Data Migration service to replicate AWS RDS data to a VM running in Azure and set up snapshots/backups of the Azure hosts.
The steps I used to do this are:
Set up an Azure Windows 2016 VM
Create an IPSec tunnel between the Azure Windows 2016 VM and my AWS Native VPN
Install matching version of Oracle on the Windows 2016 VM
Configure Data Migration service
Create a data migration and continuous replication task
Snapshots/Backups and Monitoring
Debug and Gotchyas
1,2 – Set up Azure Windows 2016 VM and IPSec tunnel
Create Network on Azure and place a VM in the network with 2 interfaces. One interface must have an public IP, call this one ‘external’ and the other inteface will be called ‘internal’ – Once you have the public IP address of your Windows 2016 VM, create a ‘Customer Gateway’ in your AWS VPC pointing to that IP. You will also need a ‘Virual Private Gateway’ configured for that VPC. Then create a ‘Site-to-Site VPN connection’ in your VPC (it won’t connect for now but create it anyway). Configure your Azure Win 2016 VM to make an IPSec tunnel by following these instructions (The instructions are for 2012 R2 but the only tiny difference is some menu items): https://docs.aws.amazon.com/vpc/latest/adminguide/customer-gateway-windows-2012.html#cgw-win2012-download-config. Once this is completed both your AWS site-to-site connection and your Azure VM are trying to connect to each other. Ensure that the Azure VM has its security groups configured to allow your AWS site-to-site vpn to get to the Azure VM (I am not sure which ports and protocols specifically, I just white-listed all traffic from the two AWS tunnel end points. Once this is done it took around 5 mins for the tunnel to come up (I was checking the status via the AWS Console), I also found that it requires traffic to be flowing over the link, so I was running a ping -t <aws_internal_ip> from my Azure VM. Also note that you will need to add routes to your applicable AWS route tables and update AWS security groups for the Azure subnet as required.
3 – Install matching version of Oracle on the Windows 2016 VM
4,5 – Configure Data Migration service and migration/replication
Log into your AWS console and go to ‘Data Migration Service’ / ‘DMS’ and hit get started. You will need to set up a replication VM (well atleast pick a size, security group, type etc). Note that the security group that you add the replication host to must have access to both your RDS and your Azure DBs – I could not pick which subnet the host went into so I had to add routes for a couple more subnets that expected. Next you will need to add your source and target databases. When you add in the details and hit test the wizard will confirm connectivity to both databases. I ran into issue on both of these points because of not adding the correct security groups, the windows firewall on the Azure VM and my VPN link dropping due to no traffic (I am still investigating a fix better than ping -t for this). Next you will be creating a migration/replication task, if you are going to be doing ongoing replication you need to run the following on your Oracle RDS db:
You can filter by schema, which should provide you with a drop down box to select which schema/s. Ensure that you enable logging on the migration/replication task (if you get errors, which I did the first couple of attempts, you won’t be fixing anything without the logs.
6 – Snapshots and Monitoring
For my requirements, daily snapshots/backups of the Azure VM will provide sufficient coverage. The Backup vault must be upgraded to v2 if you are using a Standrd SSD disk on the Azure VM, see: https://docs.microsoft.com/en-us/azure/backup/backup-upgrade-to-vm-backup-stack-v2#upgrade . To enable email notifications for Azure backups, go to the azure portal, select the applicable vault, click on ‘view alerts’ -> ‘Configure notifications’ -> enter an email address and check ‘critical’ (or what type of email notifications you want. Other recommended monitoring checks include: ping for VPN connectivity, status check of DMS task (using aws cli), SQL query on destination database confirming latest timestamp of a table that should have regular updates.
7 – Debug and Gotchyas
Azure security group allowing AWS vpn tunnel endpoint to Azure VM
Windows firewall rule on VM allowing Oracle traffic (default port 1521) from AWS RDS private subnet
Route tables on AWS subnets to route traffic to your Azure subnet via the Virtual Private Network
Security groups on AWS to allow traffic from Azure subnet
Stability of the AWS <–> Azure VM site-to-site tunnel requires constant traffic
The DMS replication host seems to go into an arbitrary subnet of your VPC (there probably some default setting I didn’t see) but check this and ensure it has routes for the Azure site-to-site
Ensure the RDS Oracle database has the archive log retention and supplemental logs settings as per steps 4,5.
With the go-live of https://letsencrypt.org/ its time to transition from the pricy and manual standard SSL cert issuing model to a fully automated process using the ACME protocol. Most orgs have numerous usages of CA purchased certs, this post will cover hosts running apache/nginx and AWS ELBs, all of these usages are to be replaced with automated provisioning and renewal of letsencrypt signed certs.
Provisioning and auto-renewing Apache and nginx TLS/SSL certs
For externally accessible sites where Apache/Nginx handles TLS/SSL termination moving to letsencrypt is quick and simple:
1 – Install the letsencrypt client software (there are RHEL and Centos rpms – so thats as simple as adding the package to puppet policies or
yum install letsencrypt
2 – Provision the keys and certificates for each of the required virtual hosts. If a virtual host has aliases, specify multiple names with the -d arg.
Before setting out, getting some basic concepts about snort is important.
This deployment with be in Network Intrusion Detection System (NIDS) mode – which performs detection and analysis on traffic. See other options and nice and concise introduction: http://manual.snort.org/node3.html.
Again drawing from the snort manual some basic understanding of snort alerts can be found:
116 – Generator ID, tells us what component of snort generated the alert
Eliminating false positives
After running pulled pork and using the default snort.conf there will likely be a lot of false positives. Most of these will come from the preprocessor rules. To eliminate false positives there are a few options, to retain maintainability of the rulesets and the ability to use pulled pork, do not edit rule files directly. I use the following steps:
Create an alternate startup configuration for snort and barnyard2 without -D (daemon) and barnyard2 config that only writes to stdout, not the database. – Now we can stop and start snort and barnyard2 quickly to test our rule changes.
Open up the relevant documentation, especially for preprocessor tuning – see the ‘doc’ directory in the snort source.
Have some scripts/traffic replays ready with traffic/attacks you need to be alerting on
Iterate through reading the doc, making changes to snort.conf(for preprocessor config), adding exceptions/suppressions to snort’s threshold.conf or PulledPork’s disablesid, dropsid, enablesid, modifysid confs for pulled pork and running the IDS to check for false positives.
If there are multiple operating systems in your environment, for best results define ipvars to isolate the different OSs. This will ensure you can eliminate false positives whilst maintaining a tight alerting policy.
From doc: HttpInspect is a generic HTTP decoder for user applications. Given a data buffer, HttpInspect will decode the buffer, find HTTP fields, and normalize the fields. HttpInspect works on both client requests and server responses.
Global config –
Writing custom rules using snorts lightweight rules description language enables snort to be used for tasks beyond intrusion detection. This example will look at writing a rule to detect Internet Explorer 6 user agents connecting to port 443.
Rule Headers -> [Rule Actions, Protocols, IP Addresses and ports, Direction Operator,
general – informational only — msg:, reference:, gid:, sid:, rev:, classtype:, priority:, metadata:
payload – look for data inside the packet —
content: set rules that search for specific content in the packet payload and trigger a response based on that data (Boyer-Moore pattern match). If there is a match anywhere within the packets payload the remainder of the rule option tests are performed (case sensitive). Can contain mixed text and binary data. Binary data is represented as hexdecimal with pipe separators — (content:”|5c 00|P|00|I|00|P|00|E|00 5c|”;). Multiple content rules can be specified in one rule to reduce false positives. Content has a number of modifiers: [nocase, rawbytes, depth, offset, distance, within, http_client_body, http_cookie, http_raw_cookie, http_header, http_raw_header, http_method, http_uri, http_raw_uri, http_stat_code, http_stat_msg, fast_pattern.
non-payload – look for non-payload data
post-detection – rule specific triggers that are enacted after a rule has been matched
So here’s a simple script that will pull the cert chain from a [domain] [port] and let you know if it is invalid – note there will likely be come bugs from characters being encoded / return carriages missing:
# chain_collector.sh [domain] [port]
# output to stdout
# assumes you have a directory with desired trust anchors at ~/trustanchors
server random: 1b:97:2e:f3:58:70:d1:70:d1:de:d9:b6:c3:30:94:e0:10:1a:48:1c:cc:d7:4d:a4:b5:f3:f8:78 = 1988109383203082608
Interestingly the negotiation with youtube.com and chromium browser resulted in Elliptic Curve Cryptography (ECC) Cipher Suitesfor Transport Layer Security (TLS) as the chosen cipher suite.
Note that there is no step mention here for the client to verify then certificate. In the past most browsers would query a certificate revocation list (CRL), though browsers such as chrome now maintain either ignore CRL functionality or use certificate pinning.
Chrome will instead rely on its automatic update mechanism to maintain a list of certificates that have been revoked for security reasons. Langley called on certificate authorities to provide a list of revoked certificates that Google bots can automatically fetch. The time frame for the Chrome changes to go into effect are “on the order of months,” a Google spokesman said. – source: http://arstechnica.com/business/2012/02/google-strips-chrome-of-ssl-revocation-checking/
Issue caused by having iptables rule/s that track connection state. If the number of connections being tracked exceeds the default nf_conntrack table size  then any additional connections will be dropped. Most likely to occur on machines used for NAT and scanning/discovery tools (such as Nessus and Nmap).
Symptoms: Once the connection table is full any additional connection attempts will be blackholed.
This issue can be detected using:
nf_conntrack:table full,dropping packet.
nf_conntrack:table full,dropping packet.
nf_conntrack:table full,dropping packet.
nf_conntrack:table full,dropping packet.
Current conntrack settings can be displayed using:
To check the current number of connections being tracked by conntrack:
Options for fixing the issue are:
Stop using stateful connection rules in iptables (probably not an option in most cases)
Increase the size of the connection tracking table (also requires increasing the conntrack hash table)
Decreasing timeout values, reducing how long connection attempts are stored (this is particularly relevant for Nessus scanning machines that can be configured to attempt many simultaneous port scans across an IP range)
Making the changes in a persistent fashion RHEL 6 examples:
<ejbca_home>/bin/ejbca.sh ca importca<caname>existingCA1.p12
Step 3 – Verify import
<ejbca_home>/bin/ejbca.sh ra adduser
### IMPORTANT ###
Distinguished name order of openssl may be opposite of ejbca default configuration – http://www.csita.unige.it/software/free/ejbca/ … If so, this ordering must changed in ejbca configuration prior to deploying (can’t be set on a per CA basis)
Have not been able to replicate this issue in testing.
Import existing TinyCA CA
Basic Admin and User operations
Create and end entity profile for server/client entities
Step 2 – Sign CSR using the End Entity which is associated with a CA
Importing existing certificates
EJBCA can create endentities and import their existing certificate one-by-one or in bulk (http://www.ejbca.org/docs/adminguide.html#Importing Certificates). Bulk inserts import all certificates under a single user which may not be desirable. Below is a script to import all certs in a directory one by one under a new endentity which will take the name of the certificate CN.