Any startup needs a plan for managing and monitoring system availability, latency, performance, efficiency, incident response, as well as capacity planning of different services. This will make sure the uptime of the services with little or no customer impact. We should also make sure the tools we use are easier to set up and cheaper to maintain. Below is a recommendation on an ideal Infrastructure monitoring and site reliability setup along with the recommended tools.
We need uptime monitoring to test the availability of your website, applications and servers. We also need to get notified when one of the services goes down. Availability monitoring will be very helpful for achieving better SLAs and keeping customers happy.
Pingdom.com: Pingdom will test our websites, and API’s on a regular basis and updates us via Slack or Email on downtimes. The setup is pretty straightforward and no coding is required.
Pingdom.com: Transaction checks are basically application flows, which need to be set up and monitored on a regular basis. signup, login etc. These transaction checks will be set up in Pingdom as automated scripts and will be executed regularly. We will be alerted if any of the transaction fails.
AWS SNS: AWS has a native notification and monitoring service, where we can add alerts to the services like EC2, RDS, Redshift etc. We can configure those alerts on CPU spikes, downtimes, disk space availability etc. These alerts can be then routed to Slack or Email to alert the team.
sentry.io: Sentry is a great error reporting tool, where we can wire up the services built using NodeJS, React, and .Net applications to route the errors to sentry with the help of a simple SDK. These errors can be then routed to the team via email or Slack to take further action.
Application Performance Monitoring
Newrelic: New Relic provides a unified dashboard for all SQL databases like RDS, Redshift and NoSQL caches like MongoDB.
Capacity planning will be a manual ongoing task, where the engineer needs to review the load and optimize the infrastructure on a regular basis. AWS has Auto Scaling and we can configure the Auto Scaling based on the expected traffic. We can configure auto-scaling for Databases like Aurora, and MongoDB too by reviewing the traffic patterns along with the min/max load.
You can have two levels of escalation P0 and P1 to start with. P0 is a critical outage and P1 is for partial service impacts without customer impact.
P0 – P0 alerts will be sent to SMS/Phone, Slack and Email.
P1 – P1 alerts/notifications will be sent to Slack and Email.