NewRelic is an excellent platform to monitor the Application and Infrastructure. Its our preferred tool and recommended for all our clients. With NewRelic you will be able to monitor all your Application/Services including the Throughput, Errors, Performance etc. NewRelic also has excellent monitoring capabilities for AWS infrastructure components, like EC2, RDS etc.
To start with any application/service that needs NewRelic application monitoring should have the SDK integrated first.
After integration you need to re-deploy the services to start monitoring the application. You will now able to see the activity in NewRelic like Web Transactions response time, Throughput etc.
NewRelic works by sending the events from the application/service into NewRelic servers, then those events gets transformed into metrics and you will see those metrics in charts and dashboards.
One of the problems with any application/service with a lot of traffic/throughput integrated into NewRelic, is the spike in the Data usage. New Relic One plan offers the first 100GB storage for free, but later on you will start paying $0.25 for every GB inserted. If you aren’t careful, you will end up paying a lot for applications with a ton of traffic.
One of the suggested approaches to reduce data usage is to reduce the data sampling size that is stored in NewRelic services. To understand more about sampling please visit https://docs.newrelic.com/docs/data-apis/understand-data/event-data/new-relic-event-limits-sampling/
Here is the change to the sampling size for a NodeJS application. The default value of max_samples_stored is 2000 and we are reducing it to 100 in NewRelic configuration file (newrelic.js)
One of the risks with reducing the sampling size is, when you are debugging the application for issues, then there is high probability that an errored event might not be reported into NewRelic. Which makes hard to debug like debugging Transaction trace, SQL etc.
But this change wont affect metrics like aggregated measurements over time. Examples: average response time over a one-minute time range, throughput over time, CPU utilization over time.