How and Why I Monitor My Java Web App
All Developers Should Proactively Monitor Their Applications In Production with time-based graphing (RRD, etc.) and log analysis
It has always surprised me that so many developers confidently send their applications off to production and then fail to monitor them afterward. "It is not my problem now" and "the IT guys will let me know if there is an issue" seem to be the common attitudes. Then, when there is a fault, these programmers are performing emergency tweaks or panicked post-mortem analysis. The interpersonal stresses caused by such events can quickly lead to a blame game, which is not healthy for the application, the developers and both internal & external customers.
With some basic monitoring, faults, hitting hard scaling limits, etc. can be prevented or at least dealt with proactively. At my current company, xtendx, we host our servers with Aspectra who provide system monitoring on a cornucopia of parameters via an RRD-based system called tacMON from terreActive. For time-based monitoring needs, tacMON generates graphs on these parameters over four time ranges: Daily, Weekly, Monthly and Yearly. For example, here are two interesting graphs:



With some basic monitoring, faults, hitting hard scaling limits, etc. can be prevented or at least dealt with proactively. At my current company, xtendx, we host our servers with Aspectra who provide system monitoring on a cornucopia of parameters via an RRD-based system called tacMON from terreActive. For time-based monitoring needs, tacMON generates graphs on these parameters over four time ranges: Daily, Weekly, Monthly and Yearly. For example, here are two interesting graphs:
CPU % Idle Over 24 Hours

Tomcat Heap Size in Megabytes Over 7 Days

(Larger Picture)
With graphs like these I can constantly monitor the health of my application to prevent faults, tune it to scale even higher, or start the slow and expensive process of expanding the server or network infrastructure. At the office, we have setup a two screen monitoring system with 16 graphs that we watch through out the day. These screens are hooked up to two old desktop PCs running Ubuntu 9.04 and FireFox with the browsers in full screen mode. Once every few minutes an HTML META tag refreshs the images. Our own low-budge NOC (Network Operations Center), if you will.
(The paper mache rooster is 'Chuckie', our mascot--orange and black are our company colors. I can get you one too for SFr. 16 if you are interested.)
(The paper mache rooster is 'Chuckie', our mascot--orange and black are our company colors. I can get you one too for SFr. 16 if you are interested.)

(Larger Picture)
Over time we have come to this list of core metrics that are indicative of failure (server response time) or resource exhaustion (network bandwidth, CPU)
-
Server-level
-
Bandwidth
-
CPU load
-
Disk space
-
I/O operations
-
-
Application-level
-
Java Heap Size
-
Server Response Time
-
Number of Threads
-
Beyond these core metrics, there are dozens of other parameters that we can investigate and monitor. All of these values are monitored 24/7 by the engineering team at Aspectra too.
Beyond this, I also take a regular tour through the logs that Java, Tomcat and my application produce. It only take a few minutes to
and look for recent stack traces and other errors in
If you are not monitoring your application and platform continuously and looking at your log files on a regular basis then you are asleep at the wheel! A crash is inevitable.
- System
- Fan speeds
- Temperature (CPU, Power Supply, etc.)
- Up Time
- CPU interupps per second and context changes per second
- Memory Free
- Swap Space % Used
- Fan speeds
- Application
- Various MySql parameters
- Other Java Memory Sizes (Eden, etc.)
Beyond this, I also take a regular tour through the logs that Java, Tomcat and my application produce. It only take a few minutes to
grep ERROR logs/mylogfile.txt and look for recent stack traces and other errors in
catalina.out or localhost-yyyy-mm-dd.log. If you are not monitoring your application and platform continuously and looking at your log files on a regular basis then you are asleep at the wheel! A crash is inevitable.