Baron Funds is a New York investment firm with $20 billion under management. While its IT infrastructure is not especially large – the firm operates 17 physical servers, some of which are virtualised using VMware – with that amount of money at stake, uptime and performance are paramount.
But like many organisations, Baron Funds found that virtualising its systems made it rather more complicated to meaningfully monitor the performance of the underlying hardware.
Henry Mayorga, the firm’s director of network technology, explains it like this: “If there’s a problem with the performance of a system, the usual approach is to look at the CPU or the memory of the server. But with a virtualised server, it could be any number of virtual machines that are affecting each other, and you can’t see what’s causing the problem.”
And while there are many systems monitoring tools that can provide detailed performance data on individual components, “the problem is making sense of all that data,” Mayorga says.
“When a problem arises, I could conceivably go through all my servers and look at all the forensic data, if I had time,” he explains. “But performance problems are transitory. Things happen in the spark of a moment – maybe somebody’s uploading a large file or running a long report – so the problem sticks around for an hour and then it’s gone.”
This is especially true in virtualised environments, he adds, “where people bring up servers and bring them down again all the time. It’s a very dynamic environment.”
Mayorga needed a tool that could make sense of the interdependencies between virtual systems and physical hardware, and to do so quickly enough that performance issues might be resolved as they arise.
A chance personal connection led him to Netuitive, a company whose systems performance management technology is built on what it describes as a ‘self-learning analytics engine’. This analyses the output of systems monitoring tools and seeks correlations within the performance of the various components of the infrastructure. It ‘learns’ what the normal correlations are – when a particular application is in use, this server will be busy – and can therefore spot unexpected behaviour.
Trial by fire
While Mayorga was trialling a Netuitive product that plugs into VMware’s VCentre management console, the firm encountered an unexpected deterioration in the response time of its email system and various databases, all of which ran on virtual machines.
Of all the various tools Mayorga and his team use to diagnose performance issues, only Netuitive correctly identified that the bottleneck was between a particular virtual server and the storage device it was connected to.
Mayorga called in an engineer from the company that provides its tiered storage system, which moves frequently accessed data on to expensive, high-performance disks, and less frequently used data on to cheaper media. It turned out that the email server and the databases had been moved onto a slow storage device, despite being the most commonly used systems.
The reason for this, it later transpired, was that an employee working over the weekend had been repeatedly searching for a file through the whole infrastructure, and the search engine had accessed every file each time. This gradually pushed the email and database systems off the expensive storage media and onto the lower performance devices, hence the slow response times.
That Netuitive had correctly identified the problem was enough to convince Mayorga to buy the product. Since deploying the tool, he has found that it allows him to quantify the impact of any change he makes to the infrastructure. “We upgraded the storage for our email server a few months ago, and we could see that we had a 25% improvement in performance as a result.”
Mayorga has extended the Netuitive system so that it monitors more of the infrastructure. Such is the power of the central engine, he says, that he can in theory analyse any form of data with it. “If you are able to take the data from the actual devices and feed it into Netuitive, the system figures out how it all correlates.”
However, he reports that this has taken a lot of effort on his part. “Getting the other data in there is tough,” he says. The Netuitive system itself could be easier to integrate with, he says, and procuring data from some other performance monitoring systems is also more difficult than it need be.
He observes, therefore, that it may be the relatively small size of the infrastructure in question that makes this a plausible exercise. “I have a very manageable situation, so I can take all my data and make sense of it in Netuitive,” Mayorga explains. “I couldn’t tell you how well this would scale for an enterprise.
“What I can tell you, though, is that within my world Netuitive solved two problems,” he adds. “One is correlating all the performance data that I have, and the other is quickly figuring out where something went wrong without having to spend hours and hours working out what is the proper performance for each system.”