Off late I was
thinking about some of the services that we can render from
eXzaTech
on the data & information management solutions and I suddenly
remembered two interesting situations I handled sometime (of course
quite long ago) back. I thought I should share them with you. Hence
this article.
In the first, it was
more to do with complaints from a software development company on the
performance of Microsoft SQL Server 2005 Enterprise Edition on a
server with Intel Itanium Processor. Software developers were
complaining that SQL Server is not performing well on Itanium
Processor and they were benchmarking the performance against Intel
Xeon processor based server without emphasizing the fact that they
were not comparing apple to apple. In their labs they were using
Intel Xeon based servers with low volume test data whereas the
production server was handling the very high volume, live production
data of an international shipping company. Although Itanium Processor
lost out to Xeon for various reasons over a period of time, I never
believed that Itanium processors were such poor performers as against
Xeon 64 Bit processors. This was a typical case of Software vendor
blaming hardware and hardware vendor blaming software for poor
performance.
The architecture was
very complex. Those details are not relevant here.
Concerns were mainly
related to very high CPU utilization, that sometimes used to touch
100% and server used to hang. There were unusual number of deadlocks,
lock timeouts lock waits resulting transaction losses, with
alarmingly high amount of resource utilization.
The customer was
using a performance monitoring tool – An NMS / EMS - and those
readings were being debated. There were also questions about the
server sizing and the reasons behind choosing Intel Itanium 2
Processor.
When we were asked
to take a look & share our views. We first took a look at the
server sizing. We used the hardware vendor's sizer tool and arrived
at the suggested server size for the given workload. The results were
astonishing. The server being used was sufficient to handle the work
loads of next 5 years at their estimations!! The same was being
seconded by the hardware, Operating System and Database vendor.
This basic belief
made us to take a look at the SQL Server logs, analyze the
performance counters, actual transaction SQL Statements and batch
jobs, types of workload and what we find was really surprising.
Single SQL Server 2005 instance was handling the entire business
workload, business logic, MIS reporting, Business Intelligence &
Analytics simultaneously. Missing indexes were forcing the query
execution plan to get recompiled almost every time, consuming the CPU
& memory resources. This coupled with server side cursors that
were holding the CPU & memory resources resulting in SQL Server
Query Optimizer running out of internal resources to compile the
query execution plan and errors were thrown. BI and Analytics
workload with complex queries consisting of hundreds of operations
such as nested joins, hashed joins, sorts, condition processing, ftp,
URL, text processing etc., were executing during the peak business
hours. Extremely high I/O operations due to server side cursors
holding large data sets, working against the tables, coupled with
missing indexes resulting in repeated expensive scans & were
creating latch access violations. These resulted in deadlocks causing
severe pressure on CPU & Memory resources which sometimes
resulted in a hanged server.
Over and above, in
order to reduce the deadlock situation, lowest isolation levels –
read uncommitted – were used for many transaction queries. This
coupled with repeated change in table structure had created missing &
inconsistent data. These had created a severe pressure on the very
business and business had to resort to their good old process to
continue with their business.
Our role was only to
give our views and recommendations and we did just that. Our
recommendations were:
to move the
analysis & reporting workload out of the live production
instance in to a different database and conduct ETL activities
during off peak hours and
optimize the
queries, create indexes
avoid changing
the database structure often d) use different integration
methodologies to integrate other applications
Once the
recommendations were implemented by the customer, it was observed
that the database was performing normally without much pressure on
CPU & memory resources and they were not losing any transaction /
data.
The second instance
was a situation that the customer wanted to keep IBM DB2 Express
Edition for sometime till migrating totally to Oracle ERP. The
situation here was when nearly 700 users connect to the DB, server
used to hang, and they were forced to restart the server.
When we were called
into help them to optimize the performance and prevent server from
getting restarted as the load increases. The easiest suggestion we
could have given is to upgrade their license to either to DB2
Workgroup or DB2 Enterprise Edition. The customer was not interested
in upgrading his license as suggested as DB2 was being used only till
the migration to new ERP.
During our
engagement with the customer we observed that 32 Bit edition of DB2
Express Version 8.1 was used. The server was a Virtual Machine on a
64 Bit hardware with 32 Bit Windows 2003 Standard Server. Considering
the limitation of 32 Bit operating system and the CPU & memory
limitations of DB2 Express Edition we concluded that it was the
memory pressure that is causing CPU pressure as well resulting in
hanged server and server had to be restarted. This was causing chaos
across the organization as applications with DB2 were handling
critical business operations.
Simple suggestions
like removing unwanted data from Disk that cleared almost 40% of disk
and this improved the performance a bit. Upgrading to Windows Server
2008 R2 64 Bit edition OS, DB2 Express Edition Version 8.1 to DB2
Express Edition Version 9.7 64 Bit did the trick. Without increasing
the CPU & memory allocations DB2 started handling nearly 7000
concurrent connections. There was not even a single instance of
hanged server and transaction loss. This was tested over a month's
period before being declared live.
These two instances
have taught us that there is no single formula to correct the poor
performance. The problems can be anywhere – the hardware, the
Operating System, The application design – literally across the
infrastructure that is delivering the applications.
What do you say?
Comments
Post a Comment