
Oracle and the underlying operating system are extremely complex. Management of high load and high concurrency expose the software to many race conditions and other problems. When this situation is combined with the constant drive to implement new functionality into the products, it is inevitable that pitfalls-bugs-will arise from time to time.
The purpose of this chapter is to provide advice on how to avoid bugs, how to identify and fix them when they occur, and how to communicate with the respective vendors.
The best way to deal with bugs is to avoid them. This simple fact is often overlooked, and a few precautions can go a long way.
Some people say "never go with dot-0 releases." Other people go one step further and try to associate patterns with the minor release numbers. This is clearly bogus, but the dot-0 philosophy is basically sound to varying degrees.
With large software engineering projects, code has to be frozen from change in order to produce the release. In order to do this, various techniques are used by the code management teams, ranging from conceptual freeze dates to complete shutdown of the code management system. Whatever the technique, a freeze is achieved.
When a hard freeze is set like this, it is almost inevitable that some of the changes will be checked in only partially complete, or rushed through with inadequate testing. It should not be this way, but human nature demands it. The net result of this is that bugs creep into the product.
The more experienced the development team, the less likely that nasty bugs will make it through. The reality for massive projects such as an RDBMS or an operating system is that even a small turnover of people results in a large number of fresh faces for each release.
One way around these problems is to avoid new feature code, because this code is the most likely to contain the bugs. Unfortunately, it is difficult to tell which pieces of the code have been changed the most, because some "old features" may well have been overhauled for this release in order to improve scalability, for example. If one wants to avoid the new features of a dot-0 release altogether, then one should look instead at the most recent release of the prior version.
Shortly after any release, be it a dot-0 or otherwise, problems will start to be reported to support. Therefore, it is always worth waiting at least a few weeks after a release before using it. Before upgrading, check with support and get the list of known bugs in that release. If the bugs are in areas of the product that you use heavily, this is probably not a good release to move to for the present.
The ultimate treatment of a bug involves both a diagnosis and a resolution of the problem. The ultimate resolution may involve a software patch, but an interim solution can sometimes be used to work around the problem. This is something that Oracle often tries to offer as a band-aid solution to the real problem.
Within Oracle development exists a team dedicated to repairing any problems that arise in the product. This team is known as the Defect Diagnosis and Resolution (DDR) group and in many ways offers more capability than a core developer of the product can provide. These people have a much broader knowledge of the software than a typical kernel developer has and are very experienced in sniffing out the root cause of a problem. When a problem is determined to be in a certain portion of the software, the DDR individual will either fix the problem personally or involve the kernel developer who developed that portion of code.
The DDR gorup is the kernel group interface to support, which itself is divided into several levels of escalation. The final line of support is known as Bug Diagnostics and Escalation (BDE), which is the equivalent of DDR on the support side. Once again, BDE personnel are typically versed in the source code of the product and have enormous experience in diagnosing problems. Much of the information below is derived from discussions with BDE. In front of BDE are the usual levels of support, depending on the support tier adopted in your contract.
Finding bugs is often not as straightforward as it may appear. Sometimes a genuine user error can appear to be a bug, or a bug can appear to be a user or code problem. Finding a genuine bug can often require a good deal of research and testing on the part of the database administrator before it is accepted as such by support.
· Operating system version (possibly including patch information, but support will ask if this is required)
The first scenario-the ORA-0600 error-is an Oracle internal error. This means that Oracle has encountered an unexpected condition within the kernel and that the developer has included a check for this in the code. An ORA-0600 error can be considered a "process panic" and normally arises as the result of a race condition. These errors are reported to the session that encounters the error and within the alert file, and so they are quite easy to spot.
Despite a popular misconception, not all internal errors are desperate cases. Unless the internal error is encountered in one of the background processes, it will not cause an instance failure. However, there is little diagnosis that can be done by the administrator for this type of problem, because it is by nature a wildcard. The good news is that an ORA-0600 is definitely a bug rather than a problem attributed to user error, and the following procedure should be followed when an internal error occurs.
1. Call it in to support. There is a good chance that there is already a fix for the problem. Support will need the arguments (the values in square brackets on the error line) in order to locate the source of the problem.
2. While support personnel are looking up the bug, get a good trace dump of the error, ready to send off.
When sending off tracefiles to Oracle, whether for an internal error or otherwise, check that the tracefiles are not all of the same size. If they are, it is likely that they have all been truncated as a result of exceeding max_dump_file_size and will not be of much use to support. If a process is still writing out a tracefile, you can get to it quickly with the svrmgrl oradebug facility and unlimit the trace in order to ensure a full dump. Otherwise, you will need to set the global limit higher in order to produce good tracefiles.
The second type of problem you may encounter is an ORA-7445 error. This type of error results in a core dump of the process that encountered the error and normally a tracefile that contains a stack trace of the problem. These errors are essentially the same as ORA-0600 errors but do not have traps encoded by the developer. The net result is that this error is somewhat less expected than an ORA-0600. In the case of an ORA-7445, the procedure should be the same as for an ORA-0600, but it might be less likely that you will get an immediate fix. However, ORA-7445 errors are still bugs and will be accepted by support without question.
Functionality-related problems are frequently subject to debate and can be harder to log as bugs than other problems. Before calling support, you should be sure that you have read the relevant documentation thoroughly and really understand what is going on. Sometimes, a perceived functionality bug is little more than an incorrect assumption on the part of a user who has not read the manual. If you are sure that it is a bug, it is normally fairly straightforward to create a test case for the problem and to supply this test case at the time of opening the call. This will allow support to work on the problem without repeatedly coming back for further information.
Performance-related problems are more difficult. There can be so many factors that affect performance that you need to have a good case in hand to clearly demonstrate your problem to support. There are several specific types of problems that fall into the area of performance:
For SQL-related problems, the minimum information that support will need is the output of tkprof, preferably with the explain option enabled. If this is not sufficient to demonstrate the bug, further information will be required. If the query has gone bad because of an event such as an upgrade, a "before" tracefile will be most useful.
Scalability problems are very difficult to demonstrate and often require extensive investigation before making a determination. As scalability has been covered in some detail in this book, it should be clear that poor scalability can arise for many reasons. Proving it to be an Oracle problem is often difficult. However, if you are observing a problem such as severe latch contention and cannot determine why this should be the case, it is worth talking to support.
General performance problems always require a utlbstat/utlestat1 this report to be sent to support. Even if you have checked this yourself, support personnel will not be happy to look any further unless you send the request in first. They will also need to know how you have determined that performance has deteriorated. Does this problem seem to be related to workload, to an upgrade, to a new machine, etc. The more evidence you can gather that an Oracle problem is causing the performance problem, the faster you will receive answers back from Oracle.
If you experience a total freeze of the instance, do not shut down the instance until you have gathered some diagnostic information for support. Specifically, before looking at a total hang, support will need a systemstate dump:
Try a level 1 dump (SYSTEMSTATE 1) to begin with, and then reconnect (to get a fresh tracefile) and try a level 2 dump. Send support anything that you get from this, which will appear as a standard tracefile in the user_dump_dest location.
Thankfully, memory leaks within the Oracle Server are rare. However, if you suspect a memory leak in the Oracle processes, support will be looking for the following information, at a minimum:
This information may be only the beginning of a lengthy debugging session, but it will at least get support on the right track.
If the problem involves Oracle Parallel Server, different information will probably be required. In the case of an OPS hang, the following information will be required:
The important theme to note here is that the same files are required from every node that has an OPS instance on it, because it only takes one node to misbehave, and the entire synchronization of Parallel Server could be disabled. Therefore, missing one node could mean that the required information is not present, even if that node still appears to be running normally.
In the case of a crash, the requirement is similar to that for a single-instance system, except that the tracefiles are required from each node. All tracefiles mentioned in the alert file on every node are required.
If you are still running OPS with Oracle7, the lock manager software is the responsibility of the hardware vendor. Things can get very difficult to diagnose in this instance, because the problem could lie anywhere in the Oracle or clusters code. When calling problems of this nature in to support, be sure to have the DLM version and the clusters version available.
OPS is particularly prone to hardware problems. Owing to its reliance on shared disk devices and the cluster interconnect, any errors in this area can cause OPS to have a problem. Therefore, if you experience a problem on an OPS system (or on a single-instance system, for that matter), make sure that you check the syslog.log for any hardware problems prior to calling Oracle support.
In an ideal world, you would not encounter any bugs. Reality, unfortunately, is very different, and bugs are a fact of life. The general trend has been toward fewer bugs, but as products get more and more complex, there is plenty of scope for bugs to be present.
![]() Scale Abilities Ltd http://www.scaleabilities.co.uk Voice: +44 1285 644533 info@scaleabilities.co.uk |
|