Document created: 4 September 03
Air University Review, September-October 1975

OER Inflation, Quotas, and
 Rating-the-Rater

Lieutenant Colonel Walter T. Brown, Jr.

The implementation of the new Air Force Officer Effectiveness Report (OER) system has stimulated renewed interest in the problem of personnel evaluation systems. This article discusses a little-publicized but possible method for enhancing personnel evaluations. This capability would also provide two new op-dons for eliminating OER inflation as well as additional management tools for senior managers. The new methodology resulted from a recent research project at the Air Force Academy. Ten years of Air Force OER data (2.2 million OER’s written between 1960 and 1969) were used in demonstrating its feasibility. The capability can be extended to officer and enlisted personnel of all services, to the Civil Service, and to industry.

A quantity called a “Tag” has been devised that quite accurately describes an officer’s OER history. It has been computed for almost 200,000 past and present Air Force officers.* A new term, “rating-the-rater,” is also coming into use. It refers to a computer capability for tracking each rater and adjusting for his inflationary tendencies. Each officer’s Tag incorporates, among other things, a correction for the inflationary tendencies of each of his raters. Later the discussion will show how the rating-the-rater concept can eliminate OER inflation without resorting to the quotas of the new OER system. These terms, “Tag” and “rating-the-rater,” ** will now be described along with some of their possible uses.

* The Tag is a numerical measure of each officer’s past performance. It is not in any way connected with an officer’s promotion file or any other file on an officer. The 200,000 Tags that have been produced for the 1960-l969 data base are used only for research purposes.

**The term “rating-the-rater” was first mentioned to the author by Lieutenant General Marshall S. Carter, USA (Retired), in a private conversation. He firmly believed that “OER installation will never be licked until the rater is rated.”

What is the Tag?

Officers must be evaluated. However, anyone, evaluating OER’s knows that they contain many possibilities for error. Several of the most significant errors will be identified in this article, along with an explanation of how they can be corrected. As the reader may have guessed, these errors are primarily a result of the OER inflation in past years. It will soon be apparent that the errors in old OER’s are so serious that they can no longer be ignored. But to eliminate them requires such a tremendous number of computations and such a vast amount of comparative data that a computer must be employed. Once the computer has stripped out these errors, a quantity (or number) remains that presents a very good relative picture of an officer’s OER history. This quantity is called a “Tag.”

Let us begin with a brief overview of some of the considerations that should be weighed in evaluating an OER file. The evaluator, of course, should recognize that some raters are tougher than others. Ideally, all easy and tough raters should be identified, along with the degree to which they are easy or tough. One should also consider the year an OER was written, since there has been an upward trend in ratings over the past two decades. The grade of the ratee is also important because typically the more senior officer receives higher ratings. (See Figure 1.) The period of supervision on each OER deserves special attention, since some ratees are frequently and intentionally shifted from one rater to another to build up a large number of “max” ratings on the top of their file. We must be careful not to overemphasize (or weight) the older evaluations at the bottom of the file; attitudes and performance frequently change with time. However, inflation has pushed more recent OER’s up against the ceiling, and consequently newer OER’s may provide less differentiation and therefore deserve less weight than one might expect. Finally, an ideal measure of an officer’s OER history should consider the ratee’s command. It is well known throughout the Air Force that at certain times some commands established an unwritten policy of inflating their ratings. Therefore, an adjustment should be made for the inflationary biases of a command at the tine a rating was given. These frequently over-looked difficulties are directly addressed and corrected by the Tag concept. Let’s now look more specifically at how this is done.

Figure 1. Inflation trends by grade and year

How is the Tag constructed?

As one can see, an accurate assessment of a rating requires a broad perspective. Consider now what is required to interpret properly an overall evaluation of 8 of a maximum possible of 9 points. Figure 1 shows that such a rating in 1967 would have been above average for lieutenants, captains, and majors. However, it would have been slightly below avenge for lieutenant colonels and colonels. Knowing only this, we would have to conclude that any captain, for example, who received this rating in 1967 was above average. But to determine just how far above average would require some measure of the extent to which the ratings for captains varied in that year. It is unreasonable to expect the evaluator of an officer’s promotion file to do the required arithmetic, but with the aid of a computer it is easy.

Consider another problem facing an evaluater as he flips through stacks of OER’s. In recent years, many officers have received maximum ratings on both their overall evaluation, 9, and their promotion potential, 4. Distinguishing between these 9s and 4s is difficult. The most obvious way to break the ties is to score all quantitative information on both sides of the form: overall evaluation, promotion potential, and the ten or so rating factors on the front of the form. To do this properly requires not only time and patience but also consistency. This is again a task in which the computer can be a valuable aid.

Another problem an evaluator must resolve is the significance (or weight) that should be assigned to each OER. Certainly a report covering a 360-day period of supervision should be given greater weight than one covering only 90 days. With a little thought one can see that each OER should be weighted in proportion to the length of the period of supervision.

The next question about weighting concerns the weights that old OER’s should have relative to the most recent OER. The author’s study of the problem has shown that, other things being equal, the older a rating is, the less weight it should receive. In other words, the weight might be represented by a ski-slope type of curve similar to the one shown in Figure 2. The difficulty arises in deciding which one of the thousands of curves of that general shape should be used. Here is how the “best” curve was selected.

Figure 2. OER obsolescence curve

Figure 2. OER obsolescence curve

An officer was picked at random, and the front and back of all the OER’s he received were scored (with appropriate adjustments for inflation trends shown in Figure 1). His most recent OER score was covered, and no one was allowed to look at it. Then a weighted average of all his other OER’s was calculated, based on the first of thousands of possible weighting curves. This process was repeated for each of the other weighting curves. The question was then asked, How much did each weighted average deviate from the most recent score—the one that was covered up? The magnitude of these misses was calculated, each miss being associated with the particular curve involved. The whole process was repeated by selecting another officer and scoring all of his OER’s. Again, the amount by which each curve’s weighted average missed “predicting” the most recent OER was calculated. These results were added to the previous misses. After looking at tens of thousands bf officers, one curve was found that had fewer total miss points than all others. This curve is the one shown in Figure 2. As the dashed lines indicate, an OER loses about half its predicting value every five years.

A final problem confronting the evaluator of an OER file is to adjust for two types of inflationary bias—rater bias and command bias. To begin, we need to rate-the-rater. Is the rater, relatively speaking, easy or hard, and to what degree? We cannot simply assume that a rater who gave high ratings is inflationary. Perhaps he had better ratees. This possibility can only be resolved by examining the OER file of each officer he rated, to see how others rated him, and this will prompt a similar question: Were each of these other raters hard or easy? Obviously, the problem of rating-the-rater expands rapidly. It turns out that all raters and ratees in the Air Force must be examined simultaneously to assess the inflationary or deflationary tendencies of each rater.

A similar technique must also be used to determine the inflationary biases of each command for each year. For example, SAC-1964 and TAC-1968 (and all other command-year combinations) would be analogous to raters. By comparing how these “raters” evaluated the same officers, one can compensate for each command’s bias in past years.*

*Although the Tags currently do not reflect the command-year inflationary bias, future calculations could easily incorporate this adjustment.

To summarize, all [X]s on the front and back of every Air Force OER are scored. The score for each OER is then compared (normalized) with all other ratings given in that same year to officers of the same grade. Next, the inflationary tendencies of each rater and each command-year are determined, and adjustments are made in each OER’s normalized score. These adjusted scores are finally weighted in proportion to their period of supervision and the measure of each OER’s currency as shown in Figure 2. The weighted average of each officer’s adjusted OER’s is his Tag.

The job of rating-the-rater and rating-the-command is quite large and exceeds the capability of most first- and second-generation computers.** Fortunately, the necessary OER data have been kept on computer tape by the Human Resources Laboratory since 1960. With these data and the Air Force Academy’s third-generation Burroughs 6700 computer, the job (including the other corrections needed to produce the book of 200,000 Tags) required only several hours of computer time.***

**For the benefit of the reader who recalls his first course in high school algebra, this task is roughly equivalent to solving 300,000 simultaneous linear equations. An industrious high school student with a well-sharpened pencil would require over a billion billion years to solve such a problem. Probably he would make a mistake before the first day was over.

***I wish to acknowledge the invaluable assistance of Lieutenant Colonel Douglas Johnson during the final programming phase of the project. The initial program for solving this problem would have required 10,000 hours of computer time. Reprogramming efforts eventually reduced the time to 20 hours, after which Colonel Johnson was able to reduce it further to 2-3 hours.

How accurate is the Tag?

The reader has probably thought of ways in which an incorrect Tag might be produced. The most obvious of these is rater inconsistency. It is possible for a rater to change from an inflationary to a deflationary tendency, or vice versa. If that should happen, the computer would classify the rater as being somewhere between the two extremes. Unless the fluctuations from one extreme to the other are considerable, the error caused by such a change should be small. Unfortunately, this sort of inconsistency produces errors in most rating systems.

A second area of concern, and one in which individuals will probably differ, is the scoring weights to be assigned to the various blocks on the OER form. Admittedly, changing the weights will produce a different Tag. However, a sensitivity study I conducted showed that reasonable but differing weighting schemes had only a minor effect on groupings of Tags. Since individuals disagree on the desired weights for the different blocks on the OER form, a decision must ultimately be made at the Air Force policy-making level. In short, the different blocks on the OER form can be weighted to reflect the significance that the Air Force wishes them to reflect.

A frequent question concerning the Tags is, “How well do they predict the results of promotion boards?” This question, however, reflects two possible misconceptions that require comment:

(1)The Tag is not intended to be a promotion-predicting device. Nor is it proposed that the Tag be given to promotion boards. The Tag, which contains adjustment for certain systematic errors, is only a relative measure of an officer’s OER history. Its uses will be discussed shortly.

(2)Promotions are not based strictly upon OER’s. A promotion board must consider many other factors, such as the needs of the service, breadth of military experience, educational background, and skills. In addition, subjective judgments will always be required in making promotion decisions. Nevertheless, past promotion board results are one means of validating the Tag concept. Keeping this in mind, let us look at the relationships between the Tags (past performance) and promotability.

The Tags that have been calculated were current as of 1 July 1969 and therefore were used (after the fact) to “predict” the subsequent year’s 0-4 through 0-7 promotion board results. It was found that by using only the Tag (and ignoring job categories, education, combat experience, etc.) over 85 percent of all promotions and passovers would have been correctly “predicted.” While such a high “batting average” does not necessarily confirm the accuracy of the Tags, it does support the reasonableness of the Tag concept.

potential use of the Tags

Initially the Tags of Air Force officers were calculated so that the Air Force Military Personnel Center could determine how abilities were distributed among the different commands. The results were displayed by command for all grades in each of ten years. Significant differences between commands were noted. The first possible use therefore is as a feedback device for the assignment people. It would provide a comparison of the current assignment policy with the actual distribution of officer abilities throughout the Air Force. In this way subsequent assignments could bring the distribution more nearly in line with policy.

The second potential use, related to the first, is to give senior Air Force commanders (including the Air Force Chief of Staff) improved visibility of how abilities are distributed within their respective organizations. Some commanders would undoubtedly find that their headquarters had absorbed too much of the organization’s talent. Other commanders would find certain subordinate units relatively low in officer quality—low enough that excessive passovers and loss of morale and effectiveness could be expected unless personnel changes occurred. With this tool, some important personnel problems would be anticipated and circumvented.

Cost-effectiveness studies involving personnel offer another fruitful area for wing Tags. As many realize, often the most difficult part of a cost-effectiveness study is determining a meaningful measure of effectiveness. This is especially difficult when the study concerns military personnel. The Tag is in many cases an appropriate measure—a measure of officer performance.

For example, a recent study examined the job performance of officers from various commissioning sources—ROTC, Air Force Academy, and Officer Training School—in each of three skill areas—pilot, navigator, and nonrated. The effectiveness data, together with the “procurement” costs for an officer of a given skill from each commissioning source, will be of value to certain decision-makers. Other studies could address the cost effectiveness of various professional military education 1programs and military-sponsored civilian education (both in the humanities and the sciences). The Tags can also he of value in establishing appropriate criteria for selections to various training and education programs.

A fourth possible use of the Tags is related to the new OER system. Under the new system no reviewer may exceed the quota of 22 percent top ratings, 28 percent middle ratings, and 50 percent bottom ratings, regardless of the quality of the officers being rated. In view of the unequal distribution of abilities as indicated by the distribution of Tags among the various commands, this quota system produces inequities. Furthermore, most officers tend to believe that their organization has above-average officers, and consequently they feel that the new system discriminates against them. This problem could be overcome by using the Tags to tailor quota to fit the group of officers currently being rated under each reviewer. An organization having officers with above-average performance records would then be given a better-than-average quota.

As an example, consider a rating cycle in which a reviewer of ten majors on the Air Staff is given the standard quota of two top, three middle, and five bottom ratings. Had a tailored quota been given in this hypothetical situation, it might have allowed five tops, three middle, and two bottom ratings. Such a tailored quota system would make the rating system more equitable. Notice that the tailored quota does not dictate who should receive the top OER’s. It would offer each person in the group being rated a chance to receive a top ratings—the better his group’s past record of performance, the more top ratings. Stated another way, a ratee would have the same chance of receiving any specific rating, regardless of whether he is competing with high- or low-quality officers.

For over a decade the academic grading system at the Air Force Academy has been based on this principle; courses having above-average students (based on college board tests or prior academic records) are allowed to give a higher proportion of top grades. As a result, problems associated with inflated academic grades, which are frequently found at other colleges and universities, have been practically eliminated at the Academy.

It is worth noting that none of these uses involve individual Tags, only the aggregating of many Tags. Thus slight random errors that might exist in individual Tags tend to average out when aggregated. There is justifiable concern about basing any significant personnel decision on a single Tag. Therefore, access to the Tag should be restricted. It may he necessary to modify the computer program so that only aggregates are printed, never individual Tags.

One of the more controversial questions concerning uses of individual Tags is whether to show an officer his own Tag. Strong arguments exist on both sides of this issue. Those supporting such an action ask, Since every officer has access to his OER file, why not give him the best possible picture of his OER history? Another supporting argument is that each officer must make major career decisions, and he deserves the best possible information for making those decisions. Besides, the recently amended Freedom of Information Act and the Privacy Act of 1974 would probably authorize officers to see their Tags.

Those who oppose allowing an officer to see his Tag argue that a rater or reviewer might he influenced if the Tag of someone he was rating became known to him. Others feel that if individual Tags existed, they might “leak” to people such as assignment officers. The assignment officer, as a result of time pressure or a tendency to be overly influenced by a number, might base an assignment primarily on the Tag instead of the “whole man.” Certainly indiscriminate use of and excessive confidence in individual Tags can be dangerous.*

* Another controversial area involves giving individual Tags to selection boards. Although this might smack of Brave New World to some, it would not be nearly as mechanistic as the Weighted Airman Promotion System (WAPS), which has been generally well received by Air Force enlisted personnel. The reader may want to consider the many pros and cons involved.

another role for rating-the-rater

Some readers may have observed that, by rating-the-rater, inflation could he brought under control without requiring a quota system. Each rater would have a personal stake in not inflating his ratings, since his inflationary bias would always be known and an adjustment of the rating would be made accordingly. Consider the rater who always gave maximum ratings to a typical cross section of officers. His ratings, if adjusted, would all become average ratings. Since the rater would thereby forfeit his opportunity to advance the better officers and retard the worst, there would be some motivation for raters to distinguish between them. If a rater did so, it would not matter which end of the rating scale he used, high or low.* The fear that most officers have of working for a hard rater would also be overcome, since such a rater’s deflationary tendency could be identified and corrected automatically.

* Unusual rating patterns could be easily detected by the computer, and, where appropriate, corrective counseling of the rater could be initiated.

In 1968 the Army adopted a new OER form containing an innovation designed to stop inflation (see Figure 3). Raters and indorsers were to compare the ratee with all other “comparable” officers they were currently rating. In the example shown in Figure 3, the rater considered the officer the fifth best of the eight officers performing similar functions under him.

Figure 3. Army OER form, 1968

Figure 3. Army OER form, 1968

Unfortunately, this system eventually broke down because (1) raters frequently (or conveniently) concluded that no other officers were comparable and (2) machinery did not exist for policing the system. Eventually instructions were given to disregard this portion of the form. Rating-the-rater would overcome these problems. The computer, in effect, would fill out this portion of the form, and lapses of memory (or cheating) could not occur.

Furthermore, if a rater had better-than-average officers, the rating-the-rater system would automatically take that into consideration. The Army system did not.

quota vs. rating-the-rater

The two methods for controlling inflation, rating-the-rater and quota systems (either a standard quota as in the new OER system or a tailored quota) can he compared in several ways. Both pose their own unique set of problems, yet both avoid the far greater problems associated with inflation.

Quotas. The primary requirement of quota systems is that many officers must be rated simultaneously. This “pooling” of a large number of ratees is required to insure (1) that at least one officer in each pool has an opportunity to receive a top rating and (2) that the distribution of abilities being rated will have a greater chance, statistically speaking, of matching the distribution of ratings allowed by the quota. Therefore, to achieve the necessary pool size requires (1) that reviewers (as opposed to raters or additional raters) give the rating that is controlled* by the quota and (2) that the controlled rating for all officers in a given grade be given at the same time each year. In other words, there must be rating cycles. Let’s now look at the consequences of these two corollary requirements.

*Quite obviously the controlled rating will carry greater significance in the eyes of promotion boards simply because it must fit the quota and cannot be inflated.

The costs of having reviewers give the controlled ratings are high. Reviewers, who are typically colonels and generals, must be involved in the time-consuming process of making these hard decisions. One estimate is that about ten reviewer man-years will be spent each year in policing this system. Even others must become involved at the reviewer and command level. Since reviewers often have little firsthand knowledge of the ratees, advisory boards are frequently established to recommend specific ratings. In other cases meetings are held among intermediate supervisors to determine who will receive the high and low ratings. Some commands, through internal restructuring of jobs, have even established administrative positions for the purpose of maintaining statistics for the new OER system.

Some ratees feel that several inequities exist, not the least of which is that their organization deserves a higher quota. Reviewers are sometimes geographically separated from some of their ratees while their other ratees work directly for them in the headquarters. In other instances, the reviewer, who may not know the ratee by sight, must frequently overrule the ratees opinion in order to meet his quota. Consequently, ratees become concerned, and the rater sees his supervisory position being somewhat weakened.

Rating cycles, the second corollary requirement of quota systems, produce high peak workloads and other administrative problems. Officers who have worked up to eight months under their rater can avoid a controlled rating for this time period if their rating official is changed four months or more before their annual rating. Other officers who must depart their organization within four months of their annual rating often feel that their interests suffer, since they are not present while the hard rating decisions are being made. These perceived inequities are the price one must pay to achieve the large “pool” of ratees that quota systems require.

Rating-the-Rater. Now let’s look more closely at how rating-the-rater can work as a control system and the problems associated with it. There are several means for achieving this control. The most obvious is that inflators, including commands, raters, and intermediate organizations, can be identified with a high degree of confidence. (The reader will recall from the discussions involving the Tag that the calculation of a rater’s inflationary tendency compensates for the average quality of the officers under the rater.) This awareness would provide the top leadership in the Air Force with information which, if acted upon in any of several ways, could assist in licking inflation. An-other means of control would be to place in each promotion file a summary sheet of the rating histories of each officer’s raters since the control system began. There could be an accurate and up-to-date entry, similiar to that shown in Figure 3, for each OER in the file.* Additionally, there could be an index describing the aggregate quality of the officers under each rater or even a score for each OER that contained an adjustment for OER bias. In this way differences in quality throughout the Air Force could be recognized.

*Of course, the ratee should also receive a copy of this summary sheet.

Rating-the-rater as a control system has its problems, too. Obviously the job could not be done without a computer. But computers, in general, are not trusted by the officer force. Many months of testing would be required in order to assure promotion boards that the rating histories of raters and the aggregate quality of those whom they have rated were accurate. There would also have to be a significant effort to educate the officer force as to what mechanisms were at work to insure that they were being treated fairly. Finally, several technicians would be needed to maintain the computer software and distribute the computer outputs.

Both quotas and rating-the-rater will control inflation. The latter technique would require a thorough explanation to the officer force and use of some computer-generated data in a highly sensitive and personal area—promotions. On the other hand, with rating-the-rater as the control to prevent inflation, the rater would give the controlled rating, and he would do so whenever a rating was appropriate. Quotas and rating cycles would be unnecessary. Inequities associated with standard quotas would disappear.

A new capability has been developed. Raters and commands can be tracked to determine their inflationary biases. Past OER’s can be adjusted for these biases and the inflationary creep of each officer grade. What remains after the adjusted OER’s are properly weighted is a Tag describing an officer’s out history. Two hundred thousand such Tags have been produced for research purposes. Similar results could be developed for enlisted and civilian personnel.

Tags can be used in many studies where some quantitative measure of officer performance is required. Tags can also be aggregated to show senior commanders how their officer quality is distributed. People, like money, are resources and must be wisely allocated by the commander to best achieve his mission.

By aggregating Tags at the reviewer level (tailoring quotas), we would create a rational and equitable basis for out quotas.

On the other hand, rating-the-rater can provide the control to eliminate inflation. A quota system would not then be needed. Raters (as opposed to reviewers) would once again be allowed to determine ratings; expensive overhead costs associated with quota systems would be eliminated; and the inequities and administrative problems associated with rating cycles for each grade would be removed. Officers would not feel that the system works against them merely because they are in an outstanding organization.

Several years ago it could truly be said that the job of rating-the-rater was prohibitively large. Recent advances in computer technology now make it possible.

And so, to the other great computer conquests let us proceed to add that long-time vexation of the military: the Officer Effectiveness Report.

United States Air Force Academy

Editor’s notes:

Readers are invited to raise questions or comment on this article, either directly to Colonel Brown at the Department of Mathematical Sciences, United States Air Force Academy, Colorado, 80840 (AUTOVON: 259-4470), or to the Military Personnel Center, Randolph Air Force Base, Texas, 78148.


Contributor

Lieutenant Colonel Walter T. Brown, Jr. (USMA; Ph.D., Massachusetts Institute of Technology) is a Tenure Associate Professor of Mathematics, U. S. Air Force Academy. Previous assignments have been as Director of the Army’s Benet Research, Development, and Engineering Laboratories; with the 25th Infantry Division, Vietnam; White Sands Missile Range; New Mexico; 82d Airborne Division; and Ranger School, Colonel Brown is a graduate of Army Command and General Staff College and a distinguished graduate of Air War College.

Disclaimer

The conclusions and opinions expressed in this document are those of the author cultivated in the freedom of expression, academic environment of Air University. They do not reflect the official position of the U.S. Government, Department of Defense, the United States Air Force or the Air University.


Air & Space Power Home Page | Feedback? Email the Editor