Origins of the TPC and the first 10 years
|
by Kim Shanley, Transaction Processing Performance Council
|
| February, 1998 |
| Preface |
In my view, the TPC's history can be best
understood by focusing on two of its major organizational
activities: 1) creating good benchmarks; 2) creating a good
process for reviewing and monitoring those benchmarks. Good
benchmarks are like good laws. They lay the foundation for
civilized (fair) competition. But if we have good benchmarks,
why do we need all the overhead process for reviewing and
monitoring the benchmark results? Similarly, you might ask if we
have good laws, why do we need police, lawyers and judges? The
answer to both questions is the same. Laws and benchmarks are
not, in of themselves, enough. And by this I don't mean to imply
that it's simply human nature to break or bend the rules. The
TPC has found that no matter how clear-cut the rules appear to
be when the benchmark specifications are written, there are
always gray areas, and yes, loopholes left in the benchmark law.
There must be a way of addressing and resolving these gray areas
and loopholes in a fair manner. And yes, even "good
laws," said Aristotle, "if they are not obeyed, do not
constitute good government." Therefore, there must be a
means for stopping those who would break or bend the rules.
While this book is primarily a technical overview of the
industry's benchmarks, the TPC's history is about both benchmark
law and benchmark order.
The State of Nature
In writing this early history of the TPC, I've drawn heavily
upon the account by Omri Serlin published in the second edition
of this handbook. It was through Omri's initiative and
leadership that the TPC was founded.
In the early 1980's, the industry began a race that has
accelerated over time: automation of daily end-user business
transactions. The first application that received wide-spread
focus was automated teller transactions (ATM), but we've seen
this automation trend ripple through almost every area of
business, from grocery stores to gas stations. As opposed to the
batch-computing model that dominated the industry in the 1960's
and 1970's, this new online model of computing had relatively
unsophisticated clerks and consumers directly conducting simple
update transactions against an on-line database system. Thus,
the on-line transaction processing industry was born, an
industry that now represents billions of dollars in annual
sales.
Given the stakes--even at this point in the race--over who could
claim the best OLTP system, the competition among computer
vendors was intense. But, how to prove who was the best? The
answer, of course, was a test--or a benchmark. Beginning in the
mid-1980's, computer system and database vendors began to make
performance claims based upon the TP1 benchmark, a benchmark
originally developed within IBM that then found its way into the
public domain. This benchmark purported to measure the
performance of a system handling ATM transactions in a batch
mode without the network or user interaction (think-time)
components of the system workload (similar in design to what
later turned out to be TPC-B). The TP1 benchmark had two major
flaws. First, by ignoring the network and user interaction
components of an OLTP workload, the system under test (SUT)
could generate inflated performance numbers. Secondly, the
benchmark was poorly defined and there was no supervision or
control of the benchmark process. As a result, the TP1 marketing
claims, not surprisingly, had little credibility with the press,
market researchers (among them Omri Serlin), or users. The
situation also deeply frustrated vendors who felt was their
competitors' marketing claims, based upon flawed benchmark
implementations, were ruining every vendor's
credibility.
Early Attempts at Civilized
Competition
In the April 1, 1985 issue of Datamation, Jim Gray in
collaboration with 24 others from academy and industry,
published (anonymously) an article titled, "A Measure of
Transaction Processing Power." This article outlined a test
for on-line transaction processing which was given the title of
"DebitCredit." Unlike the TP1 benchmark, Gray's
DebitCredit benchmark specified a true system-level benchmark
where the network and user interaction components of the
workload were included. In addition, it outlined several other
key features of the benchmarking process that were later
incorporated into the TPC process:
|
- Total system cost published with the performance
rating. Total system cost included all hardware
and software used to successfully run the benchmark,
including 5 years maintenance costs. Until this concept
became law in the TPC process, vendors often quote only
part of the overall system cost that generated a given
performance rating.
- Test specified in terms of high-level functional
requirements rather than specifying any given
hardware or software platform or code-level
requirements. This allowed any company to run this
benchmark if they could meet the functional requirements
of the benchmark.
- The benchmark workload scaleup rules -- the
number of users and size of the database tables --
increased proportionally with the increasing power of
the system to produce higher transaction rates. The
scaling prevented the workload from being overwhelmed by
the rapidly increasing power of OLTP systems.
- The overall transaction rate would be constrained by
a response time requirement. In DebitCredit, 95
percent of all transactions had to be completed in less
than 1 second.
|
The TPC Lays Down the
Law
While Gray's DebitCredit ideas were widely praised by industry
opinion makers, the DebitCredit benchmark had the same success
in curbing bad benchmarking as the prohibition did in stopping
excessive drinking. In fact, according to industry analysts like
Omri Serlin, the situation only got worse. Without a standards
body to supervise the testing and publishing, vendors began to
publish extraordinary marketing claims on both TP1 and
DebitCredit. They often deleted key requirements in DebitCredit
to improve their performance results.
From 1985 through 1988, vendors used TP1 and DebitCredit--or
their own interpretation of these benchmarks--to muddy the
already murky performance waters. Omri Serlin had had enough. He
spearheaded a campaign to see if this mess could be straightened
out. On August 10, 1988, Serlin had successfully convinced eight
companies to form the Transaction Processing Performance Council
(TPC).
|
| TPC-A |
Using the model and the consensus that had
already developed around the DebitCredit benchmark, the TPC
published its first benchmark, TPC Benchmark A (TPC-A) within
one year (November 1989). TPC-A differed from DebitCredit in the
following respects:
|
- The requirement that 95 percent of all transactions must
complete in less than 1 second was altered to 90 percent
of transactions must complete in less than 2 seconds.
- The number of emulated terminals interacting with the
SUT was reduced to a requirement of 10 terminals per tps
and the cost of the terminals was included in the system
price.
- TPC-A could be run in a local or wide-area network
configuration (DebitCredit has specified only WANs).
- The production-oriented requirements of the benchmark
were strengthened to prevent the reporting of peak,
unsustainable performance ratings. Specifically, the
ACID requirements (atomicity, consistency, isolation,
and durability) were bolstered and specific tests added
to ensure ACID viability.
|
Finally, TPC-A specified that all benchmark
testing data should be publicly disclosed in a Full Disclosure
Report.
The first TPC-A results were announced in July 1990. Four years
later, at the peak of its popularity, 33 companies were
publishing on TPC benchmarks and 115 different systems had
published TPC-A results. In total, about 300 TPC-A benchmark
results were published.
The first TPC-A result was 33 tpsA at a cost of $25,500 per
transaction or tpsA. The highest TPC-A result ever recorded was
3,692 tpsA with a cost of $4,873 per tpsA. In summary, the
highest tpsA rating had increased by a whopping factor of 111
times and the price/performance had improved by a factor of
five. Does this increase in the top tpsA ratings correspond to
an identical increase in the real world performance of OLTP
systems during this period? Even in keeping in mind that this is
a comparison of peak, not average benchmark ratings, the answer
is no. The increase in tpsA ratings is just too great. The
increase can be attributed to four major reasons: 1) the first
benchmark test is usually run for bragging rights and is grossly
unoptimized compared to later results; 2) real performance
increases of hardware and software products; 3) vendors
improving their products to eliminate performance bugs exposed
by the benchmark, 4) vendors playing the benchmarking game
effectively--learning from each other on how best to run the
benchmark. So yes, there is a gamesmanship aspect to the TPC
benchmark competition, but it should not obscure the fact that
TPC benchmarks have provided an objective measure of a truly
vast increase in computing power of hardware and software during
this period. Indeed, the benchmarks have accelerated some of
these software improvements. And all the marketing gamesmanship
should also not invalidate the legacy achievement of TPC-A that
for the first time, it provided the industry with an objective
and standard means of comparing the performance of a vast number
of systems.
|
| TPC-B |
While TPC-A leveraged the industry consensus
built up over DebitCredit, the TPC's was much more ambiguous
about publishing a benchmark around the TP1 model. The ambiguity
about what was to turn into the TPC's second benchmark, the
TPC-B benchmark, lasted throughout the life of the TPC-B.
Everyone within the TPC organization believed that one of
TPC-A's principal strengths was that it was an end-to-end system
benchmark that exercised all aspects of an OLTP system.
Furthermore, this OLTP system representing users working on
terminals, conducting simple transactions over a LAN connected
to a database server, was a model of computing that everyone
could intuitively understand.
As described earlier, TP1 (and later TPC-B) was the batch
version of DebitCredit, without the network and user interaction
(terminals) figured into the workload. A strong block of
companies within the TPC, including hardware companies who sold
"servers" (as opposed to end-to-end system solutions)
and database software companies, felt that the TPC-B model was
more representative of the customer environments they sold into.
The anti-TPC-B crowd, on the other hand, argued that the
partial-system model that TPC-B represented would reduced the
stress on key system resources and would therefore produce
artificially high transaction rates. In addition, while the tps
rates would be artificially high, the total system cost since
the network and terminal pricing would be eliminated would be
artificially low, thereby artificially boosting TPC-B's
price/performance ratings. Finally, the anti-TPC-B spokespeople
argued, since TPC-A and TPC-B were so much alike, using
identical a "tps" throughput rating system, the users
and press would be confused. Whether they won the argument or
not can be debated, but what cannot is that TPC-B proponents
eventually won the day and in August 1990, TPC-B was published
as the second TPC benchmark standard. TPC-B used the same TPC-A
transaction type (banking transaction) but it cut out the
network and user interaction component of the TPC-A workload.
What was left was a batch transaction processing
benchmark.
The first TPC-B results were published in mid-1991 and by June
1994, at the peak of its popularity, TPC-B results had been
published on 73 systems. In total, about 130 TPC-B tests were
published. In all, there were about 2.5 times more TPC-A results
published. The first TPC-B result was 102.94 tpsB with a cost of
$4,167 per tpsB. The highest TPC-B rating was a 2,025 tpsB
result, and the best price/performance number was $254 per tpsB.
In summary, the top TPC-B ratings increased by a factor of 19
and the price performance rating improved by a factor of
16.
It's more difficult to give the legacy of TPC-B the
unconditional stamp of success. It never received the user or
market analyst acceptance that TPC-A did, but many within the
Council and the industry perceived a real value in this
"database server" benchmark model. This belief fueled
a later failed TPC efforts around TPC-S (more on this later),
and continues to influence the development of TPC benchmarks
today. In January 1998, the TPC announced the formation of a Web
Commerce benchmark (TPC-W) which will measure OLTP and browsing
performance only of the web server, excluding the network and
human interaction components of the overall system. So, TPC-B
proponents may not have won the public debate on the merits of
TPC-A versus TPC-B, but they may take some measure of
satisfaction that the back-end model of benchmarking lives
on.
Political Reform Begins
Immediately
The TPC was a major improvement over the state of nature that
existed previously. However, as the Council was to learn, most
of the work of building a successful benchmarking organization
and process was ahead of them. In this sense, the early TPC was
not unlike the early American colonies who idealistically
believed that they could eliminate the endless political and
legal conflict of the Old Word by passing laws abolishing
lawyers. While such a law still appeals to a significant
minority, in our more judicious moments, most would agree that
it's not possible.
As soon as vendors began to publish TPC results, complaints from
rival vendors began to surface. Every TPC result had to be
accompanied by a Full Disclosure Report (FDR). But, what
happened when people reviewed the FDR and didn't like what they
read? How could protest be registered and how would it be
adjudicated? Even if a member of the public or a vendor
representative were, so to speak, make a citizen's arrest of a
benchmark violator, there was no police or court system to turn
the perpetrator over to for further investigation or if need be,
prosecution. It became apparent to the Council that without an
active process for reviewing and challenging benchmark
compliance, there was no way that the TPC could guarantee the
level playing field that the TPC had promised the
industry.
Throughout 1990 and 1991, the TPC embarked on a political
journey to fix this hole in its process. The Technical Advisory
Board (TAB), which was originally constituted as just an
advisory board, became the arm of the TPC where the public or
companies could challenge published TPC benchmarks. The TAB
process, which remains in place today, established a fair,
deliberative mechanism for reviewing benchmark compliance
challenges. Once the TAB has thoroughly researched and reviewed
a challenge, the TAB makes a recommendation to the full Council.
The full Council then hears the TAB's report, discusses and
debates the challenge, and then votes on the challenge. If the
Council finds the result non-compliant in a significant or major
way, the result is immediately removed as an official TPC
result.
Benchmarking Versus
Benchmarketing
By the spring of 1991, the TPC was clearly a success. Dozens of
companies were running multiple TPC-A and TPC-B results. Not
surprisingly, these companies wanted to capitalize on the TPC's
cachet and leverage the investment they had made in TPC
benchmarking. Several companies launched aggressive advertising
and public relations campaigns based around their TPC results.
In many ways, this was exactly why the TPC was created: to
provide objective measures of performance. What was wrong,
therefore, with companies wanting to brag about their good
results? What was wrong is that there was often a large gap
between the objective benchmark results and their benchmark
marketing claims--this gap, over the years, has been dubbed
"benchmarketing."
So the TPC was faced with an ironic situation. It had poured an
enormous amount of time and energy into creating good benchmark
and even a good benchmark review process. However, the TPC had
no means to control how those results were used once they were
approved. The resulting problems generated intense debates
within the TPC.
Out of these Council debates emerged the TPC's Fair Use policies
adopted in June, 1991.
|
- When TPC results are used in publicity, the use is
expected to adhere to basic standards of fidelity,
candor, and due diligence, the qualities that together
add up to, and define, Fair Use of TPC Results.
- Fidelity: Adherence to facts; accuracy
- Candor: Above-boardness; needful completeness
- Due Diligence: Care for integrity of TPC results
|
Have the TPC's Fair Use policies worked? By and
large, they have been effective in stopping blatantly misuse or
misappropriation of the TPC's trademark and good name. In other
words, very few companies claim TPC results when, in fact, they
don't have them. In general, TPC member companies have done a
fair job in policing themselves to stop or correct fair use
violations that have occurred. At times, the TPC has acted
strongly, issuing cease and retraction orders, or levying fines
for major violations.
It must be said, however, that there remains today among the
press, market researchers, and users, a sense that the TPC
hasn't gone far enough in stamping out benchmarketing. This
issue has two sides. On the one hand, companies spend hundreds
of thousands of dollars, even millions of dollars, running TPC
benchmarks to demonstrate objective performance results. It's
quite legitimate, therefore, for these companies to market the
results of these tests and compare them with the results of
their competitors. On the other hand, no company has a right to
misrepresent or mislead the public, regardless of how legitimate
the benchmark tests may be. So where does the "war" on
benchmarketing stand today? Much like the war on crime, the war
on benchmarketing persists, and the TPC continues to wage an
active campaign to eliminate it.
Codifying the Spirit of the
Law
With the creation of a good review and fair use process, and
with dozens of companies publishing regularly on the TPC-A and
TPC-B benchmarks, the TPC may be forgiven for lapsing into a
self-satisfied belief that the road ahead was smooth. That sense
of well-being was torpedoed in April, 1993 when the Standish
Group, a Massachusetts-based consulting firm, charged that
Oracle had added a special option (discrete transactions) to its
database software, with the sole purpose of inflating Oracle's
TPC-A results. The Standish Group claimed that Oracle had
"violated the spirit of the TPC" because the discrete
transaction option was something a typical customer wouldn't use
and was, therefore, a benchmark special. Oracle vehemently
rejected the accusation, stating, with some justification, that
they had followed the letter of the law in the benchmark
specifications. Oracle argued that since benchmark specials,
much less the spirit of the TPC, were not addressed in the TPC
benchmark specifications, it was unfair to accuse them of
violating anything.
The benchmarking process, which sprang from the discredited TP1
and DebitCredit days, has always been treated with a fair degree
of skepticism by the press. So the Standish Group's charges
against Oracle and the TPC attracted broad press coverage.
Headlines like the May 17, 1993 issue of Network World were not
uncommon, "Report Finds Oracle TPC results to be
misleading; says option discredits TPC-A as
benchmark."
Whether Oracle's discrete transition option was truly a
benchmark special was never formally discussed or decided by the
TPC. The historical relevance of this incident was that it
spurred the TPC into instituting several major changes to its
benchmark review process.
|
| New Anti-Benchmark Special Prohibition |
TPC benchmark rules had always required
companies to run the benchmark tests commercially available
software. However, after the Standish Group charges, the Council
realized that it had no real protection from companies that
purposely designed a benchmark special component into their
commercially available software. In other words, this special
component could be buried in some obscure corner of overall
product code and only be used when the vendor wanted to run a
TPC test. If the TPC was formed to create fair, relevant
measures of performance, then yes, the benchmark special was a
violation of the TPC's spirit and thus had to be prohibited. In
September, 1993, the Council passed Clause 0.2, which contains
the sweeping prohibition against benchmark specials that become
a bedrock of the TPC process to ensure, relevant
benchmarks.
The Council drew a line in the sand with the passing of the
Clause 0.2 anti-benchmark prohibition that has become part of
the bedrock of the TPC process to ensure fair, relevant
benchmarks:
|
- Specifically prohibited are benchmark systems, products,
technologies or pricing...whose primary purpose is
performance optimization of TPC benchmark results
without any corresponding applicability to real-world
applications and environments. In other words, all
"benchmark special" implementations that
improve benchmark results but not real-world performance
or pricing, are prohibited.
|
Clause 0.2 in TPC-A and TPC-B went into effect
in June, 1994. Oracle decided not to test its discrete
transaction option against the new anti-benchmark special rules
in the specifications and withdrew all of its results by
October, 1994. Let it also be noted that Oracle remains a TPC
member and strong supporter of the organization.
|
| New TPC Auditing Process |
As a result of the 1993 controversies, the TPC
realized that the millions of dollars being invested in the
running of TPC benchmarks would be completely wasted if the
credibility of the results were challenged. The TPC's process of
FDR review was fine, but it only was invoked after a
result was published and publicized. Yes, the TPC could yank a
result from the official results list after it was found to be
non-compliant, and even fine a company for violating the
specifications, but the damage to the company's competitors and
TPC's credibility would already have been done. In summary, it
wasn't enough to catch the bad horse after it had left the barn.
The goal was to stop the bad horse from ever getting out of the
barn.
The result of these discussions, passed in September and
December, 1993, was the creation of a group of TPC certified
auditors who would review and approve every TPC benchmark test
and result before it was even submitted to the TPC as an
official benchmark or publicized. While TPC benchmarks are still
reviewed and challenged on a regular basis, the TPC auditing
system has been very effective in preventing most of the bad
horses from ever leaving the barn.
|
| New and Better Benchmarks |
From the outset, I have said that the TPC's
history is both about benchmark law and benchmark order. From
the last sections of this chapter, the reader might have
received the false impression that the TPC is exclusively a
political organization endlessly embroiled in public
controversies and institutional reform. The "benchmark
order" activity of the TPC is certainly important, but the
TPC's day-to-day focus is to build better
benchmarks.
TPC-A was a major accomplishment in bringing order out of chaos,
but TPC-A was primarily a codification of the simplistic TP1 and
DebitCredit workloads created in the mid-1980's. However, what
was very clear even as the TPC members approved TPC-A in late
1989 was that better, more robust and realistic workloads would
be required for the 1990's.
Two benchmark activities were launched in 1990, the development
of TPC-C, the next generation OLTP benchmark, and TPC-D, a
decision support benchmark.
Both TPC-C, which was approved as a new benchmark in July, 1992
and TPC-D, which was approved in April, 1994, are covered in
other chapters of this book and therefore I'll only add a few
comments on these benchmarks.
The first TPC-C result published in September, 1992 was 54 tpmC
result with a cost per tpmC of $188,562. As of this date
(January 1998) more than six years later, the top result is a
52,871 tpmC with a cost per tpmC of $135. We have witnessed the
same tremendous improvement in the top TPC-C numbers as we did
for TPC-A and for the same reasons: 1) real world performance
and cost improvements and 2) an increased knowledge about how to
run the benchmark. (Again, keep in mind that just by looking at
peak numbers, and not averages, we're seeing an exaggerated
inflationary effect). Currently, there 143 official TPC-C
results, a higher total than TPC-A at the peak of its
popularity.
The first TPC-D result was a 100 GB result in December, 1995
with a throughput performance rating of 84 QthD and a
price/performance rating of $52,170 QphD. Today, the top 100 GB
throughput result that has been produced is 1205 QthD and $1877
QphD. Currently, there are 28 official TPC-D results. Why so
few? TPC-D is only 2.5 years old, compared to TPC-C's venerable
6 years, and TPC-D is more expensive and complex to
run.
Both TPC-C and TPC-D have gained widespread acceptance as the
industry's premier benchmarks in their respective fields (OLTP
and Decision Support). But the increase in the power of
computing systems is relentless, and benchmark workload must
continually be enhanced to keep them relevant to real world
performance. Currently, a new major revision of TPC-C is being
planned for release in early 1999. A new major revision of TPC-D
is being planned for mid-1998 and another one in
1999.
|
|