Intellyx Cortex Newsletter by Eric Newcomer
I recently hosted a podcast for BMC Software on the subject of mainframes in financial services, and talked about my experiences with mainframe systems at Credit Suisse and Citigroup.
It got me thinking about my early days in the database and transaction processing industry, and why mainframes are still so good at high volume, reliable transaction processing.
When I first heard about ACID transactions, I was amazed to hear that they let you just pick up from where you left off after a system failure, automatically recovering the database to the state it was in before the failure.
I ended up spending the next twenty years or so working on applications, standards, and products in the database and transaction processing area, and eventually had the great honor of co-authoring a book on the subject with Phil Bernstein (with a big assist from Jim Gray).
Then: At the Start
In 1982, near the start of my career, I designed and built a complete order entry/inventory management application for Salomon Ski’s warehouses on the HP 3000 minicomputer. ACID transactions were not available.
When the system went down – as it did fairly often – we would call the customer service team upstairs and tell them to hold off entering orders until we could restore the database.
We used HP’s IMAGE/QUERY database utility to crawl through the records until we found the partial updates and manually deleted them, noting the order numbers so that customer service could re-enter them. Once this was done, we’d call and let them know they could start entering orders again.
ACID transactions were exactly what we needed, although I didn’t know it at the time. And even if I did know about them, they were not available on minicomputers, only on mainframes.
A few years later I joined the database engineering team at Digital Equipment Corp (DEC). They had by that time implemented ACID transactions for their DBMS product (a network database similar to IDMS) and were developing a new SQL-based relational database, which also would support ACID transactions.
The initial challenge was to implement transactions locally, meaning for a single database. But in the evolving computer landscape at the time, applications were starting to migrate from mainframes to minicomputers – and often in parts. It was fairly common to see a mainframe application move to multiple minicomputers.
Coordinating transactions across multiple databases – running on different machines connected via the network, started to emerge as a major requirement in the distributed computing world.
Vendors cooperated on standards such as The Open Group’s XA and TxRPC and OMG’s OTS (adopted in Java EE as JTS), and finally WS-Transactions (which I helped write). And Microsoft created MTS for coordinating remote transactions across Windows machines. But the complexity and performance overhead of distributed transactions often inhibited their adoption and use.
Why are Transactions so Hard?
Single operations on data are atomic, which means when you write something to a database, you write all the data at once to the disk and it either succeeds or fails. But if your business transaction requires multiple writes to a database, as is often the case, the first one might succeed and the second one might fail.
Business transactions reliably create records of operations in the real world. For example, a retail transaction records the sale of an item and the payment for the item. The retail database needs to record both operations to maintain consistency of the inventory count and sales accounting.
Consider the example of an ATM withdrawal. You get cash and your account is debited for the amount of the withdrawal. What happens if the ATM fails just after the withdrawal is recorded but the cash is not dispensed? This thankfully doesn’t happen often, but it does happen, and transaction processing systems have to be able to recover from that (usually involving a call to customer service to make a manual adjustment, which is expensive for the bank and frustrating for the consumer). This is an example where a transaction directly involves a real world action, not just a record of one.
Take the simple case of a debit/credit operation such as a funds transfer from one account to another, as illustrated below.
Begin Transaction Transfer
Write Debit Amount to Account A
Write Credit Amount to Account B
<error>
End Transaction
To maintain accurate balances, both writes to the database have to succeed. Business transactions typically involve multiple steps, and this isn’t simple to handle on a computer because ia computer performs one write a time, in sequence. When an error occurs writing the credit amount to the database, the previous write for the debit amount has to be reversed to maintain consistency.
And of course, last but not least by any means, the transaction processing system has to scale up to handle thousands of database operations a second without causing a failure or missing an update.
Why Mainframes are Good at Transaction Processing
As I summarized in this video snippet from the recent podcast, transactions are sensitive to latency. If your transaction processing application and all the resources it requires are on the same machine, it’s easier to engineer a robust, reliable solution to the long list of transaction processing requirements, especially recovery from failure because potential failure scenarios constrained to that machine.
Distributed systems and cloud native infrastructure may be more cost efficient and agile, but it’s more difficult to engineer those platforms to achieve the same levels of reliability and performance for large transaction processing applications such as airline reservation systems, core banking systems, and credit card processing systems, to name a few.
So What’s up Now?
A big part of what we discussed during the podcast was about how the IT landscape has evolved, and so has the role of the mainframe in that landscape.
It’s really interesting to see the evolution of computing continue from the back office computer room out onto the desk, into the cloud, and onto mobile devices.
For example, several of the projects I worked on while at Citi involved developing APIs for banking as a service, embedded finance, or OpenBanking applications. We developed other APIs to power the Web and mobile applications.
It didn’t make any difference to the external industry applications or to the web and mobile applications whether the APIs were hosted by mainframes, distributed applications, or cloud native applications. In fact we often used a combination of all three.
One of the most interesting projects was working with Google to connect Gpay to Citi’s consumer banking application. The idea was to allow a Gpay user to open a bank account and tie it to Gpay for making payments. The project was eventually canceled, but we got as far as connecting Gpay to the Citi mainframe responsible for executing consumer bank transactions.
Gpay was running in the Google Cloud, the Citi APIs were hosted on distributed systems, the transaction processing program on a mainframe, and users on a mobile phone (yes we supported both Android and iOS).
This strikes me as a pattern we’re often going to see in the API-driven world of mobile, web, cloud, and gen AI — different computing platforms playing different roles in combination to meet multiple requirements in a larger IT ecosystem.
Speaking of APIs
As Gartner reported at last year’s API Days Conference, REST is the leading protocol for APIs. Citi’s APIs in the Gpay project were all REST based, for example.
But there is something about RESTful APIs that’s very pertinent to transaction processing.
Roy Fielding’s dissertation is frequently cited as the authoritative definition of REST. He also co-authored HTTP, which is basically the reference implementation of REST.
In section 5.1.3, page 78, Fielding explains the reasoning behind the constraint that communication across the web must be stateless (clients are web browsers here, but today can also be mobile apps):
“We next add a constraint to the client-server interaction: communication must be stateless in nature, as in the client-stateless-server (CSS) style of Section 3.4.3 (Figure 5-3), such that each request from client to server must contain all of the information necessary to understand the request, and cannot take advantage of any stored context on the server. Session state is therefore kept entirely on the client.”
The significance for distributed for transaction processing is that traditional distributed transaction protocols, such as the OTS/JTS standard, use a shared persistent state mechanism to share the transaction context. Stateless communication eliminates this mechanism, requiring new solutions.
So you can kind of trace the evolution of transaction processing solutions from their origin on a single machine to supporting LAN-based distributed systems, where resources are still tightly controlled enough to support stored communication context and fnmally to WAN-based protocols such as REST that assume disconnected (i.e. stateless) operation to ensure scalability and the ability to dynamically change.
Even in the case of LAN-based distributed systems, where the shared context solution works, however, the overhead of the two-phase commit transaction protocol is significant means in practice that it is only used when really necessary.
But now we are really getting into the world of the Internet and the web and global applications. These older, more traditional “scale up” transaction processing systems are designed for internal corporate data centers, not for world wide web and mobile applications.
And that’s basically why we use APIs to communicate with the systems that execute the transactions — we do not execute transactions on phones and web browsers. I mean technically there are solutions out there that could be made to work, but once again the overhead involved is prohibitive.
Which is a big reason mainframe systems remain an integral part of modern IT infrastructures.
A Trend to Watch — Distributed SQL
Up to now I haven’t mentioned anything about the challenge of cloud native transaction processing, but the challenge is implicit in the industry adoption of RESTful APIs.
A short summary: Google invented cloud native infrastructure about twenty five years ago (see photo of the original server) to support their global web based search engine business.
This type of infrastructure is also called “commodity servers” or characterized as “scale out” architecture because you can easily add a virtually unlimited number of computers to it.
At the time it was the most cost efficient IT infrastructure available, and Google’s competitors quickly adopted it. In 2006 Amazon decided to rent out spare capacity in their commodity data center using APIs for compute and data (EC2 and S3) and the cloud as we know it was born, (Whether it’s still the most cost effective infrastructure is a good question.)
A key characteristic of cloud native infrastructure is that it’s a reliable system composed of unreliable (i.e. consumer grade) components. I would also characterize it as bringing the web into the data center — initial communication across the systems used HTTP. Whenever one component fails, another immediately takes its place. Stateless communication is a requirement for this to work.
And cloud native infrastructure had to scale to support global web based applications. Therefore REST was a good fit, but all of this meant there was no shared context to support existing transaction coordination mechanisms.
Initially cloud providers rejected the widely adopted relational databases (based on SQL) in favor of more scalable NoSQL databases.
The idea was to trade off persistence for latency (i.e. to ensure a fast response to browser requests), which led to the popularity of “eventual consistency” mechanisms. This was in large part because achieving ACID level guarantees for multiple database writes while also supporting the global distribution model in the cloud environment was considered too difficult.
However fairly recently the problem has been cracked and we’re seeing databases offering ACID level consistency guarantees for globally distributed data stores, including Spanner from Google, Cockroach DB from Cockroach Labs, Yugabyte DB from Yugabyte, and most recently Aurora DSQL from AWS.
How they do it is a story for another column 😉
The Intellyx Take
Transaction processing requirements have not changed much over the years. Business transactions typically involve multiple operations on data, and businesses depend on transaction processing software to record them accurately, both for the sake of their customers and for their own accounting needs.
ACID transactions were originally developed for mainframe computers that ran the first automated commercial applications.
Distributed computing environments enabled lower cost options for commercial applications, using smaller, less expensive units. Initially, they did not support ACID transactions, but lacking this feature limited their ability to attract commercial application business.
Cloud native infrastructure, with its stateless computing model and web scale customer base, introduced additional challenges for ACID transactions. Initially, it seemed as if they couldn’t be solved without introducing prohibitive performance penalties, but eventually Distributed SQL technologies emerged to meet the ACID test.
Mainframes, however, remain a core element of modern IT infrastructure, because they still offer the best implementation approach. ACID transactions thrive in a closely controlled environment, and struggle when potential failure modes increase.
Copyright © Intellyx BV. Intellyx is an industry analysis and advisory firm focused on enterprise digital transformation. Covering every angle of enterprise IT from mainframes to artificial intelligence, our broad focus across technologies allows business executives and IT professionals to connect the dots among disruptive trends. BMC Software is an Intellyx customer. No AI was used to write this article. Photo credit: energepic.com from Pexels.


