Mainframe discovery using lexical analysis
"If you plan to maintain, improve, migrate, transform, or otherwise modify software systems, knowledge of those systems is essential." (Ulrich, 2010, p.20)
Modernising a large legacy information system can be an enormous challenge. These systems are usually characterised by large scale, sprawl from organic growth, extensive customisation, and complex entanglement. If that was not enough of a challenge, they are frequently beset by patchy documentation, loss of knowledgeable people, and technical obsolescence. Organisations cannot abandon the technology as these systems are embedded into the fabric of the organisation's processes. So they need to unpick and understand decades of development, at the very time when their appetite to invest in analysis is low, and when skills are scarce.
Academic and practitioner literature strongly recommends that modernisation begins with discovery, although there is little consensus as to how that may be done. Soft analysis might appear quick and even expert-driven, but can fail to deliver enough detail on long-forgotten applications. Hard analysis may not be feasible either. Code may be the only authoritative source of information, but multiple languages will challenge reverse engineering, and it will be hard to assemble a large-enough team with the necessary knowledge to manually model the code.
This paper describes a novel analytical technique that was applied to an active z/OS mainframe at a large retailer in the United Kingdom. The three-month project delivered accurate, complete, physical views of functionality at the component level, and did this very quickly and cheaply compared to manual analysis. The technique is a derivation of lexical analysis, using SQL to search ten million lines of code in seven languages against a lexicon of a quarter million components. This generated a list of functional dependencies, which were then used to model functionality, dependencies, integration and data. Outputs supported design, planning and estimating decisions by business, project management and architecture stakeholders. There were some limitations, so the technique is best applied to discovery in the early stages of decommissioning.
Thanks are due to the project sponsors for the opportunity to undertake this work, sponsorship and encouragement from Andy Cope, and the invaluable technical contributions of Shane Mara and David Croasdale.
Modernisation of legacy systems
Computer systems become obsolete. Their hardware ages, languages go out of fashion, and the architecture evolves. Early design compromises accumulate to the point that it becomes cheaper to rewrite systems in new technology than repair the old. The term ‘legacy’ is applied to systems like these, equipment such as 'mainframes' that were once the backbone of many large organisations. Bennett (1995) defines ‘legacy’ as “large-scale software systems that can not be maintained any longer but are essential for the company.”?
Legacy systems are challenging to modernise. There are technical complications; including obsolete technologies, ancient languages, scarce expertise, lack of documentation, extensive customisation, layers of middleware, data replication, entanglement with desktop and spreadsheet computing, and violation of standards and quality criteria (Boussaidi et al., 2012; do Nascimento et al., 2013; Ganti and Brayman, 1995; Summerville, 2011; Ulrich, 2010). These systems may have outlasted their envisaged useful lifetimes, but they are often essential to daily business processes. People also present challenges; non-technical inhibitors include high cost of removal, perceptions of business criticality, fear of data loss, return on investment concerns, and resistance to change (Fanelli et al., 2016). Our experience would add to that list; including expectations of quick turnaround, disagreements about budget allocation, unknown technical dependencies, and a reluctance to ‘slum it’ (use old languages and technology).
Selection of a modernisation strategy is given much prominence in literature, in some part due to the importance, expense and disruption of modernisation. There is vaguely defined terminology, such as decommissioning, transitioning, migration, winding-down, and sunsetting. Approaches include big-bang, COTS, greenfield, iterative, partial migration, refactoring, scraping, service migration, wrapping, and the more colourfully named Chicken Little, Cold Turkey and Butterfly (Khanya et al., 2018; Salvatierra et al., 2013). Even on an application-by-application basis there are choices to scrap, continue to operate, re-engineer, or replace the application and its data (Summerville, 2011, p.252). The alluring “just-do-it” has an appeal, but incremental approaches that involve replacing the system one application at a time are fraught with budget uncertainty and what Fanelli et al. (2016) describe as “significant architectural complexities and integration challenges.” Nevertheless, Martens et al. (2018) felt that “the slowest and least risky option is long-time or maintenance-driven migration.”
There is a substantial body of literature spanning several decades that describes how to modernise legacy systems; such as Ganti and Brayman (1995), Ransom et al. (1998), Seacord et al. (2003), Summerville (2011), Fanelli et al. (2016) and Khanya et al. (2018). Some describe abstract and broad frameworks like the System Migration Life Cycle (Althani and Khaddaj, 2017b) that concentrate on guidance and assurance. Others provide extensive explanations, like Ganti and Brayman (1995), Newcomb and Ulrich (2010), Summerville (2011) and Wagner (2014). Modernisation literature has however been criticised, with Mehrizi et al. (2019) observing that “we know little about how organisations discontinue their legacy IS”, and Althani and Khaddaj (2017b) of the opinion that “existing migration strategies are either too specific or highly descriptive”. A literature search confirmed that texts frequently provide just a brief explanation of technical elements of modernisation. Newcomb and Ulrich (2010, p.38-43) for example refer in broad terms to a “modernisation workbench” and its associated “modernisation tools”, and Ganti and Brayman cover analysis in just four pages (pp.227-231). Detailed technical methods may be found, but only by applying prior knowledge to widen the literature review’s scope to include topics like parsing, data definition, reverse engineering, and feature separation.
Strategic choices should be underpinned by good information. One frequent question that dominates many early conversations is “what is on the mainframe?”, followed by “this is not a silly question”. Usually stable after twenty-plus years of operation and maintenance, the applications and data underpin many business processes and are woven into the fabric of the organisation. However, their functionality is largely invisible unless it is presented on a “green screen”, and mainframe architecture and complexity can be outside of many decision-makers' frames of understanding.
Literature commonly mentions discovery as the starting point for modernisation. Boussaidi et al. (2012) recommend that “understanding the legacy system is mandatory... ? ...a software architecture reconstruction process is required to reconstruct and document the architecture of existing systems before initiating any modernization actions.” Haugen and Pelot (2018) talk about defining the data, applications and data flows first. The Object Management Group (2019) advocates Architecture Driven Modernisation, an approach that seeks to address information technology that has become “interwoven into complex and often convoluted software architectures”. Wagner's (2014) recommends reverse engineering the “as-is” system into models, then forward-engineering a replacement. Althani and Khaddaj's System Migration Life Cycle (2017b) begins with a pre-migration stage to understand, analyse and investigate; followed by migration (redesign, develop, deploy), and post migration (evaluation and release).
Numerous practitioners also recommend discovery as a first step. PWC (2016) advises that “an application inventory allows for comparison, prioritisation and definition of offload strategies.” Oracle suggests to “put your data first or your migration will come last” (2014), and “the most effective way of delivering a data migration program is to fully understand the data sources before starting to specify migration code” (2011). Others begin by modelling the application landscape with a “portfolio validation” exercise (KPMG, 2015), or conduct a preliminary analysis of the application portfolio as part of their WARP methodology (CapGemini, 2013).
There are numerous benefits to a discovery exercise. Ulrich (p.20-21) identifies streamlining of implementation tasks, reducing delivery times, and informing the organisation about the big picture, scope, objectives, capacity, authority, cross-functional insights and impact. Wu et al. (2005) saw discovery as an effective mechanism for decomposing large systems into manageable components, and do Nascimento et al. (2013) regarded analysis as beneficial to mapping business processes.
Conducting a discovery exercise
Discovery is likely to be expensive and challenging, so a good start would be to clarify activities and stakeholders that discovery is going to support, agree on the benefits, choose a discovery strategy, then apply the necessary methods.
There are plenty of choices for discovery strategy. Alkazemi et al. (2013) prepared a broad framework for assessing systems that covers the four contexts of support, business, architecture and technology. Fanelli et al. (2016) and Khanya et al. (2018) describe ‘database first’ and ‘database last’ approaches. There are advocates for starting with the business processes, including Grace et al. (2008), Fanelli et al. (2016) and do Nascimento et al. (2013). Ganti and Brayman (1995, p. 227) begin their ‘transition process’ with defining the business processes that the legacy system supports (p.223).
None of these strategies promises to be easy. Discovery takes time and skills, in which the organisation is unlikely to want to invest. There is a perception that deep analysis can take 'forever', particularly where large-scale legacy systems contain hundreds of applications and tens of thousands of data tables. Kruger (2017) noted there are considerable difficulties involved in feature separation, and Sandkuhl and Seigerroth (2019) were of the opinion that modelling business processes and methods can be a "knowledge-intensive and expensive exercise". For these reasons, Ulrich (2010, p.21) recommends starting with an enterprise assessment that is broad and shallow to provide sufficient data and process information to support decision-making and strategy formulation.
High-level analysis based on subject matter expert’s knowledge may appear to be affordable and reliable. However, human memory is fallible and can deliver incomplete, inaccurate and misleading information. Those experts may have developed the system or been power users, but they may not have touched the system for years, or been unfamiliar or unaware of back-office processing, integration, and data structures. Technical teams will lose critical knowledge and struggle to unravel the data and code dependencies. Even those who are literate in the languages, technology and business may be hindered by software erosion; defined by Perez-Castillo et al. (2011) as degradation of legacy code quality due to uncontrolled maintenance.
High-level analysis may not be able to rely on documentation due to issues of quality and completeness (do Nascimento et al., 2013). There is considerable risk in organisations where “legacy systems have been given derisory resources to be properly maintained” (Althani and Khaddaj, 2017a). There is a tendency for development to use documentation whilst maintenance prefers source code as a reference, and the problem is widespread “there is no end to the stories of legacy software systems lacking documentation or having low-quality documentation” (Garousi et al., 2015).
Code may therefore be the only (or most authoritative) source of detailed information. However, it is also the most voluminous, skill-dependent, and inaccessible source of information. There may also be situations where code does not describe all the artefacts; such as component names held in variables or tables, or data tables that have been orphaned by loss of code.?
Interrogating code for the purpose of discovery has been considered for several decades. A number of approaches have been proposed, and typically follow the process of decomposition, assessment and modelling. Analytical methods may be categorised according to three foci, as described in the following paragraphs:
Feature mapping methods may be found in Eisenbarth et al. (2003), Guan et al. (2018), Klammer and Pichler (2014), Kuipers and Moonen (2000), Sepulniece et al. (2015), Vemuri (2008) and Wagner (1995). Wu et al. (2005) describe a technique called module dependency analysis and see it as suitable for procedural languages where "the whole system can be divided into many program pieces (such as source files) that are inter-dependent." These dependencies between 'business objects' are explored further in Martens et al. (2018).
Process analysis concentrates on extracting the business logic or program logic. Business rule extraction is described in Wu et al. (2005), Fanelli et al. (2016) and do Nascimento et al. (2013). Methods to document program logic can closely approximate reverse engineering, and are described in terms like structural analysis (McMillan et al., 2009), parsing, grammar analysis, visualisation and tree-walking (Saeidi et al., 2013).
Data analysis is described in Ganti and Brayman (p.79) and Wu et al. (2005). Martens et al. (2018) see data migration as often being the most complex aspect of legacy migration, and make explicit mention of relationships between primary and secondary data. Methods include reverse-engineering and performance monitoring such as described in Jin et al. (2007). Modelling a legacy system’s data is not usually a trivial undertaking. There may be tens of thousands or even hundreds of thousands of data files and tables, with old copies and backups. The data structures might not be accessible to data modelling software, such as structures that are not self-describing, and include images, flat files, and VSAMs. The data might not be neatly organised, with data files scattered through production and test environments. Data at this scale will require excessive effort to model at physical, logical, and even conceptual levels.
Any analysis should also pay attention to storage and presentation. A large amount of meta-data and information will be collected and generated, and storage should facilitate secondary analysis, reuse and reporting. Presentation should be consistent and digestible by many different stakeholders, without obscuring important details and significant risks. Experience suggests that graphical techniques, such as models and visualisation, are particularly suited to helping audiences to perceive, interpret and comprehend the information they need (Kirk, 2016).
Case study background
This paper examines the analysis of a mainframe that had been the mainstay of information technology at a large organisation in the United Kingdom. Once servicing the needs of over one hundred thousand employees, the legacy system consisted of an IBM Z Series running over 280 known business applications. These applications had been written over a period of twenty years in various flavours of RPG, CPG, COBOL, JCL, REXX and Assembler, and comprised 9.2 million lines of code in 16,000 programs, supported by 600,000 lines of JCL in several thousand batch files, all reading data from 76,000 DB2 relational database tables and 87,000 VSAMs and other data files. The mainframe was operational, maintained, and business critical changes were still being developed.?
Evolution of technology, architectural decisions and business decisions had emphasised the need to decommission the mainframe for over a decade. The system was becoming increasingly unfeasible for several reasons:
A decommissioning programme was established with the objective of turning off the mainframe over a three-year period. Early discussions revealed a commonly held perception that the mainframe could simply be turned off, and this had an impact on some strategic choices. No-one was sure how much functionality could simply be purged, and in reality a substantial proportion of mainframe functionality was supporting daily operations. A considerable amount of data also had to be retained for business and legal reasons.
Historical efforts to document the mainframe had been limited. A list of applications had been compiled by subject matter experts in a workshop, identifying over a hundred applications. This list formed the basis of a soft information gathering exercise, requesting information from each of the technology teams about applications in their domain. Responses were limited for reasons that included lack of documentation, lack of knowledge, prioritisation of new work, resource availability, and disagreement about responsibility. It quickly became clear that much of the technical knowledge had dissipated, and developers who had previously played prominent roles in the mainframe's development struggled to remember the functionality. This was entirely reasonable, as in cases they had worked on code a decade earlier. Soft modelling by a team of solutions architects met with a similar lack of success.
The author proposed a project to definitively answer the key question of what was on the mainframe, ideally generating information to find answers to which people to consult, cost to decommission, risks and benefits, and schedule estimates? On this basis senior management funded a three-month discovery exercise.
Analysis activities
The analysis adopted a loose action research methodology as this is "a form of enquiry that enables practitioners everywhere to investigate and evaluate their work" (McNiff and Whitehead (2006, p.7). On commencing, it was not known what analytical methods would work, so a flexible methodology was needed.
The first two weeks focused on establishing the project, meeting the mainframe team, and identifying and collating available information. Work began on extracting inventories of the various system components; and the sources included menus, catalogues, programs, jobs, batches, files, and partial bulk meta-data from the home-grown library manager. This information was cleaned, sorted, augmented where necessary, and recorded in spreadsheets. An HTML 'dashboard' was created in the third week to publish this body of information (Figure 1).
Figure 1. Discovery dashboard
At this point it was clear to stakeholders that the mainframe was much larger than expected, and that there was insufficient documentation or subject matter expertise. Noting the poor outcomes of previous ‘soft’ approaches, it was decided to find a more robust technical approach. There were constraints. Manual analysis was immediately rejected due to scale and available resources. Automated approaches were problematic; the author did not have time to create AI or rule-based expert systems to parse 7 languages, there was insufficient meta-data, quality and presence of code comments were highly variable, there was no budget to buy tools, and even with money there was no time to locate, purchase, test and deploy commercial tools that might be available.
Another approach occurred to the author, based on his experience with content analysis and data analysis. Lexical analysis is a primary function of a compiler, converting text (code) into meaningful tokens. Similar application of this approach were found in literature, though to a much more limited extent (Klammer and Pichler, 2014; Martens et al., 2018; Wu et al., 2005). Conceptually it appeared possible to model the entire mainframe at the component level by identifying other components in code using text searches. These would represent functional dependencies, which could then be represented as a node-edge model as illustrated in Figure 2.
Figure 2. Component model with functional dependencies
The author obtained source code from the mainframe, reviewed the code to determine general feasibility, and developed and successfully tested a proof of concept. A mainframe expert then joined the team in the fourth week to provide language knowledge and support in extracting code and data. An inventory was taken of all known components, and the entire code base of the mainframe was extracted, including all batch and online programs, jobs, libraries, and schedules. This code was imported into a relational database, cleaned, and meta-data associated with each line of code. Routines were written in SQL to then perform a lexical analysis of every line of code for any literal reference to any of the 260,000 objects, identifying functional dependencies at the component level. For example, comparing the following line of code against every component on the mainframe found text matching the name of a parameter file AAS10A01:
Such a brute force approach would make 2.73 x 10^12 comparisons, so a tool was written to manage batching and multi-threading. The analysis results were rough, generating three million matches that manual inspection revealed to contain significant number of false positives. This was partly due to historical lack of naming conventions (a data table called FROZ), use of natural words for components (a VSAM called SORT), and duplicated component names (there were 10 instances of the DB2 table FROZ). A language-specific truth table was compiled, and SQL again used to purge matches that violated syntax or logic. Some were extensive and could be easily identified and eliminated, like removing any match to a comment in code. Other matches required careful attention to syntax, punctuation, context and language. Several hundred SQL statements were eventually created.
The final stage of analysis involved manual cleansing at the application level. A tool was written to facilitate this, presenting the code and associated matches for each component (Figure 3). Components were then associated with an application based on name, menu options, and identified functional dependencies.?
Figure 3. Analysis viewer and validation tool
The technical analysis produced two core tables; a list of all components in the mainframe, and a list of functional dependencies between these components. These tables were useful and reusable, but not effective for certain communication. A graphic representation was needed to demonstrate the outputs, help validate the analysis, support the needs of several stakeholders, and even capture the imagination. Various visualisation tools were assessed, but budget and time again limited the choices. The author chose Gephi on the basis of convenience and output quality, and the node-edge diagrams it could produce were similar to the module dependency analysis described in Wu et al. (2005). Five small and medium-sized applications were modelled to demonstrate the efficacy of the technique, two of which are illustrated in Figures 4 and 5.
Figure 4. Small application functional dependency diagram
Figure 5. Functional dependencies for a medium-sized application
The technique was subsequently used to analyse the data structures and data flows of the mainframe in a 12-month project, aiming to model the data to a 99% confidence level.
Narman et al. (2011) offer some approaches to validating the effectiveness of legacy systems analysis, including the ability to service the needs of different stakeholders, re-use of the data produced, and impact of the analysis on operations. Sepulniece et al. (2015, p.34) briefly touched on the subject of evaluating legacy systems analysis, preferring to identify a problem and then compare performance before and after analysis. Measures they felt could be useful included bugs, development time, authoritativeness, stability, and execution time. They also noted there were few studies of large systems of over one million lines of code, and that modelling was not suitable in these situations because it “cannot be managed by a single business expert”.?
Project objectives: This analysis met its aim of reliably identifying components and the functional dependencies between them. That information also helped clarify the scope of work, functionality that was to be replaced, and prioritisation. Comments in code also helped identify any subject matter experts who may have still been at the company.
Efficiency: Compared with other techniques, this was authoritative, accurate and low-cost. No automated viable alternatives were found, and compared to previous manual and largely 'soft' analysis, this technique was found to be:
By way of illustration, once the matching was complete, a component inventory and functional dependencies for a small application comprising c.50 components could be validated in a couple of hours.
Feasibility: The technique did not require extensive knowledge to build a semi-automated tool, saving on extensive R&D time. The brute-force algorithms were slow to execute, but simple and quick to develop (and with hindsight could have been developed in memory rather than on disk). Using this technique, and with a little practice and support from the mainframe expert, a programmer with no knowledge of the specific languages or applications can accurately validate 98% of the automatically generated functional dependencies.
Different stakeholders could quickly comprehend the graphical visualisation. The models gave a good picture of scale and scope, informing about key features, functionality and integration, cleared up assumptions, reduced reliance on opinions and patchy knowledge, and gave confidence. Indirectly they could also be used to identify other stakeholders, downstream users and data consumers.?
Solutions to problems: The analysis delivered several during the course of the project:
Unanticipated benefits: format and availability of analysis outputs were useful elsewhere:
Lessons and limitations were also identified:
Finally, this paper purposefully references 'old' research alongside more recent thinking. There was a significant volume of mainframe decommissioning in the 1990s when many large legacy systems were being replaced in favour of visual desktop environments and more distributed architectures. This knowledge was quite useful on this project, and the fundamental principles espoused by academics and practitioners 30 years ago appear just as relevant today. It is worth bearing in mind that current technologies such as the “cloud” will themselves need to be replaced one day. These systems should be developed today with a view to their eventual decommissioning (including the production of documentation, meta-data, and models), and old literature consulted when planning such large scale initiatives.
Analysing a mainframe is a complex and expensive activity. There are many approaches and many techniques available to a large organisation, but options will be constrained by the challenges of such large-scale endeavours.? This paper reports on the analysis of a large legacy system. A robust approach was required, but time and resources were limited. A novel technique based on lexical analysis was devised, and this successfully identified the features and functional dependencies of a quarter-million components objects.
This technique proved to be fast, accurate, scalable and very cost-effective. It provided detailed and useful information, assisted in solving problems, and was shown to be capable of supporting large scale legacy systems decommissioning. The analysis only produces models at the component level, so it should be considered as being most useful in the early stages of decommissioning.
