Paxata
Darshika Srivastava
Associate Project Manager @ HuQuo | MBA,Amity Business School
Paxata Review Summary
Paxata?is a surprisingly easy to use data preparation platform enabling business users, analysts and others to access and prepare data for their own diverse needs. The increasing speed, diversity, complexity and volume of data mean that traditional approaches to data preparation are no longer adequate, and it is imperative that business users gain direct access to their data to meet pressing business needs.
The exploitation of data is a major preoccupation in most businesses. It allows new efficiencies to be realized, and more productive relationships with customers. The proliferation of new data sources only serves to amplify these possibilities, but these same data introduce new issues that need to be addressed. Complexity is the most important of these, and we need new technologies to deal with it. Our data no longer come in neatly organized rows and columns. Social data, web click stream data, operational logs, sensor data, text and various other types are stored in formats that are often difficult to process. And we can add to this the simple fact that the velocity and volume of data requires data preparation technologies which are compatible with these new demands. Since data preparation typically accounts for anywhere between fifty and eighty per cent of most analytical tasks, it is essential that new approaches are adopted if this fraction is not to increase, frustrating the need to respond in a timely manner to business opportunities.
Paxata is a pioneer in the development of technologies which give business users direct, self-service access to their data, while maintaining a well governed environment necessary for regulatory and data integrity needs. Users are presented with a highly visual interface, in a familiar spreadsheet type format, where the various properties of the data are color coded for pattern highlighting and data lineage. Paxata also automates many tedious and time consuming activities using machine learning and algorithmic technologies. Users are able to ingest, profile, join, transform and clean massive volumes of highly varied data at orders of magnitude faster than traditional approaches allow. And since the complexity, speed, diversity and volume of data are only going to increase, Paxata will prove to be an indispensable solution in the transformation of data into actionable insights.
Use Cases
Paxata has proven to be particularly popular in financial services, consumer goods, retail and public sector verticals, although it is used across a diverse set of industries. This illustrates very well how the platform satisfies both the need for rapid data access and preparation at scale, while at the same time satisfying regulatory requirements. One of the world’s largest banks uses Paxata for comprehensive capital analysis and review a.k.a ‘stress tests’, allowing it to detect data quality issues across thousands of data sets, so that regulatory requirements can be met. A large consumer packaged goods company uses Paxata for end-to-end supply chain integration across manufacturing, distribution and retail stores to manage inventory levels and avoid stockouts. Paxata has also been used for tasks such as system migration, and is routinely used in many businesses to speedily prepare data for analytical tasks. In health care Paxata has been used to create consolidated views of patient health records from several disjointed systems. Some of the larger users of Paxata might have dozens of data preparation projects active at any one time, such is the return these organizations get from clean, consolidated data. More generally, where there are high volumes of complex data, Paxata provides a platform for these data to be exploited as needed.
The User Interface
Ease of use combined with considerable functional sophistication is the hallmark of Paxata. This is a rare combination of qualities, since ease of use is usually sacrificed for comprehensiveness. The user interface exploits all aspects of visual productivity, including color coding, real-time feedback on transformations, graphical displays and many drag-and-drop type tools, without requiring the user to write code, scripts, or set up models or schemas before they start their preparation work. The primary interface is accessed through the web browser, but users can also access the platform through a REST API if desired.
All data prep functionality for data integration, data quality, enrichment, governance and collaboration is accessible through a single workspace, including the Library where data are catalogued for access, profiling where the statistical characteristics of data are explored, transformation where data can be manipulated, and IntelliFusion for automatically suggesting joins across data sets and data transformations. And no matter how data is transformed Paxata always creates an audit trail of the steps executed, which can be undone if desired.
The Data Library
The Paxata Data Library is the interface to the world outside Paxata, allowing data to be ingested and exported. A large number of connectors are available, and Paxata understands contemporary formats such as JSON and Avro, as well as Hadoop, conventional relational databases, flat files and data originating from online business applications (e.g. Salesforce).
The result of data ingestion is a catalogue of data sources, addressing the fairly common phenomenon of organizations not actually knowing what data sources they actually possess. The library also acts as a repository for AnswerSets – the clean, curated data sets which are the results of the data preparation process. These can be viewed using a spreadsheet type format. The Data Library promotes enterprise-wide collaboration and sharing of data sets, reducing spreadsheet chaos, risk of data leakage, and accelerating time to analytics and decisioning.
Profiling
The statistical characteristics of data are automatically displayed, as visual distribution histograms. These allow users to quickly identify whether there are outliers, and the overall shape of data. The various charts are all linked, so changes in one feature are reflected in all connected charts.
Transformation
Paxata uses its machine learning engine to suggest joins between diverse structured and unstructured data sources, and transformations of features that may be relevant. For example it will highlight several different spellings of a town or city, and with a single click a user can consolidate all relevant records under a single name. Annoying, and potentially time consuming problems such as leading and trailing spaces can also be dealt with automatically. Many data quality and transformation steps can be performed using this color-coded visual interface, but for very complex problems users can resort to coding regular expressions. Since increasing amounts of data are held in the hierarchical formats made possible by XML and JSON, Paxata will suggest flattened out views and allow shaping on-the-fly.
If a series of steps needs to be repeated on a regular basis, Paxata supports the creation and scheduling of jobs. These can be initiated manually or according to a pre-defined schedule. Users are notified by email when a job has executed.
Governance
Since Paxata records all data manipulations, and creates versions at each step, it is possible to use the step editor to undo and redo work. Implicit in this process is the creation of an audit trail, which can be replayed as needed. Obviously this has great appeal in organizations where regulators may require audit trails of data manipulation, and it ensures that data can be effectively governed. Best of all, tracking is done without any work from the end user so, while they benefit from seeing all the steps they took, they don’t have to disrupt their work flow in order to document their processes for data lineage or governance purposes. Another value of this transparent governance capability is to the data administrators and IT teams who want to understand how data is being prepared and used, and with Paxata, they have that full context along with user annotations and the ability to automate the project to run on future data sets.
Under the Hood
The architecture of Paxata is sophisticated, employing a Spark in-memory, parallel, pipelined, distributed processing platform, and a columnar database that supports rapid transformation and reporting on the data being processed. The unique intelligence built into Paxata originates from the patent pending work of Dave Brewster (Co-Founder and CTO) who is an expert in semantic typing and matching. The associated algorithms run in the Spark environment and resulting intelligence is sent to the columnar database, which allows data to be processed in a familiar spreadsheet format.
The architecture is such that there is no single point of failure, and the excellent performance of Paxata comes from very intelligent use of cached data. Even on very large (billions of rows) data sets, Paxata will display just enough data for meaning to be conveyed, avoiding the loading of large amounts of data into the user interface. This is a quick approach to data processing, showing what is needed when it is needed, but no more.
Positioning
The data preparation and management task has come sharply into focus over the last few years, as data complexity, diversity, speed and volumes have increased. Paxata is one of a small set of platforms that facilitate data preparation work, but it does distinguish itself in a number of ways. The most compelling of these is the ease of use and self-service at scale. Within less than an hour a user will be able to navigate data and transform it in ways that would take weeks or months of technical effort using other technologies. Much effort has gone into the business analyst-friendly, self-service user interface, exploiting every visual format that can convey meaning. The whole experience is ‘guided’ with suggested transformations and the ever present opportunity to reverse steps at the click of a mouse. Certainly, Paxata provides a user friendly experience that is unique for this type of platform. While many suppliers are keen to claim business user functionality, very few actually deliver. Paxata does deliver, and lends significant advantage to the businesses that use it.y