Re-framing Open Data

Re-framing Open Data

Open Data activities have had their ebbs and flows of well over 20 years it seems. Every new decade provides a signpost or a marker to look towards the future and potentially chart a renewed path forward. Let’s be clear, as a principle data must move to enfranchisement. (Yes, a three-dollar word to break from the potential baggage associated with open.) So as we look to make more data available for use, we’ll look briefly at the past, touch upon technologies that have dramatically transformed the landscape, examine a few principles and explore how we might move forward.

Looking back:

While some have pointed back to the late 90’s for the earliest Open Data efforts, efforts really kicked into high gear around 2010, with Vancouver, Calgary, Edmonton, Federal Government, Montreal, Ontario, Toronto and others all creating principles and supporting portals to share data. Despite over 10 years of activities, one could argue that Open Data efforts have not grown beyond ideation to innovation or commercialization. After 10 years, the Government of Canada Open Data portal (https://open.canada.ca/en) has curated a collection of more than 80,000 open data and information assets while only having 93 apps. 69 are government developed with 29 developed by the public. A quick survey reveals broken links and many apps stuck before version 2. There are several factors that could be to blame:

1.      Hackfests don’t build version 2 – Bringing teams together for special events or after hours is great to brainstorm on interesting apps, but enthusiasm can wane when people get back to the normal day to day. This often leaves projects incomplete or abandoned.

2.      Dearth of skills – The Open Data movement is renowned for its multidisciplinary community, bringing together civil society, political leaders, technology enthusiasts and change makers. Matchmaking between ideas people and tech people to deliver end users solutions can be challenging over the long term.

3.      Lack of business models – Invention and ideation is, perhaps, the easiest part of any Open Data project. The challenge is often determining the minimum viable product and business models which form the cornerstone of innovation. While Open Data efforts share this challenge with startup accelerators and other innovation catalysts, it is a significant drag on Open Data efforts.

4.      Minimal reuse – Some applications have risen to prominence in the Open Data efforts, surpassing the functionality found in existing commercial applications. Unfortunately, when organizations have looked to adopt these community driven applications, they have been unable to do so because of concerns around supportability and longevity of community developed applications.

No alt text provided for this image

5.      Low value datasets – Perhaps it’s the elephant in the room, but maybe the datasets are just not that interesting. Certainly the argument can be made that the “SURVEY OF LOT41 AND LOT42 WITHIN THEORETICAL NE1/4 SECTION24 TP61 R13 W4M” data set from 2017 (https://open.canada.ca/data/en/dataset/ed7586ab-7040-4081-a928-4cafd1a56c51 ) might not drive great interest from a generalized developer community. As events associated with the response to COVID19 unfold, constituents are looking for frequent accurate updates on the efforts to respond.

Technology advances

The ten years of technology development following the advent of Open Data have been almost as transformative as the ten years prior. 2000 to 2010 really cemented the use and development of the Internet. Hyperscale cloud services emerged in 2009, with machine learning, Internet of Things, chat-based collaboration, low code application platforms and others now changing the way we deal with data. Controls and safeguards for data have evolved as well. New tools and techniques help remove some of the policy friction previously experienced in the traditional Open Data world.

1.      Differential privacy – Wikipedia describes differential privacy as “a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. Another way to describe differential privacy is as a constraint on the algorithms used to publish aggregate information about a statistical database which limits the disclosure of private information of records whose information is in the database.” Fundamentally, it provides a mechanism for sharing data, while respecting privacy.

2.      Deidentification techniques (e.g. perturbation) – De-identification of datasets, like the work from Dr Khaled El Emam’s work as Canada Research Chair for Health Privacy, (https://www.ipc.on.ca/wp-content/uploads/2016/08/Deidentification-Guidelines-for-Structured-Data.pdf ) provides perturbation techniques to deidentify data to support sharing and reuse.

3.      Machine Learning, Deep Learning, Artificial Intelligence – ML, DL and AI have dramatically increased the focus on the importance of data sharing. They have fundamentally altered the landscape of open data since each requires data to be most effective. These tools have also changed the economic dynamics of data. While data has value, AI models and ML Algorithms which harness data also have value. This value proposition has a significant impact on the open data movement.

4.      Derived synthetic data – Machine learning and deep learning support the creation of a new category of derived synthetic data. https://www.oreilly.com/library/view/practical-synthetic-data/9781492072737/ The term derived synthetic data is used to differentiate from those virtual worlds used to train automated vehicles. Those virtual communities may only exist in the minds of the creators. Derived synthetic data takes real world data and derives wholly synthetic data from this real data set.  For example: By using machine learning, a completely fictitious health information database can be generated from an original containing personally identifying information. By sharing all the characteristics of the original dataset, but being completely computer generated, these derived synthetic datasets provide considerable value when exploring sharing.

5.      Homomorphic encryption – Homomorphic encryption allows for the operation of decision making on encrypted information. By obfuscating sensitive data and providing mechanism to perform decision making on encrypted data, homomorphic encryptions provides for new mechanisms for sharing data.

6.      Data watermarking – Data sharing requires a degree of risk acceptance because it may be difficult to maintain the chain of custody from data originator to consumer. New techniques for data marking provide the ability to mark data at the “cell level” to help with accountability as well data origination. Data watermarking can also assist with enforcement of permissions associated with datasets. For example, CCTV footage might be watermarked for a privacy retention policy.

7.      Big Data, Little Data – With the deluge of data from sensors and other resources, Big Data has dominated the discussion of data sharing. The more data the better has been a frequent call. While lots of data is great for some applications, it might also be equally useful to determine what is the minimum data needed to arrive upon a meaningful solution. There is considerable research on little data to determine approaches to using little data to its best impact.

Faulty Principles

The original philosophy for Open Data, that data should be unleashed/unlocked remains, but over the years has collected additional baggage from an extended global community. Some of this baggage comes from the Open Source community. Some of this baggage comes from government sunlight efforts. While all well-meaning, it is important to distinguish Open Data from other efforts. Reinforcing the previous assertion, data should have a predisposition to be available. That said, there are a few misperceptions (to be corrected) or principles to be asserted.

1.      Data is not Oil – The conference circuit is filled with presentations describing data as the new oil. While the metaphor may hold for a small set of examples, it is dangerous to embrace this concept. Oil is a finite resource and once it is depleted, more cannot be made. Data is quite different, since there is always a new data stream that can be tapped or created. Stored oil maintains its value, stored data gradually loses value since it may no longer be relevant or locked in formats that don’t provide value.

2.      Open Data is only for governments – The Open Data movement had its start in government with the sense that opening data would provide greater transparency of government activities by leveraging assets already paid for by constituents. In our ever increasingly interconnected world, there are an increasing number of interactions between public sector and private sector. Consider the construction of hi-rise office tower. Companies like PCL construction are leveraging data to create digital twins of their physical sites (https://jsi.pcl.com/ ) and are constructing smart buildings. Individual smart buildings could connect to a broader smart city infrastructure so that municipal resources (energy, water, waste, traffic) could be more dynamically managed.

3.      Data needs to be highly accurate – With an increasing number of conversations in organizations looking to leverage machine learning for their business there can be a belief that data need to be highly error free and precise. Machine learning can work well with all types of data as long as the characteristics of the data are properly published. Microsoft Research’s work on Datasheets for Datasets (https://www.microsoft.com/en-us/research/publication/datasheets-for-datasets/ ) helps qualify the utility of data for a particular usage. By describing the characteristics of the data, ML developers can provide the appropriate guidance for the individuals accountable for the ultimate decision making.

 4.      All data should be free – The Open government Data movement has focused on sharing government data freely with the premise that constituents have already paid for the data and are therefore owners of the data already. As we look to broaden the conversation on data sharing, we need to engage the extended community to find opportunities to contribute to broader collective good. Organizations should be able to monetize the data they’ve paid to have collected and curated. The value of the data may differ due to a few characteristics:

a.      Temporal value of data – Fresh data may have greater value than stale data. Data from 2017 may only be valuable in very narrow situations, and as a result, will not be overly useful for today’s applications.

b.      Higher resolution is more valuable than lower resolution – With an increasing variety of tools to capture the physical world virtually, there is an increased valued in highly accurate and high resolution data. Take, for example, mapping data. For some applications, one-meter accuracy works well (say traffic applications) but for others, like the construction of tidal energy generation, one foot accuracy may be needed.

c.      Streaming data – Streaming data has different value than static data.

5.      Shared data interests – The perception remains that industry segments will be happy to find opportunities to leverage Open Data for their shared interests. In working with Canada’s Innovation Superclusters (Digital Tech, Proteins, Manufacturing, AI enabled supply chain and Oceans) and it has been deceivingly difficult to get sector participants to work together for the common good. The Oceans supercluster is the furthest ahead in their philosophy and thinking when it comes to data sharing across the sector in areas such as environmental characterization, but finding the business model to bring together energy exploration, transportation, aquaculture, environmental sustainability on a common platform is a challenge.

6.      Data is static – Many Open Data datasets are static data sets and do not go the next step to streaming and active data sets. Closed circuit television, traffic data, waterflow data and others change our perspective on how data is provided and consumed. Streamed data, like CCTV data requires questions around linking of data, retirement of data due to privacy consideration, consent and more. In today’s data centric world, more emphasis must be made on alternative data sources and cadences.

7.      All data sharing has privacy implications – Given the significant impact of privacy professionals have had on local smart city activities, a perception has arisen that all smart cities efforts have a strong privacy element. While the media portrays smart city effort as “big bang” efforts, the reality is that most smart city efforts begin with humble initiatives.  Smart escalators, intelligent elevators, knowledgeable streetlights that comprise connected communities are the actual start of smart cities. These efforts don’t require personal information to be helpful to people.

8.      Build it and they will come – Open Data policies emphasize publication of data with a sense that if data is published people will flock to consume it. Many datasets remain unseen or untouched, seemingly lost in the noise.

9.      Consistent use is not defined consistently - Consistent use of data has been a fundamental principle in privacy regulation. While a strong control to prevent sharing across business lines, it can constrain uses of data that are in the best interest of the data subject. For example, early projects using ML for billing accuracy in healthcare have revealed the ability to forewarn of / diagnose impending chronic disease. This reuse of data could be seen as running afoul of consistent use.

10.  “Winner Take All” – Organizations have curated datasets for ages, even prior to the Library at Alexandria. In the data environment, some have suggested that we currently find ourselves in a “Winner-Take-All” scenario, where the giants with the data are impossible to displace in the marketplace. In response, some have proposed mandatory data publication rules, obliging publication after say, 10 years. Even if privacy rules permitted, it is here that the cracks start to appear in the “Winner-Take-All” principle. Imagine having data from CompuServe. While interesting from a historical perspective, the data from the era of CompuServe would not reflect the habits/interests of today’s internet user. The notion of a single, master repository of data is often a mental construct in the “Winner-Take-All” scenario. In reality, data is distributed across a variety of systems and organizational units with governance to manage between them. Policy makers have, at times, not recognized the separations between business units, business models and client communities. Innovation and disruption may also occur through the creative curation of data from across a variety of data sources. This is already occurring in machine learning scenarios, where the results of the initial analysis yield new business outcomes in an unrelated area. For example: after applying ML to increase billing accuracy, researchers determined that the datasets could be used to predict chronic illness in patients.

11.  Data theft is the only security consideration – While there is considerable focus on the privacy considerations associated with data sharing, more emphasis on potential misuses is required. In the age of ML, there is an increased need to consider the potential adversarial aspects of data sharing. Business and Cyber security threats such as model theft, model poisoning, decision manipulation and more must be considered.

Governance

In its broadest sense, governance refers to the way rules, norms and actions are structured, sustained, regulated and held accountable (https://en.wikipedia.org/wiki/Governance). Frequently, when organizations look to make data available they are confronted with the questions about the rules and actions associated with the data. Many organizations do not have robust internal data governance processes and as a result find that broadly sharing data challenging. Even relatively straightforward activities of moving from internal IT to cloud computing environments can be deceivingly difficult where organizations don’t have the policies, rules and processes for their data. In considering the evolution of governance for Open Data, it’s important that the economics of data, development of communities and the market dynamics are explored.

1.      Economics of data – It’s important to reconsider the economics of data. Data generation, data competition, data sharing, data consumption and others. The International Monetary Fund published a policy whitepaper The Economics and Implications of Data : An Integrated Perspective (https://www.imf.org/en/Publications/Departmental-Papers-Policy-Papers/Issues/2019/09/20/The-Economics-and-Implications-of-Data-An-Integrated-Perspective-48596) which seeks to assesses the implications of data for macroeconomic growth, equity, and stability. While the IMF looks to the Macroeconomic scale, there are other projects which seek to work at the level of individuals. Microsoft’s Data Dignity project seeks to empower individuals to participate in a marketplace where they can monetize their data, content & expertise.

2.      The rise of data sharing communities (Data trusts, data collectives, data commons, etc.) - Communities of mutual interest have been mainstays since the earliest days of data sharing. Rules and guard rails have been fundamental to the well-functioning of these communities. For example, the Financial Consumer Agency of Canada (https://www.canada.ca/en/financial-consumer-agency.html ) established rules for the sharing of credit information across banks, retail and finance communities. In an effort to provide governance across data usage, a variety of data governance structures have arisen. There are Data Trusts (https://www.cigionline.org/articles/what-data-trust ), Data Collectives (https://yycdatacollective.ucalgary.ca/ ), Data Commons (https://nlmdirector.nlm.nih.gov/2018/04/24/what-makes-a-data-commons-work/ ), Data Communities (https://data.world/community/open-community/ ), Data Collaboratives (https://datacollaboratives.org/ ) and more. In many cases these groups look across a variety of disparate use cases and data sets. These broad use cases across a wide variety of datasets have created situations where meaningful dialogue and collaboration has collapsed as a result of generalizations across the wide variety of potential data, its uses and the participants in the transactions.

3.      Government authorities – As one would expect, there are a variety of government regulators that have mandates for data, both internal and external to government. In addition to the sector specific data governance structures (Health, Education, Finance, etc.) there are authorities that have broad data mandates across communities. Information Commissioners support accountability and transparency of government “institutions in order to promote an open and democratic society and to enable public debate on the conduct of those institutions.” Privacy commissioners across Canada act independently of government to uphold and protect privacy rights of individuals in their jurisdiction. Several privacy commissioners are looking to an increased role in the data environment, moving beyond privacy to artificial intelligence ethics.

4.      International actions – In Microsoft’s, The Future Computed (https://aka.ms/futurecomputed ) , it is proposed that “the companies and countries that will fare best in the AI era will be those that embrace these changes rapidly and effectively.” Nations are racing against each other to establish robust data environments to help drive innovation in AI. While the state ownership of health data in China is perhaps the most frequently cited example (https://english.www.gov.cn/policies/latest_releases/2016/06/24/content_281475379018156.htm ) with Russia a close second (https://jsis.washington.edu/news/russian-data-localization-enriching-security-economy/ ), data localization requirements abound worldwide (https://itif.org/publications/2017/05/01/cross-border-data-flows-where-are-barriers-and-what-do-they-cost ).

Going forward

 With all this as a background, what are some of the building blocks for building success for Data Enfranchisement in the coming decade. These building blocks include broad principles as described in national objectives or charters as well as specific activities to drive action.

1.      Skills – Skills development will always top the list of actions to embrace changes in the marketplace. Canada’s Digital Charter call out skills development (https://www.ic.gc.ca/eic/site/062.nsf/eng/h_00109.html#s1 ) across the many stakeholders in the ecosystem. Reorienting the perceptions of Open Data to drive both greater data sharing and the consumption of the data will require significant skills development. Consistent with other ideation occurring across Canada, many projects remain at the Proof of Concept (PoC) level without a clearly defined Minimum Viable Product (MVP) or business model. Lean and other design skilling could also assist in data sharing efforts.

 2.      Measurement -  Some people probably cringed when earlier in this whitepaper I noted the number of assets shared by the federal government and the number of applications built to use them. Certainly, raw numbers of available datasets and applications do not provide a sense of the impact of the work. Consideration should be given to metrics such as primary and secondary use, impact on individuals and business, commercial value, societal impact, service metrics (speed to service, reduction in access to information requests, reduction in call center interactions, etc.)

 3.      Apps and APIs – Simply publishing data doesn’t unlock the value of the data for constituents. My favourite example of Open Data originates at the University Health Network where they have launched the Patient Health Portal (https://www.uhn.ca/corporate/News/Pages/myUHN_patient_portal_patient_and_doctor_tell_how_made_difference.aspx ). The client quote seals its importance: “"It has empowered me, and by extension my family, to become partners in my care." My sense is that this embodies the spirit of Open Data. Providing data for access by those that require it. As Open Data moves to the next decade, emphasis should be placed on the constituent’s use of the information rather than simple publication of data sets. Applications and APIs to drive greater access to meaningful constituent scenarios should be considered over publication of static datasets.

 4.      Put the constituent at the center - Consistent with the UHN example, Open Data examples should put the end user at the center. Simply publishing data requires effort by constituents to bring the data to life. What if individuals could get a 360-degree view of their entitlements.  What if businesses could appreciate the programs that related to their business? What if a platform could be built across a community to support themes of mutual interest?

 5.      Define communities of interest - The broad sharing of data opens opportunity for its use in novel applications not originally envisaged by the data custodians. However, application developers may find it difficult to arrive upon an end solution from the data alone. Consistent with putting the constituent at the center, communities of interest built around enabling new services scenarios could help establish governance, schema, and business models for data sharing. These services scenarios would assist in driving additional data sharing to lead to the novel application of data to new solutions.

 6.      Ruthless scoping of initiatives – It is important to consider potential end-states for broad data sharing initiatives so that potential unintended consequences can be identified and addressed. Care must be taken to ensure that any envisioning effort appropriately assigns potential risks to the specific scenarios. For example: Privacy has become a polarizing theme in smart cities conversations despite much of the data in smart cities scenarios not being personally identifying. To facilitate the sharing of data in the immediate term, initiatives should be scoped narrowly to allow for specificity on the data to be shared, its characteristics and its associated rules. Ruthless scoping also assists in reducing misinterpretations between stakeholders.

7.      Incremental innovation – Many data sharing efforts begin with elaborate concepts of the future potential state. Given the complexity of the communities of interest (e.g. Agriculture, Logistics, Oceans, Smart Cities, etc.) the development of master-planned approaches to sharing data across all component uses and business cases is fraught due to the complexity of these environments. While there are select master planned efforts in smart cities and other domains, in practicality the digital transformation is occurring almost organically with individual systems gradually connecting to larger systems. This incremental innovation provides opportunity for thoughtful and deliberate assessments of the impact of connecting disparate systems, both from a technology and a policy and business perspective. It also reduces the potential risks associated with unintended release and use of data.

 8.      Start with nonthreatening services – The almost constant parade of privacy and security stories in the media has created an environment where informed analysis and debate about data has been superseded by sound bites and posturing. Practical applications can quickly become bogged down in philosophical debates where data sharing could be considered for more sensitive scenarios (health, safety, etc.) More open data sharing efforts should look to existing communities where information sharing has been established (e.g. North American Electric Reliability Corporation (NERC) https://www.nerc.com/pa/Stand/Pages/default.aspx , Free and Secure Trade (FAST) https://cbsa-asfc.gc.ca/prog/fast-expres/menu-eng.html , etc.) These existing governance structures and sharing arrangement could help establish the foundation for broader data sharing within and outside communities of interest.

 Conclusion

As we look to a new decade of Open Data, it is important that we look to pivot our efforts to seize the opportunities that readily available dataset provide. Through the use of modern technological tools, data can be safeguarded in new ways that help remove policy obstacles to sharing while protecting privacy and security. By refocusing data sharing efforts to constituent focused outcomes with discrete stages, progress can be made in addressing compliance considerations while minimizing the churn which might result from over generalization and ambiguity of solutions.

No alt text provided for this image


Chris Fortin

Senior Account Executive at Microsoft; helping the Canadian public sector achieve more.

4 年

Great read John Weigelt and Omar Rashid . Thank you for taking the time to contribute this. “Data is not the new oil”. Lots to consider.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了