Whose Linefeed Is It, Anyway? Data sources matter...
in a hefty dose of irony, this image generated by an AI tool.

Whose Linefeed Is It, Anyway? Data sources matter...

Me: Hey AI chatbot, discuss ethical use of public and private data.

AI ChatBot: Don't ask, don't tell.

Me: [ PTSD ]...

The past few weeks have seen an absolute firestorm of discussion about artificial intelligence/machine learning/buzzword-of-the-week and the ethics of using data from various sources. This is going to get very messy very quickly, as current laws regarding data privacy, copyrights and the impact of AI upon traditionally-human jobs did not have anything like modern AI capability in mind. I'm not going to go into detail here about the problem, as I don't fully understand it (hint: nobody does) and it's such a tangled mess that any opinion I have is guaranteed to be uninformed.

That said, most of the debate is about the use of publicly-available data that can be scraped from websites. But what about private data?

What about your customers' data? Can you use that? Should you use that?

One of the biggest problems in developing and testing products that are based upon complex analytics, or for that matter any data-driven application or service, is where you get the data to develop/test against. Using real customer data for development/test brings up all kinds of problems.

Problems? What Problems?

Glad you asked.

  1. Allowing engineers access to private data, even that of your own customers, is a significant security risk. This greatly increases the chance of a leak or breach, as engineers tend to be more cavalier about data handling and more likely to be handling it with half-broken software with half-broken safeguards. Believe me, I know... for 25 years as a QA engineer I did exactly that, and for the last 8 years as a DevOps/SecOps lead I've watched engineers do exactly that (and pleaded with them to not do what I myself did for the previously-mentioned 25 years... hypocrisy much?). They're literally processing that data with buggy non-released code... that's what your engineers want the data for.
  2. Development/testing is likely not an allowed usage of your customers' data per the contract you have with them. At my company I am part of the team that answers customers' pre-sales due dilligence questionnaires (fun! not...), and more often than not there is a question directly asking whether engineers can see customer-provided data. Customers don't like it. And then they won't like you.
  3. There's these things called laws, especially in Europe and some local jurisdictions like California, that have a lot to say about customer data handling. If your customer exercises their legal right to have their data "forgotten", how are you going to excise it from your existing test data sets and AI training pools? How are you going to, under oath and penalty of perjury, attest that it's all gone? It could very well come to that, given the current legal and social environment. No obligatory end-of-paragraph joke here, as nobody's gonna be laughing when the gavel falls.
  4. Inclusion in AI training models has a significant risk of exposure of the private data that it was trained upon. Customer "A" is not going to be happy if they find that mention of information that was gleaned from their private data is inadvertently exposed to Rival Customer "B" through your product via your shared AI model. At that point, you'd better be practiced at saying "Yes, your Honor", and realize that the Fifth Amendment isn't generally applicable in civil cases.

So why use real customer private data?

Why Do We Even Do This?

It's all about real data being seen as the only trustworthy source for testing and development. The whole point of training AI on real data is that what's in real data isn't well-understood until you look exhaustively at it. This is a shortcoming in understanding your own product and how it's used, and AI training is a shortcut to gaining that understanding. This is all very rational, until you consider the risks above. Every one of those risks is an existential threat to a company if the risk becomes reality, and with the way things are going legally/socially, could even impact the individuals involved. That means "piercing the corporate veil" and "This is your cellmate. Meet Bubba."

Customer data pools, in addition to revealing use cases and data contents your development team never dreamed of, also tend to be very large as they are being organically fed by however many tens of thousands of users or millions of transactions your customer processes every day. Everyone who has ever been involved in the engineering of a data-driven application knows how incredibly difficult and expensive it can be to stress-test and performance-test an application at scale in the lab. However, there's all these huge, fully-applicable data sets sitting right there, already paid for by your customers, just begging to be used...

I've coined a pithy little saying that I'll hold to right here. "Every time an engineer or Ops person says 'Dammit!' or 'Why the hell hasn't someone come up with a way to...' there is an opportunity for a whole new product." Synthetic AI training sets and data pool generators which are capable of replicating real-world customer data, at scale, but without risk of real data breaches or running afoul of contracts and laws, are going to be crucial to the development/testing of next-generation AI and analytics and data-pump applications.

I am not an expert, but I know who is.

I don't know how this is going to be done, as I am absolutely not an expert in machine learning, analytics or artificial intelligence (I'm not even an expert in my own non-artificial brain...). But there are people out there who are.

This Is Not My Sponsor

I'm going to give a shout-out to gretel.ai as one such, not because I'm using them now, nor are the paying me to, but because in a past life I've worked with their founders and I know they are, in fact, experts in this field. I am quite sure there are others in this space as well, or who are looking to break into it. They're the ones who come immediately to mind.

Actually, come to think of it... I can think of something my current company needs that can be addressed by a quality synthetic data generator. I'll be back in a bit, I have a phone call to make. ("Phone" is a thing where you use audio to talk to a person far away, it's like texting with a pure audio interface...)

Hashtags are your friends. Here's a lot of friends. You are loved. #artificialintelligence #gretelai #customerdata #gdpr #ccpa #privacycompliance Gretel.ai Ali Golshan #itsfridayafternoonsogohomealready

要查看或添加评论,请登录

Lance R.的更多文章

社区洞察

其他会员也浏览了