Standardizing phone handset telemetry using AI

Standardizing phone handset telemetry using AI

This was originally published on Medium, and presented in an NVIDIA webinar ft. Jared Ritter from Charter, Eric Harper from NVIDIA and Aaron Williams from Omnisci. See here for the git repo covering this demo.

There is no bigger data than telecommunications and Multiple System Operator (MSO) data. There is no data which affects the operations of these industries more than network telemetry. And there is no data that is less standard.

Telecommunications companies are dealing with one of the most complex data problems you could imagine:

  • They are often an amalgam of companies, each of which had their own data operations
  • They use different manufacturers of network machines, each with their own data format
  • They use many models and firmware versions for each manufacturer
  • They often have different geographies, each with their own operations

Getting a whole of network view is complex, because it is one of the largest data roll-up activities that could be imagined.

This article will show how using neural networks can reduce this time significantly—we have seen reductions in Time to Data of up to 99.998% for business as usual data pipelines.

One of Datalogue’s largest customers, a top US Multiple System Operator, has employed these techniques to improve time to data and increase the number of errors which are automatically responded to—leading to savings on the order of hundreds of thousands of dollars a month in reduced operational costs.

Data perspectives: views of the network and views of the business

Network telemetry is one view into the customer experience of the network.

To get a more holistic view, this should be supplemented by other data, including data from the customer’s handset. This provides a representation of the user’s actual experience of the network.

With this perspective, a network provider can see not only what is going on with the network, but which issues are affecting a customer’s experience.

Correlating this data further with service call data, truck roll data and account data can help a network provider understand which issues:

  • are serious enough to get a customer to call customer service (which reduces customer satisfaction)
  • are serious enough to require a truck-roll (which costs the network provider significant money, estimated at $200 per truck roll)
  • are serious enough to cause customer churn (a worst-case, with significant lost revenue potential on the order of $1000+ per year per instance).
Each of these perspectives adds huge value, and depth of analysis, but adds an exponent to the complexity of the data operations problem.

A new industrial data workflow

Say a network provider wants to access handset telemetry, for the reasons stated above.

Ordinarily…

a company would ingest a data store, mapping the data to a schema that makes sense to them.

The company would then transform the data as necessary to feed it into the output format required.

Then, a second source appears. Another mapping process. Another pipeline.

A third, source. Same again. (If you are getting bored, imagine how the engineers feel.)

Then the structure of the first source changes without warning. The company would have to catch that, delete the misprocessed data, remap and repipe, and start again.

A painful cycle, and one that doesn’t benefit from economies of scale. A user might get marginally faster at building a data pipeline, but it is still about one unit of labor for each new source.

… or each change in a destination. Each iteration in the output requires more of the above.

That’s a lot of upfront work. A lot of maintenance work. And a lot of thankless work.

And critically, this data grooming is not supporting good business process. The people who know, own and produce the data, the people who live with the data every day, those people aren’t the people who are comprehending the data here. Rather, by leaving it to the data engineering, analysis or operations teams, the domain-specific knowledge of those groups is largely left by the wayside. It is lost knowledge.

That means that errors are only caught once the data product loop is complete. Once the data org massages the data according to their understanding, analyzes the data according to the their understanding and provides insights based on their understanding. Only once that insight is delivered, and is counterintuitive enough to alarm the experts would errors be caught.

A new approach …

would be worthwhile only if it addressed the flaws in the existing processes:

  • it should scale with the number of sources being ingested
  • it should scale with the complexity of the problem
  • it should be resilient to change in the source structure and content
  • it should leverage the domain-specific expertise of the data producers

This is where a neural network based solution earns its keep.

A neural network based workflow would:

  1. use data ontologies to capture the domain knowledge of the producers
  2. train a model to understand those classes, in the full distribution of how they may appear
  3. create data pipelines that leverage the neural network based classification to be agnostic to source structure and be resilient to change in the schema.

I’ll outline the above by walking through this example.

By example: handset telemetry

We were given three files to work from:

  1. a ten thousand row training data set of US handset telemetry data (“the training data”) (this will be discussed further below)
  2. a 340 million row full US handset telemetry dataset (“the US telemetry data”)
  3. a 700+ million row European handset telemetry dataset (“the EU telemetry data”)

The training data was well known and understood by the data producers, and structured according to their want. Crucially, this data represented their work in solving a subset of the data problem. Designing the ontology and training the neural network is like uploading expertise.

Designing the ontology is like uploading expertise

The US and EU telemetry datasets were unseen—we were to standardize all three files to the same format.

To do so, we created a notebook that utilized the Datalogue SDK to:

  1. create a data ontology,
  2. attach training data to that ontology,
  3. train and deploy a model, and
  4. create a resilient pipeline that standardized data on the back of classifications made by the deployed model.

Creating an Ontology

An ontology represents a taxonomy of classes of data that fall within a subject matter area, often coinciding with the fields in the output schema of the proposed data pipeline.

It is not the one true taxonomy of all types of data in the entire group of sources, but rather, the columns that are useful in solving the problem at hand.

The ontology can always be added to, or changed in any way (i.e. expanded to cover an evolving problem).

The ontology is also where our domain experts embed their domain knowledge.

In this example, the producers of the data helped us create an ontology that explained the data that we were seeing: which data is sensitive and which can be freely shared, which was handset data and which was network data, which telemetry field belonged to what.

This was captured in the Datalogue SDK (see the first code snippet in the appendix for the full ontology):

wireless_ontology = Ontology(
    "Declassified Wireless Carrier Data",
    "This is for the purpose of cleaning and delivering safe data as a product.",
    [
        OntologyNode(
            "Sensitive Data",
            "This is data NOT to be distributed to 3rd parties.",
            [
                OntologyNode(
                    "Subscriber",
                    None,
                    [
                        OntologyNode("Full Name"),
                        OntologyNode("First Name"),
                        OntologyNode("Family Name"),
...

Which pushed an ontology to the Datalogue platform.

No alt text provided for this image

This allows the data operators to understand and contextualize the data. Both business purpose (here, obfuscation) and context (here, field descriptions and nesting according to the subject matter) are embedded directly into the data ontology.

Attaching training data

The next step is adding training data to the ontology, both of which are inputs into the model training.

This further embedded the domain experts’ knowledge of the data to the process. Their knowledge of the training data set allowed them to perfectly map that data to the ontology, adding thousands of training data points to most nodes.

For the other nodes, supplementary datasets were used.

These were attached with the SDK (see the second code snippet in the appendix for the full training data mapping):

us_telemetry_training_col_dict = {
    "Packet Loss": "main_QOS_PacketLoss_LostPercentage",
    "Jitter": "main_QOS_Jitter_Average",
    "Signal Strength": "main_QOS_SignalStrength",
    "Link Speed": "main_QOS_LinkSpeed",
...

And then pushed to the Datalogue platform:

No alt text provided for this image

For the above model, in this example, the majority of classes had sufficient data (~5000 datapoints), but these data came from a single source.

This will create a model that is overfitting to the training data store (as the training data only captures the part of the distribution of that class represented in that training data store).

The effects of this were shown when using this model on other datastores. The model performed well on the US datastores, but performed less well for ambiguous classes (e.g. the various ratios represented in this dataset) for the European dataset, where the overlap in context was less apparent.

To head this off, when you try to train a model as above, Datalogue’s “model doctor” will warn you of this heuristic, and recommend that you add additional sources for those classes to avoid this.

Training and deploying a model

Once we have the training data attached to each leaf node in the ontology, the user can train a neural network model to classify these classes of data.

At a glance, this model training option is an “on rails” training system. It trains a model that takes each string in as a series of character embeddings (a matrix representing the string), and uses a very deep convolutional neural network to learn the character distributions of these classes of data.

This model also heeds the context of the datastore—where the data themselves are ambiguous, other elements are considered, such as neighboring data points and column headers.

This “on rails” option will be sufficient for most classification problems, and allows a non-technical user to quickly create performant models.

# Use Datalogue Semantic Engine to train a classification model on the Telemetry Ontology
dtl.training.run(
  ontology_id ?= wireless_ontology.ontology_id # will change to wireless_ontology.id in next version
)# Use Datalogue Semantic Engine to train a classification model on the Telemetry Ontology
dtl.training.run(
  ontology_id ?= wireless_ontology.ontology_id # will change to wireless_ontology.id in next version
)
No alt text provided for this image

Where there is a bit more time and effort available, and where more experimentation and better results may be required, a ML engineer can use “science mode” to experiment with hyperparameter tuning, and generally have more control over the training process.

Model Metrics

Once the model has been trained, the user is able to see the performance of the model on the validataion and test sets, having access to:

  • model wide statistics (F1 score, precision, recall, etc.)
  • confusion matrices
  • class specific statistics
  • training statistics (loss curve, etc.)
No alt text provided for this image

As you can see, this model is able to, with little work, disambiguate the telemetry classes effectively.

Model deployment

With a single click, a model is deployed and ready to be used in pipelines.

Data Pipelines

Once the user has a model, you can use it to create pipelines that work for both the European and US datastores, and are resilient to changes in the incoming schemata of these sources.

This pipeline has some novel concepts:

  1. the pipeline starts with a classification transformation—the neural network identifying the classes of data (either on a datapoint or column level) with reference to the ontology
  2. the later transformations (such as the structure transformation) rely on the aforementioned classification—if the schema changes, or if a new source file is used, the pipeline need not be changed. The pipelines don’t rely on the structure or column header for transformation, but rather the classes as determined by the neural network
The marginal cost for adding a new source, or remediating a changed source schema is now only the cost of verifying the results of the model—no new manual mapping or pipelining required.
No alt text provided for this image

The resulting data

From completely differently named and structured sources, we now have a standardized output:

Before:

No alt text provided for this image
No alt text provided for this image

After:

No alt text provided for this image


Now you have a clean dataset to use for analytics. One datastore to be used for:

  • visualization
  • automation
  • advanced analytics

And fast.

For one telemetry customer, neural network based data pipelines were measured as being 99.998% faster than traditional methods, because of the reduction of the manual work required to process this data.

Neural network based data pipelines were measured as being 99.998% faster than traditional methods

That faster time to data means:

  • less time spent on reporting and modelling
  • faster time to getting a single view of your network
  • faster time to issue detection
  • faster time to issue resolution

And for the aforementioned MSO provider, hundreds of thousands of dollars in savings per month.

This scales

The above was a simple example used to highlight the model creation, deployment and pipelining.

There were pipelines from three sources used.

In deployment, for the above MSO provider, more than 100k pipelines are created each month, and that number is growing exponentially.

100,000 pipelines created per month

Appendices:

Resources

Code snippets

  1. Ontology generation
  2. Training data mapping
  3. Pipeline definition
  4. Pipeline JSON object

Screenshots


要查看或添加评论,请登录

Johanan Ottensooser的更多文章

  • Whoop there it is — data intensive overhaul of Whoop

    Whoop there it is — data intensive overhaul of Whoop

    Presenting my fitness data in a data intensive way to make me work out harder and understand myself better. Originally…

  • Some things I learned from some books I read in 2019

    Some things I learned from some books I read in 2019

    In order of how much I liked the book. Note, these are just the books I liked, so every book here is recommended.

    6 条评论
  • A TL;DR of “The Feedback Fallacy”

    A TL;DR of “The Feedback Fallacy”

    Originally published on Medium. The Harvard Business Review wrote an article about common misconceptions about feedback…

    1 条评论
  • Silicon Valley or Science Fiction

    Silicon Valley or Science Fiction

    Cornell Tech Student Graduation Address 2017. I was asked to present the Students’ address at Cornell Tech’s student…

    1 条评论

社区洞察

其他会员也浏览了