Reverse Engineering App Network Flows: Lazy way to doc generation with Python

Reverse Engineering App Network Flows: Lazy way to doc generation with Python

Introduction

#scapy #python #chatgpt #devnet #automation #network

Writing documentation is boring ! And depending on the stage of the project you are documenting it can be a very complex task.

But it's crucial, good infrastructure documentation can save the day during an incident, can help build proper monitoring and technical SLAs and enable us to build the firewall and security postures that fit best. So we can't just skip it.

But more often than not I work on migrating or redesigning infrastructures that have little to no technical documentation, so ... We have to reverse engineer to find out why and how things work.

I got to this stage last week with one of my clients that develops a home-made Saas application offering, their main focus in the last 10 years was the apps and the customer experience, and little time was spent documenting the boring infrastructure.

And since I'm helping them redesign from the ground up and migrating application payloads, well we need to know was going on.

TL;DR

PCAP in Sequence Diagram out, Wireshark has Sequence Diagrams, but can you really save them an call say that they are part of the documentation ? :)

Github repo : https://github.com/craigarms/generateflow

Sprint #1

Finding the right tools to kickstart the documentation

The first step was to decide how to get a picture of what was going on, the applications are complex multi-tier apps with databases and interconnections all over the shop, so I started concentrating on one process : Data synchronization between clients and servers.

This Process involves :

  • an API Gateway
  • a SQL Database
  • a Redis Database
  • and an Elastic Search Index

I set out decided to build diagrams, which I've learnt are call Sequence Diagrams:

Sequence Diagram


I spent all of 5mins in MS Visio setting up the building blocks for this, and got extremely bored ....

I also realized that i didn't have nearly enough information about the apps to be able to build this. So I had a choice, pick the lead dev's brain for a week and documenting every step, or finding something simpler :)

I'm a network guy, so when I want to know what's going on my go to tool is tcpdump (or variation depending on the platform).

The final stage of this sprint was talking to ChatGPT and Bing Chat to see if anything simple existed to transform packet captures into sequence flow diagrams.

The short answer is no, the long answer was UML software and things alike which would require upskilling on the software before getting things done.

So with the help of the AI bots, and StackOverflow I settled on Graphiz and Scapy in a custom python program.

Sprint Conclusion

I'm going to make/hack together a Python script that'll read the traffic capture file and generate nice sequence flow diagrams.

Shouldn't be to hard, should it ?

Sprint Backlog

The above diagram dot 'code' was taken from StackOverflow and modified using Dreampuf's GraphizOnline

Dot is a graphical description language, you code your graph in a structured text format and the Graphiz library can generate the graphical output. We'll Deep dive in the next sprint.

The Next step is to generate this programmatically in Python.

Then parse the capture file for the information and feed it through. (I'm still feeling confident on the triviality at this point)

Sprint #2

Building the diagram with Python

I know the code produced by ChatGPT is far from perfect, I've spent many nights arguing with it because of corrupt code it has produced. But its a great bootstrap for a quick project like this one.

So here is my prompt :

Can you convert the following Dot file to generate it using python graphviz library [...]

The dot file I provided it was the one used to create the Sequence diagram above.

ChatGPT replied with "Certainly" and output the code. (Uploaded to Gist to shorten the length of code in the article)

Graphviz uses two main notions: nodes and edges

To generate the desired graph the author of the answer over on StackOverflow suggested using a trick of using invisible nodes that will represent the vertical length of the bars.

Edges are the connectors, which will be joined on the invisible nodes.

ChatGPT's version of the diagram

While the code that ChatGPT generate gave me good bootstrap code to start, the result it generated didn't quite hit the mark. (Also 50% of the lines generated were not used ...)

Lets look at how this all comes together:

graph = Digraph(format='png')
graph.attr(dpi='300')
graph.attr(rankdir='LR')
graph.attr(size='8,5')
graph.node_attr.update(color='lightblue2', style='filled')
graph.attr('node', shape='box', fontname="Arial", \ 
           style='filled, rounded', fillcolor='#e2e2f0')        

This section initializes the Graph object, I'm already noticing one error which is the rankdir variable set to LR = Left to right.

graph.node('b_start', label='Bob')        

Next we have the top labeled nodes for Bob, Alice and Charlie

graph.node('b_0', label='', shape='point', height='0')        

Then the "invisible" nodes which will be used to build the arrow connections with the edges.

graph.node('b_end', label='Bob')        

Followed by the bottom labeled node for each entity.

These are all the nodes we need to build the topology, next we need to organize these nodes to have the straight lines between the top labeled node and the bottom labeled one; Here is ChatGPT's suggestion:

graph.edge('b_start', 'b_0')
graph.edge('b_0', 'b_1')
graph.edge('b_1', 'b_2')
graph.edge('b_2', 'b_3')
graph.edge('b_3', 'b_end', style='dashed', arrowhead='none')        

Makes sense, connect each node to the next one in the sequence, notice how the last one has added parameters, referencing the produced image file I notice that the last line is dotted and doesn't have an arrow. So I guest the IA simply forgot this on the other lines.

Finally the arrow connections (edges):

graph.edge('b_0', 'a_0', weight='0', arrowhead='vee', \
           fontname='Arial', label='<<B>1 </B>Authentication Request>')
graph.edge('a_1', 'c_1', weight='0', arrowhead='vee', \ 
           fontname='Arial', label='<<B>1 </B>Authentication Validation>')
graph.edge('c_2', 'a_2', weight='0', arrowhead='vee', \
           fontname='Arial', label='<<B>1 </B>Authentication Valid>')
graph.edge('a_3', 'b_3', weight='0', arrowhead='vee', \
           fontname='Arial', label='<<B>2 </B>Authentication Response>')        

So we have to connect the nodes of the same rank to get a line between them, ok easy.

Lets correct the code with the things noticed above:

16,19c15,18
< graph.edge('b_start', 'b_0')
< graph.edge('b_0', 'b_1')
< graph.edge('b_1', 'b_2')
< graph.edge('b_2', 'b_3')
---
> graph.edge('b_start', 'b_0', style='dashed', arrowhead='none')
> graph.edge('b_0', 'b_1', style='dashed', arrowhead='none')
> graph.edge('b_1', 'b_2', style='dashed', arrowhead='none')
> graph.edge('b_2', 'b_3', style='dashed', arrowhead='none')
28,31c27,30
< graph.edge('a_start', 'a_0')
< graph.edge('a_0', 'a_1')
< graph.edge('a_1', 'a_2')
< graph.edge('a_2', 'a_3')
---
> graph.edge('a_start', 'a_0', style='dashed', arrowhead='none')
> graph.edge('a_0', 'a_1', style='dashed', arrowhead='none')
> graph.edge('a_1', 'a_2', style='dashed', arrowhead='none')
> graph.edge('a_2', 'a_3', style='dashed', arrowhead='none')
39,41c38,40
< graph.edge('c_start', 'c_0')
< graph.edge('c_0', 'c_1')
< graph.edge('c_1', 'c_2')
---
> graph.edge('c_start', 'c_0', style='dashed', arrowhead='none')
> graph.edge('c_0', 'c_1', style='dashed', arrowhead='none')
> graph.edge('c_1', 'c_2', style='dashed', arrowhead='none')        

And we get this diagram:

Looking Better :)

The code to generate this is here

After some playing around on the GraphizOnline tool and some Googling I realized that the connecting edges needed to be subgraphs so we could set the attribute forcing them to be on the same level, basically making straight lines.

 c = Digraph()
c.attr(rank="same")
c.edge('b_0', 'a_0', weight='0', arrowhead='vee', \ 
       fontname='Arial', label='<<B>1 </B>Authentication Request>')
graph.subgraph(c)

c = Digraph()
c.attr(rank="same")
c.edge('a_1', 'c_1', weight='0', arrowhead='vee', \ 
       fontname='Arial', label='<<B>2 </B>Authentication Validation>')
graph.subgraph(c)

c = Digraph()
c.attr(rank="same")
c.edge('c_2', 'a_2', weight='0', arrowhead='vee', \
       fontname='Arial', label='<<B>3 </B>Authentication Valid>')
graph.subgraph(c)

c = Digraph()
c.attr(rank="same")
c.edge('a_3', 'b_3', weight='0', arrowhead='vee', \ 
       fontname='Arial', label='<<B>4 </B>Authentication Response>')
graph.subgraph(c)        

And there we have it (here for the code)

Python Generated, looking like the manual effort

Sprint Conclusion

Now we can generate Sequence Diagrams directly in Python, we are still missing a few loops and tricks but all the building block are present to build these programmatically.

Sprint Backlog

At this stage we now need some sample data, a few loops and to pull out good old Scapy to transform packet flows into something we can feed into Graphiz.

So long ChatGPT and thanks for all the scripts (fish ;))

Sprint #3

Scapy Time !

Scapy is an awesome Python library for manipulating packets, personally I use it for data extraction, summarization and other things I don't do in Wireshark. But you can do anything with Scapy :)

The first thing we need is some sample data so:

tcpdump -i eth0 icmp -w icmp.pcap        

and in another pane/window

ping -c2 8.8.8.8        

We should get 4 packets (since the count is 2) and 2 endpoints (or nodes for the diagram exercise)

Perfect

So now in my new Python script I initialize all the needed stuff

from pprint import pprint
from scapy.layers.http import *
from scapy.layers.inet import TCP, IP, ICMP
from scapy.all import *

file = 'icmp.pcap'
flows = []

packets = rdpcap(file)        

Now we are going to parse each packet and check if it has an IP header, and if so if it has an ICMP header.

The flows list will contain a summary of each communication, source IP, destination IP and a label for the arrow connector.

for packet in packets:
    label = ""
    if packet.haslayer(IP):
        if packet.haslayer(ICMP):
            icmp_type_codes = {
                                0: {
                                    0: "echo-reply"
                                },
                                8: {
                                    0: "echo-request"
                                }
                            }

            label = icmp_type_codes[packet['ICMP'].type] \
                                   [packet['ICMP'].code]
            flows.append([packet[IP].src, packet[IP].dst, label])
    return flows        

Since Scapy reports the ICMP types and codes as integer, we map them to something pretty for the diagram.

If we Pretty Print the flows list we get:

[['172.30.119.144', '8.8.8.8', 'echo-request'],
 ['8.8.8.8', '172.30.119.144', 'echo-reply'],
 ['172.30.119.144', '8.8.8.8', 'echo-request'],
 ['8.8.8.8', '172.30.119.144', 'echo-reply']]        

Cool, so from this list we can deduce that there are 2 label nodes needed and 4 invisible nodes, connected with 4 subgraph edges with labels.

Lets now extract the unique peers of this communication:

flow_host_count = []

for flow in flows:
    if flow[0] not in flow_host_count:
        flow_host_count.append(flow[0])
    if flow[1] not in flow_host_count:
        flow_host_count.append(flow[1])

pprint(flow_host_count)        

Which gives us the flow_host_count list:

['172.30.119.144', '8.8.8.8']        

Perfect, nearly there, now we just need to build the logic to generate the graph:

  • For each host we'll need to create the top and bottom labeled nodes
  • For the total number of packets we need to create the invisible nodes
  • And connect all that together in order

graph = Digraph(format='png')
graph.node_attr.update(color='lightblue2', style='filled')
graph.attr('node', shape='box', fontname="Arial", \ 
           style='filled', fillcolor='#e2e2f0', rank="same")

for host in flow_host_count:
    graph.node(f'{host}_start', label=host)
    graph.edge(f'{host}_start', f'{host}_0', \ 
               style='dashed', arrowhead='none')
    for i in range(len(flows)):
        graph.node(f'{host}_{i}', label='', shape='point', height='0')
        graph.edge(f'{host}_{i}', f'{host}_{i+1}', \
                   style='dashed', arrowhead='none')
    graph.node(f'{host}_end', label=host)
    graph.node(f'{host}_{i+1}', label='', shape='point', height='0')
    graph.edge(f'{host}_{i+1}', f'{host}_end', \
               style='dashed', arrowhead='none')        

It seems the order in which you create the nodes doesn't impact the final diagram, which is good.

Finally, we need to create a subgraph for each packet, referencing the invisible nodes and appending the label.

current_connexion = 0
for flow in flows:
    connection = Digraph()
    connection.attr(rank='same')
    connection.edge(f'{flow[0]}_{current_connexion}', \ 
                    f'{flow[1]}_{current_connexion}', \ "
                    weight='0', arrowhead='vee',\
                    fontname='Arial', label=flow[2])
    graph.subgraph(connection)
    current_connexion += 1

graph.render('output_icmp', format='png', cleanup=False)        

Which renders:

I dropped the rounded boxes along the line it seems

Sprint Conclusion

Yipeee :) Looking good Pcap to PNG in a human-readable fashion, the "code" looks ugly, no functions, and prints in all directions, but who cares, it works :)

Sprint Backlog

So the proof of concept works, now It needs to be extended but actually be useful.

I want at least to be able to parse HTTP, REDIS, SQL and ElasticSearch traffic.

In the next step, I'll be using a GNS3 lab with all components and a ChatGPT MockMultiTier application to generate the flows, capture them and play around with.

Sprint #4

Network Stuff

I built a simple network in GNS3, just for the pleasure of having the visual stuff and not having to fiddle with the iptables stuff docker does natively.

Lab representing the target flows

Obviously my clients network is more involved, but this will enable to capture the same type of flows and generate the diagram in a controlled environment.

The docker instance represented as a computer (on the left) is an instance of a ChatGPT generated application, with no manual modification, this represents the API Gateway that will talk with the other databases.

The code is here if you want it, but its by no means an example of nice coding :).

If we query the app through Postman, it will talk to Redis, then to MySQL and finally add some log data to ElasticSearch before replying to the request.

Capture time :)

Dumping a GET Request

Now that we have a pcap of the HTTP GET on the /bootstrap endpoint we can start parsing it in the script.

Currently we are only parsing ICMP traffic, so we need to add a few statements to process the HTTP to start.

Scapy has built-in dissectors for the packet data, like Wireshark, and uses them based on the ports being used. As we'll see later not all data types have matching dissectors in Scapy.

if packet.haslayer(TCP):
    # Manage Plain HTTP Traffic
    if packet.haslayer(HTTP):
        if packet.haslayer(HTTPRequest):
            method = packet.sprintf('%HTTPRequest.Method%')[1:-1]
            path = packet.sprintf('%HTTPRequest.Path%')[1:-1]
            label = f"{method}{path}"
        elif packet.haslayer(HTTPResponse):
            status = packet.sprintf('%HTTPResponse.Status_Code%')[1:-1])
            reason = packet.sprintf('%HTTPResponse.Reason_Phrase%')[1:-1])
            label = f"{status} {reason}"
        else:
            label = ""           

If the packet has a TCP Layer and and HTTP Layer, we check if its a request or a response and extra the data that could be relevant.

Which generates something promising:

Parsing the HTTP

Next lets add the REdis Serialization Protocol (RESP), no dissector here in Scapy, but the communication is plain text, so I expect we can use text based regular expressions on it.

This great Medium post helped me understand the constructs of the message payload in the RESP packets.

In short: (taken from the article)

  • In RESP, the first byte of data determines its type. Subsequent bytes constitute the type’s contents.
  • The \r\n (CRLF) is the protocol's terminator, which always separates its parts.

The first byte of data can be:

  • + for a simple string, followed by the string
  • - for a simple error, followed by the error string
  • * for an array, followed by the array length and the array itself which will have $ defining a bulk string with its length followed by the string itself

So first we check if the packet being analyze is from or to a Redis port:

redis_ports = [6379, 6380, 6381, 6382]
if packet['TCP'].dport in redis_ports \
   or packet['TCP'].sport in redis_ports:        

Yes, I've been using different/multiple ports, hence the mapping.

Then we check if the packet has a payload, it could just be a TCP Handshake or session teardown.

if packet.haslayer('Raw'):        

Then we convert the raw payload to UTF-8 and check if its an RESP simple message or simple error:

payload = packet['Raw'].load.decode('utf-8', errors='ignore')
# Manage Simple Strings and Simple Errors
if payload[0] == "+" or payload[0] == "-":
    label = payload[1:]        

Lets take a look at the result:

Parsing RESP

Hmm, OK we get the PONG but not the PING, because simple messages seem only to be used in replies, not queries.

This is what the query looks like in Wireshark

RESP Ping

So the Ping query is defined as *1\r\n$4\r\nPING\r\n which translates to:

  • An Array length of 1 : *1
  • A bulk string length 4 : $4
  • The 4 character String : PING

So lets parse this using Regex, but lets do it the lazy way with TTP, which is a Regex template engine.

To build the regex we need the text, lets include a break in the code and print the packet contents:

RESP Ping Packet

Plug that data into the Online TTP builder https://textfsm.nornir.tech/ and build the template to match the data:

TTP Template building

If the line starts with one character followed by a digit, its a type and length declaration, the next line could be another type-length declaration or data.

That should do:

redis_template = """
{{ type | re('.') }}{{ length | DIGIT }}
{{ data | _line_ }}
"""        

Now to import the library and parse the resulting list of list of dictionaries ... (Why so many ?)

from ttp import ttp

parser = ttp(data=payload, template=redis_template)
parser.parse(one=True)
results = parser.result(format='raw')[0][0]
if len(results) >= 1:
    if isinstance(results, list):
        if 'type' in results[0]:
            if results[0]['type'] == "*":
                label = results[1]['data']
                if label == "GET":
                    label = f"{label} {results[2]['data']}"
    elif isinstance(results, dict):
        if 'type' in results:
            if results['type'] == "$":
                label = "DATA"
            elif results['type'] == ":":
                label = results['length']        

Drum roll for the result:

We have Ping !

Next ElasticSearch, I won't go into all the code as its available on Github and its the same logic, ElasticSearch is HTTP traffic but Scapy doesn't dissect it because of the ports, I'll probably find a way to send it to the dissector anyway, later :)

Here is the template used for ElasticSearch:

http_request_template = """
{{ Method }} {{ Path }} {{ Http_Version | re('HTTP/\d.\d') }}
User-Agent: {{User_Agent | ORPHRASE }}
Accept: {{ Accept | ORPHRASE }}
Content-Type: {{ Content_Type | ORPHRASE }}
Content-Length: {{ Content_Length }}
Host: {{ Host }}

{{ data | _line_ }}
"""         

MySQL has been a pain in the neck and for quick and dirty results I had to factor in multi-line, lines starting with white spaces, and one word queries...

Here are my templates parse in the order shown if the previous fails:

mysql_query_template = """
{{ CMD }} {{ cols | re('[\w+,\s*]*') }}FROM {{ db }} {{ condition | ORPHRASE }}
"""

mysql_first_word_template = """
{{ CMD | WORD }}{{ data | re('.*') }}
"""

mysql_multiline_template = """
{{ re('\s+') }}{{ CMD | WORD }}{{ data | _line_ }}
"""        

At the end of a nice evening of coding we get the following results:

Get query on /bootstrap

This one in my opinion, (for something I generated with code) is beautiful:

POST to /sync

And the last 2 which were coded into the mock app, Get Health:

Get /health

And Delete:

DELETE /sync

Sprint Conclusion

It works, its a bit tedious coding each new "dissector", I'll be on the lookout as we generate documentation in live environments for better ways to handle this.

Conclusion

This project is far from over, in the coming days I'll be separating the graph code from the Pcap code. The Graph code is going to be super useful for me if I can give it YAML, JSON, CSV, TSV or what not and let it throw out a "nice" Diagram.

The Pcap part of the code deserves to get merged into my pcap_summary project for more insight on captures.

I plan on building more solid dissectors enabling me to try and guess the application.

Yes I did use this in live networks :) And the results enabled people to see for the first time (maybe) what exactly was going on in the background.

Feel free to comment, fork, open issues and/or pull request.


Adnan G.

Automation That Works – Using Zapier, Airtable, and Make (Integromat), I help businesses save 20+ hours per week by eliminating manual tasks.

11 个月

I hope this message finds you well. I am reaching out to seek your expertise and assistance with a specific task related to my software application. I have a Python PYD file integrated into my trading software, which operates within the commodity futures market in China, specifically trading the instrument FU. This PYD file utilizes market data (including price and volume) to generate various technical indicators such as moving averages (MA) and employs algorithms to generate trade signals within my software. While I have some knowledge of the functions within the PYD file, I am keen to gain a deeper understanding of its underlying logic and operations. Specifically, I am interested in reverse engineering the PYD file to comprehend its comprehensive functionality. Could you please assist me in decompiling and analyzing this PYD file? Your expertise in this area would be immensely valuable as I seek to enhance my understanding of this critical component of my trading system.

回复
Sébastien PRADERES

<freelance> program manager

1 年

Hi Craig Armstrong & Fabien Berarde Very impressive and innovating ... As usual, you've done wonders! Looking forward to seeing you in Bordeaux

Fabien Berarde

Ingénieur réseaux et sécurité

1 年

Awesome article that cover lots of topics as usual. Even though most of the time I used the "Statistics > Flow Graph" feature of Wireshark to draw sequence diagram, it is true that sometime it is lacking useful labels. For redis, it is working like a charm (at least for basic commands), but I have seen protocols were Wireshark failed to extract useful information (because of a lack of proper dissector). I have no doubt it would be much quicker to extract them on a scapy generated sequence diagram than writting a dissector for Wireshark.

  • 该图片无替代文字
Craig Armstrong

Architecte Expert Réseaux Freelance

1 年

Big thanks to Fabien Berarde for the technical proof-read and his insights in the article !

要查看或添加评论,请登录

Craig Armstrong的更多文章

社区洞察

其他会员也浏览了