Re-Thinking Crawler
Hello World, and welcome to the very first edition of The Scholar Diaries. First lemme give you a gist of what is this all about - My name is Prince Bhardwaj, a researcher focusing on anything and everything about computer science. Most of my days are filled with chasing answers, running experiments, debugging code, and occasionally banging my head against the wall(the joys of research!). This newsletter is my "dear diary" way of sharing the interesting bits I come across during my work. Each week, I’ll post highlights from my notes like the ideas, discoveries, research papers, breakthroughs, and even the occasional head-scratcher that kept me up at night. You’ll find my take on where things are headed, insights from the papers I’m reading, and my thoughts on the open questions in computer science. Whether you’re a fellow researcher, a tech enthusiast, or just curious about what goes on in the CS (and no, it’s not just about AI), I hope you'll find value and maybe even a little inspiration here.
(Oh, and if you’re curious about more details about me, my work, or what makes me tick, you can check out HERE)
This week, I was manually experimenting on 1 million domains to answer questions and uncover insights from the data. Over time, I have become a bit of a data monger, deeply focused on finding answers no matter how much effort it takes. But this challenge? It was on another level. I have always relied on web crawlers to extract the data I needed. Crawlers are, after all, the foundation of automating data extraction from the web. They have helped me answer countless questions and supported so much of the work I have done. Crawlers aren’t just tools, they’re the engines that power research, machine learning, and large-scale automation.
Until this week, my approach to crawling the web had always worked. I had write scripts, fine-tune my methods, and eventually get what I was looking for. But this time, the challenge was different.
The Challenge: A Million Domains, All Unique
The problem wasn’t just the scale - it was the sheer diversity. Every domain I worked with seemed to be built differently, with no shared patterns or structures. Some were old-school static HTML, while others were fully dynamic and JavaScript-heavy, relying on client-side rendering. Some needed user actions to trigger data loads, while others had hidden API calls buried deep in their network activity.
At first, I did what I always do: I built and tweaked my crawlers, adjusting them for what I thought would work. But it quickly became clear that my usual techniques weren’t going to cut it.
There was no one-size-fits-all solution. Every domain was like a captcha, and I had to solve each one individually.
An Opportunity in Disguise
Rather than give up, I saw this as an opportunity to re-engineer crawling techniques. Crawlers, for all their power, aren’t perfect. They rely on predictable structures, and when faced with messy, dynamic, or inconsistent web designs, they start to struggle.
I started by digging deep:
The Two-Way Approach: Static vs. Dynamic Crawling
To address the diversity of domains, I designed a two-way approach:
Here’s how it played out:
Static Crawling
The static approach was relatively straightforward. I used:
The results here were solid - everything worked as expected, and the validation process ensured I could trust the data from dynamic crawls.
Dynamic Crawling
This was where things got tricky. Many of the domains required:
I started by manually testing a batch of 1000 domains to figure out the behaviour. After some experimentation, I managed to automate step 1 (triggering API calls) by simulating user interactions with Puppeteer, and Chrome DevTools.
But step 2 was a nightmare.
The Hardest Part: Filtering Network Requests
For each user action (like a click), the browser would trigger 10+ network requests, most of which were unrelated to what I needed. Each domain used different names, parameters, and structures for their API requests, which meant I couldn’t generalize the process.
I spent hours reading documentation, scouring forums, and testing tools, but nothing seemed to work. I even considered building an AI agent to handle the filtering process, but that would have been too time-consuming and resource-heavy.
In the end, I had no choice but to manually filter and extract the necessary parameters for 1,500 domains. It was tedious, exhausting, and felt like taking a step backward.
BurpSuite comes to the rescue
While struggling with step 2, I had a remembered to my pentesting days. Back then, I used Burp Suite to monitor and analyze network traffic for security testing. Could Burp help here?
I hooked up a Burp proxy to capture the network requests, and it worked beautifully. Burp allowed me to:
Of course, I had to tweak some custom scripts into Burp for this specific use case, but it paid off. My pentesting skills saved the day!
领英推荐
Can homomorphic encryption fix the internet?
I thought I understood one core principle of encryption: You take clear text, encrypt it, and now that encrypted gibberish can’t be worked with until it is decrypted again. THIS IS WRONG.
There is a concept called homomorphic encryption, which turns this assumption on its head: You can actually work with the encrypted data and do mathematical calculations on it, and those calculations also happened on the clear text.
Let me try to explain this in the most basic way:
Our clear numbers are 4 and 9. Now let’s assume our “encryption algorithm” consists of multiplying these numbers with another secret number 5.
So if we encrypt our clear data, we end up with two encrypted data sets: 20 (=4*5) and 45 (=9*5). This encrypted data has homomorphic properties. So we can now give these numbers to a third person (a server) and say, “Add these two for me.”. And we get back the result 65 (=20+45).
Now we “decrypt” this again by dividing by our secret 13 (=65/5) which is exactly the same as if we sum up the two clear data numbers 4+9=13 …yaay.
So we now have a way to offload calculations to a server, which has no chance of ever knowing the original secret.
Be aware that in the real world, this is a lot more complex, to be secure. Our example would be easily hackable by finding the Greatest Common Factor for our numbers. (GCF of 20 and 45 is 5). But the principle is kind of the same: Do some encryption magic, and then you can do calculations on the encrypted data, which reflects in the original data.
My Struggle with Julia’s Type System
What’s the Problem? - Julia uses multiple dispatch, meaning the function that gets called depends on the type of the inputs. This makes Julia super flexible. But when you combine that with Type Unions (e.g. Union{Int, String}), things can get complicated.
Unions let you write code that works for multiple types, which sounds great. But under the hood, Julia struggles to optimize these, especially when working with collections like arrays.
An Example
data = Union{Int, String}[1, "hello", 3]
This creates an array where each element can either be an Int or a String. Pretty cool, right?
But when Julia runs this, it can’t optimize memory layout or operations because the types are mixed. Instead of handling everything efficiently, it takes a generic (and slower) approach.
Why This Matters - If you’re writing performance-critical code which I'm working on high performance computing, these type issues can hurt you. Julia is fast when it knows exactly what it’s dealing with. Unions mess with that certainty.
My Takeaway
Julia is powerful, but its type system has trade-offs. Understanding these quirks can save you a lot of debugging time and frustration.
there is something about Theoretical Computer Science
which attracts me to its core, This field isn’t just for academics its concepts shape the technology we use every day.
TCS begins with curiosity. Why do some problems seem impossible to solve? Can we find faster, smarter ways to process information? Questions like these drive researchers to explore topics like algorithms, complexity, and cryptography.
For example, sorting algorithms may seem basic, but they reveal powerful ideas about how efficiently we can process data. Similarly, studying complexity classes (like P vs. NP) isn’t just theoretical, it has real-world implications for everything from cybersecurity to AI.
What makes TCS unique is its reliance on rigorous proofs and logic. It’s not about trial and error; it’s about consistently applying precise methods to verify ideas. This discipline ensures that results are reliable and universally applicable.
Even if you’re not a theorist, TCS impacts you. Encryption protocols protect your data, AI algorithms power your apps, and optimization techniques make systems faster - all thanks to the principles of theoretical CS.
Those was some interesting bits from this week, ?? well done if you have made it till here - thank you for sticking with me! I hope you found something valuable, whether it was a new insight, a fresh perspective, or just a relatable struggle. If not, that’s okay too sometimes it’s about enjoying the process and exploring ideas together also feel free to share your thoughts, feedback, or even your own experiences - I had love to hear from you.
I know there is no need of saying this but still as a formality if you find this newsletter interesting you can subscribe and also connect with me...
ps: Also attended an nice talk session given by Gareth Tyson on bluesky with very amaze-full insights