登录查看更多内容

Re-Thinking Crawler

Prince B.

Computer Scientist

发布日期: 2025年1月18日

Hello World, and welcome to the very first edition of The Scholar Diaries. First lemme give you a gist of what is this all about - My name is Prince Bhardwaj, a researcher focusing on anything and everything about computer science. Most of my days are filled with chasing answers, running experiments, debugging code, and occasionally banging my head against the wall(the joys of research!). This newsletter is my "dear diary" way of sharing the interesting bits I come across during my work. Each week, I’ll post highlights from my notes like the ideas, discoveries, research papers, breakthroughs, and even the occasional head-scratcher that kept me up at night. You’ll find my take on where things are headed, insights from the papers I’m reading, and my thoughts on the open questions in computer science. Whether you’re a fellow researcher, a tech enthusiast, or just curious about what goes on in the CS (and no, it’s not just about AI), I hope you'll find value and maybe even a little inspiration here.

(Oh, and if you’re curious about more details about me, my work, or what makes me tick, you can check out HERE)

This week, I was manually experimenting on 1 million domains to answer questions and uncover insights from the data. Over time, I have become a bit of a data monger, deeply focused on finding answers no matter how much effort it takes. But this challenge? It was on another level. I have always relied on web crawlers to extract the data I needed. Crawlers are, after all, the foundation of automating data extraction from the web. They have helped me answer countless questions and supported so much of the work I have done. Crawlers aren’t just tools, they’re the engines that power research, machine learning, and large-scale automation.

Until this week, my approach to crawling the web had always worked. I had write scripts, fine-tune my methods, and eventually get what I was looking for. But this time, the challenge was different.

The Challenge: A Million Domains, All Unique

The problem wasn’t just the scale - it was the sheer diversity. Every domain I worked with seemed to be built differently, with no shared patterns or structures. Some were old-school static HTML, while others were fully dynamic and JavaScript-heavy, relying on client-side rendering. Some needed user actions to trigger data loads, while others had hidden API calls buried deep in their network activity.

At first, I did what I always do: I built and tweaked my crawlers, adjusting them for what I thought would work. But it quickly became clear that my usual techniques weren’t going to cut it.

There was no one-size-fits-all solution. Every domain was like a captcha, and I had to solve each one individually.

An Opportunity in Disguise

Rather than give up, I saw this as an opportunity to re-engineer crawling techniques. Crawlers, for all their power, aren’t perfect. They rely on predictable structures, and when faced with messy, dynamic, or inconsistent web designs, they start to struggle.

I started by digging deep:

Identifying the problems. What exactly was causing my crawlers to fail? Was it dynamic JavaScript rendering? Anti-bot mechanisms? Inconsistent API responses?
Exploring existing solutions. Could tools like Selenium, Puppeteer, or headless browsers solve my problems? Were there better libraries for parsing or automating interactions?
Breaking the problem into smaller parts. If I couldn’t solve everything at once, I had tackle it piece by piece.

The Two-Way Approach: Static vs. Dynamic Crawling

To address the diversity of domains, I designed a two-way approach:

Static Crawling: This focused on extracting visible content and validating data without requiring user actions or triggering JavaScript. It served as a backup to verify the accuracy of dynamic crawls.
Dynamic Crawling: This involved simulating user behaviour, triggering API calls, and capturing network requests to extract the data I needed.

Here’s how it played out:

Static Crawling

The static approach was relatively straightforward. I used:

BeautifulSoup4 for parsing HTML.
Selenium for automating simpler interactions.
Redis to manage queues for large-scale data processing.
Parallelization tools like Celery and ThreadPoolExecutor to speed things up.

The results here were solid - everything worked as expected, and the validation process ensured I could trust the data from dynamic crawls.

Dynamic Crawling

This was where things got tricky. Many of the domains required:

User actions (like clicks) to trigger API calls that loaded the required data.
Network monitoring to capture and extract key parameters from these API calls.

I started by manually testing a batch of 1000 domains to figure out the behaviour. After some experimentation, I managed to automate step 1 (triggering API calls) by simulating user interactions with Puppeteer, and Chrome DevTools.

But step 2 was a nightmare.

The Hardest Part: Filtering Network Requests

For each user action (like a click), the browser would trigger 10+ network requests, most of which were unrelated to what I needed. Each domain used different names, parameters, and structures for their API requests, which meant I couldn’t generalize the process.

I spent hours reading documentation, scouring forums, and testing tools, but nothing seemed to work. I even considered building an AI agent to handle the filtering process, but that would have been too time-consuming and resource-heavy.

In the end, I had no choice but to manually filter and extract the necessary parameters for 1,500 domains. It was tedious, exhausting, and felt like taking a step backward.

BurpSuite comes to the rescue

While struggling with step 2, I had a remembered to my pentesting days. Back then, I used Burp Suite to monitor and analyze network traffic for security testing. Could Burp help here?

I hooked up a Burp proxy to capture the network requests, and it worked beautifully. Burp allowed me to:

Capture all network traffic.
Filter and analyze requests in a way that made it easy to spot the important ones.
Extract the key parameters I needed with precision.

Of course, I had to tweak some custom scripts into Burp for this specific use case, but it paid off. My pentesting skills saved the day!

领英推荐

Recap of Zyte API and Reflections on Traditional web…

Zyte 1 年前

Exploring the Frontier of AI Scraping: A Fireside Chat…

Zyte 1 年前

State of AI & Web Scraping in 2024: Thoughts and…

Oxylabs.cn 1 年前

Can homomorphic encryption fix the internet?

I thought I understood one core principle of encryption: You take clear text, encrypt it, and now that encrypted gibberish can’t be worked with until it is decrypted again. THIS IS WRONG.

There is a concept called homomorphic encryption, which turns this assumption on its head: You can actually work with the encrypted data and do mathematical calculations on it, and those calculations also happened on the clear text.

Let me try to explain this in the most basic way:

Our clear numbers are 4 and 9. Now let’s assume our “encryption algorithm” consists of multiplying these numbers with another secret number 5.

So if we encrypt our clear data, we end up with two encrypted data sets: 20 (=4*5) and 45 (=9*5). This encrypted data has homomorphic properties. So we can now give these numbers to a third person (a server) and say, “Add these two for me.”. And we get back the result 65 (=20+45).

Now we “decrypt” this again by dividing by our secret 13 (=65/5) which is exactly the same as if we sum up the two clear data numbers 4+9=13 …yaay.

So we now have a way to offload calculations to a server, which has no chance of ever knowing the original secret.

Be aware that in the real world, this is a lot more complex, to be secure. Our example would be easily hackable by finding the Greatest Common Factor for our numbers. (GCF of 20 and 45 is 5). But the principle is kind of the same: Do some encryption magic, and then you can do calculations on the encrypted data, which reflects in the original data.

My Struggle with Julia’s Type System

What’s the Problem? - Julia uses multiple dispatch, meaning the function that gets called depends on the type of the inputs. This makes Julia super flexible. But when you combine that with Type Unions (e.g. Union{Int, String}), things can get complicated.

Unions let you write code that works for multiple types, which sounds great. But under the hood, Julia struggles to optimize these, especially when working with collections like arrays.

An Example

data = Union{Int, String}[1, "hello", 3]

This creates an array where each element can either be an Int or a String. Pretty cool, right?

But when Julia runs this, it can’t optimize memory layout or operations because the types are mixed. Instead of handling everything efficiently, it takes a generic (and slower) approach.

Why This Matters - If you’re writing performance-critical code which I'm working on high performance computing, these type issues can hurt you. Julia is fast when it knows exactly what it’s dealing with. Unions mess with that certainty.

My Takeaway

Avoid mixed-type arrays or unions unless absolutely necessary.
Stick to concrete types (Int, String, etc.) for better performance.

Julia is powerful, but its type system has trade-offs. Understanding these quirks can save you a lot of debugging time and frustration.

there is something about Theoretical Computer Science

not part of the work I planned for this week but cant ignore curiosity

which attracts me to its core, This field isn’t just for academics its concepts shape the technology we use every day.

TCS begins with curiosity. Why do some problems seem impossible to solve? Can we find faster, smarter ways to process information? Questions like these drive researchers to explore topics like algorithms, complexity, and cryptography.

For example, sorting algorithms may seem basic, but they reveal powerful ideas about how efficiently we can process data. Similarly, studying complexity classes (like P vs. NP) isn’t just theoretical, it has real-world implications for everything from cybersecurity to AI.

What makes TCS unique is its reliance on rigorous proofs and logic. It’s not about trial and error; it’s about consistently applying precise methods to verify ideas. This discipline ensures that results are reliable and universally applicable.

Even if you’re not a theorist, TCS impacts you. Encryption protocols protect your data, AI algorithms power your apps, and optimization techniques make systems faster - all thanks to the principles of theoretical CS.

Those was some interesting bits from this week, ?? well done if you have made it till here - thank you for sticking with me! I hope you found something valuable, whether it was a new insight, a fresh perspective, or just a relatable struggle. If not, that’s okay too sometimes it’s about enjoying the process and exploring ideas together also feel free to share your thoughts, feedback, or even your own experiences - I had love to hear from you.

I know there is no need of saying this but still as a formality if you find this newsletter interesting you can subscribe and also connect with me...

ps: Also attended an nice talk session given by Gareth Tyson on bluesky with very amaze-full insights

The Scholar Diaries

195 位关注者

要查看或添加评论，请登录

Prince B.的更多文章

Squeezing Bits, Securing Bytes

2025年2月1日

Squeezing Bits, Securing Bytes

The human brain has an unfortunate quirk of rationalising hypotheses in a backwards direction It is soo true that I…
What Do AI, Retiring Chips & Linux Packets Have in Common?

2025年1月25日

What Do AI, Retiring Chips & Linux Packets Have in Common?

Hello World, and welcome to the new edition of The Scholar Diaries. Like always first lemme give you a gist of what is…
Return of JIT

2024年7月26日

Return of JIT

Some time back, introduced a new toggle named V8 optimizer that allowed users to disable compilation. This feature…
GitHub is the most important social media

2024年7月11日

GitHub is the most important social media

Felt that I should put this upfront, GitHub is an atypical social media platform because it requires specific…

Re-Thinking Crawler

Prince B.

Computer Scientist

The Challenge: A Million Domains, All Unique

An Opportunity in Disguise

The Two-Way Approach: Static vs. Dynamic Crawling

Static Crawling

Dynamic Crawling

The Hardest Part: Filtering Network Requests

BurpSuite comes to the rescue

领英推荐

Can homomorphic encryption fix the internet?

My Struggle with Julia’s Type System

there is something about Theoretical Computer Science

The Scholar Diaries

195 位关注者

Prince B.的更多文章

社区洞察

其他会员也浏览了

What are the best Practices When Doing Hyperparameter Tuning?

Explainable ML models with SHAP

Fuzzy Matching comes of age with vector embeddings

Uniform Manifold Approximation and Projection

Building a simple Agent using LangChain

How to Scrape in Another Language or Location For Market Research

Haystack Framework: A Beginner's Guide and My Advent of Haystack Journey

Custom Enterprise LLM/RAG with Real-Time Fine-Tuning

Handling Long Context RAG for LLMs with Contextual Summarization

No-Code LLM Fine-Tuning and Debugging in Real Time: Case Study

The Challenge: A Million Domains, All Unique

An Opportunity in Disguise

The Two-Way Approach: Static vs. Dynamic Crawling

Static Crawling

Dynamic Crawling

The Hardest Part: Filtering Network Requests

BurpSuite comes to the rescue

领英推荐

Can homomorphic encryption fix the internet?

My Struggle with Julia’s Type System

there is something about Theoretical Computer Science

The Scholar Diaries

195 位关注者

Prince B.的更多文章

Squeezing Bits, Securing Bytes

What Do AI, Retiring Chips & Linux Packets Have in Common?

Return of JIT

GitHub is the most important social media

社区洞察

其他会员也浏览了

What are the best Practices When Doing Hyperparameter Tuning?

Explainable ML models with SHAP

Fuzzy Matching comes of age with vector embeddings

Uniform Manifold Approximation and Projection

Building a simple Agent using LangChain

How to Scrape in Another Language or Location For Market Research

Haystack Framework: A Beginner's Guide and My Advent of Haystack Journey

Custom Enterprise LLM/RAG with Real-Time Fine-Tuning

Handling Long Context RAG for LLMs with Contextual Summarization

No-Code LLM Fine-Tuning and Debugging in Real Time: Case Study