Experiments in summarization using LLMs
For many months I have been trying to be on the top of the AI news, it is damn hard ! So much is happening at such fast pace, it feels like before I can make sense of what happened last month, 10x more things happen in the current week. As a VC, I need to be on the top of the game - I need to be informed of what is happening with respect to startup funding, product launches, new reasearch papers, SOTA models, new paradigm shifts e.g. SFT, RFT, Reasoning etc - whole alphabet soup. And more interestingly more interesting uses of AI that people are talking about. Or the more intellectual AGI / job losses / LLMs for India / AI sovereignty debates etc.
At last count I have subscribed to over 60+ newsletters, I read some 100+ blogs, twitter handles, youtube channels etc. It has become information overload - just reading takes away so much time, that getting time to process knowledge is hard. (Imagine reading even one paper of arXiv takes up at least an hour if not more)
So I needed a way out, and I felt the most efficient way is to code my way out of it. (I have been programming since 5th grade - BASIC games on a PC-AT in the 90s). So I came up with the idea of scraping all these 100+ websites, 60+ newsletters, etc.. Before OpenAI Computer Use and Anthropic CUA came out in the last quarter, I was coding Playwright scripts with GPT3.5 - and I can tell you it was hard. So I had to simplify, so I hit upon the idea of just subscribing to lot of RSS feeds (wow a two decade old technology coming in handy) - but there was another problem some of the interesting new stuff is not available on RSS or most publications force you to come to the main website by just sending you to a small snippet on their RSS feeds.
So finally three weeks ago, inspiration stuck !
I realized that a lot of interesting newsletters / substack / medium etc are anyways coming into my inbox. Why not just parse them and collect insights from them. I can tell you now there is a human acceptable that is working now. But I have stepped into the minefield of 100K is not enough context window, LLM evals, prompt engineering, APIs vs Ollama, GPT vs Reasoning models - who knew markdown handling is also not easy.
I plan to opensource the project. But the most interesting architecture debate I ended up with is this choice - which is the cost vs quality vs coding speed trade off between these three choices.
领英推荐
I can tell you that the first approach is the current MVP - it works ok for all newletters I receive in 24 hours - currently they amount to around 70K tokens (~55,000 words across 12 newsletters on average). But the approach goes for a toss if you want to do a weekly review - context length issues. Or even within a daily review, some of the interesting things get missed - I have seen using o1 vs 4o is dramatic improvement. (But of course how much are you willing to spend for your daily newsletter, which is just ok quality 10 cents or 50 cents?)
Have tried approach two as well, but it is not working well at all. The challenge is that newsletter contents are varied and the golden prompts are unable to capture that heterogenity - so the output is a bland summary. The newsletters contain funding rounds, research papers, product launches, news etc. I also tried an OpenAI suggested recipe in their cookbook - but it wasn't good at all, actually quite bad. (I was told the recipe was meant for the turbo models era !)
So finally I'm currently working on the tedious approach of creating custom prompts for each different newsletter - and man that is hard - it involves, multiple iterations with GPT (again o1 is better than 4o) and a bunch of wrangling with Cursor (I want to reduce tokens by doing a bit of parsing of html newsletters).
If you have any suggestions feel free to comment.
Solving for Neurological Disorders | 3x Founder | 2x Exit | Love to Discuss about AI, Products, MedTech & Startups
2 周I faced the same problem. Tried doing this what you did, but then you will miss out on information that fall under the sponsored section or tool section or recruitment section of the newsletter. The solution is to subscribe to the top 5 newsletter only, I know it's daunting but no other choice. It's an 80-20 gameplay where you need to extract 80% of the info by subscribing just to 20% of the things. For the papers there is a section in Huggingface called paper of the day where you get the the top 10 to 20 papers. Subscribe to 4 to 5 online creators. That's it. At the end of the day, you are a human.
Enterprise Solutions Architect and Product Manager | MBA @ ESADE | Ex-Healthtech Startup CTO | Ex-President, ESADE Data Analytics & AI Club | Cloud & Digital Transformation Expertise
3 周Project looks impressive! I can see you've built a robust system with custom prompts and database integration to track individual summaries. Have you experimented with different chunking strategies or multi-stage summarization approaches to further improve quality?
Technologist | Ex - Agriculturalist, Hunter Gatherer, Herbivores Primate
3 周Interesting & a little overkill but I can see how much time this would save by sending daily highlights and weekly review I have a small setup but only for YT & I use 3.5 turbo, it's the cheapest one available & quite good for summarisation tasks especially if you nail your prompt, for context length I simply process in batches of 15K tokens & keep appending the output until the transcription is fully processed. It’s at least better than watching a 3 hour long video? Would love hear your take on how automated processing of text & videos online is hurting the creator economy (assuming one does open a webpage or watches a video, for which ads are the main source if revenue)
Building GoMarble | AI-Assisted Human-Led Paid Marketing
3 周For learning I use notebookLM a lot and have recently started exploring this project https://www.open-notebook.ai/ I would suggest fork this project and build integrations to add custom sources (RSS, etc). You could also build email integrations to send you summary. If your objective is to organise your learning sources, it comes in built with the architecture(s) to do so. Some sources would need approach 2, some would need approach 3 (from your diagram).
Working on something similar. Couple of thoughts - Summarization isn't a hard task for an LLM. Gemini has a 2M token window, may not have as hard a time on context window amnesia. Another way to do this in claude - 1) prompt it to read a newsletter, strip unnecessary details and create a distilled JSON / text artifact. 2) that artifact is an input into a separate prompt to read and "summarise". You're basically splitting the "read" and "summarise" tasks - less context window hassles and prompt (1) can be tailored to the newsletter format. Chain them in make /n8n to automate. Go hardcore - do (1) from above, but build a vector DB DB and actually train a RAG model on the information?