How I Sped Up Complex UI Data Loading in NodeJS by 20X

How I Sped Up Complex UI Data Loading in NodeJS by 20X

Building a Faster User Promotion UI on ManyStories.com for Increased Engagement — API, Redis Cache, MongoDB

?? The problem: fetching and showing a list of recent writers is too slow?—? 1–3 seconds per request.
? The solution: intelligently cache the list of recent writers once produced.
?? Impact: loading speed cut by 95%? from 2000 milliseconds to 100 milliseconds.
How? Read the story.
Want similar results? Lincoln specializes in Full-Stack JavaScript: NodeJS servers, ReactJS Frontend, & MongoDB Database.? Contact [email protected] for free consultation? on speeding up your mobile and web apps.
This will be a long story, but every engineer who’s trying to level up their web services skills should read it. Multiple options and tradeoffs are explored.

ManyStories has thousands of approved writers sharing their stories. The top writer has over 570 followers. To be in the top 100 of writers by follower count, you must have over 90 followers. But reaching these figures isn’t easy, especially if nobody sees you. At ManyStories, we want all of our writers to be seen when they share their stories.

The Problem

That’s why it’s important that we present a carousel of the most recent sharers within the feed of stories on our homepage. As readers scroll through the feed of stories that our algorithms recommend for them, they might be interested in discovering new writers who they aren’t yet following or aren’t in their feed most days.

This works very well. Readers can discover new writers and their many stories, and writers get more followers. Everyone wins. But… what if the carousel doesn’t show fast enough as the user scrolls?

Database Setup

Before we can speak about the (RESTful) API response, we have to discuss the database setup.

Fetching the list of users who have recently shared a story seems trivial, but it turns out to be quite involved, especially in a relational database setup. This can cause the response time of the API to take between 1 and 3 seconds for this endpoint. That’s not going to cut it.

Of course, the size of the data matters and its weight is relative to the resources allocated to the database server. We’re keeping database costs low for now.

We use MongoDB as our (NoSQL) database, but I believe in the benefits of relationship database systems, so all of our data isn’t simply tossed into a JSON object.

Each user has a User record in the database. Each story has a Story record; a story is essentially just a link to a published piece of content on the internet, wherever the writer chose to publish it. To connect a user to a story, we have a StoryUserRelation record.

When a user shares a story on ManyStories, they are considered a writer; the relation’s record gets an isSharer: true flag. If a user is simply someone who has interacted with that story as a (potential) reader, there’s still a record for that relation for keeping track of state like whether the user liked or bookmarked the story, but the isSharer flag is set to false.

For relations where isSharer is true, there’s also a date field called isSharerUpdatedAt so that we can keep track of?—?you guessed it?—?when the user shared the story.

Each of the tables have relevant indices for frequently queried fields (columns in SQL speak).

So, we have our database setup. It’s pretty straightforward. So where’s the bottleneck in the API response? The slowness stems from having the join the various tables. More on this soon.

User Experience of the?Feed

When a user visits ManyStories.com, they are presented with the following feed:

  1. Featured stories grid?—?a longtail selection of top stories old and new
  2. Members’ Plot stories grid?—?stories from paying members of the platform
  3. Recent stories list?—?stories shared by all writers
  4. Carousel of recent sharers?—?takes 1–3 seconds ??
  5. Recommended stories?—?based on reading history
  6. More recent & recommended stories

Most visitors won’t make it to position 4 of the feed because they are likely only passive users of the platform. They may be discovering it for the first time or simply landed there because a writer or reader they respect referred them. They might not even make it to position 3 of the feed. In these cases, we don’t want to load data from our API if the user won’t ever scroll to see it presented in the UI.

That’s why our feed only loads data just-in-time. As the user nears the bottom of the feed, we fetch more data to fill the feed. In other words, if they are scrolling towards the end of position 2, we load position 3. If they make it to the middle of position 3, we load position 4, and so on.

This works well if the user is taking their time to engage with each position of the feed as they scroll. The delay in their scrolling gives our API server time to process each request. Ideally, each position of the feed only takes up to a couple hundred milliseconds or less to process, but that’s not always the case, especially on slow networks.

Additionally, what if the user is a fast scroller? They just want to get through the feed to see what it’s all about. Well, that can pose a pretty big problem for our user experience UX due to the latency of our API.

Slow API?Response

Loading position 4 of the feed, which is the carousel of writers who have recently shared a story on ManyStories, used to be very slow. It took an average of 2 seconds to load, even on a fast network. That’s awful.

As the new year approaches, I’ve taken some time to explore performance improvements to ManyStories in order to deliver the wonderful stories shared by our writers to readers around the world.

Let’s look at the problem again. From the Chrome Dev Tools’ Network tab, we can see how much time our server takes to respond to the request for the recent sharers:

Clearly, this is slow: 2.61 seconds to load a list of ~10 items. Most users will scroll right past this. Before we cut that down to just 0.100 seconds, let’s discuss the first question you should ask as an engineer: why is this API endpoint so slow?

First, what are we requesting here? A list of users, their number of shared stories, and whether each user is followed by the current user. Additionally, we’re requesting that the server only returns users who are considered writers, sorted by their most recent shared story date. This isn’t a trivial request. It’s going to take time, even for a well resourced server and database system. I wrote the old algorithm for this endpoint many years ago, when I was a more junior software engineer and ManyStories had significantly less users. Here’s how I did it:

  1. Get the distinct list records from the StoryUserRelation table, grouped by userId, sorted by isSharerUpdatedAt descending, and limited to the latest 200.
  2. Using the id fields from the list of StoryUserRelation from step 1, paginate based on an offset and a limit provided by the API request. The limit is usually 10. The result should be a list of minimal records that include the userId field, which should be distinct across the set.
  3. For each userId, load the User record.
  4. For each user record, 1) count the number of stories they’ve shared, and 2) get their UserUserRelation record with respect to the currently logged in API user so that we can know whether they are already followed. We also loaded other data for each user, but that’s not relevant here.
  5. Return the list of shaped and sanitized user profiles as the API response.

This worked, but it took between 1000 and 3000 milliseconds. That may not seem like a lot of time, but it’s too long in internet land. Our goal should be for each of our endpoints to take at most a few hundred milliseconds, if not less than 100ms.

To get around this slowness in the short-term, I decided to prefetch the list of recent writers as soon as the homepage loads. As we discussed earlier, this isn’t great because it puts undue stress on the server for resources most users won’t see. In other words, we’re spending up to 3 seconds of our server’s time and resources to load data that might not be used. Let’s not do that.

Finding the?Solution

Our goal is to take the response time of this endpoint from an average of 2000 milliseconds (2 seconds) down to 100ms.

In order to accomplish this, we need to understand the context of the problem we’re trying to solve. Once we understand the context, we need to decide which tradeoffs we’re willing to make, the risk of each tradeoff, and how we’ll mitigate each risk. As an engineer, this is most of your job. The bulk of the rest is naming things.

The Context of the?Problem

We want to show a list of the most recent story sharers.

Well when does someone become a sharer? When they share a story by visiting ManyStories.com/share. We know that, on a good day, some tens of our users share a story or two. There are times when a single user might decide to share a large chunk of their archive on ManyStories; this is usually after we approve a user from the waitlist, which is still active.

Ok, so we know when a user becomes a sharer. Do they ever stop being a sharer? Most of the time, no. The only way they can stop being a sharer is if they unshare all of their stories or delete their account. This is an infrequent event.

A user can share multiple stories, right? Yep. So we have to remember to account for their most recent shared story while ignoring the earlier ones.

It’s perfectly fine for all visitors to see the same list of recent writers.

Is our database well resourced, given our monetary resources? Yep. The database is running smoothly.

Do we have the necessary indices setup on the tables? Yep. There’s not much more we can do here that will be worth the squeeze.

Ok, that’s enough context for now. Here’s a preview of the end result that we’re working towards. See the 92.46ms time to the left, next to the green bar that is labeled with “Waiting for server response”. That’s how long it will take our server to respond to requests for the list of recent sharers. That’s what success looks like?—?far better than the previous time of 2.61 seconds.

The Options for Solving the?Problem

We have many options, but we have to consider the time commitment and resources at our disposal. Both of those are low, so we have to optimize.

Our first option is to denormalize the database. Database normalization reduces redundancy and increases data integrity, but it comes at the cost of added latency for complex queries. We could reduce our normalization and latency in this specific query by simply creating another table that stores a relational record of each user’s ID and the last time they shared a story. We could even go as far as storing the presentational data we’ll need for the user in our UI, such as their name, profile image, and the number of stories they’ve shared.

This could work, but it has two potentially costly problems. The first is that we’d be duplicating data, which introduces fragmentation and so potential inconsistency. If user’s name or image changes, we’ll need to remember to keep this new table in sync by updating the corresponding record’s fields. The same goes for each time a user shares/unshares a story. If we don’t keep this in sync, we could display stale information about a user in our UI, which isn’t great.

Secondly, the redundant data costs money to store in our database. If we have many writers, our increased costs will be nontrivial. Surely, this isn’t the only time we might want to denormalize our database to solve a similar problem. Multiply these complications across all the possible problems of this kind, and it quickly becomes unwieldy.

Our second option is to let the browser cache our API response. Before we throw this straight in the trash, let me tell you why. When you let browsers cache your API responses, you relinquish control over your system. If your data changes, it’s very difficult to invalidate a browser cache since you have no control over the browser, an external client. As a result, your users may receive stale data from the browser’s cache the next time they visit your page. Cache invalidation is already tough enough. Let’s not make it harder by limiting ourselves.

Our third option is to cache the response at the data layer. This is a good option, but I already had it implemented. It’s still pretty slow, as evidenced by it enabling our average time of 2000ms instead of the upper bound of 3000ms. This is because of the normalized database setup we discussed in our first option. Each if we cache each table’s query, we still have to join the 3+ tables at the service layer. This is a slow process.

Our fourth option is to use a database aggregation pipeline with lookups to join the tables in the database system. This is faster because the processing happens nearer to the data storage devices, and the database’s query system is optimized for such workflows. This would reduce the time it takes to join the tables and minimize the data that is transferred from the database into our API server, which reduces load on our server and reduces latency. This would probably cut our average response time down by at least 50%, but that’s not enough when we’re starting at 2 seconds.

Our fifth option is to cache the processed data at the service layer. If we use a sub-millisecond cache like DragonflyDB, a drop-in alternative to Redis, we could cache the result of all those expensive database joins so that we don’t have to rerun that time-consuming process each time a visitor comes. That would lead to a nearly instant response time, setting aside other latency from API overhead, such as session authorization.

There are only two hard things in Computer Science: cache invalidation and naming?things.
— Phil?Karlton

The main potential downside here is the difficultly of cache invalidation. Of course, we’ll have to pay for storing this data in the cache, too, but the size of this data is minimal. Additionally, as with any cache approach, similar to our first option, we have to deal with the potential of presenting stale data to users.

The Chosen?Solution

As you may have already guessed, the solution consists of two parts: the 4th and 5th options from our list. That’s because there are actually two problems to be solved in order to maximize the resources of our primary database:

  1. The first problem is that how we read from the database is suboptimal.
  2. The second problem is that we’re reading from the database too often for this nontrivial feature.

Solution option #4 (database aggregation pipeline) solves for the first problem. Solution option #5 (service-layer cache) solves for the second problem. I’ve included an image of the solution below.

Let’s step through it.

Most of the work is encapsulated in the pageAggregation() utility function I built because I got tired of fussing with pagination and complex MongoDB aggregation pipeline intricacies. The details aren’t too important, but I will cover the high-level of what it’s doing below.

Solving Problem?#1

If you recall from section that outlines the goal of the API request and the algorithm by which we process it, we’re taking five steps to produce the list of recent story sharers. Futhermore, some of those steps have multiple parts. We’re also doing a good amount of data processing in memory on our own server, away from the database engine, which isn’t ideal.

We can improve upon this by instead using a database aggregation pipeline with lookups to join the tables in the database system. MongoDB makes this easy with its aggregation pipeline stages. I was already using an aggregation at some point of this process, but it can all actually be done fully in the database pipeline, sparing our local memory and CPU.

The following is the aggregation pipeline I’ve refactored to, which is what’s being handled by the aforementioned pageAggregation() utility function:

  1. $match: Query our StoryUserRelation table for all records where {isSharer: true}.
  2. $sort: sort the relation groups in descending order isSharerUpdatedAt. This is an unfortunate necessity in order to accurately acquire the latest isSharerUpdatedAt value for each user in the next set. But it’s already because this field is indexed in the database, so it’ll be quick.
  3. $group: group the relations by userId. Also, while we’re grouping them, attach to each group the latest isSharerUpdatedAt field and the count of the group’s records as a field called numSharedStories. This will return a list consisting of objects that include only the three fields mentioned in this step. Essentially, we’re down to objects that represent the unique users who have shared a story on ManyStories.
  4. $sort: sort the relation groups in descending order isSharerUpdatedAt
  5. $limit: take the first 15 records in the sorted list of grouped relations. This will represent the most recent 15 users who have shared a story.
  6. $lookup: from the 15 grouped relations, get all the matching users by userId from the User table of the database.
  7. Replace the 15 grouped relations that represented the users with the actual User objects, but keep and attach the numSharedStories field and the isSharerUpdatedAt field as lastSharedAt from each group onto its corresponding user.

This looks longer, but it actually condenses the previous approach quite a bit. Most importantly, it’s more efficient, takes advantage of database indexes, and it’s fully accomplished within the database engine, not our API server, saving memory and CPU.

But the bulk of the time savings remains to be achieved.

Solving Problem?#2

Now that we have the list of users who most recently shared stories on ManyStories, we can send that as the response to the API request. Our job is done! Well, not so quick?… or slow. We need to plan for the future. That’s a lot of work we just did to aggregate that data. We taxed our database quite a bit. What if we receive another visitor who wants to see the list? How about thousands of visitors each day? We wouldn’t want to do all that work again and again, right? That’s right. We can do better and be faster at the same time.

Cache to the?rescue!

After we do all that work to aggregate the data, we should store the result in a cache for the next visitor to use instead of unnecessarily taxing our database again. That’s what we’re doing at the end of the solution code:

And at the start, when another request comes in, we check whether we can use the cache. If we can, we attempt to fetch the previously aggregated result from the cache and return it. If it’s not there, we aggregate the result from the database again. Here’s how that looks:

This is great. Because we use a Redis-compatible cache, which uses RAM to store our data, it’s super fast to add to and read our result from the cache. We’ve officially cut our loading time of the list of recent story sharers from 2,000 milliseconds to just 100 milliseconds on average.

That’s wonderful!

But we’ve created another problem. With great cache come great responsibility to invalidate that cache.

When we put something in a cache, we need to be cognizant of the risks and costs.

The first risk is that we could present stale data to our users or the rest of our system. Fortunately, the recent sharers list isn’t something that has a strict timeline nor natural expectation of freshness from users beyond not being forever-stale. So long as our list is updated in the UI at a reasonable pace, not too long after a writer shares a story, we can get away with some level of delay.

With that in mind, we set our time to live (TTL) on this cache entry to a very high number, which is 6 hours. That means that, so long as we get 1 user within six hours and the cache store doesn’t overflow, our list should always be available for new visitors to quickly retrieve from the case.

Additionally, we don’t care to remove the record from the cache at any point because having a list readily available is better than not, even if it’s a little stale.

So if we don’t ever remove the record from cache, won’t it always be stale? Not quite. First, let’s revisit this diagram of our system:

Since our cache record is virtually eternal, we need to update it some how. Otherwise, we’ll always get a cache hit and never reproduce the list from the source database. That’s not good because we don’t want to show our users the same list of writers forever.

To get around this, we need to reproduce the the list from the database whenever a writer shares a story. We could do this during the API request that handles sharing the story for the writer, but that has a big downside: increased server load.

Given that the database aggregation for the list of recent story sharers is heavy and slow, running it on our main web API server would meaningfully degrade its performance.

A better approach is to emit a PubSub queue event for our background worker servers to handle. Those are resourced to handle heavy workloads of this sort. When the background worker takes the task off the queue?—?I use AWS SQS with retries on failure?—?, it calls the same function from earlier (getRecentSharers()). The only thing we need to do is tell it to skip the cache check. That will force it to produce a new list from the source database and replace the old list with a fresh list of writers for our readers to discover.

To further improve performance, we could even have the aggregation pipeline read against the secondary database server in our replica set, reducing load on our primary database instance.

Just like that, we’ve closed the circle.

And because the data is small, it doesn’t cost much to store in our cache store.

Conclusion

Gosh that was long! I’m glad you made it this far. Now you know how to make your API response times faster, leverage database aggregations for producing complex data, cache data, retrieve cached data, and update a cache. All while reproducing load on your servers.

Consult Me

If you have a startup that is suffering from slowness in retrieving data or expensive database costs, feel free to reach out to me for consultation on drastically improving your mobile or web app’s performance with lower costs. Simply email [email protected]


Lincoln W Daniel

Fullstack Software Engineer | Past Future Tech Lead, ex-Medium, ex-GoPuff Engineer | Author JavaForHumans.com

1 个月

I'm looking for a new role as senior software engineer (frontend / backend / full stack). See my resume below: https://resume.io/r/wREfl7NvO

  • 该图片无替代文字
回复

要查看或添加评论,请登录

Lincoln W Daniel的更多文章

社区洞察

其他会员也浏览了