登录查看更多内容

Building a City-Scale Digital Twin as 'Two Friends, a Drone, and a Laptop'

Rob Sloan

Creative Technologist & CEO | ICVFX × NeRF × Digital Twins ? Husband, Father, & Grad School Professor ? @RobMakesMeta ??

发布日期: 2022年10月27日

In 2019 I watched a SIGGRAPH presentation by Ryan Mayeda, a Product Manager at Epic Games, that was showcasing Unreal Engine's next-gen ability of camera positioning and perspective motion tracking and real-time game engine rendering to capture and synchronously composite a perceived background onto a large LED display. The short showcase piece with now household ICVFX names like Matt Workman, Philip Galler, and AJ Sciutto pitched a concept of eliminating the requirement of location production, set builds, or expansive set extension effects in many instances. This would be done with an in-camera composite of a 3D environment that could parallax in real time.

Unbeknownst to most of the world at this point was that Industrial Light & Magic had been helping to develop and implement this "In-Camera Visual Effect" method as a system called Stagecraft for their forthcoming Disney+ show, "The Mandalorian." Upon its release, the doors broke open and a flood of Virtual Production-this and XR/VR/AR Stage-that started popping up everywhere. There was, and still is, a massive amount of hype and confusion surrounding this arena of filmmaking, but that's an article for a different day.

This article is about an epiphany and a resulting experiment.

Now, why on earth would I be talking about In-Camera Visual Effects in an article that has a headline about a "Digital Twin?" It has to do with that "eliminating the requirement of location production" portion that I just mentioned. What my years shooting films, shows, and commercials, and subsequently teaching others to shoot films, shows, and commercials has reasonably informed me of is that location shooting is always a gigantic pain for two key reasons that go hand in hand:

Lack of Control of the Environment
Limited Amount of Time

We love shootings on sound stages because we can control just about everything within the set such as lighting rigs, camera placement, or art direction. All of the production resources we need are typically very nearby, and logistically accessible. But locations are real and therefore give us an element of authenticity. So like many other elements, we compromise. Similarly, location shoots tend to have a perpetual sense of rushing. Time is always against the crew, especially if you're locking down a well-known location or targeting a Golden Hour shot series. And we all know the time-suck of a company move.

ICVFX productions that we've watched thus far also tend to operate on completely fictitious environments, ie: The Mandalorian, and Star Trek: Discovery. But why aren't they being used to replace actual locations? This has to do with another balancing act that the Virtual Art Department (VAD) teams do as well. Creating fictional worlds feels easy when you compare them to recreating what audiences already know to be real. Imagine having to recreate your hometown as a 3D environment level so that every conceivable camera angle has the same authentic look as if you were shooting on location. Tough, isn't it? Procedural environments have been used as a bridge while we figure out this quandry.

Therefore, how can ICVFX be used to have a "traditional" location production, but virtually on a Simulcam or LED Volume?

This is what I was hoping to solve.

--> How do you capture reality?

The answer relies on another aspect of "metaverse" content creation: Reality Capture. No, I'm not talking about Capturing Reality's amazing software.... yet. The term itself comes from the remote-sensing world of recreating the physical world in a digital form. It can use one or both of a couple of different technologies -- LiDAR (light detection and ranging) and Photogrammetry -- to implement a Digital Twin of a real-life object, texture, or in our intended case, location.

Digital Twinning has been around for a number of years, but mostly in high-end surveying, mapping, architecture/engineering, and industrial applications. What I wanted to do is bring this digital recreation capability into the filmmaking realm to provide greater access to filmmakers wanting to shoot specific locations with greater control and with more time.

My partner, Jeremy Brown, and I spent some time looking into how to execute this. We knew up front that small bespoke captures of Street X or Building Y or Parking Lot Z wouldn't be sufficient for smaller productions. They typically can't afford a scanning team as an add-on to their budget or schedule. Instead, they could go for an easier HDRI capture or pre-purchased driving plates. While larger productions could afford a scanning team, having the ability to capture "unavailable" filming locations may be of interest to them. For this to work for a broad number of productions we would need to determine if a pre-captured area at scale could play in our favor.

We opted to test out a solid mix of location types in a 3.5 square kilometer urban/historic/suburban zone just Northeast of downtown Tampa in a neighborhood known as Ybor City, Florida. [side note: there's absolutely AMAZING Cuban food here] This assuredly wouldn't be a small capture, but by being as large as it was meant that we would also capture a host of different biomes for lack of a better term.

The first major hurdle was flight planning. This would ultimately tell us whether or not we could actually pull off this crazy idea. We scoped out our line of sight (LOS) options and checked FAA regulations as well as any state and local laws. We chose a flight elevation to achieve a reasonable ground sample distance (GSD) and many, many launch points. We decided to air on the side of caution for some of these decisions because we were leaping into a massive area capture and didn't want to run afoul of Part 107 rules that would get our commercial drone license suspended. This was not like a commercial real estate or utility scan; this would be over people's homes and functional roadways. We neither wanted to create a safety concern nor the perception of one while we were in the air. Once planned, checked, and FAA flight approved, we scheduled our days and went for it.

We captured this area with a DJI Phantom 4 RTK, with the RTK enabled, and paired it with a ground station for more accurate GPS geotagging. Depending on the location, the RTK signal on the P4RTK can achieve positional accuracy to within ~2cm. This would help with the image alignment process.

Now due to Florida's weather and a few other scheduling conflicts (projects where we're actually paid), we were not able to do this over sequential days. In total, we were on location over the course of 12 days in a three-week span. This is atypical for standard scanning projects and had this been a contracted project it would have only taken a third of the time.

--> Fix it in Post... Processing

Once all 12,118 images were organized, cleaned, and ingested into Reality Capture, then the processing fun began. We chose the Reality Capture software for a few reasons: 1) it's the fastest photogrammetry software we could find, 2) it's built for 3D asset creation for DCC's or Game Engine applications, 3) the interoperability with Unreal Engine is second to none -- it also helps that it's now owned by Epic Games.

We had a number of processing hiccups attempting to run a dataset of this scale. The main culprit was a combination of limited experience with mega-scale area captures coupled with the "workstation" we were running off of. The former was partially why we were doing this proof of concept in the first place. The latter was gauging exactly how accessible this type of project really was. If we could work out the process bugs and R&D on a limited set of tools, the restrictions of what we are wanting to accomplish melts away.

We made use of a 2021-era ASUS ROG Laptop with a 3.3 GHz Ryzen 9 and RTX 3070. This was not the decked-out multi CPU, multi GPU, 128+ GB RAM workstation that costs tens of thousands of dollars to get. We needed something practical, off the shelf, and by comparison to the normal workstation: cheap.

Once loading the full 12,000+ images into Reality Capture we attempted to align the images directly. [insert belly laugh here] This was rife with issues. It takes a considerable amount of time if you have more than a thousand images. We had 12 times that. This is where the GPS geotagging really helps out. By generally placing the location of the images on a map, the alignment process works to get the nuance arrangement in place. Once alignment is done you can view a sparse point cloud of the captured area. While the alignment process would complete, we discovered a number of misalignments and errant cloud points off the clear plane of altitude where the ground and buildings were.

Once we modified our data processing methods to incorporate the component workflow and control points that Reality Capture has published tutorials on, those errors appeared to resolve themselves. While this process does feel like it's adding additional steps compared to many simpler asset scans, I felt like it gave us a better feel of what we did capture and allowed us to analyze what we would do differently the next time.

Once we completed the multi-component alignment we moved into building the mesh. Since we were running the base configuration on our work-top / lap-station, this also took an inordinate amount of time. There were attempts at segmenting the workload, but the main concern was whether or not the individual segments would realign when taken out of Reality Capture and into Unreal Engine. We knew it could be done, but would it need to be another set of compromises in order to be deemed successful?

Early attempts were not promising, but due to the amount of time it was taking to process out the smaller segments it was hard to tell with any kind of assurance one way or the other what our fixes were changing. Turns out we were knee-capping ourselves on multiple fronts.

--> UPGRADES!

Bypassing the alignment component workflow for smaller alignment runs: FAIL. Attempting to do all of this on a baseline of 16GB of RAM: HUGE FAIL. Running the RCTempCache off of an external hard drive via USB3: M-M-M-MONSTER FAIL. The inherent technical limitations should have been red flags, but we were too geeked out and excited to try to pull this off with a minimalist technique. Sometimes you just need more.

Ultimately we opted to upgrade the laptop in order to allow it to operate in a more performant manner. To date, we have upgraded the maximum RAM that the laptop allows, added a 2TB Samsung 970 EVO Plus to operate all of the data from, and even opened up the overclock settings that the BIOS allows. The performance results were actually quite surprising.

The first test we compared with the new system configuration was a .83 km2 extraction of the Historic District of Ybor City. The boundaries are actually set by latitude and longitude measurements. We did a Normal Detail (2x downsample) and High Detail (no downsample) mesh of this, followed by 8K texture mapping of the entire area. These were the results:

Ybor District Normal Detail:

305.9M Tris; 5.7 Hours
49 8K Textures; 4.2 Hours
9 Hours 54 Minutes Total

Ybor District High Detail:

1.23B Tris; 24.1 Hours
66 8K Textures; 7.5 Hours
31 Hours 35 Minutes Total

These timed out to be 600% faster to mesh in Normal Detail and 620% faster to render out the 8K textures than it had been prior to making the upgrades. Now let's take a look at the full 3.5 km2 area:

Ybor Full Area Normal Detail:

1.30B Tris; 12.5 Hours
209 8K Textures; 13.6 Hours
26 Hours 6 Minutes Total

This was also a marked improvement, as prior to the upgrades this also took 9 Days and 18 Hours to complete. A mere 895% speed increase. What this meant is that we could finally test out a full area mesh at High Detail -- something we didn't think would be reasonably possible before. It absolutely blew us away.

What we have as a result is a 3.5 km2 high-detail model of Ybor City, FL. The model has a total of 5.3 billion triangles and 276 8K textures.

Below are some of the "Cut by Box" extractions to show the detail of the Reality Capture mesh and texture. I'm mentioning the Cut by Box tool because RC limits the number of triangles that can be rendered on the screen to 40,000,000 to keep it from crashing. It's not actually as much Reality Capture's limit as it is a hardware issue that goes beyond upgrades, but RC gives you a warning anyway.

--> Conclusions

For something this scale being completed entirely by drone-captured images that are photogrammetrically processed, the detail is remarkable. The fidelity limitations of this capture are mostly due to known choices of higher flight elevation, no terrestrial capture, and no use of LiDAR. We also could have spent more time cleaning the dataset to remove things like the ghosting of vehicles that moved during flight. These choices were made for budgetary reasons; this was, after all, an experiment.

Over the coming weeks, I'll be documenting the adventures of this city-scale asset as a Nanite mesh in Unreal Engine 5 to test it out as a PreViz or blocking environment. This is the first project in a sequence of many that we are working on.

My hope is that anyone reading will return to follow along with progress and offer up any feedback on things we can do to improve this process moving forward. Mega-scale environment scanning will only get better the more that we capture, model, and use for virtual productions. We hope to provide as much information as we can as we go.

Thanks for reading!

Rob Sloan is currently a Course Director at Full Sail University in the Film Production, MFA program where he teaches the art and aesthetic of storytelling using digital cinema and workflow tools. He has been teaching at Full Sail University for over 10 years. Rob initiated the development of Full Sail's LED Volume, "VP1" and most recently has been developing a second Virtual Production studio specifically for the Film Production, MFA program.

Rob is available for consulting, speaking engagements, and contract work relating to virtual production projects, studios, environment scanning, world-building, or the media and entertainment industry's adoption of Virtual Production methods.

To get a more real-time stream of Rob's thoughts, experiments, and technical musings, please follow him on Twitter: @RobMakesMeta

Reference: ICVFX with UE4 Presentation by Ryan Mayeda of Epic Games

#digitaltwin #ICVFX #UnrealEngine #UE5 #virtualproduction #remotesensing #geospatial #drones #photogrammetry #virtualworlds #metaverse

Noor Alam

Digital Marketing & SEO for Vascular Doctors, Home Care, Construction, and Landscaping.

7 个月

Thanks for sharing, Rob

Tarek Abdo, MBA

Engineer | Digital Project Delivery | Solution Engineering & Services

9 个月

Rob are you still optimizing this?

Alejandro Franceschi

1 年

Love the article, and totally relate to “what it takes to really pull this off.” Few people really appreciate that, but you pushed through, and the results are quite impressive. How I wish FL had this tech available when I used to live there, some 15+ years ago. I might never have left. ??