Unstack’d: Digging into Data Science at REA
On Thursday September 14, the REA Consumer Data Science team opened their doors to give the public a peek at the inner workings of how they problem solve, build, and deploy, all while striving to constantly improve ways of working and team culture.
I began working on this about two months before the event. I wanted to provide my team with an opportunity to share their story and inspire others to strive for better- better data science and machine learning outcomes, but also better working environments and better team performance. I wanted it to be a genuine and real look at how our team works, and for it to be a relatable picture of wins and growth areas. Thankfully, REA has Unstack’d! Unstack’d is a series of events where people at REA Group share their experiences with technology, the real estate industry, and working at REA. It was the perfect pairing!
People began filtering in to the venue at 511 Church St in Cremorne at about 5pm, and were greeted by the team, as well as a number of drink options to get the event started.?There was also an opportunity for networking before we started moving into the main room for the first talk just before 6pm.
Mark Aitkin , our Data Science Lead, kicked things off by welcoming everyone, cracking a joke or two as he is known to do, running through our agenda, and introducing Alex Cummaudo and Pasan Karunaratne to talk about how our team finds gold in unstructured data.
Striking Gold in Unstructured Data
This talk was put together with input from most of the team, but with the bulk of the content and structuring done by Alex Cummaudo (Machine Learning Engineer at REA Group), Pasan Karunaratne (Machine Learning Engineering Lead at REA Group), and myself (Delivery Lead at REA Group). Many thanks to Alex and Pasan for providing me with images for this post. The talk was presented by Alex Cummaudo and Pasan Karunaratne on the evening, but I will aim to share the content of the presentation here in a format optimised for reading.
In early 2022, The Consumer Data Science team had a bright idea- a vision for a data-rich future in which consumer (end users of any REA Group website) and customer (real estate agents) experiences are augmented and improved. To expand on this vision, we must first understand the difference between structured and unstructured data.
Structured v Unstructured data
Structured data is any data that fits neatly into a predefined format or schema. This makes it easy to search, analyse, and process. A good indication that you’re handling structured data is that you’ll usually be able to store structured data in a spreadsheet. Examples of structured data about you include your name, your date of birth, your email address, whether you’re married or not, etc- I think you get the idea. Examples of structured data for us here at REA include the data an agent provides us with when they list the property- things like price, bedrooms, bathrooms, car spaces, etc. It also includes data about our users’ actions on the site, such as how many listings they’ve viewed, how many times they view any given listing, or how many enquiries they have made. All this data is great! What this data is missing, however, is the mountain of information we can obtain from other places, and which can enrich the user experience significantly. Enter unstructured data.
Unstructured data is data that, given its name, has no set format or structure to how the data may be presented. This data is also significantly harder to draw insights from given it’s relative lack of uniformity. Unstructured data about you might include text messages or emails, images and videos taken of or by you, and social media posts, among many others. At REA, unstructured data we handle consists primarily of images (property images and floorplans) and text (listing descriptions).
’But you mentioned gold in the title’, I hear you say. ‘What gold exists in property images, floorplans, and listing descriptions?’ Let’s find out- I think you’ll find there’s more there than you think.
Take a look at the below image- every single one of these property data points (referred to as attributes moving forward) is one that we might not already have in our system, and one that has value.
Now, think about floorplans and listing descriptions. By the same token (but in different formats) these contain attributes we want to have the ability to surface in various ways but might not be able to without processing this unstructured data.
The Vision
Ok. So we have this amazing untapped resource, and we have so much of it. Back to our idea and the vision we developed. Now that you’re armed with the knowledge of what unstructured data is, do you have any ideas on how you might benefit from some of this data being extracted and processed? We did! Here are just a few ideas we had on how we could use this information to help enrich experiences:
- We could surface insights to real estate agents so that they were better able to differentiate their property from others on the market
- We could feed this data back into the PropTrack automated valuation model (AVM) to increase its accuracy
- We could provide a better on-site experience to users by highlighting properties with attributes we understand they are interested in
- We could create entirely new ways to search for property on relaestate.com.au
- We could suggest more relevant properties to users based on their likes and dislikes
- We could dynamically create floorplans for properties that do not have one
- many, many others, some known and some yet to be discovered!
We called this set of ideas Project GLAD- GLAD stands for Great Listing Attribute Detection. It rolls off the tongue easily and is a good way for us to build the vision’s brand.
To summarise, our vision is this: ‘Automate attribute detection in unstructured REA data and use the attributes detected to enrich user experiences.’
Delivering Value
As a data science team, we can see the value of all this ‘gold’. This vision really excited (and still excites) us! We know that we can use this technology in all these places to drive better user experiences. We’re great at thinking big and presenting a vision for the future. At the same time, we work within a business that has many, many competing priorities- I’m sure you can relate. Just like any large business, there are a number of organisational constraints that REA has to manage, including capacity, alignment, and strategic value, among others. When tough decisions on what we should work on need to be made, delivering items of highest value to the wider business within our existing capacity comes first.
Although our vision is great, and although there is business interest in the entire vision in the long term, the first use case that the business was really interested in centred around providing insights to agents. There was appetite for us to extract attributes that agents care most about so that they could get insights on what makes their property unique- this way they would be well placed to talk to why potential buyers should consider their property over the many others they might be inspecting. This did mean that we had to put our other ideas on hold in the meantime. That didn’t mean we gave up on those ideas, or that our vision was lost, all it meant was that we pivoted to meet the business where it had a use case for this technology, but with the knowledge that we would still aim to extend this technology to other use cases in the future. Keep the above list in mind though, as we will come back to some of these items toward the end of this section of the blog post.
Ok, so now we have a vision, and a use case that fits the vision. So how did we build it? How do we extract attributes from listing descriptions, floorplans, and images?
The Build
Text Extraction
For text extraction, we are using the simplest solution that provided the accuracy levels we needed- no need to overthink a solution when we can deliver the same value much more quickly and efficiently! After considering a number of different methods for processing this data, we landed on text matching, or for those with a technical bent, regex. What does this look like? Essentially, we check the listing description text against a pre-defined list of keywords looking for a match! Each match is recorded and forms part of our assessment of potential attributes to add to our GLAD dataset.
Floorplans Extraction
Given floorplans are usually image files that contain diagrams and text, the obvious way for us to extract attributes from this type of unstructured data is to use Optical Character Recognition (OCR).
OCR may be something you’re already familiar with, even if you’re not technical. If you’ve ever used your camera to scan an Apple gift card code, or maybe to enable Google Translate to read and translate text in a foreign language, you’ve experienced OCR in action. Essentially, OCR is a technology that enables a computer to identify characters (and therefore words and numbers) inside an image.
There are many existing OCR solutions on the market, and so it made sense for us to grab something off the shelf to help us with this part of the project. After some exploratory work, we narrowed the options to one vendor-provided solution and one open source solution. Once we had these two, we put them through their paces to compare cost, accuracy, and speed. Our testing showed us that the accuracy and speed of the vendor-provided solution were streets ahead of the open source solution, and the cost of this solution was manageable, and so we had a clear winner.
So, drumroll…. We’re using Amazon Rekognition to power our OCR processing of floorplans. Rekognition pulls out every single character from every floorplan, however, and so our job doesn’t end there. Some of the recognised text might not be relevant to us or the property (e.g. it could be the agency name, the address, or any number of other things), and we need some way to find the attributes we’re looking for. We are using SQL to tie text elements extracted from floorplans back to attributes we have prioritised as valuable, and every match we find again forms part of our assessment of potential attributes to add to our GLAD dataset.
Image Extraction
For extracting attributes from images, we need to use a somewhat more sophisticated solution, given that it is more technically difficult for a computer to ‘see’ attributes in images that they vary wildly from one to the other (if you and I were to both write about swimming pools, we would likely use at least some words that are identical to each other, but if we each took a photo of a swimming pool, they are both likely to be very different to each other). Essentially, image processing is a whole different ball game to listing description and floorplan processing.
To be able to spot potential attributes in images, we use a multi-modal, zero-shot model (read, a quite sophisticated and complex machine learning model) called CLIP (it’s built by OpenAI, the same people who brought you ChatGPT!). CLIP is able to perform zero-shot detection- that is, we can detect if an image has a particular attribute without having to go through an arduous in-house model training process. This is a massive time saver and is exactly what we needed!
CLIP uses embeddings to create a ‘map’ of images and text. Essentially it creates a number- think about how every physical location has coordinates- for every image and every piece of text that is important to us. The embeddings that each piece of text and each image are converted to are similar to embeddings for other strings of text or images that represent a similar thing. For example, images of a room that contain a swimming pool and the words swimming pool will have similar embeddings, and so when we search for the text swimming pool, CLIP is able to return images that are a close match for containing a swimming pool. We also use the JINA library to run CLIP at scale, so that we can process the hundreds of thousands of images that we need to on a regular basis.
One final puzzle piece
In our team, we are working to ensure we reuse existing work wherever possible- nothing like increasing speed to delivery or augmenting a feature set by using something you already have! In building GLAD we realised we could do just that- we already had a model we had previously built that classified rooms by their quality. This model rates kitchens, bathrooms, and exterior by their quality on a scale of 1 to 5 (there’s a lot to unpack there but that’s for another post!). Given we had this information, why not integrate it into the GLAD solution? And so we did!
Putting it all together
Once we have the attributes extracted from listing descriptions, floorplans, and images, we do an ensembling of the data which includes checking the multiple data sources against each other to ensure we provide a high level of accuracy. We then add the high quality room model’s output to our dataset to give that little bit more value to end users.
In the current use case of providing real estate agents with insights about the properties they are selling, we then roll the data up and make it accessible to one of our platform development teams to bring these insights to real estate agents.
Where to next?
What’s next for GLAD? Remember the list we talked to before, about our vision for this work? Like we said, that’s still our vision! We have now gotten the green light from the business to prioritise evolving the foundation that GLAD provides to build a world-class system that can in future cater for hundreds of use cases across all of REA Group. We’re just getting started!
end talk
Between the talk and a panel discussion, we had a great networking and eating break. It was great to see around 90 people connecting, sharing their thoughts on the talk, and enjoying a bite to eat.
After the break we held a panel discussion around building high performing data science teams facilitated by Megan Evans , and featuring Rachel L. , Evangeline Clough Good , and me!
Building High Performing Data Science Teams (a panel discussion summary)
Unfortunately I don’t have a direct transcript of the discussion (I neglected to record it), but here are some of the key themes that were covered:
Data Science teams:
- The key differences between a data scientist and a machine learning engineer (MLE) in practice can be quite large or quite small- it all depends on the company/data science team you are working with. In theory, data scientists are focused more on applying maths and statistical methods, while machine learning engineers might be more focused on creating machine learning pipelines so that machine learning models can work in a production environment.
- A good data scientist or machine learning engineer has a focus on problem solving- that is, they are focused on understanding problem spaces, and then finding the best fit solution for that problem, whether complex or deceptively simple. A good data scientist or machine learning engineer should be able to translate technical language into something easily understood by stakeholders, and should also have the wisdom to know how deep to go from a technical perspective to begin with!
- If you’re keen to get into data science, the fundamental hard skills you should be considering spending time on are Python and SQL. These two are key data science tools in many or even most organisations. Data visualisation is another fundamental skill that would be worth developing if you want to head down this path. These days it is so easy to learn- Youtube, Udemy, LinkedIn Learning, and similar services are great resources from which to learn at little or often no cost. Use them to your advantage! Aside from this, don’t think that you have to wait to be an expert in these things before you can apply for jobs in the data science space if that’s your goal! Get out there, network, apply for roles, and learn on the job.
- A good data science team exhibits a few key attributes: 1. They are focused on delivering business value, regardless of the form that value might take. 2. They keep an eye on new technologies, but they are measured in how they apply them. 3. They have a focus on reuse- model reuse is an amazing opportunity that is often overlooked. 4. They work well together within the team. 5. They are focused on and make time for learning and growth. 6. They make noise about and celebrate their work outside their team. 7. They value the opinion of each of their members and are safe enough so that anyone can challenge thinking and suggest alternative ways of doing things
- Data science teams differ from traditional dev teams in that there is a lot more ambiguity and fluidity around how agreed upon outcomes are reached. Models and insights can often be developed in multiple ways, and it’s up to the data science team to determine which way provides the best outcome. This leads to a lot more research and spikes, work being drawn up only a short time in advance, and priorities changing regularly as research guides our direction.
- Continuous improvement is central to our team’s ways of working. We have made lots of changes over time to improve our efficiency and tackle pain points, but the work is never done- it’s both important to maintain discipline in implementing change and look out for new opportunities to better how we work.
Audience Questions
- What do you think about Generative AI? Our opinions on this technology are fairly mixed- some feel there should be more regulation around it, others are comfortable knowing legislation and ethics frameworks will catch up to the technology in time, and others are fairly neutral on it. What we do agree on is that the diversity of thought in our team is a strength and something that we strongly encourage. Diversity of thought helps us to achieve better outcomes!
- How do you measure that you are high performing, and not just mediocre? Some of the things we think about when assessing our performance include: 1. Are we delivering on everything we said we would?2. Are we delivering everything we said we would on time?Is our work driving user experiences? 3. Are we meeting our own OKRs and supporting wider business OKRs? 4. Are we happy as a team while doing all the above?
end panel discussion summary
After the panel discussion Mark wrapped things up, and we spent the 20 or so minutes after ending the formal activities of the evening networking with those that were keen to chat.
Overall, it was a successful event! It’s great to share knowledge, and I think we were able to do that in a way that was approachable and valuable for everyone that attended.
If you read this far, thanks for reading! And if you have any questions, feel free to reach out ??
Sr Automation Consultant at Insignia Financial | Expert in Automation Solutions
1 年Good job Orian ??