Enhancing Resource Access Security with Continuous Access Evaluation
Created using generative AI and some human love based on the transcript of my video on the same topic https://youtu.be/m3309aUKET8.
I want to dive into continuous access evaluation, a feature that's getting a lot more attention when we think about handling access tokens that are core to how we access any resource that used modern authentication.
Token 101
Historically, if we think back to tokens, we have some resource that we want to talk to. So, I can think about it like this: I'm here as some client, and I want to leverage some resource. We'll have the idea of a resource—this could be Exchange Online, Teams, and we'll go into some details about that—but I have some resource that I want to talk to. This resource trusts Entra ID for the authentication, so we have our organizations Entra tenant.
So, we have Entra, I have authenticated (proving I am who I say I am) and have a refresh token. Now what I want is to get an access token for this particular resource.
What happens here is I send my refresh token to Entra with a request. It's like, "Hey, I would like an access token for this resource." So, I send my refresh token with the request. Entra, at this point, would look at the conditional access policies that relate to this particular resource, and if I satisfy those requirements, it would go and create an access token for this resource and send me that access token. Now, fantastic, I have the access token. I would now go and send that access token to the resource I've been authorized to use.
The challenge with this is the resource trusts this access token because it trusts Entra ID and it's been signed accordingly, but there's no communication between the resource and Entra ID directly to validate. So, if something happens, I have no way of revoking that access token. There is no revocation list that Entra puts out that the resource can go and check. Because there is no online state service like we have with certificates, this access token has a short lifetime. By default, normally, it's one hour. The idea here is, well, if it's just one hour, if something does happen, they've only got access for an hour. If the token got stolen, they've got access for an hour. If the person leaves, well, they've got access for one hour because every hour they have to go back and request a new access token with their refresh token, and they get a new refresh token back as well. But it's very short-lived.
Now, work is being done around token binding, which helps give me protection for those scenarios and talks about, "Hey, if tokens are stolen and people are trying to reuse them," and I did a whole video on that available at https://youtu.be/toytJf1rmV4, but that is not everywhere today. So, how can I think about one hour being too long? How can I think about wanting to be able to revoke that access token that historically has not been possible?
Continuous Access Evaluation Core Functionality
Now, obviously, here the answer is going to be continuous access evaluation. What's happening with continuous access evaluation is we introduced the idea of some endpoint that is available now through Entra. What happens here is when there are certain types of events, those events get written and are accessible from this endpoint.
Now, when we're talking about events, it's something that would be a critical event about the user. The user has been deleted, the user has been disabled, the password for the user has been changed or reset, MFA has been enabled for the user, I explicitly revoke the tokens for the user, or maybe high risk has been detected via Entra ID protection. We can look at the documentation at https://learn.microsoft.com/en-us/entra/identity/conditional-access/concept-continuous-access-evaluation#scenarios and the documentation goes through exactly what those things are. So, any of these are considered critical events, and in this scenario, these would now be made available via that endpoint.
Now, additionally, there are only certain services supported for this. These critical-type events relate to the user that will trigger, "Hey, I need to do a revocation," but also, when we talk about these resources today, it is only Microsoft first-party. Specifically, it's Exchange, it's SharePoint Online, and also that means OneDrive, it's Teams, and it's also the Microsoft Graph. Now, expect this list to grow over time; this is where they are today.
The other important thing here is the client itself. This has to be continuous access evaluation capable, and that will make a lot more sense in a second. Now, anyone can write a client that is CAE capable. Obviously, the Microsoft clients are, Office apps on Windows, iOS and Android, but, for example, the iOS native mail app is also CAE enabled as well. This CAE capability is critical because it has to be able to respond to certain challenges that are going to happen when this revocation actually happens.
Now, if I'm curious about, well, when I'm talking to a certain one of these resources through the combination of my client and the resource I'm talking about, has continuous access evaluation been enabled? I'm just going to go and look at the sign-in logs for a particular identity, and I can see that if it is CAE enabled or not.
So, those are the things we need in place. When I have continuous access evaluation, there's nothing I have to do for these Microsoft first-party services. Providing my client is CAE capable, it's now going to be able to respond quicker than waiting for the token to expire after an hour if one of these critical events actually happens.
Now, specifically, when one of those critical events happens, it generates a revocation event. So, over here, I'll write it out. So, we get a revocation event, and what's happening now is this particular resource, because it's aware, it has a connection to whatever that endpoint is to look for those revocation events. When that revocation happens, if we read the documentation, what it says is it is a fifteen-minute revocation SLA. So, what the SLA states is that within fifteen minutes of that revocation event happening, it should no longer accept that access token. Now, in my experience, I've never seen it take that long, normally a couple of minutes.
Once the service sees that revocation event, what it's now going to say is, "Hey, this access token is no longer any good," and it will generate a response to the client. It doesn't just refuse it because that would be very unhelpful. What it's going to do is send a 401. So, it's going to send a 401 response, and it's going to include a claims challenge. Imagine, for example, "Hey, my account was disabled." It's going to have some challenge to say, "Hey, you need to go and send me a new token after this time to show you've been enabled again," It's sending a challenge that the client now has to satisfy to prove the client can still go and talk to Entra.
This is why it has to be CAE capable because it has to be able to handle the 401. It has to know what to do. So, if it wasn't CAE capable, it would get this, it would just try and resend the old token again and fail miserably. Because it's CAE capable, it gets this with the challenge in it, and what it will now do is it will send a request again to Entra, saying, "Hey, I need an access token." So, it would send the refresh token, saying, "Hey, I need an access token," but it would say, "Here's the challenge I was given," and Entra will now create a new access token that includes the response to the challenge, give it to the client, and the client could now send it and would have access to the service again.
So, that's really the big deal about the client having to be CAE capable as well because, in the header of this message, there'll be a "www-authenticate," and in there is the challenge that the client has to know to say, "Oh, I need to take that challenge and send it to the token service in Entra."
One of the upsides of this is, remember I said, well, the tokens were one hour; they're short-lived because there was no way to revoke them. Well, because now they can be revoked, the tokens become long-lived. Every time the client requests a token, it indicates, "Hey, I want a token, please, and by the way, I am CAE capable." If it's for a CAE-capable service, it knows it can use CAE. Instead of this token being one hour, the token will actually be between twenty-four and twenty-eight hours. It varies slightly, but the token is now long-lived. This is OK because remember, we can revoke it.
领英推荐
Imagine there was some transitory problem, maybe talking to Entra ID or something on the Entra side. Well, my token will now be valid for a much longer period, so I'm more resistant to those transitory-type potential issues you might encounter. That's one of the really nice things. However, if my client is not CAE capable, this will not be used. I would just get a regular one-hour token to avoid the client getting into a position where it’s stuck.
Location Awareness
There is another aspect as well. Imagine I get a token, and when I get the access token, it's checking my conditional access policies. Let's say I have conditional access policy number one that applies to whatever I'm doing right here as part of that particular resource. I'm using a named IP location, which can be IPv4 or IPv6.
When this happens, for the access token it gives you, it also adds in the conditional access policy IDs where there is a location. So, this access token it gave me for the resource also includes the conditional access policy IDs (CAPIDs). Now, the resource sees the access token with this list of conditional access policy IDs and communicates with this endpoint. It's like, "Hey, I've got this list of conditional access policy IDs. What are the IP ranges for that?" The endpoint will send back the list of CIDR ranges that represent whatever that IP location policy is.
Now, the resource can do an additional check, provided the client is CAE capable, the resource supports CAE, and there has been a conditional access policy that applies to this resource that is IP-based, not geo-based or anything else. It's only for the IP-based policies. Now, the IDs of those conditional access policies get added to the token. The resource provider will check at a data plane level, ensuring the communication is happening within these lists of IP addresses.
If I got the access token and then took my laptop somewhere else, and I didn't meet this requirement, I wouldn't have to wait an hour. It would be like, "No, you're not in this IP location anymore and issue a 401 with challenge. Likewise, if someone stole your access token, they would be from a different IP location, and it wouldn't work. It starts to provide additional protection. If someone does steal my access token and I'm using those IP-based location policies, the person who stole your token coming from a different place wouldn't work. It adds another aspect. Instead of just being these revocation events, it's now also checking, "Hey, where are you actually talking to me from?" because I know based on the token you were given, you are supposed to be coming from these CIDR ranges, and it's going to check that.
There's a caveat here. One of the things we have, if I look at my session controls in a conditional access policy, is the option to customize continuous access evaluation. I can say, "Strictly enforce location policies."
Ordinarily, if you have a stricter option around security, that is a no-brainer. Like, yes, I want stricter policies; I'm going to turn that on. Then what happens is, in some companies, no one can access any of the services anymore. Why?
Imagine you have your corporate network or some IP set of ranges. Remember, again, we have Entra, and I have my client. Imagine now I have some firewalls with certain public IPs, and I've configured things for various types of traffic to go via this. For my authentication traffic, all of the authentication traffic should flow this way. I've done a conditional access policy that's looking for these IPs that are allowed to get the access tokens. Fantastic, I'm all good to go.
Now, again, I'm trying to talk to my resource. For Office, it's kind of chatty; there's a lot of traffic. So, what I did is, rather than make all of that traffic go through maybe my very high-quality TLS termination inspecting all the traffic here, I have a different service. For just general internet traffic, I have a different firewall with a different egress point that has a different set of public IPs, and maybe it varies a lot more. Well, my data plane for this service is going that way. Or, remember, it could absolutely be my client isn't even on my network. Maybe my client is at home, but they have some kind of VPN, so the authentication traffic still flows this way. But then regular internet traffic doesn't at all.
Remember, because I've enabled an IP-based location, my access token says these are conditional access policy IDs. The resource went and checked the CIDR ranges for it, and it's expecting to see these IPs. This is all of the data plane-type communications. Either way, the resource provider is seeing an IP that does not match what the conditional access policy ID is telling it should be. But remember, the conditional access policy ID is meeting the requirement when it's asking for the token because the token request is coming via our Entra path.
You basically have a split path scenario here, and Entra is smart enough to understand this is actually a very common configuration. A lot of companies have the authentication and critical flows going through one path, and then regular internet traffic, of which they consider Office, going a very different path. When they turn on this continuous access evaluation, it's going to fail. It's going to send the 401 for a claims challenge. When Entra sees that, because the client is CAE capable and sees, "Oh, I'm being challenged for the IP location, but the client's coming from the right IP," it realizes, "Oh, the client must have a split-path configuration, and they're not going to be able to get to the service."
So what it does is, when it detects that, it uses now standard enforcement. It turns off the IP location. The access token it sends does not have the conditional access policy IDs for location anymore. That's using standard enforcement. Now, it's still CAE capable, but it will now be a one-hour token, and it's not location-based. That happens for us automatically. If you do have this split path scenario, don't worry; Entra is smart enough to realize this must be what's happening.
Entra will give you a token without location when this split-path is detected unless I turn on strict enforcement. It's saying, "No, the data path should always be within these same paths of the authentication path. If it's different, do not turn off the location stuff," in which case now I can't get to it.
So that strict enforcement should only be turned on if I'm absolutely confident I do not have this split path scenario for the authentication versus the data plane for the Office traffic or Graph or anything else. Because if you do, and you do have this split path, clients won't be able to get to the resource because they'll keep getting this challenge. They'll keep sending it, and it'll keep sending back, "You meet this; here's a new token." It'll have the conditional access policy IDs in it again. It will go and ask for the CIDR ranges, and if this is not the IP range that I'm expecting, I'm going to give you a 401.
That strict enforcement sounds fantastic, and if I know I do not have this scenario, if I know the traffic's going via all of the IPs I have in my IP location policy, great, you can turn on that strict enforcement. But if you don't, and you turn it on, you're going to block the access, and then your users will be very sad, and then their sadness will be amplified to you!
Summary
That was it. Really, conditional access evaluation, first and foremost, is a security feature. It's helping reduce the risk of token theft. It helps reduce the risk of replay. It helps you respond faster to those critical user events. I can do the revocation. Now, obviously, I still want to use token binding where I can to further reduce the chance of token theft. I would expect this to grow over time with more and more services, but it's a really great feature. Just be aware of not just turning on strict enforcement because you think it looks better. Make sure you're convinced you don't have split path.
Till next time, take care ??
Cloud Architect @ Jio | Deputy General Manager | Azure, AWS, GCP
4 个月John Savill will check it out
Principal Engineer
4 个月My understanding is CAE is on by default now. Is that correct ?