"How do we connect, Databricks API?"

A trouble-shooting guide for people who are now trying something like this, in frustration.

curl -v -H "Authorization: Bearer XxxxXXXXxxxxx" \
  -X POST "https://<somehost.some_databricks_cloud>/api/2.x/….." \
 -d '{ …… }'        

"Why a trouble-shooting guide? Doesn't the Databricks HTTP(S) API work?"

No, the API works. When you succeed in making YourDeploymentTool? using it, it'll work. And then you'll never think about it again, quite likely.

But it's very, very easy to get the HTTP requests wrong while you're developing and testing it.

Generic HTTP reasons for your suffering

  • Authentication tokens are huge and human-unreadable. So you won't be able to see if you've made a cut-and-paste mistake, if you have whitespace or punctuation character issues, if it's meant to be reformatted, if you're accidentally using an old one, etc.
  • Access tokens are typically short-lived, so auth rejections can begin at arbitrary times while you're still experimenting.
  • By default not all of the important HTTP status indicators will be visible
  • An empty HTTP response body may be valid, or it may indicate failure. It depends on each separate endpoint.
  • When using an insufficiently-privileged identity you'll get the errors, or silent treatment, for URLs that would work fine with a superuser account. "Real error or not?" guessing games begin.
  • Slip-ups with GET, PATCH, DELETE, PUT, POST, etc. The different HTTP request methods aren't hard to understand, but it's easy to overlook that you typed (or copied) "POST" where it should be "PUT", etc.

Databricks-specific reasons for API request mistakes

  • The API is split in two - Databricks Account API vs Workspace API - and the auth is not the same. (Partially the same, but not entirely.)
  • Multiple API server hostnames are used and it's easy to mix them up in development as you switch between the one for Account, the one for Workspace X, or Workspace Y, etc.
  • Authorization blockers:- Cloud provider privileges/roles (for IAM, Network, Storage, etc.) are required for some work, beyond Databricks privileges/roles. Especially at the time of doing first steps in Databricks Account administration and new Workspace deployment.- The early steps in the precursor work for making a new workspace are typically in the Account API, which needs the highest Databricks privilege (Account admin).- Databricks platform-issued user tokens are easy to grab from the GUI, but you can't get them in a secure and programmatic way without using some other type of authentication first. Usually OAuth.- Personal Access Tokens are only for Workspace API access, not Account API.- Being an Account Admin doesn't imply access to all workspaces. It does mean the admin user (or service principal) can add themselves, though.

Identities and credentials you can use

For the sake of learning: Forget about Databrick Workspace Personal Access Tokens (PAT). The Databricks documentation relentlessly points towards them as the first choice, but for automated platform deployment they can't be used for first steps because they aren't accepted by the Account API. Furthermore other authentication is always needed before you can issue the first PAT.

Databricks follows the common practice of having API servers that require HTTP requests to have a "Authorization: Bearer …." or "Authorization: Basic …." headers. The value that comes after the "Bearer" or "Basic" auth scheme names will happen to be a Base64 string either way, with a length somewhere between a few dozen and several thousand characters long.

If the HTTP auth scheme is "Basic" the user is sending their username and password in that Base64 string. It will be passed to whatever 'login(username, password)' internal function is used within Databricks cloud servers, the same as would be used by the login page.

Using "Basic" is nifty for hands-on API exploration, but using a username and password like that risks leaking them. They might get typed visibly in shell history, left in logs, saved in a config file left in a public place, etc., etc. So the normal practice is to use "Bearer" instead.

With "Bearer" the Base64 token will be passed onto one of many different authentication functions. Which depends on what identity services your Databricks account is integrated with. The code of Databricks isn't open but from prior development experience with other federated authentication systems I can infer it would go like this:

  • Not using SSO: The Base64 string is an access token that will be passed to OAuth etc functions in whatever library Databricks embedded in their cloud servers for that.
  • Active Directory SSO: Externally validated with an Active Directory server.
  • External OIDC provider SSO: The Base64 value is a JWT. It will be deserialized as JSON, the JWT signature confirmed (using keys previously loaded from the OIDC provider), expiry time field checked, then the user identified by the fields in the JSON object.
  • Google Cloud Identity SSO: Well, I've never programmed against this one, but once again the value will be handed off to whatever that uses.

Using a "Basic" authorization header

As per RFC-7617 the credentials string is the Base64 encoding of the concatenation of the username, a colon ":", and the password. (Use standard Base64 encoding, not the URL-safe one that replaces "+" with "-" etc.)

This is the same username (= email address) and password you use to log onto the Databricks GUI websites.?

For a command line method you can try the following to create it:

$ # Don't miss the -n (no newline) flag on echo. It will be the wrong
$ #   string if the newline is included.
$ echo -n "[email protected]:S3cr3tPswd!" | base64
bm9ib2R5Lm5vb25lQG5vd2hlcmUuY29tOlMzY3IzdFBzd2Qh        

A slightly fuller example demonstrating saving it as a env var and using it in a curl -H (--header) argument:

$ DB_BASIC_AUTH_CREDS=$(echo -n "[email protected]:S3cr3tPswd!" | base64)
$ curl -v -H "Authorization: Basic ${DB_BASIC_AUTH_CREDS}" -X GET "https://…./api/……."        

Using a Bearer token authorization header

Once you have an access token it's just cut-and-paste into the "Authorization: Bearer …" header, basically.

$ DATABRICKS_XXX_TOKEN=$(<some command to get the token>)
$
$ curl -s -H "Authorization: Bearer ${DATABRICKS_XXX_TOKEN}" -X GET "https://…../api/……"        

So where do you get these access tokens? First you will have to authenticate to the relevant identity service your Databricks environment is using with a password or service principal secret, then request it to issue one. If you are already logged in with the identity service enabled as SSO for your Databricks account you might be able get a new token without repeated password etc. entry, but the point remains: Any trusted connection with an identity service that will provide HTTP access tokens is, ultimately, established by password or secret key confirmation at some prior time, whether that was one millisecond or a whole day earlier. (In theory it could also be another type of secret such as a hardware key, but I'm not aware of any Databricks practitioners doing this.)

This begs the question: if we have to log in with our precious, secret credentials at some point anyhow, why not use "Basic" auth scheme all the time and keep things simple? The main answer is access tokens with a short lifespan quickly become useless to attackers who might obtain them. The attacker who grabs your "Basic" authentication information on the other hand has a username and password that will presumably be valid for a while.

The options you have for getting the access token depend on: Which SSO if any used, and if the identity is a user or service principal. Then there may be more than one choice of tool/method. The documentation is also split between cloud platform-specific sites (AWS, Azure, GCP). There's too many permutations to list them all, but I believe the list below covers most databricks accounts out there in the userverse.

Note: The "No SSO" cases were tested on AWS. Presumably the login and access token generation methods referenced above would work for Databricks accounts hosted in GCP and Azure if there was some way to disable SSO for them.

Troubleshooting

The simplest endpoints to test

Account level

Workspace level

No HTTP response body? Might be a valid, empty HTTP 200 OK.

The HTTP body content responses from Databricks requests can be A) entirely empty, B) just a '{}' or '[]', which is very easy to miss in a screen already full of punctuation marks, or C) a nice, full JSON object or array.

So check the HTTP response code at the same time.

  • 200 - Good (even if the body response is empty)
  • 400 - Bad request- Can appear when Databricks object IDs the URL such as /api/2.0/account/{account_id}/metastores/{metastore_id} don't meet the expected format, e.g. a UUID without its hyphens. Arguably the Databricks API should return a 404 instead.
  • 401 - Authentication failed: Sometimes appears when a token is incorrect or expired, which is the expected behaviour. (But sometimes an incorrect or expired token is generating a 403 instead (Jan 2024).). Will also appear when a valid Personal Access Token was used for a request to the Account API (they only work for the Workspaces API).
  • 403 - This (should) mean it accepted the proof of ID but it won't let the authenticated principal do that thing. However, sometimes Databricks is returning this when bad or expired tokens are used (Jan 2024).
  • 404 - (Endpoint) Not found. Appears when: When a totally wrong path is used; Right path, wrong host, eg. request a Workspace API endpoint but the hostname is the one for Account API; A typo. This includes having a forward slash twice eg. 'https://<hostname>//api/….'; When a correct path template but the object for the ids or names in the URL doesn't exist. In this case the error title in the JSON response body will be "RESOURCE_NOT_FOUND" rather than endpoint-not-found though; When a HTTP request method was wrong. Eg. it was a "GET" when it should be "POST", etc. (Should be a 405. Sometimes it is.)
  • 405 - Method not allowed: Path was otherwise valid, but it isn't for the HTTP request method used. Eg. it was a "GET" when it should be "POST", etc. (It is supposed to be a 405 in this situation, but to be honest I've mostly observed the Databricks API returning 404s instead (Jan 2024).)

Silent treatment for Account API even when authenticated

You will get only empty responses, along with either a 401 or 403 response code, if you use the Account API with a token generated for a user or service principal who hasn't been granted the "Account admin" role. Databricks Personal Access Tokens won't work either, even if they're for the same username as an account admin.

Some errors come in well-formed JSON. Some don't.

If your first error message came in a nice JSON object, well, cool. Most error requests seem to produce them. But don't expect JSON error description objects every time. Eg. at least some authentication rejections will return HTML.

GET, PATCH, DELETE, PUT, POST, the whole KABOODLE

As you flip between trying one Databricks endpoint, cutting and pasting your previous command to make another one, it's easy to miss that the last one was a POST request method but this one is PATCH, etc. A HTTP 405 should be returned, but Databricks gives 404 instead. (Jan 2024)

Query parameters go here, … or here, or there or there

If you have used one way of adding a query parameter, you can't assume it applies everywhere. Depending on the Databricks REST endpoint being used it could be:

  • One or more databricks object ids/names in URL args. Eg. the account and metastore ids in '/api/2.0/accounts/{account_id}/metastores/{metastore_id}'
  • In a JSON object in the HTTP request body.
  • As URL search parameters, eg. /api/2.0/jobs/get?job_id=<some_number_id>.
  • In a custom filter expression Eg. "userName eq Akira Kurogane". An example as a GET request would be /api/2.0/preview/scim/v2/Users?filter=displayName%20eq%20Akira%20Kurogane)

Usually it is object IDs, not names, in URL args

Sometimes the user-created name is the argument used in the URL for a specific resource. Eg. /api/2.1/unity-catalog/catalogs/{name}, or /api/2.1/unity-catalog/storage-credentials/{name}.

But even when a resource has an unique, URL-safe name most of the endpoints require numeric, hexadecimal or UUID IDs. Eg. a UUID is required for this one: /api/2.1/unity-catalog/metastores/{metastore_id}.

Working with both Account and Workspace APIs? A hostname oops = more HTTP 3xx and 4xx response codes

As you start automating a new workspace you begin with account API requests and then move to the workspace API after it's been created.

So it's natural during development that you will try Account API, then Workspace API, then back again. But as you're cutting and pasting remember the Account API has one hostname and your workspace has another. If you accidentally start using one instead of the other then you'll get authentication errors, authorization errors, or resource not found (= 'wrong URL path, stupid') errors. A HTTP 401 response is the instant result if using a workspace PAT for the Account API.

要查看或添加评论,请登录

Akira Kurogane的更多文章

  • MongoDB booted as unikernel - no OS

    MongoDB booted as unikernel - no OS

    Several months earlier, amongst other reading, I found "Cloud-Native Database Systems and Unikernels: Reimagining OS…

    12 条评论
  • Resolving Databricks permissions puzzles

    Resolving Databricks permissions puzzles

    Databricks presents itself as something like a RDBMS but it is quite different underneath. The permissions are perhaps…

社区洞察

其他会员也浏览了