"How do we connect, Databricks API?"
A trouble-shooting guide for people who are now trying something like this, in frustration.
curl -v -H "Authorization: Bearer XxxxXXXXxxxxx" \
-X POST "https://<somehost.some_databricks_cloud>/api/2.x/….." \
-d '{ …… }'
"Why a trouble-shooting guide? Doesn't the Databricks HTTP(S) API work?"
No, the API works. When you succeed in making YourDeploymentTool? using it, it'll work. And then you'll never think about it again, quite likely.
But it's very, very easy to get the HTTP requests wrong while you're developing and testing it.
Generic HTTP reasons for your suffering
Databricks-specific reasons for API request mistakes
Identities and credentials you can use
For the sake of learning: Forget about Databrick Workspace Personal Access Tokens (PAT). The Databricks documentation relentlessly points towards them as the first choice, but for automated platform deployment they can't be used for first steps because they aren't accepted by the Account API. Furthermore other authentication is always needed before you can issue the first PAT.
Databricks follows the common practice of having API servers that require HTTP requests to have a "Authorization: Bearer …." or "Authorization: Basic …." headers. The value that comes after the "Bearer" or "Basic" auth scheme names will happen to be a Base64 string either way, with a length somewhere between a few dozen and several thousand characters long.
If the HTTP auth scheme is "Basic" the user is sending their username and password in that Base64 string. It will be passed to whatever 'login(username, password)' internal function is used within Databricks cloud servers, the same as would be used by the login page.
Using "Basic" is nifty for hands-on API exploration, but using a username and password like that risks leaking them. They might get typed visibly in shell history, left in logs, saved in a config file left in a public place, etc., etc. So the normal practice is to use "Bearer" instead.
With "Bearer" the Base64 token will be passed onto one of many different authentication functions. Which depends on what identity services your Databricks account is integrated with. The code of Databricks isn't open but from prior development experience with other federated authentication systems I can infer it would go like this:
Using a "Basic" authorization header
As per RFC-7617 the credentials string is the Base64 encoding of the concatenation of the username, a colon ":", and the password. (Use standard Base64 encoding, not the URL-safe one that replaces "+" with "-" etc.)
This is the same username (= email address) and password you use to log onto the Databricks GUI websites.?
For a command line method you can try the following to create it:
$ # Don't miss the -n (no newline) flag on echo. It will be the wrong
$ # string if the newline is included.
$ echo -n "[email protected]:S3cr3tPswd!" | base64
bm9ib2R5Lm5vb25lQG5vd2hlcmUuY29tOlMzY3IzdFBzd2Qh
A slightly fuller example demonstrating saving it as a env var and using it in a curl -H (--header) argument:
$ DB_BASIC_AUTH_CREDS=$(echo -n "[email protected]:S3cr3tPswd!" | base64)
$ curl -v -H "Authorization: Basic ${DB_BASIC_AUTH_CREDS}" -X GET "https://…./api/……."
Using a Bearer token authorization header
Once you have an access token it's just cut-and-paste into the "Authorization: Bearer …" header, basically.
$ DATABRICKS_XXX_TOKEN=$(<some command to get the token>)
$
$ curl -s -H "Authorization: Bearer ${DATABRICKS_XXX_TOKEN}" -X GET "https://…../api/……"
So where do you get these access tokens? First you will have to authenticate to the relevant identity service your Databricks environment is using with a password or service principal secret, then request it to issue one. If you are already logged in with the identity service enabled as SSO for your Databricks account you might be able get a new token without repeated password etc. entry, but the point remains: Any trusted connection with an identity service that will provide HTTP access tokens is, ultimately, established by password or secret key confirmation at some prior time, whether that was one millisecond or a whole day earlier. (In theory it could also be another type of secret such as a hardware key, but I'm not aware of any Databricks practitioners doing this.)
This begs the question: if we have to log in with our precious, secret credentials at some point anyhow, why not use "Basic" auth scheme all the time and keep things simple? The main answer is access tokens with a short lifespan quickly become useless to attackers who might obtain them. The attacker who grabs your "Basic" authentication information on the other hand has a username and password that will presumably be valid for a while.
领英推荐
The options you have for getting the access token depend on: Which SSO if any used, and if the identity is a user or service principal. Then there may be more than one choice of tool/method. The documentation is also split between cloud platform-specific sites (AWS, Azure, GCP). There's too many permutations to list them all, but I believe the list below covers most databricks accounts out there in the userverse.
Note: The "No SSO" cases were tested on AWS. Presumably the login and access token generation methods referenced above would work for Databricks accounts hosted in GCP and Azure if there was some way to disable SSO for them.
Troubleshooting
The simplest endpoints to test
Account level
Workspace level
No HTTP response body? Might be a valid, empty HTTP 200 OK.
The HTTP body content responses from Databricks requests can be A) entirely empty, B) just a '{}' or '[]', which is very easy to miss in a screen already full of punctuation marks, or C) a nice, full JSON object or array.
So check the HTTP response code at the same time.
Silent treatment for Account API even when authenticated
You will get only empty responses, along with either a 401 or 403 response code, if you use the Account API with a token generated for a user or service principal who hasn't been granted the "Account admin" role. Databricks Personal Access Tokens won't work either, even if they're for the same username as an account admin.
Some errors come in well-formed JSON. Some don't.
If your first error message came in a nice JSON object, well, cool. Most error requests seem to produce them. But don't expect JSON error description objects every time. Eg. at least some authentication rejections will return HTML.
GET, PATCH, DELETE, PUT, POST, the whole KABOODLE
As you flip between trying one Databricks endpoint, cutting and pasting your previous command to make another one, it's easy to miss that the last one was a POST request method but this one is PATCH, etc. A HTTP 405 should be returned, but Databricks gives 404 instead. (Jan 2024)
Query parameters go here, … or here, or there or there
If you have used one way of adding a query parameter, you can't assume it applies everywhere. Depending on the Databricks REST endpoint being used it could be:
Usually it is object IDs, not names, in URL args
Sometimes the user-created name is the argument used in the URL for a specific resource. Eg. /api/2.1/unity-catalog/catalogs/{name}, or /api/2.1/unity-catalog/storage-credentials/{name}.
But even when a resource has an unique, URL-safe name most of the endpoints require numeric, hexadecimal or UUID IDs. Eg. a UUID is required for this one: /api/2.1/unity-catalog/metastores/{metastore_id}.
Working with both Account and Workspace APIs? A hostname oops = more HTTP 3xx and 4xx response codes
As you start automating a new workspace you begin with account API requests and then move to the workspace API after it's been created.
So it's natural during development that you will try Account API, then Workspace API, then back again. But as you're cutting and pasting remember the Account API has one hostname and your workspace has another. If you accidentally start using one instead of the other then you'll get authentication errors, authorization errors, or resource not found (= 'wrong URL path, stupid') errors. A HTTP 401 response is the instant result if using a workspace PAT for the Account API.