登录查看更多内容

CF 上传应用: 下载失败错误ERR Downloading Failed

Xiujiao Gao

Co-Founder & COO at FiveTwenty Inc.

发布日期: 2019年8月30日

+ 关注

What you easily see are usually consequences or symptoms, but not the root cause.

现象和后果总是显而易见，根本原因总待人上下而求索。

"The Dev environment is down!!!" We heard louder screams from developers when the busy dev env was down than when the production environment was down. We dropped whatever we were doing and wanted to stop those screams as fast as we can. We pushed an app and saw the following logs:

“测试环境挂了！” 每当测试环境出问题，我们的工程师吼的比产品线挂掉还要厉害。这时候我们会放下手中正在干的活，争取在最短的时间让工程师们安静下来。我们上传一个应用，看到下面的错误日志。

[cell/o] Creating container for app xx, container successfully created
[cell/o] ERR Downloading Failed
[cell/0] OUT cell-xxxxx stopping instance, destroying the container
[api/0] OUT process crushed with type: "web"
[api/0] OUT app instance exited

It told us that "Downloading Failed", but it will never directly tell us what failed to download. With some knowledge of how an app is pushed, staged, and run, we were easily able to guess that it was the droplet download that had failed. Because the next step would be the cell getting the droplet, and then running the app in the container it created if everything worked as expected. However, we still did not know what was the root cause of "Downloading Failed".

日志显示“Downloading Failed”，但是没有指明是什么下载失败。从应用部署的流程来看，在客户执行应用上传命令（cf push), 应用预发布（stage）和应用发布成功运行（running)这三个阶段中，我们大致可以才到是Droplet （编译打包的结果，可以在容器上跑起来）没有下载成功。原因是如果没有故障的话，下一步Cell将下载这个Droplet，然后在已经创建的容器上发布这个应用。尽管如此，对于什么引起Droplet下载失败我们还没有清晰的答案。

That is where the fun comes from, that is our chance to feel smart again, by figuring it out! :)

机会来啦，解决这个问题，证明我们很聪明，自我感觉很良好：）

We ran "bosh ssh" to the cell node and looked at the logs, bad tls showed up in the log entries. With this bad tls information, we knew that the certificates had some issues. Unfortunately, the logs will never tell you exactly which certificates are the problematic ones.

运行 "bosh ssh" 命令登录到Cell 节点，查看上面更详细的日志，赫然发现 bad tls。有了这个关键信息，我们知道可能是证书惹的祸。可惜，日志不会告诉你证书到底怎么惹的祸。

In our case, we use the safe cli tool to manage all of the certificates which were stored in Vault. safe has a command "safe x509 validate [path to the cert]" which we can use to inspect and validate certificates. With a simple script, we looped through all of the certificates used in the misbehaving CF environment with the "safe validate"command.

在我们的案例里，我们所有的密钥证书等都存在Vault 里面。我们使用一个叫safe 的开源软件来访问和管理Vault里面的所有数据。safe 有一个命令是"safe x509 validate [path to the cert]"，可以查看证书的具体信息。我们用一个简单的脚本，循环查看了挂掉的CF的所有证书。

The output told us that the following certificates were expired (root cause!!!)。

结果显示下面的证书过期了（这是祸根子）。

api/cf_networking/policy_server_internal/server_cert
syslogger/scalablesyslog/adapter/tls/cert
syslogger/scalablesyslog/adapter_rlp/tls/cert


loggregator_trafficcontroller/reverse_log_proxy/loggregator/tls/reverse_log_proxy/cert


bbs/silk-controller/cf_networking/silk_controller/server_cert
bbs/silk-controller/cf_networking/silk_daemon/server_cert
bbs/locket/tls/cert
diego/scheduler/scalablesyslog/scheduler/tls/api/cert
diego/scheduler/scalablesyslog/scheduler/tls/client/cert
cell/rep/diego/rep/server_cert
cell/rep/tls/cert
cell/vxlan-policy-agent/cf_networking/vxlan_policy_agent/client_cert

c
cell/silk-daemon/cf_networking/silk_daemon/client_cert

If you are not using safe, you can also use the openssl command or other such commands to view the dates for certificates.

如果你没有在你的环境里使用safe（我强烈推荐你使用），你可以用openssl 来查看你的证书。

$ openssl x509 -noout  -dates -in cert_file


notBefore=Jul 13 22:25:49 2018 GMT

notAfter=Jul 12 22:25:49 2019 GMT

We then ran "safe x509 renew" against all of the expired certificates. After double checking that all of the expired certificates were successfully renewed, we then redeployed the CF in order to update the certificates.

下面我们用"safe x509 renew" 对所有失效的证书进行了延期，确保所有证书都有效后，我们重新部署CF本身来更新这些证书。

The redeployment went well, for the most part, except for when it came to the cell instances, it hung at the first one forever. We then tried "bosh redeploy" using the "--skip-drain" flag, unfortunately, this did not solve our issue. We next observed that the certificates on the bbs node had been successfully updated to the new ones, while the certificates on the cell nodes were still the old ones. hmm... so this would mean that the bbs and cell nodes could not talk to each other.

重新部署CF过程比较顺利，除了Cell节点比较捣乱，总是停在那，不接受新的证书。我们试了带着"--skip-drain"重新部署，也没有彻底解决这问题。但是其他节点，比如bbs节点都成功更新了证书，可能bbs和Cell不能相互沟通了。

Everyone needs a little help sometimes, so does BOSH.

人人都需要点小帮助， BOSH也不例外。

Without digging further into what exactly was making the cell updates hang forever, we decided to give BOSH a little help. We ran "bosh ssh" to the cell that was hanging, and replaced all of the expired certificates in the config files manually, and then ran "monit restart all" on the cell. This helped to nudge the "bosh redeploy" into moving forward happily. We got a happy running dev CF back and the world finally quieted down.

没有过多纠结到底为啥Cell挂在那里不动，我们决定帮帮BOSH。我们远程登陆到有问题的Cell节点，手动把所有过期的证书替换成延期的，然后在这个Cell节点上运行了"monit restart all"。这一步后Cell更新成功，CF重新部署成功，我们的测试环境复活了，世界安静了下来。

The story should never end here, because a good engineer will always try to fix the problem before it becomes a real issue.

故事到这不应该结束，因为一个好的工程师总是要防患于未然。

Our awesome coworker Tom Mitchell wrote Doomsday.

超级同事Tom Mitchell写了Doomsday （世界末日）

Doomsday is a server (and also a CLI) which can be configured to track certificates from different storage backends (Vault, Credhub, Pivotal Ops Manager, or actual websites) and provide a tidy view into when certificates will expire.

Doomsday 是一个可以跟踪多个不同存储后端（Vault, Credhub, Pivotal Ops Manager, or actual websites）的服务和命令行工具，它还有一个简洁的网页来显即将过期的证书。

Deploy Doomsday, rotate your certs before they expire, and live a happier life!

部署Doomsday，在证书过期之前更新延期，让你的生活更加清净美好！

要查看或添加评论，请登录

Xiujiao Gao的更多文章

CF Push App: ERR Downloading Failed

2019年8月21日

CF Push App: ERR Downloading Failed

What you easily see are usually consequences or symptoms, but not the root cause. "The Dev environment is down!!!" We…
Default Password for BOSH VMs

2019年5月18日

Default Password for BOSH VMs

The default username for BOSH VMs is vcap. We have two options when comes to the vcap password for BOSH and VMs that…
Migrate BOSH/Cloud Foundry (CF) Disks from vSphere Datastore(s) to Different Ones

2019年5月6日

Migrate BOSH/Cloud Foundry (CF) Disks from vSphere Datastore(s) to Different Ones

Migrating disks for BOSH and Cloud Foundry (CF) VMs from the current datastore(s) to new datastore(s) can be painless…
How to Migrate Your CF from One vSphere Cluster to Another

2019年5月6日

How to Migrate Your CF from One vSphere Cluster to Another

Recently, one of our clients had to migrate their CF from one vSphere cluster to another. Here is the story: the client…
BOSH Director and CF VMs Time Drift

2019年3月1日

BOSH Director and CF VMs Time Drift

When the time on BOSH director and CF VMs such as cells are off, it may throw off some of your applications with…
Configure UAA in CF with SAML as A Service Provider

2018年4月26日

Configure UAA in CF with SAML as A Service Provider

Before we start going through how to configure UAA in CF with SAML as a Service Provider, let's make sure we have…
Deploy HA CF with Anti-Affinity DRS Rules in vSphere

2018年4月2日

Deploy HA CF with Anti-Affinity DRS Rules in vSphere

VM-VM Affinity Rules vSphere Vm-VM Affinity Rules specify whether selected individual VMs should run on the same host…
A Handy S3 CLI

2018年4月2日

A Handy S3 CLI

Do you ever get annoyed that you have to install Python, pip, and then AWS CLI in order to simply access your S3…
Bootstrap BOSH2 with local VirtualBox

2017年4月3日

Bootstrap BOSH2 with local VirtualBox

Do you feel a little bit disappointed when you can not simply run vagrant up to bring up a bosh-lite on your local…
Running Cloud Foundry with bosh2 on VirtualBox

2017年4月3日

Running Cloud Foundry with bosh2 on VirtualBox

If you do not have a BOSH-Lite installed with bosh2 locally on your VirtualBox yet, you can follow BOSH-Lite on…

See all articles

CF 上传应用: 下载失败错误ERR Downloading Failed

Xiujiao Gao

Co-Founder & COO at FiveTwenty Inc.

Xiujiao Gao的更多文章

社区洞察

其他会员也浏览了

Understanding Vite Flow and Structure in a React Project

Exploring Nuxt Devtools

AOSP · PROCESS

ConTEXTual Mishaps — Charting Through Android’s Bermuda Triangle

Rails 6.1 Upgrade? Beware the where.not() Conundrum!

Clean Architecture in Android with Kotlin: Building Scalable Apps ??

Meet null_association: The Hero Your Rails App Didn’t Know It Needed ??♂?

Parcel: Bundle React app like a pro

Google I/O '18 Developer News

React's default way of appending itself to a DOM tree considered harmful

Xiujiao Gao的更多文章

CF Push App: ERR Downloading Failed

Default Password for BOSH VMs

Migrate BOSH/Cloud Foundry (CF) Disks from vSphere Datastore(s) to Different Ones

How to Migrate Your CF from One vSphere Cluster to Another

BOSH Director and CF VMs Time Drift

Configure UAA in CF with SAML as A Service Provider

Deploy HA CF with Anti-Affinity DRS Rules in vSphere

A Handy S3 CLI

Bootstrap BOSH2 with local VirtualBox

Running Cloud Foundry with bosh2 on VirtualBox

社区洞察

其他会员也浏览了

Understanding Vite Flow and Structure in a React Project

Exploring Nuxt Devtools

AOSP · PROCESS

ConTEXTual Mishaps — Charting Through Android’s Bermuda Triangle

Rails 6.1 Upgrade? Beware the where.not() Conundrum!

Clean Architecture in Android with Kotlin: Building Scalable Apps ??

Meet null_association: The Hero Your Rails App Didn’t Know It Needed ??♂?

Parcel: Bundle React app like a pro

Google I/O '18 Developer News

React's default way of appending itself to a DOM tree considered harmful