CF 上传应用: 下载失败错误ERR Downloading Failed

CF 上传应用: 下载失败错误ERR Downloading Failed

What you easily see are usually consequences or symptoms, but not the root cause. 

现象和后果总是显而易见,根本原因总待人上下而求索。

No alt text provided for this image

"The Dev environment is down!!!" We heard louder screams from developers when the busy dev env was down than when the production environment was down. We dropped whatever we were doing and wanted to stop those screams as fast as we can. We pushed an app and saw the following logs:

“测试环境挂了!” 每当测试环境出问题,我们的工程师吼的比产品线挂掉还要厉害。这时候我们会放下手中正在干的活,争取在最短的时间让工程师们安静下来。我们上传一个应用,看到下面的错误日志。

[cell/o] Creating container for app xx, container successfully created
[cell/o] ERR Downloading Failed
[cell/0] OUT cell-xxxxx stopping instance, destroying the container
[api/0] OUT process crushed with type: "web"
[api/0] OUT app instance exited

It told us that "Downloading Failed", but it will never directly tell us what failed to download. With some knowledge of how an app is pushed, staged, and run, we were easily able to guess that it was the droplet download that had failed. Because the next step would be the cell getting the droplet, and then running the app in the container it created if everything worked as expected. However, we still did not know what was the root cause of "Downloading Failed".

日志显示“Downloading Failed”, 但是没有指明是什么下载失败。从应用部署的流程来看,在客户执行应用上传命令(cf push), 应用预发布(stage)和应用发布成功运行(running)这三个阶段中,我们大致可以才到是Droplet (编译打包的结果,可以在容器上跑起来)没有下载成功。原因是如果没有故障的话,下一步Cell将下载这个Droplet,然后在已经创建的容器上发布这个应用。尽管如此,对于什么引起Droplet下载失败我们还没有清晰的答案。

That is where the fun comes from, that is our chance to feel smart again, by figuring it out! :)

机会来啦,解决这个问题,证明我们很聪明,自我感觉很良好:)

No alt text provided for this image


We ran "bosh ssh" to the cell node and looked at the logs, bad tls showed up in the log entries. With this bad tls information, we knew that the certificates had some issues. Unfortunately, the logs will never tell you exactly which certificates are the problematic ones.

运行 "bosh ssh" 命令登录到Cell 节点,查看上面更详细的日志,赫然发现 bad tls。有了这个关键信息,我们知道可能是证书惹的祸。可惜,日志不会告诉你证书到底怎么惹的祸。

In our case, we use the safe cli tool to manage all of the certificates which were stored in Vault. safe has a command "safe x509 validate [path to the cert]" which we can use to inspect and validate certificates. With a simple script, we looped through all of the certificates used in the misbehaving CF environment with the "safe validate"command.

在我们的案例里,我们所有的密钥证书等都存在Vault 里面。我们使用一个叫safe 的开源软件来访问和管理Vault里面的所有数据。safe 有一个命令是"safe x509 validate [path to the cert]",可以查看证书的具体信息。我们用一个简单的脚本,循环查看了挂掉的CF的所有证书。

The output told us that the following certificates were expired (root cause!!!)。

结果显示下面的证书过期了(这是祸根子)。

api/cf_networking/policy_server_internal/server_cert
syslogger/scalablesyslog/adapter/tls/cert
syslogger/scalablesyslog/adapter_rlp/tls/cert


loggregator_trafficcontroller/reverse_log_proxy/loggregator/tls/reverse_log_proxy/cert


bbs/silk-controller/cf_networking/silk_controller/server_cert
bbs/silk-controller/cf_networking/silk_daemon/server_cert
bbs/locket/tls/cert
diego/scheduler/scalablesyslog/scheduler/tls/api/cert
diego/scheduler/scalablesyslog/scheduler/tls/client/cert
cell/rep/diego/rep/server_cert
cell/rep/tls/cert
cell/vxlan-policy-agent/cf_networking/vxlan_policy_agent/client_cert
c
cell/silk-daemon/cf_networking/silk_daemon/client_cert

If you are not using safe, you can also use the openssl command or other such commands to view the dates for certificates.

如果你没有在你的环境里使用safe(我强烈推荐你使用),你可以用openssl 来查看你的证书。

$ openssl x509 -noout  -dates -in cert_file


notBefore=Jul 13 22:25:49 2018 GMT
notAfter=Jul 12 22:25:49 2019 GMT

We then ran "safe x509 renew" against all of the expired certificates. After double checking that all of the expired certificates were successfully renewed, we then redeployed the CF in order to update the certificates.

下面我们用"safe x509 renew" 对所有失效的证书进行了延期,确保所有证书都有效后,我们重新部署CF本身来更新这些证书。

The redeployment went well, for the most part, except for when it came to the cell instances, it hung at the first one forever. We then tried "bosh redeploy" using the "--skip-drain" flag, unfortunately, this did not solve our issue. We next observed that the certificates on the bbs node had been successfully updated to the new ones, while the certificates on the cell nodes were still the old ones. hmm... so this would mean that the bbs and cell nodes could not talk to each other.

重新部署CF过程比较顺利,除了Cell节点比较捣乱,总是停在那,不接受新的证书。我们试了带着"--skip-drain"重新部署,也没有彻底解决这问题。但是其他节点,比如bbs节点都成功更新了证书,可能bbs和Cell不能相互沟通了。

Everyone needs a little help sometimes, so does BOSH.

人人都需要点小帮助, BOSH也不例外。

No alt text provided for this image

Without digging further into what exactly was making the cell updates hang forever, we decided to give BOSH a little help. We ran "bosh ssh" to the cell that was hanging, and replaced all of the expired certificates in the config files manually, and then ran "monit restart all" on the cell. This helped to nudge the "bosh redeploy" into moving forward happily. We got a happy running dev CF back and the world finally quieted down.

没有过多纠结到底为啥Cell挂在那里不动,我们决定帮帮BOSH。我们远程登陆到有问题的Cell节点,手动把所有过期的证书替换成延期的,然后在这个Cell节点上运行了"monit restart all"。这一步后Cell更新成功,CF重新部署成功,我们的测试环境复活了,世界安静了下来。

The story should never end here, because a good engineer will always try to fix the problem before it becomes a real issue.

故事到这不应该结束,因为一个好的工程师总是要防患于未然。

No alt text provided for this image

Our awesome coworker Tom Mitchell wrote Doomsday.

超级同事Tom Mitchell写了Doomsday (世界末日)

Doomsday is a server (and also a CLI) which can be configured to track certificates from different storage backends (Vault, Credhub, Pivotal Ops Manager, or actual websites) and provide a tidy view into when certificates will expire.
Doomsday 是一个可以跟踪多个不同存储后端(Vault, Credhub, Pivotal Ops Manager, or actual websites)的服务和命令行工具,它还有一个简洁的网页来显即将过期的证书。

Deploy Doomsday, rotate your certs before they expire, and live a happier life! 

部署Doomsday, 在证书过期之前更新延期,让你的生活更加清净美好!

要查看或添加评论,请登录

Xiujiao Gao的更多文章

社区洞察

其他会员也浏览了