Why did Facebook server rooms lock out?
Avishai (Avi) Moscovich PE, P.Eng.
Marketing & Biz Dev | Brand & Community Builder | Strategist
Did the Facebook outage last for several hours due to the poor design of ... doors?
Some thoughts from @heckalyptus (Twitter) on one of the most important and least sexy systems in the world - access control system.
According to many reports during and after the latest Facebook outage, employees appeared locked outside the doors of Facebook offices in several cities, and IT crews had to saw doors to get into server rooms. It is impossible to know if this did actually happened, but the fear it created about doors being hacked and remotely controlled is real.
Allegedly, the reason such a thing happened lies in Facebook's access control system. An access control system consists of the components that are responsible for managing the electric locks in the various doors of the organization and deciding when and for whom to open them. The system also keeps a detailed record of each door that was opened and who opened it, for the purpose of the investigation if necessary.
To do this, the system has a database that contains 2 important tables: an identity table (users) and a door table. The system administrator defines for each user which doors he is allowed to pass through, and when the user tries to pass through the door by identifying with a proximity tag, fingerprint or face, the system checks whether the user is allowed to pass through, and opens the lock accordingly.
Seemingly, it makes sense that the failure of the Facebook servers will lead to doors not opening: the controller of the door is unable to communicate with the main server where the tables are stored, so it does not know whether the user is allowed to go through the door or not, so blocks it by default.?
In reality, however, access control manufacturers are well aware that network outages or server downtime are something that happens from time to time, so no manufacturer relies solely on the tables on the server. Each controller has an internal memory in which the tables relevant to it are stored so that it can open the door to an authorized user even in the event of a disconnect from the server.
When the communication to the server resumes, the controller will update its tables to the latest version that is on the server and will send the documentation of all the openings of the door that occurred during the disconnection period.
Thus, a disconnect between the server and the controller will prevent new users from updating and receiving information about the door opening in real-time, but will not interfere with the opening of the doors for existing users.
If so, then why have the doors on Facebook not opened during the outage? Even if their entire grid fell off, the doors were supposed to open according to the tables stored on the controller. Something else happened here.
We may never really know what happened and why it happened, but here are have 3 theories for possible scenarios:
Scenario # 1: Missing Identity
A company the size of Facebook has tens of thousands of active employees. At this size, central identity management is a necessary system, in order to consolidate into one database all the various components in the company that require identification: connection to a computer, access to systems and information, payroll, attendance, printers, and also access control
Apparently, there is an integration between the central identity management system and the table of users in the access control. Thus, a new employee is immediately added to the table after being hired, and an employee who has left is removed from the table, completely automatically.?
In the above scenario, the integration was poorly implemented, which caused that during the malfunction of Facebook's servers, the access control system received a table with 0 users from the central identity system, and instead of stopping and saying "hey, something is wrong here", it continued to delete *all* users from her table
Afterwards, the system also sent the empty table to the doors, which immediately stopped opening because their licensed table did not include anyone, and the workers were left outside.
Scenario No. 2: Panicked security
The reports came mostly from rumours of employees being “locked out” outside an office or server room. Facebook itself did not release any comment on the reports, and it is quite possible that the locking factor was none other than the company's internal security department.
领英推荐
Access control systems that appeal to companies of this magnitude often include an "exceptional scenario" mechanism that allows system administrators to run a series of predefined operations at the push of a button. Thus, the security manager can lock all the doors in case of a kidnapping or shooting incident, or open them all in case of an earthquake.
These scenarios are usually prefabricated, and often not properly tested. Activation of an incorrectly defined scenario, or activation of the incorrect scenario, may have led to the doors being locked.
In addition, it is not possible to deny that the doors were locked intentionally, due to unknown reasons, there may be a fear of internal collaboration within the company or another reason that has not been revealed.
Scenario No. 3: Objective: The server room
The server room of an internet company is the most important and sensitive resource it has. In the event that the global glitch was caused by an external or internal source that obtained high enough permissions to disrupt the BGP routing of a company like Facebook, it is probably a very high level of hacking expertise.
Such attacker values the ability to disrupt the access server rooms. If the attacker also gained control of the access control server, they could delete the user tables, then lock the server out of the control of the security department. Restoring the server back can certainly take several hours, during which the doors will remain locked.
Above are reasonable scenarios, do you have other ideas?
.
.
.
.
.
.
Expertise & Credit: @heckalyptus (Twitter)
From Hebrew: https://twitter.com/heckalyptus/status/1446460968788168735?s=20
#Facebook #outage #accesscontrol