Automated software crash recovery in hyperconverged systems using centralized knowledge database
摘要:
A system and method for detecting and fixing crashes in a cluster environment, including detecting a crash; generating a call trace of the crash; generating a crash ID based on the call trace; checking if the crash ID matches a known crash ID from a knowledge base; when the crash ID matches, applying an automatic recovery procedure, including any of (a) restarting a service that caused the crash; (b) removing and replacing a software package that caused the crash; (c) updating software that caused the crash; and (d) rebooting a machine where the crash occurred; when the crash ID does not match, (a) collecting logs on the machine where the crash occurred; (b) collecting logs from any virtual environments on the machine where the crash occurred; and (c) generating crash ID and sending the crash ID and the logs to the knowledge base.
信息查询
0/0