在多任务（RTOS）环境中使用看门狗

1万 · 2021-7-5 13:24

在SEGGER的博客上看到一篇有关在实时操作系统使用看门狗的**。从一个失败的太空项目出发，分析了看门狗的作用及使用，自我感觉很有启发，特此翻译此文并推荐给各位同仁。为了阅读方便，有些航天领域名词本人添加了注释，读者也可自行搜索更详细的解释。当然限于个人水平，有不当之处恳请指正。大家也可以看原文：https://blog.segger.com/using-a-watchdog-in-a-multi-task-rtos-environment/。作者为Til Stork，其全文如下：

Written by Til Stork on August 9th, 2017.
Clementine, a NASA satellite to test sensors and spacecraft components under extended exposure to the space environment, was launched on 25 January 1994. For the lack of a few lines of watchdog code, her mission was lost on 7 May 1994.

Clementine had performed lunar mapping for approximately two consecutive months when she left lunar orbit and headed for her next target, the near-Earth asteroid Geographos. Soon, however, a malfunction in one of Clementine’s on-board computers occurred, effectively cutting NASA from operating the spacecraft and causing one of its thrusters to fire uncontrolled.
NASA spent 20 minutes trying to bring the system back to life, but to no avail. A hardware reset command finally brought Clementine back online, but it was too late: she had already used up all of her fuel, and the mission’s continuation had to be canceled.
Subsequently, the development team responsible for Clementine’s software wished they had used the hardware’s watchdog timer, when it became evident that the software timeouts they had implemented had been insufficient.

How could a watchdog have helped?

A watchdog is a piece of hardware that’s either integrated directly into a microcontroller, or is attached to a microcontroller externally. Its main purpose is to perform an error handling (usually a hardware reset) when it can safely assume that the system has hung or is otherwise executing improperly.
A watchdog’s main component is a counter that initially gets configured for a certain value and subsequently counts down to zero. The software must frequently re-set this counter to its initial value to ensure that it never reaches zero. Otherwise, a malfunction is assumed and, usually, the CPU will be reset. This suggests watchdogs for a last resort, an option taken only when everything else has failed. As it could have been the case with Clementine.

How to feed the watchdog

Properly using a watchdog timer, however, is not as simple as restarting the counter (a process often referred to as “feeding” or “kicking” the watchdog). With a watchdog timer running in their system, developers must carefully choose the watchdog’s timeout period so the watchdog can intervene before a malfunctioning system can perform any irreversible malicious actions.

In simple applications, specifically without the use of an RTOS, developers would usually feed the watchdog from the main loop. This approach merely requires configuration of an appropriate initial counter value, which can be as simple as choosing any value that exceeds the worst-case execution time of the entire main loop by at least one timer cycle. This often is a fairly robust approach: While some systems will require immediate recovery, others merely need to ensure they are not hung indefinitely – and this will definitely get the job done.

In a multitask (RTOS) environment

In more complex systems, however, specifically with multi-tasking systems, various threads could potentially hang on various occasions and for various reasons. Some threads are OK to not run for long times, such as a thread waiting for potential network communication. A clean method to feed the watchdog periodically, while still ensuring that each distinct process is in good health, became a major challenge for developers of these systems, who for example need to focus on:

Whether the OS is executing properly
Whether high-priority tasks are exhausting the CPU, preventing low-priority tasks from running at all
Whether a deadlock has occurred that inhibits the execution of one or several tasks
Whether a task routine is executing properly and entirely
Developers also need to ensure that any modification performed to their source code, whether it be a dedicated watch dog tasks or specific modifications to the monitored tasks, must be small and optimized for efficiency in order to keep intrusiveness at a minimum.

Utilize the watchdog support of your RTOS

For this reason, state-of-the-art RTOS’s like SEGGER’s embOS offer comprehensive watchdog solutions to their customers in order to simplify the watchdog handling and thereby reduce the time spend on any development process.

The general principles applied with these solutions may vary between different RTOS’s. At SEGGER, however, versatility and ease-of-use are deemed of capital importance, while still keeping the required footprint to a minimum in both memory usage and execution time. To the embedded experts it therefore was evident that a comprehensive set of API functions was required that allows for both

the individual registration of tasks, timers, and even ISRs with the underlying embOS watchdog module, as well as
the possibility to test the intended watchdog conditions flexibly from any desired context.
The final implementation now consists of mere five API functions, yet is powerful enough to suffice any intended purpose.
Using these API functions, a task would simply register itself with the embOS watchdog module and would simultaneously configure its timeout period individually. The task could then signal its proper execution periodically by calling one simple embOS API function. Whether all monitored tasks have signaled their proper execution within their specified timeout period, subsequently gets checked by another single embOS API call, which may either be performed from within a dedicated watchdog task, from within OS_Idle(), or even from within the periodic OS timer interrupt service routine or any other ISR.

Users would merely need to provide and register two functions: The first performs the hardware-dependent feeding of the watchdog, while the other specifies further actions in case the watchdog counter reaches zero. E.g., this allows the storage of a log file to non-volatile memory, containing further information on the system status before performing a hardware reset or taking any other action.

Conclusion

When starting to design and develop an application with a watchdog, make sure you decide early on how you intend to use it – and consider the available tools that will aid you in achieving it more swiftly. At least, you wouldn’t want to get stranded in space, would you?

1万 · 2021-7-5 13:25

Clementine是美国航空航天局在1994年1月25日发射的空间环境下测试传感器和航天器部件的卫星。由于缺乏几条看门狗程序，她的任务于1994年5月7日失效。
【注】Clementine是美国的一个航天器，官方称为深度空间计划科学实验，由NASA和导弹防御组织联合发射。
      Clementine在连续两个月进行了月球测绘后，离开月球轨道并连续前往她的下一个目标——近地球小行星Geographos。然而，Clementine所载电脑很快发生了故障，并切断了NASA对航天器的有效操作，并导致其中一个推进器不受控制。
【注】Geographos：1620号小行星颗阿波罗型小行星离地球近时400余万公里，其形状为极规则长条形，长宽比为4至5倍。
      NASA花了20分钟试图使系统得到恢复，但是无济于事。硬件复位命令终于使Clementine重新上线，但为时已晚。她已经耗尽了所有的燃料，而任务的延续必须被取消。
      在他们实施的软件超时明显失效时。负责Clementine软件的开发团队希望他们使用了硬件的看门狗定时器。
看门狗有什么作用？
      看门狗是一种直接集成到微控制器中或者外部连接到微控制器的硬件。其主要目的是在可以安全地假设系统已挂起或以其他方式执行不正确的情况下执行错误处理（通常为硬件复位）。
      看门狗的主要组件是一个计数器，最初被配置为一个特定的值，然后倒数为零。软件必须经常将该计数器重新设置为其初始值，以确保其不会达到零。否则，会出现故障，通常会重置CPU。这表明看门狗是最后的手段，只有当其他一切都失败时才采取这种选择。就像Clementine的情况一样。
如何喂狗
      然而，正确使用看门狗定时器并不像重新启动计数器那样简单（通常被称为喂狗或者踢狗）的过程。在其系统中运行看门狗定时器时，开发人员必须仔细选择看门狗的超时时间，以便看门狗在发生故障的系统可以执行任何不可逆转的恶意动作之前进行干预。
      在简单的应用中，特别是没有使用RTOS，开发人员通常会从主循环中提供看门狗。该方法仅需要配置适当的初始计数器值，它可以简单地选择任何超过整个主循环最坏的执行时间的值，至少有一个计时器周期。这通常是一个非常有效的方法，虽然有一些系统需要立即恢复，但更多系统只需要确保它们不会被无限期地挂起，这一方法能很好的实现之一目的。
在多任务（RTOS）环境中喂狗
      然而，在更复杂的系统中，特别是多任务系统，各种线程可能会因为各种原因潜在地挂起。一些线程可以长时间运行，例如线程等待潜在的网络通信。一个干净的方法可以定期喂养看门狗，同时确保每个不同的过程都处于健康状态，成为这些系统开发人员面临的主要挑战，例如需要关注的是：

操作系统是否正常执行
高优先级任务是否耗尽CPU，完全阻止低优先级任务运行
是否发生了阻止执行一个或多个任务的死锁
任务程序是否正确执行

      开发人员还需要确保对其源代码执行的任何修改（无论是专用监视任务还是受监视任务的特定修改）都必须很小，并针对效率进行优化，以将侵扰性保持在最低限度。
RTOS增加看门狗支持
      因此，最先进的RTOS如SEGGER的embOS为客户提供综合的看门狗解决方案，以简化看门狗处理，从而减少任何开发过程的时间花费。
      这些解决方案应用的一般原则可能会因不同的RTOS而异。然而，在SEGGER，多功能性和易用性被认为是首要的，同时在内存使用和执行时间内将所需的占用空间最小化。因此，对于嵌入式专家来说，显然需要一套全面的API函数来实现：

单独注册任务，计时器，甚至带embOS看门狗模块的ISR。
从任何所需的上下文灵活地测试预期看门狗状态的可能性。

      现在最终的实现只包括五个API函数，但功能足以满足任何预期的目的。
      使用这些API函数，一个任务可以简单地将其自身注册到embOS看门狗模块，并可以单独配置其超时时间。然后，任务可以通过调用一个简单的embOS API函数来定期发出正确的执行。所有被监视的任务是否在指定的超时时间内发出正确的执行信号，随后通过另一个单独的embOS API调用进行检查，该调用可以在专用看门狗任务内从OS_Idle（）内执行，甚至从定期操作系统定时器中断服务程序或任何其他ISR。
      用户只需要提供和注册两个功能：第一个执行看门狗的硬件依赖的馈送，而另一个则在看门狗计数器达到零时指定进一步的动作。例如，这可以将日志文件存储到非易失性存储器中，在执行硬件复位或执行任何其他操作之前，包含有关系统状态的进一步信息。
结论
      当开始设计和开发具有看门狗的应用程序时，需要确保尽早决定如何使用它。并考虑可以帮助您更快地实现的可用工具。至少，你不想被困在太空中，是吗？

只看该作者 · 2021-7-5 13:38

学习了

1万 · 2021-7-6 13:29

多任务环境下的看门狗设计，很多年前(好像是2006年)21IC的老帖子上讨论过在定时中断程序中喂狗的事情。

在多任务（RTOS）环境中使用看门狗

相关下载

相关帖子

突出贡献奖章

沉静之湖泊

七世轮回

技术奇才奖章