The need of an EDF scheduler in Linux has been already highlighted in the Documentation/scheduler/sched-rt-group.txt file, which says
"The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring full deadline scheduling to the linux kernel."
This implementation is part of a FP7 European project called ACTORS and financially supported by the European commission (see
http://www.actors-project.eu ).
The involved partners (which include Ericsson Research, Evidence Srl, AKAtech) strongly believe that a general-purpose operating system like Linux should provide a standard real-time scheduling policy (like the one implemented in this patch) still allowing to schedule non-real-time tasks in the usual way.
Why do we need a further scheduling class ? The existing scheduling classes (i.e., sched_fair and sched_rt), perform very well in their own domain of application. However,
1. they cannot provide the guarantees a time-sensitive application may require.
Using sched_fair, no concept of timing constraint (e.g., deadline) can be associated to tasks. In fact, although it is possible to assign a share of the processor time to a task (or a group of tasks), there is no way to specify that a task must execute for 20msec within 100msec, unlike using a real-time scheduling algorithm, such as EDF.
As a proof of concept, we implemented a very simple test to run two tasks that need to execute for 20msec every 50msec. We scheduled them using CFS with the same CPU share, and we observed that even if there is enough spare CPU time (the system is idle), the tasks occasionally experience some deadline miss. The test is available here.
Using sched_rt, instead, we can give that kind of guarantee only using a global period, with the constraint that "a subgroup must have a smaller or equal period to its parent".
2. the latency experienced by a task (i.e., the time between two consecutive executions of a task) is not deterministic and cannot be bound, since it highly depends on the number of tasks running in the system at that time.
Using EDF, instead, this time is deterministic, bounded and known at any instant of time.
These issues are particularly critical when running time-sensitive or control applications. Without a real-time scheduler, in fact, it is not possible to make any feasibility study of the system under development, and developers cannot be sure that the timing requirements will be met under any circumstance. And we feel that this prevents the usage of Linux in industrial contexts like the one addressed by our project.
A key feature of our scheduling class is that "temporal isolation" is ensured. This means that the temporal behavior of each task (i.e., its ability to meet its deadlines) is not affected by the behavior of any other task in the system. In other words, even if a task misbehaves, it is not able to exploit larger execution times than the amount it has been allocated.
Each task is characterized by a "budget" sched_runtime and a "period" sched_period, equal to its deadline.
At any instant of time, the system schedules the (ready) task having earliest deadline. During execution, task's budget is decreased by an amount equal to the time executed. When task's budget reaches zero (i.e., the task executed for a time equal to its budget), it is stopped until the time instant of its next deadline. At that time, the task is made runnable again, its budget is refilled and a new deadline is computed.
This means that the task is guaranteed a share of processor time equal to sched_runtime/sched_period every sched_period.
Of course, the sum of sched_runtime/sched_period of all tasks cannot be higher than the total throughput available on the system (i.e., 100% on uniprocessor systems), otherwise the system is overloaded and task deadlines cannot be guaranteed.
When a task tries to execute more than its budget, it is slowed down, by stopping it until the time instant of its next deadline. When, at that time, it is made runnable again, its budget is refilled and a new deadline is computed.
Download and usage
A public git repository is available at the following address:
git clone git://gitorious.org/sched_deadline/linux-deadline.git
Information about download and usage is available
here .
Technical details
Some notes about the implementation:
* It implements the well-known Earliest Deadline First (EDF) algorithm
* It is aligned with the current mainstream kernel
* It relies on standard Linux mechanisms (e.g., control groups) to natively support multicore platforms and to provide hierarchical scheduling through a standard API (see below)
* It does not make any restrictive assumption about the characteristics of the tasks: it can handle periodic, sporadic or aperiodic tasks (see below)
* We decided to put the new scheduling class with highest priority among all existing scheduling classes (i.e., above sched_rt and sched_fair).
* It works natively on multicore platform. We have one runqueue for each CPU, mplemented with a red-black tree, for efficient manipulation.
* We migrate the tasks among the runqueues trying to always have, on a system with m CPUs, the m earliest deadline ready tasks running on the CPUs.
* Tasks are not migrated if a specific CPU affinity has been set. Affinity can be set using existing Linux mechanisms (i.e., sched_setaffinity() for tasks and cpuset.cpus for task groups). Thus, it is sufficient that each EDF task in the system has its affinity set, to achieve a partitioned scheduling with very few overhead.
* The API for tasks uses an additional system call, called sched_setscheduler_ex(...), to assign and modify the scheduling parameters described above (i.e., sched_runtime and sched_period). The existing system call sched_setscheduler() has not been extended, because of the binary compatibility issues that modifying its struct sched_param parameter would have raised for existing applications. So we implemented an additional syscall.
For the sake of consistency, also sched_setaparam_ex() and sched_getparam_ex() have been implemented.
* When the support for cgroups is enabled in the kernel .config (i.e., CONFIG_DEADLINE_GROUP_SCHED symbol) two entries regulate the maximum bandwidth available for the root control group. Supposing the cgroups filesystem mounted at /cgroups, these entries are /cgroup/cpu.deadline_runtime_us and /cgroup/cpu.deadline_period_us, respectively. By default, these two entries are set to 0. Therefore, if control groups support is enabled, you have to set these values, otherwise there will be not available bandwidth for SCHED_DEADLINE tasks.
* It is integrated with the cgroups filesystem, so it's possible to assign a reservation to a group of tasks and make hierarchical scheduling. The cgroups interface provides two entries, cpu.deadline_runtime_us and cpu.deadline_period_us. These files are created once the cgroup filesystem is mounted, to get and set the group bandwidth.
* In case a SCHED_DEADLINE task forks, its budget (i.e, runtime) is not changed. The child is scheduled SCHED_DEADLINE with no budget. It's parent's duty to call sched_setscheduler_ex(...) on the child to give it some budget.
Tests about performance of our scheduling class can be found
here .
During the last meeting of the project at Brussels, the reviewers from the European commission shared with the project partners the hope that this code may eventually become part of the mainstream kernel.
We wait for your feedback and contribution.