Linux Process Scheduler

This post derives words from InformIT – The Linux Process Scheduler by Robert Love. The original is based on the knowledge of kernel 2.6. It is expected to be outdated as at the time of writing Linux kernel has moved to version 3.17.

Introduction

A scheduler is the component of the kernel that selects which process to run next.
It divides the finite resource of processor time between the runnable processes on a system.
It is responsible for best utilizing the system and giving the impression that multiple processes are simultaneously executing.
It tries to determine the most worthwhile process to run and retain fairness to other processes.
The algorithm inside the new scheduler in Linux kernel 2.5 was designed to accomplish these goals:
  • O(1) – Can complete in constant-time
  • Perfect SMP scalability – Each processor has its own locking and individual run queue
  • Improving SMP affinity – Group tasks to a specific CPU and run them there. Only migrate to another CPU to resolve imbalance queue sizes.
  • Good interactive performance – Even under considerable system load, a system should react and schedule interactive tasks immediately.
  • Fairness – Not starving processes or over feeding timeslice to processes
  • Optimize for the common case of only 1-2 runnable processes, yet scale well to multi-processor with many processes

Scheduling Flavors

Multitasking operating systems come in two flavors:
  1. Cooperative scheduling – Processes decide their own time to give up and hand to another process
  2. Preemptive scheduling – A scheduler decides suspension and resumption of processes

Cooperative Scheduling

A process does not stop running until it voluntarily decides to do so. This act of voluntarily suspending itself is called yielding.
This scheduling scheme brings bad consequences:
  1. A scheduler cannot perform global decisions because it does not have control over processes’ execution time.
  2. A process can monopolize a processor leading to machine hang.

Preemptive Scheduling

A scheduler decides when processes to stop or to resume. The act of involuntarily suspending a process is called preemption.
Time before a process is preempted is pre-determined. This time duration is called time slice.
Time slice concept enables a scheduler to make global scheduling decisions.
Linux uses this scheduling scheme from the beginning. During Linux 2.5, it received a scheduler overhaul. The scheduler is called the O(1) scheduler because of its algorithmic behavior.

Policy

Policy is the behavior of the scheduler that determines what runs when. A scheduler’s policy often determines the overall feel of a system and is responsible for optimally utilizing processor time.
It tries to satisfy 2 conflicting goals:
  1. Fast process response time (low latency) – Interactive processes
  2. High process throughput – Heavy computation

I/O-bound Processes and Processor-bound Processes

I/O bound processes spend most time waiting on I/O request. They are often runnable but only for short periods, and they will eventually block waiting on more I/O.
Processor-bound processes spend most time executing code. They tend to run until they are preempted.
Linux favors I/O bound processes to provide good interactive response.

Process Priority

This ranks processes based on their worth and need for processor’s time.
Higher priority processes run before lower priority processes. Same priority processes are scheduled round-robin.
Linux employs dynamic priority ranking.
As Linux favors interactive response, processor-bound processes which consume full time slices are lowered priorities.
Linux implements 2 separate priority ranges:
  1. Nice value – Being nicer means giving away its time to others. It ranges from -20(bad) to 19(nice). The default is 0.
  2. Real-time priority – Real-time processes are at higher priority than normal processes. Linux implements according with POSIX.

Timeslice

Timeslice is a numeric value representing how long can a task can run until it is preempted.
Too long timeslice makes system non responsive.
Too short timeslice makes the cost of switching processes significant.
A process does not have to use its timeslice in one go. It can split into smaller timeslices to be used in later queue.
When a process uses up its timeslice, it is considered expired. It is no longer eligible to run until all other processes use up their timeslice.
When all processes use up their timeslice, the scheduler recalculates timeslices for all processes.

Scheduling Data Structure

Runqueues

The struct runqueue is the basic data structure in the scheduler. It is defined in kernel/sched.c.
runqueue is a list of runnable processes given on a processor i.e. one runqueue per processor.
A process exist only in one runqueue.
Before a runqueue can be manipulated, it must be locked first.
Runqueues should be locked in a same order to avoid deadlock.
struct runqueue {
spinlock_t       lock;                /* spin lock which protects this runqueue */
unsigned long    nr_running;          /* number of runnable tasks */
unsigned long    nr_switches;         /* number of contextswitches */
unsigned long    expired_timestamp;   /* time of last array swap */
unsigned long    nr_uninterruptible;  /* number of tasks in uinterruptible sleep */
struct task_struct *curr;             /* this processor's currently running task */
struct task_struct *idle;             /* this processor's idle task */
struct mm_struct   *prev_mm;          /* mm_struct of last running task */
struct prio_array  *active;           /* pointer to the active priority array */
struct prio_array  *expired;          /* pointer to the expired priority array */
struct prio_array  arrays[2];         /* the actual priority arrays */
int    prev_cpu_load[NR_CPUS];        /* load on each processor */
struct task_struct *migration_thread; /* the migration thread on this processor */
struct list_head    migration_queue;  /* the migration queue for this processor */
atomic_t         nr_iowait;           /* number of tasks waiting on I/O */
}

Priority Arrays

2 priority arrays: activeexpired
It is the data structure that promotes O(1) complexity.
Each priority array contains one queue of runnable processors per priority level.
These queues contain lists of runnable processes at each priority level.
Priority arrays contain priority bitmap to efficiently discover the highest priority runnable task in the system.
The default number of priorities is 140.
Bitmap has to provide 1 bit per 1 priority. Hence, the size of bitmap is at least 140.
Using unsigned long (32 bits), bitmap requires an array of 5 elements unsigned long. 32 x 5 > 140
Finding the highest priority runnable queue by scanning for the first 1-bit in bitmap. sched_find_first_bit() does this.

struct prio_array {
int              nr_active;           /* number of tasks */
unsigned long    bitmap[BITMAP_SIZE]; /* priority bitmap, default:BITMAP_SIZE=5  */
struct list_head queue[MAX_PRIO];     /* priority queues, default:MAX_PRIO=140 */
};

Operation

Recalculating Timeslices

Recalculation occurs when all tasks have exhausted their timeslice.
Each task is recalculated its priority and timeslice.
Priority and other attributes are used to calculate timeslice size.
This process can be potentially long as O(n), complexity increases as there are more tasks.

for (each task on the system) {
recalculate priority
recalculate timeslice
}
When a task uses up its time slice:
  • The scheduler recalculates its new timeslice
  • The scheduler moves the task to expired priority array.
When all task use up their time slice:
  • The scheduler switches active and expired pointers via schedule().
schedule() picks the next task and switches to it.
It is called by kernel code that wants to sleep.
It is invoked when a task is preempted.
context_switch() does the switching from one task to another.

Calculating Priority and Timeslice

static_prio in task_struct stores task’s nice value.
effective_prio() returns dynamic priority of a task based on its interactivity by adding -5~5 to current nice value.
The scheduler heuristically determines whether a task is I/O bound or processor bound type.
The most indicative metric is how long a task sleeps.
It tracks how long a task spends in sleeping, ranging from 0 to MAX_SLEEP_AVG, in sleep_avg.
task_timeslice() returns a new timeslice for a given task.
Sufficiently interactive tasks can have comebacks even they have expired their timeslice. This is to avoid being stuck in the expired priority array, thus they are no longer interactive to users.
scheduler_tick() puts them back to the active priority array. It is invoked by a timer interrupt.
TASK_INTERACTIVE() – computes whether a task is interactive enough
EXPIRED _STARVING() – checks whether runqueue has not switched its arrays for long time or not.

Sleeping and Waking Up

Sleeping tasks are removed from their runqueue and are put in a waitqueue, then schedule() is called to pick up the next task to run.
States for sleeping tasks are TASK_INTERRUPTIBLE (also response to signals) and TASK_UNINTERRUPTIBLE.

The Load Balancer

Symmetrical MultiProcessing
Each processormaintains its own list of processes and operates the scheduler only on those tasks.
Load balancer compares processors’ runqueue and balance them.
It is implemented in kernel/sched.c as load_balance().

Preemption and Context Switching

context_switch() is defined in kernel/sched.c.
it is called by schedule().
It does 2 jobs:
  1. switch_mm() – switch virtual memory mapping from the previous task to the next task
  2. switch_to() – switch the processor state from the previous task to the next task (stacks and registers)
need_resched is a flag saying that after waking up tasks, they should be rescheduled by schedule().

User Preemption

It occurs when kernel is about to return to user-space:
  • from a system call
  • from an interrupt handler
If kernel is returning to user-space, it is safe to resume current task and is safe to pick a new task.
If need_resched flag is set, scheduler is invoked.

Kernel Preemption

Linux kernel is a fully preemptive kernel from 2.6 kernel onwards.
It is possible to preempt a task at any point, as long as the kernel is in a state which is safe to reschedule.
SMP -> Use locks as an indicator whether the current code is reentrant and capable of being preempted.
Kernel preemption implements preempt_count in task_struct to count how many locks are being held.
It occurs when:
  • returning to kernel-space from an interrupt handler
  • kernel code becomes preemptible again
  • a task in kernel explicitly calls schedule()
  • a task in kernel blocks which result in schedule() calling

Real-Time

Linux provides soft real-time behavior.
Linux scheduling policy ensures real-time tasks are running whenever they are runnable, but the kernel does not promise to always be able to fulfill them.
Linux provides 2 real-time scheduling policies
  1. SCHED_FIFO – FIFO without timeslices. Tasks under this policy are always scheduled over any SCHED_OTHER. They run until they block or yield.
  2. SCHED_RR – FIFO with timeslices i.e. real-time round-robin scheduling algorithm.
  3. SCHED_OTHER – this is non-real-time scheduling policy

The scheduler does not calculate dynamic priority for real-time tasks.

It uses shared priority ranks with non-real-time tasks.
The default number of real-time priority is 100.
Therefore, from 0 to 99 are reserved for real-time priorities.
Priorities from 100 to 139 are not real-time tasks.

Appendix

Task States

All information for a process is contained in data struct task_struct.
There are 5 task states in kernel 2.4.37.
  1. TASK_RUNNING – The process is able to run. The scheduler decides to give it a run.
  2. TASK_INTERRUPTIBLE – The process is waiting for some event. It can be woken up by a signal.
  3. TASK_UNINTERRUPTIBLE – The process is waiting for some event. It will not be considered by the scheduler. It cannot be killed.
  4. TASK_ZOMBIE – The task has exited, but there are leftover data structures which are note yet freed.
  5. TASK_STOPPED – The task is stopped by a signal SIGTSTP (Ctrl+Z)
Later Linux kernels have more task states also basic state combinations.

Scheduler-Related System Calls

nice() – set a task’s nice value
sched_setscheduler() – set a task scheduling policy
sched_getscheduler() – get a task scheduling policy
sched_setparam() – set a task’s scheduling parameters
sched_getparam() – get a task’s scheduling parameters
sched_get_priority_max() – returns the maximum priority value that can be used with the scheduling algorithm identified by policy.
sched_get_priority_min() – returns the minimum priority value that can be used with the scheduling algorithm identified by policy.
sched_rr_get_interval() – get a task’s timeslice value
sched_setaffinity() – set a task’s processor affinity
sched_yield() – move a task to the expired priority array.

Useful Macros

cpu_rq(processor) – returns a pointer to the runqueue associated with the given processor
this_rq() – returns the runqueue of the current processor
task_rq(task) – returns a pointer to the runqueue on which the given task is queued
task_rq_lock(task, flags) – locks a runqueue the task runs on and returns the address of the runqueue
task_rq_unlcok(rq, flags) – unlocks the runqueue
this_rq_lock() – locks the current runqueue
rq_unlock(struct runqueue *rq) – unlocks the given runqueue
double_rq_lock(rq1, rq2) – locks 2 runqueues. locks rq1 first then rq2
double_rq_unlock(rq1, rq2) – unlock 2 runqueues. unlocks rq2 then rq1
sched_find_first_bit() – find first 1-bit in an array
TASK_INTERACTIVE(task) – computes whether a task is interactive enough
EXPIRED_STARVING(rq) – checks whether runqueue has not switched its arrays for long time or not.

References

  1. Inform IT – The Linux Process Scheduler
  2. What are Possible Task States?
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s