cpuset by example

cpuset is an important concepts in linux system and is created to provide a mechanism to assign a set of cpus and mem nodes to a set of tasks.

Details can be found here from kernel documents kernel documents.

Here I am just trying to show some examples on how we can read the info manually and manipulate it to change task’s cpu affinity.

So again I am using my 64 bit Ubuntu vmware system and it has 4 cpus available:

$lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             4
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 30
Stepping:              5
CPU MHz:               2925.979
BogoMIPS:              5851.95
Virtualization:        VT-x
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-3

First of all, cpuset is a hierarchical system like a regular file system starting from root directory.

Each process has a cpuset file in procfs which shows where in the hierarchy the process is attached to.

For example we tried it on our shell process and it shows we are attached to root which is the default.

$cat /proc/$$/cmdline
/bin/ksh93
$cat /proc/$$/cpuset
/

We want to be able to explore the cpuset file system and we need to mount it first.

$ls /sys/fs/cgroup/
systemd
$sudo mkdir /sys/fs/cgroup/cpuset
$sudo mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
$ls /sys/fs/cgroup/cpuset/
cgroup.clone_children  cpuset.cpu_exclusive  cpuset.mem_hardwall     cpuset.memory_pressure_enabled  cpuset.mems                      notify_on_release
cgroup.procs           cpuset.cpus           cpuset.memory_migrate   cpuset.memory_spread_page       cpuset.sched_load_balance        release_agent
cgroup.sane_behavior   cpuset.mem_exclusive  cpuset.memory_pressure  cpuset.memory_spread_slab       cpuset.sched_relax_domain_level  tasks

Now we see files appearing in /sys/fs/cgroup/cpuset/, here we mainly focus on set of cpus allowed:

$cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-3

As can be seen, by default the set of cpus allowed is all cpus online at the system.

And we can see it also controls the memory policy by giving the numa node where memory is required to be allocated from:

$cat /sys/fs/cgroup/cpuset/cpuset.mems
0

So in this case memory is required to be allocated from numa node 0 which is also the only node in the system:

$ls -d /sys/devices/system/node/node*
/sys/devices/system/node/node0

And we can also get information such as what process and tasks are attached to this set by reading file cgroup.procs and tasks:

$head -5 /sys/fs/cgroup/cpuset/cgroup.procs
1
2
3
5
7
$head -5 /sys/fs/cgroup/cpuset/tasks
1
2
3
5
7
$diff -u /sys/fs/cgroup/cpuset/cgroup.procs /sys/fs/cgroup/cpuset/tasks|head -10
--- /sys/fs/cgroup/cpuset/cgroup.procs  2015-09-27 15:13:28.858820058 -0400
+++ /sys/fs/cgroup/cpuset/tasks 2015-09-27 15:13:28.858820058 -0400
@@ -252,6 +252,9 @@
 684
 789
 821
+827
+828
+829
 859

The main difference is cgroup.procs also lists thread’s tid while tasks only list main thread’s pid.

$ps -L -p 821 -o tid,pid,cpuid,ppid,args
  TID   PID CPUID  PPID COMMAND
  821   821     0     1 rsyslogd
  827   821     0     1 rsyslogd
  828   821     2     1 rsyslogd
  829   821     0     1 rsyslogd

Now we see how cpuset works and how it links to processes, we can try to manipulate by creating new cpuset down the hierarchy.

We will name the new cpuset set1:

$sudo mkdir /sys/fs/cgroup/cpuset/set1
$ls /sys/fs/cgroup/cpuset/set1/
cgroup.clone_children  cpuset.cpus           cpuset.memory_migrate      cpuset.memory_spread_slab  cpuset.sched_relax_domain_level
cgroup.procs           cpuset.mem_exclusive  cpuset.memory_pressure     cpuset.mems                notify_on_release
cpuset.cpu_exclusive   cpuset.mem_hardwall   cpuset.memory_spread_page  cpuset.sched_load_balance  tasks
$cat /sys/fs/cgroup/cpuset/set1/cpuset.cpus

$cat /sys/fs/cgroup/cpuset/set1/cpuset.mems

$cat /sys/fs/cgroup/cpuset/set1/cgroup.procs

As we can see, the mkdir command will instruct the sysfs to automatically create all the needed files and those initial files are just empty, and no processes are currently attached to the new cpuset.

If we simply go and try to attach task we would run to error:

$sudo ksh -c "echo 3671 > /sys/fs/cgroup/cpuset/set1/tasks"
ksh: echo: write to 1 failed [No space left on device]

At minimum we need to set up cpus and mems and here we set it up such that cpu will only be allocated from cpu0 and cpu1:

$sudo ksh -c "echo 0-1 > /sys/fs/cgroup/cpuset/set1/cpuset.cpus"
$sudo ksh -c "echo 0-1 > /sys/fs/cgroup/cpuset/set1/cpuset.mems"
$cat /sys/fs/cgroup/cpuset/set1/cpuset.cpus
0-1
$cat /sys/fs/cgroup/cpuset/set1/cpuset.mems
0

Now we can successfully attach another shell process to the new set:

$sudo ksh -c "echo 3671 > /sys/fs/cgroup/cpuset/set1/tasks"

And we can verify the connections is established correctly:

$cat /sys/fs/cgroup/cpuset/set1/cgroup.procs
3671
$cat /sys/fs/cgroup/cpuset/set1/tasks
3671

It will also be reflected in process 3671’s procfs system:

$cat /proc/3671/cgroup
2:cpuset:/set1
1:name=systemd:/user/1000.user/1.session
$cat /proc/3671/cpuset
/set1

Since we only limited to cpu0 and cpu1 we expect the cpu affinity mask updated as well:

$cat /proc/3671/status|grep Cpus_allowed
Cpus_allowed:   00000000,00000003
Cpus_allowed_list:      0-1

cpuset is a stricter limitation on the cpus allowed so if we try to assign it to any cpus not in cpu0-1 range it will be rejected by the system:

$taskset -p 4 3671
pid 3671's current affinity mask: 3
taskset: failed to set pid 3671's affinity: Invalid argument

And any wider range is AND’ed with the cpuset 0-1 so setting mask of 7 we will get 3 back:

$taskset -p 7 3671
pid 3671's current affinity mask: 3
pid 3671's new affinity mask: 3

Now we update set1 to only allow cpu2 and to verify that we can run ps continuously and see how process 3671 is now always running on cpu2 (the cpuid column):

$cat /sys/fs/cgroup/cpuset/set1/cpuset.cpus
2
$ps -p 3671 -o pid,cpuid,args,user
  PID CPUID COMMAND                     USER
 3671     2 /bin/ksh93                  codywu
Advertisements

About codywu2010

a programmer
This entry was posted in Uncategorized. Bookmark the permalink.

6 Responses to cpuset by example

  1. Vatiminxuyu says:

    Thanks for your sharing!
    Can I ask 2 questions?
    1. How to detach 3671 process from set1?
    2. How to remove set1 from cpuset system?
    Thanks a lot~!

  2. Vatiminxuyu says:

    Thanks for your sharing!
    Can I ask 2 quesitons?
    1. How detach process 3671 from set1?
    2. How to remove set1 from cpuset system?
    Thanks a lot!

    • codywu2010 says:

      Hi Vatiminxuyu,

      to detach 3671 from set1 simply append it to default cpuset like below:
      sudo ksh -c “echo 3671 > /sys/fs/cgroup/cpuset/tasks”

      you can compare the output of “cat /proc/3671/cpuset” before and after the above command to see the difference.

      to remove set1, first detach all the tasks as described above, i.e. migrating them to default cpuset.
      After that, simply execute “sudo rmdir /sys/fs/cgroup/cpuset/set1” and set1 dir should be gone.

      if any task still lingers in the old cpuset you would see error such as “device busy” so you need to make sure all tasks are detached before final removal.

  3. codywu2010 says:

    you are welcome and have fun!

  4. Daniel says:

    Hi CodyWu,

    Thanks for putting this together and I did learn a ton from you 🙂

    …I do have a question. I run a 32 core server/ 4 socket DellR820 … I have isolated all processors
    except 0 and 4 for application use. I am planning to use core 22 for an application for testing purpose.

    cat /proc/cmdline
    BOOT_IMAGE=/vmlinuz-3.10.0-327.3.1.el7.x86_64 root=/dev/mapper/system-root ro crashkernel=auto selinux=0 rd.lvm.lv=system/swap rd.lvm.lv=swapvg/swaplv biosdevname=0 net.ifnames=0 rd.lvm.lv=system/root net.ifnames=0 biosdevname=0 isolcpus=1-3,5-31 nosoftlockup mce=ignore_ce audit=0

    Then based upon you cgroups setup I have created set1 just to bind sfptp’s pid ie 8474) to proc 22 via echo cmds ..

    # pwd
    /sys/fs/cgroup/cpuset/set1
    # cat cpuset.cpus
    22
    # cat tasks
    8474
    # cat cgroup.procs
    8474
    # pgrep sfptp
    8474

    # cat /proc/8474/status | grep -i cpu
    Cpus_allowed: 00000000,00000000,00400000
    Cpus_allowed_list: 22

    # taskset -p 8474
    pid 8474’s current affinity mask: 400000
    # taskset -cp 8474
    pid 8474’s current affinity list: 22

    # ps -L -p 8474 -o tid,pid,cpuid,ppid,args,user
    TID PID CPUID PPID COMMAND USER
    8474 8474 0 1 /usr/sbin/sfptpd -f /etc/sf root
    8475 8474 4 1 /usr/sbin/sfptpd -f /etc/sf root
    8495 8474 4 1 /usr/sbin/sfptpd -f /etc/sf root
    8496 8474 4 1 /usr/sbin/sfptpd -f /etc/sf root
    8497 8474 4 1 /usr/sbin/sfptpd -f /etc/sf root
    [root@wlpra99a0009 set1]#

    But my question is why is my app( sfptp) here is still tied to CPUID 0 or 4( These are the only 2 after isolating all others). I was hoping to see CPUID 22 instead. What am I missing ?

    Appreciate your help,
    D

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s