Tracing

Tracing for performance

Traces of the input-output (I/O) activity of computer system are critical components of research efforts to improve I/O subsystems. Traces record what applications or entire systems are doing, allowing researchers to replay the traces in order to measure the performance of proposed or modified systems. Looking at traces we can discover interesting patterns which can be used to build smarter storage systems. Traces bring repeatability to I/O research and they can be used even when the entire environment in which they were generated cannot be duplicated in the lab (which could be due to privacy issues, cost, use of proprietary application software and so on).

Recognizing the importance of I/O traces, Storage Networking Industry Association (SNIA) has established the Input/Output Traces, Tools, and Analysis (IOTTA) repository. The primary goal of this repository is to store storage-related I/O trace files, associated tools, and other information to assist with storage research.

The I/O requests that we trace traverse a complex stack of software and hardware components: database systems, file systems, operating-system caches and buffers, interconnects and their device drivers, array controllers, SSD controllers, etc. Application performance is influenced by all of these components and by the interaction between them. However, almost all tracing is done at one particular point in the I/O stack, typically either the file system interface or the block-device interface. We care about end-to-end performance, but we measure at one spot.

SystemTap

SystemTap is an open source tool which was created by consortium that include RedHat, IBM, Intel, Oracle and Hitachi. It was created to allow kernel developers and regular users to investigate behavior of kernel internals in easy and scalable way.

The user of SystemTap needs to write scripts in c-like language. Each script describes list of event-handler entities. Where event could be any place where kernel can stop or a set of predefined events such as timers that expire, start and stop of the script invocation and etc. The handler is a set of commands in c-like language which can modify data, aggregate statistics or print data.

The SystemTap script is invoked using stap utility which performs the following steps:

Translate the .stp script into c source code which can be compiled as Linux kernel module
Compile the c source code and create kernel module (.ko file)
Load the module into the kernel
The module runs and invokes the handlers associated with events. If any handler output some data, it is collected by the stap and printed to standard output.
When Ctrl-c key sequence is pressed or when one of the handlers invoke exit command, the stap unloads the kernel module and the program terminates

1 global locations
2 
3 probe begin {
4   printf("Monitoring for dropped packets\n")
5 }
6 probe end {
7   printf("Stopped monitoring of dropped packets\n");
8 }
9
10probe kernel.trace("kfree_skb") {
11  locations[$location] <<< 1
12}
13
14probe timer.sec(5) {
15 printf("\n");
16 foreach (l in location-) {
17   printf("%d packets dropped at %s\n",
18          @count(locations[l]), symname(l));
19 }
20 delete locations
21}

Line 1: Defines global variable. Variables types in SystemTap inferred at fist use. Variables in SystemTap can be scalars (numbers, strings), arrays, dictionaries or statistics. Statistics are special SystemTap structure. Data can be aggregated in statistics using <<< operator and later extracted using range of statistics functions such as min, max,average and etc.
Lines 3-5: Probe begin is a special event which happens once the script starts running. The handler for event is specified between the curly braces. In our case this just outputs the string to standard output
Lines 6-8: Probe end is a special event which happens just before the script terminates.
Lines 10-12: This time the probe specifies place in the kernel as a probe point for each the handler will be invoked. The kfree_skb is the name of special trace point inside kernel defined with this name. There are many such places which were chosen by kernel developers as strate- gic locations to gather information. Alternatively kernel.function can be specified instead of kernel.trace to invoke handler when function is entered. The handler in our example aggregates 1 to dictionary which is index using the location parameter. The special $location form allows using kernel variables that were defined at the scope of the function in which the probe occur.
lines 14-22: This time the probe handler will be executed every 5 seconds using built-in timers. Each time we iterate over all locations inside the dictionary and print the sum of aggregated values (number of times the packet was dropped). Finally we delete the locations array, which basically empties it and prepares for next use. This way we discard aggregated statistics every 5 seconds and start over again.

The kernel technology behind SystemTap is called krpobes . Basically SystemTap can be considered as a high level layer over the kprobes which hides from the user the burden of using low level kernel interfaces. Essentially kprobe is a low level kernel api which allows to hook any place in the kernel. To allow this hooking the first byte of instruction in the hooking address is replaced by a breakpoint instruction appropriate for the specific CPU architecture. When this break point is hit, kprobe's logic takes over and executes the specified handler. After that the overwritten instruction is invoked and the execution continues from the address after the breakpoint.

Another type of kprobe is called return kprobe. It is used to invoke handler after completion of some function in kernel. This time when the function is called, the return probe gets the return address and replaces it with trampoline code. The trampoline code executes the handler and then resumes the execution at the return address of the original function.