cs736-f2015: September 2015

Thursday, September 24, 2015

What's the maximum number in your tests?

Hi,

When displaying the min, 1st quantile, median, and 3rd quantile, and max over our samples, we found that the min, 1st quantile, median, and 3rd quantile are in the reasonable range, while the max number can be huge. Do you have the similar phenomenon?

I paste my results of the clock precision with different loop numbers here.

Are the Max columns here reasonable? Thanks.

Tuesday, September 22, 2015

Project Spec and Means vs Minimums

Under the Timing Issues header, the project description states:

"To get good timing results, you will need to repeat an operation (such as the context switch) many times and divide the total time by the number of repetitions."

I was asked about this and I wanted to post here so it could be discussed. I have not been dividing by the total number of times as the direction suggests. Reporting the minimum has been mentioned many times as a more appropriate choice for our results.

System Zeroing out memory

In this paper project, we were asked to look at the time it takes the system to zero out memory. Naturally that leads to looking at the implementation of malloc which can be found here. However, it sort of leads to a dead end as we find that the system, if it doesn't use mmap (if you allocate very large files it might) it will grow the heap using this function.

So, now I'm lost because I wanted to find the code that zeroed out the memory, not the code that grew the heap. Does anyone have any hints as to where this would be located?

Edit:
As a follow up, I found the kernel's page fault handler and that's probably what calls the zeroing function. The function arch_flush_lazy_mmu_mode(); probably is the thing which does the zeroing- if you look at the macro expansion in its definition, there's assembly that uses words like 'clobber'.

[Note: code.woboq.org is slightly different versions of kernel+glibc code than the lab machines]

Saturday, September 19, 2015

RDTSCP vs. CPUID+RDTSC

Hello all,

Do you have any thoughts on RDTSCP vs. CPUID+RDTSC for timing? I don't believe we are allowed to use RDTSCP on the lab machines as it's a privileged instruction. Long story short, the "P" version of the instruction serializes the CPU pipeline before getting the cycle counter, to more fine-tune the results of your measurement. However, CPUID also flushes the pipeline, so would simply putting it before an RDTSC yield similar results?

Curious if people are doing this or simply using RDTSC (as our numbers are varying a bit from other people's).

I've also authored a StackOverflow question in the past (some of my past research has involved these types of benchmarks, but at the kernel level) which you may have run into: http://stackoverflow.com/questions/27693145/rdtscp-versus-rdtsc-cpuid and got a relatively similar answer to my thoughts.

Friday, September 18, 2015

malloc vs mmap

While reading the Mach paper, I remembered a question I had while taking my undergrad OS. We built a memory allocator that did the same operations as malloc: allocating and freeing space on the heap. To implement it, we asked for space from the OS using mmap.

I'm confused about the operation of mmap. I can't tell how it relates to files because some of the stackoverflow posts I have been reading seem to suggest that you can use mmap to read files (I know you have to go through the filesystem and use an fread call, I think). Maybe setting up a scenario would help: I have a file (~/Downloads/hclinton.txt) that I want to read in a program. When would I want to use mmap to allocate memory for this file, and when would I want to use malloc?

Wednesday, September 16, 2015

Question about Link Update and Message Passing while and after Process Migration

There is one thing that the paper does not explain thoroughly: according to the paper,

After a process migrates, the first message sent to it by a remote process causes an update sent back to the sender informing it of the mirgrated process's new location.

But how does the system implement it?

For simplicity, say originally, process P1 is in machine A, process P2 is in machine B. P2 has a link pointing to P1. My thoughts are: when P1 migrates,

1. P1 sends a message to SwitchBoard to update its link and leaves a clue in Machine A's link table which says it has migrated, then when P2 send message m to P1's old link, it finds out the clue and consults the SwitchBoard where to be forward. Of course, forward address can also be field in link table (which eliminates the need of consulting SwitchBoard).

But will it be a memory overhead since most processes does not migrate? And how should the message m be forward, by machine A or first sent back to machine B and resent by machine B? And if P1 migrates several times, the message m must be transfer multiple times in order to get to P1, and that will be intolerable if P1 is accessed heavily.

2. P1 simply sends a message to SwitchBoard and SwitchBoard broadcasts this change. But there are two problems, SwitchBoard does not know the full list of processes which have links to P1; might cause a big flow of messages and lose the "copy-on-write" like property.

So, how it's actually done? The paper seems to focus on the general communication method and does not talk much about the details.

Monday, September 14, 2015

Questions regarding part 4 of the first paper

When we ran the context switch overhead tests for part 4 of the first paper assignment, there were a few time spikes scattered throughout the trials. I think they were mostly near the beginning of the trials, although sometimes the spikes did appear in other places. I would just like to know if anyone else has seen these spikes and if so, what they think could be the cause. Right now, it looks like these spikes are due to preemption or something similar that is cutting in on processor time, but there are other troubling possibilities.

One possible issue lies with the manner in which I implemented the repeated trials. Each testing program uses a function containing a for loop, and inside this loop is a call to a different function which performs a single trial. Inside the trial performing function, all variables, including the beginning and end time structs, are re-declared. At first I thought it would be wise to perform each trial in a fresh context using the implementation described above. However, I am now concerned as to whether this could be effecting the time measurements. In particular, I am worried about the possibility that some of the methods for performing a context switch might perform differently the first time, in which case these measurements, which treat every trial like the first trial, would be no good.

How close, in general, should the trial loop be to the actual trial implementation? Is it okay to remove it to a different function as I have done, or would it be smarter to perform the tests with a loop closely surrounding the trial implementation, all in the same context? Would it be safe to assume that each context used for a context switch trial is relatively the same, or would it be safer to keep producing new contexts to obtain a more varied population sample?

Wednesday, September 9, 2015

Dealing with variance in measurement

Here's a recent Q&A about the first assignment:

On 9/9/2015 1:26 PM, A Eager Student wrote:

First, as partners on this paper, do we submit the same paper, co-written?

Just one paper with both names on it.

Second, if you have several numbers, say 100 measurements of duration,
what is a good statistical measure of how close these numbers are to each
other? We thought to use standard deviation but thought that that might be
better suited to looking at how much a single value differed from the mean,
whereas we want a total measure of variance.

If you think that the underlying population distribution is normal, then you can calculate a standard deviation.

However, the question is whether that is the right way to summarize the information. If you are looking for the underlying cost of an operation, with as much of the noise removed at possible, then a minimum might be better.

And your first measurement might be much higher, do to caching effects.

Often the best way to display this information is to graph with a box-whisker plot.

--bart

Thursday, September 3, 2015

Entering and exiting the kernel as an example of a protected subsystem

Since the kernel is the prototypical protected subsystem, and was the natural starting place for design of Multics rings, I wanted to make sure that everyone understood the mechanics of how that worked.

In my CS537 (undergrad OS course) lecture notes, I have a nice section on how processes enter and exit the kernel in a modern operating system. This is material that typically does not appear in a text book. So check out this section of my notes:
http://pages.cs.wisc.edu/~bart/537/lecturenotes/enterexit.html

This should help as background to our class discussion on Friday.