[time-nuts] Time stamping with a PICPET

Mon Oct 28 02:08:08 UTC 2013

On 26 Oct, 2013, at 23:53 , Hal Murray <hmurray at megapathdsl.net> wrote:
> dennis.c.ferguson at gmail.com said:
>> That's perfect if it works like it seems it should.  The problem with modern
>> CPUs is finding an instruction sequence that does the read-write-read in
>> that order, allowing each to complete before doing the next.  The write is
>> the biggest problem.  Writes are often buffered, and even when you can find
>> an instruction which stalls until it clears ...
> 
> I'm far from a wizard in this area, but I used to work with people who were.
> 
> The rules for things like PCI cover that case.  If you do something like 
> write to a register to clear an interrupt request, you have to follow it by a 
> read to that register or one close to it.  As you hippity-hop through bridges 
> and such, the read gets trapped behind the write and doesn't happen until the 
> write finishes.
> 
>> When using the CPU cycle counter as a system clock source it is common to
>> find that the two reads in a read-write-read sequence are only a cycle or
>> two different even when you know the write is crossing an interconnect with
>> 10's of nanoseconds of latency (not that 10's of nanoseconds is bad...). 
> 
> That's reasonable if the read-write-read were to cycle-counter, 
> someplace-else, and cycle-counter.  The write has been started.  It's in the 
> piepline, but you haven't told the memory system that you need it to finish.
> 
> Try read-write-read-read where the outer reads are to the cycle counter and 
> the inner write-read both go to the same IO device.

Note that you've turned a read-write-read into a read-read-read with an additional
write wart.  As I mentioned you can often find instructions to do a read-read-read
correctly, so this will likely work too.

Putting some numbers to this might help get a handle on the cost, though.  One
reason for doing the before- and after- reads is to get a measurement of the
ambiguity of the sample (which also provides a basis for filtering damaged samples).
Cycle counter reads hardly cost anything but on a 166 MHz, 64 bit PCI-X bus, the
last, highest performance PCI bus that was a real bus (PCIe is a packet protocol
running on a network of point-to-point links) a single register read takes about
74 ns to complete.  I'll guess the write adds about 40 ns to that.  Since the
write increases the ambiguity from +/- 37 ns (i.e. read-read-read only) to +/- 57 ns,
finding a way to do it with a read-read-read alone provides a useful improvement.
For PCIe the write is probably cheaper but the read is likely to be even more expensive
due to the packetization and (de)serialization logic that "bus" requires.

It is the case, however, that if you do a naive implementation of the read-read-read,
or the read-write-read-read, you may end up finding the first and last read of
the cycle counter are still only a few cycles apart.  The reason is that while
most modern CPUs will execute the instructions more-or-less in order (most will
do 2 instructions per cycle if they can now, so the order may not be exact) the
CPU won't have a reason to actually wait for the 74 ns it takes that middle read
to complete and will go barrelling along executing additional instructions until
it finds something that actually uses the result it hasn't got yet.  The cycle
counter read doesn't depend on the previous read so there's no reason to wait.

To get these operations serialized correctly you need to find the magic
instructions that force that to happen.  On a recent x86 you might be able
to use the serializing rdtscp instructions, on older ones you might need
to separate the operations with cpuid instructions.  On other CPUs it could
be barrier instructions or something else entirely.  A posted write may
only be serializable by adding an extraneous and expensive read to the
same device after it, as you suggest.

Dennis Ferguson