James Morle's Blog
RSS FeedTesting Transport Latency
Posted on 8:06 am February 24, 2012 by James Morle(Updated 04/04/12 to fix bug with large raw disk partitions)
I need your help!
One of the things that I have struggled to get data for in my storage research is the latency of the interface between server and storage device. It's easy to get numbers for the actual storage device, but the latency numbers for the interface are typically not published. That's a shame, because the latency at this piece of the architecture is going to become increasingly important as we move wholesale to SSD. I don't mean flash specifically here, I mean semiconductor storage devices, of which flash is a component. But this article isn't about that, except by implication, it's just about the piece of wet string that connects storage devices to servers (the transport).
I decided that the best way to get the transport latency figures was to measure them through some kind of test, and that's the subject of this article. I have written a small piece of C code (Linux only, currently) that I believe produces a reasonable estimate of the transport latency, and I would like to start using it to gather data from anyone that would like to share it.
The test is very simple. It simply opens a file with the O_DIRECT flag and reads the same single 4KB block 10,000 times. A 'file' can be a file that you happen to have in the filesystem, or a disk device of any kind (multipath, raw disk slice, etc). The important thing is that the file exists on a storage device (disk, ssd, DRAM, whatever) that is connected via an identifiable interface and topology. For example, if I run the test on my laptop I run it against a file in a filesystem which is stored on a SATA2 SSD. The SSD part of that sentence doesn't matter because I am just testing, in this case, the SATA2 interface. On a Fibre Channel system, my test file might be located on a LUN in a storage array, connected by some kind of Fibre Channel SAN topology. In this case, I would just be testing the SAN transport.
The test is not perfect: There is additional latency added by layers that must be present to complete the test, namely the entire Linux SCSI and device driver stack, and the very outer edges of the storage device's microcode. These parts all add latency, but not enough to significantly alter the significance of the results.
To build the test, simply copy and paste the following code into an editor window, and save it as 'latest.c'. Now compile the code as follows:
gcc -O -o latest latest.c
You should now have an executable binary in the current directory. Now just execute the test as follows:
./latest
- What is the physical interface type between server and storage? This should be one of SATA, SAS, Fibre Channel, Infiniband, Ethernet, or any other that you may have.
- What generation of interface is it? This might be "16Gbps Fibre Channel", "1Gbps Ethernet", "SATA2", or some other combination.
- What is the topology? This might be "Direct attached", "via single switch", "via two switches", "multipath across two switches", or any number of combinations. Please be really descriptive here!
- What are the switches (if any)?
- What is the storage device? This might be "direct attached HDD", "direct attached SSD", the name of a storage array, or some other value.
- Are there any other pertinent points? For example, if you have 16Gbps Fibre Channel HBAs in the servers but only 4Gbps on the storage array, it would be nice to know that.
- Review all the files that come in
- Let you know if there's something weird happening on your system (I'll need to ask you some more questions, probably)
- Collate the results
- Publish a summary
/* latest.c
*
* Simplistic test for measuring approximate latency of storage transport
*
* Copyright 2012 Scale Abilities Ltd
*
*/
#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/time.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/types.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
#define BLOCKSIZE 4096
#define SAMPLES 10000
char *mem,*buf;
struct timeval tv;
suseconds_t start, end;
FILE *logfd;
int infd, i;
short res[SAMPLES];
if ( argc!=2) {
fprintf(stderr,"Must supply filename");
exit(-1);
}
if ((infd=open(argv[1],O_RDONLY|O_DIRECT))<0) {
perror("Cannot open test file: ");
exit(-1);
}
if ((size_t)lseek(infd,0,SEEK_END)<(size_t)BLOCKSIZE) {
fprintf(stderr,"Test file must be %d bytes or larger\n",BLOCKSIZE);
exit(-1);
}
if ((logfd=fopen("latest.out","w"))==NULL) {
perror("Cannot create output log file: ");
exit(-1);
}
/* Align buffer for O_DIRECT */
mem=malloc(2*BLOCKSIZE);
buf=(void *) (((uintptr_t)mem+BLOCKSIZE )& ~ 0x1FF);
memset(buf,0,BLOCKSIZE);
for (i=0;i<SAMPLES;i++) {
gettimeofday(&tv,0);
start=tv.tv_sec*1000000+tv.tv_usec;
if (pread(infd,buf,BLOCKSIZE,0)!=BLOCKSIZE) {
perror("Read error: ");
exit(-1);
}
gettimeofday(&tv,0);
res[i]=(short) ((tv.tv_sec*1000000+tv.tv_usec)-start);
}
close(infd);
for (i=0;i<SAMPLES;i++)
fprintf(logfd,"%d, %d\n",i,res[i]);
fclose(logfd);
exit(0);
}


>What is the physical interface type between server and storage? This should be one of SATA, SAS, Fibre Channel, Infiniband, Ethernet, or any other that you may have.
Hi James,
Unless you stipulate external, attached storage you are going to get a latency reading of the controller cache on the PCI controller card. I've quoted the above to draw attention to the fact that I know of no way to attach SATA or SAS as DAS without a cache-enabled controller. Can one even buy a controller that has no cache?
Also, I'd stipulate that this test be run on an otherwise *completely* idle system as there is plenty of opportunity for a time slice between the two gettimeofday() calls.
Kevin,
Both are good points.
I might be interested in even getting some figures for controller cache hits, though. This is probably a best case scenario for hopping out onto the PCI bus to an external memory device, so it would be good to have as a baseline. Desktop and laptop machines almost certainly don't have controller cache (unless I'm mistaken, I guess they could have a touch of write-through cache), and it would be good to capture the interface latency to SAS and SATA devices from those platforms.
Excellent point on the idle system, though the 10,000 samples should let through some samples that don't exhibit time smear, even on a system with some kind of load.
Cheers!
James
I'd be inclined to recommend not hammering the I/O in such a tight loop... perhaps a poll(,,N) where N is random between 10 and 100 inserted into the loop. I think there would be more opportunity to get some samples of the routine stalls some controller suffer.
I'd also initialize the array elems so there are no ZFOD hits in the loop....or...keep it simple
Kevin,
Thanks for the suggestions. However, I think the current simple approach is working well. There are indeed outliers, but these can be statistically discarded - I'm looking for the base latency overhead, without all the exceptions, at this stage.
Cheers!
James