Posted on 8:06 am February 24, 2012 by James Morle
(Updated 04/04/12 to fix bug with large raw disk partitions)
I need your help!
One of the things that I have struggled to get data for in my storage research is the latency of the interface between server and storage device. It’s easy to get numbers for the actual storage device, but the latency numbers for the interface are typically not published. That’s a shame, because the latency at this piece of the architecture is going to become increasingly important as we move wholesale to SSD. I don’t mean flash specifically here, I mean semiconductor storage devices, of which flash is a component. But this article isn’t about that, except by implication, it’s just about the piece of wet string that connects storage devices to servers (the transport).
I decided that the best way to get the transport latency figures was to measure them through some kind of test, and that’s the subject of this article. I have written a small piece of C code (Linux only, currently) that I believe produces a reasonable estimate of the transport latency, and I would like to start using it to gather data from anyone that would like to share it.
The test is very simple. It simply opens a file with the O_DIRECT flag and reads the same single 4KB block 10,000 times. A ‘file’ can be a file that you happen to have in the filesystem, or a disk device of any kind (multipath, raw disk slice, etc). The important thing is that the file exists on a storage device (disk, ssd, DRAM, whatever) that is connected via an identifiable interface and topology. For example, if I run the test on my laptop I run it against a file in a filesystem which is stored on a SATA2 SSD. The SSD part of that sentence doesn’t matter because I am just testing, in this case, the SATA2 interface. On a Fibre Channel system, my test file might be located on a LUN in a storage array, connected by some kind of Fibre Channel SAN topology. In this case, I would just be testing the SAN transport.
The test is not perfect: There is additional latency added by layers that must be present to complete the test, namely the entire Linux SCSI and device driver stack, and the very outer edges of the storage device’s microcode. These parts all add latency, but not enough to significantly alter the significance of the results.
To build the test, simply copy and paste the following code into an editor window, and save it as ‘latest.c’. Now compile the code as follows:
gcc -O -o latest latest.c
You should now have an executable binary in the current directory. Now just execute the test as follows:
./latest
…where <testfile> is the name (and path) of the file that exists on the storage device you wish to test against.
This should take somewhere between half a second and ten seconds, depending upon the storage transport being tested, and produce a file named “latest.out”.
I would then be
extremely grateful if you could
email me the output file (about 80-100KB) along with details of the storage
transport that exists between the server and the test file. I really need you to be detailed here:
- What is the physical interface type between server and storage? This should be one of SATA, SAS, Fibre Channel, Infiniband, Ethernet, or any other that you may have.
- What generation of interface is it? This might be “16Gbps Fibre Channel”, “1Gbps Ethernet”, “SATA2″, or some other combination.
- What is the topology? This might be “Direct attached”, “via single switch”, “via two switches”, “multipath across two switches”, or any number of combinations. Please be really descriptive here!
- What are the switches (if any)?
- What is the storage device? This might be “direct attached HDD”, “direct attached SSD”, the name of a storage array, or some other value.
- Are there any other pertinent points? For example, if you have 16Gbps Fibre Channel HBAs in the servers but only 4Gbps on the storage array, it would be nice to know that.
With that, I’ll leave you with the code. In return for your help, I will do the following:
- Review all the files that come in
- Let you know if there’s something weird happening on your system (I’ll need to ask you some more questions, probably)
- Collate the results
- Publish a summary
Thanks!
Important Disclaimer: I’ve done everything I can to ensure that this code is entirely safe to run, and that it does not damage anything on your system. You should also read the code to satisfy yourself that this is the case before running it, as neither myself nor Scale Abilities Ltd will accept any liability for damages that may arise. Use this code at your own risk.
/* latest.c
*
* Simplistic test for measuring approximate latency of storage transport
*
* Copyright 2012 Scale Abilities Ltd
*
*/
#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/time.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/types.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
#define BLOCKSIZE 4096
#define SAMPLES 10000
char *mem,*buf;
struct timeval tv;
suseconds_t start, end;
FILE *logfd;
int infd, i;
short res[SAMPLES];
if ( argc!=2) {
fprintf(stderr,"Must supply filename");
exit(-1);
}
if ((infd=open(argv[1],O_RDONLY|O_DIRECT))<0) {
perror("Cannot open test file: ");
exit(-1);
}
if ((size_t)lseek(infd,0,SEEK_END)<(size_t)BLOCKSIZE) {
fprintf(stderr,"Test file must be %d bytes or larger\n",BLOCKSIZE);
exit(-1);
}
if ((logfd=fopen("latest.out","w"))==NULL) {
perror("Cannot create output log file: ");
exit(-1);
}
/* Align buffer for O_DIRECT */
mem=malloc(2*BLOCKSIZE);
buf=(void *) (((uintptr_t)mem+BLOCKSIZE )& ~ 0x1FF);
memset(buf,0,BLOCKSIZE);
for (i=0;i<SAMPLES;i++) {
gettimeofday(&tv,0);
start=tv.tv_sec*1000000+tv.tv_usec;
if (pread(infd,buf,BLOCKSIZE,0)!=BLOCKSIZE) {
perror("Read error: ");
exit(-1);
}
gettimeofday(&tv,0);
res[i]=(short) ((tv.tv_sec*1000000+tv.tv_usec)-start);
}
close(infd);
for (i=0;i<SAMPLES;i++)
fprintf(logfd,"%d, %d\n",i,res[i]);
fclose(logfd);
exit(0);
}
Posted on 7:52 am February 1, 2012 by James Morle
We’ve just booked the first European venue for the Understanding Storage Masterclass. I will be presenting the Masterclass on April 24/25 2012 at Prospero House in London, tickets are available HERE.
I’m pretty excited to host this training session in my home country, and I hope to see you there!
Posted on 7:53 am November 30, 2011 by James Morle
This year at the UKOUG conference the OakTable Network will be trying something a little different. In addition to the usual 45-60 minute presentations during the conference, and the special OakTable Sunday event immediately prior to the conference, we will also be trialling a new concept – the OAK Talk. Anybody that has watched a TED Talk, or was even more fortunate to attend, will be immediately familiar with this concept – very short, concise and entertaining presentations.
The OAK Talks will be presented every day during the conference in a rapid fire fashion. Each lunchtime the ‘Unconference’ area of the exhibit hall will be occupied by the OakTable team to deliver FIVE presentations within the space of an hour. The presentations will be different each day and will feature the following presenters:
Monday
- Tuomas Pystynen
- Niall Litchfield
- Doug Burns
- Marco Gralike
- Jonathan Lewis
Tuesday
- Graham Wood
- Niall Litchfield
- Mogens Noergaard
- James Morle
- Martin Widlake
Wednesday
- Alex Gorbachev
- Christian Antognini
- David Kurtz
- John Beresniewicz
- Dan Norris
We will be tweeting further news using the hashtag #OakTalks.
Hope to see some of you there, it promises to be an interesting spin on technical presentation style.
Posted on 7:57 am November 14, 2011 by James Morle
This is a post about the importance of appropriately simplistic architectures. I frequently get involved with the creation of full-stack architectures, and in particular the architecture of the database platform. There are some golden rules when designing such systems, but one of the most important ones is to keep the design as simple as possible. This isn’t a performance enhancement, this is an availability enhancement. Complexity, after all, is the enemy of availability.
Read the full article