

                        Quick Sort 32 v2.13
                       Compiled 6 July 1997
         32bit protected mode computer speed test program

                 (C) Copyright 1997 by Sami Farin
                        All Rights Reserved
                      mailto:sfarin@ratol.fi
                   http://www.ratol.fi/~sfarin/





    Index
      1.   What's this?
      2.   Requirements
      3.   Usage
      4.   About Qsort
      5.   Command line arguments
      6.   Technical (?) stuff
      7.   Random data asm src
      8.   Errorlevels
      9.   Final words
     10.   More words





[1.] - What's this?

   Quick Sort 32 v2.13 (shortly Qsort) is a computer speed test program.
   It runs in 32bit protected mode (using DOS32 v3.3 by Adam Seychell).
   Qsort is programmed in assembly and optimized for Pentium class
   machines.



[2.] - Requirements

   Depending on the test size, Qsort needs as much (real/virtual) memory
   as it takes to store the test data. You can test with sizes of
   4 bytes (with 'F') - 4 gigabytes. The random numbers are the same
   every time you run the program (surprise!).

   Qsort also requires 386+. If FPU is available, it will be used to
   calculate the speed, since overflows don't occur so easily as with 64bit
   integers. When FPU is used, max. displayed speed is 42949672.95 kB/s
   (20000x more than Pentium Pro 180) and without FPU about 42000 kB/s.
   Minimum for both is about 0.00 kB/s.
   If the test runs for more than 2^32 microseconds, the kB/s results will
   not be correct if you don't have a FPU or don't use 'T'. If you have a
   FPU, elapsed time will be displayed incorrectly if the test runs for
   longer than 2^32 microseconds (~36 minutes), but the kB/s rate is
   calculated correctly, though. If using TSC, max. elapsed time is 2^64
   CPU core clock cycles (2500-10000 years) and kB/s rate will be displayed
   correctly.

   Qsort (actually DOS32) supports DPMI v0.9, VCPI v1.0, XMS v2.0
   and v3.0, Int 15h and RAW mode (clear boot). Win95 is also ok, but
   please note that GUIs etc take >5 MB of memory. It means that Win95
   starts swapping like crazy when qsorting with >11MB on a 16 MB machine.
   And THEN it IS slow!
   Try closing every program and removing the wall paper if your disk
   crunches a lot during the test. You can use Qsort to test Win95
   swapping speed and hd speed :)
   Best result is achieved with clear boot (F5 at bootup for MSDOS 6+).
   Qsort uses timing interrupt of it's own, so TSRs won't slow it down,
   because they don't get called at 18.2 Hz rate as before. So you should
   get the same results even when using different DOS and/or BIOS (on the
   same computer, of course).

   DOS32 extender takes about 8.5 kB and the Qsort itself about 6 kB
   unpacked (thank the gods I'm not a VB-programmer).
   Some debuggers (SoftICE for DOS, TD, GameTools...) don't work with
   Qsort (actually DOS32). You have to unload them before using Qsort.
   Sorry. No fix.



[3.] - Usage

   Qsort can be run in many ways, the easiest being "qsort32 [ENTER]".
   If you don't have a keyboard, try 2xCLICK.
   If you don't have enough memory, Qsort quits whining.
   If you've got lots of mem, the test will begin. First it parses the
   command line (see [5.]). Then Smartdrive and Win95 buffers are flushed
   and the test data will be generated (you will see 'i'). If you have used
   command line option 'D', the random numbers will be saved to file
   "Unsorted.$$$" before flushing. You will get a bit lower results if you
   use 'D'. 1st and 2nd level caches are not flushed before sorting.
   'w' is displayed if you used option 'W'.

   Since v2.0, Qsort captures timer interrupt to achieve more stabile and
   precise timings. On my computer, DOS used about 100 microseconds in every
   timer interrupt (ints happen about 18.2 times per second). And that 100
   us is not the same every time. So I decided to set up a timer interrupt
   of my own. DOS time is not updated while Qsort is sorting the data.

   Well, then comes the actual test; timer is started and the data sorted.
   After the sort timer is stopped. Timing accuracy is about 1/(4.77M/4)s
   ~839ns or 4-17ns with TSC. Please note that the elapsed time is rounded.
   It is stored more accurately on the FPU (e.g. 345987ns "inside" the FPU,
   but it's displayed as 346us). But what the heck.

   Then the sorted data is checked (to make sure it actually IS sorted)
   and 'c' will be displayed. It the data isn't ok, you will see '!'
   (but you shouldn't see it, EVER). If you are overclocking and see '!',
   your computer isn't working correctly...


   Here some other results you might expect:
   *=256 kB L2 PB-cache
  **=512 kB L2 PB-cache

  Machine              Speed (kB/s) with 12000 / 6000 / 1000 kB
**AMD K6 208.333 MHz   3053.1 / 3384.1 / 4008.4  Mem timings tweaked a bit
**AMD K6 208.333 MHz   2941.1 / 3286.9 / 3967.4  83.333*2.5 MHz
**AMD K6 200 MHz       2822.4 / 3132.2 / 3767.4  66.666*3 MHz
**AMD K6 200 MHz       ------ / 2925.5 / ------  66.666*3 MHz
  Pentium II 266 MHz   ------ / 3169.5 / ------  512 kB cache @ 133 MHz
  Pentium Pro 180 MHz  ------ / 2253.1 / 2534.1  Internal 256 kB L2 cache
 *Pentium 166 MHz      2274.7 / 2542.0 / 3037.7  66.666*2.5
**Cyrix 6x86-P166+     1985.6 / 2209.6 / 2599.1  ASUS P/I P55T2P4, 64 MB EDO
 *Pentium 120 MHz      1738.6 / 1927.7 / 2276.7  60*2 MHz
 *Pentium 120 MHz      ------ / 1905.8 / 2259.5  60*2 MHz
  Cyrix 6x86-P166+     1580.0 / 1760.0   2088.1  133 MHz CPU, no L2 cache
 *Cyrix 6x86-P120+     1459.4 / 1627.5 / 1904.5  100 MHz CPU, 16 MB FPM
 *Pentium 90 MHz       ------ / 1322.1 / 1563.1
  486DX 33 MHz         ------ /  275.0 /  323.2  8+64 kB cache
  386SX 20 MHz         ------ / ------ /   45.1  only 4MB mem :)

   Pentium Pro/II are so "slow" because they can't execute compare and
   conditional jump in parallel. And I couldn't work out a "fix" for that.
   PPro's 2nd level cache is 2-4 times faster than any kind of PB-cache, so
   PPro would beat other Pentiums it if could execute those damn instuctions
   in parallel!


   If you can test this program with different memory types (FPMRAM, EDORAM,
   SDRAM) on the SAME motherboard AND CPU I'd be pleased to know how memory
   type affects the results. Also 256 kB vs 512 kB might be interesting.
   And bus speed (33,50,60,66,75,83,100 MHz).

   Test results for Cyrix 6x86MX and Pentium MMX wanted!

   Please note that this program is optimized for Pentium class machines
   (including Cyrix 6x86). It means that 486-optimized version of Qsort would
   run slower on Pentiums than Pentium-optimized code would and so on.

   If you used command line option 'D', the sorted data is written to file
   "Sorted.$$$". It looks very very interesting, check it out if you don't
   have anything else to do! Be sure to have enough free space on current
   working directory/disk.

   After the test has finished, you may do whatever comes in your mind.
   For example, run it again to see if some nasty interrupts interrupted
   Qsort so lot that it affected the result. It's always a good idea to
   run the program at least twice.



[4.] - About Qsort

   I compared Qsort to a bubble sort routine and found out that quick sort
   algorithm was about 4500 times faster with 1500 integers. And bubble sort
   slows down exponentially.
   If you want the source code for Qsort (well commented and so on, of
   course), please ask me for more info.

   The executable can be used to any purpose. BUT it (and every file in
   this package) may NOT be modified in any way. You may NOT reverse-
   engineer, disassemble or do something as silly to the executable.
   You may not ask money for Qsort32, since it's FREEWARE!

   TASM 5, Qedit, SoftICE for Win95 and some other progs were used in
   development of Qsort. Also lots of optimizing and some thinking happened.

   If you have an ANSI C-language quick sorting routine that is faster than
   Qsort, it would be nice to hear about such a beast (and the compiler).
   It must sort completely random 32bit unsigned integers (as in
   Unsorted.$$$, see [7.]).



[5.] - Command line arguments

   *** No need for '/' nor '-'. The options are case insensitive. ***

   Q: Quiet, output only the speed in kB/s

   R: Reverse, sort reverse order numbers instead of random ones
      9,8,7,6,5,... instead of 7,1,4,6,2,...

   F: File, get numbers from file 'TESTFILE.Q32'
      Qsort sorts them as 32bit integers (you should know that by now...).

   ,: Use ',' as the delimiter instead of '.'
      Outputs '1234,56' instead of '1234.56'.
      Excel might not accept both of them.

   L: Add line feed when using option 'Q'
      Outputs empty line after sort speed when option 'Q' used.

   x: Integer, 1-[big number], test size is in kilobytes (1024 bytes)
      '23000' tells Qsort that you want to test with 23000*1024 bytes.

   M: Megabyte, test size is in megabytes (1048576 bytes)
      Option "23m" means 23*1048576 bytes.

   V: On-The-Screen(TM) using 1024x768x256 linear framebuffer mode
      If you want to see HOW quick sorting algorithm works, you might want
      to check this out! You must have VBE v2.0+ to use this.
      UniVBE or S3VBE20 might help if you don't have VBE v2.0+ already
      on board. Note again that the pixels are sorted as 32bit integers.
      You will get poor results since reading from video memory is REALLY
      slow. This option not intended to measure video performance.

   P: When using 'V' and 'F', get palette from the file (HSI RAW file format)
      Make the file with makeraw1 (256c B&W) or makeraw2 (256c colour).
      If 'P' not specified, also the possible palette+header will be sorted.
      Note: if using 'P' and 'V' without 'F', default palette will be used.
      Image Alchemy is available at ftp://ftp.handmadesw.com.
      You can also use any other 256c B&W format, but in that case, don't
      specify 'P'. You also have to strip the header from a non-HSI RAW file.

   Z: No delays when using option 'V'
      If you just can't wait.

   T: Use RDTSC instruction, found on Intel Pentium+, AMD K5+ and most new
      CPUs. Timing accuracy of TSC (Time Stamp Counter) is very good, 10ns
      on a 100 MHz Pentium. And that clock is quite stabile too, +-250
      picoseconds (max). You have to specify your CPU's speed in MHz
      before you can use option 'T', because if the MHz-rate is calculated
      every time you run Qsort, it will be (a bit) different every time.
      Also, when multitasking, calculated MHz would be quite inaccurate.
      The speed must be specified in environment variable 'QSORMHZ' or by
      specifying it after 'T' ('T:180', 'T=180' or with decimals 'T:166.666',
      'T=233.333'). Max number of decimals is three, the rest are ignored.
      If CR4.TSD=1 and CPL!=0, this option won't work (GPF). Non-braindead
      operating systems let that bit be 0, so 'T' works OK.
      Cyrix 6x86 doesn't have TSC! Oh, it's so sad.

   C: Calculates amount of Memory IO (external memory) of total execution
      time. It means the count of clock cycles during which the processor's
      external memory bus was in use divided by elapsed clock cycles * 100%.
      The larger the amount of Memory IO of the total execution time is, the
      slower your memory subsystem (2nd level cache and main memory) is.
      Use 'C' on Pentium and Pentium MMX CPUs.

      This options also measures how many million machine languge
      instructions were executed in a second (MIPS) during the test.
      Absolute maximum value for a normal Pentium is MHz-rate*2, but there
      are wait states during memory accesses and every instruction can't be
      executed in parallel, some instructions take more than one clock cycle
      and such nice little things.
      "Average number of instructions to sort one integer" is the same for
      every computer (so it is there just for fun). If the value is not stable
      with the same command line options, it's due to hardware interrupts
      (timer, keyb, printer, modem...).

      This options doesn't need option 'T', but it won't be of any harm.

      *** This option needs CPL 0, so it won't work under Windows etc. ***
      ***       (RDMSR and WRMSR are privileged instructions...)       ***

   X: Same as 'C', but for Pentium Pro and Pentium II.
      Use this if 'C' doesn't (seem to) work in CPL 0.
      I'm not quite sure if the Memory IO value is correct, since
      Pentium Pro has different counters than Pentium.
      Qsort uses event 62h to calculate the Memory IO, but it doesn't include
      2nd level cache accesses (62h="External Bus Cycles - DRDY Asserted").
      As you can see, MIPS value is quite low on a PPro compared to MHz-rate
      and Pentiums, since PPro can't execute compare+conditional jump in
      parallel ;(

      As you know, MSRs needed for options 'C' and 'X' exist only in
      Intel-powered CPUs (Pentium and up). (At this time, I mean!)

   D: Debug, dump unsorted numbers to 'UNSORTED.$$$' and sorted to 'SORTED.$$$'
      ...so you can see that Qsort actually sorts them.

   W: Wait about 1s before starting the sort.

   ?: No idea.



[6.] - Technical (?) stuff

   There's an option to specify the test size so that even with
   80886 2 GHz + 2 GB mem the result is accurate and meaningful.

   Don't press "Any Key" during the test, since one keypress might take
   as much as 1ms to process. Interrupt occurs also when releasing
   the key. Keyboard buffer is not flushed before the test starts.
   If you want to be sure that keyboard interrupts do not occur while
   sorting, use option 'W' and DO NOT dance on the keyboard!

   If you specify a size which is smaller than 2nd level cache size,
   speed of main memory won't affect the result at all.
   See included .XLS files for more info! Them are made with Excel 7.
   I think it uses Excel 5 format. If you've got old Excel or no Excel
   at all, SORRY.
   I didn't release the charts in Excel 4 format, because it was two times
   larger than in Excel 7 format and the colours got screwed and windows
   misplaced. You can replace the values with values made by makekbs{1,2}.btm
   (4DOS needed!). Just open resu*.kbs in Excel and copy/paste them to
   rndtest.xls and/or revtest.xls. Then change some settings, reboot the
   machine, rerun .btm files ... and play with Excel to see how the change
   affected sorting speed with different memory block sizes.

   You can also make a chart of your own with the .BTMs and Excel or
   some other program made for that purpose.

   Qsort does no heavy calculations, so you have to use (FOR EXAMPLE)
   Safbench to test CPU, FPU and memory speeds thoroughly.

   Following things affect the result:
     1)  CPU type                     (386,486,Pentium,AMD K6,Cyrix 6x86MX...)
     2)  CPU speed                    (16,33,66,100,133,180,200,233 MHz...)
     3)  Bus speed                    (33,50,60,66,75,83,100 MHz...)
     4)  Main memory type and speed   (FPMRAM,SDRAM,BEDORAM,20,55,70ns...)
     5)  1st level cache size, type and speed   (8,16,32 kB...(*))
     6)  2nd level cache size, type and speed   (64,128,256,512,internal,SRAM)
     7)  Chipset settings in BIOS               (Wait states, RAM timings...)
     8)  Motherboard design and used chipset
     9)  Enabled features of the processor      (Write gathering, wback,...)

   (*) Only size of the DATA cache affects the result if you have 1st
   level cache in the first place. Pentium has 8 kB for data, Pentium
   MMX 16 kB and AMD K6 has 32 kB.

   For example, many 'mips-programs' don't use 2nd level cache nor main
   memory in the stupid little 8086/8087-compatible ten-instuction loop.
   So all that matters to such programs is CPU type and speed.

   Did you know: if there are two Pentium 100's, the only thing what
   makes the speed differences in CPU-related tasks is memory speeds
   and cache sizes. 1st and 2nd level cache and main memory read AND write
   speeds are in relationship with the result given by a test program.
   Every test program uses memory blocks of different sizes, so Pentium A
   might be faster with a program which uses under 256 kB of mem and Pentium B
   might be faster with huge programs or programs which access lots of data.
   Also the type of memory access (random, sequential) makes differences.
   Try 'Safbench M' to see how fast memory you REALLY have. For example,
   CacheChk uses slow 'rep lodsd' instuction to read data, which reads one
   32bit word per cycle _AT MAX_. Pentiums are capable of reading two of 'em
   in one cycle.

   If you are overclocking and you occasionally see '!', it means that your
   computer doesn't work correctly and you should stop overclocking RIGHT
   NOW since your data is in danger. You might want to add some wait states
   for memory (in BIOS setup) and try again if it helped. If you keep getting
   '!' characters, you should put the jumpers into their original places
   (unoverclock). If that doesn't help, something is REALLY wrong.



[7.] - Random data asm src

   Here's the source code to produce the random integers used in the test.

         Cld
         Mov     EAX, 2156620195
         Mov     EBP, 1582790145
         Mov     ESI, NumberOf32bitInts
         Mov     EDI, [MemBuffer]       ; EDI points to the start of the buffer

RandLoop:
         Ror     EAX, 1
         Xor     EAX, 05AC5C6B3h
         Mul     EBP
         Rol     EBP, 1
         Add     EBP, 0CA51AB67h
         Xor     EAX, EBP
         Ror     EDX, 3
         Add     EDX, EAX
         Add     EBP, EDX
         Rol     EBP, 7
         Mul     EAX
         Xor     EBP, EAX
         Xor     EDX, 01A95E145h
         Rol     EDX, 1
         Xor     EBP, EDX
         Xor     EAX, EBP
         StosD
         Dec     ESI
         Jnz     RandLoop
         Cli
         Jz      $                      ; Or something just as genius



[8.] - Errorlevels

     0 OK

     1 Compare failed, integers not in order
       If this happens, your memory is defective, you have too "fast"
       settings for memory in BIOS or you are overclocking and your
       computer doesn't work correcly because of it. If you have
       overclocked your computer and comparison fails, you should NOT
       overclock your computer at all or add some wait states etc.

   255 Not enough memory

   253 TESTFILE.Q32 not found, put it in current dir if you want to use it

   252 Specified conflicting command line options, RTFM

   251 Help screen displayed

   250 Could not get mode info for mode 1024x768x256

   249 Linear framebuffer mode 1024x768x256 not supported
       Try installing UniVBE or S3VBE20 or something

   248 Failed to map framebuffer (tell me if you see this)

   247 TESTFILE.Q32 too small (<4 bytes)

   246 Too big test size specified (>4GB)

   245 CPUID not supported
       Qsort doesn't try to enable that for Cyrix 6x86, since it
       doesn't have TSC.

   244 TSC not supported
       So you can't use option 'T'

   243 No MHz-rate specified
       It must be in env. variable QSORMHZ or use option 'T:xxx'.

   242 Specified MHz was not between 1-1000000000.

   241 Pentium compatible MSRs not available
       So you can't use options 'C', 'X' and 'T'

   240 CPL isn't 0 (needed for option 'C' and 'X'), try clear boot
       Under Win etc Current Privilege Level is 3



[9.] - Final words

   If you have some bugs or improvements to report feel free to tell me.
   My email-address is sfarin@ratol.fi and the "nice" homepage is located
   at http://www.ratol.fi/~sfarin/ ... The most recent Qsort32 is there.

   WANTED: working exception handlers and register dumpers for Tran's
   Pmode (v3.07)...

   REMEMBER: if everything else fails, TRY CLEAR BOOT!



[10.] - More words

   DOS32 is (C) Adam Seychell, Win95 is (C) Microsoft, TASM is (C) Borland,
   Qedit is (C) Semware, 4DOS is (C) JP Software, Image Alchemy is (C)
   Handmade Software, Inc., Pentium is (C) Intel etc...

   You use Qsort and Safbench at your own risk. If you lose 4 MB document
   because Win boots your computer or something it's your and only your
   problem. Qsort worked for me, I hope the same for you.
