Linux下stream内存带宽测试参数和示例详解附源码(总结)
目录
- 一、简介
- 二、使用简介
- 2.1 测试内容简介
- 2.2 编译参数简介
- 2.3 具体参数示例
- 三、源码下载及使用
- 四、其他相关知识链接
- FIO测试硬盘性能参数和实例总结
一、简介
本文通过实例详细讲解各编译参数,方便读者快速掌握。stream是一套综合性能测试程序集,通过fortran和c两种高级且高效的语言编写完成,由于这两种语言在数学计算方面的高效率, 使得stream测试例程可以充分发挥出内存的能力。
二、使用简介
stream测试得到的是可持续运行的内存带宽最大值,而并不是一般的硬件厂商提供的理论最大值,具有如下特点:
1.主要有四种数组的运算,测试到内存带宽的性能分别是:数组的复制(Copy)、数组的尺度变换(Scale)、数组的矢量求和(Add)、数组的复合矢量求和(Triad)。
2. 数组的值采用了双精度(8个字节)
2.1 测试内容简介
测试内容 | 解析 |
---|---|
Copy | 是复制操作,即从内存单元中读取一个数,并复制到其他内存单元中,两次访问内存操作 |
Scale | 是乘法操作,即从内存单元中读取一个数,与常数相乘,得到的记过存到其他内存单元,两次访问内存操作 |
Add | 是加法操作,从两个内存单元中分别读取两个数,将其进行加法操作后,得到的结果写入另一个内存单元中,3次访问内存操作 |
Triad | 是前面三种的结合,先从内存中读取一个数,与一个常数相乘得到一个乘积,然后从另一个内存单元中读取一个数与刚才乘积结果相加,得到的结果写入内存。3次访问内存操作 |
测试结果一般的规律是Add > Triad > Copy > Scale。一次Add操作需要访问三次内存(两个读操作,一个写操作),Triad操作也需要三次访问内存, Copy和Scale操作需要两次访问内存。单位操作内,访问内存次数越多,越能够掩盖访存延迟,带宽越大。
单核Stream测试,影响的因素除了内存控制器能力外,还有Core的ROB、Load/Store对其影响,因此不是单纯的内存带宽性能测试。
而多核Stream测试,通过多核同时发出大量内存访问请求,能够更加饱和地访问内存,从而测试到内存带宽的极限性能。
2.2 编译参数简介
stream测试首先通过带不同的编译参数用于达到不同的测试结果,具体的编译参数如下:
参数 | 解析 |
---|---|
-O3 | 编译器编译优化级别 |
-mcmodel=small | 当单个Memory Array Size 大于2GB时需要设置此参数 |
-fopenmp | 适应多处理器环境;开启后,程序默认线程为CPU线程数 |
-DSTREAM_ARRAY_SIZE=200000000 | 指定计算中a[],b[],c[]数组的大小,部分版本stream为-DN=2000000形式设置 |
-DNTIMES=30 | 执行的次数,并且从这些结果中选最优值 |
-DOFFSET=4096 | 数组的偏移,一般可以不定义 |
注意:
1、运行时动态指定运行的进程数:
export OMP_NUM_THREADS=8 #8为自定义的要使用的处理器数量
2、设置数组DSTREAM_ARRAY_SIZE值
这个参数是对测试结果影响最大,也是最需要关注的一个参数,它用来指定计算中a[],b[],c[]数组的大小,且数组的值采用了双精度(8个字节)。设置数组的维数 STREAM ARRAY_SIZE 定义时需要注意以下几点:
(1)要充分考虑内存容量的需求,粗略估计是 STREAM ARRAY_SIZE × 8(双精度) × 3 (三个数组)<= 0.6*M;M 是用户的可用内存。
(2)要保证测试过程中,必须设置测试数组大小远大于CPU 最高级缓存(一般为L3 Cache)的大小,否则就是测试CPU缓存的吞吐性能(带宽值最大数组大小要大于缓存的 4 倍),而非内存吞吐性能。
(3)为了保证测试可以持续一段时间,测试过程中内存带宽可以达到一定的最大值, 从而避免得不到实际最大峰值的情况,如果四项测试中有完成时间小于20微秒的情况,就需要适当的增大测试数组的维度 STREAM ARRAY_SIZE。
(4)5.9版本默认值-DN=2000000,5.10版本,参数名变为-DSTREAM_ARRAY_SIZE,默认值10000000)。注意:。
3、设置-mcmodel=small报错
新的gcc已经不支持‘-mcmodel=medium’参数了,可以改为“-mcmodel=large”、“-mcmodel=small”、“-mcmodel=tiny”
2.3 具体参数示例
本地编译执行:
[root@localhost /]# gcc -O3 -mtune=native -march=native -fopenmp -DSTREAM_ARRAY_SIZE=200000000 -DNTIMES=100 stream.c -o stream
[root@localhost /]# ./stream
交叉编译在其他目标机器执行:
[root@localhost /]# aarch64-linux-gnu-gcc -O3 --static -fno-PIC -mcmodel=large -fopenmp -DSTREAM_ARRAY_SIZE=200000000 -DNTIMES=30 stream.c -o stream
[root@localhost /]# ./stream
执行结果如下:
[root@localhost /]# gcc -O3 -mtune=native -march=native -fopenmp -DN=200000000 -DNTIMES=100 stream.c -o stream
[root@localhost /]# ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
***** WARNING: ******It appears that you set the preprocessor variable N when compiling this code.This version of the code uses the preprocesor variable STREAM_ARRAY_SIZE to control the array sizeReverting to default value of STREAM_ARRAY_SIZE=10000000
***** WARNING: ******
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 100 times.The *best* time for each kernel (excluding the first iteration)will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 6
Number of Threads counted = 6
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 11274 microseconds.(= 11274 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 13704.6 0.011720 0.011675 0.011816
Scale: 10937.1 0.014686 0.014629 0.015018
Add: 12362.1 0.019471 0.019414 0.019872
Triad: 12369.2 0.019462 0.019403 0.020019
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
三、源码下载及使用
将代码直接复制到新建文本stream.c中,直接用上面所带参数进行编译执行即可。
/*-----------------------------------------------------------------------*/
/* Program: STREAM */
/* Revision: $Id: stream.c,v 5.10 2013/01/17 16:01:06 mccalpin Exp mccalpin $ */
/* Original code developed by John D. McCalpin */
/* Programmers: John D. McCalpin */
/* Joe R. Zagar */
/* */
/* This program measures memory transfer rates in MB/s for simple */
/* computational kernels coded in C. */
/*-----------------------------------------------------------------------*/
/* Copyright 1991-2013: John D. McCalpin */
/*-----------------------------------------------------------------------*/
/* License: */
/* 1. You are free to use this program and/or to redistribute */
/* this program. */
/* 2. You are free to modify this program for your own use, */
/* including commercial use, subject to the publication */
/* restrictions in item 3. */
/* 3. You are free to publish results obtained from running this */
/* program, or from works that you derive from this program, */
/* with the following limitations: */
/* 3a. In order to be referred to as "STREAM benchmark results", */
/* published results must be in conformance to the STREAM */
/* Run Rules, (briefly reviewed below) published at */
/* http://www.cs.virginia.edu/stream/ref.html */
/* and incorporated herein by reference. */
/* As the copyright holder, John McCalpin retains the */
/* right to determine conformity with the Run Rules. */
/* 3b. Results based on modified source code or on runs not in */
/* accordance with the STREAM Run Rules must be clearly */
/* labelled whenever they are published. Examples of */
/* proper labelling include: */
/* "tuned STREAM benchmark results" */
/* "based on a variant of the STREAM benchmark code" */
/* Other comparable, clear, and reasonable labelling is */
/* acceptable. */
/* 3c. Submission of results to the STREAM benchmark web site */
/* is encouraged, but not required. */
/* 4. Use of this program or creation of derived works based on this */
/* program constitutes acceptance of these licensing restrictions. */
/* 5. Absolutely no warranty is expressed or implied. */
/*-----------------------------------------------------------------------*/
# include <stdio.h>
# include <unistd.h>
# include <math.h>
# include <float.h>
# include <limits.h>
# include <sys/time.h>/*-----------------------------------------------------------------------* INSTRUCTIONS:** 1) STREAM requires different amounts of memory to run on different* systems, depending on both the system cache size(s) and the* granularity of the system timer.* You should adjust the value of 'STREAM_ARRAY_SIZE' (below)* to meet *both* of the following criteria:* (a) Each array must be at least 4 times the size of the* available cache memory. I don't worry about the difference* between 10^6 and 2^20, so in practice the minimum array size* is about 3.8 times the cache size.* Example 1: One Xeon E3 with 8 MB L3 cache* STREAM_ARRAY_SIZE should be >= 4 million, giving* an array size of 30.5 MB and a total memory requirement* of 91.5 MB. * Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP)* STREAM_ARRAY_SIZE should be >= 20 million, giving* an array size of 153 MB and a total memory requirement* of 458 MB. * (b) The size should be large enough so that the 'timing calibration'* output by the program is at least 20 clock-ticks. * Example: most versions of Windows have a 10 millisecond timer* granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds.* If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec.* This means the each array must be at least 1 GB, or 128M elements.** Version 5.10 increases the default array size from 2 million* elements to 10 million elements in response to the increasing* size of L3 caches. The new default size is large enough for caches* up to 20 MB. * Version 5.10 changes the loop index variables from "register int"* to "ssize_t", which allows array indices >2^32 (4 billion)* on properly configured 64-bit systems. Additional compiler options* (such as "-mcmodel=medium") may be required for large memory runs.** Array size can be set at compile time without modifying the source* code for the (many) compilers that support preprocessor definitions* on the compile line. E.g.,* gcc -O -DSTREAM_ARRAY_SIZE=100000000 stream.c -o stream.100M* will override the default size of 10M with a new size of 100M elements* per array.*/
#ifndef STREAM_ARRAY_SIZE
# define STREAM_ARRAY_SIZE 10000000
#endif/* 2) STREAM runs each kernel "NTIMES" times and reports the *best* result* for any iteration after the first, therefore the minimum value* for NTIMES is 2.* There are no rules on maximum allowable values for NTIMES, but* values larger than the default are unlikely to noticeably* increase the reported performance.* NTIMES can also be set on the compile line without changing the source* code using, for example, "-DNTIMES=7".*/
#ifdef NTIMES
#if NTIMES<=1
# define NTIMES 10
#endif
#endif
#ifndef NTIMES
# define NTIMES 10
#endif/* Users are allowed to modify the "OFFSET" variable, which *may* change the* relative alignment of the arrays (though compilers may change the * effective offset by making the arrays non-contiguous on some systems). * Use of non-zero values for OFFSET can be especially helpful if the* STREAM_ARRAY_SIZE is set to a value close to a large power of 2.* OFFSET can also be set on the compile line without changing the source* code using, for example, "-DOFFSET=56".*/
#ifndef OFFSET
# define OFFSET 0
#endif/** 3) Compile the code with optimization. Many compilers generate* unreasonably bad code before the optimizer tightens things up. * If the results are unreasonably good, on the other hand, the* optimizer might be too smart for me!** For a simple single-core version, try compiling with:* cc -O stream.c -o stream* This is known to work on many, many systems....** To use multiple cores, you need to tell the compiler to obey the OpenMP* directives in the code. This varies by compiler, but a common example is* gcc -O -fopenmp stream.c -o stream_omp* The environment variable OMP_NUM_THREADS allows runtime control of the * number of threads/cores used when the resulting "stream_omp" program* is executed.** To run with single-precision variables and arithmetic, simply add* -DSTREAM_TYPE=float* to the compile line.* Note that this changes the minimum array sizes required --- see (1) above.** The preprocessor directive "TUNED" does not do much -- it simply causes the * code to call separate functions to execute each kernel. Trivial versions* of these functions are provided, but they are *not* tuned -- they just * provide predefined interfaces to be replaced with tuned code.*** 4) Optional: Mail the results to mccalpin@cs.virginia.edu* Be sure to include info that will help me understand:* a) the computer hardware configuration (e.g., processor model, memory type)* b) the compiler name/version and compilation flags* c) any run-time information (such as OMP_NUM_THREADS)* d) all of the output from the test case.** Thanks!**-----------------------------------------------------------------------*/# define HLINE "-------------------------------------------------------------\n"# ifndef MIN
# define MIN(x,y) ((x)<(y)?(x):(y))
# endif
# ifndef MAX
# define MAX(x,y) ((x)>(y)?(x):(y))
# endif#ifndef STREAM_TYPE
#define STREAM_TYPE double
#endifstatic STREAM_TYPE a[STREAM_ARRAY_SIZE+OFFSET],b[STREAM_ARRAY_SIZE+OFFSET],c[STREAM_ARRAY_SIZE+OFFSET];static double avgtime[4] = {0}, maxtime[4] = {0},mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX};static char *label[4] = {"Copy: ", "Scale: ","Add: ", "Triad: "};static double bytes[4] = {2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,2 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE,3 * sizeof(STREAM_TYPE) * STREAM_ARRAY_SIZE};extern double mysecond();
extern void checkSTREAMresults();
#ifdef TUNED
extern void tuned_STREAM_Copy();
extern void tuned_STREAM_Scale(STREAM_TYPE scalar);
extern void tuned_STREAM_Add();
extern void tuned_STREAM_Triad(STREAM_TYPE scalar);
#endif
#ifdef _OPENMP
extern int omp_get_num_threads();
#endif
int
main(){int quantum, checktick();int BytesPerWord;int k;ssize_t j;STREAM_TYPE scalar;double t, times[4][NTIMES];/* --- SETUP --- determine precision and check timing --- */printf(HLINE);printf("STREAM version $Revision: 5.10 $\n");printf(HLINE);BytesPerWord = sizeof(STREAM_TYPE);printf("This system uses %d bytes per array element.\n",BytesPerWord);printf(HLINE);
#ifdef Nprintf("***** WARNING: ******\n");printf(" It appears that you set the preprocessor variable N when compiling this code.\n");printf(" This version of the code uses the preprocesor variable STREAM_ARRAY_SIZE to control the array size\n");printf(" Reverting to default value of STREAM_ARRAY_SIZE=%llu\n",(unsigned long long) STREAM_ARRAY_SIZE);printf("***** WARNING: ******\n");
#endifprintf("Array size = %llu (elements), Offset = %d (elements)\n" , (unsigned long long) STREAM_ARRAY_SIZE, OFFSET);printf("Memory per array = %.1f MiB (= %.1f GiB).\n", BytesPerWord * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.0),BytesPerWord * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.0/1024.0));printf("Total memory required = %.1f MiB (= %.1f GiB).\n",(3.0 * BytesPerWord) * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024.),(3.0 * BytesPerWord) * ( (double) STREAM_ARRAY_SIZE / 1024.0/1024./1024.));printf("Each kernel will be executed %d times.\n", NTIMES);printf(" The *best* time for each kernel (excluding the first iteration)\n"); printf(" will be used to compute the reported bandwidth.\n");#ifdef _OPENMPprintf(HLINE);
#pragma omp parallel {#pragma omp master{k = omp_get_num_threads();printf ("Number of Threads requested = %i\n",k);}}
#endif#ifdef _OPENMPk = 0;
#pragma omp parallel
#pragma omp atomic k++;printf ("Number of Threads counted = %i\n",k);
#endif/* Get initial value for system clock. */
#pragma omp parallel forfor (j=0; j<STREAM_ARRAY_SIZE; j++) {a[j] = 1.0;b[j] = 2.0;c[j] = 0.0;}printf(HLINE);if ( (quantum = checktick()) >= 1) printf("Your clock granularity/precision appears to be ""%d microseconds.\n", quantum);else {printf("Your clock granularity appears to be ""less than one microsecond.\n");quantum = 1;}t = mysecond();
#pragma omp parallel forfor (j = 0; j < STREAM_ARRAY_SIZE; j++)a[j] = 2.0E0 * a[j];t = 1.0E6 * (mysecond() - t);printf("Each test below will take on the order"" of %d microseconds.\n", (int) t );printf(" (= %d clock ticks)\n", (int) (t/quantum) );printf("Increase the size of the arrays if this shows that\n");printf("you are not getting at least 20 clock ticks per test.\n");printf(HLINE);printf("WARNING -- The above is only a rough guideline.\n");printf("For best results, please be sure you know the\n");printf("precision of your system timer.\n");printf(HLINE);/* --- MAIN LOOP --- repeat test cases NTIMES times --- */scalar = 3.0;for (k=0; k<NTIMES; k++){times[0][k] = mysecond();
#ifdef TUNEDtuned_STREAM_Copy();
#else
#pragma omp parallel forfor (j=0; j<STREAM_ARRAY_SIZE; j++)c[j] = a[j];
#endiftimes[0][k] = mysecond() - times[0][k];times[1][k] = mysecond();
#ifdef TUNEDtuned_STREAM_Scale(scalar);
#else
#pragma omp parallel forfor (j=0; j<STREAM_ARRAY_SIZE; j++)b[j] = scalar*c[j];
#endiftimes[1][k] = mysecond() - times[1][k];times[2][k] = mysecond();
#ifdef TUNEDtuned_STREAM_Add();
#else
#pragma omp parallel forfor (j=0; j<STREAM_ARRAY_SIZE; j++)c[j] = a[j]+b[j];
#endiftimes[2][k] = mysecond() - times[2][k];times[3][k] = mysecond();
#ifdef TUNEDtuned_STREAM_Triad(scalar);
#else
#pragma omp parallel forfor (j=0; j<STREAM_ARRAY_SIZE; j++)a[j] = b[j]+scalar*c[j];
#endiftimes[3][k] = mysecond() - times[3][k];}/* --- SUMMARY --- */for (k=1; k<NTIMES; k++) /* note -- skip first iteration */{for (j=0; j<4; j++){avgtime[j] = avgtime[j] + times[j][k];mintime[j] = MIN(mintime[j], times[j][k]);maxtime[j] = MAX(maxtime[j], times[j][k]);}}printf("Function Best Rate MB/s Avg time Min time Max time\n");for (j=0; j<4; j++) {avgtime[j] = avgtime[j]/(double)(NTIMES-1);printf("%s%12.1f %11.6f %11.6f %11.6f\n", label[j],1.0E-06 * bytes[j]/mintime[j],avgtime[j],mintime[j],maxtime[j]);}printf(HLINE);/* --- Check Results --- */checkSTREAMresults();printf(HLINE);return 0;
}# define M 20int
checktick(){int i, minDelta, Delta;double t1, t2, timesfound[M];/* Collect a sequence of M unique time values from the system. */for (i = 0; i < M; i++) {t1 = mysecond();while( ((t2=mysecond()) - t1) < 1.0E-6 );timesfound[i] = t1 = t2;}/** Determine the minimum difference between these M values.* This result will be our estimate (in microseconds) for the* clock granularity.*/minDelta = 1000000;for (i = 1; i < M; i++) {Delta = (int)( 1.0E6 * (timesfound[i]-timesfound[i-1]));minDelta = MIN(minDelta, MAX(Delta,0));}return(minDelta);}/* A gettimeofday routine to give access to the wallclock timer on most UNIX-like systems. */#include <sys/time.h>double mysecond()
{struct timeval tp;struct timezone tzp;int i;i = gettimeofday(&tp,&tzp);return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}#ifndef abs
#define abs(a) ((a) >= 0 ? (a) : -(a))
#endif
void checkSTREAMresults ()
{STREAM_TYPE aj,bj,cj,scalar;STREAM_TYPE aSumErr,bSumErr,cSumErr;STREAM_TYPE aAvgErr,bAvgErr,cAvgErr;double epsilon;ssize_t j;int k,ierr,err;/* reproduce initialization */aj = 1.0;bj = 2.0;cj = 0.0;/* a[] is modified during timing check */aj = 2.0E0 * aj;/* now execute timing loop */scalar = 3.0;for (k=0; k<NTIMES; k++){cj = aj;bj = scalar*cj;cj = aj+bj;aj = bj+scalar*cj;}/* accumulate deltas between observed and expected results */aSumErr = 0.0;bSumErr = 0.0;cSumErr = 0.0;for (j=0; j<STREAM_ARRAY_SIZE; j++) {aSumErr += abs(a[j] - aj);bSumErr += abs(b[j] - bj);cSumErr += abs(c[j] - cj);// if (j == 417) printf("Index 417: c[j]: %f, cj: %f\n",c[j],cj); // MCCALPIN}aAvgErr = aSumErr / (STREAM_TYPE) STREAM_ARRAY_SIZE;bAvgErr = bSumErr / (STREAM_TYPE) STREAM_ARRAY_SIZE;cAvgErr = cSumErr / (STREAM_TYPE) STREAM_ARRAY_SIZE;if (sizeof(STREAM_TYPE) == 4) {epsilon = 1.e-6;}else if (sizeof(STREAM_TYPE) == 8) {epsilon = 1.e-13;}else {printf("WEIRD: sizeof(STREAM_TYPE) = %lu\n",sizeof(STREAM_TYPE));epsilon = 1.e-6;}err = 0;if (abs(aAvgErr/aj) > epsilon) {err++;printf ("Failed Validation on array a[], AvgRelAbsErr > epsilon (%e)\n",epsilon);printf (" Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",aj,aAvgErr,abs(aAvgErr)/aj);ierr = 0;for (j=0; j<STREAM_ARRAY_SIZE; j++) {if (abs(a[j]/aj-1.0) > epsilon) {ierr++;
#ifdef VERBOSEif (ierr < 10) {printf(" array a: index: %ld, expected: %e, observed: %e, relative error: %e\n",j,aj,a[j],abs((aj-a[j])/aAvgErr));}
#endif}}printf(" For array a[], %d errors were found.\n",ierr);}if (abs(bAvgErr/bj) > epsilon) {err++;printf ("Failed Validation on array b[], AvgRelAbsErr > epsilon (%e)\n",epsilon);printf (" Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",bj,bAvgErr,abs(bAvgErr)/bj);printf (" AvgRelAbsErr > Epsilon (%e)\n",epsilon);ierr = 0;for (j=0; j<STREAM_ARRAY_SIZE; j++) {if (abs(b[j]/bj-1.0) > epsilon) {ierr++;
#ifdef VERBOSEif (ierr < 10) {printf(" array b: index: %ld, expected: %e, observed: %e, relative error: %e\n",j,bj,b[j],abs((bj-b[j])/bAvgErr));}
#endif}}printf(" For array b[], %d errors were found.\n",ierr);}if (abs(cAvgErr/cj) > epsilon) {err++;printf ("Failed Validation on array c[], AvgRelAbsErr > epsilon (%e)\n",epsilon);printf (" Expected Value: %e, AvgAbsErr: %e, AvgRelAbsErr: %e\n",cj,cAvgErr,abs(cAvgErr)/cj);printf (" AvgRelAbsErr > Epsilon (%e)\n",epsilon);ierr = 0;for (j=0; j<STREAM_ARRAY_SIZE; j++) {if (abs(c[j]/cj-1.0) > epsilon) {ierr++;
#ifdef VERBOSEif (ierr < 10) {printf(" array c: index: %ld, expected: %e, observed: %e, relative error: %e\n",j,cj,c[j],abs((cj-c[j])/cAvgErr));}
#endif}}printf(" For array c[], %d errors were found.\n",ierr);}if (err == 0) {printf ("Solution Validates: avg error less than %e on all three arrays\n",epsilon);}
#ifdef VERBOSEprintf ("Results Validation Verbose Results: \n");printf (" Expected a(1), b(1), c(1): %f %f %f \n",aj,bj,cj);printf (" Observed a(1), b(1), c(1): %f %f %f \n",a[1],b[1],c[1]);printf (" Rel Errors on a, b, c: %e %e %e \n",abs(aAvgErr/aj),abs(bAvgErr/bj),abs(cAvgErr/cj));
#endif
}#ifdef TUNED
/* stubs for "tuned" versions of the kernels */
void tuned_STREAM_Copy()
{ssize_t j;
#pragma omp parallel forfor (j=0; j<STREAM_ARRAY_SIZE; j++)c[j] = a[j];
}void tuned_STREAM_Scale(STREAM_TYPE scalar)
{ssize_t j;
#pragma omp parallel forfor (j=0; j<STREAM_ARRAY_SIZE; j++)b[j] = scalar*c[j];
}void tuned_STREAM_Add()
{ssize_t j;
#pragma omp parallel forfor (j=0; j<STREAM_ARRAY_SIZE; j++)c[j] = a[j]+b[j];
}void tuned_STREAM_Triad(STREAM_TYPE scalar)
{ssize_t j;
#pragma omp parallel forfor (j=0; j<STREAM_ARRAY_SIZE; j++)a[j] = b[j]+scalar*c[j];
}
/* end of stubs for the "tuned" versions of the kernels */
#endif
四、其他相关知识链接
FIO测试硬盘性能参数和实例总结
Linux下stream内存带宽测试参数和示例详解附源码(总结)相关推荐
- Linux系统中软件的“四”种安装原理详解:源码包安装、RPM二进制安装、YUM在线安装、脚本安装包...
一.Linux软件包分类 1.1 源码包 优点: 开源,如果有足够的能力,可以修改源代码: 可以自由选择所需的功能: 软件是编译安装,所以更加适合自己的系统,更加稳定.效率更高: 卸载方便: 缺点: ...
- linux /proc/stat 计算线程cpu,Linux下用/proc/stat文件来计算cpu的利用率(附源码)
总的Cpu使用率计算 计算方法: 1.采样两个足够短的时间间隔的Cpu快照,分别记作t1,t2,其中t1.t2的结构均为: (user.nice.system.idle.iowait.irq.soft ...
- 创建三个并发进程linux,Linux下几种并发服务器的实现模式(详解)
1>单线程或者单进程 相当于短链接,当accept之后,就开始数据的接收和数据的发送,不接受新的连接,即一个server,一个client 不存在并发. 2>循环服务器和并发服务器 1.循 ...
- dns日志级别 linux,linux下DNS服务器视图view及日志系统详解
linux下DNS服务器视图view及日志系统详解DNS服务器ACL:在named.conf文件中定义ACL功能如同bash当中定义变量,便于后续引用 ACL格式: acl ACL名称 { IP地址1 ...
- 1 linux下tcp并发服务器的几种设计的模式套路,Linux下几种并发服务器的实现模式(详解)...
1>单线程或者单进程 相当于短链接,当accept之后,就开始数据的接收和数据的发送,不接受新的连接,即一个server,一个client 不存在并发. 2>循环服务器和并发服务器 1.循 ...
- linux cp -r 参数,Linux系统中cp命令的参数及用法详解
Linux系统中cp命令主要是用来复制文件或者目录.下面由学习啦小编为大家整理了Linux系统中cp命令的参数及用法详解的相关知识,希望对大家有帮助! Linux系统中cp命令的参数及用法详解:参数说 ...
- linux c多进程多线程,linux下的C\C++多进程多线程编程实例详解
linux下的C\C++多进程多线程编程实例详解 1.多进程编程 #include #include #include int main() { pid_t child_pid; /* 创建一个子进程 ...
- linux在vi创建文件,Linux下创建文本文件(vi/vim命令使用详解)
vi test.txt 或者 vim test.txt 再或者 touch test.txt vim是vi的升级版,指令更多,功能更强. 下面是收集的vim用法,当在vim里面要实现退出,首先要做的是 ...
- Linux如何重启oracle监听,Linux下重启oracle服务及监听器和实例详解
一.在Linux下重启Oracle数据库及监听器: 方法1: 用root以ssh登录到linux,打开终端输入以下命令: cd $ORACLE_HOME #进入到oracle的安装目录 dbstart ...
最新文章
- python 销量预测模型_如何做电商的销量预测模型?
- esp8266make相关文件改进
- 【java】将PDF转成字符串
- 关于Fragment、Tabhost和FragmentPagerAdapter来实现导航栏的效果
- 前端真的能做到彻底权限控制吗?
- 退出窗口[置顶] 退出Activity的方法
- 数据库备份的几种方法
- Potentiometers
- 关于x210开发板和主机、虚拟机ping通问题
- Ios开发之Category
- ubuntu使用双模机械师K7机械键盘遇到的问题
- 最新国外虚拟主机对比评论国外虚拟主机购买指南
- MyBatis-Spring(五)--MapperScannerConfigurer实现增删改查
- 强化学习 V.S. 自然语言处理,计算机保研er应该选哪个?
- 使用线性回归构建房价预测模型
- 李宏毅深度学习笔记(CNN)
- 【后端检测-绕过】文件头检测绕过、二次渲染绕过
- 从零开始速通百度云网盘
- 许奔创新社-第56问:创意洞见的基础是什么?
- 网络时代,如何增进亲情