引用地址：http://blog.csdn.net/typename/article/details/5598145

、

如今的web服务器需要同时处理一万个以上的客户端了，难道不是吗？毕竟如今的网络是个big place了。

现在的计算机也很强大了，你只需要花大概$1200就可以买一个1000MHz的处理器，2G的内存， 1000Mbit/sec的网卡的机器。让我们来看看--20000个客户，每个为50KHz，100Kbyes和 50Kbit/sec，那么没有什么比为这两万个客户端的每个每秒从硬盘读取4千字节然后发送到网络上去更消耗资源的了。可以看出硬件不再是瓶颈了。 (That works out to $0.08 per client, by the way. Those $100/client licensing fees some operating systems charge are starting to look a little heavy!)

在1999年最繁忙的ftp站点，cdrom.com，尽管有G比特的网络带宽，却也只能同时处理10000个客户端。在2001年，同样的速度可以被几个ISP服务商所提供，他们预期该趋势会因为大量的商业用户而变得越来越普遍。

目前的瘦客户端模型也开始又变得流行起来了--服务器运行在Internet上，为数千个客户端服务。

基于以上一些考虑，这里就配置操作系统或者编写支持数千个网络客户端的代码问题提出一些注意点，该论题是基于类Unix操作系统的--该系统是我的个人爱好，当然Windows也有占有一席之地。

The C10K problem
相关网站
须首先阅读的书籍
I/O框架
I/O策略
1. Serve many clients with each thread, and use nonblocking I/O and level-triggered readiness notification
  - The traditional select()
  - The traditional poll()
  - /dev/poll (Solaris 2.7+)
  - kqueue (FreeBSD, NetBSD)
2. Serve many clients with each thread, and use nonblocking I/O and readiness change notification
  - epoll (Linux 2.6+)
  - Polyakov's kevent (Linux 2.6+)
  - Drepper's New Network Interface (proposal for Linux 2.6+)
  - Realtime Signals (Linux 2.4+)
  - Signal-per-fd
  - kqueue (FreeBSD, NetBSD)
3. Serve many clients with each thread, and use asynchronous I/O and completion notification
4. Serve one client with each server thread
  - LinuxThreads (Linux 2.0+)
  - NGPT (Linux 2.4+)
  - NPTL (Linux 2.6, Red Hat 9)
  - FreeBSD threading support
  - NetBSD threading support
  - Solaris threading support
  - Java threading support in JDK 1.3.x and earlier
  - Note: 1:1 threading vs. M:N threading
5. Build the server code into the kernel
Comments
Limits on open filehandles
Limits on threads
Java issues [Updated 27 May 2001]
Other tips
- Zero-Copy
- The sendfile() system call can implement zero-copy networking.
- Avoid small frames by using writev (or TCP_CORK)
- Some programs can benefit from using non-Posix threads.
- Caching your own data can sometimes be a win.
Other limits
Kernel Issues
Measuring Server Performance
Examples
- Interesting select()-based servers
- Interesting /dev/poll-based servers
- Interesting kqueue()-based servers
- Interesting realtime signal-based servers
- Interesting thread-based servers
- Interesting in-kernel servers
Other interesting links

Book to Read First

如果你还没有读过W.Richard Stevens先生的《Unix网络编程:第一卷》的话，请尽快获取一份拷贝，该书描述了许多关于编写高性能的服务器的I/O策略和各自的一些缺陷，甚至还讲述了"thundering herd"问题，同时你也可以阅读Jeff Darcy写的关于高性能服务器设计的一些 notes。
(Another book which might be more helpful for those who are *using* rather than *writing* a web server isBuilding Scalable Web Sites by Cal Henderson.)

I/O框架

以下所列的为几个包装好的库，它们概要了几中常见的技巧，并且可以使你的代码与具体操作系统隔离，从而具有更好的移植性。

ACE, 一个重量级的C++ I/O框架，用面向对象实现了一些I/O策略和其它有用的东西，特别的，它的Reactor是用OO方式处理非阻塞I/O，而Proactor是用OO方式处理异步I/O的( In particular, his Reactor is an OO way of doing nonblocking I/O, and Proactor is an OO way of doing asynchronous I/O).
ASIO 一个C++的I/O框架，逐渐成为Boost库的一部分。it's like ACE updated for the STL era。
libevent 由Niels Provos用C编写的一个轻量级的I/O框架。它支持kqueue和select，并且很快就可以支持poll和epoll(翻译此文时已经支持)。我想它应该是只采用了水平触发机制，该机制有好处当然也有不好的地方。Niels给出了一张图来说明时间和连接数目在处理一个事件上的功能，从图上可以看出kqueue和sys_epoll明显胜出。
我本人也尝试过轻量级的框架(很可惜没有维持至今):
- Poller 是一个轻量级的C++ I/O框架，它使用任何一种准备就绪API(poll, select, /dev/poll, kqueue, sigio)实现水平触发准备就绪API。以其它不同的API为基准，Poller的性能好得多。该链接文档的下面一部分说明了如何使用这些准备就绪API。
- rn 是一个轻量级的C I/O框架，也是我继Poller后的第二个框架。该框架可以很容易的被用于商业应用中，也容易的适用于非C++应用中。它如今已经在几个商业产品中使用。
Matt Welsh在2000年四月关于在构建服务器方面如何平衡工作线程和事件驱动技术的使用写了一篇论文，在该论文中描述了他自己的Sandstorm I/O框架。
Cory Nelson's Scale! library - 一个Windows下的异步套接字，文件和管道的库。

I/O Strategies

网络软件设计者往往有很多种选择，以下列出一些：

是否处理多个I/O？如何处理在单一线程中的多个I/O调用？
- 不处理，从头到尾使用阻塞和同步I/O调用，可以使用多线程或多进程来达到并发效果。
- 使用非阻塞调用（如在一个设置O_NONBLOCK选项的socket上使用write）读取I/O，当I/O完成时发出通知（如poll，/dev/poll）从而开始下一个I/O。这种主要使用在网络I/O上，而不是磁盘的I/O上。
- 使用异步调用（如aio_write()）读取I/O，当I/O完成时会发出通知（如信号或者完成端口），可以同时使用在网络I/O和磁盘I/O上。
如何控制对每个客户的服务?
- 对每个客户使用一个进程（经典的Unix方法，自从1980年一直使用）
- 一个系统级的线程处理多个客户，每个客户是如下一种：
  - a user-level thread (e.g. GNU state threads, classic Java with green threads)
  - a state machine (a bit esoteric, but popular in some circles; my favorite)
  - a continuation (a bit esoteric, but popular in some circles)
- o一个系统级的线程对应一个客户端(e.g. classic Java with native threads)
- 一个系统级的线程对应每一个活动的客户端(e.g. Tomcat with apache front end; NT完成端口; 线程池)
是否使用标准的操作系统服务，还是把一些代码放入内核中（如自定义驱动，内核模块，VxD）。

下面的五种方式应该是最常用的了。

一个线程服务多个客户端，使用非阻塞I/O和水平触发的就绪通知
一个线程服务多个客户端，使用非阻塞I/O和就绪改变时通知
一个服务线程服务多个客户端，使用异步I/O
一个服务线程服务一个客户端，使用阻塞I/O
把服务代码编译进内核

1. 一个线程服务多个客户端，使用非阻塞I/O和水平触发的就绪通知

...把网络句柄设置为非阻塞模型，然后使用select()或poll()来告知哪个句柄已有数据在等待处理。此模型是最传统的，在此模型下，由内核告知你某个文件描述符是否准备好，是否已经完成你的任务自从上次内核告知已准备好以来（“水平触发”这个名字来源计算机硬件设计，与其相对的是“边缘触发”，Jonathon Lemon在它的关于kqueue() 的论文中介绍了这两个术语）。

注意：牢记内核的就绪通知仅仅只是个提示，当你试图从一个文件描述符读取数据时，该文件描述符可能并没有准备好。这就是为什么需要在使用就绪通知的时候使用非阻塞模型的原因。

一个重要的瓶颈是read()或sendfile()从磁盘块读取时，如果该页当前并不在内存中。设置磁盘文件描述符为非阻塞没有任何影响。同样的问题也发生在内存映射磁盘文件中。首先一个服务需要磁盘I/O时，进程块和所有的客户端都必须等待，因此最初的非线程的性能就被消耗了。
这也是异步I/O的目的，当然仅限于没有AIO的系统。处理磁盘I/O的工作线程或工作进程也可能遭遇此瓶颈。一条途径就是使用内存映射文件，如果mincore()指明I/O必需的话，那么要求一个工作线程来完成此I/O，然后继续处理网络事件。Jef Poskanzer提到Pai，Druschel和Zwaenepoel的 Flash web服务器使用了这个方法，并且他们就此在Usenix'99上做了一个演讲，看上去就好像 FreeBSD和Solaris 中提供了mincore()一样，但是它并不是Single Unix Specification的一部分，在Linux的2.3.51 的内核中提供了该方法，感谢Chuck Lever。

在2003.11的 freebsd-hackers list中，Vivek Pei上报了一个不错的成果，他们利用系统剖析工具剖析它们的Flash Web服务器，然后再攻击其瓶颈。其中找到的一个瓶颈就是mincore（猜测毕竟不是好办法），另外一个就是sendfile在磁盘块访问时。他们修改了sendfile()，当需要读取的页不在内存中时则返回类似EWOULDBLOCK的值，从而提高了性能。The end result of their optimizations is a SpecWeb99 score of about 800 on a 1GHZ/1GB FreeBSD box, which is better than anything on file at spec.org.

在非阻塞套接字的集合中，关于单一线程是如何告知哪个套接字是准备就绪的，以下列出了几种方法:

传统的select()
遗憾的是，select()受限于FD_SETSIZE个句柄。该限制被编译进了标准库和用户程序（有些版本的C library允许你在用户程序编译时放宽该限制）。

See Poller_select (cc, h) for an example of how to use select() interchangeably with other readiness notification schemes.
传统的poll()
poll()虽然没有文件描述符个数的硬编码限制，但是当有数千个时速度就会变得很慢，因为大多数的文件描述符在某个时间是空闲的，彻底扫描数千个描述符是需要花费一定时间的。

有些操作系统（如Solaris 8）通过使用了poll hinting技术改进了poll()，该技术由Niels Provos在1999年实现并利用基准测试程序测试过。

See Poller_poll (cc, h, benchmarks) for an example of how to use poll() interchangeably with other readiness notification schemes.
/dev/poll
这是在Solaris中被推荐的代替poll的方法。

/dev/poll的背后思想就是利用poll()在大部分的调用时使用相同的参数。使用/dev/poll时，首先打开/dev/poll得到文件描述符，然后把你关心的文件描述符写入到/dev/poll的描述符，然后你就可以从/dev/poll的描述符中读取到已就绪的文件描述符。

/dev/poll 在Solaris 7(see patchid 106541) 中就已经存在，不过在Solaris 8 中才公开现身。在750个客户端的情况下，this has 10% of the overhead of poll()。

关于/dev/poll在Linux上有多种不同的尝试实现，但是没有一种可以和epoll相比，不推荐在 Linux上使用/dev/poll。

See Poller_devpoll (cc, h benchmarks ) for an example of how to use /dev/poll interchangeably with many other readiness notification schemes. (Caution - the example is for Linux /dev/poll, might not work right on Solaris.)
kqueue()
这是在FreeBSD系统上推荐使用的代替poll的方法(and, soon, NetBSD).

kqueue()即可以水平触发，也可以边缘触发，具体请看下面.

2. 一个线程服务多个客户端，使用非阻塞I/O和就绪改变时通知

Readiness change notification（或边缘触发就绪通知）的意思就是当你给内核一个文件描述符，一段时间后，如果该文件描述符从没有就绪到已经准备就绪，那么内核就会发出通知，告知该文件描述符已经就绪，并且不会再对该描述符发出类似的就绪通知直到你在描述符上进行一些操作使得该描述符不再就绪（如直到在send，recv或者accept等调用上遇到EWOULDBLOCK错误，或者发送/接收了少于需要的字节数）。

当使用Readiness change notification时，必须准备好处理乱真事件，因为最常见的实现是只要接收到任何数据包都发出就绪信号，而不管文件描述符是否准备就绪。

这是水平触发的就绪通知的相对应的机制。It's a bit less forgiving of programming mistakes, since if you miss just one event, the connection that event was for gets stuck forever. 然而，我发现edge-triggered readiness notification可以使编写带OpenSSL的非阻塞客户端更简单，可以试下。

[Banga, Mogul, Drusha '99]详细描述了这种模型.

有几种APIs可以使得应用程序获得“文件描述符已就绪”的通知:

kqueue() 这是在FreeBSD系统上推荐使用边缘触发的方法 (and, soon, NetBSD).

FreeBSD 4.3及以后版本，NetBSD（2002.10）都支持 kqueue()/kevent()，支持边沿触发和水平触发（请查看Jonathan Lemon 的网页和他的BSDCon 2000关于kqueue的论文）。

就像/dev/poll一样，你分配一个监听对象，不过不是打开文件/dev/poll，而是调用kqueue ()来获得。需要改变你所监听的事件或者获得当前事件的列表，可以在kqueue()返回的描述符上调用kevent()来达到目的。它不仅可以监听套接字，还可以监听普通的文件的就绪，信号和I/O完成的事件也可以.

Note: 在2000.10，FreeBSD的线程库和kqueue()并不能一起工作得很好，当kqueue()阻塞时，那么整个进程都将会阻塞，而不仅仅是调用kqueue()的线程。

See Poller_kqueue (cc, h, benchmarks) for an example of how to use kqueue() interchangeably with many other readiness notification schemes.

使用kqueue()的例程和库:
- PyKQueue -- 一个Python的kqueue()库.
- Ronald F.Guilmette的echo的服务器例程; 另外可以查看他在 2000.9.28在freebsd 上发表的帖子。
epoll
这是Linux 2.6的内核中推荐使用的边沿触发poll.

2001.7.11， Davide Libenzi提议了一个实时信号的可选方法，他称之为/dev/epoll< /a>，该方法类似与实时信号就绪通知机制，但是结合了其它更多的事件，从而在大多数的事件获取上拥有更高的效率。

epoll在将它的接口从一个/dev下的指定文件改变为系统调用sys_epoll后就合并到2.5版本的 Linux内核开发树中，另外也提供了一个为2.4老版本的内核可以使用epoll的补丁。

unifying epoll, aio, 2002 年万圣节前夕的Linux内核邮件列表就统一epoll，aio和其它的event sources 展开了很久的争论，it may yet happen，but Davide is concentrating on firming up epoll in general first.
Polyakov's kevent (Linux 2.6+) 的最后新闻：2006.2.9和2006.7.9，Evgeniy Polyakov发表了融合epoll和 aio的补丁，他的目标是支持网络AIO. See:
- the LWN article about kevent
- his July announcement
- his kevent page
- his naio page
- some recent discussion
Drepper的最新网络接口 (proposal for Linux 2.6+)
在2006 OLS上，Ulrich Drepper提议了一种最新的高速异步网络API. See:
- his paper, "The Need for Asynchronous, Zero-Copy Network I/O"
- his slides
- LWN article from July 22
Realtime Signals实时信号
Linux2.4内核中推荐使用的边沿触发poll.

2.4的linux内核可以通过实时信号来分派套接字事件,示例如下:
```
/* Mask off SIGIO and the signal you want to use. */ sigemptyset(&sigset); sigaddset(&sigset, signum); sigaddset(&sigset, SIGIO); sigprocmask(SIG_BLOCK, &m_sigset, NULL); /* For each file descriptor, invoke F_SETOWN, F_SETSIG, and set O_ASYNC. */ fcntl(fd, F_SETOWN, (int) getpid()); fcntl(fd, F_SETSIG, signum); flags = fcntl(fd, F_GETFL); flags |= O_NONBLOCK|O_ASYNC; fcntl(fd, F_SETFL, flags); 
```
当正常的I/O函数如read()或write()完成时，发送信号。要使用该段的话，在外层循环中编写一个普通的poll()，在循环里面，当poll()处理完所有的描述符后，进入 sigwaitinfo()循环。如果sigwaitinfo()或sigtimedwait()返回了实时信号，那么siginfo.si_fd和 siginfo_si_band给出的信息和调用poll()后pollfd.fd和pollfd.revents的几乎一样。如果你处理该I/O，那么就继续调用sigwaitinfo()。
如果sigwaitinfo()返回了传统的SIGIO，那么信号队列溢出了，你必须通过临时改变信号处理程序为SIG_DFL来刷新信号队列，然后返回到外层的poll()循环。

See Poller_sigio (cc, h) for an example of how to use rtsignals interchangeably with many other readiness notification schemes.

See Zach Brown's phhttpd 示例代码来如何直接使用这些特点. (Or don't; phhttpd is a bit hard to figure out...)

[Provos, Lever, and Tweedie 2000] 描述了最新的phhttp的基准测试，使用了不同的sigtimewait()和sigtimedwait4()，这些调用可以使你只用一次调用便获得多个信号。有趣的是，sigtimedwait4()的主要好处是它允许应用程序测量系统负载(so it could behave appropriately)（poll()也提供了同样的系统负载测量）。
Signal-per-fd
Signal-per-fd是由Chandra和Mosberger提出的对实时信号的一种改进，它可以减少甚至削除实时信号的溢出通过oalescing redundant events。然而是它的性能并没有epoll好. 论文(www.hpl.hp.com/techreports/2000/HPL-2000-174.html) 比较了它和select()，/dev/poll的性能.

Vitaly Luban在2001.5.18公布了一个实现Signal-per-fd的补丁; 授权见www.luban.org/GPL/gpl.html. (到2001.9，在很重的负载情况下仍然存在稳定性问题，利用dkftpbench测试在4500个用户时将引发问题.

See Poller_sigfd (cc, h) for an example of how to use signal-per-fd interchangeably with many other readiness notification schemes.

NPTL使用了三个由NGPT引入的内核特征: getpid()返回PID，CLONE_THREAD和futexes; NPTL还使用了(并依赖)也是该项目的一部分的一个更为wider的内核特征集。

一些由NGPT引入内核的items也被修改，清除和扩展，例如线程组的处理(CLONE_THREAD). [the CLONE_THREAD changes which impacted NGPT's compatibility got synced with the NGPT folks, to make sure NGPT does not break in any unacceptable way.]

这些为NPTL开发的并且后来在NPTL中使用的内核特征都描述在设计白皮书中，http://people.redhat.com/drepper/nptl-design.pdf ...

A short list: TLS support, various clone extensions (CLONE_SETTLS, CLONE_SETTID, CLONE_CLEARTID), POSIX thread-signal handling, sys_exit() extension (release TID futex upon VM-release), the sys_exit_group() system-call, sys_execve() enhancements and support for detached threads.

There was also work put into extending the PID space - eg. procfs crashed due to 64K PID assumptions, max_pid, and pid allocation scalability work. Plus a number of performance-only improvements were done as well.

In essence the new features are a no-compromises approach to 1:1 threading - the kernel now helps in everything where it can improve threading, and we precisely do the minimally necessary set of context switches and kernel calls for every basic threading primitive.

NGPT和NPTL的一个最大的不同就是NPTL是1:1的线程模型，而NGPT是M:N的编程模型(具体请看下面). 尽管这样， Ulrich的最初的基准测试还是表明NPTL比NGPT快很多。(NGPT小组期待查看Ulrich的测试程序来核实他的结果.)

FreeBSD线程支持

FreeBSD支持LinuxThreads和用户空间的线程库。同样，M:N的模型实现KSE在FreeBSD 5.0中引入。具体请查看www.unobvious.com/bsd/freebsd-threads.html.

2003.3.25, Jeff Roberson 发表于freebsd-arch:

... 感谢Julian, David Xu, Mini, Dan Eischen,和其它的每一位参加了KSE和libpthread开发的成员所提供的基础， Mini和我已经开发出了一个1：1模型的线程实现，它可以和KSE并行工作而不会带来任何影响。It actually helps bring M:N threading closer by testing out shared bits. ...

And 2006.7, Robert Watson提议1:1的线程模型应该为FreeBSD 7.x的默认实现:

我知道曾经讨论过这个问题，但是我认为随着7.x的向前推进，这个问题应该重新考虑。在很多普通的应用程序和特定的基准测试中，libthr明显的比libpthread在性能上要好得多。 libthr是在我们大量的平台上实现的，而libpthread却只有在几个平台上。最主要的是因为我们使得Mysql和其它的大量线程的使用者转换到"libthr"，which is suggestive, also! ... 所以strawman提议:让libthr成为7.x上的默认线程库。

NetBSD线程支持

根据Noriyuki Soda的描述:

内核支持M:N线程库是基于调度程序激活模型，合并于2003.1.18当时的NetBSD版本中。

详情请看Nathan J. Williams, Wasabi Systems, Inc.在2002年的FREENIX上的演示 An Implementation of Scheduler Activations on the NetBSD Operating System。

为什么Ingo Molnar相对于M：N更喜欢1：1
Sun改为1：1的模型
NGPT是Linux下的M：N线程库.
Although Ulrich Drepper计划在新的glibc线程库中使用M：N的模型, 但是还是选用了1：1的模型.
MacOSX 也将使用1：1的线程.
FreeBSD和 NetBSD 仍然将使用M：N线程，FreeBSD 7.0也倾向于使用1：1的线程（见上面描述），可能M：N线程的拥护者最后证明它是错误的。

5. 把服务代码编译进内核

Novell和Microsoft都宣称已经在不同时期完成了该工作，至少NFS的实现完成了该工作。 khttpd在Linux下为静态web页面完成了该工作， Ingo Molnar完成了"TUX" (Threaded linUX webserver) ，这是一个Linux下的快速的可扩展的内核空间的HTTP服务器。 Ingo在2000.9.1宣布 alpha版本的TUX可以在 ftp://ftp.redhat.com/pub/redhat/tux下载, 并且介绍了如何加入其邮件列表来获取更多信息。
在Linux内核的邮件列表上讨论了该方法的好处和缺点，多数人认为不应该把web服务器放进内核中，相反内核加入最小的钩子hooks来提高web服务器的性能，这样对其它形式的服务器就有益。具体请看 Zach Brown的讨论对比用户级别和内核的http服务器。在2.4的linux内核中为用户程序提供了足够的权力（power），就像X15 服务器运行的速度和TUX几乎一样，但是它没有对内核做任何改变。

Comments

Richard Gooch曾经写了一篇讨论I/O选项的论文。

在2001, Tim Brecht和MMichal Ostrowski为使用简单的select的服务器做了各种策略的测度测试的数据值得看一看。

在2003, Tim Brecht发表了 userver的源码, 该服务器是整合了Abhishek Chandra, David Mosberger, David Pariag和Michal Ostrowski所写的几个服务器而成的，可以使用select(), poll(), epoll()和sigio.

回到1999.3, Dean Gaudet发表:

我一直在问“为什么你们不使用基于select/event的模型，它明显是最快的。”...

他们的理由是“太难理解了，并且其中关键部分（payoff）不清晰”，但是几个月后，当该模型变得易懂时人们就开始愿意使用它了。

Mark Russinovich写了一篇评论和文章讨论了在2.2的linux内核只能够I/O策略问题。尽管某些地方似乎有点错误，不过还是值得去看。特别是他认为Linux2.2的异步I/O (请看上面的F_SETSIG) 并没有在数据准备好时通知用户进程，而只有在新的连接到达时才有。这看起来是一个奇怪的误解。还可以看看早期的一些comments, Ingo Molnar在1999.4.30所举的反例, Russinovich在1999.5.2的comments, Alan Cox的反例，和各种 linux内核邮件. 我怀疑他想说的是Linux不支持异步磁盘I/O，这在过去是正确的，但是现在SGI已经实现了KAIO，它已不再正确了。

查看页面 sysinternals.com和 MSDN了解一下“完成端口”，据说它是NT中独特的技术，简单说，win32的"重叠I/O"被认为是太低水平而不方面使用，“完成端口”是提供了完成事件队列的封装，再加上魔法般的调度，通过允许更多的线程来获得完成事件如果该端口上的其它已获得完成事件的线程处于睡眠中时（可能正在处理阻塞I/O），从而可以保持运行线程数目恒定（scheduling magic that tries to keep the number of running threads constant by allowing more threads to pick up completion events if other threads that had picked up completion events from this port are sleeping (perhaps doing blocking I/O).

查看OS/400的I/O完成端口支持.

在1999.9，在linux内核邮件列表上曾有一次非常有趣的讨论，讨论题目为 "15,000 Simultaneous Connections" (并且延续到第二周). Highlights:

Ed Hall 发表了一些他自己的经验：他已经在运行Solaris的UP P2/333上完成>1000个连接每秒。他的代码使用了一个很小的线程池（每个cpu 1或者2个线程池），每个线程池使用事件模型来管理大量的客户端连接。
Mike Jagdisposted an analysis of poll/select overhead, and said "The current select/poll implementation can be improved significantly, especially in the blocking case, but the overhead will still increase with the number of descriptors because select/poll does not, and cannot, remember what descriptors are interesting. This would be easy to fix with a new API. Suggestions are welcome..."
Mike posted about his work on improving select() and poll().
Mike posted a bit about a possible API to replace poll()/select(): "How about a 'device like' API where you write 'pollfd like' structs, the 'device' listens for events and delivers 'pollfd like' structs representing them when you read it? ... "
Rogier Wolff suggested using "the API that the digital guys suggested",http://www.cs.rice.edu/~gaurav/papers/usenix99.ps
Joerg Pommnitz pointed out that any new API along these lines should be able to wait for not just file descriptor events, but also signals and maybe SYSV-IPC. Our synchronization primitives should certainly be able to do what Win32's WaitForMultipleObjects can, at least.
Stephen Tweedie asserted that the combination of F_SETSIG, queued realtime signals, and sigwaitinfo() was a superset of the API proposed in http://www.cs.rice.edu/~gaurav/papers/usenix99.ps. He also mentions that you keep the signal blocked at all times if you're interested in performance; instead of the signal being delivered asynchronously, the process grabs the next one from the queue with sigwaitinfo().
Jayson Nordwick compared completion ports with the F_SETSIG synchronous event model, and concluded they're pretty similar.
Alan Cox noted that an older rev of SCT's SIGIO patch is included in 2.3.18ac.
Jordan Mendelson posted some example code showing how to use F_SETSIG.
Stephen C. Tweedie continued the comparison of completion ports and F_SETSIG, and noted: "With a signal dequeuing mechanism, your application is going to get signals destined for various library components if libraries are using the same mechanism," but the library can set up its own signal handler, so this shouldn't affect the program (much).
Doug Royer noted that he'd gotten 100,000 connections on Solaris 2.6 while he was working on the Sun calendar server. Others chimed in with estimates of how much RAM that would require on Linux, and what bottlenecks would be hit.

Interesting reading!

Limits on open filehandles

Any Unix: the limits set by ulimit or setrlimit.
Solaris: see the Solaris FAQ, question 3.46 (or thereabouts; they renumber the questions periodically).
FreeBSD:
Edit /boot/loader.conf, add the line
```
set kern.maxfiles=XXXX
```
where XXXX is the desired system limit on file descriptors, and reboot. Thanks to an anonymous reader, who wrote in to say he'd achieved far more than 10000 connections on FreeBSD 4.3, and says

"FWIW: You can't actually tune the maximum number of connections in FreeBSD trivially, via sysctl.... You have to do it in the /boot/loader.conf file.
The reason for this is that the zalloci() calls for initializing the sockets and tcpcb structures zones occurs very early in system startup, in order that the zone be both type stable and that it be swappable.
You will also need to set the number of mbufs much higher, since you will (on an unmodified kernel) chew up one mbuf per connection for tcptempl structures, which are used to implement keepalive."

Another reader says

"As of FreeBSD 4.4, the tcptempl structure is no longer allocated; you no longer have to worry about one mbuf being chewed up per connection."

See also:
- the FreeBSD handbook
- SYSCTL TUNING, LOADER TUNABLES, and KERNEL CONFIG TUNING in 'man tuning'
- The Effects of Tuning a FreeBSD 4.3 Box for High Performance, Daemon News, Aug 2001
- postfix.org tuning notes, covering FreeBSD 4.2 and 4.4
- the Measurement Factory's notes, circa FreeBSD 4.3
OpenBSD: A reader says

"In OpenBSD, an additional tweak is required to increase the number of open filehandles available per process: the openfiles-cur parameter in /etc/login.conf needs to be increased. You can change kern.maxfiles either with sysctl -w or in sysctl.conf but it has no effect. This matters because as shipped, the login.conf limits are a quite low 64 for nonprivileged processes, 128 for privileged."
Linux: See Bodo Bauer's /proc documentation. On 2.4 kernels:
```
echo 32768 > /proc/sys/fs/file-max 
```
increases the system limit on open files, and
```
ulimit -n 32768
```
increases the current process' limit.

On 2.2.x kernels,
```
echo 32768 > /proc/sys/fs/file-max echo 65536 > /proc/sys/fs/inode-max 
```
increases the system limit on open files, and
```
ulimit -n 32768
```
increases the current process' limit.

I verified that a process on Red Hat 6.0 (2.2.5 or so plus patches) can open at least 31000 file descriptors this way. Another fellow has verified that a process on 2.2.12 can open at least 90000 file descriptors this way (with appropriate limits). The upper bound seems to be available memory.
Stephen C. Tweedie posted about how to set ulimit limits globally or per-user at boot time using initscript and pam_limit.
In older 2.2 kernels, though, the number of open files per process is still limited to 1024, even with the above changes.
See also Oskar's 1998 post, which talks about the per-process and system-wide limits on file descriptors in the 2.0.36 kernel.

Limits on threads

On any architecture, you may need to reduce the amount of stack space allocated for each thread to avoid running out of virtual memory. You can set this at runtime with pthread_attr_init() if you're using pthreads.

Solaris: it supports as many threads as will fit in memory, I hear.
Linux 2.6 kernels with NPTL: /proc/sys/vm/max_map_count may need to be increased to go above 32000 or so threads. (You'll need to use very small stack threads to get anywhere near that number of threads, though, unless you're on a 64 bit processor.) See the NPTL mailing list, e.g. the thread with subject "Cannot create more than 32K threads?", for more info.
Linux 2.4: /proc/sys/kernel/threads-max is the max number of threads; it defaults to 2047 on my Red Hat 8 system. You can set increase this as usual by echoing new values into that file, e.g. "echo 4000 > /proc/sys/kernel/threads-max"
Linux 2.2: Even the 2.2.13 kernel limits the number of threads, at least on Intel. I don't know what the limits are on other architectures. Mingo posted a patch for 2.1.131 on Intel that removed this limit. It appears to be integrated into 2.3.20.
See also Volano's detailed instructions for raising file, thread, and FD_SET limits in the 2.2 kernel. Wow. This document steps you through a lot of stuff that would be hard to figure out yourself, but is somewhat dated.
Java: See Volano's detailed benchmark info, plus their info on how to tune various systems to handle lots of threads.

Java issues

Up through JDK 1.3, Java's standard networking libraries mostly offered the one-thread-per-client model. There was a way to do nonblocking reads, but no way to do nonblocking writes.

In May 2001, JDK 1.4 introduced the package java.nio to provide full support for nonblocking I/O (and some other goodies). See the release notes for some caveats. Try it out and give Sun feedback!

HP's java also includes a Thread Polling API.

In 2000, Matt Welsh implemented nonblocking sockets for Java; his performance benchmarks show that they have advantages over blocking sockets in servers handling many (up to 10000) connections. His class library is called java-nbio; it's part of the Sandstorm project. Benchmarks showing performance with 10000 connectionsare available.

Before NIO, there were several proposals for improving Java's networking APIs:

Matt Welsh's Jaguar system proposes preserialized objects, new Java bytecodes, and memory management changes to allow the use of asynchronous I/O with Java.
Interfacing Java to the Virtual Interface Architecture, by C-C. Chang and T. von Eicken, proposes memory management changes to allow the use of asynchronous I/O with Java.
JSR-51 was the Sun project that came up with the java.nio package. Matt Welsh participated (who says Sun doesn't listen?).

Other tips

Zero-Copy
Normally, data gets copied many times on its way from here to there. Any scheme that eliminates these copies to the bare physical minimum is called "zero-copy".
- Thomas Ogrisegg's zero-copy send patch for mmaped files under Linux 2.4.17-2.4.20. Claims it's faster than sendfile().
- IO-Lite is a proposal for a set of I/O primitives that gets rid of the need for many copies.
- Alan Cox noted that zero-copy is sometimes not worth the trouble back in 1999. (He did like sendfile(), though.)
- Ingo implemented a form of zero-copy TCP in the 2.4 kernel for TUX 1.0 in July 2000, and says he'll make it available to userspace soon.
- Drew Gallatin and Robert Picco have added some zero-copy features to FreeBSD; the idea seems to be that if you call write() or read() on a socket, the pointer is page-aligned, and the amount of data transferred is at least a page, *and* you don't immediately reuse the buffer, memory management tricks will be used to avoid copies. But see followups to this message on linux-kernelfor people's misgivings about the speed of those memory management tricks.
  According to a note from Noriyuki Soda:
  
  Sending side zero-copy is supported since NetBSD-1.6 release by specifying "SOSEND_LOAN" kernel option. This option is now default on NetBSD-current (you can disable this feature by specifying "SOSEND_NO_LOAN" in the kernel option on NetBSD_current). With this feature, zero-copy is automatically enabled, if data more than 4096 bytes are specified as data to be sent.
- The sendfile() system call can implement zero-copy networking.
  The sendfile() function in Linux and FreeBSD lets you tell the kernel to send part or all of a file. This lets the OS do it as efficiently as possible. It can be used equally well in servers using threads or servers using nonblocking I/O. (In Linux, it's poorly documented at the moment; use _syscall4 to call it. Andi Kleen is writing new man pages that cover this. See also Exploring The sendfile System Callby Jeff Tranter in Linux Gazette issue 91.) Rumor has it, ftp.cdrom.com benefitted noticeably from sendfile().
  
  A zero-copy implementation of sendfile() is on its way for the 2.4 kernel. See LWN Jan 25 2001.
  
  One developer using sendfile() with Freebsd reports that using POLLWRBAND instead of POLLOUT makes a big difference.
  
  Solaris 8 (as of the July 2001 update) has a new system call 'sendfilev'. A copy of the man page is here.. The Solaris 8 7/01 release notes also mention it. I suspect that this will be most useful when sending to a socket in blocking mode; it'd be a bit of a pain to use with a nonblocking socket.
Avoid small frames by using writev (or TCP_CORK)
A new socket option under Linux, TCP_CORK, tells the kernel to avoid sending partial frames, which helps a bit e.g. when there are lots of little write() calls you can't bundle together for some reason. Unsetting the option flushes the buffer. Better to use writev(), though...

See LWN Jan 25 2001 for a summary of some very interesting discussions on linux-kernel about TCP_CORK and a possible alternative MSG_MORE.
Behave sensibly on overload.
[Provos, Lever, and Tweedie 2000] notes that dropping incoming connections when the server is overloaded improved the shape of the performance curve, and reduced the overall error rate. They used a smoothed version of "number of clients with I/O ready" as a measure of overload. This technique should be easily applicable to servers written with select, poll, or any system call that returns a count of readiness events per call (e.g. /dev/poll or sigtimedwait4()).
Some programs can benefit from using non-Posix threads.
Not all threads are created equal. The clone() function in Linux (and its friends in other operating systems) lets you create a thread that has its own current working directory, for instance, which can be very helpful when implementing an ftp server. See Hoser FTPd for an example of the use of native threads rather than pthreads.
Caching your own data can sometimes be a win.
"Re: fix for hybrid server problems" by Vivek Sadananda Pai (vivek@cs.rice.edu) on new-httpd, May 9th, states:

"I've compared the raw performance of a select-based server with a multiple-process server on both FreeBSD and Solaris/x86. On microbenchmarks, there's only a marginal difference in performance stemming from the software architecture. The big performance win for select-based servers stems from doing application-level caching. While multiple-process servers can do it at a higher cost, it's harder to get the same benefits on real workloads (vs microbenchmarks). I'll be presenting those measurements as part of a paper that'll appear at the next Usenix conference. If you've got postscript, the paper is available athttp://www.cs.rice.edu/~vivek/flash99/"

Other limits

Old system libraries might use 16 bit variables to hold file handles, which causes trouble above 32767 handles. glibc2.1 should be ok.
Many systems use 16 bit variables to hold process or thread id's. It would be interesting to port the Volano scalability benchmark to C, and see what the upper limit on number of threads is for the various operating systems.
Too much thread-local memory is preallocated by some operating systems; if each thread gets 1MB, and total VM space is 2GB, that creates an upper limit of 2000 threads.
Look at the performance comparison graph at the bottom ofhttp://www.acme.com/software/thttpd/benchmarks.html. Notice how various servers have trouble above 128 connections, even on Solaris 2.6? Anyone who figures out why, let me know.
Note: if the TCP stack has a bug that causes a short (200ms) delay at SYN or FIN time, as Linux 2.2.0-2.2.6 had, and the OS or http daemon has a hard limit on the number of connections open, you would expect exactly this behavior. There may be other causes.

Kernel Issues

For Linux, it looks like kernel bottlenecks are being fixed constantly. See Linux Weekly News, Kernel Traffic, the Linux-Kernel mailing list, and my Mindcraft Redux page.

In March 1999, Microsoft sponsored a benchmark comparing NT to Linux at serving large numbers of http and smb clients, in which they failed to see good results from Linux. See also my article on Mindcraft's April 1999 Benchmarks for more info.

Mohit Aron (aron@cs.rice.edu) writes that rate-based clocking in TCP can improve HTTP response time over 'slow' connections by 80%.

Measuring Server Performance

Two tests in particular are simple, interesting, and hard:

raw connections per second (how many 512 byte files per second can you serve?)
total transfer rate on large files with many slow clients (how many 28.8k modem clients can simultaneously download from your server before performance goes to pot?)

Jef Poskanzer has published benchmarks comparing many web servers. Seehttp://www.acme.com/software/thttpd/benchmarks.html for his results.

I also have a few old notes about comparing thttpd to Apache that may be of interest to beginners.

Chuck Lever keeps reminding us about Banga and Druschel's paper on web server benchmarking. It's worth a read.

IBM has an excellent paper titled Java server benchmarks [Baylor et al, 2000]. It's worth a read.

Examples

Interesting select()-based servers

thttpd Very simple. Uses a single process. It has good performance, but doesn't scale with the number of CPU's. Can also use kqueue.
mathopd. Similar to thttpd.
fhttpd
boa
Roxen
Zeus, a commercial server that tries to be the absolute fastest. See their tuning guide.
The other non-Java servers listed at http://www.acme.com/software/thttpd/benchmarks.html
BetaFTPd
Flash-Lite - web server using IO-Lite.
Flash: An efficient and portable Web server -- uses select(), mmap(), mincore()
The Flash web server as of 2003 -- uses select(), modified sendfile(), async open()
xitami - uses select() to implement its own thread abstraction for portability to systems without threads.
Medusa - a server-writing toolkit in Python that tries to deliver very high performance.
userver - a small http server that can use select, poll, epoll, or sigio

Interesting /dev/poll-based servers

N. Provos, C. Lever, "Scalable Network I/O in Linux," May, 2000. [FREENIX track, Proc. USENIX 2000, San Diego, California (June, 2000).] Describes a version of thttpd modified to support /dev/poll. Performance is compared with phhttpd.

Interesting kqueue()-based servers

thttpd (as of version 2.21?)
Adrian Chadd says "I'm doing a lot of work to make squid actually LIKE a kqueue IO system"; it's an official Squid subproject; see http://squid.sourceforge.net/projects.html#commloops. (This is apparently newer than Benno's patch.)

Interesting realtime signal-based servers

Chromium's X15. This uses the 2.4 kernel's SIGIO feature together with sendfile() and TCP_CORK, and reportedly achieves higher speed than even TUX. The source is available under a community source (not open source) license. See the original announcement by Fabio Riccardi.
Zach Brown's phhttpd - "a quick web server that was written to showcase the sigio/siginfo event model. consider this code highly experimental and yourself highly mental if you try and use it in a production environment." Uses the siginfo features of 2.3.21 or later, and includes the needed patches for earlier kernels. Rumored to be even faster than khttpd. See his post of 31 May 1999 for some notes.

Interesting thread-based servers

Hoser FTPD. See their benchmark page.
Peter Eriksson's phttpd and
pftpd
The Java-based servers listed at http://www.acme.com/software/thttpd/benchmarks.html
Sun's Java Web Server (which has been reported to handle 500 simultaneous clients)

Interesting in-kernel servers

khttpd
"TUX" (Threaded linUX webserver) by Ingo Molnar et al. For 2.4 kernel.

Changelog

$Log: c10k.html,v $ Revision 1.212 2006/09/02 14:52:13 dank added asio Revision 1.211 2006/07/27 10:28:58 dank Link to Cal Henderson's book. Revision 1.210 2006/07/27 10:18:58 dank Listify polyakov links, add Drepper's new proposal, note that FreeBSD 7 might move to 1:1 Revision 1.209 2006/07/13 15:07:03 dank link to Scale! library, updated Polyakov links Revision 1.208 2006/07/13 14:50:29 dank Link to Polyakov's patches Revision 1.207 2003/11/03 08:09:39 dank Link to Linus's message deprecating the idea of aio_open Revision 1.206 2003/11/03 07:44:34 dank link to userver Revision 1.205 2003/11/03 06:55:26 dank Link to Vivek Pei's new Flash paper, mention great specweb99 score

Copyright 1999-2006 Dan Kegel
dank@kegel.com
Last updated: 2 Sept 2006
[Return to www.kegel.com]

【转】The C10K problem翻译相关推荐

(转)The C10K problem翻译
The C10K problem 如今的web服务器需要同时处理一万个以上的客户端了,难道不是吗?毕竟如今的网络是个big place了. 现在的计算机也很强大了,你只需要花大概$1200就可以买一个 ...
The C10K problem翻译
如今的web服务器需要同时处理一万个以上的客户端了,难道不是吗?毕竟如今的网络是个big place了. 现在的计算机也很强大了,你只需要花大概$1200就可以买一个1000MHz的处理器,2G的内存 ...
[转]The C10K problem翻译
转载自:http://www.cnblogs.com/fll/archive/2008/05/17/1201540.html 如今的web服务器需要同时处理一万个以上的客户端了,难道不是吗?毕竟如今的 ...
网络编程——The C10K Problem(C10K = connection 10 kilo 问题)。k 表示 kilo，即 1000
The C10K problem翻译 (C10K = connection 10 kilo 问题).k 表示 kilo,即 1000 比如:kilometer(千米), kilogram(千克). 如 ...
The C10K problem原文翻译
原文地址:http://www.cnblogs.com/fll/archive/2008/05/17/1201540.html The C10K problem 如今的web服务器需要同时处理一万个以 ...
转 The C10K problem 中文版 - 如何处理高并发连接
分享一下我老师大神的人工智能教程!零基础,通俗易懂!http://blog.csdn.net/jiangjunshow 也欢迎大家转载本篇文章.分享知识,造福人民,实现我们中华民族伟大复兴! 文章来源 ...
[转]The C10K problem(中文版) - 如何处理高并发连接
文章来源:http://www.cnblogs.com/fll/archive/2008/05/17/1201540.html 原始来源:http://www.kegel.com/c10k.html ...
The C10K problem
原文:http://www.kegel.com/c10k.html 转载:https://rtoax.blog.csdn.net/article/details/117317900 译文:https: ...
关于The C10K problem的一些资料
英文原文:http://www.kegel.com/c10k.html 两篇中文翻译: http://www.cnblogs.com/fll/archive/2008/05/17/1201540.ht ...

【转】The C10K problem翻译

引用地址：http://blog.csdn.net/typename/article/details/5598145

、

Contents

Changelog

【转】The C10K problem翻译相关推荐

最新文章

热门文章