高负载高并发网站架构分析

阅读原文时间：2021年04月22日阅读：6

由于自己正在做一个高性能大用户量的论坛程序，对高性能高并发服务器架构比较感兴趣，于是在网上收集了不少这方面的资料和大家分享。希望能和大家交流
msn: defender_ios@hotmail.com
———————————————————————————————————————
 初创网站与开源软件 6
 谈谈大型高负载网站服务器的优化心得! 8
 Lighttpd+Squid+Apache搭建高效率Web服务器 9
 浏览量比较大的网站应该从哪几个方面入手？ 17
 用负载均衡技术建设高负载站点 20
 大型网站的架构设计问题 25
 开源平台的高并发集群思考 26
 大型、高负载网站架构和应用初探时间：30-45分钟 27
 说说大型高并发高负载网站的系统架构 28
 mixi技术架构 51
mixi.jp：使用开源软件搭建的可扩展SNS网站 51
总概关键点： 51
1，Mysql 切分，采用Innodb运行 52
2，动态Cache 服务器 -- 52
美国Facebok.com,中国Yeejee.com,日本mixi.jp均采用开源分布式缓存服务器Memcache 52
3，图片缓存和加 52
 memcached+squid+apache deflate解决网站大访问量问题 52
 FeedBurner:基于MySQL和JAVA的可扩展Web应用 53
 YouTube 的架构扩展 55
 了解一下 Technorati 的后台数据库架构 57
 Myspace架构历程 58
 eBay 的数据量 64
 eBay 的应用服务器规模 67
 eBay 的数据库分布扩展架构 68
 从LiveJournal后台发展看大规模网站性能优化方法 70
一、LiveJournal发展历程 70
二、LiveJournal架构现状概况 70
三、从LiveJournal发展中学习 71
1、一台服务器 71
2、两台服务器 72
3、四台服务器 73
4、五台服务器 73
5、更多服务器 74
6、现在我们在哪里： 75
7、现在我们在哪里 78
8、现在我们在哪里 79
9、缓存 80
10、Web访问负载均衡 80
11、MogileFS 81
 Craigslist 的数据库架构 81
 Second Life 的数据拾零 82
 eBay架构的思想金矿 84
 一天十亿次的访问－eBay架构（一） 85
 七种缓存使用武器为网站应用和访问加速发布时间: 92
 可缓存的CMS系统设计 93
 开发大型高负载类网站应用的几个要点[nightsailer] 105
 Memcached和Lucene笔记 110
 使用开源软件，设计高性能可扩展网站 110
 面向高负载的架构Lighttpd+PHP(FastCGI)+Memcached+Squid 113
 思考高并发高负载网站的系统架构 113
 "我在SOHU这几年做的一些门户级别的程序系统(C/C++开发)" 115
 中国顶级门户网站架构分析1 116
 中国顶级门户网站架构分析 2 118
 服务器的大用户量的承载方案 120
 YouTube Scalability Talk 121
 High Performance Web Sites by Nate Koechley 123
One dozen rules for faster pages 123
Why talk about performance? 123
Case Studies 124
Conclusion 124
 Rules for High Performance Web Sites 124
 对于应用高并发，DB千万级数量该如何设计系统哪？ 125
 高性能服务器设计 130
 优势与应用：再谈CDN镜像加速技术 131
 除了程序设计优化，zend+ eacc(memcached)外，有什么办法能提高服务器的负载能力呢? 135
 如何规划您的大型JAVA多并发服务器程序 139
 如何架构一个“Just so so”的网站？ 148
 最便宜的高负载网站架构 152
 负载均衡技术全攻略 154
 海量数据处理分析 164
 一个很有意义的SQL的优化过程（一个电子化支局中的大数据量的统计SQL） 166
 如何优化大数据量模糊查询（架构，数据库设置，SQL..） 168
 求助:海量数据处理方法 169
# re: 求助:海量数据处理方法回复更多评论 169
 海量数据库查询方略 169
 SQL Server 2005对海量数据处理 170
 分表处理设计思想和实现 174
 Linux系统高负载 MySQL数据库彻底优化(1) 179
 大型数据库的设计与编程技巧本人最近开发一个访问统计系统，日志非常的大，都保存在数据库里面。我现在按照常规的设计方法对表进行设计，已经出现了查询非常缓慢地情形。大家对于这种情况如何来设计数据库呢？把一个表分成多个表么？那么查询和插入数据库又有什么技巧呢？谢谢，村里面的兄弟们！ 183
 方案探讨,关于工程中数据库的问题. [已结贴] 184
 web软件设计时考虑你的性能解决方案 190
 大型Java Web系统服务器选型问题探讨 193
 高并发高流量网站架构 210
1.1 互联网的发展 210
1.2 互联网网站建设的新趋势 210
1.3 新浪播客的简介 211
2.1 镜像网站技术 211
2.2 CDN内容分发网络 213
2.3 应用层分布式设计 214
2.4 网络层架构小结 214
3.1 第四层交换简介 214
3.2 硬件实现 215
3.3 软件实现 215
 网站架构的高性能和可扩展性 233
 资料收集：高并发高性能高扩展性 Web 2.0 站点架构设计及优化策略 243
 CommunityServer性能问题浅析 250
鸡肋式的多站点支持 250
内容数据的集中式存储 250
过于依赖缓存 250
CCS的雪上加霜 250
如何解决？ 251
 Digg PHP's Scalability and Performance 251
 YouTube Architecture 253
Information Sources 254
Platform 254
What's Inside? 254
The Stats 254
Recipe for handling rapid growth 255
Web Servers 255
Video Serving 256
Serving Video Key Points 257
Serving Thumbnails 257
Databases 258
Data Center Strategy 259
Lessons Learned 260
1. Jesse • Comments (78) • April 10th 261
Library 266
Friendster Architecture 273
Information Sources 274
Platform 274
What's Inside? 274
Lessons Learned 274
 Feedblendr Architecture - Using EC2 to Scale 275
The Platform 276
The Stats 276
The Architecture 276
Lesson Learned 277
Related Articles 278
Comments 279
Re: Feedblendr Architecture - Using EC2 to Scale 279
Re: Feedblendr Architecture - Using EC2 to Scale 279
Re: Feedblendr Architecture - Using EC2 to Scale 280
 PlentyOfFish Architecture 281
Information Sources 282
The Platform 282
The Stats 282
What's Inside 283
Lessons Learned 286
 Wikimedia architecture 288
Information Sources 288
Platform 288
The Stats 289
The Architecture 289
Lessons Learned 291
 Scaling Early Stage Startups 292
Information Sources 293
The Platform 293
The Architecture 293
Lessons Learned 294
 Database parallelism choices greatly impact scalability 295
 Introduction to Distributed System Design 297
Table of Contents 297
Audience and Pre-Requisites 298
The Basics 298
So How Is It Done? 301
Remote Procedure Calls 305
Some Distributed Design Principles 307
Exercises 308
References 309
 Flickr Architecture 309
Information Sources 309
Platform 310
The Stats 310
The Architecture 311
Lessons Learned 316
Comments 318
How to store images? 318
RE: How to store images? 318
 Amazon Architecture 319
Information Sources 319
Platform 320
The Stats 320
The Architecture 320
Lessons Learned 324
Comments 329
Jeff.. Bazos? 329
Werner Vogels, the CTO of 329
Re: Amazon Architecture 330
Re: Amazon Architecture 330
Re: Amazon Architecture 330
It's WSDL 330
Re: It's WSDL 331
Re: Amazon Architecture 331
 Scaling Twitter: Making Twitter 10000 Percent Faster 331
Information Sources 332
The Platform 332
The Stats 333
The Architecture 333
Lessons Learned 336
Related Articles 337
Comments 338
Re: Scaling Twitter: Making Twitter 10000 Percent Faster 338
Re: Scaling Twitter: Making Twitter 10000 Percent Faster 338
Re: Scaling Twitter: Making Twitter 10000 Percent Faster 338
Re: Scaling Twitter: Making Twitter 10000 Percent Faster 339
Re: Scaling Twitter: Making Twitter 10000 Percent Faster 339
Re: Scaling Twitter: Making Twitter 10000 Percent Faster 339
They could have been 20% better? 340
Re: Scaling Twitter: Making Twitter 10000 Percent Faster 340
Re: Scaling Twitter: Making Twitter 10000 Percent Faster 341
 Google Architecture 341
Information Sources 342
Platform 342
What's Inside? 342
The Stats 342
The Stack 343
Reliable Storage Mechanism with GFS (Google File System) 343
Do Something With the Data Using MapReduce 344
Storing Structured Data in BigTable 346
Hardware 347
Misc 347
Future Directions for Google 348
Lessons Learned 348

不管怎么样，先要找出瓶颈在哪个部分：是CPU负荷太高（经常100％），还是内存不够用（大量使用虚拟内存），还是磁盘I/O性能跟不上（硬盘指示灯狂闪）？这几个都是可以通过升级硬件来解决或者改善的（使用更高等级的CPU，更快速和更大容量的内存，配置硬件磁盘阵列并使用更多数量的高速SCSI硬盘），但这需要较大的投入。
软件方面，如果使用了更大容量的内存和改善的I/O性能，已经能够大幅提高数据库的运行效率，还可以配置查询缓存和进一步优化数据库结构和查询语句，就能让数据库的性能再进一大步。
如果在服务器硬件投入上有困难，那就尽量生成静态页面。
作者: BBSADM
标题: 目前的web系统架构
时间: Fri Apr 6 20:15:56 2007
点击: 100

最大好处是静态文件加速。
以后准备把帖子内容也静态化，实现最低负荷

而且用 nginx做前台便于负载均衡，测试机可以拿来做静态文件的负载均衡
 初创网站与开源软件
前面有一篇文章中提到过开源软件，不过主要是在系统运维的角度去讲的，主要分析一些系统级的开源软件(例如bind,memcached)，这里我们讨论的是用于搭建初创网站应用的开源软件(例如phpbb,phparticle)，运行在Linux，MySQL，Apache,PHP,Java等下面。
创业期的网站往往采用比较简单的系统架构，或者是直接使用比较成熟的开源软件。使用开源软件的好处是搭建速度快，基本不需要开发，买个空间域名，下个软件一搭建，用个半天就搞定了，一个崭新的网站就开张了，在前期可以极大程度的节约时间成本和开发成本。
当然使用开源软件搭建应用也存在一些局限性，这是我们要重点研究的，而研究的目的就是如何在开源软件选型时以及接下来的维护过程中尽量避免。
一方面是开源软件一般只有在比较成熟的领域才有，如果是一些创新型的项目很难找到合适的开源软件，这个时候没什么好的解决办法，如果非要用开源的话一般会找一个最相似的改一下。实际上目前开源的项目也比较多了，在sf.net上可以找到各种各样的开源项目。选型的时候尽量应该选取一个程序架构比较简单的，不一定越简单越好，但一定要简单，一目了然，别用什么太高级的特性，互联网应用项目不需要太复杂的框架。原因有两个，一个是框架复杂无非是为了实现更好的可扩展性和更清晰的层次，而我们正在做的互联网应用范围一般会比开源软件设计时所考虑的范围小的多，所以有的应用会显得设计过度，另外追求完美的层次划分导致的太复杂的继承派生关系也会影响到整个系统维护的工作量。建议应用只需要包含三个层就可以了，数据(实体)层，业务逻辑层，表现层。太复杂的设计容易降低开发效率，提高维护成本，在出现性能问题或者突发事件的时候也不容易找到原因。
另外一个问题是开源软件的后期维护和继续开发可能会存在问题，这一点不是绝对的，取决于开源软件的架构是否清晰合理，扩展性好，如果是较小的改动可能一般不会存在什么问题，例如添加一项用户属性或者文章属性，但有些需求可能就不是很容易实现了。例如网站发展到一定阶段后可能会考虑扩展产品线，原来只提供一个论坛加上cms，现在要再加上商城，那用户系统就会有问题，如何解决这个问题已经不仅仅是改一下论坛或者cms就可以解决了，这个时候我们需要上升到更高的层次来考虑问题，是否需要建立针对整个网站的用户认证系统，实现单点登录，用户可以在产品间无缝切换而且保持登录状态。由于网站初始的用户数据可能大部分都存放在论坛里，这个时候我们需要把用户数据独立出来就会碰到麻烦，如何既能把用户数据独立出来又不影响论坛原有系统的继续运行会是件很头痛的事情。经过一段时间的运行，除非是特别好的设计以及比较好的维护，一般都会在论坛里存在各种各样乱七八糟的对用户信息的调用，而且是直接针对数据库的，这样如果要将用户数据移走的话要修改代码的工作量将不容忽视，而另外一个解决办法是复制一份用户数据出来，以新的用户数据库为主，论坛里的用户数据通过同步或异步的机制实现同步。最好的解决办法就是在选型时选一个数据层封装的比较好的，sql代码不要到处飞的软件，然后在维护的时候保持系统原有的优良风格，把所有涉及到数据库的操作都放到数据层或者实体层里，这样无论对数据进行什么扩展，代码修改起来都比较方便，基本不会对上层的代码产生影响。
网站访问速度问题对初创网站来说一般考虑的比较少，买个空间或者托管服务器，搭建好应用后基本上就开始运转了，只有到真正面临极大的速度访问瓶颈后才会真正对这个问题产生重视。实际上在从网站的开始阶段开始，速度问题就会一直存在，并且会随着网站的发展也不断演进。一个网站最基本的要求，就是有比较快的访问速度，没有速度，再好的内容或服务也出不来。所以，访问速度在网站初创的时候就需要考虑，无论是采用开源软件还是自己开发都需要注意，数据层尽量能够正确，高效的使用SQL。SQL包含的语法比较复杂，实现同样一个效果如果考虑到应用层的的不同实现方法，可能有好几种方法，但里面只有一种是最高效的，而通常情况下，高效的SQL一般是那个最简单的SQL。在初期这个问题可能不是特别明显，当访问量大起来以后，这个可能成为最主要的性能瓶颈，各种杂乱无章的SQL会让人看的疯掉。当然前期没注意的话后期也有解决办法，只不过可能不会解决的特别彻底，但还是要吧非常有效的提升性能。看MySQL的SlowQuery Log是一个最为简便的方法，把执行时间超过1秒的查询记录下来，然后分析，把该加的索引加上，该简单的SQL简化。另外也可以通过Showprocesslist查看当前数据库服务器的死锁进程，从而锁定导致问题的SQL语句。另外在数据库配置文件上可以做一些优化，也可以很好的提升性能，这些文章在网站也比较多，这里就不展开。
这些工作都做了以后，下面数据库如果再出现性能问题就需要考虑多台服务器了，一台服务器已经解决不了问题了，我以前的文章中也提到过，这里也不再展开。
其它解决速度问题的办法就不仅仅是在应用里面就可以实现的了，需要从更高的高度去设计系统，考虑到服务器，网络的架构，以及各种系统级应用软件的配合，这里也不再展开。
良好设计并实现的应用+中间件+良好的分布式设计的数据库+良好的系统配置+良好的服务器/网络结构，就可以支撑起一个较大规模的网站了，加上前面的几篇文章，一个小网站发展到大网站的过程基本上就齐了。这个过程会是一个充满艰辛和乐趣的过程，也是一个可以逐渐过渡的过程，主动出击，提前考虑，减少救火可以让这个过程轻松一些。

 谈谈大型高负载网站服务器的优化心得!
因为工作的关系，我做过几个大型网站（书库、证券）的相关优化工作，一般是在世界排行1000-4000以内的这些网站使用的程序各不一样，配置也不尽相同，但是它们有一个共同的特点，就是使用的是FREEBSD系统，高配置高负载，PV值非常高，都是需要用两台以上独立主机来支持的网站~ 我在优化及跟踪的过程中，开始效果也差强人意，也不太理想，后来通过阅读大量资料才慢慢理清了一些思路，写出来希望给大家有所帮助。 WEB服务器配置是DUAL XEON 2.4G以上，2G内存以上，SCSI硬盘一块以上，FREEBSD 5.X以上
数据库服务器与WEB服务器类似~~
书库程序是使用的jieqi的，论坛是使用的Discuz!的
apache 2.x + php 4.x + mysql 4.0.x + zend + 100M光纤独享带宽

1、一定要重新编译内核，根据自己对内核认识的程度和服务器的具体配置来优化，记住打开SMP，也可以使用ULE调度。
2、要优化系统的值，一般是添加入/etc/sysctl.conf里面，要加大内核文件并发数量及其他优化等值。
3、APACHE 2使用perwork工作模式就可以了，我试过worker模式，实在是差强人意呀。修改httpd.conf里面的值，加大并发数量和关闭不需要的模块。因为apache非常消耗内存，尽量轻装上阵可以适当的使用长连接。关闭日志。 4、PHP编译的时候，注意要尽量以实用为目的加入参数，没有用到的坚决不加，以免浪费系统资源。 5、ZEND要使用较小的优化等级，15就足够了，1023级别只会加重服务器负载~ 6、MYSQL要尽量少使用长连接，限制为2-3秒即可~ 7、要全部采用手工编译方式，不要用ports安装，因为它会带上很多你不需要的模块，切记。 8、对于这类高负载高在线人数的大站，所有优化的思路就是把尽可能多的系统资源，提供给WEB和MYSQL服务，并且让这些服务单个进程可以占用尽可能少的系统资源。如果系统一开始大量使用SWAP，对于这些服务器来说，服务器状态将会极剧恶化。 9、长时间观察跟踪调试，有什么问题尽快解决

就想到这些东东，欢迎大家补充~~

梦飞
http://onlinecq.com
2006/4/25
P.S. 补充我的几点优化：
1、编译Apache PHP MySQL时使用GCC参数传递对特定CPU进行优化；
2、如果网站小文件很多，可以考虑使用reiserfs磁盘系统，提升读写性能；
3、如不需要 .htaccess ，则将设置为 None
对于apache服务器繁忙，加大内存可以解决不少问题。
纯交互站点，mysql性能会是一个瓶颈。需要长期跟踪更改参数。
 Lighttpd+Squid+Apache搭建高效率Web服务器
davies 发表于 2006-9-9 01:06 | 分类: Tech :: Web ::

架构原理
Apache通常是开源界的首选Web服务器，因为它的强大和可靠，已经具有了品牌效应，可以适用于绝大部分的应用场合。但是它的强大有时候却显得笨重，配置文件得让人望而生畏，高并发情况下效率不太高。而轻量级的Web服务器Lighttpd却是后起之秀，其静态文件的响应能力远高于Apache，据说是Apache的2-3倍。Lighttpd的高性能和易用性，足以打动我们，在它能够胜任的领域，尽量用它。Lighttpd对PHP的支持也很好，还可以通过Fastcgi方式支持其他的语言，比如Python。
毕竟Lighttpd是轻量级的服务器，功能上不能跟Apache比，某些应用无法胜任。比如Lighttpd还不支持缓存，而现在的绝大部分站点都是用程序生成动态内容，没有缓存的话即使程序的效率再高也很难满足大访问量的需求，而且让程序不停的去做同一件事情也实在没有意义。首先，Web程序是需要做缓存处理的，即把反复使用的数据做缓存。即使这样也还不够，单单是启动Web处理程序的代价就不少，缓存最后生成的静态页面是必不可少的。而做这个是 Squid的强项，它本是做代理的，支持高效的缓存，可以用来给站点做反向代理加速。把Squid放在Apache或者Lighttpd的前端来缓存 Web服务器生成的动态内容，而Web应用程序只需要适当地设置页面实效时间即可。
即使是大部分内容动态生成的网站，仍免不了会有一些静态元素，比如图片、JS脚本、CSS等等，将Squid放在Apache或者Lighttp前端后，反而会使性能下降，毕竟处理HTTP请求是Web服务器的强项。而且已经存在于文件系统中的静态内容再在Squid中缓存一下，浪费内存和硬盘空间。因此可以考虑将Lighttpd再放在Squid的前面，构成 Lighttpd+Squid+Apache的一条处理链，Lighttpd在最前面，专门用来处理静态内容的请求，把动态内容请求通过proxy模块转发给Squid，如果Squid中有该请求的内容且没有过期，则直接返回给Lighttpd。新请求或者过期的页面请求交由Apache中Web程序来处理。经过Lighttpd和Squid的两级过滤，Apache需要处理的请求将大大减少，减少了Web应用程序的压力。同时这样的构架，便于把不同的处理分散到多台计算机上进行，由Lighttpd在前面统一把关。
在这种架构下，每一级都是可以进行单独优化的，比如Lighttpd可以采用异步IO方式，Squid可以启用内存来缓存，Apache可以启用MPM 等，并且每一级都可以使用多台机器来均衡负载，伸缩性很好。
实例讲解
下面以daviesliu.net和rainbud.net域下面的几个站点为例来介绍一下此方案的具体做法。daviesliu.net域下有几个用 mod_python实现的blog站点，几个php的站点，一个mod_python的小程序，以后可能还会架设几个PHP和Django的站点。而服务器非常弱，CPU为Celeron 500，内存为PC 100 384M，因此比较关注Web服务器的效率。这几个站点都是采用虚拟主机方式，开在同一台机器的同一个端口上。
Lighttpd服务于80端口，Squid运行在3128端口，Apache运行在81端口。
Lighttpd的配置
多个域名采用/var/www/domain/subdomain 的目录结构，用evhost模块配置document-root如下：
evhost.path-pattern = var.basedir + "/%0/%3/"
FtpSearch中有Perl脚本，需要启用CGI支持，它是用来做ftp站内搜索的，缓存的意义不大，直接由lighttpd的mod_cgi处理：
$HTTP["url"] =~ "^/cgi-bin/" { # only allow cgi's in this directory
dir-listing.activate = "disable" # disable directory listings
cgi.assign = ( ".pl" => "/usr/bin/perl", ".cgi" => "/usr/bin/perl" )
}
bbs使用的是phpBB，访问量不大，可以放在lighttpd(fastcgi)或者apache(mod_php)下，暂时使用 lighttpd，设置所有.php的页面请求有fastcgi处理：
fastcgi.server = ( ".php" => ( ( "host" => "127.0.0.1", "port"=> 1026, "bin-path" => "/usr/bin/php-cgi" ) ) )
blog.daviesliu.net 和 blog.rainbud.net 是用mod_python编写的blogxp程序，所有静态内容都有扩展名，而动态内容没有扩展名。blogxp是用python程序生成XML格式的数据再交由mod_xslt转换成HTML页面，只能放在Apache下运行。该站点采用典型Lighttpd+Squid+Apache方式处理：
$HTTP["host"] =~ "^blog" {
$HTTP["url"] !~ "\." {
proxy.server = ( "" => ( "localhost" => ( "host"=> "127.0.0.1", "port"=> 3128 ) ) ) #3128端口为
}
}
share中有静态页面，也有用mod_python处理的请求，在/cgi/下：
$HTTP["host"] =~ "^share" {
proxy.server = (
"/cgi" => ( "localhost" => ( "host"=> "127.0.0.1", "port"=> 3128 ) )
)
}
Squid的配置
只允许本地访问：
http_port 3128
http_access allow localhost
http_access deny all
启用反向代理：
httpd_accel_host 127.0.0.1
httpd_accel_port 81 #apache的端口
httpd_accel_single_host on
httpd_accel_with_proxy on #启用缓存
httpd_accel_uses_host_header on #启用虚拟主机支持
此方向代理支持该主机上的所有域名。
Apache的配置
配置/etc/conf.d/apache2，让其加载mod_python、mod_xslt、mod_php模块：
APACHE2_OPTS="-D PYTHON -D XSLT -D PHP5"
所有网站的根目录：
AllowOverride All #允许.htaccess覆盖 Order allow,deny Allow from all
基于域名的虚拟主机：
ServerName blog.daviesliu.net DocumentRoot /var/www/daviesliu.net/blog
这里明显没有lighttpd的evhost配置方便。
blog.daviesliu.net下的.htaccess设置(便于开发，不用重启Apache):
SetHandler mod_python
PythonHandler blogxp.publisher
PythonDebug On
PythonAutoReload On

SetHandler None #静态文件直接由Apache处理

AddType text/xsl .xsl #防止对xsl文件进行转化 AddOutputFilterByType mod_xslt text/xml XSLTCache off XSLTProcess on
Header set Pragma "cache"
Header set Cache-Control "cache"
在blogxp.publisher里面，还需要设置返回的文档类型和过期时间：
req.content_type = "text/xml"
req.headers_out['Expires'] = formatdate( time.time() + 60 * 5 )
经过这样的配置，所有站点都可以通过80、3128、81三个端口进行正常访问，80端口用作对外的访问，以减少负荷。81端口可以用作开发时的调试，没有缓存的困扰。
性能测试
由于时间和精力有限，下面只用ab2做一个并不规范的性能对比测试(每项都测多次取平均)，评价指标为每秒钟的请求数。
测试命令,以测试lighttpd上并发10个请求 scripts/prototype.js 为例：
ab2 -n 1000 -c 10 http://blog.daviesliu.net:80/scripts/prototype.js
静态内容：prototype.js (27kB)
Con Lighttpd(:80) Squid(:3128) Apache(:81)
1 380 210 240
10 410 215 240
100 380 160 230

可见在静态内容上，Lighttpd表现强劲，而Squid在没有配内存缓存的情况下比另两个Web服务器的性能要差些。

动态页面：/rss (31kB)
Con Lighttpd(:80) Squid(:3128) Apache(:81)
1 103 210 6.17
10 110 200 6.04
100 100 100 6.24

在动态内容上，Squid的作用非常明显，而Lighttpd受限于Squid的效率，并且还要低一大截。如果是有多台Squid来做均衡的话，Lighttpd的功效才能发挥出来。
在单机且静态内容很少的情况下，可以不用Lighttpd而将Squid置于最前面。
14 Comments »
1. Re:Lighttpd+Squid+Apache搭建高效率Web服务器
这种搭配倒是可以不过正文描述有些地方有问题
light 可以自己加上cache支持但从性能只考虑cache看比squid还好一点(平均每秒3000+线上实际数据)
squid 那块说的不太对处理静态优化到99.99%以上的hitratio后基本上作用非常大
对整体结构也很有好处
light+squid+apache的结构过渡时期实际在线也跑过当时是后端没做压缩支持
实际上每一块都可以根据自己需要patch 没有最好只有更合适可管理性很重要
由 windtear 发表于 Wed Sep 13 13:38:15 2006
2. Re:Lighttpd+Squid+Apache搭建高效率Web服务器
lighttpd + php 访问量大的话经常会导致 php 死掉，然后 500

不管是 local 还是 remote 方式

无奈，换 zeus 了，很坚挺，商业的就是商业的。
由 soff 发表于 Wed Sep 13 13:39:01 2006
3. Re:Lighttpd+Squid+Apache搭建高效率Web服务器
His result looks weird, as a result, his conclusion is wrong.

Squid does not boost dynamic page at all, the speed gain in his test is because his client is requesting the same page in paralell, and squid will return the same page for the concurrent requests. I also guess that he did not configure expire time for static content in his web server, Squid will try to refetch the file with If-Modified-Since header for each request. That's why squid performs poor in the static test.
由 kxn 发表于 Wed Sep 13 13:41:24 2006
4. Re:Lighttpd+Squid+Apache搭建高效率Web服务器
不太同意这一点，对Squid而言，动态页面和静态页面是一样的，只要设置好HTTP头，
如果设置Expires，是没有缓存效果的
如果不能Cache动态页面的话，那怎么起到加速效果？
由 davies 发表于 Wed Sep 13 13:42:00 2006
5. Re:Lighttpd+Squid+Apache搭建高效率Web服务器
不好意思,英语不好,误导你了,上午在单位的机器没法输入中文
动态页面除非正确设置HTTP的过期时间头,否则就是没有加速效果的.反过来说,静态页面也需要设置过期时间头才对.

我说的设置 expire 时间是指的把过期时间设置到几分钟后或者几小时后,这样页面就在这段时间内完全缓冲在squid里面.

你实际测试动态页面有性能提升,这有几种可能,一是你的测试用的是并发请求同一个页面,squid对并发的同页面请求,如果拿到的结果里面没有 non cache 头,会把这一个结果同时发回给所有请求,相当于有一个非常短时间的cache,测试结果看起来会好很多,但是实际因为请求同一页面的机会不是很多,所以基本没有啥改进,另一种情况是你用的动态页面程序是支持if-modified-since头的,他如果判断这个时间以后么有修改过,就直接返回not modified,速度也会加快很多.

所以其实squid在实际生产中大部分时间都是用于缓冲静态页面的,动态页面不是不能缓冲,但是需要页面程序里面做很多配合,才能达到比较好的效果

newsmth的 www 高峰时候是 600qps ,squid端还是比较轻松,瓶颈在后端.
由 kxn 发表于 Wed Sep 13 13:43:55 2006
6. Re:Lighttpd+Squid+Apache搭建高效率Web服务器
多谢你的详细解答!

我文章中写了，每个请求都会添加 Expires 头为当前时间的后5分钟，即每个页面的有效期为5分钟，Squid似乎会根据这个时间来判断是否刷新缓存，无需服务器支持If-modified-since
这个5分钟是根据页面的一般更新频率来确定的.

如果是访问量很大的Web应用，比如newsmth的www，如果将php页面的失效时间设置为1-2秒，则这段时间内的请求都会用缓存来回应，即使在这段缓存时间内数据更新了，但并不影响用户的使用，1-2秒钟的滞后效应对用户的体验影响并不大，但换取的是更快的服务器响应尤其是访问量大但更新并不频繁的blog部分，这样做可能很有效

当然，如果实现了If-modified-since接口，将更有效，但工作量太大
由 davies 发表于 Wed Sep 13 13:45:27 2006
7. Re:Lighttpd+Squid+Apache搭建高效率Web服务器
看来是我没有仔细看你文章了, 确实没有注意到你文章里面提了 expire 头
静态页面也可以设置 expire 头的,用 web server 的一个模块就可以
这样基本就是全部用 squid 缓冲了.

没有 expire 头的时候,squid就会每个请求都用 if modified since 去刷.

smthwww的php 页面expire时间是 5 分钟还是 10 分钟来着,我忘记了.
由 kxn 发表于 Wed Sep 13 13:46:46 2006
8. Re:Lighttpd+Squid+Apache搭建高效率Web服务器
总的感觉多此一举阿，如果没有非常巨大的访问量,squid的解决方案就足够了。

如果真用了lighttpd, 基本上没有什么必要要apache了,
除非是非常特别的应用, lighttpd基本上都能支持的.

单机折腾这么多层，是不会有什么性能收益的.
由 scaner 发表于 Wed Sep 13 13:48:44 2006
9. Re:Lighttpd+Squid+Apache搭建高效率Web服务器
其实lighttpd的缓存功能很强大，你可以看看他的cml文档，能很好的解决动态内容的缓存问题。而且如果是单机服务器的话在架个squid意义不大。当然除非你要缓存的东西实在太多，squid的Bloom Filter还是非常有效的。
由 Wei Litao [email] 发表于 Sat Sep 16 13:19:38 2006
10. Re: Lighttpd+Squid+Apache搭建高效率Web服务器
lighttpd有bug，内存泄漏比较严重。我现在用nginx，正在lilybbs上测试效果。其实把动态内容静态化才是最终出路。那些点击量真想去掉。
目前lilybbs的架构：
------ nginx ---------
| | |
Squid fastcgi proxy
| (逐步迁移) |
静态文件 Njuwebbsd
(逐步迁移到fastcgi上)
由 bianbian [email] [www] 发表于 Fri Apr 6 20:49:08 2007
11. Re: Lighttpd+Squid+Apache搭建高效率Web服务器
ft.不支持空格排版。架构请看：
http://bbs.nju.cn/blogcon?userid=BBSADM&file=1175861756
由 bianbian [www] 发表于 Fri Apr 6 20:51:11 2007
12. Re: Lighttpd+Squid+Apache搭建高效率Web服务器
另外。我觉得单机搞三层没什么必要，你这个情况可以完全抛弃apache。我现在的遗憾是nginx其他都很强，就是memcache没完善，所以必须弄个Squid
由 bianbian [www] 发表于 Fri Apr 6 20:57:11 2007
13. Re: Lighttpd+Squid+Apache搭建高效率Web服务器
我文中的那个方案只是在特殊场合才有用，呵呵
主来还是用来玩玩
点击其实可以通过分析log来离线做，或者单独放一些数据，用ajax来跟新这一部分，呵呵
由 davies 发表于 Sat Apr 7 01:31:18 2007
14. Re: Lighttpd+Squid+Apache搭建高效率Web服务器
头一次听说NginX，感觉应该是跟lighttpd同一个层次的东西，相差不会太大。如果要拼并发性能的话，估计平不过yaws，改天做个简单测试。
由 davies 发表于 Sat Apr 7 01:39:14
 浏览量比较大的网站应该从哪几个方面入手？
________________________________________
作者: 游戏人间时间: 2007-6-15 04:23 PM 标题: 浏览量比较大的网站应该从哪几个方面入手？

当然，提问前先将个人的一些理解分享。大家有的也请不吝共享，偶急切的需要这方面的经验….

下面所提到的主要是针对一般的网站，不包括下载或聊天室等特殊站点…

一、减少数据库的压力

　　缓存查询结果／建内存表

二、减少Apache的压力——减少HTTP的请求次数

　　背景图片全部做成一张然后用CSS控制位置/不使用AJAX来进行即时验证(不考虑客户体验什么的,通过拖长客户时间来减轻服务器压力)

三、减轻I/O压力

　　页面局部缓存
作者: 蟋蟀时间: 2007-6-15 05:29 PM

咱也说点，只是理论，不知道对不对．
流量大的网站咱没做过．
一横向
１＼首先要考虑的就是硬件，适当的投入硬件，要比你搞那么多软件优化要实惠的多．
２＼在就是从cpu 内存　硬盘　了．频繁操作的数据能存到内存中就存到内存中，能存到分布共享中就存储在分布共享内存中
其次考虑在考虑硬盘上．

二纵向
１＼从web的http的响应　应答考虑
web要有服务器，所以如何优化服务器，如何通过配置服务器加速操作，能缓存的缓存，这方面的东西不少。
２、要是动态脚本，考虑使用的数据库　如何优化数据库、如何建立合理的表等操作　这方面细节同样不少
３、用php脚本，尽量少的require 文件，毕竟每次php是一次性编译，而且每次到require都要返回　这个脚本方面的就要看程序员的水平了

还有很多
作者: fcicq 时间: 2007-6-19 09:30 PM

好久没出来了，难得碰上一篇可以回的帖子

一、减少数据库的压力
　　缓存查询结果／建内存表

有条件就把数据库尽量分开，减小数据库规模
杜绝超过0.5s的 queries - 非常重要！
开大内存索引

二、减少Apache的压力——减少HTTP的请求次数
　　背景图片全部做成一张然后用CSS控制位置/不使用AJAX来进行即时验证(不考虑客户体验什么的,通过拖长客户时间来减轻服务器压力)
背景图片？这个没必要.
静态内容不要用apache!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

三、减轻I/O压力
　　页面局部缓存
作者: ￥时间: 2007-6-20 11:59 AM

可以lighttp+apache配合的…lighttp负责静态的如image,js,css等,apache负责php,用rewrite转发到lighttp
甚至有研究表明,lighttp处理fastcgi模式下的php,要比apache等要快
性能上,lighttp是要优于apache的,但稳定性就差点..
________________________________________
作者: sigmazel 时间: 2007-6-20 12:49 PM

WEB方面：
1.脚本引用的资源文件如css,js,image可以多放几台服务器上，尽可能的压缩。
2.适当的加入ajax
3.尽量控制php的代码行，如果方便的话，可以写成com或so级的
4.缓存
________________________________________
作者: php5 时间: 2007-6-20 06:55 PM

考虑硬件成本的话可以笼统地从以下着手

一、页面尽量静态化
二、配置服务器动态的走apache,静态的走Lighttpd
三、用最好的OS如FreeBSD
四、重点优化mysql性能从编译、配置上入手
五、最基本的控制好程序性能及SQL查询
六、做缓存、做代理反向代理
七、页面上的优化了，节省流量上的考虑
作者: 奶瓶时间: 2007-6-21 10:52 AM

静态文件用apache的代价很大,其实lighttpd和NGINX这类的也并不会小太多,有一些支持“文件至网卡”模式的特殊静态服务器可能划算一些。php的调用文件个数可以做到比较精确的控制,tmpfs一类的方法可以尝试,不要过分迷信memcached,本地cache适当用用回保不错
作者: fengchen9127 时间: 2007-6-21 11:13 AM

优化数据库访问。
　　前台实现完全的静态化当然最好，可以完全不用访问数据库，不过对于频繁更新的网站，静态化往往不能满足某些功能。
　　缓存技术就是另一个解决方案，就是将动态数据存储到缓存文件中，动态网页直接调用这些文件，而不必再访问数据库，WordPress和Z-Blog都大量使用这种缓存技术。我自己也写过一个Z-Blog的计数器插件，也是基于这样的原理。
　　如果确实无法避免对数据库的访问，那么可以尝试优化数据库的查询SQL.避免使用Select * from这样的语句，每次查询只返回自己需要的结果，避免短时间内的大量SQL查询。
禁止外部的盗链。
　外部网站的图片或者文件盗链往往会带来大量的负载压力，因此应该严格限制外部对于自身的图片或者文件盗链，好在目前可以简单地通过refer来控制盗链，Apache自己就可以通过配置来禁止盗链，IIS也有一些第三方的ISAPI可以实现同样的功能。当然，伪造refer也可以通过代码来实现盗链，不过目前蓄意伪造refer盗链的还不多，可以先不去考虑，或者使用非技术手段来解决，比如在图片上增加水印。
控制大文件的下载。
　大文件的下载会占用很大的流量，并且对于非SCSI硬盘来说，大量文件下载会消耗CPU，使得网站响应能力下降。因此，尽量不要提供超过2M的大文件下载，如果需要提供，建议将大文件放在另外一台服务器上。
使用不同主机分流主要流量
将文件放在不同的主机上，提供不同的镜像供用户下载。比如如果觉得RSS文件占用流量大，那么使用FeedBurner或者FeedSky等服务将RSS输出放在其他主机上，这样别人访问的流量压力就大多集中在FeedBurner的主机上，RSS就不占用太多资源了。
使用流量分析统计软件。
　在网站上安装一个流量分析统计软件，可以即时知道哪些地方耗费了大量流量，哪些页面需要再进行优化，因此，解决流量问题还需要进行精确的统计分析才可以。
 用负载均衡技术建设高负载站点
转载自:IT.COM.CN　|　2005年11月04日　|　作者:　|　浏览次数:57
　　Internet的快速增长使多媒体网络服务器，特别是Web服务器，面对的访问者数量快速增加，网络服务器需要具备提供大量并发访问服务的能力。例如Yahoo每天会收到数百万次的访问请求，因此对于提供大负载Web服务的服务器来讲，CPU、I/O处理能力很快会成为瓶颈。

　　简单的提高硬件性能并不能真正解决这个问题，因为单台服务器的性能总是有限的，一般来讲，一台PC服务器所能提供的并发访问处理能力大约为1000个，更为高档的专用服务器能够支持3000-5000个并发访问，这样的能力还是无法满足负载较大的网站的要求。尤其是网络请求具有突发性，当某些重大事件发生时，网络访问就会急剧上升，从而造成网络瓶颈，例如在网上发布的克林顿弹劾书就是很明显的例子。必须采用多台服务器提供网络服务，并将网络请求分配给这些服务器分担，才能提供处理大量并发服务的能力。

　　当使用多台服务器来分担负载的时候，最简单的办法是将不同的服务器用在不同的方面。按提供的内容进行分割时，可以将一台服务器用于提供新闻页面，而另一台用于提供游戏页面；或者可以按服务器的功能进行分割，将一台服务器用于提供静态页面访问，而另一些用于提供CGI等需要大量消耗资源的动态页面访问。然而由于网络访问的突发性，使得很难确定那些页面造成的负载太大，如果将服务的页面分割的过细就会造成很大浪费。事实上造成负载过大的页面常常是在变化中的，如果要经常按照负载变化来调整页面所在的服务器，那么势必对管理和维护造成极大的问题。因此这种分割方法只能是大方向的调整，对于大负载的网站，根本的解决办法还需要应用负载均衡技术。

　　负载均衡的思路下多台服务器为对称方式，每台服务器都具备等价的地位，都可以单独对外提供服务而无须其他服务器的辅助。然后通过某种负载分担技术，将外部发送来的请求均匀分配到对称结构中的某一台服务器上，而接收到请求的服务器都独立回应客户机的请求。由于建立内容完全一致的Web服务器并不复杂，可以使用服务器同步更新或者共享存储空间等方法来完成，因此负载均衡技术就成为建立一个高负载Web站点的关键性技术。

　　基于特定服务器软件的负载均衡

　　很多网络协议都支持“重定向”功能，例如在HTTP协议中支持Location指令，接收到这个指令的浏览器将自动重定向到Location指明的另一个URL上。由于发送Location指令比起执行服务请求，对Web服务器的负载要小的多，因此可以根据这个功能来设计一种负载均衡的服务器。任何时候Web服务器认为自己负载较大的时候，它就不再直接发送回浏览器请求的网页，而是送回一个Locaction指令，让浏览器去服务器集群中的其他服务器上获得所需要的网页。

　　在这种方式下，服务器本身必须支持这种功能，然而具体实现起来却有很多困难，例如一台服务器如何能保证它重定向过的服务器是比较空闲的，并且不会再次发送Location指令？Location指令和浏览器都没有这方面的支持能力，这样很容易在浏览器上形成一种死循环。因此这种方式实际应用当中并不多见，使用这种方式实现的服务器集群软件也较少。有些特定情况下可以使用CGI（包括使用FastCGI或mod_perl扩展来改善性能）来模拟这种方式去分担负载，而Web服务器仍然保持简洁、高效的特性，此时避免Location循环的任务将由用户的CGI程序来承担。

　　基于DNS的负载均衡

　　由于基于服务器软件的负载均衡需要改动软件，因此常常是得不偿失，负载均衡最好是在服务器软件之外来完成，这样才能利用现有服务器软件的种种优势。最早的负载均衡技术是通过DNS服务中的随机名字解析来实现的，在DNS服务器中，可以为多个不同的地址配置同一个名字，而最终查询这个名字的客户机将在解析这个名字时得到其中的一个地址。因此，对于同一个名字，不同的客户机会得到不同的地址，他们也就访问不同地址上的Web服务器，从而达到负载均衡的目的。

　　例如如果希望使用三个Web服务器来回应对 www.exampleorg.org.cn的HTTP请求，就可以设置该域的DNS服务器中关于该域的数据包括有与下面例子类似的结果：

　　www1 IN A 192.168.1.1
　　www2 IN A 192.168.1.2
　　www3 IN A 192.168.1.3
　　www IN CNAME www1
　　www IN CNAME www2
　　www IN CNAME www3

　　此后外部的客户机就可能随机的得到对应www的不同地址，那么随后的HTTP请求也就发送给不同地址了。

　　DNS负载均衡的优点是简单、易行，并且服务器可以位于互联网的任意位置上，当前使用在包括Yahoo在内的Web站点上。然而它也存在不少缺点，一个缺点是为了保证DNS数据及时更新，一般都要将DNS的刷新时间设置的较小，但太小就会造成太大的额外网络流量，并且更改了DNS数据之后也不能立即生效；第二点是DNS负载均衡无法得知服务器之间的差异，它不能做到为性能较好的服务器多分配请求，也不能了解到服务器的当前状态，甚至会出现客户请求集中在某一台服务器上的偶然情况。

　　反向代理负载均衡

　　使用代理服务器可以将请求转发给内部的Web服务器，使用这种加速模式显然可以提升静态网页的访问速度。因此也可以考虑使用这种技术，让代理服务器将请求均匀转发给多台内部Web服务器之一上，从而达到负载均衡的目的。这种代理方式与普通的代理方式有所不同，标准代理方式是客户使用代理访问多个外部Web服务器，而这种代理方式是多个客户使用它访问内部Web服务器，因此也被称为反向代理模式。

　　实现这个反向代理能力并不能算是一个特别复杂的任务，但是在负载均衡中要求特别高的效率，这样实现起来就不是十分简单的了。每针对一次代理，代理服务器就必须打开两个连接，一个为对外的连接，一个为对内的连接，因此对于连接请求数量非常大的时候，代理服务器的负载也就非常之大了，在最后反向代理服务器会成为服务的瓶颈。例如，使用Apache的mod_rproxy模块来实现负载均衡功能时，提供的并发连接数量受Apache本身的并发连接数量的限制。一般来讲，可以使用它来对连接数量不是特别大，但每次连接都需要消耗大量处理资源的站点进行负载均衡，例如搜寻。

　　使用反向代理的好处是，可以将负载均衡和代理服务器的高速缓存技术结合在一起，提供有益的性能，具备额外的安全性，外部客户不能直接访问真实的服务器。并且实现起来可以实现较好的负载均衡策略，将负载可以非常均衡的分给内部服务器，不会出现负载集中到某个服务器的偶然现象。

　　基于NAT的负载均衡技术

　　网络地址转换为在内部地址和外部地址之间进行转换，以便具备内部地址的计算机能访问外部网络，而当外部网络中的计算机访问地址转换网关拥有的某一外部地址时，地址转换网关能将其转发到一个映射的内部地址上。因此如果地址转换网关能将每个连接均匀转换为不同的内部服务器地址，此后外部网络中的计算机就各自与自己转换得到的地址上服务器进行通信，从而达到负载分担的目的。

　　地址转换可以通过软件方式来实现，也可以通过硬件方式来实现。使用硬件方式进行操作一般称为交换，而当交换必须保存TCP连接信息的时候，这种针对OSI网络层的操作就被称为第四层交换。支持负载均衡的网络地址转换为第四层交换机的一种重要功能，由于它基于定制的硬件芯片，因此其性能非常优秀，很多交换机声称具备400MB-800MB的第四层交换能力，然而也有一些资料表明，在如此快的速度下，大部分交换机就不再具备第四层交换能力了，而仅仅支持第三层甚至第二层交换。

　　然而对于大部分站点来讲，当前负载均衡主要是解决Web服务器处理能力瓶颈的，而非网络传输能力，很多站点的互联网连接带宽总共也不过10MB，只有极少的站点能够拥有较高速的网络连接，因此一般没有必要使用这些负载均衡器这样的昂贵设备。

　　使用软件方式来实现基于网络地址转换的负载均衡则要实际的多，除了一些厂商提供的解决方法之外，更有效的方法是使用免费的自由软件来完成这项任务。其中包括Linux Virtual Server Project中的NAT实现方式，或者本文作者在FreeBSD下对natd的修订版本。一般来讲，使用这种软件方式来实现地址转换，中心负载均衡器存在带宽限制，在100MB的快速以太网条件下，能得到最快达80MB的带宽，然而在实际应用中，可能只有40MB-60MB的可用带宽。

　　扩展的负载均衡技术

　　上面使用网络地址转换来实现负载分担，毫无疑问所有的网络连接都必须通过中心负载均衡器，那么如果负载特别大，以至于后台的服务器数量不再在是几台、十几台，而是上百台甚至更多，即便是使用性能优秀的硬件交换机也回遇到瓶颈。此时问题将转变为，如何将那么多台服务器分布到各个互联网的多个位置，分散网络负担。当然这可以通过综合使用DNS和NAT两种方法来实现，然而更好的方式是使用一种半中心的负载均衡方式。

　　在这种半中心的负载均衡方式下，即当客户请求发送给负载均衡器的时候，中心负载均衡器将请求打包并发送给某个服务器，而服务器的回应请求不再返回给中心负载均衡器，而是直接返回给客户，因此中心负载均衡器只负责接受并转发请求，其网络负担就较小了。

　　上图来自Linux Virtual Server Project，为他们使用IP隧道实现的这种负载分担能力的请求/回应过程，此时每个后台服务器都需要进行特别的地址转换，以欺骗浏览器客户，认为它的回应为正确的回应。

　　同样，这种方式的硬件实现方式也非常昂贵，但是会根据厂商的不同，具备不同的特殊功能，例如对SSL的支持等。

　　由于这种方式比较复杂，因此实现起来比较困难，它的起点也很高，当前情况下网站并不需要这么大的处理能力。

　　比较上面的负载均衡方式，DNS最容易，也最常用，能够满足一般的需求。但如果需要进一步的管理和控制，可以选用反向代理方式或NAT方式，这两种之间进行选择主要依赖缓冲是不是很重要，最大的并发访问数量是多少等条件。而如果网站上对负载影响很厉害的CGI程序是由网站自己开发的，也可以考虑在程序中自己使用Locaction来支持负载均衡。半中心化的负载分担方式至少在国内当前的情况下还不需要。

 大型网站的架构设计问题

在CSDN上看到一篇文章（ http://blog.csdn.net/fww80/archive/2006/04/28/695293.aspx）讨论大型高并发负载网站的系统架构问题，作者提出了几点建议：
1. HTML静态化，这可以通过CMS自动实现；
2. 图片服务器分离（类似的，在视频网站中，视频文件也应独立出来）；
3. 数据库集群和库表散列，Oracle、MySQL等DBMS都有完美的支持；
4. 缓存，比如使用Apache的Squid模块，或者是开发语言的缓存模块，；
5. 网站镜像；
6. 负载均衡。
作者将负载均衡称为“是大型网站解决高负荷访问和大量并发请求采用的终极解决办法”，并提出“一个典型的使用负载均衡的策略就是，在软件或者硬件四层交换的基础上搭建squid集群”。在实践时可以考虑建立应用服务器集群和Web服务器集群，应用服务器集群可以采用Apache+Tomcat集群和WebLogic集群等，Web服务器集群可以用反向代理，也可以用NAT的方式，或者多域名解析均可。

从提升网站性能的角度出发，静态资源不应和应用服务器放在一起，数据库服务器也应尽量独立开来。在典型的MVC模式中，由谁来完成数据逻辑处理的，对系统性能有着至关重要的影响。以Java EE为例，在OO的设计思想中，我们强调系统的抽象、重用、可维护性，强调下层的更改不会扩散到上层逻辑，强调系统移植的便捷性，因而往往存在一种过分抽象的问题，比如在Hibernate的基础上再加入一层DAO的设计。另外一方面，却会忽视利用DBMS本身的优秀特性（存储过程、触发器）来完成高效的数据处理。诚然，如果客户要求将数据从Oracle移植到MySQL，那么DBMS特性的东西越少，移植便越容易。但事实上，在实践中，提出类似移植要求的情况非常少见，因此在做架构设计时，不一定为了这种潜在的需求而大幅牺牲系统的性能与稳定性。此外，我不建议采用分布式数据库管理结构，这样带来的开销太大，数据维护也是个头痛的问题，尽可能采用集中式的数据管理。

在商业系统中，算法逻辑本身并不复杂，在这种情况下，程序设计本身的好坏不会对系统的性能造成致命的影响。重要的影响因素反而变为软件系统架构本身。在传统的CORBA、J2EE、DCOM等对象模型中，我们看到专家们对分布式对象计算的理论偏好，但实践证明，对象的分布带来的恶劣影响远远胜过其积极意义。这也是现在轻量级的开发框架受推崇的一个重要原因。如果能用简单的，就不要用复杂的，例如能够用Python、RoR完成的任务，是否一定要用Java来做？我看未必。对于用户来说，他们关心的不是采用什么先进的技术，而是我们提供的产品能否满足他的需求。而且，Python、RoR这些开发工具已经强大到足以应对大部分网站应用，在各种缓存系统的帮助下，在其他技术的协调配合下，完全能够胜任高负载高并发的网站访问任务。

在HTML静态化方面，如果是对于更新相对较少的页面，可以这样处理，例如新闻、社区通告、或者类似与淘宝网的产品分类信息。但若数据更新频繁，这样做的意义便不大。

网站镜像是个传统的技术，更高级的应用来自流媒体领域的CDN(Content Delivery Network)，CDN的概念可以由流媒体数据扩展到图片、视频文件等静态资源的传输。不过，在电子商务领域，很少有这样的应用。

 开源平台的高并发集群思考
目前碰到的高并发应用，需要高性能需求的主要是两个方面
1。网络
2。数据库
这两个方面的解决方式其实还是一致的
1。充分接近单机的性能瓶颈，自我优化
2。单机搞不定的时候(
数据传输瓶颈:
单位时间内磁盘读写/网络数据包的收发
cpu计算瓶颈)，把负荷分担给多台机器，就是所谓的负载均衡
网络方面单机的处理
1。底层包收发处理的模式变化(从select 模式到epoll / kevent)
2。应用模式的变化
2.1 应用层包的构造方式
2.2 应用协议的实现
2.3 包的缓冲模式
2.4 单线程到多线程
网络负载均衡的几个办法
1。代理模式：代理服务器只管收发包，收到包以后转给后面的应用服务器群（服务器群后可能还会有一堆堆的数据库服务器等等），并且把返回的结果再返回给请求端
2。虚拟代理ip：代理服务器收发包还负载太高，那就增加多台代理服务器，都来管包的转发。这些代理服务器可以用统一的虚拟ip，也可以单独的ip
3。p2p：一些广播的数据可以p2p的模式来减轻服务器的网络压力
数据库(指mysql)单机的处理
1。数据库本身结构的设计优化（分表，分记录，目的在于保证每个表的记录数在可定的范围内）
2。sql语句的优化
3。master + slave模式
数据库集群的处理
1。master + slave模式（可有效地处理并发查询）
2。mysql cluster 模式（可有效地处理并发数据变化）
相关资料：
http://dev.mysql.com/doc/refman/5.0/en/ndbcluster.html
 大型、高负载网站架构和应用初探
时间：30-45分钟
开题：163,sina,sohu等网站他们有很多应用程序都是PHP写的，为什么他们究竟是如何能做出同时跑几千人甚至上万同时在线应用程序呢?
• 挑选性能更好web服务器
o 单台 Apache web server 性能的极限
o 选用性能更好的web server TUX,lighttpd,thttpd …
o 动，静文件分开，混合使用
• 应用程序优化,Cache的使用和共享
o 常见的缓存技术
 生成静态文件
 对象持久化 serialize & unserialize
o Need for Speed ，在最快的地方做 cache
 Linux 系统下的 /dev/shm
 tmpfs/ramdisk
 php内置的 shared memory function /IPC
 memcached
 MySQL的HEAP表
o 多台主机共享cache
 NFS,memcached,MySQL 优点和缺点比较
• MySQL数据库优化
o 配置 my.cnf，设置更大的 cache size
o 利用 phpMyAdmin 找出配置瓶颈，榨干机器的每一点油
o 集群(热同步,mysql cluster)
• 集群，提高网站可用性
o 最简单的集群，设置多条A记录，DNS轮询，可用性问题
o 确保高可用性和伸缩性能的成熟集群解决方案
 通过硬件实现，如路由器,F5 network
 通过软件或者操作系统实现
 基于内核,通过修改TCP/IP数据报文负载均衡，并确保伸缩性的 LVS以及确保可用性守护进程ldirectord
 基于 layer 7，通过URL分发的 HAproxy
o 数据共享问题
 NFS,Samba,NAS,SAN
o 案例
• 解决南北互通，电信和网通速度问题
o 双线服务器
o CDN
 根据用户IP转换到就近服务器的智能DNS,dnspod …
 Squid 反向代理,(优点,缺点)
o 案例
http://blog.yening.cn/2007/03/25/226.html#more-226
 说说大型高并发高负载网站的系统架构
By Michael
转载请保留出处：俊麟 Michael’s blog ( http://www.toplee.com/blog/?p=71)
Trackback Url : http://www.toplee.com/blog/wp-trackback.php?p=71
　　我在CERNET做过拨号接入平台的搭建，而后在Yahoo&3721从事过搜索引擎前端开发，又在MOP处理过大型社区猫扑大杂烩的架构升级等工作，同时自己接触和开发过不少大中型网站的模块，因此在大型网站应对高负载和并发的解决方案上有一些积累和经验，可以和大家一起探讨一下。

　　一个小型的网站，比如个人网站，可以使用最简单的html静态页面就实现了，配合一些图片达到美化效果，所有的页面均存放在一个目录下，这样的网站对系统架构、性能的要求都很简单，随着互联网业务的不断丰富，网站相关的技术经过这些年的发展，已经细分到很细的方方面面，尤其对于大型网站来说，所采用的技术更是涉及面非常广，从硬件到软件、编程语言、数据库、WebServer、防火墙等各个领域都有了很高的要求，已经不是原来简单的html静态网站所能比拟的。
　　大型网站，比如门户网站。在面对大量用户访问、高并发请求方面，基本的解决方案集中在这样几个环节：使用高性能的服务器、高性能的数据库、高效率的编程语言、还有高性能的Web容器。但是除了这几个方面，还没法根本解决大型网站面临的高负载和高并发问题。
　　上面提供的几个解决思路在一定程度上也意味着更大的投入，并且这样的解决思路具备瓶颈，没有很好的扩展性，下面我从低成本、高性能和高扩张性的角度来说说我的一些经验。
1、HTML静态化
　　其实大家都知道，效率最高、消耗最小的就是纯静态化的html页面，所以我们尽可能使我们的网站上的页面采用静态页面来实现，这个最简单的方法其实也是最有效的方法。但是对于大量内容并且频繁更新的网站，我们无法全部手动去挨个实现，于是出现了我们常见的信息发布系统CMS，像我们常访问的各个门户站点的新闻频道，甚至他们的其他频道，都是通过信息发布系统来管理和实现的，信息发布系统可以实现最简单的信息录入自动生成静态页面，还能具备频道管理、权限管理、自动抓取等功能，对于一个大型网站来说，拥有一套高效、可管理的CMS是必不可少的。
　　除了门户和信息发布类型的网站，对于交互性要求很高的社区类型网站来说，尽可能的静态化也是提高性能的必要手段，将社区内的帖子、文章进行实时的静态化，有更新的时候再重新静态化也是大量使用的策略，像Mop的大杂烩就是使用了这样的策略，网易社区等也是如此。目前很多博客也都实现了静态化，我使用的这个Blog程序WordPress还没有静态化，所以如果面对高负载访问， www.toplee.com一定不能承受
　　同时，html静态化也是某些缓存策略使用的手段，对于系统中频繁使用数据库查询但是内容更新很小的应用，可以考虑使用html静态化来实现，比如论坛中论坛的公用设置信息，这些信息目前的主流论坛都可以进行后台管理并且存储再数据库中，这些信息其实大量被前台程序调用，但是更新频率很小，可以考虑将这部分内容进行后台更新的时候进行静态化，这样避免了大量的数据库访问请求。
　　在进行html静态化的时候可以使用一种折中的方法，就是前端使用动态实现，在一定的策略下进行定时静态化和定时判断调用，这个能实现很多灵活性的操作，我开发的台球网站故人居( www.8zone.cn)就是使用了这样的方法，我通过设定一些html静态化的时间间隔来对动态网站内容进行缓存，达到分担大部分的压力到静态页面上，可以应用于中小型网站的架构上。故人居网站的地址： http://www.8zone.cn，顺便提一下，有喜欢台球的朋友多多支持我这个免费网站:)
2、图片服务器分离
　　大家知道，对于Web服务器来说，不管是Apache、IIS还是其他容器，图片是最消耗资源的，于是我们有必要将图片与页面进行分离，这是基本上大型网站都会采用的策略，他们都有独立的图片服务器，甚至很多台图片服务器。这样的架构可以降低提供页面访问请求的服务器系统压力，并且可以保证系统不会因为图片问题而崩溃。
　　在应用服务器和图片服务器上，可以进行不同的配置优化，比如Apache在配置ContentType的时候可以尽量少支持，尽可能少的LoadModule，保证更高的系统消耗和执行效率。
　　我的台球网站故人居8zone.cn也使用了图片服务器架构上的分离，目前是仅仅是架构上分离，物理上没有分离，由于没有钱买更多的服务器:)，大家可以看到故人居上的图片连接都是类似img.9tmd.com或者img1.9tmd.com的URL。
　　另外，在处理静态页面或者图片、js等访问方面，可以考虑使用lighttpd代替Apache，它提供了更轻量级和更高效的处理能力。
3、数据库集群和库表散列
　　大型网站都有复杂的应用，这些应用必须使用数据库，那么在面对大量访问的时候，数据库的瓶颈很快就能显现出来，这时一台数据库将很快无法满足应用，于是我们需要使用数据库集群或者库表散列。
　　在数据库集群方面，很多数据库都有自己的解决方案，Oracle、Sybase等都有很好的方案，常用的MySQL提供的Master/Slave也是类似的方案，您使用了什么样的DB，就参考相应的解决方案来实施即可。
　　上面提到的数据库集群由于在架构、成本、扩张性方面都会受到所采用DB类型的限制，于是我们需要从应用程序的角度来考虑改善系统架构，库表散列是常用并且最有效的解决方案。我们在应用程序中安装业务和应用或者功能模块将数据库进行分离，不同的模块对应不同的数据库或者表，再按照一定的策略对某个页面或者功能进行更小的数据库散列，比如用户表，按照用户ID进行表散列，这样就能够低成本的提升系统的性能并且有很好的扩展性。sohu的论坛就是采用了这样的架构，将论坛的用户、设置、帖子等信息进行数据库分离，然后对帖子、用户按照板块和ID进行散列数据库和表，最终可以在配置文件中进行简单的配置便能让系统随时增加一台低成本的数据库进来补充系统性能。
4、缓存
　　缓存一词搞技术的都接触过，很多地方用到缓存。网站架构和网站开发中的缓存也是非常重要。这里先讲述最基本的两种缓存。高级和分布式的缓存在后面讲述。
　　架构方面的缓存，对Apache比较熟悉的人都能知道Apache提供了自己的mod_proxy缓存模块，也可以使用外加的Squid进行缓存，这两种方式均可以有效的提高Apache的访问响应能力。
　　网站程序开发方面的缓存，Linux上提供的Memcached是常用的缓存方案，不少web编程语言都提供memcache访问接口，php、perl、c和java都有，可以在web开发中使用，可以实时或者Cron的把数据、对象等内容进行缓存，策略非常灵活。一些大型社区使用了这样的架构。
　　另外，在使用web语言开发的时候，各种语言基本都有自己的缓存模块和方法，PHP有Pear的Cache模块和eAccelerator加速和Cache模块，还要知名的Apc、XCache（国人开发的，支持！）php缓存模块，Java就更多了，.net不是很熟悉，相信也肯定有。
5、镜像
　　镜像是大型网站常采用的提高性能和数据安全性的方式，镜像的技术可以解决不同网络接入商和地域带来的用户访问速度差异，比如ChinaNet和EduNet之间的差异就促使了很多网站在教育网内搭建镜像站点，数据进行定时更新或者实时更新。在镜像的细节技术方面，这里不阐述太深，有很多专业的现成的解决架构和产品可选。也有廉价的通过软件实现的思路，比如Linux上的rsync等工具。
6、负载均衡
　　负载均衡将是大型网站解决高负荷访问和大量并发请求采用的终极解决办法。
　　负载均衡技术发展了多年，有很多专业的服务提供商和产品可以选择，我个人接触过一些解决方法，其中有两个架构可以给大家做参考。另外有关初级的负载均衡DNS轮循和较专业的CDN架构就不多说了。
6.1 硬件四层交换
　　第四层交换使用第三层和第四层信息包的报头信息，根据应用区间识别业务流，将整个区间段的业务流分配到合适的应用服务器进行处理。　第四层交换功能就象是虚IP，指向物理服务器。它传输的业务服从的协议多种多样，有HTTP、FTP、NFS、Telnet或其他协议。这些业务在物理服务器基础上，需要复杂的载量平衡算法。在IP世界，业务类型由终端TCP或UDP端口地址来决定，在第四层交换中的应用区间则由源端和终端IP地址、TCP和UDP端口共同决定。
　　在硬件四层交换产品领域，有一些知名的产品可以选择，比如Alteon、F5等，这些产品很昂贵，但是物有所值，能够提供非常优秀的性能和很灵活的管理能力。Yahoo中国当初接近2000台服务器使用了三四台Alteon就搞定了。
6.2 软件四层交换
　　大家知道了硬件四层交换机的原理后，基于OSI模型来实现的软件四层交换也就应运而生，这样的解决方案实现的原理一致，不过性能稍差。但是满足一定量的压力还是游刃有余的，有人说软件实现方式其实更灵活，处理能力完全看你配置的熟悉能力。
　　软件四层交换我们可以使用Linux上常用的LVS来解决，LVS就是Linux Virtual Server，他提供了基于心跳线heartbeat的实时灾难应对解决方案，提高系统的鲁棒性，同时可供了灵活的虚拟VIP配置和管理功能，可以同时满足多种应用需求，这对于分布式的系统来说必不可少。
　　一个典型的使用负载均衡的策略就是，在软件或者硬件四层交换的基础上搭建squid集群，这种思路在很多大型网站包括搜索引擎上被采用，这样的架构低成本、高性能还有很强的扩张性，随时往架构里面增减节点都非常容易。这样的架构我准备空了专门详细整理一下和大家探讨。
总结：
　　对于大型网站来说，前面提到的每个方法可能都会被同时使用到，Michael这里介绍得比较浅显，具体实现过程中很多细节还需要大家慢慢熟悉和体会，有时一个很小的squid参数或者apache参数设置，对于系统性能的影响就会很大，希望大家一起讨论，达到抛砖引玉之效。
　　转载请保留出处：俊麟 Michael’s blog ( http://www.toplee.com/blog/?p=71)
Trackback Url : http://www.toplee.com/blog/wp-trackback.php?p=71
This entry is filed under 其他技术, 技术交流. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
(2 votes, average: 6.5 out of 10)
Loading …
58 Responses to “说说大型高并发高负载网站的系统架构”
1
pi1ot says:
April 29th, 2006 at 1:00 pm
Quote
各模块间或者进程间的通信普遍异步化队列化也相当重要，可以兼顾轻载重载时的响应性能和系统压力,数据库压力可以通过file cache分解到文件系统，文件系统io压力再通过mem cache分解，效果很不错.
2
Exception says:
April 30th, 2006 at 4:40 pm
Quote
写得好！现在，网上像这样的文章不多，看完受益匪浅
3
guest says:
May 1st, 2006 at 8:13 am
Quote
完全胡说八道!
“大家知道，对于Web服务器来说，不管是Apache、IIS还是其他容器，图片是最消耗资源的”,你以为是在内存中动态生成图片啊.无论是什么文件,在容器输出时只是读文件,输出给response而已,和是什么文件有什么关系.
关键是静态文件和动态页面之间应该采用不同策略,如静态文件应该尽量缓存,因为无论你请求多少次输出内容都是相同的,如果用户页面中有二十个就没有必要请求二十次,而应该使用缓存.而动态页面每次请求输出都不相同(否则就应该是静态的),所以不应该缓存.
所以即使在同一服务器上也可以对静态和动态资源做不同优化,专门的图片服务器那是为了资源管理的方便,和你说的性能没有关系.
4
Michael says:
May 2nd, 2006 at 1:15 am
Quote
动态的缓存案例估计楼上朋友没有遇到过，在处理inktomi的搜索结果的案例中，我们使用的全部是面对动态的缓存，对于同样的关键词和查询条件来说，这样的缓存是非常重要的，对于动态的内容缓存，编程时使用合理的header参数可以方便的管理缓存的策略，比如失效时间等。
我们说到有关图片影响性能的问题，一般来说都是出自于我们的大部分访问页面中图片往往比html代码占用的流量大，在同等网络带宽的情况下，图片传输需要的时间更长，由于传输需要花很大开销在建立连接上，这会延长用户client端与server端的http连接时长，这对于apache来说，并发性能肯定会下降，除非你的返回全部是静态的，那就可以把 httpd.conf 中的 KeepAlives 为 off ，这样可以减小连接处理时间，但是如果图片过多会导致建立的连接次数增多，同样消耗性能。
另外我们提到的理论更多的是针对大型集群的案例，在这样的环境下，图片的分离能有效的改进架构，进而影响到性能的提升，要知道我们为什么要谈架构？架构可能为了安全、为了资源分配、也为了更科学的开发和管理，但是终极目都是为了性能。
另外在RFC1945的HTTP协议文档中很容易找到有关Mime Type和Content length部分的说明，这样对于理解图片对性能影响是很容易的。
楼上的朋友完全是小人作为，希望别用guest跟我忽悠，男人还害怕别人知道你叫啥？再说了，就算说错了也不至于用胡说八道来找茬！大家重在交流和学习，我也不是什么高人，顶多算个普通程序员而已。
5
Ken Kwei says:
June 3rd, 2006 at 3:42 pm
Quote
Michael 您好，这篇文章我看几次了，有一个问题，您的文章中提到了如下一段：
“对于交互性要求很高的社区类型网站来说，尽可能的静态化也是提高性能的必要手段，将社区内的帖子、文章进行实时的静态化，有更新的时候再重新静态化也是大量使用的策略，像Mop的大杂烩就是使用了这样的策略，网易社区等也是如此。”
对于大型的站点来说，他的数据库和 Web Server 一般都是分布式的，在多个区域都有部署，当某个地区的用户访问时会对应到一个节点上，如果是对社区内的帖子实时静态化，有更新时再重新静态化，那么在节点之间如何立刻同步呢？数据库端如何实现呢？如果用户看不到的话会以为发帖失败？造成重复发了，那么如何将用户锁定在一个节点上呢，这些怎么解决？谢谢。
6
Michael says:
June 3rd, 2006 at 3:57 pm
Quote
对于将一个用户锁定在某个节点上是通过四层交换来实现的，一般情况下是这样，如果应用比较小的可以通过程序代码来实现。大型的应用一般通过类似LVS和硬件四层交换来管理用户连接，可以制定策略来使用户的连接在生命期内保持在某个节点上。
静态化和同步的策略比较多，一般采用的方法是集中或者分布存储，但是静态化却是通过集中存储来实现的，然后使用前端的proxy群来实现缓存和分担压力。
7
javaliker says:
June 10th, 2006 at 6:38 pm
Quote
希望有空跟你学习请教网站负载问题。
8
barrycmster says:
June 19th, 2006 at 4:14 pm
Quote
Great website! Bookmarked! I am impressed at your work!
9
heiyeluren says:
June 21st, 2006 at 10:39 am
Quote
一般对于一个中型网站来说，交互操作非常多，日PV百万左右，如何做合理的负载？
10
Michael says:
June 23rd, 2006 at 3:15 pm
Quote
heiyeluren on June 21, 2006 at 10:39 am said:
一般对于一个中型网站来说，交互操作非常多，日PV百万左右，如何做合理的负载？
交互如果非常多，可以考虑使用集群加Memory Cache的方式，把不断变化而且需要同步的数据放入Memory Cache里面进行读取，具体的方案还得需要结合具体的情况来分析。
11
donald says:
June 27th, 2006 at 5:39 pm
Quote
请问，如果一个网站处于技术发展期，那么这些优化手段应该先实施哪些后实施哪些呢？
或者说从成本（技术、人力和财力成本）方面，哪些先实施能够取得最大效果呢？
12
Michael says:
June 27th, 2006 at 9:16 pm
Quote
donald on June 27, 2006 at 5:39 pm said:
请问，如果一个网站处于技术发展期，那么这些优化手段应该先实施哪些后实施哪些呢？
或者说从成本（技术、人力和财力成本）方面，哪些先实施能够取得最大效果呢？
先从服务器性能优化、代码性能优化方面入手，包括webserver、dbserver的优化配置、html静态化等容易入手的开始，这些环节争取先榨取到最大化的利用率，然后再考虑从架构上增加投入，比如集群、负载均衡等方面，这些都需要在有一定的发展积累之后再做考虑比较恰当。
13
donald says:
June 30th, 2006 at 4:39 pm
Quote
恩，多谢Michael的耐心讲解
14
Ade says:
July 6th, 2006 at 11:58 am
Quote
写得好,为人也不错.
15
ssbornik says:
July 17th, 2006 at 2:39 pm
Quote
Very good site. Thanks for author!
16
echonow says:
September 1st, 2006 at 2:28 pm
Quote
赞一个先，是一篇很不错的文章，不过要真正掌握里面的东西恐怕还是需要时间和实践！
先问一下关于图片服务器的问题了！
我的台球网站故人居9tmd.com也使用了图片服务器架构上的分离，目前是仅仅是架构上分离，物理上没有分离，由于没有钱买更多的服务器:)，大家可以看到故人居上的图片连接都是类似img.9tmd.com或者img1.9tmd.com的URL。
这个，楼主这个img.9tmd.com是虚拟主机吧，也就是说是一个apache提供的服务吧，这样的话对于性能的提高也很有意义吗？还是只是铺垫，为了方便以后的物理分离呢？
17
Michael says:
September 1st, 2006 at 3:05 pm
Quote
echonow on September 1, 2006 at 2:28 pm said:
赞一个先，是一篇很不错的文章，不过要真正掌握里面的东西恐怕还是需要时间和实践！
先问一下关于图片服务器的问题了！
我的台球网站故人居9tmd.com也使用了图片服务器架构上的分离，目前是仅仅是架构上分离，物理上没有分离，由于没有钱买更多的服务器:)，大家可以看到故人居上的图片连接都是类似img.9tmd.com或者img1.9tmd.com的URL。
这个，楼主这个img.9tmd.com是虚拟主机吧，也就是说是一个apache提供的服务吧，这样的话对于性能的提高也很有意义吗？还是只是铺垫，为了方便以后的物理分离呢？
这位朋友说得很对，因为目前只有一台服务器，所以从物理上无法实现真正的分离，暂时使用虚拟主机来实现，是为了程序设计和网站架构上的灵活，如果有了一台新的服务器，我只需要把图片镜像过去或者同步过去，然后把img.9tmd.com的dns解析到新的服务器上就自然实现了分离，如果现在不从架构和程序上实现，今后这样的分离就会比较痛苦:)
18
echonow says:
September 7th, 2006 at 4:59 pm
Quote
谢谢lz的回复，现在主要实现问题是如何能在素材上传时直接传到图片服务器上呢，总不至于每次先传到web，然后再同步到图片服务器吧
19
Michael says:
September 7th, 2006 at 11:25 pm
Quote
echonow on September 7, 2006 at 4:59 pm said:
谢谢lz的回复，现在主要实现问题是如何能在素材上传时直接传到图片服务器上呢，总不至于每次先传到web，然后再同步到图片服务器吧
通过samba或者nfs实现是比较简单的方法。然后使用squid缓存来降低访问的负载，提高磁盘性能和延长磁盘使用寿命。
20
echonow says:
September 8th, 2006 at 9:42 am
Quote
多谢楼主的耐心指导，我先研究下，用共享区来存储确实是个不错的想法!
21
Michael says:
September 8th, 2006 at 11:16 am
Quote
echonow on September 8, 2006 at 9:42 am said:
多谢楼主的耐心指导，我先研究下，用共享区来存储确实是个不错的想法!
不客气，欢迎常交流！
22
fanstone says:
September 11th, 2006 at 2:26 pm
Quote
Michael，谢谢你的好文章。仔细看了，包括回复，受益匪浅。
Michael on June 27, 2006 at 9:16 pm said:
donald on June 27, 2006 at 5:39 pm said:
请问，如果一个网站处于技术发展期，那么这些优化手段应该先实施哪些后实施哪些呢？
或者说从成本（技术、人力和财力成本）方面，哪些先实施能够取得最大效果呢？
先从服务器性能优化、代码性能优化方面入手，包括webserver、dbserver的优化配置、html静态化等容易入手的开始，这些环节争取先榨取到最大化的利用率，然后再考虑从架构上增加投入，比如集群、负载均衡等方面，这些都需要在有一定的发展积累之后再做考虑比较恰当。
尤其这个部分很是有用，因为我也正在建一个电子商务类的网站，由于是前期阶段，费用的问题毕竟有所影响，所以暂且只用了一台服务器囊括过了整个网站。除去前面所说的图片服务器分离，还有什么办法能在网站建设初期尽可能的为后期的发展做好优化（性能优化，系统合理构架，前面说的websever、dbserver优化，后期譬如硬件等扩展尽可能不要过于烦琐等等）？也就是所谓的未雨绸缪了，尽可能在先期考虑到后期如果发展壮大的需求，预先做好系统规划，并且在前期资金不足的情况下尽量做到网站以最优异的性能在运行。关于达到这两个要求，您可以给我一些稍稍详细一点的建议和技术参考吗？谢谢！
看了你的文章，知道你主要关注*nix系统架构的，我的是.net和win2003的，不过我觉得这个影响也不大。主要关注点放在外围的网站优化上。
谢谢!希望能得到您的一些好建议。
23
Michael says:
September 11th, 2006 at 2:55 pm
Quote
回复fanstone：
关于如何在网站的前期尽可能低成本的投入，做到性能最大化利用，同时做好后期系统架构的规划，这个问题可以说已经放大到超出技术范畴，不过和技术相关的部分还是有不少需要考虑的。
一个网站的规划关键的就是对阶段性目标的规划，比如预测几个月后达到什么用户级别、存储级别、并发请求数，然后再过几个月又将什么情况，这些预测必须根据具体业务和市场情况来进行预估和不断调整的，有了这些预测数据作为参考，就能进行技术架构的规划，否则技术上是无法合理进行架构设计的。
在网站发展规划基础上，考虑今后要提供什么样的应用？有些什么样的域名关系？各个应用之间的业务逻辑和关联是什么？面对什么地域分布的用户提供服务？等等。。。
上面这些问题有助于规划网站服务器和设备投入，同时从技术上可以及早预测到未来将会是一个什么架构，在满足这个架构下的每个节点将需要满足什么条件，就是初期架构的要求。
总的来说，不结合具体业务的技术规划是没有意义的，所以首先是业务规划，也就是产品设计，然后才是技术规划。
24
fanstone says:
September 11th, 2006 at 8:52 pm
Quote
谢谢解答，我再多看看！
25
Roc says:
March 22nd, 2007 at 11:48 pm
Quote
很好的文章，楼主说的方法非常适用，目前我们公司的网站也是按照楼主所说的方法进行设计的，效果比较好，利于以后的扩展，另外我再补充一点，其实楼主也说了，网站的域名也需要提前考虑和规划，比如网站的图片内容比较多，可以按应用图片的类型可以根据不同的业务需求采用不同的域名img1~imgN等，便于日后的扩展和移至，希望楼主能够多发一些这样的好文章。
26
zhang says:
April 3rd, 2007 at 9:08 am
Quote
图片服务器与主数据分离的问题。
图片是存储在硬盘里好还是存储在数据库里好？
请您分硬盘和数据库两种情况解释下面的疑问。
当存放图片的服务器容量不能满足要求时如何办？
当存放图片的服务器负载不能满足要求时如何办？
谢谢。
27
Michael says:
April 3rd, 2007 at 2:29 pm
Quote
zhang on April 3, 2007 at 9:08 am said:
图片服务器与主数据分离的问题。
图片是存储在硬盘里好还是存储在数据库里好？
请您分硬盘和数据库两种情况解释下面的疑问。
当存放图片的服务器容量不能满足要求时如何办？
当存放图片的服务器负载不能满足要求时如何办？
谢谢。
肯定是存储在硬盘里面，出现存储在数据库里面的说法实际上是出自一些虚拟主机或者租用空间的个人网站和企业网站，因为网站数据量小，也为了备份方便，从大型商业网站来说，没有图片存储在数据库里面的大型应用。数据库容量和效率都会是极大的瓶颈。
你提到的后面两个问题。容量和负载基本上是同时要考虑的问题，容量方面，大部分的解决方案都是使用海量存储，比如专业的盘阵，入门级的磁盘柜或者高级的光纤盘阵、局域网盘阵等，这些都是主要的解决方案。记得我原来说过，如果是考虑低成本，一定要自己使用便宜单台服务器来存储，那就需要从程序逻辑上去控制，比如你可以多台同样的服务器来存储，分别提供NFS的分区给前端应用使用，在前端应用的程序逻辑中自己去控制存储在哪一台服务器的NFS分区上，比如根据Userid或者图片id、或者别的逻辑去进行散列，这个和我们规划大型数据库存储散列分表或者分库存储的逻辑类似。
基本上图片负载高的解决办法有两种，前端squid缓存和镜像，通过对存储设备（服务器或者盘阵）使用镜像，可以分布到多台服务器上对外提供图片服务，然后再配合squid缓存实现负载的降低和提高用户访问速度。
希望能回答了您的问题。
28
Michael says:
April 3rd, 2007 at 2:41 pm
Quote
Roc on March 22, 2007 at 11:48 pm said:
很好的文章，楼主说的方法非常适用，目前我们公司的网站也是按照楼主所说的方法进行设计的，效果比较好，利于以后的扩展，另外我再补充一点，其实楼主也说了，网站的域名也需要提前考虑和规划，比如网站的图片内容比较多，可以按应用图片的类型可以根据不同的业务需求采用不同的域名img1~imgN等，便于日后的扩展和移至，希望楼主能够多发一些这样的好文章。
欢迎常来交流，还希望能得到你的指点。大家相互学习。
29
zhang says:
April 4th, 2007 at 11:39 pm
Quote
非常感谢您的回复，
希望将来有合作的机会。
再次感谢。
30
Charles says:
April 9th, 2007 at 2:50 pm
Quote
刚才一位朋友把你的 BLOG 发给我看，问我是否认识你，所以我就仔细看了一下你的 BLOG，发现这篇文章。
很不错的一篇文章，基本上一个大型网站需要做的事情都已经提及了。我自己也曾任职于三大门户之一，管理超过 100 台的 SQUID 服务器等，希望可以也分享一下我的经验和看法。
1、图片服务器分离
这个观点是我一直以来都非常支持的。特别是如果程序与图片都放在同一个 APAHCE 的服务器下，每一个图片的请求都有可能导致一个 HTTPD 进程的调用，而 HTTPD 如果包含有 PHP 模块的的时候，就会占用过多的内存，而这个是没有任何必要的。
使用独立的图片服务器不但可以避免以上这个情况，更可以对不同的使用性质的图片设置不同的过期时间，以便同一个用户在不同页面访问相同图片时不会再次从服务器（基于是缓存服务器）取数据，不但止快速，而且还省了带宽。还有就是，对于缓存的时间上，亦可以做调立的调节。
在我过往所管理的图片服务器中，不但止是将图片与应用及页面中分离出来，还是为不同性质的图片启用不同的域名。以缓解不同性质图片带来的压力。例如 photo.img.domain.com 这个域名是为了摄影服务的，平时使用 5 台 CACHE，但到了 5.1 长假期后，就有可能需要独立为他增加至 10 台。而增加的这 5 台可以从其他负载较低的图片服务器中调动过来临时使用。
2、数据库集群
一套 ORACLE RAC 的集群布置大概在 40W 左右，这个价格对于一般公司来说，是没有必要的。因为 WEB 的应用逻辑相对较简单，而 ORACLE 这些大型数据库的价值在于数据挖掘，而不在于简单的存储。所以选择 MySQL 或 PostgreSQL 会实际一些。
简单的 MySQL 复制就可以实现较好的效果。读的时候从 SLAVE 读，写的时候才到 MASTER 上更新。实际的情况下，MySQL 的复制性能非常好，基本上不会带来太高的更新延时。使用 balance （ http://www.inlab.de/balance.html）这个软件，在本地（127.0.0.1）监听 3306 端口，再映射多个 SLAVE 数据库，可以实现读取的负载均衡。
3、图片保存于磁盘还是数据库？
对于这个问题，我亦有认真地考虑过。如果是在 ext3 的文件系统下，建 3W 个目录就到极限了，而使用 xfs 的话就没有这个限制。图片的存储，如果需要是大量的保存，必须要分隔成很多个小目录，否则就会有 ext3 只能建 3W 目录的限制，而且文件数及目录数太多会影响磁盘性能。还没有算上空间占用浪费等问题。
更更重要的是，对于一个大量小文件的数据备份，要占用极大的资源和非常长的时间。在这些问题前面，可能将图片保存在数据库是个另外的选择。
可以尝试将图片保存到数据库，前端用 PHP 程序返回实际的图片，再在前端放置一个 SQUID 的服务器，可以避免性能问题。那么图片的备份问题，亦可以利用 MySQL 的数据复制机制来实现。这个问题就可以得到较好的解决了。
4、页面的静态化我就不说了，我自己做的 wordpress 就完全实现了静态化，同时能很好地兼顾动态数据的生成。
5、缓存
我自己之前也提出过使用 memcached，但实际使用中不是非常特别的理想。当然，各个应用环境不一致会有不一致的使用结果，这个并不重要。只要自己觉得好用就用。
6、软件四层交换
LVS 的性能非常好，我有朋友的网站使用了 LVS 来做负责均衡的调度器，数据量非常大都可以轻松支撑。当然是使用了 DR 的方式。
其实我自己还想过可以用 LVS 来做 CDN 的调度。例如北京的 BGP 机房接受用户的请求，然后通过 LVS 的 TUN 方式，将请求调度到电信或网通机房的实际物理服务器上，直接向用户返回数据。
这种是 WAN 的调度，F5 这些硬件设备也应用这样的技术。不过使用 LVS 来实现费用就大大降低。
以上都只属个人观点，能力有限，希望对大家有帮助。：）
31
Michael says:
April 9th, 2007 at 8:17 pm
Quote
很少见到有朋友能在我得blog上留下这么多有价值的东西，代表别的可能看到这篇文章的朋友一起感谢你。
balance （ http://www.inlab.de/balance.html）这个东西准备看一下。
32
Michael says:
April 16th, 2007 at 1:29 am
Quote
如果要说3Par的光纤存储局域网技术细节，我无法给您太多解释，我对他们的产品没有接触也没有了解，不过从SAN的概念上是可以知道大概框架的，它也是一种基于光纤通道的存储局域网，可以支持远距离传输和较高的系统扩展性，传统的SAN使用专门的FC光通道SCSI磁盘阵列，从你提供的内容来看，3Par这个东西建立在低成本的SATA或FATA磁盘阵列基础上，这一方面能降低成本，同时估计3Par在技术上有创新和改进，从而提供了廉价的高性能存储应用。
这个东西细节只有他们自己知道，你就知道这是个商业的SAN （存储局域网，说白了也是盘阵，只是通过光纤通道独立于系统外的）。
33
zhang says:
April 16th, 2007 at 2:10 am
Quote
myspace和美国的许多银行都更换为了3Par
请您在百忙之中核实一下，是否确实像说的那么好。
下面是摘抄：
Priceline.com是一家以销售空座机票为主的网络公司，客户数量多达680万。该公司近期正在内部实施一项大规模的SAN系统整合计划，一口气购进了5套3PARdata的服务器系统，用以替代现有的上百台Sun存储阵列。如果该方案部署成功的话，将有望为Priceline.com节省大量的存储管理时间、资本开销及系统维护费用。
　　Priceline.com之前一直在使用的SAN系统是由50台光纤磁盘阵列、50台SCSI磁盘阵列和15台存储服务器构成的。此次，该公司一举购入了5台3Par S400 InServ Storage Servers存储服务器，用以替代原来的服务器系统，使得设备整体能耗、占用空间及散热一举降低了60%。整个系统部署下来，总存储容量将逼近30TB。
　　Priceline的首席信息官Ron Rose拒绝透露该公司之前所使用的SAN系统设备的供应商名称，不过，消息灵通人士表示，PriceLine原来的存储环境是由不同型号的Sun系统混合搭建而成的。
　　“我们并不愿意随便更换系统供应商，不过，3Par的存储系统所具备的高投资回报率，实在令人难以抗拒，”Rose介绍说，“我们给了原来的设备供应商以足够的适应时间，希望它们的存储设备也能够提供与3Par一样的效能，最后，我们失望了。如果换成3Par的存储系统的话，短期内就可以立刻见到成效。”
　　Rose接着补充说，“原先使用的那套SAN系统，并没有太多让我们不满意的地方，除了欠缺一点灵活性之外：系统的配置方案堪称不错，但并不是最优化的。使用了大量价格偏贵的光纤磁盘，许多SAN端口都是闲置的。”
自从更换成3Par的磁盘阵列之后，该公司存储系统的端口数量从90个骤减为24个。“我们购买的软件许可证都是按端口数量来收费的。每增加一个端口，需要额外支付500-1,500美元，单单这一项，就为我们节省了一笔相当可观的开支，”Rose解释说。而且，一旦启用3Par的精简自动配置软件，系统资源利用率将有望提升30%，至少在近一段时间内该公司不必考虑添置新的磁盘系统。
　　精简自动配置技术最大的功效就在于它能够按照应用程序的实际需求来分配存储资源，有效地降低了空间闲置率。如果当前运行的应用程序需要额外的存储资源的话，该软件将在不干扰应用程序正常运行的前提下，基于“按需”和“公用”的原则来自动发放资源空间，避免了人力干预，至少为存储管理员减轻了一半以上的工作量。
　　3Par的磁盘阵列是由低成本的SATA和FATA（即：低成本光纤信道接口）磁盘驱动器构成的，而并非昂贵的高效能FC磁盘，大大降低了系统整体成本。
　　3Par推出的SAN解决方案，实际上是遵循了“允许多个分布式介质服务器共享通过光纤信道SAN 连接的公共的集中化存储设备”的设计理念。“这样一来，就不必给所有的存储设备都外挂一个代理服务程序了，”Rose介绍说。出于容灾容错和负载均衡的考虑，Priceline搭建了两个生产站点，每一个站点都部署了一套3Par SAN系统。此外，Priceline还购买了两台3Par Inservs服务器，安置在主数据中心内，专门用于存放镜像文件。第5台服务器设置在Priceline的企业资料处理中心内，用于存放数据仓库；第6台服务器设置在实验室内，专门用于进行实际网站压力测试。
MySpace目前采用了一种新型SAN设备——来自加利福尼亚州弗里蒙特的3PARdata。在3PAR的系统里，仍能在逻辑上按容量划分数据存储，但它不再被绑定到特定磁盘或磁盘簇，而是散布于大量磁盘。这就使均分数据访问负荷成为可能。当数据库需要写入一组数据时，任何空闲磁盘都可以马上完成这项工作，而不再像以前那样阻塞在可能已经过载的磁盘阵列处。而且，因为多个磁盘都有数据副本，读取数据时，也不会使SAN的任何组件过载。
3PAR宣布，VoIP服务供应商Cbeyond Communications已向它订购了两套InServ存储服务器，一套应用于该公司的可操作支持系统，一套应用于测试和开发系统环境。3PAR的总部设在亚特兰大，该公司的产品多销往美国各州首府和大城市，比如说亚特兰大、达拉斯、丹佛、休斯顿、芝加哥，等等。截至目前为止，3PAR售出的服务器数量已超过了15,000台，许多客户都是来自于各行各业的龙头企业，它们之所以挑选3PAR的产品，主要是看中了它所具备的高性能、可扩展性、操作简单、无比伦比的性价比等优点，另外，3PAR推出的服务器系列具有高度的集成性能，适合应用于高速度的T1互联网接入、本地和长途语音服务、虚拟主机（Web hosting）、电子邮件、电话会议和虚拟个人网络（VPN）等服务领域。
亿万用户网站MySpace的成功秘密
◎ 文 / David F. Carr 译 / 罗小平
高速增长的访问量给社区网络的技术体系带来了巨大挑战。MySpace的开发者多年来不断重构站点软件、数据库和存储系统，以期与自身的成长同步——目前，该站点月访问量已达400亿。绝大多数网站需要应对的流量都不及MySpace的一小部分，但那些指望迈入庞大在线市场的人，可以从MySpace的成长过程学到知识。
MySpace开发人员已经多次重构站点软件、数据库和存储系统，以满足爆炸性的成长需要，但此工作永不会停息。“就像粉刷金门大桥，工作完成之时，就是重新来过之日。”（译者注：意指工人从桥头开始粉刷，当到达桥尾时，桥头涂料已经剥落，必须重新开始）MySpace技术副总裁Jim Benedetto说。
既然如此，MySpace的技术还有何可学之处？因为MySpace事实上已经解决了很多系统扩展性问题，才能走到今天。
Benedetto说他的项目组有很多教训必须总结，他们仍在学习，路漫漫而修远。他们当前需要改进的工作包括实现更灵活的数据缓存系统，以及为避免再次出现类似7月瘫痪事件的地理上分布式架构。
背景知识
当然，这么多的用户不停发布消息、撰写评论或者更新个人资料，甚至一些人整天都泡在Space上，必然给MySpace的技术工作带来前所未有的挑战。而传统新闻站点的绝大多数内容都是由编辑团队整理后主动提供给用户消费，它们的内容数据库通常可以优化为只读模式，因为用户评论等引起的增加和更新操作很少。而MySpace是由用户提供内容，数据库很大比例的操作都是插入和更新，而非读取。
浏览MySpace上的任何个人资料时，系统都必须先查询数据库，然后动态创建页面。当然，通过数据缓存，可以减轻数据库的压力，但这种方案必须解决原始数据被用户频繁更新带来的同步问题。
MySpace的站点架构已经历了5个版本——每次都是用户数达到一个里程碑后，必须做大量的调整和优化。Benedetto说，“但我们始终跟不上形势的发展速度。我们重构重构再重构，一步步挪到今天”。
在每个里程碑，站点负担都会超过底层系统部分组件的最大载荷，特别是数据库和存储系统。接着，功能出现问题，用户失声尖叫。最后，技术团队必须为此修订系统策略。
虽然自2005年早期，站点账户数超过7百万后，系统架构到目前为止保持了相对稳定，但MySpace仍然在为SQL Server支持的同时连接数等方面继续攻坚，Benedetto说，“我们已经尽可能把事情做到最好”。
里程碑一：50万账户
按Benedetto 的说法，MySpace最初的系统很小，只有两台Web服务器和一个数据库服务器。那时使用的是Dell双CPU、4G内存的系统。
单个数据库就意味着所有数据都存储在一个地方，再由两台Web服务器分担处理用户请求的工作量。但就像MySpace后来的几次底层系统修订时的情况一样，三服务器架构很快不堪重负。此后一个时期内，MySpace基本是通过添置更多Web服务器来对付用户暴增问题的。
但到在2004年早期，MySpace用户数增长到50万后，数据库服务器也已开始汗流浃背。
但和Web服务器不同，增加数据库可没那么简单。如果一个站点由多个数据库支持，设计者必须考虑的是，如何在保证数据一致性的前提下，让多个数据库分担压力。
在第二代架构中，MySpace运行在3个SQL Server数据库服务器上——一个为主，所有的新数据都向它提交，然后由它复制到其他两个；另两个全力向用户供给数据，用以在博客和个人资料栏显示。这种方式在一段时间内效果很好——只要增加数据库服务器，加大硬盘，就可以应对用户数和访问量的增加。
里程碑二：1-2百万账户
MySpace注册数到达1百万至2百万区间后，数据库服务器开始受制于I/O容量——即它们存取数据的速度。而当时才是2004年中，距离上次数据库系统调整不过数月。用户的提交请求被阻塞，就像千人乐迷要挤进只能容纳几百人的夜总会，站点开始遭遇“主要矛盾”，Benedetto说，这意味着MySpace永远都会轻度落后于用户需求。
“有人花5分钟都无法完成留言，因此用户总是抱怨说网站已经完蛋了。”他补充道。
这一次的数据库架构按照垂直分割模式设计，不同的数据库服务于站点的不同功能，如登录、用户资料和博客。于是，站点的扩展性问题看似又可以告一段落了，可以歇一阵子。
垂直分割策略利于多个数据库分担访问压力，当用户要求增加新功能时，MySpace将投入新的数据库予以支持它。账户到达2百万后，MySpace还从存储设备与数据库服务器直接交互的方式切换到SAN（Storage Area Network，存储区域网络）——用高带宽、专门设计的网络将大量磁盘存储设备连接在一起，而数据库连接到SAN。这项措施极大提升了系统性能、正常运行时间和可靠性，Benedetto说。
里程碑三：3百万账户
当用户继续增加到3百万后，垂直分割策略也开始难以为继。尽管站点的各个应用被设计得高度独立，但有些信息必须共享。在这个架构里，每个数据库必须有各自的用户表副本——MySpace授权用户的电子花名册。这就意味着一个用户注册时，该条账户记录必须在9个不同数据库上分别创建。但在个别情况下，如果其中某台数据库服务器临时不可到达，对应事务就会失败，从而造成账户非完全创建，最终导致此用户的该项服务无效。
另外一个问题是，个别应用如博客增长太快，那么专门为它服务的数据库就有巨大压力。
2004年中，MySpace面临Web开发者称之为“向上扩展”对“向外扩展”（译者注：Scale Up和Scale Out，也称硬件扩展和软件扩展）的抉择——要么扩展到更大更强、也更昂贵的服务器上，要么部署大量相对便宜的服务器来分担数据库压力。一般来说，大型站点倾向于向外扩展，因为这将让它们得以保留通过增加服务器以提升系统能力的后路。
但成功地向外扩展架构必须解决复杂的分布式计算问题，大型站点如Google、Yahoo和Amazon.com，都必须自行研发大量相关技术。以Google为例，它构建了自己的分布式文件系统。
另外，向外扩展策略还需要大量重写原来软件，以保证系统能在分布式服务器上运行。“搞不好，开发人员的所有工作都将白费”，Benedetto说。
因此，MySpace首先将重点放在了向上扩展上，花费了大约1个半月时间研究升级到32CPU服务器以管理更大数据库的问题。Benedetto说，“那时候，这个方案看似可能解决一切问题。”如稳定性，更棒的是对现有软件几乎没有改动要求。
糟糕的是，高端服务器极其昂贵，是购置同样处理能力和内存速度的多台服务器总和的很多倍。而且，站点架构师预测，从长期来看，即便是巨型数据库，最后也会不堪重负，Benedetto说，“换句话讲，只要增长趋势存在，我们最后无论如何都要走上向外扩展的道路。”
因此，MySpace最终将目光移到分布式计算架构——它在物理上分布的众多服务器，整体必须逻辑上等同于单台机器。拿数据库来说，就不能再像过去那样将应用拆分，再以不同数据库分别支持，而必须将整个站点看作一个应用。现在，数据库模型里只有一个用户表，支持博客、个人资料和其他核心功能的数据都存储在相同数据库。
既然所有的核心数据逻辑上都组织到一个数据库，那么MySpace必须找到新的办法以分担负荷——显然，运行在普通硬件上的单个数据库服务器是无能为力的。这次，不再按站点功能和应用分割数据库，MySpace开始将它的用户按每百万一组分割，然后将各组的全部数据分别存入独立的SQL Server实例。目前，MySpace的每台数据库服务器实际运行两个SQL Server实例，也就是说每台服务器服务大约2百万用户。Benedetto指出，以后还可以按照这种模式以更小粒度划分架构，从而优化负荷分担。
当然，还是有一个特殊数据库保存了所有账户的名称和密码。用户登录后，保存了他们其他数据的数据库再接管服务。特殊数据库的用户表虽然庞大，但它只负责用户登录，功能单一，所以负荷还是比较容易控制的。
里程碑四：9百万到1千7百万账户
2005年早期，账户达到9百万后，MySpace开始用Microsoft的C#编写ASP.NET程序。C#是C语言的最新派生语言，吸收了C++和Java的优点，依托于Microsoft .NET框架（Microsoft为软件组件化和分布式计算而设计的模型架构）。ASP.NET则由编写Web站点脚本的ASP技术演化而来，是Microsoft目前主推的Web站点编程环境。
可以说是立竿见影，MySpace马上就发现ASP.NET程序运行更有效率，与ColdFusion相比，完成同样任务需消耗的处理器能力更小。据技术总监Whitcomb说，新代码需要150台服务器完成的工作，如果用ColdFusion则需要246台。Benedetto还指出，性能上升的另一个原因可能是在变换软件平台，并用新语言重写代码的过程中，程序员复审并优化了一些功能流程。
最终，MySpace开始大规模迁移到ASP.NET。即便剩余的少部分ColdFusion代码，也从Cold-Fusion服务器搬到了ASP.NET，因为他们得到了BlueDragon.NET（乔治亚州阿尔法利塔New Atlanta Communications公司的产品，它能将ColdFusion代码自动重新编译到Microsoft平台）的帮助。
账户达到1千万时，MySpace再次遭遇存储瓶颈问题。SAN的引入解决了早期一些性能问题，但站点目前的要求已经开始周期性超越SAN的I/O容量——即它从磁盘存储系统读写数据的极限速度。
原因之一是每数据库1百万账户的分割策略，通常情况下的确可以将压力均分到各台服务器，但现实并非一成不变。比如第七台账户数据库上线后，仅仅7天就被塞满了，主要原因是佛罗里达一个乐队的歌迷疯狂注册。
某个数据库可能因为任何原因，在任何时候遭遇主要负荷，这时，SAN中绑定到该数据库的磁盘存储设备簇就可能过载。“SAN让磁盘I/O能力大幅提升了，但将它们绑定到特定数据库的做法是错误的。”Benedetto说。
最初，MySpace通过定期重新分配SAN中数据，以让其更为均衡的方法基本解决了这个问题，但这是一个人工过程，“大概需要两个人全职工作。”Benedetto说。
长期解决方案是迁移到虚拟存储体系上，这样，整个SAN被当作一个巨型存储池，不再要求每个磁盘为特定应用服务。MySpace目前采用了一种新型SAN设备——来自加利福尼亚州弗里蒙特的3PARdata。
在3PAR的系统里，仍能在逻辑上按容量划分数据存储，但它不再被绑定到特定磁盘或磁盘簇，而是散布于大量磁盘。这就使均分数据访问负荷成为可能。当数据库需要写入一组数据时，任何空闲磁盘都可以马上完成这项工作，而不再像以前那样阻塞在可能已经过载的磁盘阵列处。而且，因为多个磁盘都有数据副本，读取数据时，也不会使SAN的任何组件过载。
当2005年春天账户数达到1千7百万时，MySpace又启用了新的策略以减轻存储系统压力，即增加数据缓存层——位于Web服务器和数据库服务器之间，其唯一职能是在内存中建立被频繁请求数据对象的副本，如此一来，不访问数据库也可以向Web应用供给数据。换句话说，100个用户请求同一份资料，以前需要查询数据库100次，而现在只需1次，其余都可从缓存数据中获得。当然如果页面变化，缓存的数据必须从内存擦除，然后重新从数据库获取——但在此之前，数据库的压力已经大大减轻，整个站点的性能得到提升。
缓存区还为那些不需要记入数据库的数据提供了驿站，比如为跟踪用户会话而创建的临时文件——Benedetto坦言他需要在这方面补课，“我是数据库存储狂热分子，因此我总是想着将万事万物都存到数据库。”但将像会话跟踪这类的数据也存到数据库，站点将陷入泥沼。
增加缓存服务器是“一开始就应该做的事情，但我们成长太快，以致于没有时间坐下来好好研究这件事情。”Benedetto补充道。
里程碑五：2千6百万账户
2005年中期，服务账户数达到2千6百万时，MySpace切换到了还处于beta测试的SQL Server 2005。转换何太急？主流看法是2005版支持64位处理器。但Benedetto说，“这不是主要原因，尽管这也很重要；主要还是因为我们对内存的渴求。”支持64位的数据库可以管理更多内存。
更多内存就意味着更高的性能和更大的容量。原来运行32位版本的SQL Server服务器，能同时使用的内存最多只有4G。切换到64位，就好像加粗了输水管的直径。升级到SQL Server 2005和64位Windows Server 2003后，MySpace每台服务器配备了32G内存，后于2006年再次将配置标准提升到64G。
意外错误
如果没有对系统架构的历次修改与升级，MySpace根本不可能走到今天。但是，为什么系统还经常吃撑着了？很多用户抱怨的“意外错误”是怎么引起的呢？
原因之一是MySpace对Microsoft的Web技术的应用已经进入连Microsoft自己也才刚刚开始探索的领域。比如11月，超出SQL Server最大同时连接数，MySpace系统崩溃。Benedetto说，这类可能引发系统崩溃的情况大概三天才会出现一次，但仍然过于频繁了，以致惹人恼怒。一旦数据库罢工，“无论这种情况什么时候发生，未缓存的数据都不能从SQL Server获得，那么你就必然看到一个‘意外错误’提示。”他解释说。
去年夏天，MySpace的Windows 2003多次自动停止服务。后来发现是操作系统一个内置功能惹的祸——预防分布式拒绝服务攻击（黑客使用很多客户机向服务器发起大量连接请求，以致服务器瘫痪）。MySpace和其他很多顶级大站点一样，肯定会经常遭受攻击，但它应该从网络级而不是依靠Windows本身的功能来解决问题——否则，大量MySpace合法用户连接时也会引起服务器反击。
“我们花了大约一个月时间寻找Windows 2003服务器自动停止的原因。”Benedetto说。最后，通过Microsoft的帮助，他们才知道该怎么通知服务器：“别开枪，是友军。”
紧接着是在去年7月某个周日晚上，MySpace总部所在地洛杉矶停电，造成整个系统停运12小时。大型Web站点通常要在地理上分布配置多个数据中心以预防单点故障。本来，MySpace还有其他两个数据中心以应对突发事件，但Web服务器都依赖于部署在洛杉矶的SAN。没有洛杉矶的SAN，Web服务器除了恳求你耐心等待，不能提供任何服务。
Benedetto说，主数据中心的可靠性通过下列措施保证：可接入两张不同电网，另有后备电源和一台储备有30天燃料的发电机。但在这次事故中，不仅两张电网失效，而且在切换到备份电源的过程中，操作员烧掉了主动力线路。
2007年中，MySpace在另两个后备站点上也建设了SAN。这对分担负荷大有帮助——正常情况下，每个SAN都能负担三分之一的数据访问量。而在紧急情况下，任何一个站点都可以独立支撑整个服务，Benedetto说。
MySpace仍然在为提高稳定性奋斗，虽然很多用户表示了足够信任且能原谅偶现的错误页面。
“作为开发人员，我憎恶Bug，它太气人了。”Dan Tanner这个31岁的德克萨斯软件工程师说，他通过MySpace重新联系到了高中和大学同学。“不过，MySpace对我们的用处很大，因此我们可以原谅偶发的故障和错误。” Tanner说，如果站点某天出现故障甚至崩溃，恢复以后他还是会继续使用。
这就是为什么Drew在论坛里咆哮时，大部分用户都告诉他应该保持平静，如果等几分钟，问题就会解决的原因。Drew无法平静，他写道，“我已经两次给MySpace发邮件，而它说一小时前还是正常的，现在出了点问题……完全是一堆废话。”另一个用户回复说，“毕竟它是免费的。”Benedetto坦承100%的可靠性不是他的目标。“它不是银行，而是一个免费的服务。”他说。
换句话说，MySpace的偶发故障可能造成某人最后更新的个人资料丢失，但并不意味着网站弄丢了用户的钱财。“关键是要认识到，与保证站点性能相比，丢失少许数据的故障是可接受的。”Benedetto说。所以，MySpace甘冒丢失2分钟到2小时内任意点数据的危险，在SQL Server配置里延长了“checkpoint”操作——它将待更新数据永久记录到磁盘——的间隔时间，因为这样做可以加快数据库的运行。
Benedetto说，同样，开发人员还经常在几个小时内就完成构思、编码、测试和发布全过程。这有引入Bug的风险，但这样做可以更快实现新功能。而且，因为进行大规模真实测试不具可行性，他们的测试通常是在仅以部分活跃用户为对象，且用户对软件新功能和改进不知就里的情况下进行的。因为事实上不可能做真实的加载测试，他们做的测试通常都是针对站点。
“我们犯过大量错误，”Benedetto说，“但到头来，我认为我们做对的还是比做错的多。”
34
zhang says:
April 16th, 2007 at 2:15 am
Quote
了解联合数据库服务器
为达到最大型网站所需的高性能级别，多层系统一般在多个服务器之间平衡每一层的处理负荷。SQL Server 2005 通过对 SQL Server 数据库中的数据进行水平分区，在一组服务器之间分摊数据库处理负荷。这些服务器独立管理，但协作处理应用程序的数据库请求；这样一组协作服务器称为“联合体”。
只有在应用程序将每个 SQL 语句发送到包含该语句所需的大部分数据的成员服务器时，联合数据库层才能达到非常高的性能级别。这称为使用语句所需的数据来配置 SQL 语句。使用所需的数据来配置 SQL 语句不是联合服务器所特有的要求。群集系统也有此要求。
虽然服务器联合体与单个数据库服务器对应用程序来说是一样的，但在实现数据库服务层的方式上存在内部差异，
35
Michael says:
April 16th, 2007 at 3:18 am
Quote
关于MySpace是否使用了3Par的SAN，并且起到多大的关键作用，我也无法考证，也许可以通过在MySpace工作的朋友可以了解到，但是从各种数据和一些案例来看，3Par的确可以改善成本过高和存储I/O性能问题，但是实际应用中，除非电信、银行或者真的类似MySpace这样规模的站点，的确很少遇到存储连SAN都无法满足的情况，另外，对于数据库方面，据我知道，凡电信、金融和互联网上电子商务关键数据应用，基本上Oracle才是最终的解决方案。包括我曾经工作的Yahoo，他们全球超过70%以上应用使用MySQL，但是和钱相关的或者丢失数据会承担责任的应用，都是使用Oracle。在UDB方面，我相信Yahoo的用户数一定超过MySpace的几千万。
事实上，国内最值得研究的案例应该是腾讯，腾讯目前的数据量一定是惊人的，在和周小旻的一次短暂对话中知道腾讯的架构专家自己实现了大部分的技术，细节我无法得知。
36
Michael says:
April 16th, 2007 at 3:23 am
Quote
图片存储到数据库我依然不赞同，不过一定要这么做也不是不可以，您提到的使用CGI程序输出到用户客户端，基本上每种web编程语言都支持，只要能输出标准的HTTP Header信息就可以了，比如PHP可以使用 header(”content-type:image/jpeg\r\n”); 语句输出当前http返回的文件mime类型为图片，同时还有更多的header()函数可以输出的HTTP Header信息，包括 content-length 之类的（提供range 断点续传需要），具体的可以参考PHP的手册。另外，perl、asp、jsp这些都提供类似的实现方法，这不是语言问题，而是一个HTTP协议问题。
37
zhang says:
April 16th, 2007 at 8:52 am
Quote
早晨，其实已经是上午，起床后，
看到您凌晨3：23的回复，非常感动。
一定注意身体。
好像您还没有太太，
太太和孩子也像正规程序一样，会良好地调节您的身体。
千万不要使用野程序调节身体，会中毒。
开个玩笑。
38
zhang says:
April 16th, 2007 at 8:59 am
Quote
看到您凌晨3：23的回复，
非常感动！
一定注意身体，
好像您还没有太太，
太太和孩子就像正规程序一样，能够良好地调节您的身体，
千万不要使用野程序调节身体，会中毒。
开个玩笑。
39
Michael says:
April 16th, 2007 at 11:04 am
Quote
zhang on April 16, 2007 at 8:59 am said:
看到您凌晨3：23的回复，
非常感动！
一定注意身体，
好像您还没有太太，
太太和孩子就像正规程序一样，能够良好地调节您的身体，
千万不要使用野程序调节身体，会中毒。
开个玩笑。
哈哈，最近我是有点疯狂，不过从大学开始，似乎就习惯了晚睡，我基本多年都保持2点左右睡觉，8点左右起床，昨晚有点夸张，因为看一个文档和写一些东西一口气就到3点多了，临睡前看到您的留言，顺便就回复了。
40
myld says:
April 18th, 2007 at 1:38 pm
Quote
感谢楼主写了这么好的文章，谢谢！！！
41
楓之谷外掛 says:
April 27th, 2007 at 11:04 pm
Quote
看ㄋ你的文章，很有感覺的說．我自己也做網站，希望可以多多交流一下，大家保持聯繫．
http://www.gameon9.com/
http://www.gameon9.com.tw/
42
南半球 says:
May 9th, 2007 at 8:22 pm
Quote
关于两位老大讨论的：图片保存于磁盘还是数据库
个人觉得数据库存文件的话，查询速度可能快点，但数据量很大的时候要加上索引，这样添加记录的速度就慢了
mysql对大数据量的处理能力还不是很强，上千万条记录时，性能也会下降
数据库另外一个瓶颈问题就是连接
用数据库，就要调用后台程序（JSP/JAVA,PHP等）连接数据库，而数据库的连接连接、传输数据都要耗费系统资源。数据库的连接数也是个瓶颈问题。曾经写过一个很烂的程序，每秒访问3到5次的数据库，结果一天下来要连接20多万次数据库，把对方的mysql数据库搞瘫痪了。
43
zhang says:
May 19th, 2007 at 12:07 am
Quote
抽空儿回这里浏览了一下，
有收获，
“写真照”换了，显得更帅了。
ok
44
Michael says:
May 19th, 2007 at 12:17 am
Quote
zhang on May 19, 2007 at 12:07 am said:
抽空儿回这里浏览了一下，
有收获，
“写真照”换了，显得更帅了。
ok
哈哈，让您见笑了
45
David says:
May 30th, 2007 at 3:27 pm
Quote
很好，虽然我不是做web的，但看了还是收益良多。
46
pig345 says:
June 13th, 2007 at 10:23 am
Quote
感谢Michael
47
疯子日记 says:
June 13th, 2007 at 10:12 pm
Quote
回复:说说大型高并发高负载网站的系统架构 …
好文章!学习中………….
48
terry says:
June 15th, 2007 at 4:29 pm
Quote
推荐nginx
49
7u5 says:
June 16th, 2007 at 11:54 pm
Quote
拜读
50
Michael says:
June 16th, 2007 at 11:59 pm
Quote
terry on June 15, 2007 at 4:29 pm said:
推荐nginx
欢迎分享Nginx方面的经验:)
51
说说大型高并发高负载网站的系统架构 - 红色的河 says:
June 21st, 2007 at 11:40 pm
Quote
[…] 来源： http://www.toplee.com/blog/archives/71.html 时间：11:40 下午 | 分类：技术文摘标签：系统架构, 大型网站, 性能优化 […]
52
laoyao2k says:
June 23rd, 2007 at 11:35 am
Quote
看到大家都推荐图片分离，我也知道这样很好，但页面里的图片的绝对网址是开发的时候就写进去的，还是最终执行的时候替换的呢？
如果是开发原始网页就写进去的，那本地调试的时候又是怎么显示的呢？
如果是最终执行的时候替换的话，是用的什么方法呢？
53
Michael says:
June 23rd, 2007 at 8:21 pm
Quote
都可以，写到配置文件里面就可以，或者用全局变量定义，方法很多也都能实现，哪怕写死了在开发的时候把本地调试也都配好图片server，在hosts文件里面指定一下ip到本地就可以了。
假设用最终执行时候的替换，就配置你的apache或者别的server上的mod_rewrite模块来实现，具体的参照相关文档。
54
laoyao2k says:
June 25th, 2007 at 6:43 pm
Quote
先谢谢博主的回复，一直在找一种方便的方法将图片分离。
看来是最终替换法比较灵活，但我知道mod_rewrite是用来将用户提交的网址转换成服务器上的真实网址。
看了博主的回复好像它还有把网页执行的结果进行替换后再返回给浏览器的功能，是这样吗？
55
Michael says:
June 25th, 2007 at 11:00 pm
Quote
不是，只转换用户请求，对url进行rewrite，进行重定向到新的url上，规则很灵活，建议仔细看看lighttpd或者apache的mod_rewrite文档，当然IIS也有类似的模块。
56
laoyao2k says:
June 25th, 2007 at 11:56 pm
Quote
我知道了，如果要让客户浏览的网页代码里的图片地址是绝对地址，只能在开发时就写死了(对于静态页面)或用变量替换(对于动态页面更灵活些)，是这样吗？
我以为有更简单的方法呢，谢博主指点了。
57
马蜂不蛰 says:
July 24th, 2007 at 1:25 pm
Quote
请教楼主：
我正在搞一个医学教育视频资源在线预览的网站，只提供几分钟的视频预览，用swf格式，会员收看预览后线下可购买DVD光碟。
系统架构打算使用三台服务器：网页服务器、数据库服务器、视频服务器。
网页使用全部静态，数据库用SQL Server 2000，CMS是用ASP开发的。
会员数按十万级设计，不使用库表散列技术，请楼主给个建议，看看我的方案可行不？
58
Michael says:
July 24th, 2007 at 11:56 pm
Quote
这个数量级的应用好好配置优化一下服务器和代码，三台服务器完全没有问题，关键不是看整体会员数有多少，而是看同时在线的并发数有多少，并发不多就完全没有问题了，并发多的话，三台的这种架构还是有些问题的。
 mixi技术架构
mixi.jp：使用开源软件搭建的可扩展SNS网站
总概关键点：
1，Mysql 切分，采用Innodb运行
2，动态Cache 服务器 --
美国Facebok.com,中国Yeejee.com,日本mixi.jp均采用开源分布式缓存服务器Memcache
3，图片缓存和加

于敦德 2006-6-27
Mixi目前是日本排名第三的网站，全球排名42，主要提供SNS服务：日记，群组，站内消息，评论，相册等等，是日本最大的SNS网站。Mixi从2003年12月份开始开发，由现在它的CTO - Batara Kesuma一个人焊，焊了四个月，在2004年2月份开始上线运行。两个月后就注册了1w用户，日访问量60wPV。在随后的一年里，用户增长到了21w，第二年，增长到了200w。到今年四月份已经增长到370w注册用户，并且还在以每天1.5w人的注册量增长。这些用户中70%是活跃用户（活跃用户：三天内至少登录一次的用户），平均每个用户每周在线时间为将近3个半小时。

下面我们来看它的技术架构。Mixi采用开源软件作为架构的基础：Linux 2.6，Apache 2.0，MySQL，Perl 5.8，memcached，Squid等等。到目前为止已经有100多台MySQL数据库服务器，并且在以每月10多台的速度增长。Mixi的数据库连接方式采用的是每次查询都进行连接，而不是持久连接。数据库大多数是以InnoDB方式运行。Mixi解决扩展问题主要依赖于对数据库的切分。
首先进行垂直切分，按照表的内容将不同的表划分到不同的数据库中。然后是水平切分，根据用户的ID将不同用户的内容再划分的不同的数据库中，这是比较通常的做法，也很管用。划分的关键还是在于应用中的实现，需要将操作封装在在数据层，而尽量不影响业务层。当然完全不改变逻辑层也不可能，这时候最能检验以前的设计是否到位，如果以前设计的不错，那创建连接的时候传个表名，用户ID进去差不多就解决问题了，而以前如果sql代码到处飞，或者数据层封装的不太好的话那就累了。
这样做了以后并不能从根本上解决问题，尤其是对于像mixi这种SNS网站，页面上往往需要引用大量的用户信息，好友信息，图片，文章信息，跨表，跨库操作相当多。这个时候就需要发挥memcached的作用了，用大内存把这些不变的数据全都缓存起来，而当修改时就通知cache过期，这样应用层基本上就可以解决大部分问题了，只会有很小一部分请求穿透应用层，用到数据库。Mixi的经验是平均每个页面的加载时间在0.02秒左右（当然根据页面大小情况不尽相似），可以说明这种做法是行之有效的。Mixi一共在32台机器上有缓存服务器，每个Cache Server 2G内存，这些Cache Server与App Server装在一起。因为Cache Server对CPU消耗不大，而有了Cache Server的支援，App Server对内存要求也不是太高，所以可以和平共处，更有效的利用资源。

http://dbplus.blog.51cto.com/194965/33632
 memcached+squid+apache deflate解决网站大访问量问题
不许联想的RSS之前停了两天，据说是因为服务器负荷不了技术人员建议给关了，不输出RSS能减轻多少负载呢？所以月光博客不干了，出来给支了几招，但对于个人博客可能管用，对于流量更大的专业网站显然需要进一步的优化。途牛最近的访问量增长得比较快，所以很多页面load比较慢。之前我们就一直使用memcached进行了缓存以减轻数据库的压力，近期又对sql查询进行了优化，数据库的性能得到了明显的改善。途牛有很大一部分资源是图片，针对这个我们使用squid进行了缓存，这部分还包括js、css等一些静态文件。由于我们又有社区，用户的反馈比较多，所以页面并没有使用缓存，而是使用Apache的deflate模块进行压缩。技术实现都比较简单但非常实用，通过这几步优化，途牛在响应速度上有了不小的提高。

 FeedBurner:基于MySQL和JAVA的可扩展Web应用
于敦德 2006-6-27
FeedBurner（以下简称FB，呵呵）我想应该是大家耳熟能详的一个名字，在国内我们有一个同样的服务商，叫做FeedSky。在2004年7月份，FB的流量是300kbps，托管是5600个源，到2005年4月份，流量已经增长到5Mbps，托管了47700个源；到2005年9月份流量增长到20M，托管了109200个源，而到2006年4月份，流量已经到了115Mbps，270000个源，每天点击量一亿次。
FB的服务使用Java实现，使用了Mysql数据库。我们下面来看一下FB在发展的过程中碰到的问题，以及解决的方案。
在2004年8月份，FB的硬件设备包括3台Web服务器，3台应用服务器和两台数据库服务器，使用DNS轮循分布服务负载，将前端请求分布到三台Web服务器上。说实话，如果不考虑稳定性，给5600个源提供服务应该用不了这么多服务器。现在的问题是即使用了这么多服务器他们还是无法避免单点问题，单点问题将至少影响到1/3的用户。FB采用了监控的办法来解决，当监控到有问题出现时及时重启来避免更多用户受到影响。FB采用了Cacti( http://www.cacti.net)和Nagios( http://www.nagios.org)来做监控。
FB碰到的第二个问题是访问统计和管理。可以想象，每当我们在RSS阅读器里点击FB发布的内容，都需要做实时的统计，这个工作量是多么的巨大。大量写操作将导致系统的效率急剧下降，如果是Myisam表的话还会导致表的死锁。FB一方面采用异步写入机制，通过创建执行池来缓冲写操作；只对本日的数据进行实时统计，而以前的数据以统计结果形式存储，进而避免每次查看访问统计时的重复计算。所以每一天第一次访问统计信息时速度可能会慢，这个时候应该是FB在分析整理前一天的数据，而接下来的访问由于只针对当日数据进行分析，数据量小很多，当然也会快很多。FB的Presentation是这样写，但我发现好像我的FB里并没有今天实时的统计，也许是我观察的不够仔细-_-!
现在第三个问题出现了，由于大多数的操作都集中在主数据库上，数据库服务器的读写出现了冲突，前面提到过Myiasm类型的数据库在写入的时候会锁表，这样就导致了读写的冲突。在开始的时候由于读写操作比较少这个问题可能并不明显，但现在已经到了不能忽视的程度。解决方案是平衡读写的负载，以及扩展HibernateDaoSupport，区分只读与读写操作，以实现针对读写操作的不同处理。
。解决方案是使用内存做缓存，而非数据库，他们同样使用了我们前面推荐的memcached，同时他们还使用了Ehcache(现在是第四个问题：数据库全面负载过高。由于使用数据库做为缓存，同时数据库被所有的应用服务器共享，速度越来越慢，而这时数据库大小也到了Myisam的上限-4GB，FB的同学们自己都觉得自己有点懒 http://ehcache.sourceforge.net/)，一款基于Java的分布式缓存工具。
第五个问题：流行rss源带来大量重复请求，导致系统待处理请求的堆积。同时我们注意到在RSS源小图标有时候会显示有多少用户订阅了这一RSS源，这同样需要服务器去处理，而目前所有的订阅数都在同一时间进行计算，导致对系统资源的大量占用。解决方案，把计算时间错开，同时在晚间处理堆积下来的请求，但这仍然不够。
问题六：状态统计写入数据库又一次出问题了。越来越多的辅助数据（包括广告统计，文章点击统计，订阅统计）需要写入数据库，导致太多的写操作。解决方案：每天晚上处理完堆积下来的请求后对子表进行截断操作：
– FLUSH TABLES; TRUNCATE TABLE ad_stats0;
这样的操作对Master数据库是成功的，但对Slave会失败，正确的截断子表方法是：
– ALTER TABLE ad_stats TYPE=MERGE UNION=(ad_stats1,ad_stats2);
– TRUNCATE TABLE ad_stats0;
– ALTER TABLE ad_stats TYPE=MERGE UNION=(ad_stats0,ad_stats1,ad_stats2);
解决方案的另外一部分就是我们最常用的水平分割数据库。把最常用的表分出去，单独做集群，例如广告啊，订阅计算啊，
第七个问题，问题还真多，主数据库服务器的单点问题。虽然采用了Master-Slave模式，但主数据库Master和Slave都只有一台，当Master出问题的时候需要太长的时间进行Myisam的修复，而Slave又无法很快的切换成为Master。FB试了好多办法，最终的解决方案好像也不是非常完美。从他们的实验过程来看，并没有试验Master-Master的结构，我想Live Journal的Master-Master方案对他们来说应该有用，当然要实现Master-Master需要改应用，还有有些麻烦的。
第八个问题，停电!芝加哥地区的供电状况看来不是很好，不过不管好不好，做好备份是最重要的，大家各显神通吧。
这个Presentation好像比较偏重数据库，当然了，谁让这是在Mysql Con上的发言，不过总给人一种不过瘾的感觉。另外一个感觉，FB的NO们一直在救火，没有做系统的分析和设计。
最后FB的运维总监Joe Kottke给了四点建议：
1、监控网站数据库负载。
2、 “explain”所有的SQL语句。
3、缓存所有能缓存的东西。
4、归档好代码。
最后，FB用到的软件都不是最新的，够用就好，包括：Tomcat5.0，Mysql 4.1，Hibernate 2.1，Spring，DBCP。
 YouTube 的架构扩展
flyincat 发布于：2007-07-25 09:23
作者：Fenng | English Version 【可以转载, 转载时务必以超链接形式标明文章原始出处和作者信息及版权声明】
网址： http://www.dbanotes.net/opensource/youtube_web_arch.html
在西雅图扩展性的技术研讨会上，YouTube 的 Cuong Do 做了关于 YouTube Scalability 的报告。视频内容在 Google Video 上有(地址)，可惜国内用户看不到。
Kyle Cordes 对这个视频中的内容做了介绍。里面有不少技术性的内容。值得分享一下。(Kyle Cordes 的介绍是本文的主要来源)
简单的说 YouTube 的数据流量, "一天的YouTube流量相当于发送750亿封电子邮件.", 2006 年中就有消息说每日 PV 超过 1 亿,现在? 更夸张了,"每天有10亿次下载以及6,5000次上传", 真假姑且不论, 的确是超乎寻常的海量. 国内的互联网应用,但从数据量来看,怕是只有 51.com 有这个规模. 但技术上和 YouTube 就没法子比了.

Web 服务器

YouTube 出于开发速度的考虑，大部分代码都是 Python 开发的。Web 服务器有部分是 Apache，用 FastCGI 模式。对于视频内容则用 Lighttpd 。据我所知，MySpace 也有部分服务器用 Lighttpd ，但量不大。YouTube 是 Lighttpd 最成功的案例。(国内用 Lighttpd 站点不多，豆瓣用的比较舒服。by Fenng)

视频

视频的缩略图(Thumbnails)给服务器带来了很大的挑战。每个视频平均有4个缩略图，而每个 Web 页面上更是有多个，每秒钟因为这个带来的磁盘 IO 请求太大。YouTube 技术人员启用了单独的服务器群组来承担这个压力，并且针对 Cache 和 OS 做了部分优化。另一方面，缩略图请求的压力导致 Lighttpd 性能下降。通过 Hack Lighttpd 增加更多的 worker 线程很大程度解决了问题。而最新的解决方案是起用了 Google 的 BigTable，这下子从性能、容错、缓存上都有更好表现。看人家这收购的，好钢用在了刀刃上。
出于冗余的考虑，每个视频文件放在一组迷你 Cluster 上，所谓 "迷你 Cluster" 就是一组具有相同内容的服务器。最火的视频放在 CDN 上，这样自己的服务器只需要承担一些"漏网"的随即访问即可。YouTube 使用简单、廉价、通用的硬件，这一点和 Google 风格倒是一致。至于维护手段，也都是常见的工具，如 rsync, SSH 等，只不过人家更手熟罢了。

数据库

YouTube 用 MySQL 存储元数据--用户信息、视频信息什么的。数据库服务器曾经一度遇到 SWAP 颠簸的问题，解决办法是删掉了 SWAP 分区! 管用。
最初的 DB 只有 10 块硬盘，RAID 10 ，后来追加了一组 RAID 1。够省的。这一波 Web 2.0 公司很少有用 Oracle 的(我知道的只有 Bebo,参见这里). 在扩展性方面，路线也是和其他站点类似，复制，分散 IO。最终的解决之道是"分区",这个不是数据库层面的表分区，而是业务层面的分区(在用户名字或者 ID 上做文章,应用程序控制查找机制)
YouTube 也用 Memcached.
很想了解一下国内 Web 2.0 网站的数据信息,有谁可以提供一点 ?
--EOF--
回复(0) |引用(0)|收藏(0)|推荐给朋友|推荐到群组|22次阅读|
标签：系统架构
引用地址： http://www.mtime.com/blog/trackback/460579/

 了解一下 Technorati 的后台数据库架构
作者：Fenng | English Version 【可以转载, 转载时务必以超链接形式标明文章原始出处和作者信息及版权声明】
网址： http://www.dbanotes.net/web/technorati_db_arch.html
Technorati (现在被阻尼了, 可能你访问不了)的 Dorion Carroll在 2006 MySQL 用户会议上介绍了一些关于 Technorati 后台数据库架构的情况.
基本情况
目前处理着大约 10Tb 核心数据, 分布在大约 20 台机器上.通过复制, 多增加了 100Tb 数据, 分布在 200 台机器上. 每天增长的数据 1TB. 通过 SOA 的运用, 物理与逻辑的访问相隔离,　似乎消除了数据库的瓶颈. 值得一提的是, 该扩展过程始终是利用普通的硬件与开源软件来完成的. 毕竟 , Web 2.0 站点都不是烧钱的主. 从数据量来看，这绝对是一个相对比较大的 Web 2.0 应用.
Tag 是 Technorati 最为重要的数据元素. 爆炸性的 Tag 增长给 Technorati 带来了不小的挑战.
2005 年 1 月的时候, 只有两台数据库服务器, 一主一从. 到了 06 年一月份, 已经是一主一从, 6 台 MyISAM 从数据库用来对付查询, 3 台 MyISAM 用作异步计算.
一些核心的处理方法:
1) 根据实体(tags/posttags))进行分区
衡量数据访问方法，读和写的平衡.然后通过不同的维度进行分区．( Technorati 数据更新不会很多, 否则会成为数据库灾难)
2) 合理利用 InnoDB 与 MyISAM
InnoDB 用于数据完整性/写性能要求比较高的应用. MyISAM 适合进行 OLAP 运算. 物尽其用.
3) MySQL 复制
复制数据到从主数据库到辅数据库上,平衡分布查询与异步计算, 另外一个功能是提供冗余．如图:

后记
拜读了一个藏袍的两篇大做(mixi.jp：使用开源软件搭建的可扩展SNS网站 / FeedBurner:基于MySQL和JAVA的可扩展Web应用) 心痒难当, 顺藤摸瓜, 发现也有文档提及 Technorati , 赶紧照样学习一下. 几篇文档读罢, MySQL 的可扩展性让我刮目相看.
或许,应该把注意力留一点给 MySQL 了 .
--End.
 Myspace架构历程

亿万用户网站MySpace的成功秘密

◎ 文 / David F. Carr 译 / 罗小平

高速增长的访问量给社区网络的技术体系带来了巨大挑战。MySpace的开发者多年来不断重构站点软件、数据库和存储系统，以期与自身的成长同步——目前，该站点月访问量已达400亿。绝大多数网站需要应对的流量都不及MySpace的一小部分，但那些指望迈入庞大在线市场的人，可以从MySpace的成长过程学到知识。

用户的烦恼
Drew，是个来自达拉斯的17岁小伙子，在他的MySpace个人资料页上，可以看到他的袒胸照，看样子是自己够着手拍的。他的好友栏全是漂亮姑娘和靓车的链接，另外还说自己参加了学校田径队，爱好吉他，开一辆蓝色福特野马。
不过在用户反映问题的论坛里，似乎他的火气很大。“赶紧弄好这该死的收件箱！”他大写了所有单词。使用MySpace的用户个人消息系统可以收发信息，但当他要查看一条消息时，页面却出现提示：“非常抱歉……消息错误”。

Drew的抱怨说明1.4亿用户非常重视在线交流系统，这对MySpace来说是个好消息。但也恰是这点让MySpace成了全世界最繁忙的站点之一。
11月，MySpace的美国国内互联网用户访问流量首次超过Yahoo。comScore Media Metrix公司提供的资料显示，MySpace当月访问量为387亿，而Yahoo是380.5亿。
显然，MySpace的成长太快了——从2003年11月正式上线到现在不过三年。这使它很早就要面对只有极少数公司才会遇到的高可扩展性问题的严峻挑战。
事实上，MySpace的Web服务器和数据库经常性超负荷，其用户频繁遭遇“意外错误”和“站点离线维护”等告示。包括Drew在内的MySpace用户经常无法收发消息、更新个人资料或处理其他日常事务，他们不得不在论坛抱怨不停。

尤其是最近，MySpace可能经常性超负荷。因为Keynote Systems公司性能监测服务机构负责人Shawn White说，“难以想象，在有些时候，我们发现20%的错误日志都来自MySpace，有时候甚至达到30%以至40%……而Yahoo、Salesforce.com和其他提供商用服务的站点，从来不会出现这样的数字。”他告诉我们，其他大型站点的日错误率一般就1%多点。

顺便提及，MySpace在2006年7月24号晚上开始了长达12小时的瘫痪，期间只有一个可访问页面——该页面解释说位于洛杉矶的主数据中心发生故障。为了让大家耐心等待服务恢复，该页面提供了用Flash开发的派克人（Pac-Man）游戏。Web站点跟踪服务研究公司总经理Bill Tancer说，尤其有趣的是，MySpace瘫痪期间，访问量不降反升，“这说明了人们对MySpace的痴迷——所有人都拥在它的门口等着放行”。
现Nielsen Norman Group 咨询公司负责人、原Sun Microsystems公司工程师，因在Web站点方面的评论而闻名的Jakob Nielsen说，MySpace的系统构建方法显然与Yahoo、eBay以及Google都不相同。和很多观察家一样，他相信MySpace对其成长速度始料未及。“虽然我不认为他们必须在计算机科学领域全面创新，但他们面对的的确是一个巨大的科学难题。”他说。

MySpace开发人员已经多次重构站点软件、数据库和存储系统，以满足爆炸性的成长需要，但此工作永不会停息。“就像粉刷金门大桥，工作完成之时，就是重新来过之日。”（译者注：意指工人从桥头开始粉刷，当到达桥尾时，桥头涂料已经剥落，必须重新开始）MySpace技术副总裁Jim Benedetto说。
既然如此，MySpace的技术还有何可学之处？因为MySpace事实上已经解决了很多系统扩展性问题，才能走到今天。

Benedetto说他的项目组有很多教训必须总结，他们仍在学习，路漫漫而修远。他们当前需要改进的工作包括实现更灵活的数据缓存系统，以及为避免再次出现类似7月瘫痪事件的地理上分布式架构。

背景知识
MySpace目前的努力方向是解决扩展性问题，但其领导人最初关注的是系统性能。
3年多前，一家叫做Intermix Media（早先叫eUniverse。这家公司从事各类电子邮件营销和网上商务）的公司推出了MySpace。而其创建人是Chris DeWolfe和Tom Anderson，他们原来也有一家叫做ResponseBase的电子邮件营销公司，后于2002年出售给Intermix。据Brad Greenspan（Intermix前CEO）运作的一个网站披露，ResponseBase团队为此获得2百万美金外加分红。Intermix是一家颇具侵略性的互联网商务公司——部分做法可能有点过头。2005年，纽约总检察长Eliot Spitzer——现在是纽约州长——起诉Intermix使用恶意广告软件推广业务，Intermix最后以790万美元的代价达成和解。

2003年，美国国会通过《反垃圾邮件法》（CAN-SPAM Act），意在控制滥发邮件的营销行为。Intermix领导人DeWolfe和Anderson意识到新法案将严重打击公司的电子邮件营销业务，“因此必须寻找新的方向。”受聘于Intermix负责重写公司邮件营销软件的Duc Chau说。
当时有个叫Friendster的交友网站，Anderson和DeWolfe很早就是它的会员。于是他们决定创建自己的网上社区。他们去除了Friendster在用户自我表述方面的诸多限制，并重点突出音乐（尤其是重金属乐），希望以此吸引用户。Chau使用Perl开发了最初的MySpace版本，运行于Apache Web服务器，后台使用MySQL数据库。但它没有通过终审，因为Intermix的多数开发人员对ColdFusion（一个Web应用程序环境，最初由Allaire开发，现为Adobe所有）更为熟悉。因此，最后发布的产品采用ColdFusion开发，运行在Windows上，并使用MS SQL Server作为数据库服务器。
Chau就在那时离开了公司，将开发工作交给其他人，包括Aber Whitcomb（Intermix的技术专家，现在是MySpace技术总监）和Benedetto（MySpace现技术副总裁，大概于MySpace上线一个月后加入）。

MySpace上线的2003年，恰恰是Friendster在满足日益增长的用户需求问题上遭遇麻烦的时期。在财富杂志最近的一次采访中，Friendster总裁Kent Lindstrom承认他们的服务出现问题选错了时候。那时，Friendster传输一个页面需要20到30秒，而MySpace只需2到3秒。

结果，Friendster用户开始转投MySpace，他们认为后者更为可靠。
今天，MySpace无疑已是社区网站之王。社区网站是指那些帮助用户彼此保持联系、通过介绍或搜索、基于共同爱好或教育经历交友的Web站点。在这个领域比较有名的还有最初面向大学生的Facebook、侧重职业交流的LinkedIn，当然还少不了Friendster。MySpace宣称自己是“下一代门户”，强调内容的丰富多彩（如音乐、趣事和视频等）。其运作方式颇似一个虚拟的夜总会——为未成年人在边上安排一个果汁吧，而显著位置则是以性为目的的约会，和寻找刺激派对气氛的年轻人的搜索服务。

用户注册时，需要提供个人基本信息，主要包括籍贯、性取向和婚姻状况。虽然MySpace屡遭批评，指其为网上性犯罪提供了温床，但对于未成年人，有些功能还是不予提供的。
MySpace的个人资料页上表述自己的方式很多，如文本式“关于本人”栏、选择加载入MySpace音乐播放器的歌曲，以及视频、交友要求等。它还允许用户使用CSS（一种Web标准格式语言，用户以此可设置页面元素的字体、颜色和页面背景图像）自由设计个人页面，这也提升了人气。不过结果是五花八门——很多用户的页面布局粗野、颜色迷乱，进去后找不到东南西北，不忍卒读；而有些人则使用了专业设计的模版（可阅读《Too Much of a Good Thing?》第49页），页面效果很好。
在网站上线8个月后，开始有大量用户邀请朋友注册MySpace，因此用户量大增。“这就是网络的力量，这种趋势一直没有停止。”Chau说。

拥有Fox电视网络和20th Century Fox影业公司的媒体帝国——新闻集团，看到了他们在互联网用户中的机会，于是在2005年斥资5.8亿美元收购了MySpace。新闻集团董事局主席Rupert Murdoch最近向一个投资团透露，他认为MySpace目前是世界主要Web门户之一，如果现在出售MySpace，那么可获60亿美元——这比2005年收购价格的10倍还多！新闻集团还惊人地宣称，MySpace在2006年7月结束的财政年度里总收入约2亿美元，而且预期在2007年度，Fox Interactive公司总收入将达到5亿美元，其中4亿来自MySpace。
然而MySpace还在继续成长。12月份，它的注册账户达到1.4亿，而2005年11月时不过4千万。当然，这个数字并不等于真实的用户个体数，因为有些人可能有多个帐号，而且个人资料也表明有些是乐队，或者是虚构的名字，如波拉特（译者注：喜剧电影《Borat》主角），还有像Burger King（译者注：美国最大的汉堡连锁集团）这样的品牌名。

当然，这么多的用户不停发布消息、撰写评论或者更新个人资料，甚至一些人整天都泡在Space上，必然给MySpace的技术工作带来前所未有的挑战。而传统新闻站点的绝大多数内容都是由编辑团队整理后主动提供给用户消费，它们的内容数据库通常可以优化为只读模式，因为用户评论等引起的增加和更新操作很少。而MySpace是由用户提供内容，数据库很大比例的操作都是插入和更新，而非读取。
浏览MySpace上的任何个人资料时，系统都必须先查询数据库，然后动态创建页面。当然，通过数据缓存，可以减轻数据库的压力，但这种方案必须解决原始数据被用户频繁更新带来的同步问题。

MySpace的站点架构已经历了5个版本——每次都是用户数达到一个里程碑后，必须做大量的调整和优化。Benedetto说，“但我们始终跟不上形势的发展速度。我们重构重构再重构，一步步挪到今天”。
尽管MySpace拒绝了正式采访，但Benedetto在参加11月于拉斯维加斯召开的SQL Server Connections会议时还是回答了Baseline的问题。本文的不少技术信息还来源于另一次重要会议——Benedetto和他的老板——技术总监Whitcomb今年3月出席的Microsoft MIX Web开发者大会。
据他们讲，MySpace很多大的架构变动都发生在2004和2005年早期——用户数在当时从几十万迅速攀升到了几百万。

在每个里程碑，站点负担都会超过底层系统部分组件的最大载荷，特别是数据库和存储系统。接着，功能出现问题，用户失声尖叫。最后，技术团队必须为此修订系统策略。
虽然自2005年早期，站点账户数超过7百万后，系统架构到目前为止保持了相对稳定，但MySpace仍然在为SQL Server支持的同时连接数等方面继续攻坚，Benedetto说，“我们已经尽可能把事情做到最好”。

里程碑一：50万账户
按Benedetto 的说法，MySpace最初的系统很小，只有两台Web服务器和一个数据库服务器。那时使用的是Dell双CPU、4G内存的系统。
单个数据库就意味着所有数据都存储在一个地方，再由两台Web服务器分担处理用户请求的工作量。但就像MySpace后来的几次底层系统修订时的情况一样，三服务器架构很快不堪重负。此后一个时期内，MySpace基本是通过添置更多Web服务器来对付用户暴增问题的。
但到在2004年早期，MySpace用户数增长到50万后，数据库服务器也已开始汗流浃背。
但和Web服务器不同，增加数据库可没那么简单。如果一个站点由多个数据库支持，设计者必须考虑的是，如何在保证数据一致性的前提下，让多个数据库分担压力。
在第二代架构中，MySpace运行在3个SQL Server数据库服务器上——一个为主，所有的新数据都向它提交，然后由它复制到其他两个；另两个全力向用户供给数据，用以在博客和个人资料栏显示。这种方式在一段时间内效果很好——只要增加数据库服务器，加大硬盘，就可以应对用户数和访问量的增加。

里程碑二：1-2百万账户
MySpace注册数到达1百万至2百万区间后，数据库服务器开始受制于I/O容量——即它们存取数据的速度。而当时才是2004年中，距离上次数据库系统调整不过数月。用户的提交请求被阻塞，就像千人乐迷要挤进只能容纳几百人的夜总会，站点开始遭遇“主要矛盾”，Benedetto说，这意味着MySpace永远都会轻度落后于用户需求。
“有人花5分钟都无法完成留言，因此用户总是抱怨说网站已经完蛋了。”他补充道。
这一次的数据库架构按照垂直分割模式设计，不同的数据库服务于站点的不同功能，如登录、用户资料和博客。于是，站点的扩展性问题看似又可以告一段落了，可以歇一阵子。
垂直分割策略利于多个数据库分担访问压力，当用户要求增加新功能时，MySpace将投入新的数据库予以支持它。账户到达2百万后，MySpace还从存储设备与数据库服务器直接交互的方式切换到SAN（Storage Area Network，存储区域网络）——用高带宽、专门设计的网络将大量磁盘存储设备连接在一起，而数据库连接到SAN。这项措施极大提升了系统性能、正常运行时间和可靠性，Benedetto说。

里程碑三：3百万账户
当用户继续增加到3百万后，垂直分割策略也开始难以为继。尽管站点的各个应用被设计得高度独立，但有些信息必须共享。在这个架构里，每个数据库必须有各自的用户表副本——MySpace授权用户的电子花名册。这就意味着一个用户注册时，该条账户记录必须在9个不同数据库上分别创建。但在个别情况下，如果其中某台数据库服务器临时不可到达，对应事务就会失败，从而造成账户非完全创建，最终导致此用户的该项服务无效。
另外一个问题是，个别应用如博客增长太快，那么专门为它服务的数据库就有巨大压力。
2004年中，MySpace面临Web开发者称之为“向上扩展”对“向外扩展”（译者注：Scale Up和Scale Out，也称硬件扩展和软件扩展）的抉择——要么扩展到更大更强、也更昂贵的服务器上，要么部署大量相对便宜的服务器来分担数据库压力。一般来说，大型站点倾向于向外扩展，因为这将让它们得以保留通过增加服务器以提升系统能力的后路。
但成功地向外扩展架构必须解决复杂的分布式计算问题，大型站点如Google、Yahoo和Amazon.com，都必须自行研发大量相关技术。以Google为例，它构建了自己的分布式文件系统。
另外，向外扩展策略还需要大量重写原来软件，以保证系统能在分布式服务器上运行。“搞不好，开发人员的所有工作都将白费”，Benedetto说。
因此，MySpace首先将重点放在了向上扩展上，花费了大约1个半月时间研究升级到32CPU服务器以管理更大数据库的问题。Benedetto说，“那时候，这个方案看似可能解决一切问题。”如稳定性，更棒的是对现有软件几乎没有改动要求。
糟糕的是，高端服务器极其昂贵，是购置同样处理能力和内存速度的多台服务器总和的很多倍。而且，站点架构师预测，从长期来看，即便是巨型数据库，最后也会不堪重负，Benedetto说，“换句话讲，只要增长趋势存在，我们最后无论如何都要走上向外扩展的道路。”
因此，MySpace最终将目光移到分布式计算架构——它在物理上分布的众多服务器，整体必须逻辑上等同于单台机器。拿数据库来说，就不能再像过去那样将应用拆分，再以不同数据库分别支持，而必须将整个站点看作一个应用。现在，数据库模型里只有一个用户表，支持博客、个人资料和其他核心功能的数据都存储在相同数据库。
既然所有的核心数据逻辑上都组织到一个数据库，那么MySpace必须找到新的办法以分担负荷——显然，运行在普通硬件上的单个数据库服务器是无能为力的。这次，不再按站点功能和应用分割数据库，MySpace开始将它的用户按每百万一组分割，然后将各组的全部数据分别存入独立的SQL Server实例。目前，MySpace的每台数据库服务器实际运行两个SQL Server实例，也就是说每台服务器服务大约2百万用户。Benedetto指出，以后还可以按照这种模式以更小粒度划分架构，从而优化负荷分担。
当然，还是有一个特殊数据库保存了所有账户的名称和密码。用户登录后，保存了他们其他数据的数据库再接管服务。特殊数据库的用户表虽然庞大，但它只负责用户登录，功能单一，所以负荷还是比较容易控制的。

里程碑四：9百万到1千7百万账户
2005年早期，账户达到9百万后，MySpace开始用Microsoft的C#编写ASP.NET程序。C#是C语言的最新派生语言，吸收了C++和Java的优点，依托于Microsoft .NET框架（Microsoft为软件组件化和分布式计算而设计的模型架构）。ASP.NET则由编写Web站点脚本的ASP技术演化而来，是Microsoft目前主推的Web站点编程环境。
可以说是立竿见影，MySpace马上就发现ASP.NET程序运行更有效率，与ColdFusion相比，完成同样任务需消耗的处理器能力更小。据技术总监Whitcomb说，新代码需要150台服务器完成的工作，如果用ColdFusion则需要246台。Benedetto还指出，性能上升的另一个原因可能是在变换软件平台，并用新语言重写代码的过程中，程序员复审并优化了一些功能流程。

最终，MySpace开始大规模迁移到ASP.NET。即便剩余的少部分ColdFusion代码，也从Cold-Fusion服务器搬到了ASP.NET，因为他们得到了BlueDragon.NET（乔治亚州阿尔法利塔New Atlanta Communications公司的产品，它能将ColdFusion代码自动重新编译到Microsoft平台）的帮助。
账户达到1千万时，MySpace再次遭遇存储瓶颈问题。SAN的引入解决了早期一些性能问题，但站点目前的要求已经开始周期性超越SAN的I/O容量——即它从磁盘存储系统读写数据的极限速度。
原因之一是每数据库1百万账户的分割策略，通常情况下的确可以将压力均分到各台服务器，但现实并非一成不变。比如第七台账户数据库上线后，仅仅7天就被塞满了，主要原因是佛罗里达一个乐队的歌迷疯狂注册。
某个数据库可能因为任何原因，在任何时候遭遇主要负荷，这时，SAN中绑定到该数据库的磁盘存储设备簇就可能过载。“SAN让磁盘I/O能力大幅提升了，但将它们绑定到特定数据库的做法是错误的。”Benedetto说。
最初，MySpace通过定期重新分配SAN中数据，以让其更为均衡的方法基本解决了这个问题，但这是一个人工过程，“大概需要两个人全职工作。”Benedetto说。
长期解决方案是迁移到虚拟存储体系上，这样，整个SAN被当作一个巨型存储池，不再要求每个磁盘为特定应用服务。MySpace目前采用了一种新型SAN设备——来自加利福尼亚州弗里蒙特的3PARdata。
在3PAR的系统里，仍能在逻辑上按容量划分数据存储，但它不再被绑定到特定磁盘或磁盘簇，而是散布于大量磁盘。这就使均分数据访问负荷成为可能。当数据库需要写入一组数据时，任何空闲磁盘都可以马上完成这项工作，而不再像以前那样阻塞在可能已经过载的磁盘阵列处。而且，因为多个磁盘都有数据副本，读取数据时，也不会使SAN的任何组件过载。
当2005年春天账户数达到1千7百万时，MySpace又启用了新的策略以减轻存储系统压力，即增加数据缓存层——位于Web服务器和数据库服务器之间，其唯一职能是在内存中建立被频繁请求数据对象的副本，如此一来，不访问数据库也可以向Web应用供给数据。换句话说，100个用户请求同一份资料，以前需要查询数据库100次，而现在只需1次，其余都可从缓存数据中获得。当然如果页面变化，缓存的数据必须从内存擦除，然后重新从数据库获取——但在此之前，数据库的压力已经大大减轻，整个站点的性能得到提升。
缓存区还为那些不需要记入数据库的数据提供了驿站，比如为跟踪用户会话而创建的临时文件——Benedetto坦言他需要在这方面补课，“我是数据库存储狂热分子，因此我总是想着将万事万物都存到数据库。”但将像会话跟踪这类的数据也存到数据库，站点将陷入泥沼。
增加缓存服务器是“一开始就应该做的事情，但我们成长太快，以致于没有时间坐下来好好研究这件事情。”Benedetto补充道。

里程碑五：2千6百万账户
2005年中期，服务账户数达到2千6百万时，MySpace切换到了还处于beta测试的SQL Server 2005。转换何太急？主流看法是2005版支持64位处理器。但Benedetto说，“这不是主要原因，尽管这也很重要；主要还是因为我们对内存的渴求。”支持64位的数据库可以管理更多内存。
更多内存就意味着更高的性能和更大的容量。原来运行32位版本的SQL Server服务器，能同时使用的内存最多只有4G。切换到64位，就好像加粗了输水管的直径。升级到SQL Server 2005和64位Windows Server 2003后，MySpace每台服务器配备了32G内存，后于2006年再次将配置标准提升到64G。

意外错误
如果没有对系统架构的历次修改与升级，MySpace根本不可能走到今天。但是，为什么系统还经常吃撑着了？很多用户抱怨的“意外错误”是怎么引起的呢？
原因之一是MySpace对Microsoft的Web技术的应用已经进入连Microsoft自己也才刚刚开始探索的领域。比如11月，超出SQL Server最大同时连接数，MySpace系统崩溃。Benedetto说，这类可能引发系统崩溃的情况大概三天才会出现一次，但仍然过于频繁了，以致惹人恼怒。一旦数据库罢工，“无论这种情况什么时候发生，未缓存的数据都不能从SQL Server获得，那么你就必然看到一个‘意外错误’提示。”他解释说。
去年夏天，MySpace的Windows 2003多次自动停止服务。后来发现是操作系统一个内置功能惹的祸——预防分布式拒绝服务攻击（黑客使用很多客户机向服务器发起大量连接请求，以致服务器瘫痪）。MySpace和其他很多顶级大站点一样，肯定会经常遭受攻击，但它应该从网络级而不是依靠Windows本身的功能来解决问题——否则，大量MySpace合法用户连接时也会引起服务器反击。
“我们花了大约一个月时间寻找Windows 2003服务器自动停止的原因。”Benedetto说。最后，通过Microsoft的帮助，他们才知道该怎么通知服务器：“别开枪，是友军。”
紧接着是在去年7月某个周日晚上，MySpace总部所在地洛杉矶停电，造成整个系统停运12小时。大型Web站点通常要在地理上分布配置多个数据中心以预防单点故障。本来，MySpace还有其他两个数据中心以应对突发事件，但Web服务器都依赖于部署在洛杉矶的SAN。没有洛杉矶的SAN，Web服务器除了恳求你耐心等待，不能提供任何服务。
Benedetto说，主数据中心的可靠性通过下列措施保证：可接入两张不同电网，另有后备电源和一台储备有30天燃料的发电机。但在这次事故中，不仅两张电网失效，而且在切换到备份电源的过程中，操作员烧掉了主动力线路。
2007年中，MySpace在另两个后备站点上也建设了SAN。这对分担负荷大有帮助——正常情况下，每个SAN都能负担三分之一的数据访问量。而在紧急情况下，任何一个站点都可以独立支撑整个服务，Benedetto说。
MySpace仍然在为提高稳定性奋斗，虽然很多用户表示了足够信任且能原谅偶现的错误页面。
“作为开发人员，我憎恶Bug，它太气人了。”Dan Tanner这个31岁的德克萨斯软件工程师说，他通过MySpace重新联系到了高中和大学同学。“不过，MySpace对我们的用处很大，因此我们可以原谅偶发的故障和错误。” Tanner说，如果站点某天出现故障甚至崩溃，恢复以后他还是会继续使用。
这就是为什么Drew在论坛里咆哮时，大部分用户都告诉他应该保持平静，如果等几分钟，问题就会解决的原因。Drew无法平静，他写道，“我已经两次给MySpace发邮件，而它说一小时前还是正常的，现在出了点问题……完全是一堆废话。”另一个用户回复说，“毕竟它是免费的。”Benedetto坦承100%的可靠性不是他的目标。“它不是银行，而是一个免费的服务。”他说。
换句话说，MySpace的偶发故障可能造成某人最后更新的个人资料丢失，但并不意味着网站弄丢了用户的钱财。“关键是要认识到，与保证站点性能相比，丢失少许数据的故障是可接受的。”Benedetto说。所以，MySpace甘冒丢失2分钟到2小时内任意点数据的危险，在SQL Server配置里延长了“checkpoint”操作——它将待更新数据永久记录到磁盘——的间隔时间，因为这样做可以加快数据库的运行。
Benedetto说，同样，开发人员还经常在几个小时内就完成构思、编码、测试和发布全过程。这有引入Bug的风险，但这样做可以更快实现新功能。而且，因为进行大规模真实测试不具可行性，他们的测试通常是在仅以部分活跃用户为对象，且用户对软件新功能和改进不知就里的情况下进行的。因为事实上不可能做真实的加载测试，他们做的测试通常都是针对站点。
“我们犯过大量错误，”Benedetto说，“但到头来，我认为我们做对的还是比做错的多。”

Trackback: http://tb.blog.csdn.net/TrackBack.aspx?PostId=1536222

 eBay 的数据量
作者：Fenng | English Version 【可以转载, 转载时务必以超链接形式标明文章原始出处和作者信息及版权声明】
网址： http://www.dbanotes.net/database/ebay_storage.html
作为电子商务领头羊的 eBay 公司，数据量究竟有多大? 很多朋友可能都会对这个很感兴趣。在这一篇
Web 2.0: How High-Volume eBay Manages Its Storage(从+1 GB/1 min得到的线索) 报道中，eBay 的存储主管 Paul Strong 对数据量做了一些介绍，管中窥豹，这些数据也给我们一个参考。
站点处理能力
• 平均每天的 PV 超过 10 亿 ;
• 每秒钟交易大约 1700 美元的商品 ;
• 每分钟卖出一辆车A ;
• 每秒钟卖出一件汽车饰品或者配件 ;
• 每两分钟卖出一件钻石首饰 ;
• 6 亿商品，2 亿多注册用户; 超过 130 万人把在 eBay 上做生意看作是生活的一部分。
在这样高的压力下，可靠性达到了 99.94%，也就是说每年 5 个小时多一点的服务不可用。从业界消息来看，核心业务的可用性要比这个高。
数据存储工程组控制着 eBay 的 2PB (1Petabyte=1000Terabytes) 可用空间。这是一个什么概念，对比一下 Google 的存储就知道了。每周就要分配 10T 数据出去，稍微算一下，一分钟大约使用 1G 的数据空间。
计算能力
eBay 使用一套传统的网格计算系统。该系统的一些特征数据：
• 170 台 Win2000/Win2003 服务器；
• 170 台 Linux (RHES3) 服务器；
• 三个 Solaris 服务器: 为 QA 构建与部署 eBay.com; 编译优化 Java / C++ 以及其他 Web 元素 ;
• Build 整个站点的时间：过去是 10 个小时，现在是 30 分钟;
• 在过去的2年半, 有 200 万次 Build，很可怕的数字。
存储硬件
每个供货商都必须通过严格的测试才有被选中的可能，这些厂家或产品如下：
• 交换机: Brocade
• 网管软件：IBM Tivoli
• NAS： Netapp (占总数据量的 5%，2P*0.05, 大约 100 T)
• 阵列存储：HDS (95%，这一份投资可不小，HDS 不便宜, EMC 在 eBay 是出局者) 负载均衡与 Failover: Resonate ;

搜索功能： Thunderstone indexing system ;
数据库软件：Oracle 。大多数 DB 都有 4 份拷贝。数据库使用的服务器 Sun E10000。另外据我所知, eBay 购买了 Quest SharePlex 全球 Licence 用于数据复制.
应用服务器

应用服务器有哪些特点呢?
• 使用单一的两层架构(这一点有点疑问，看来是自己写的应用服务器)
• 330 万行的 C++ ISAPI DLL (二进制文件有 150M)
• 数百名工程师进行开发
• 每个类的方法已经接近编译器的限制
非常有意思，根据eWeek 的该篇文档，昨天还有上面这段划掉的内容，今天上去发现已经修改了:
架构
• 高分布式
• 拍卖站点是基于 Java 的，搜索的架构是用 C++ 写的
• 数百名工程师进行开发，所有的工作都在同样的代码环境下进行
可能是被采访者看到 eWeek 这篇报道，联系了采访者进行了更正。我还有点奇怪原来"两层"架构的说法。
其他信息
• 集中化存储应用程序日志;
• 全局计费：实时的与第三方应用集成(就是eBay 自己的 PayPal 吧?)
• 业务事件流：使用统一的高效可靠消息队列. 并且使用 Cookie-cutter 模式用于优化用户体验(这似乎是大型电子商务站点普遍使用的用于提高用户体验的手法)。
后记
零散作了一点流水帐。作为一个 DBA, 或许有一天也有机会面对这样的数据量。到那一天，再回头看这一篇电子垃圾。
更新：更详细信息请参考：Web 2.0: How High-Volume eBay Manages Its Storage。可能处于 Cache 的问题，好几个人看到的原文内容有差异
--EOF—

 eBay 的应用服务器规模
作者：Fenng | English Version 【可以转载, 转载时务必以超链接形式标明文章原始出处和作者信息及版权声明】
网址： http://www.dbanotes.net/web/ebay_application_server.html
前面我在《eBay 的数据量》中介绍了一些道听途说来的关于互联网巨头 eBay 服务器架构的信息，不过还缺了一点关键数据。
在 Oracle 站点上的一篇题为 The eBay Global Platform and Oracle 10g JDBC 的白皮书，有能看到一些数据。
在 2004 年的时候，eBay 的应用服务器采用了 IBM WebSphere，部署在 WinNT 上，硬件是 Intel 双 CPU 奔腾服务器。服务器数量是 2400 台。在《eBay 的数据量》中我们知道，eBay 的是集中式处理 Log 的，每天会有 2T 的 Log 数据产生，现在只会更多。这些应用服务器分成不同的组，通过一个统一的 DAL(database access layer) 逻辑层访问 135 个数据库节点。
这篇白皮书已经发布了两年，相信在这两年的时间里，服务器规模又会扩大了许多。
eBay 的 SOA 架构 V3 示意图如下：

这个图来自这里
以前我写的《这些大网站都用什么操作系统与 Web 服务器 ?》，还有网友质疑 eBay 的服务器不是 WinNT，现在倒是间接证明了 Web 服务器的确是 Windows 。

 eBay 的数据库分布扩展架构
作者：Fenng | English Version 【可以转载, 转载时务必以超链接形式标明文章原始出处和作者信息及版权声明】
网址： http://www.dbanotes.net/database/ebay_database_scale_out.html
在过去的 Blog 中, 我(插一嘴：这里的"我" 如果替换成 "Fenng" 似乎有些自恋, 也不是我喜欢的行文语气, 可发现转贴不留名的行为太多了,他大爷的)曾经介绍过《eBay 的应用服务器规模》 , 也介绍过《eBay 的数据量》，在这篇文章中提到过 "eBay 购买了 Quest Share Plex 全球 Licence 用于数据复制"，这个地方其实没有说开来。
对于 eBay 这样超大规模的站点来说，瓶颈往往最容易在数据库服务器上产生，必定有一部分数据(比如交易记录这样不容易水平分割的数据)容易带来大量的读操作，而不管用什么存储，能承担的 IO 能力是有限的。所以，如果有效的分散 IO 的承载能力就是一个很有意义的事情。
经过互联网考古学不断挖掘，路路续续又现了一些蛛丝马迹能够多少说明一些问题。客观事实加上主观想象，简单的描述一下。见下图：

通过 Quest 公司的 Share Plex 近乎实时的复制数据到其他数据库节点，F5 通过特定的模块检查数据库状态，并进行负载均衡，IO 成功的做到了分布，读写分离，而且极大的提高了可用性。F5 真是一家很有创新性的公司，虽然从这个案例来说，技术并无高深之处，但方法巧妙，整个方案浑然一体。
F5公司专门为Oracle 9i 数据库开发了专用的健康检查模块，通过调用F5专有的扩展应用校验(EAV)进程，F5能够随时得到Oracle 9i数据库的应用层服务能力而不是其他的负载均衡设备所采用的 ICMP/TCP 层进行健康检查。
这个图来自一篇《F5助力eBay数据库服务器负载均衡》的软文，真是一篇很好的软文，国外恐怕不会出现这样"含金量"极高的东西。
当然，这个技术架构可不算便宜。Quest 的 Share Plex License 很贵，而且，对于每个结点来说，都需要数据库 License 与硬件费用。但优点也很多：节省了维护成本; 数据库层面的访问也能做到 SOA; 高可用性。
国内的一些厂商比较喜欢给客户推存储级别的解决方案。通过存储底层复制来解决数据分布以及灾备问题。这个思路似乎太传统了，对于互联网企业来说多少有点过时。
 从LiveJournal后台发展看大规模网站性能优化方法
于敦德 2006-3-16
一、LiveJournal发展历程
LiveJournal是99年始于校园中的项目，几个人出于爱好做了这样一个应用，以实现以下功能：
• 博客，论坛
• 社会性网络，找到朋友
• 聚合，把朋友的文章聚合在一起
LiveJournal采用了大量的开源软件，甚至它本身也是一个开源软件。
在上线后，LiveJournal实现了非常快速的增长：
• 2004年4月份：280万注册用户。
• 2005年4月份：680万注册用户。
• 2005年8月份：790万注册用户。
• 达到了每秒钟上千次的页面请求及处理。
• 使用了大量MySQL服务器。
• 使用了大量通用组件。
二、LiveJournal架构现状概况

三、从LiveJournal发展中学习
LiveJournal从1台服务器发展到100台服务器，这其中经历了无数的伤痛，但同时也摸索出了解决这些问题的方法，通过对LiveJournal的学习，可以让我们避免LJ曾经犯过的错误，并且从一开始就对系统进行良好的设计，以避免后期的痛苦。
下面我们一步一步看LJ发展的脚步。
1、一台服务器
一台别人捐助的服务器，LJ最初就跑在上面，就像Google开始时候用的破服务器一样，值得我们尊敬。这个阶段，LJ的人以惊人的速度熟悉的Unix的操作管理，服务器性能出现过问题，不过还好，可以通过一些小修小改应付过去。在这个阶段里LJ把CGI升级到了FastCGI。
最终问题出现了，网站越来越慢，已经无法通过优过化来解决的地步，需要更多的服务器，这时LJ开始提供付费服务，可能是想通过这些钱来购买新的服务器，以解决当时的困境。
毫无疑问，当时LJ存在巨大的单点问题，所有的东西都在那台服务器的铁皮盒子里装着。

2、两台服务器
用付费服务赚来的钱LJ买了两台服务器：一台叫做Kenny的Dell 6U机器用于提供Web服务，一台叫做Cartman的Dell 6U服务器用于提供数据库服务。

LJ有了更大的磁盘，更多的计算资源。但同时网络结构还是非常简单，每台机器两块网卡，Cartman通过内网为Kenny提供MySQL数据库服务。

暂时解决了负载的问题，新的问题又出现了：
• 原来的一个单点变成了两个单点。
• 没有冷备份或热备份。
• 网站速度慢的问题又开始出现了，没办法，增长太快了。
• Web服务器上CPU达到上限，需要更多的Web服务器。
3、四台服务器
又买了两台，Kyle和Stan，这次都是1U的，都用于提供Web服务。目前LJ一共有3台Web服务器和一台数据库服务器。这时需要在3台Web服务器上进行负载均横。

LJ把Kenny用于外部的网关，使用mod_backhand进行负载均横。
然后问题又出现了：
• 单点故障。数据库和用于做网关的Web服务器都是单点，一旦任何一台机器出现问题将导致所有服务不可用。虽然用于做网关的Web服务器可以通过保持心跳同步迅速切换，但还是无法解决数据库的单点，LJ当时也没做这个。
• 网站又变慢了，这次是因为IO和数据库的问题，问题是怎么往应用里面添加数据库呢？
4、五台服务器
又买了一台数据库服务器。在两台数据库服务器上使用了数据库同步(Mysql支持的Master-Slave模式)，写操作全部针对主数据库（通过Binlog，主服务器上的写操作可以迅速同步到从服务器上），读操作在两个数据库上同时进行(也算是负载均横的一种吧)。

实现同步时要注意几个事项：
• 读操作数据库选择算法处理，要选一个当前负载轻一点的数据库。
• 在从数据库服务器上只能进行读操作
• 准备好应对同步过程中的延迟，处理不好可能会导致数据库同步的中断。只需要对写操作进行判断即可，读操作不存在同步问题。
5、更多服务器
有钱了，当然要多买些服务器。部署后快了没多久，又开始慢了。这次有更多的Web服务器，更多的数据库服务器，存在 IO与CPU争用。于是采用了BIG-IP作为负载均衡解决方案。

6、现在我们在哪里：

现在服务器基本上够了，但性能还是有问题，原因出在架构上。
数据库的架构是最大的问题。由于增加的数据库都是以Slave模式添加到应用内，这样唯一的好处就是将读操作分布到了多台机器，但这样带来的后果就是写操作被大量分发，每台机器都要执行，服务器越多，浪费就越大，随着写操作的增加，用于服务读操作的资源越来越少。

由一台分布到两台

最终效果
现在我们发现，我们并不需要把这些数据在如此多的服务器上都保留一份。服务器上已经做了RAID，数据库也进行了备份，这么多的备份完全是对资源的浪费，属于冗余极端过度。那为什么不把数据分布存储呢？
问题发现了，开始考虑如何解决。现在要做的就是把不同用户的数据分布到不同的服务器上进行存储，以实现数据的分布式存储，让每台机器只为相对固定的用户服务，以实现平行的架构和良好的可扩展性。
为了实现用户分组，我们需要为每一个用户分配一个组标记，用于标记此用户的数据存放在哪一组数据库服务器中。每组数据库由一个master及几个slave组成，并且slave的数量在2-3台，以实现系统资源的最合理分配，既保证数据读操作分布，又避免数据过度冗余以及同步操作对系统资源的过度消耗。

由一台（一组）中心服务器提供用户分组控制。所有用户的分组信息都存储在这台机器上，所有针对用户的操作需要先查询这台机器得到用户的组号，然后再到相应的数据库组中获取数据。
这样的用户架构与目前LJ的架构已经很相像了。
在具体的实现时需要注意几个问题：
• 在数据库组内不要使用自增ID，以便于以后在数据库组之间迁移用户，以实现更合理的I/O，磁盘空间及负载分布。
• 将userid，postid存储在全局服务器上，可以使用自增，数据库组中的相应值必须以全局服务器上的值为准。全局服务器上使用事务型数据库InnoDB。
• 在数据库组之间迁移用户时要万分小心，当迁移时用户不能有写操作。
7、现在我们在哪里

问题：
• 一个全局主服务器，挂掉的话所有用户注册及写操作就挂掉。
• 每个数据库组一个主服务器，挂掉的话这组用户的写操作就挂掉。
• 数据库组从服务器挂掉的话会导致其它服务器负载过大。
对于Master-Slave模式的单点问题，LJ采取了Master-Master模式来解决。所谓Master-Master实际上是人工实现的，并不是由MySQL直接提供的，实际上也就是两台机器同时是Master，也同时是Slave，互相同步。
Master-Master实现时需要注意：
• 一个Master出错后恢复同步，最好由服务器自动完成。
• 数字分配，由于同时在两台机器上写，有些ID可能会冲突。
解决方案：
• 奇偶数分配ID，一台机器上写奇数，一台机器上写偶数
• 通过全局服务器进行分配(LJ采用的做法)。
Master-Master模式还有一种用法，这种方法与前一种相比，仍然保持两台机器的同步，但只有一台机器提供服务（读和写），在每天晚上的时候进行轮换，或者出现问题的时候进行切换。
8、现在我们在哪里

现在插播一条广告，MyISAM VS InnoDB。
使用InnoDB：
• 支持事务
• 需要做更多的配置，不过值得，可以更安全的存储数据，以及得到更快的速度。
使用MyISAM：
• 记录日志（LJ用它来记网络访问日志）
• 存储只读静态数据，足够快。
• 并发性很差，无法同时读写数据（添加数据可以）
• MySQL非正常关闭或死机时会导致索引错误，需要使用myisamchk修复，而且当访问量大时出现非常频繁。
9、缓存
去年我写过一篇文章介绍memcached，它就是由LJ的团队开发的一款缓存工具，以key-value的方式将数据存储到分布的内存中。LJ缓存的数据：
• 12台独立服务器（不是捐赠的）
• 28个实例
• 30GB总容量
• 90-93%的命中率（用过squid的人可能知道，squid内存加磁盘的命中率大概在70-80%）
如何建立缓存策略？
想缓存所有的东西？那是不可能的，我们只需要缓存已经或者可能导致系统瓶颈的地方，最大程度的提交系统运行效率。通过对MySQL的日志的分析我们可以找到缓存的对象。
缓存的缺点？
• 没有完美的事物，缓存也有缺点：
• 增大开发量，需要针对缓存处理编写特殊的代码。
• 管理难度增加，需要更多人参与系统维护。
• 当然大内存也需要钱。
10、Web访问负载均衡
在数据包级别使用BIG-IP，但BIG-IP并不知道我们内部的处理机制，无法判断由哪台服务器对这些请求进行处理。反向代理并不能很好的起到作用，不是已经够快了，就是达不到我们想要的效果。
所以，LJ又开发了Perlbal。特点：
• 快，小，可管理的http web 服务器/代理
• 可以在内部进行转发
• 使用Perl开发
• 单线程，异步，基于事件，使用epoll , kqueue
• 支持Console管理与http远程管理，支持动态配置加载
• 多种模式：web服务器，反向代理，插件
• 支持插件：GIF/PNG互换？
11、MogileFS
LJ使用开源的MogileFS作为分布式文件存储系统。MogileFS使用非常简单，它的主要设计思想是：
• 文件属于类（类是最小的复制单位）
• 跟踪文件存储位置
• 在不同主机上存储
• 使用MySQL集群统一存储分布信息
• 大容易廉价磁盘
到目前为止就这么多了，更多文档可以在 http://www.danga.com/words/找到。Danga.com和LiveJournal.com的同学们拿这个文档参加了两次MySQL Con，两次OS Con，以及众多的其它会议，无私的把他们的经验分享出来，值得我们学习。在web2.0时代快速开发得到大家越来越多的重视，但良好的设计仍是每一个应用的基础，希望web2.0们在成长为Top500网站的路上，不要因为架构阻碍了网站的发展。
参考资料： http://www.danga.com/words/2005_oscon/oscon-2005.pdf
 Craigslist 的数据库架构
作者：Fenng | English Version 【可以转载, 转载时务必以超链接形式标明文章原始出处和作者信息及版权声明】
网址： http://www.dbanotes.net/database/craigslist_database_arch.html
(插播一则新闻：竞拍这本《Don’t Make Me Think》，我出价 RMB 85，留言的不算--不会有恶意竞拍的吧? 要 Ping 过去才可以，失败一次，再来)
Craigslist 绝对是互联网的一个传奇公司。根据以前的一则报道：
每月超过 1000 万人使用该站服务，月浏览量超过 30 亿次，(Craigslist每月新增的帖子近 10 亿条??)网站的网页数量在以每年近百倍的速度增长。Craigslist 至今却只有 18 名员工(现在可能会多一些了)。
Tim O'reilly 采访了 Craigslist 的 Eric Scheide ，于是通过这篇 Database War Stories #5: craigslist 我们能了解一下 Craigslist 的数据库架构以及数据量信息。
数据库软件使用 MySQL 。为充分发挥 MySQL 的能力，数据库都使用 64 位 Linux 服务器, 14 块本地磁盘(72*14=1T ?), 16G 内存。
不同的服务使用不同方式的数据库集群。
论坛
1 主(master) 1 从(slave)。Slave 大多用于备份. myIsam 表. 索引达到 17G。最大的表接近 4200 万行。
分类信息
1 主 12 从。 Slave 各有个的用途. 当前数据包括索引有 114 G , 最大表有 5600 万行(该表数据会定期归档)。使用 myIsam。分类信息量有多大? "Craigslist每月新增的帖子近 10 亿条"，这句话似乎似乎有些夸张，Eric Scheide 说昨日就超过 330000 条数据，如果这样估计的话，每个月的新帖子信息大约在 1 亿多一些。
归档数据库
1 主 1 从. 放置所有超过 3 个月的帖子。与分类信息库结构相似但是更大，数据有 238G，最大表有 9600 万行。大量使用 Merge 表，便于管理。
搜索数据库
4 个集群用了 16 台服务器。活动的帖子根据地区/种类划分，并使用 myIsam 全文索引，每个只包含一个子集数据。该索引方案目前还能撑住，未来几年恐怕就不成了。
Authdb
1 主 1 从，很小。
目前 Craigslist 在 Alexa 上的排名是 30，上面的数据只是反映采访当时(April 28, 2006)的情况，毕竟，Craigslist 数据量还在每年 200% 的速度增长。
Craigslist 采用的数据解决方案从软硬件上来看还是低成本的。优秀的 MySQL 数据库管理员对于 Web 2.0 项目是一个关键因素。
--EOF--
 Second Life 的数据拾零
作者：Fenng | English Version 【可以转载, 转载时务必以超链接形式标明文章原始出处和作者信息及版权声明】
网址： http://www.dbanotes.net/review/second_life.html
Matrix 似乎提前来到我们身边。从 06 年开始，陆续看到多次关于 Second Life(SL) 的报道。因为自己的笔记本跑不起来 SL 的客户端，所以一直没有能体会这个虚拟世界的魅力。今天花了一点时间，读了几篇相关的文档。
RealNetworks 前 CTO Philip Rosedale 通过 Linden 实验室创建了 Second Life，2002 年这个项目开始 Alpha 版测试，当时叫做 LindenWorld。
2007 年 2 月 24 日号称已经达到 400 万用户(用户在游戏中被称为 "Residents"，居民)。 2001 年 2 月 1 日，并发用户达到 3 万。并发用户每月的增长是 20%。这个 20%现在看起来有些保守了，随着媒体的关注，增长的会有明显的变化。系统的设计目标是 10 万并发用户，系统的复杂度不小，但 Linden 实验室对SL 的可扩展能力信心满满。
目前在旧金山与达拉斯共有 2000 多台(现在恐怕3000也不止了吧) Intel/AMD 服务器来支撑整个虚拟世界(refer here)。64 位的 AMD 服务器居多。操作系统选用的 Debian Linux，数据库是 MySQL。通过 Tim O'relly 的这篇 Web 2.0 and Databases Part 1: Second Life ，可以了解到一点关于 SL 数据库建设的信息。在 Second Life 中每个地理区域都是运行在服务器软件单一实例上的，叫做"模拟器"或者简称是 "sim"，每个 Sim 负责 16 英亩的虚拟土地。当用户在相邻的 Sim 间移动，实际上是从一个处理器(或是服务器)移动到另一个。根据这篇访谈，用户当前所在 Sim 的信息，以及用户本身的账户信息是存储在一个中心数据库上的。

SL 的客户端软件的下载使用了 Amazon 的 S3 服务。
一点感想：MySQL 真是这波 Web 2.0 大潮中最大赢家之一啊
--EOF--
 eBay架构的思想金矿
英文来源： http://www.manageability.org/blog/stuff/about-ebays-architecture
杨争 /译

了解一件事情是怎么做的一个正确的方式是看看它在现实中是怎么做的。软件工业一直以来都在为"很多idea仅仅在理论上说说"所困惑。与此同时，软件厂商不断地把这些idea作为最佳实践推销给大家。
很少的软件开发者亲眼目睹过大规模可扩展的架构这一领域。幸运的是，有时我们可以看到和听到关于这方面公开发表的资料。我读过一些好的资料关于google的硬件基础设施的设计以及yahoo的页面渲染专利。现在，另一个互连网的巨人，eBay，给我们提供了其架构的一些资料（译者注：指的是"一天十亿次的访问－采用Core J2EE Pattern架构的J2EE 系统"这篇文章）。
这篇文章提供了很多信息。然而，我们将只对那些独特的和我感兴趣的那部分进行评论。
给我留下深刻印象是eBay站点的99.92%的可用性和380M page的页面数据。除此之外，每周近3万行代码的改动，清楚明白地告诉我们ebay的java代码的高度扩展性。
eBay使用J2EE技术是如何做到这些的。eBay可扩展性的部分如下：

Judicious use of server-side state
No server affinity
Functional server pools
Horizontal and vertical database partitioning
eBay取得数据访问的线性扩展的做法是非常让人感兴趣的。他们提到使用"定制的O-R mapping" 来支持本地Cache和全局Cache、lazy loading, fetch sets (deep and shallow)以及读取和提交更新的子集。而且，他们只使用bean管理的事务以及使用数据库的自动提交和O-R mapping来route不同的数据源.
有几个事情是非常令人吃惊的。第一，完全不使用Entity Beans,只使用他自己的O-R mapping工具(Hibernate anyone?)。第二、基于Use-Case的应用服务器划分。第三、数据库的划分也是基于Use-Case。最后是系统的无状态本性以及明显不使用集群技术。
下面是关于服务器状态的引用：
基本上我们没有真正地使用server-side state。我们可能使用它，但现在我们并没有找到使用它的理由。….。如果需要状态化的话，我们把状态放到数据库中;需要的时候我们再从数据库中取。我们不必使用集群。也就不用为集群做任何工作。
总之，你自己不必为架构一台有状态的服务器所困扰，更进一步，忘掉集群，你不需要它。现在看看功能划分：

我们有一组或者一批机器，上面运行的应用是某个具体的use case，比如搜索功能有他们自己的服务器群，我们可以采用不同的调优策略，原因是浏览商品这个基本上是只读的用例和卖一件商品这个读写的用例在执行的时候是不同。在过去四五年我们一直采用水平数据库划分达到我们需要的可用性和线性扩展性。
总之，不要把你的应用和数据库放在一个giant machine，仅仅使用servers pools，每个pools对应一个Use Case. 听起来是否类似Google的策略。
下面是关于水平划分的一些介绍：
基于内容的路由可以实现系统的水平线性扩展。所以，想象一下，如果eBay某天拥有6000万种商品，我们不必把这些数据存储到一台超级Sun服务器上。…..也许我们可以把这些数据库放到许多台Sun服务器,但是我们怎么取到我们需要的数据呢？eBay提出了基于内容路由的方法. 这种方法通过一定的规则，从20台物理服务器中找到我需要的数据。更cool的事情是这里还定义了failover的策略。
最后，下面一句话描述了未来采用更加松散耦合的架构：

使用消息系统来耦合不同的Use Case是我们研究的内容。
是不是觉得很奇怪，最初这篇文章是介绍J2EE设计模式的？关键的线性扩展的思想几乎和Patterns无关。是的，eBay采用设计模式组织他们的代码。然而过分强调设计模式将失去对整体的把握。eBay架构关键的思想是无状态的设计，使用灵活的，高度优化的 OR-mapping 层以及服务器基于use cases划分。设计模式是好的，然而不能期望它使应用具有线性扩展性。
总之，eBay和Google的例子表明以Use-Case为基础组成的服务器pools的架构比几个大型计算机证明是具有更好线性扩展性的和可用性。当然，厂商害怕听到这样的结论。然而，部署这么多服务器的最大麻烦是如何管理好他们。-)

我的总结：
eBay采用设计模式达到eBay架构的分层，各层（表示层、商业逻辑层、数据访问层）之间松散耦合，职责明确，分层提高了代码的扩展性和程序开发的效率。
eBay采用无状态的设计，灵活的、高度优化的 OR-mapping 层以及服务器基于use cases划分，达到应用之间的松散耦合，提高系统的线性扩展性。
为什么要求系统具有可线性扩展，目的就是当网站的访问量上升的时候，我们可以不用改动系统的任何代码，仅仅通过增加服务器就可以提高整个网站的支撑量。

 一天十亿次的访问－eBay架构（一）
版权声明：如有转载请求，请注明出处： http://blog.csdn.net/yzhz
本文来自于2003JavaOne（ http://java.sun.com/javaone/）上的一篇文章。我把它翻译成中文，有些不重要的部分我已略去。虽然是2003年的文章，但其中的J2EE设计方案还是值得我们去学习的，而且这个架构本身就是面向未来的。
eBay作为全球最大的网络交易市场赢得了市场的尊重，作为技术人员我们对其后台架构如何能够支撑起这个庞然大物都会感兴趣。每天十亿次访问量，6900万注册会员，1600万商品这些天文般的数字意味着它每天承受着巨大的并发访问量，而且eBay上大量页面都不是静态页面。
这篇介绍eBay架构的文章一定能对我们的项目设计和开发起到很好的指导作用。
eBay的架构是eBay的工程师和Sun的工程师共同设计完成的。
下面文章中斜体字是我的注释或者感想，其他的都是原文翻译。

作者：Deepak Alur、Arnold Goldberg、Raj Krishnamurthy
翻译：杨争

一天十亿次的访问
采用Core J2EE Pattern架构的J2EE 系统

详细了解Core J2EE Pattern可以查看此链接 http://java.sun.com/blueprints/corej2eepatterns/Patterns/

目标：
通过本文，学习如何采用Core J2EE Patterns架构具有高度扩展性多层的J2EE应用。

作者：
Deepak Alur
- Senior Software Architect, SunPS program
- Co-author of Core J2EE Patterns
- Sun-eBay V3 Architecture—Team leader

Arnold Goldberg
- Lead Architect—eBay.com Platform
- Led V3 architecture, design and implementation

Raj Krishnamurthy
- Software Architect, SunPS program
- Sun-eBay V3 Architecture team—Key member

议程：
入门和Core J2EE Patterns
eBay.com三层架构的目标
关键架构和技术决策
eBay.com如何应用Core J2EE Patterns
结论

一、入门和Core J2EE Patterns
1、目标：
- eBay.com网站的架构
- 架构中模式的地位
- 使用 Core J2EE Patterns的好处

2、eBay介绍
（1）使命
1、全球交易平台
2、拍卖、定价、B2C、B2B

（2）统计数据
- 6900万注册会员
- 28000个分类，1600万商品
- 2002年营业额：148亿7千万美元
-全球社区
-每天十亿次访问量
- 1200多个URL

3、eBay旧的二层架构及其存在的问题
（1）ebay旧的二层架构
-集成在一起的两层架构（架构中各组件之间的耦合度高）
- 330万行C++ ISAPI DLL
-面向功能的设计
- Not for systemic qualities

（2）二层架构存在的问题
-阻碍商业创新（可扩展性不够）
-随着访问量增大，系统线性扩展性面临着挑战（无法通过仅仅增加硬件投入，扩充系统的支撑量）
-高额的维护成本
-不便于“重构”（代码很难通过重构来改善）
- Architects in constant Fire-Fighting Mode
4、2000年底开始三层架构改造

系统向分层、松散耦合、模块化、基于标准的架构过渡

一天十亿次的访问－eBay架构（二）
版权声明：如有转载请求，请注明出处： http://blog.csdn.net/yzhz

5、eBay架构的改造是基于下面这本书介绍的模式
core J2EE Pattern 最佳实践和设计策略第二版，sun官方网站也提供core J2EE Pattern，见
http://java.sun.com/blueprints/corej2eepatterns/Patterns/

该书介绍了21 种J2EE设计模式，我们可以把他们归类到三层中。
（1）、表示层的设计模式：
- Intercepting Filter (X)
- Front Controller (X)
- Application Controller (X)
- Context Object (X)
- View Helper
- Composite View
- Service To Worker (X)
- Dispatcher View
带(X)表示这些设计模式在eBay.com的架构中采用了。

（2）、商业逻辑层的设计模式:
- Business Delegate
- Service Locator (X)
- Session Facade
- Application Service (X)
- Business Object (X)
- Composite Entity
- Transfer Object (X)
- Transfer Object Assembler (X)
- Value List Handler (X)
带(X)表示这些设计模式在eBay.com的架构中采用了。

3、集成层（也称为数据访问层）设计模式:
- Data Access Object (X)
- Service Activator
- Domain Store (X)
- Web Service Broker (X)
带(X)表示这些设计模式在eBay.com的架构中采用了。

二、ebay三层架构的目标
1、目标
高可用性、高可靠性、可线性扩展，建立实现系统的无缝增长。
高开发效率，支持新功能的快速交付。
可适应未来的架构，应变将来商业的更新需求。
ebay的系统可用性2002年已到了99.92%.(令人叹服)，每季度网站新增十五个重大功能，
每个星期将近3万行代码在修改，3个星期内可以提供一个国际化版本。

2、为了可适应未来的架构，ebay采用了下面的做法
采用J2EE模式
Only adopt Technology when required
Create new Technology as needed
大量的性能测试
大量的容量计划
大量关键点的调优
Highly redundant operational infrastructure and the technology to leverage it

3、为了实现可线性扩展，ebay采用了下面的做法：
（1）合理地使用server state
（2） No server affinity
（3） Functional server pools。
（4） Horizontal and vertical database partitioning。

ebay架构采用了服务器分块化的概念，每台服务器上的应用与它的use case有关，即server pool中的一部分服务器专门用于登陆，一部分服务器专门用于显示商品信息。毕竟不同use case访问数据库的方式不同，比如“显示商品信息”use case只是只读操作。而且由于是只读操作，数据库的压力会比较低，我可以只采用几台服务器来承担这部分操作，而更多的服务器用于读写操作多的use case，这样合理地使用服务器资源。
由于不同的应用放在不同的服务器上，这里就涉及到用户状态的复制问题。这就是第一条ebay要求合理地使用server state的原因，就我所知，ebay的用户状态只有很少保存在session中，ebay把用户的状态放到了数据库和cookie中。

4、为了使得数据访问可线性扩展
（1）建模我们的数据访问层
（2）支持Support well-defined data access patterns
Relationships and traversals
本地cache和全局cache
（3）定制的O-R mapping—域存储模式
（4） Code generation of persistent objects
（5）支持lazy loading
（6）支持fetch sets (shallow/deep fetches)
（7）支持retrieval and submit (Read/Write sets)

5、为了使数据存储可线性扩展，eBay采用了下列做法
（1）商业逻辑层的事务控制
只采用Bean管理的事务
Judicious use of XA
数据库的自动提交
（2）基于内容的路由
运行期间采用 O-R Mapping ，找到正确的数据源
支持数据库的水平线性扩展
Failover hosts can be defined
（3）数据源管理
动态的
Overt and heuristic control of availability
如果数据库宕机，应用可以为其他请求服务。

6、应变未来采用的技术
（1）消息系统
子系统之间、数据库之间松散耦合
J2EE的Message Driven Beans
（2）SOAP
对于外部开发者和合作伙伴，通过可用的工具和最佳实践来平衡我们的平台
采用SOAP 来标准化不同eBay应用之间进程内部的通信
采用SOAP满足我们的QoS需求

四、将J2EE的设计模式应用到eBay中
介绍了三个Use cases例子，“查看账号”，“查看商品”，“eBayAPI”，介绍了这三个use case 如何采用J2EE的设计模式实现其设计。（略去）
一天十亿次的访问－eBay架构（三）
版权声明：如有转载请求，请注明出处： http://blog.csdn.net/yzhz
五、结论
1、表示层架构

2、商业逻辑层架构

3、eBay整体架构

4、总结
（1）eBay.com的架构采用了J2EE核心模式
-使你不用重新发明轮子，提高系统重用性
-经过实践证明的解决方案和策略
-J2EE核心模式可以成为Developer和Architect 的词汇
-更快的开发效率
（2）在你开发项目中学习和采用这些设计模式
（3）参与到模式的社区中。

5、看了这么多，如果你能记得些什么的话，希望是下面这段话：
模式在开发和设计中是非常有用的。模式能帮助你达到设计的重用、加快开发进度、降低维护成本，提高系统和代码的可理解性。
我的体会：
1、ebay架构的主体是采用J2EE的核心设计模式设计的，我们在实际项目中可根据我们应用的需求采用适合我们应用的设计模式。毕竟我们看到eBay的架构也不是用了J2EE核心设计模式中提到的所有模式，而是根据项目的实际情况采用了部分适合其本身的模式。
2、需要澄清的是：这些设计模式是J2EE的设计模式，而不是EJB的设计模式。如果你的架构没有采用EJB，你仍然可以使用这些设计模式。
3、本文中除了介绍如何采用J2EE核心模式架构eBay网站，还介绍了eBay架构为了支持线性扩展而采用的一些做法，我觉得这些做法很有特点，不仅可以大大提高系统的线性扩展性，而且也能大大提高网站的性能。这些我会有另外一篇文章介绍给大家。

 七种缓存使用武器为网站应用和访问加速发布时间:
Web应用中缓存的七种武器：
1 数据库的缓存
通常数据库都支持对查询结果的缓存，并且有复杂的机制保证缓存的有效性。对于MySQL,Oracle这样的数据库，通过合理配置缓存对系统性能带来的提升是相当显著的。

2 数据连接驱动的缓存。
诸如PHP的ADODB，J2EE的连接驱动，甚至如果把HIbernate等ORM也看成连接器的话。这里的缓存有效机制就不是那么强了，使用此步的方法实现缓存的一个最好的优点就是我们取数据的方式可以保持不变。例如，我调用
$db->CacheGetAll("select * from table"); 的语句不需要改变，可以透明实现缓存。这主要应用于一些变化不大的数据上，例如一些数据字典是不经常变化的。

3 系统级的缓存
可以在系统内通过Cache库，自行对需要的数据进行缓存，例如一个树桩菜单生成十分消耗资源，那可以将这个生成的树缓存起来。这样做的缺点是，当这颗树的某些地方被更新时，你需要手动更新缓存内的东西。使用的缓存库都可以有不同的缓存方法，有的把内容放在硬盘上，有的放在内存里面，如果你把内容模拟成硬盘来缓存，速度当然也能提升不少。
4 页面级的缓存
这个在内容管理系统里面用的最多。也就是生成静态页面。这里面缓存控制机制最为复杂，一般也没有什么包治百病的方法，只有具体情况具体分析。通常生成的静态叶面你需要有一个机制去删除过时的，或访问很少的叶面，以保证检索静态叶面的速度。
5 使用预编译叶面和加载为FastCGI的办法
对于PHP，可以使用zend等编译引擎，对于JSP本身就是预编译。而FastCGI的原理就是将脚本预先加载起来，不用每次执行都去读，这和JSP预编成Servlet，然后加载的道理是一样的。
6 前置缓存
可以使用Squid作为Web服务器的前置缓存。
7 做集群
对数据库作集群，对web服务器作集群，对Squild前置机做集群
对于新手来说，如果你的程序要是恰死，首先你要检查代码是否有错误，是否存在内存泄漏，如果都没有，那么通常问题出在数据库连接上面。
综合应用上面的缓存方法，开发高负载的Web应用成就很容易了。
 可缓存的CMS系统设计
2007-06-03 13:41:16 作者: chedong　来源: www.chedong.com 　标签:cms cache 设计 (English)
文章转载自互联网，如果您觉得我们侵权了，请联系管理员，我们会立刻处理。
对于一个日访问量达到百万级的网站来说，速度很快就成为一个瓶颈。除了优化内容发布系统的应用本身外，如果能把不需要实时更新的动态页面的输出结果转化成静态网页来发布，速度上的提升效果将是显著的，因为一个动态页面的速度往往会比静态页面慢2－10倍，而静态网页的内容如果能被缓存在内存里，访问速度甚至会比原有动态网页有2－3个数量级的提高。
• 动态缓存和静态缓存的比较
• 基于反向代理加速的站点规划
• 基于apache mod_proxy的反向代理加速实现
• 基于squid的反向代理加速实现
• 面向缓存的页面设计
• 应用的缓存兼容性设计：
HTTP_HOST/SERVER_NAME和REMOTE_ADDR/REMOTE_HOST需要用 HTTP_X_FORWARDED_HOST/HTTP_X_FORWARDED_SERVER代替
后台的内容管理系统的页面输出遵守可缓存的设计，这样就可以把性能问题交给前台的缓存服务器来解决了，从而大大简化CMS系统本身的复杂程度。
静态缓存和动态缓存的比较
静态页面的缓存可能有2种形式：其实主要区别就是CMS是否自己负责关联内容的缓存更新管理。
1. 静态缓存：是在新内容发布的同时就立刻生成相应内容的静态页面，比如：2003年3月22日，管理员通过后台内容管理界面录入一篇文章后，就立刻生成 http://www.chedong.com/tech/2003/03/22/001.html这个静态页面，并同步更新相关索引页上的链接。
2. 动态缓存：是在新内容发布以后，并不预先生成相应的静态页面，直到对相应内容发出请求时，如果前台缓存服务器找不到相应缓存，就向后台内容管理服务器发出请求，后台系统会生成相应内容的静态页面，用户第一次访问页面时可能会慢一点，但是以后就是直接访问缓存了。

如果去ZDNet等国外网站会发现他们使用的基于Vignette内容管理系统都有这样的页面名称：0,22342566,300458.html。其实这里的0,22342566,300458就是用逗号分割开的多个参数：
第一次访问找不到页面后，相当于会在服务器端产生一个doc_type= 0&doc_id=22342566&doc_template=300458的查询，
而查询结果会生成的缓存的静态页面： 0,22342566,300458.html
静态缓存的缺点：
• 复杂的触发更新机制：这两种机制在内容管理系统比较简单的时候都是非常适用的。但对于一个关系比较复杂的网站来说，页面之间的逻辑引用关系就成为一个非常非常复杂的问题。最典型的例子就是一条新闻要同时出现在新闻首页和相关的3个新闻专题中，在静态缓存模式中，每发一篇新文章，除了这篇新闻内容本身的页面外，还需要系统通过触发器生成多个新的相关静态页面，这些相关逻辑的触发也往往就会成为内容管理系统中最复杂的部分之一。
• 旧内容的批量更新：通过静态缓存发布的内容，对于以前生成的静态页面的内容很难修改，这样用户访问旧页面时，新的模板根本无法生效。
在动态缓存模式中，每个动态页面只需要关心，而相关的其他页面能自动更新，从而大大减少了设计相关页面更新触发器的需要。
以前做小型应用的时候也用过类似方式：应用首次访问以后将数据库的查询结果在本地存成一个文件，下次请求时先检查本地缓存目录中是否有缓存文件，从而减少对后台数据库的访问。虽然这样做也能承载比较大的负载，但这样的内容管理和缓存管理一体的系统是很难分离的，而且数据完整性也不是很好保存，内容更新时，应用需要把相应内容的的缓存文件删除。但是这样的设计在缓存文件很多的时候往往还需要将缓存目录做一定的分布，否则一个目录下的文件节点超过3000，rm *都会出错。
这时候，系统需要再次分工，把复杂的内容管理系统分解成：内容输入和缓存这2个相对简单的系统实现。
• 后台：内容管理系统，专心的将内容发布做好，比如：复杂的工作流管理，复杂的模板规则等……
• 前台：页面的缓存管理则可以使用缓存系统实现
______________________ ___________________
|Squid Software cache| |F5 Hardware cache|
---------------------- -------------------
\ /
\ ________________ /
|ASP |JSP |PHP |
Content Manage System
----------------
所以分工后：内容管理和缓存管理2者，无论哪一方面可选的余地都是非常大的：软件（比如前台80端口使用SQUID对后台8080的内容发布管理系统进行缓存），缓存硬件，甚至交给akamai这样的专业服务商。
面向缓存的站点规划
一个利用SQUID对多个站点进行做WEB加速http acceleration方案：
原先一个站点的规划可能是这样的：
200.200.200.207 www.chedong.com
200.200.200.208 news.chedong.com
200.200.200.209 bbs.chedong.com
200.200.200.205 images.chedong.com
面向缓存服务器的设计中：所有站点都通过外部DNS指向到同一个IP：200.200.200.200/201这2台缓存服务器上（使用2台是为了冗余备份）
_____________________ ________
www.chedong.com 请求 \ | cache box | | | / 192.168.0.4 www.chedong.com
news.chedong.com 请求 -| 200.200.200.200/201 |-|firewall| - 192.168.0.4 news.chedong.com
bbs.chedong.com 请求 / | /etc/hosts | | box | \ 192.168.0.3 bbs.chedong.com
--------------------- --------
工作原理：
外部请求过来时，设置缓存根据配置文件进行转向解析。这样，服务器请求就可以转发到我们指定的内部地址上。
在处理多虚拟主机转向方面：mod_proxy比squid要简单一些：可以把不同服务转向后后台多个IP的不同端口上。
而squid只能通过禁用DNS解析，然后根据本地的/etc/hosts文件根据请求的域名进行地址转发，后台多个服务器必须使用相同的端口。
使用反向代理加速，我们不仅可以得到性能上的提升，而且还能获得额外的安全性和配置的灵活度：
• 配置灵活性提高：可以自己在内部服务器上控制后台服务器的DNS解析，当需要在服务器之间做迁移调整时，就不用大量修改外部DNS配置了，只需要修改内部DNS实现服务的调整。
• 数据安全性增加：所有后台服务器可以很方便的被保护在防火墙内。
• 后台应用设计复杂程度降低：原先为了效率常常需要建立专门的图片服务器images.chedong.com和负载比较高的应用服务器 bbs.chedong.com分离，在反向代理加速模式中，所有前台请求都通过缓存服务器：实际上就都是静态页面，这样，应用设计时就不用考虑图片和应用本身分离了，也大大降低了后台内容发布系统设计的复杂程度，由于数据和应用都存放在一起，也方便了文件系统的维护和管理。
基于Apache mod_proxy的反向代理缓存加速实现
Apache包含了mod_proxy模块，可以用来实现代理服务器，针对后台服务器的反向加速
安装apache 1.3.x 编译时：
--enable-shared=max --enable-module=most
注：Apache 2.x中mod_proxy已经被分离成mod_proxy和mod_cache：同时mod_cache有基于文件和基于内存的不同实现
创建/var/www/proxy，设置apache服务所用户可写
mod_proxy配置样例：反相代理缓存＋缓存
架设前台的 www.example.com反向代理后台的 www.backend.com的8080端口服务。
修改：httpd.conf
ServerName www.example.com ServerAdmin admin@example.com # reverse proxy setting ProxyPass / http://www.backend.com:8080/ ProxyPassReverse / http://www.backend.com:8080/ # cache dir root CacheRoot "/var/www/proxy" # max cache storage CacheSize 50000000 # hour: every 4 hour CacheGcInterval 4 # max page expire time: hour CacheMaxExpire 240 # Expire time = (now - last_modified) * CacheLastModifiedFactor CacheLastModifiedFactor 0.1 # defalt expire tag: hour CacheDefaultExpire 1 # force complete after precent of content retrived: 60-90% CacheForceCompletion 80 CustomLog /usr/local/apache/logs/dev_access_log combined
基于Squid的反向代理加速实现
Squid是一个更专用的代理服务器，性能和效率会比Apache的mod_proxy高很多。
如果需要combined格式日志补丁：
http://www.squid-cache.org/mail-archive/squid-dev/200301/0164.html
squid的编译：
./configure --enable-useragent-log --enable-referer-log --enable-default-err-language=Simplify_Chinese \ --enable-err-languages="Simplify_Chinese English" --disable-internal-dns
make
#make install
#cd /usr/local/squid
make dir cache
chown squid.squid *
vi /usr/local/squid/etc/squid.conf
在/etc/hosts中：加入内部的DNS解析，比如：
192.168.0.4 www.chedong.com
192.168.0.4 news.chedong.com
192.168.0.3 bbs.chedong.com
---------------------cut here----------------------------------
# visible name
visible_hostname cache.example.com
# cache config: space use 1G and memory use 256M
cache_dir ufs /usr/local/squid/cache 1024 16 256
cache_mem 256 MB
cache_effective_user squid
cache_effective_group squid

http_port 80
httpd_accel_host virtual
httpd_accel_single_host off
httpd_accel_port 80
httpd_accel_uses_host_header on
httpd_accel_with_proxy on
# accelerater my domain only
acl acceleratedHostA dstdomain .example1.com
acl acceleratedHostB dstdomain .example2.com
acl acceleratedHostC dstdomain .example3.com
# accelerater http protocol on port 80
acl acceleratedProtocol protocol HTTP
acl acceleratedPort port 80
# access arc
acl all src 0.0.0.0/0.0.0.0
# Allow requests when they are to the accelerated machine AND to the
# right port with right protocol
http_access allow acceleratedProtocol acceleratedPort acceleratedHostA
http_access allow acceleratedProtocol acceleratedPort acceleratedHostB
http_access allow acceleratedProtocol acceleratedPort acceleratedHostC
# logging
emulate_httpd_log on
cache_store_log none
# manager
acl manager proto cache_object
http_access allow manager all
cachemgr_passwd pass all

----------------------cut here---------------------------------
创建缓存目录：
/usr/local/squid/sbin/squid -z
启动squid
/usr/local/squid/sbin/squid
停止squid：
/usr/local/squid/sbin/squid -k shutdown
启用新配置：
/usr/local/squid/sbin/squid -k reconfig
通过crontab每天0点截断/轮循日志：
0 0 * * * (/usr/local/squid/sbin/squid -k rotate)
可缓存的动态页面设计
什么样的页面能够比较好的被缓存服务器缓存呢？如果返回内容的HTTP HEADER中有"Last-Modified"和"Expires"相关声明，比如：
Last-Modified: Wed, 14 May 2003 13:06:17 GMT
Expires: Fri, 16 Jun 2003 13:06:17 GMT
前端缓存服务器在期间会将生成的页面缓存在本地：硬盘或者内存中，直至上述页面过期。
因此，一个可缓存的页面：
• 页面必须包含Last-Modified: 标记
一般纯静态页面本身都会有Last-Modified信息，动态页面需要通过函数强制加上，比如在PHP中：
// always modified now
header("Last-Modified: " . gmdate("D, d M Y H:i:s") . " GMT");
• 必须有Expires或Cache-Control: max-age标记设置页面的过期时间：
对于静态页面，通过apache的mod_expires根据页面的MIME类型设置缓存周期：比如图片缺省是1个月，HTML页面缺省是2天等。
ExpiresActive on ExpiresByType image/gif "access plus 1 month" ExpiresByType text/css "now plus 2 day" ExpiresDefault "now plus 1 day"

对于动态页面，则可以直接通过写入HTTP返回的头信息，比如对于新闻首页index.php可以是20分钟，而对于具体的一条新闻页面可能是1天后过期。比如：在php中加入了1个月后过期：
// Expires one month later
header("Expires: " .gmdate ("D, d M Y H:i:s", time() + 3600 * 24 * 30). " GMT");
• 如果服务器端有基于HTTP的认证，必须有Cache-Control: public标记，允许前台
ASP应用的缓存改造首先在公用的包含文件中(比如include.asp)加入以下公用函数：
<%
' Set Expires Header in minutes
Function SetExpiresHeader(ByVal minutes)
' set Page Last-Modified Header:
' Converts date (19991022 11:08:38) to http form (Fri, 22 Oct 1999 12:08:38 GMT)
Response.AddHeader "Last-Modified", DateToHTTPDate(Now())

' The Page Expires in Minutes
Response.Expires = minutes

' Set cache control to externel applications
Response.CacheControl = "public"
End Function
' Converts date (19991022 11:08:38) to http form (Fri, 22 Oct 1999 12:08:38 GMT)
Function DateToHTTPDate(ByVal OleDATE)
Const GMTdiff = #08:00:00#
OleDATE = OleDATE - GMTdiff
DateToHTTPDate = engWeekDayName(OleDATE) & _
", " & Right("0" & Day(OleDATE),2) & " " & engMonthName(OleDATE) & _
" " & Year(OleDATE) & " " & Right("0" & Hour(OleDATE),2) & _
":" & Right("0" & Minute(OleDATE),2) & ":" & Right("0" & Second(OleDATE),2) & " GMT"
End Function
Function engWeekDayName(dt)
Dim Out
Select Case WeekDay(dt,1)
Case 1:Out="Sun"
Case 2:Out="Mon"
Case 3:Out="Tue"
Case 4:Out="Wed"
Case 5:Out="Thu"
Case 6:Out="Fri"
Case 7:Out="Sat"
End Select
engWeekDayName = Out
End Function
Function engMonthName(dt)
Dim Out
Select Case Month(dt)
Case 1:Out="Jan"
Case 2:Out="Feb"
Case 3:Out="Mar"
Case 4:Out="Apr"
Case 5:Out="May"
Case 6:Out="Jun"
Case 7:Out="Jul"
Case 8:Out="Aug"
Case 9:Out="Sep"
Case 10:Out="Oct"
Case 11:Out="Nov"
Case 12:Out="Dec"
End Select
engMonthName = Out
End Function
%>
然后在具体的页面中，比如index.asp和news.asp的“最上面”加入以下代码：HTTP Header

<% '页面将被设置20分钟后过期 SetExpiresHeader(20) %>
应用的缓存兼容性设计

经过代理以后，由于在客户端和服务之间增加了中间层，因此服务器无法直接拿到客户端的IP，服务器端应用也无法直接通过转发请求的地址返回给客户端。但是在转发请求的HTTD头信息中，增加了HTTP_X_FORWARDED_????信息。用以跟踪原有的客户端IP地址和原来客户端请求的服务器地址：
下面是2个例子，用于说明缓存兼容性应用的设计原则：
'对于一个需要服务器名的地址的ASP应用：不要直接引用HTTP_HOST/SERVER_NAME，判断一下是否有HTTP_X_FORWARDED_SERVER
function getHostName ()
dim hostName as String = ""
hostName = Request.ServerVariables("HTTP_HOST")
if not isDBNull(Request.ServerVariables("HTTP_X_FORWARDED_HOST")) then
if len(trim(Request.ServerVariables("HTTP_X_FORWARDED_HOST"))) > 0 then
hostName = Request.ServerVariables("HTTP_X_FORWARDED_HOST")
end if
end if
return hostNmae
end function

//对于一个需要记录客户端IP的PHP应用：不要直接引用REMOTE_ADDR，而是要使用HTTP_X_FORWARDED_FOR，
function getUserIP (){
$user_ip = $_SERVER["REMOTE_ADDR"];
if ($_SERVER["HTTP_X_FORWARDED_FOR"]) {
$user_ip = $_SERVER["HTTP_X_FORWARDED_FOR"];
}
}

注意：HTTP_X_FORWARDED_FOR如果经过了多个中间代理服务器，有何能是逗号分割的多个地址，
比如：200.28.7.155,200.10.225.77 unknown,219.101.137.3
因此在很多旧的数据库设计中（比如BBS）往往用来记录客户端地址的字段被设置成20个字节就显得过小了。
经常见到类似以下的错误信息：
Microsoft JET Database Engine 错误 '80040e57'
字段太小而不能接受所要添加的数据的数量。试着插入或粘贴较少的数据。
/inc/char.asp，行236
原因就是在设计客户端访问地址时，相关用户IP字段大小最好要设计到50个字节以上，当然经过3层以上代理的几率也非常小。
如何检查目前站点页面的可缓存性（Cacheablility）呢？可以参考以下2个站点上的工具：
http://www.ircache.net/cgi-bin/cacheability.py
附：SQUID性能测试试验

phpMan.php是一个基于php的man page server，每个man
page需要调用后台的man命令和很多页面格式化工具，系统负载比较高，提供了Cache
Friendly的URL，以下是针对同样的页面的性能测试资料：
测试环境：Redhat 8 on Cyrix 266 / 192M Mem
测试程序：使用apache的ab(apache benchmark)：
测试条件：请求50次，并发50个连接
测试项目：直接通过apache 1.3 (80端口) vs squid 2.5(8000端口：加速80端口)

测试1：无CACHE的80端口动态输出：
ab -n 100 -c 10 http://www.chedong.com:81/phpMan.php/man/kill/1
This is ApacheBench, Version 1.3d <$Revision: 1.2 $> apache-1.3
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd,
http://www.zeustech.net/
Copyright (c) 1998-2001 The Apache Group, http://www.apache.org/

Benchmarking localhost (be patient)…..done
Server Software:
Apache/1.3.23
Server Hostname: localhost
Server
Port:
80

Document Path:
/phpMan.php/man/kill/1
Document Length: 4655 bytes

Concurrency Level: 5
Time taken for tests: 63.164 seconds
Complete requests: 50
Failed requests: 0
Broken pipe errors: 0
Total transferred: 245900 bytes
HTML transferred: 232750 bytes
Requests per second: 0.79 [#/sec] (mean)
Time per request: 6316.40 [ms]
(mean)
Time per request: 1263.28 [ms]
(mean, across all concurrent requests)
Transfer rate:
3.89 [Kbytes/sec] received

Connnection Times (ms)

min mean[+/-sd] median max
Connect: 0
29 106.1 0 553
Processing: 2942 6016
1845.4 6227 10796

Waiting:
2941 5999 1850.7 6226 10795

Total:
2942 6045 1825.9 6227 10796

Percentage of the requests served within a certain time (ms)
50% 6227
66% 7069
75% 7190
80% 7474
90% 8195
95% 8898
98% 9721
99% 10796
100% 10796 (last request)

测试2：SQUID缓存输出
/home/apache/bin/ab -n50 -c5
"http://localhost:8000/phpMan.php/man/kill/1"
This is ApacheBench, Version 1.3d <$Revision: 1.2 $> apache-1.3
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd,
http://www.zeustech.net/
Copyright (c) 1998-2001 The Apache Group, http://www.apache.org/

Benchmarking localhost (be patient)…..done
Server Software:
Apache/1.3.23
Server Hostname: localhost
Server
Port:
8000

Document Path:
/phpMan.php/man/kill/1
Document Length: 4655 bytes

Concurrency Level: 5
Time taken for tests: 4.265 seconds
Complete requests: 50
Failed requests: 0
Broken pipe errors: 0
Total transferred: 248043 bytes
HTML transferred: 232750 bytes
Requests per second: 11.72 [#/sec] (mean)
Time per request: 426.50 [ms] (mean)
Time per request: 85.30 [ms] (mean,
across all concurrent requests)
Transfer rate:
58.16 [Kbytes/sec] received

Connnection Times (ms)

min mean[+/-sd] median max
Connect:
0 1
9.5 0 68
Processing:
7 83 537.4
7 3808

Waiting:
5 81 529.1
6 3748

Total:
7 84 547.0
7 3876

Percentage of the requests served within a certain time (ms)
50% 7
66% 7
75% 7
80% 7
90% 7
95% 7
98% 8
99% 3876
100% 3876 (last request)

结论：No Cache / Cache = 6045 / 84 = 70
结论：对于可能被缓存请求的页面，服务器速度可以有2个数量级的提高，因为SQUID是把缓存页面放在内存里的（因此几乎没有硬盘I/O操作）。

小节：

• 大访问量的网站应尽可能将动态网页生成静态页面作为缓存发布，甚至对于搜索引擎这样的动态应用来说，缓存机制也是非常非常重要的。
• 在动态页面中利用HTTP Header定义缓存更新策略。
• 利用缓存服务器获得额外的配置和安全性
• 日志非常重要：SQUID日志缺省不支持COMBINED日志，但对于需要REFERER日志的这个补丁非常重要： http://www.squid-cache.org/mail-archive/squid-dev/200301/0164.html

参考资料：
HTTP代理缓存
http://vancouver-webpages.com/proxy.html

可缓存的页面设计
http://linux.oreillynet.com/pub/a/linux/2002/02/28/cachefriendly.html
运用ASP.NET的输出缓冲来存储动态页面 - 开发者 - ZDNet China
http://www.zdnet.com.cn/developer/tech/story/0,2000081602,39110239-2,00.htm
相关RFC文档：

• RFC
2616:

o section
13 (Caching)
o section
14.9 (Cache-Control header)
o section
14.21 (Expires header)
o section
14.32 (Pragma: no-cache) is important if you are interacting with
HTTP/1.0 caches
o section
14.29 (Last-Modified) is the most common validation method
o section
3.11 (Entity Tags) covers the extra validation method

可缓存性检查
http://www.web-caching.com/cacheability.html
缓存设计要素
http://vancouver-webpages.com/CacheNow/detail.html

ZOPE上的几篇使用APACHE MOD_PROXY MOD_GZIP加速的文档
http://www.zope.org/Members/anser/apache_zserver/
http://www.zope.org/Members/softsign/ZServer_and_Apache_mod_gzip
http://www.zope.org/Members/rbeer/caching
 开发大型高负载类网站应用的几个要点[nightsailer]
大 | 中 | 小
2007/05/17 14:12 772 huzhangyou2002 信仰的服务器设计
作者：nightsailer
来源： http://www.phpchina.com/bbs/thread-15484-1-1.html

看了一些人的所谓大型项目的方法,我感觉都是没有说到点子上，有点难受。
我也说说自己的看法.我个人认为,很难衡量所谓项目是否大型,
即便很简单的应用在高负载和高增长情况下都是一个挑战.因此,按照我的想法,姑且说是高负载
高并发或者高增长情况下,需要考虑的问题.这些问题,很多是和程序开发无关,而是和整个系统的
架构密切相关的.

数据库

没错,首先是数据库,这是大多数应用所面临的首个SPOF。尤其是Web2.0的应用，数据库的响应是首先要解决的。
一般来说MySQL是最常用的，可能最初是一个mysql主机，当数据增加到100万以上，
那么，MySQL的效能急剧下降。常用的优化措施是M-S（主-从）方式进行同步复制，将查询和操作和分别在不同的
服务器上进行操作。我推荐的是M-M-Slaves方式，2个主Mysql，多个Slaves，需要注意的是，虽然有2个Master，
但是同时只有1个是Active，我们可以在一定时候切换。之所以用2个M，是保证M不会又成为系统的SPOF。
Slaves可以进一步负载均衡，可以结合LVS,从而将select操作适当的平衡到不同的slaves上。

以上架构可以抗衡到一定量的负载，但是随着用户进一步增加，你的用户表数据超过1千万，这时那个M变成了
SPOF。你不能任意扩充Slaves，否则复制同步的开销将直线上升，怎么办？我的方法是表分区，
从业务层面上进行分区。最简单的，以用户数据为例。根据一定的切分方式，比如id，切分到不同的数据库集群去。
全局数据库用于meta数据的查询。缺点是每次查询，会增加一次，比如你要查一个用户nightsailer,你首先要到全局数据库群找到nightsailer对应的cluster id，然后再到指定的cluster找到nightsailer的实际数据。
每个cluster可以用m-m方式，或者m-m-slaves方式。
这是一个可以扩展的结构，随着负载的增加，你可以简单的增加新的mysql cluster进去。

需要注意的是：
1、禁用全部auto_increment的字段
2、id需要采用通用的算法集中分配
3、要具有比较好的方法来监控mysql主机的负载和服务的运行状态。如果你有30台以上的mysql数据库在跑就明白我的意思了。
4、不要使用持久性链接（不要用pconnect）,相反，使用sqlrelay这种第三方的数据库链接池，或者干脆自己做，因为php4中mysql的
链接池经常出问题。

缓存

缓存是另一个大问题，我一般用memcached来做缓存集群，一般来说部署10台左右就差不多（10g内存池）。需要注意一点，千万不能用使用swap，最好关闭linux的swap。

负载均衡/加速

可能上面说缓存的时候，有人第一想的是页面静态化，所谓的静态html，我认为这是常识，不属于要点了。页面的静态化随之带来的是静态服务的
负载均衡和加速。我认为Lighttped+Squid是最好的方式了。
LVS <------->lighttped====>squid(s) ====lighttpd

上面是我经常用的。注意，我没有用apache，除非特定的需求，否则我不部署apache，因为我一般用php-fastcgi配合lighttpd,
性能比apache+mod_php要强很多。

squid的使用可以解决文件的同步等等问题，但是需要注意，你要很好的监控缓存的命中率，尽可能的提高的90%以上。
squid和lighttped也有很多的话题要讨论，这里不赘述。
存储
存储也是一个大问题，一种是小文件的存储，比如图片这类。另一种是大文件的存储，比如搜索引擎的索引，一般单文件都超过2g以上。小文件的存储最简单的方法是结合lighttpd来进行分布。或者干脆使用Redhat的GFS，优点是应用透明，缺点是费用较高。我是指
你购买盘阵的问题。我的项目中，存储量是2-10Tb，我采用了分布式存储。这里要解决文件的复制和冗余。
这样每个文件有不同的冗余，这方面可以参考google的gfs的论文。
大文件的存储，可以参考nutch的方案，现在已经独立为hadoop子项目。(你可以google it)

其他：
此外，passport等也是考虑的，不过都属于比较简单的了。

吃饭了，不写了，抛砖引玉而已。

【回复】

9tmd :
说了关键的几个部分，还有一些比如squid群、LVS或者VIP（四层交换）之类的必须考虑，数据库逻辑分表不需要master里面查id，可以定期缓存或者程序逻辑上进行控制。
跟大家分享一下我的经验： http://www.toplee.com/blog/archives/337.html （欢迎讨论）
nightsailer：
楼上说的很好.
我再说一下关于为何要在主表查询，最主要的因素是考虑到复制和维护的问题。假设按照程序逻辑，用户nightsailer应该在s1集群，但是由于种种原因，我须要将nightsailer的数据从s1集群转移到s5集群或者某些时候，我需要将某几个集群的数据合并，此时，我维护的时候只需要更新一下主数据库中nightsailer的cluster id从1变成5,，维护的工作可以独立进行，无需考虑更新应用程序的逻辑。也许程序的id分配逻辑可以考虑到这种情况，但是这样一来，你的这个逻辑会发散到各个应用中，产生的代码的耦合是很高的。相反，采用查表这种方式，只需要在最初的时候进行初始分配，那么其他的应用是无需考虑这些算法和逻辑的。
当然，我最初提到的增加这次查询并不是说每次查询都需要找主数据库，缓存策略是必定要考虑的。

至于说为什么要禁用auto_increment,我想也清楚了，数据的合并和分隔，肯定是不能用auto_increment的。
nightsailer：
在闲扯一下，PHP的优化可以有很多，主要的措施：
1、使用FCGI方式，配合lighttpd,Zeus.
我个人比较喜欢Zeus，简单可靠。不过，需要￥￥￥。
lighty也不错，配置文件也很简单，比较清爽。最新的1.5,虽然不稳定，但是配合linux的aio,性能的提升
非常明显。即便现在的稳定版，使用2.6的epoll可以得到的性能是非常高。当然，lighty比zeus缺点是对于smp
的支持很有限，所以可以采用多服务器负载,或者干脆起不同的进程服务监听不同的端口。
2、专门的PHP FCGI服务器。
好处多多，在这个服务器上，就跑php的fcgi服务，你可以把一些缓存加上，比如xcache，我个人喜欢这个。
还有别的，套用大腕的话，把能装的都装上，呵呵。
另外，最主要的是，你可以只维护一个php的环境，这个环境能够被apache,zeus,lighttpd同时share，
前提是这些都使用php的fcgi模式，而且，xcache可以充分发挥！
3、apache+mod_fastcgi
apache并非无用，有时候也需要。比如我用php做了一个web_dav的服务器，在其他有问题，只能跑apache.
那么，apache安装一下mod_fastcgi，通过使用externl server，使用2配置的php fastcgi。
4、优化编译
ICC是我的首选，就是intel的编译器啦，用icc重新编译php,mysql,lighty,能编的都编，会有不小的收获的。尤其是你用
intel的cpu的话。
5、php4的64位需要patch
好像没有人在linux x86_64上编译过php4吧，我曾经googleit
,别说国内了，连老外都很少用。
这里就做个提醒把，如果用php官方下载的(包括最新的php-4.4.4)，统统无法编译通过。问题是出在autoconf上，需要
手工修改config.m4，一般是在mysql,gd,ldap等一些关键的extension上，还有phpize的脚本。把/usr/lib64加入到
config.m4中相关搜索的path中。
不过我估计很少人像我这样死用php4不防，呵呵。php5就没有问题。
我也考虑正在迁移到php5.2，写代码太方便了，一直忍着呢。
nightsailer：
QUOTE:
原帖由 wuexp 于 2007-1-3 17:01 发表
分表会使操作数据(更改,删除,查询)边的很复杂,特别是遇到排序的时候就更麻烦了.
曾经考虑根据用户id哈希一下,插入到相应的分表里

明白你的意思。

不过我们可能讨论的不完全一样，呵呵。
我所说的分表要依据不同的业务情况来划分的，

1、可以是垂直划分，
比如依据业务实体切分，比如用户a的blog贴子，用户的tag，用户的评论都在a数据库u，甚者是完整的一套数据结构(这种情况下应该说是分数据库）

2、也可以水平划分，
一个表的数据分在不同的数据库上。
比如message表，你可能分为daily_message,history_message，
dialy_meesage可能是hot对象，week_message是warm，2个月以前的帖子
可能属于cold对象了。这些对象依据访问频度不同会划分到不同的数据库群上。

3、二者结合

不过，不论如何，更改、删除并不复杂，和未分区的表没有区别。

至于查询和排序，不可能仅仅是通过select，order吧？
而是应该产生类似摘要表，索引表，参考表。。。
另外，要根据业务具体分析减少垃圾数据，有些时候，只需要最初的1万条记录，那么所有表
数据的排序就不需要了。很多传统的业务，比如零售，流水表很大，但是报表的数据
并非实时生成的，扎报表应该不陌生。

也可以参考很多网站的做法，比如technorati啊,flickr之类的。

所谓的麻烦是你设计系统的结构的时候要考虑到，在设计数据库的时候更要注意，
因此只要项目的framework最初设计比较完备，那么可以说大部分对开发人员是透明的。
前提是，你一定要设计好，而不是让程序员边写代码边设计，那会是噩梦。

我写这么多废话，并非仅仅是对程序员来说，也许对设计者更有用。

9tmd ：
程序逻辑上控制表拆分只需要维护一个数据库访问的配置文件即可，对于开发来说，完全透明，可以不用关心访问的是哪里，而只需要调用通用的接口即可，曾经做过的系统里面，这样的应用经常遇到，尤其在全网passport、社区帖子等方面的处理上应用最多。

原来在yahoo工作和后来mop工作都使用了这样的架构，整体感觉来说还是值得信赖的，单表毕竟存在面对极限数据量的风险。

9tmd ：
前面老是有人问auto_increment的问题，其实这是MySQL官方专门针对M/S的Replication做过的说明，因为MySQL的同步是依靠同步MySQL的SQL日志来实现的，事实上单向的Master->Slave使用auto_increment是没有问题的，而双向的M/M模式就会存在问题了，稍微一思考就知道怎么回事了。官方文档：
http://dev.mysql.com/tech-resour … ql-replication.html
http://dev.mysql.com/doc/refman/ … auto-increment.html

另外，在使用MySQL的同步时，需要注意在自己的代码里面，写SQL的时候不要使用MySQL自己提供的类似 NOW()之类的函数，而应该使用php程序里面计算的时间带入SQL语句里面，否则同步的时候也可能导致值不相等，这个道理可以牵涉出另外一些类似的问题，大家可以考虑一下。

参考文章：
http://blog.csdn.net/heiyeshuwu/archive/2007/01/04/1473941.aspx
http://www.phpchina.com/bbs/thread-15484-1-1.html
http://www.toplee.com/blog/archives/337.html
 Memcached和Lucene笔记
By Michael
　　前段时间完成的项目使用了大量的Memcached，整个架构在性能上的确提高了很多，的确不是一点点的提高，面向大负载访问的时候，MySQL数据库仍然可以做到轻量级的负载，效果不错，建议有条件的朋友一定要把项目改造到Memcached上，著名的Vbb论坛当前的版本就已经开始支持使用Memcached进行论坛数据缓存。我原来在MOP的时候，我们也大量的采用这个东西。
　　在使用Memcached方面，谈不上什么经验，反正极端的性能最大化就是使用永久的缓存，通过你的程序逻辑去控制和维护MC里面的缓存数据，我做的项目就是这样处理的，程序的逻辑的确增加了复杂度，但是对于商业项目来说，这种付出是非常值得的。
　　Memcached唯一可能需要注意的是，他对key的操作不是原子级别的，所以在高并发处理的时候，对同一个key的写操作可能会导致覆盖，这个需要自己从程序逻辑上进行处理，这个理论我并没有深入研究，不过JH看了源代码给了我这样的结论，按照JH的实力和人品，我认为有80％以上的可信度:)
　　对于Lucene，大部分人都不陌生，相关的技术也不用太多讲解，网上到处都是相关的文档。我最近想通过PHP来找到一个最佳的整合Lucene的方法，并且应用到正规的商业应用中，目前知道的可选方案是Pecl的Clucene模块和Zend Framework的 Zend_Search_Lucene 模块，这两个东西目前我使用的感觉都不算太好，另外还有一种是使用 PHP的 Java扩展支持（有两种，一种是php_java扩展，一种是php_java的 bradge方式），这个感觉也比较怪异，最后还有一种知道的办法就是使用系统调用 java 命令执行Lucene功能。这个没有试过，不知性能可以达到什么程度。
在这里做个记号，等有了进一步的收获补进来。
 使用开源软件，设计高性能可扩展网站
2006-6-17 于敦德
上次我们以LiveJournal为例详细分析了一个小网站在一步一步的发展成为大规模的网站中性能优化的方案，以解决在发展中由于负载增长而引起的性能问题，同时在设计网站架构的时候就从根本上避免或者解决这些问题。
今天我们来看一下在网站的设计上一些通常使用的解决大规模访问，高负载的方法。我们将主要涉及到以下几方面：
1、前端负载
2、业务逻辑层
3、数据层
在LJ性能优化文章中我们提到对服务器分组是解决负载问题，实现无限扩展的解决方案。通常中我们会采用类似LDAP的方案来解决，这在邮件的服务器以及个人网站，博客的应用中都有使用，在Windows下面有类似的Active Directory解决方案。有的应用（例如博客或者个人网页）会要求在二级域名解析的时候就将用户定位到所属的服务器群组，这个时候请求还没到应用上面，我们需要在DNS里解决这个问题。这个时候可以用到一款软件bind dlz，这是bind的一个插件，用于取代bind的文本解析配置文件。它支持包括LDAP，BDB在内的多种数据存储方式，可以比较好的解决这个问题。
另外一种涉及到DNS的问题就是目前普遍存在的南北互联互通的问题，通过bind9内置的视图功能可以根据不同的IP来源解析出不同的结果，从而将南方的用户解析到南方的服务器，北方的用户解析到北方的服务器。这个过程中会碰到两个问题，一是取得南北IP的分布列表，二是保证南北服务器之间的通讯顺畅。第一个问题有个笨办法解决，从日志里取出所有的访问者IP，写一个脚本，从南北的服务器分别ping回去，然后分析结果，可以得到一个大致准确的列表，当然最好的办法还是直到从运营商那里拿到这份列表。后一个问题解决办法比较多，最好的办法就是租用双线机房，同一台机器，双IP，南北同时接入，差一些的办法就是南北各自找机房，通过大量的测试找出中间通讯顺畅的两个机房，后一种通常来说成本较低，但效果较差，维护不便。
另外DNS负载均衡也是广泛使用的一种负载均衡方法，通过并列的多条A记录将访问随即的分布到多台前端服务器上，这种通常使用在静态页面居多的应用上，几大门户内容部分的前端很多都是用的这种方法。
用户被定位到正确的服务器群组后，应用程序就接手用户的请求，并开始沿着定义好的业务逻辑进行处理。这些请求主要包括两类静态文件(图片，js脚本,css等)，动态请求。
静态请求一般使用squid进行缓存处理，可以根据应用的规模采用不同的缓存配置方案，可以是一级缓存，也可以是多级缓存，一般情况下cache的命中率可以达到70%左右，能够比较有效的提升服务器处理能力。Apache的deflate模块可以压缩传输数据，提高速度，2.0版本以后的cache模块也内置实现磁盘和内存的缓存，而不必要一定做反向代理。
动态请求目前一般有两种处理方式，一种是静态化，在页面发生变化时重新静态页面，现在大量的CMS，BBS都采用这种方案，加上cache，可以提供较快的访问速度。这种通常是写操作较少的应用比较适合的解决方案。
另一种解决办法是动态缓存，所有的访问都仍然通过应用处理，只是应用处理的时候会更多的使用内存，而不是数据库。通常访问数据库的操作是极慢的，而访问内存的操作很快，至少是一个数量级的差距，使用memcached可以实现这一解决方案，做的好的memcache甚至可以达到90%以上的缓存命中率。10年前我用的还是2M的内存，那时的一本杂事上曾经风趣的描述一对父子的对话：
儿子：爸爸，我想要1G的内存。
爸爸：儿子，不行，即使是你过生日也不行。
时至今日，大内存的成本已经完全可以承受。Google使用了大量的PC机建立集群用于数据处理，而我一直觉得，使用大内存PC可以很低成本的解决前端甚至中间的负载问题。由于PC硬盘寿命比较短，速度比较慢，CPU也稍慢，用于做web前端既便宜，又能充分发挥大内存的优势，而且坏了的话只需要替换即可，不存在数据的迁移问题。
下面就是应用的设计。应用在设计的时候应当尽量的设计成支持可扩展的数据库设计，数据库可以动态的添加，同时支持内存缓存，这样的成本是最低的。另外一种应用设计的方法是采用中间件，例如ICE。这种方案的优点是前端应用可以设计的相对简单，数据层对于前端应用透明，由ICE提供，数据库分布式的设计在后端实现，使用ICE封装后给前端应用使用，这路设计对每一部分设计的要求较低，将业务更好的分层，但由于引入了中间件，分了更多层，实现起来成本也相对较高。
在数据库的设计上一方面可以使用集群，一方面进行分组。同时在细节上将数据库优化的原则尽量应用，数据库结构和数据层应用在设计上尽量避免临时表的创建、死锁的产生。数据库优化的原则在网上比较常见，多google一下就能解决问题。在数据库的选择上可以根据自己的习惯选择，Oracle，MySQL等，并非Oracle就够解决所有的问题，也并非MySQL就代表小应用，合适的就是最好的。
前面讲的都是基于软件的性能设计方案，实际上硬件的良好搭配使用也可以有效的降低时间成本，以及开发维护成本，只是在这里我们不再展开。
网站架构的设计是一个整体的工程，在设计的时候需要考虑到性能，可括展性，硬件成本，时间成本等等，如何根据业务的定位，资金，时间，人员的条件设计合适的方案是件比较困难的事情，但多想多实践，终究会建立一套适合自己的网站设计理念，用于指导网站的设计工作，为网站的发展奠定良好的基础。

 面向高负载的架构Lighttpd+PHP(FastCGI)+Memcached+Squid
By Michael
　　因新项目，开始从Apache上转移到Lighttpd上，同时还有Memcached的大量使用，借此机会把toplee.com的服务器环境也进行一些改造，顺便整理一份文档留存！
　　更多大型架构的经验，可以看我之前的一篇blog： http://www.toplee.com/blog/archives/71.html
12.31 截至今天完成以下内容：
　　　　1. 完成lighttpd的安装配置，并且做了大量的优化；
　　　　2. 几乎全部看完了 http://trac.lighttpd.net/trac/wiki上的文档；
　　　　3. 配置了lighttpd和php的fastcgi支持；
　　　　4. 增加了php对XCache的支持；
　　　　5. 设置了部分域名在lighttpd上的解析；
　　　　6. 完成了Apache通过mod_rewrite和mod_proxy将部分域名以及全部的php访问转到lighttpd上；
　　　　7.完成Memcached的环境搭建，并且修改了部分数据库操作缓存到MC上；

　　效果：
　　　　1. 系统负载变低了不少，响应速度得到提升；
　　　　2. MC的效果非常理想，数据库压力得到很大减轻。
　　TODO：
　　　　（下面的事情等我买了第二台服务器后进行，目前仅在帮朋友的项目上这么干了）
　　　　-. 配置MySQL的Master/Slave模式，把对数据库的Write和Read进行分开
　　　　-. 加入squid群进行缓存加速
　　　　-. 其他（比如DNS负载均衡加LVS的四层交换…）
To be continued…
注册┆登录┆发表文章
 思考高并发高负载网站的系统架构

2007-04-13　10:38:10

大中小
下面是我10月中旬的想法，经过和小黑的讨论，现在想法有些变化。

如今百家@*店的网站架构已经在超负荷工作了。服务器经常达到100%的使用率。主要是数据库占用了大量的CPU资源。这样的系统，根本无法跟上网站的发展。
所以，我针对我们网站，考虑了一些关于网站流量分流的方法。
1，看了一下别人的文章，大部分都说，现在的网站瓶颈在于数据库。所以，我们这里放在第一条。我们设计的网站要求用户数量要达到1亿。（目前淘宝用户近2000万，腾讯用户近10亿，活跃用户1亿多）。当然，我们的设计要多考虑一些，所以，就定位在淘宝的用户数和腾讯的用户数之间。
用户名列表单独建立数据库，以便随时将此数据库独立出去单独建服务器。用户登录的时候，数据库要从1亿记录中查找数据，即使使用主索引，也要将近1秒时间（在SQLSERVER中，使用like查找20万条数据，就需要大约2秒种，这里是按照精确匹配以及主索引联合的方式查找）。所以，我们要将用户表分表。以26个字母为分表顺序，中文开头的用户名一张表，数字开头的用户名一张表。这样就有28张表，平均下来，一张表300万数据，最多的一张表估计大约1000万数据。然后，以用户名为主索引，SQLSERVER数据库应该可以应付过来了。
除了用户表单独建立数据库以外，还要准备大城市大约需要100万左右的商品数据，500万左右的帖子（目前杭州网论坛有超过30万主题贴，600万回复贴），所以，现将商品数据存放在一个独立的数据库中，论坛帖子也存放在一个独立的数据库中，商品数据按城市分表，论坛帖子也安城市分，分别是主题贴一张表，回复贴一张表。
商店数量（淘宝目前有约60万活跃商店）我们网站应为不释放商店，所以需要更多的表存储该数据，目前无法确定，约为1000万家商店。保险期间，也独立建一个数据库，到时候可以和其他小数据库同用一个服务器。
商店对应的商店分类，以及友情链接，由于数量要预定至少10倍于商店数量，所以也要分别单独建立数据库，并按照用户名分表。
网站总设置，以及城市列表，版主信息等，这些可以生成静态内容的单独建立一个数据库。城市内商品分类需要单独建立一个数据库。并按照城市分表。
其他内容同理。到这里为止，理论上解决了数据库的瓶颈问题。
2，尽可能生成静态HTML页面。首页生成HTML页面，城市首页生成HTML页面，所有商品页面生成HTML页面，帖子第一页（前10篇）也生成HTML页面。如果一个主题贴超过1页，可以点击“更多”查看。这样可以节省服务器资源。生成页面的时候，服务器会占用大量CPU资源。所以，此功能要单独放置在一个独立的服务器中，并在该服务器上建立一个缓存队列数据库，用户在提交表单的时候，将数据保存在缓存队列数据库中，等待服务器的处理。服务器按照发表时间顺序（主索引）处理这些内容。将生成的HTML商品页面放在一个文件夹内（可随时增设服务器），上传图片和处理图片，存放图片在另一个文件夹内（图片最消耗IIS资源，以后一定要增设图片服务器）。并将处理好的内容存入主数据库中，并在缓存队列数据库中删除处理好的记录。这2块是占用CPU的大头，要随时准备移除主WEB服务器。
3，使用缓存。有些网页，像首页，可以学习百度首页，缓存24小时。
4，读取和写入数据库分开。用2个完全一样的数据库，一个专门用来写入，一个用来读取，隔断时间将新加入的数据从写入数据库拷贝到读取数据库，这样可以减少数据库的符合。这个方法听说很多网站都采用的。
5，在北方，教育网等特殊网络做镜像服务器；
6，使用负载均衡技术。由特殊的服务商提供方案。

 "我在SOHU这几年做的一些门户级别的程序系统(C/C++开发)"
Bserv:
用于高负载，高读写速度的单点和集合数据。内核为BerkeleyDB，外壳为UDP线程池。接口为读写单点数据或者集合数据。单点数据就是Key->Value数据。集合数据就是有索引的数据，List->Keys->Values。比如一个班级所有成员，一个主贴所有回帖等等。 DBDS性能很高，每秒读取>800个每秒，写>300个每秒（志强xeon:2G*2，72Gscsi，Ram:2G）配合java接口，目前应用在ChinaRen所有项目中（ChinaRen校内，校友录，社区等等）。是整个ChinaRen的核心数据服务，大概配备了50台服务器。特点：高速，高请求量。用于各种数据的低成本存储，解决数据库无法实现超高速读写的问题。门户级别的高速数据服务。 OnlineServer:
ChinaRen/SOHU小纸条系统核心核心为3个小server系统：online2(在线系统业务逻辑)，userv(用户资料系统)，cserv(LRU缓存) 这三个子系统都是UDP+线程池结构，单进程+多线程。配备java接口，apache_mod的json和xml接口。 online2包括了大部分业务逻辑，包括，上线，好友系统，纸条系统。 userv包括设置用户各种属性，信息。 cserv是个大的lru缓存，用于减小磁盘IO。可以放各种信息块，包括用户信息，好友，留言等。目前配备4台服务器（DL380，xeon:3G*2，SCSI:146G raid，Ram:2G），用户分布到4台服务器上，相互交互。服务器可以由1台到2台，到4台，到8台。底层存储为文件存储（无数据库），用reiserfs。配套系统： mod_online，两个版本，apache和lighttpd版本，用于页面上显示蜡烛人。请求量巨大，目前用lighttpd版本的mod_online。放在sohu的squid前端机器上，运行在8080，大概8台，每台请求量大概500-800个每秒。蜡烛人在所有ChinaRen页面有ID的地方显示用户是否在线。目前这套在线系统，作为SOHUIM的内核原型。准备开发WEBIM系统，用户所有SOHU矩阵用户的联络。
apache_mod系列:
基于apache2的服务有很多，用于高请求量，快速显示的地方。 1.mod_gen_verifyimg2：
用于显示验证码，使用GD2，freetype。直接在apache端返回gif流，显示随机的字体，角度，颜色等等。用于ChinaRen各个需要验证码的页面，请求量很大。 2.mod_ip2loc
用于apache端的IP->物理地址转换，高速，高效。读取数据文件到内部数据树，高速检索，获得客户端ip的物理地址。用于需要IP自动定位的产品，还有就是数据统计等。比如ChinaRen校内，每个客户端请求都能获得物理地址，用于应用的逻辑处理。 3.mod_pvserver2
ChinaRen社区帖子点击的记录和显示。根据URL，得到帖子ID，通过UDP数据包，统计到bserv系统。并且把结果通过Cookie返回到客户端。html直接用javascript显示点击数在帖子上。解决了点击数量高效记录，高效读取和非动态页面程序显示的问题。 4.mod_online
用于ChinaRen页面上的蜡烛人显示。和onlineserver通讯，得到用户在线状态和其他状态信息。请求量很大，每台前端大概500-800个请求。 5.其他mod 还有一些认证的，访问统计的，特种url过虑跳转的，页面key生成的，还有若干。特点：高速，密集超高请求量。前端分担应用服务器压力，高效。
cserv:
高速LRU缓存系统。内核是UDP+线程池+LRU结构（hash+PQueue）。用于存放各种数据块，Key->Value结构。通过LRU方式提供给应用，可减小文件IO，磁盘IO等慢速操作。目前用于ChinaRen在线系统的用户资料缓存。特点：高速读写，低成本。
ddap:
UDP+线程池，单进程，多线程的服务端程序原型，大部分程序由这个结构开始。性能为8000-10000个请求每秒。
eserv:
访问统计系统用于用户访问的次数和最后上线时间的存储和读写。用于ChinaRen校友录每个班级的访问记录。存储为文件存储，并有同时写入后备的bserv，用于备份和检索。目前性能，每台机器每秒50个记录，100个读每秒。能满足校友录巨大的用户登录记录的需要。特点：无数据库，纯文件存储，高速读写。低成本
logserver:
用于各种事件的日志记录核心为ddap，UDP+线程池功能是分模块记录各种日志。ChinaRen所有用户服务，系统日志，都记录在logserver中。用于统计，查询。写入性能很好，每秒100个单台机器。特点：高速高效，低成本，海量。
SessionServ:
session系统核心为ddap，UDP+线程池用于在内存中存储临时数据。有get/put/del/inc等操作。广泛的用于固定时间窗口的小数据存储。比如过期，数据有效性检测，应用同步等等。由于是全内存操作，所以速度很快，存取速度应该>1000个每秒。目前广泛用与ChinaRen社区，校内，校友录等业务当中。特点：高速高效，低成本，应用广泛。
其他server:
MO_dispatcher：用于短信上行接口的的数据转发，使用TCP。能高速大流量根据业务号码分发到各个应用服务中。目前用于SOHU短信到ChinaRen各短信服务的转发。 sync：用于静态前端同步，分客户端和服务端程序。客户端通过TCP链接和服务端获取需要同步的文件列表，并且通过TCP高速更新本地文件。此同步程序用于多客户端，单服务端。比如一台服务器生成静态文件，同步这些文件到若干客户前端去。特点：门户级静态内容服务器间同步，高效，高速，大流量。目前用于ChinaRen社区的静态帖子。
总结一下:
门户的核心服务，要求是高效率，高密度存取，海量数据，最好还是低成本。不要用数据库，不要用java，不要用mswin。用C，用内存，用文件，用linux就对了。
 中国顶级门户网站架构分析1

首先声明，下面的内容都是我个人根据一些工具形成的猜想。并不保证和现实中各大门户网站所用的架构一摸一样，不过我认为八九不离十了^_^ 。
整篇文章我想分2个部分来讲：第一部分是分析国内2大顶级门户网站首页和频道的初步的基本构架。第二部分我将自己做的实验文档记录下来。希望每个SA心里都能有这样的架构。
新浪和搜狐在国内的知名度可谓无人不知无人不晓。他们每天的点击率都在千万以上。这样大的访问量对于新浪和搜狐来说怎样利用有限的资源让网民获得最快的速度成为首要的前提，毕竟现在网络公司已经离开了烧钱的阶段，开始了良性发展，每一笔钱砸下去都需要一定回响才行的。另一方面，技术人员要绞尽脑汁，不能让用户老是无法访问、或者访问速度极慢。这样就算有再好的编辑、再好的销售，他们也很难将广告位卖出去，等待他们的将是关门。当然这些情况都没有发生，因为他们的技术人员都充分的利用了现有资源并将他们发挥到了极至。说到底就是用squid做web cache server，而apache在squid的后面提供真正的web服务。当然使用这样的架构必须要保证主页上大部分都是静态页面。这就需要程序员的配合将页面在反馈给客户端之前将页面全部转换成静态页面。好了基本架构就这样，下面说说我怎么猜到的以及具体的架构：
法宝之一：nslookup
实战：
nslookup www.sina.com.cn
Server: ns-px.online.sh.cn
Address: 202.96.209.5
Non-authoritative answer:
Name: taurus.sina.com.cn
Addresses: 61.172.201.230, 61.172.201.231, 61.172.201.232, 61.172.201.233
61.172.201.221, 61.172.201.222, 61.172.201.223, 61.172.201.224, 61.172.201.225
61.172.201.226, 61.172.201.227, 61.172.201.228, 61.172.201.229
Aliases: www.sina.com.cn, jupiter.sina.com.cn
这里可以看到新浪在首页上用到了那么多IP，开始有人会想果然新浪财大气粗啊。其实不然，继续往下看：
nslookup news.sina.com.cn
Server: ns-px.online.sh.cn
Address: 202.96.209.5
Non-authoritative answer:
Name: taurus.sina.com.cn
Addresses: 61.172.201.228, 61.172.201.229, 61.172.201.230, 61.172.201.231
61.172.201.232, 61.172.201.233, 61.172.201.221, 61.172.201.222, 61.172.201.223
61.172.201.224, 61.172.201.225, 61.172.201.226, 61.172.201.227
Aliases: news.sina.com.cn, jupiter.sina.com.cn
细心的人可以发现了news这个频道的ip数和首页上一样，而且IP也完全一样。也就是这些IP在sina的DNS上的名字都叫taurus.sina.com.cn，那些IP都是这个域的A记录。而news,sports,jczs.news。。。都是CNAME记录。用DNS来做自动轮询。还不信，再来一个，就体育频道好了：
nslookup sports.sina.com.cn
Server: ns-px.online.sh.cn
Address: 202.96.209.5
Non-authoritative answer:
Name: taurus.sina.com.cn
Addresses: 61.172.201.222, 61.172.201.223, 61.172.201.224, 61.172.201.225
61.172.201.226, 61.172.201.227, 61.172.201.228, 61.172.201.229, 61.172.201.230
61.172.201.231, 61.172.201.232, 61.172.201.233, 61.172.201.221
Aliases: sports.sina.com.cn, jupiter.sina.com.cn
其他的可以自己试。好了再来看看sohu的情况：
nslookup www.sohu.com
Server: ns-px.online.sh.cn
Address: 202.96.209.5
Non-authoritative answer:
Name: pagegrp1.sohu.com
Addresses: 61.135.132.172, 61.135.132.173, 61.135.132.176, 61.135.133.109
61.135.145.47, 61.135.150.65, 61.135.150.67, 61.135.150.69, 61.135.150.74
61.135.150.75, 61.135.150.145, 61.135.131.73, 61.135.131.91, 61.135.131.180
61.135.131.182, 61.135.131.183, 61.135.132.65, 61.135.132.80
Aliases: www.sohu.com
－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－
nslookup news.sohu.com
Server: ns-px.online.sh.cn
Address: 202.96.209.5
Non-authoritative answer:
Name: pagegrp1.sohu.com
Addresses: 61.135.150.145, 61.135.131.73, 61.135.131.91, 61.135.131.180
61.135.131.182, 61.135.131.183, 61.135.132.65, 61.135.132.80, 61.135.132.172
61.135.132.173, 61.135.132.176, 61.135.133.109, 61.135.145.47, 61.135.150.65
61.135.150.67, 61.135.150.69, 61.135.150.74, 61.135.150.75
Aliases: news.sohu.com
情况和sina一样，只是从表面来看sohu的IP数要多于sina的IP数，那么sohu上各个频道用的服务器就要多于sina了？当然不能这么说，因为一台服务器可以绑定多个IP，因此不能从IP数的多少来判断用了多少服务器。
从上面这些实验可以基本看出sina和sohu对于频道等栏目都用了相同的技术，即squid来监听这些IP的80端口，而真正的web server来监听另外一个端口。从用户的感觉上来说不会有任何的区别，而相对于将web server直接和客户端连在一起的方式，这样的方式明显的节省的带宽和服务器。用户访问的速度感觉也会更快。
先说那么多了，要去睡觉了，明天还有很多工作要做～有不明白的记得给我留言！！！
 中国顶级门户网站架构分析 2
中国顶级门户网站架构分析1
前天讲了最基本的推测方法，今天稍微深入一些：）
1. 难道就根据几个域名的ip相同就可以证明他们是使用squid的嘛？
当然不是，前面都只是推测。下面才是真正的证实我上面的猜测。先nslookup一把sina的体育频道。
nslookup sports.sina.com.cn
Server: ns1.china.com
Address: 61.151.243.136
Non-authoritative answer:
Name: taurus.sina.com.cn
Addresses:61.172.201.231, 61.172.201.232, 61.172.201.233, 61.172.201.9
61.172.201.10, 61.172.201.11, 61.172.201.12, 61.172.201.13, 61.172.201.14
61.172.201.15, 61.172.201.16, 61.172.201.17, 61.172.201.227, 61.172.201.228
61.172.201.229, 61.172.201.230
Aliases: sports.sina.com.cn, jupiter.sina.com.cn
然后直接访问这些ip中的任意一个ip试试看，访问下来的结果应该是如下图所示：

由此可以证明sina是在dns中设置了很多ip来指向域名sqsh-19.sina.com.cn，而其他各种相同性质的频道都只是sqsh-19.sina.com.cn一个别名，用CNAME指定。dns的设置应该是这样的，然后server方面，通过squid 2.5.STABLE5（最新的稳定版为STABLE6）来侦听80端口。上面这些是根据一些信息分析而出的，应该基本正确的。下面一些就是我的个人的猜想：
它的真正的web server也同样是侦听80端口，因为在squid配置文件中有一项是：
httpd_accel_port 80
如果你设成其他端口号（比如88）的话，那上图的错误信息就会变成
While trying to retrieve the URL: http://61.172.201.19:88
工具2：nmap扫描程序：可以用来检查服务器开了什么端口。
我现在用nmap来扫描sina的一个ip：61.172.201.19来进行分析
bash-2.05$ nmap 61.172.201.19
Starting nmap 3.50 ( http://www.insecure.org/nmap/ ) at 2004-07-30 13:31 GMT
Interesting ports on 61.172.201.19:
(The 1657 ports scanned but not shown below are in state: filtered)
PORT STATE SERVICE
22/tcp open ssh
80/tcp open http
Nmap run completed -- 1 IP address (1 host up) scanned in 73.191 seconds
可以看到他对外只开了2个端口，80端口就是刚才我们说的squid打开的，这点刚才已经验证过了。而22端口是用来ssh远程连接的，主要是sa用来远程操作服务器用的安全性非常高的方法。
工具3：lynx或者其他可以读取http头文件的工具及小程序：直接看例子比较好理解：）
HTTP/1.0 200 OK
Date: Fri, 30 Jul 2004 05:49:47 GMT
Server: Apache/2.0.49 (Unix)
Last-Modified: Fri, 30 Jul 2004 05:48:16 GMT
Accept-Ranges: bytes
Vary: Accept-Encoding
Cache-Control: max-age=60
Expires: Fri, 30 Jul 2004 05:50:47 GMT
Content-Length: 180747
Content-Type: text/html
Age: 37
X-Cache: HIT from sqsh-230.sina.com.cn
Connection: close
上面是sina的http头的反馈信息。里面有很多有价值的东东哦：）譬如，它后面的apache是用2.0.49，还设了过期时间为2分钟。最后修改时间。这些都是要在编译apache的时候载入的，特别是Last-Modified还需要小小的改一把源码--至少我是这样做的。
综上所述
sina的架构应该是前面squid，按照现在的服务器2u，2g内存一般每台服务器至少可以跑4个squid2.5stable5. 这样它16个ip就用了4台服务器。后面一层是apache2.0.49应该会用2台。这2台可能用的全是私有ip，通过前面的squid服务器在hosts文件中指定。具体的实现方法我会下次整理出我做实验的文档：）而apache的htdocs可能是有一个或2个磁盘阵列作nfs。apache mount nfs server的时候应该是只读的，然后另外还有服务器转门用来做编辑器服务器，用来编辑人员更新文章。这台服务器应该对nfs server是具有可写的权限。
----这就一套完整的sina所运用的方案，当然很多是靠猜测的，我没有和sina的技术人员有过任何沟通（因为一个也不认识），否则我也就不会写出来了。其他sohu，163应该也有这样的架构。
最后声明：这只是一些静态页面组成频道的一个架构，sina还有很多其他服务器，什么下载，在线更新等不在这个架构中。
 服务器的大用户量的承载方案
http://blog.chinaunix.net/u/243/showart_299315.html
一、前言
二、编译安装
三、安装MySQL、memcache
四、安装Apache、PHP、eAccelerator、php-memcache
五、安装Squid
六、后记

一、前言

一、前言，准备工作
当前，LAMP开发模式是WEB开发的首选，如何搭建一个高效、可靠、稳定的WEB服务器一直是个热门主题，本文就是这个主题的一次尝试。
我们采用的架构图如下：

-------- ---------- ------------- --------- ------------
| 客户端 | ===> |负载均衡器| ===> |反向代理/缓存| ===> |WEB服务器| ===> |数据库服务器|
-------- ---------- ------------- --------- ------------
Nginx Squid Apache,PHP MySQL
eAccelerator/memcache
准备工作：
服务器： Intel(R) Xeon(TM) CPU 3.00GHz * 2, 2GB mem, SCISC 硬盘
操作系统：CentOs4.4，内核版本2.6.9-22.ELsmp，gcc版本3.4.4
软件：
Apache 2.2.3（能使用MPM模式）
PHP 5.2.0（选用该版本是因为5.2.0的引擎相对更高效）
eAccelerator 0.9.5（加速PHP引擎，同时也可以加密PHP源程序）
memcache 1.2.0（用于高速缓存常用数据）
libevent 1.2a（memcache工作机制所需）
MySQL 5.0.27（选用二进制版本，省去编译工作）
Nginx 0.5.4（用做负载均衡器）
squid-2.6.STABLE6（做反向代理的同时提供专业缓存功能）

 YouTube Scalability Talk
Cuong Do of YouTube / Google recently gave a Google Tech Talk on scalability.
I found it interesting in light of my own comments on YouTube’s 45 TB a while back.
Here are my notes from his talk, a mix of what he said and my commentary:
In the summer of 2006, they grew from 30 million pages per day to 100 million pages per day, in a 4 month period. (Wow! In most organizations, it takes nearly 4 months to pick out, order, install, and set up a few servers.)
YouTube uses Apache for FastCGI serving. (I wonder if things would have been easier for them had they chosen nginx, which is apparently wonderful for FastCGI and less problematic than Lighttpd)
YouTube is coded mostly in Python. Why? “Development speed critical”.
They use psyco, Python -> C compiler, and also C extensions, for performance critical work.
They use Lighttpd for serving the video itself, for a big improvement over Apache.
Each video hosted by a “mini cluster”, which is a set of machine with the same content. This is a simple way to provide headroom (slack), so that a machine can be taken down for maintenance (or can fail) without affecting users. It also provides a form of backup.
The most popular videos are on a CDN (Content Distribution Network) - they use external CDNs and well as Google’s CDN. Requests to their own machines are therefore tail-heavy (in the “Long Tail” sense), because the head codes to the CDN instead.
Because of the tail-heavy load, random disks seeks are especially important (perhaps more important than caching?).
YouTube uses simple, cheap, commodity Hardware. The more expensive the hardware, the more expensive everything else gets (support, etc.). Maintenance is mostly done with rsync, SSH, other simple, common tools.
The fun is not over: Cuong showed a recent email titled “3 days of video storage left”. There is constant work to keep up with the growth.
Thumbnails turn out to be surprisingly hard to serve efficiently. Because there, on average, 4 thumbnails per video and many thumbnails per pages, the overall number of thumbnails per second is enormous. They use a separate group of machines to serve thumbnails, with extensive caching and OS tuning specific to this load.
YouTube was bit by a “too many files in one dir” limit: at one point they could accept no more uploads (!!) because of this. The first fix was the usual one: split the files across many directories, and switch to another file system better suited for many small files.
Cuong joked about “The Windows approach of scaling: restart everything”
Lighttpd turned out to be poor for serving the thumbnails, because its main loop is a bottleneck to load files from disk; they addressed this by modifying Lighttpd to add worker threads to read from disk. This was good but still not good enough, with one thumbnail per file, because the enormous number of files was terribly slow to work with (imagine tarring up many million files).
Their new solution for thumbnails is to use Google’s BigTable, which provides high performance for a large number of rows, fault tolerance, caching, etc. This is a nice (and rare?) example of actual synergy in an acquisition.
YouTube uses MySQL to store metadata. Early on they hit a Linux kernel issue which prioritized the page cache higher than app data, it swapped out the app data, totally overwhelming the system. They recovered from this by removing the swap partition (while live!). This worked.
YouTube uses Memcached.
To scale out the database, they first used MySQL replication. Like everyone else that goes down this path, they eventually reach a point where replicating the writes to all the DBs, uses up all the capacity of the slaves. They also hit a issue with threading and replication, which they worked around with a very clever “cache primer thread” working a second or so ahead of the replication thread, prefetching the data it would need.
As the replicate-one-DB approach faltered, they resorted to various desperate measures, such as splitting the video watching in to a separate set of replicas, intentionally allowing the non-video-serving parts of YouTube to perform badly so as to focus on serving videos.
Their initial MySQL DB server configuration had 10 disks in a RAID10. This does not work very well, because the DB/OS can’t take advantage of the multiple disks in parallel. They moved to a set of RAID1s, appended together. In my experience, this is better, but still not great. An approach that usually works even better is to intentionally split different data on to different RAIDs: for example, a RAID for the OS / application, a RAID for the DB logs, one or more RAIDs for the DB table (uses “tablespaces” to get your #1 busiest table on separate spindles from your #2 busiest table), one or more RAID for index, etc. Big-iron Oracle installation sometimes take this approach to extremes; the same thing can be done with free DBs on free OSs also.
In spite of all these effort, they reached a point where replication of one large DB was no longer able to keep up. Like everyone else, they figured out that the solution database partitioning in to “shards”. This spread reads and writes in to many different databases (on different servers) that are not all running each other’s writes. The result is a large performance boost, better cache locality, etc. YouTube reduced their total DB hardware by 30% in the process.
It is important to divide users across shards by a controllable lookup mechanism, not only by a hash of the username/ID/whatever, so that you can rebalance shards incrementally.
An interesting DMCA issue: YouTube complies with takedown requests; but sometimes the videos are cached way out on the “edge” of the network (their caches, and other people’s caches), so its hard to get a video to disappear globally right away. This sometimes angers content owners.
Early on, YouTube leased their hardware.
 High Performance Web Sites by Nate Koechley
One dozen rules for faster pages
1. Share results of our research in t Yahoo is firml committed to openness
Why talk about performance?
In the last 2 years, we do a lot more with web pages. Steve Souders - High Performance Two Performance Flavors: Response Time and System Efficiency The importance of front end performance!!! 95% is front end. Back0end vs. Front-end Until now we haveon 1 Perception - How fast does it feel to the users? Perceived response time It's in the eye of the beholder 2 80% of consequences Yahoo Interface Blog yuiblog.com 3 Cache Sadly the cache doesn't work as well as it should 40-60% of users still have an empty cache Therefore optimize for no-cache and with cache 4 Cookies Set scope correctly Keep sizes low, 80ms delay with cookies Total cookie size - Amazon 60 bytes - good example. 1. Eliminate unnecessary cookies 2. Keep cookie sizes low 3. 5 Parallel Downloads
One Dozen Rules
Rule 1 - Make fewer HTTP requests css sprites alistapart.com/articles/sprites Combine Scripts, Combined Stylesheets Rule 2 - Use a CDN amazon.com - Akamai Distribute your static content beofre distributing content Rule 3 Add an Expires Header Not just for images images, stylesheets and scripts Rule 4: Gzip Components You can addect users download times 90% of browsers support compression Gzip compresses more than deflate Gzip: not just for HTML for gzip scripts, Free YUI Hosting includes Aggregated files w Rule 5: Put CSS at the top stsylesheets use < link > not @import!!!!! Slower, but perceived loading time is faster Rule 6; Move scripts to the bottom of th te page scripts block rendering what about defer? - no good Rule 7: Avoid CSS Expressions Rule 8: Make JS and CSS External Inline: bigger HTML but no hhtp request External: cachable but extra http Except for a users "home page" Post-Onload Download Dynamic Inlining Rule 9: Reduce DNS Lookups Best practice: Max 2-4 hosts Use keep-alive Rule 10: Minify Javascript Take out white space, Two popular choices - Dojo is a better compressor but JSMin is less error prone. minify is safer than obstifacation Rule 11: Avoid redirects Redirects are worst form of blocking Redirects - Amazon have none! Rule 12: Tuen off ETags
Case Studies
Yahoo 1 Moved JS to onload 2 removed redirects 50% faster What about performance and Web 2.0 apps? Client-side CPU is more of an issue User expectations are higher start off on the right foot - care! Live Analysis IBM Page Detailer - windows only Fasterfox - measures load time of pages LiveHTTPHeaders firefox extension Firebug - Recommended! YSlow to be released soon.
Conclusion
Focus on the front end harvest the low hanging fruit reduce http requests
 Rules for High Performance Web Sites
These rules are the key to speeding up your web pages. They've been tested on some of the most popular sites on the Internet and have successfully reduced the response times of those pages by 25-50%.
The key insight behind these best practices is the realization that only 10-20% of the total end-user response time is spent getting the HTML document. You need to focus on the other 80-90% if you want to make your pages noticeably faster. These rules are the best practices for optimizing the way servers and browsers handle that 80-90% of the user experience.
• Rule 1 - Make Fewer HTTP Requests
• Rule 2 - Use a Content Delivery Network
• Rule 3 - Add an Expires Header
• Rule 4 - Gzip Components
• Rule 5 - Put CSS at the Top
• Rule 6 - Move Scripts to the Bottom
• Rule 7 - Avoid CSS Expressions
• Rule 8 - Make JavaScript and CSS External
• Rule 9 - Reduce DNS Lookups
• Rule 10 - Minify JavaScript
• Rule 11 - Avoid Redirects
• Rule 12 - Remove Duplicate Scripts
• Rule 13 - Turn Off ETags
• Rule 14 - Make AJAX Cacheable and Small
 对于应用高并发，DB千万级数量该如何设计系统哪？
背景：
博客类型的应用，系统实时交互性比较强。各种统计，计数器，页面的相关查询之类的都要频繁操作数据库。数据量要求在千万级，同时在线用户可能会有几万人活跃。系统现在是基于spring + hibernate + jstl + mysql的，在2千人在线，几十万记录下没有什么压力。可对于千万记录以及数万活跃用户没什么经验和信心。
对于这些，我的一点设计想法与问题，欢迎大家指导：

一. 加强cache
由于web2类型的网站，用squid反向代理可能不是很适用；由于这种情况下需要cluster，jvm上作过多cache可能会引起其他问题；所以比较合适的应该是采用静态发布的方式，把数据发布成xml文件，然后通过xml + xslt 拼接各模块(div)显示。（直接发布成html文件用jstl感觉不是很方便，也没用过，请有经验的介绍下），主要目的就是把压力拦截在Apache上。或者用memcached cache文章内容，用户资料等对象。
二. 数据库分库
分库有两种，一种是分表，把经常访问的放一张表，不常访问的放一张表。
好比对于博客，文章表可以分为文章基本信息（标题，作者，正文……）不常改动的信息，和文章统计信息（阅读次数，评论次数……）经常变动的信息，以期望update统计信息之类的可以快一点（这个东西实践起来弊端也会比较明显：查询文章时需要多查询一次统计信息表，到底能不能提高性能还没有具体数据，欢迎有经验的给点数据：））。
对于记录过多，好比千万级，这样的分法显然也解决不了问题，那么就需要归档处理了。归档大致就是创建一个同样的表，把旧内容（好比三个月以前的）都移到旧表里面，保持活跃的表记录不多。（mysql本身有一个archive引擎，看资料感觉对解决大量数据没什么用处，连索引都不支持，用过的朋友可以给点建议）。归档带来的最大问题就是：归档以后的数据如何访问哪？如果用户要访问以前的数据就会比较麻烦了。（mysql的merge查询？）大家这方面有没有好的practice？我还没想到好的办法。
分库的另外一种方式是物理的分，就是装他几十台mysql服务器，然后按照某种方式把数据分散到不同的服务器上，这种方式也有利于备份恢复和系统的稳定性（一台数据库宕了，也只会影响一部分功能或用户）。例如对于博客应用，比较理想的分库模式可以按照用户分，好比我把用户id在1…10万的资料都存到mysql1上，把10万。。。20万的存到mysql2….，依次类推，通过线性增加服务器的方式解决大数据问题。呵呵，还算完美吧~~，就是给统计排名带来了麻烦……
按照第二种分库方式，数据库连接将发生变化，如果数据达到千万，10几个mysql应该是需要的，这时候连接池就要废掉了，采用每次查询取链接的方式。或者需要改造出一个特别的连接池了。
三．采用Ibatis
把hibernate废掉，改用ibatis，毕竟ibatis可以很方便的进行sql优化，有什么问题优化起来方便多了（还没有用过ibatis，只是感觉）。另一方面，如果物理分库有效果，好像严格的sql优化意义也就不大了。这应该也是一个优化方面。
总结一下我的结构：把文章，用户资料，各种分类，tag, 链接，好友之类的进行静态化（xml + xslt 读取显示） + 物理分库 + ibatis sql优化 + JVM短暂性的cache总的用户数，在线用户数等极个别数据，其他的全部不cache（包括关闭hibernate二级缓存，如果用hibernate）
各个博客之间没什么关系，采用分库＋分表的方法应该是比较好的。
都不用按常用不常用分，简单地将博客分组就好了。
另外，因为业务逻辑比较简单，要处理千万记录以及数万活跃用户，
我觉得还是用JDBC＋mysql，自已从头构建一个应用服务器更好些。。

MySQL 5.1已经支持表分区了，拿100万行的表测试过（采用的是HASH），查询速度非常理想。
可以使用velocity模板，直接发布为html。
另外，je2里面有很多静态页面，不知道是如何自动生成的。看起来效果不错。
http://www.javaeye.com/static.html

我以前用httpclient读页面，然后写成本地文件，少量的效果还不错（速度和静态页面效果都不错）。项目要定时循环读，结果后来因为任务多了quartz调度不了那么快，造成内存溢出。
没用过java，表是肯定要分的，具体怎么分要看你们的具体应用需求，分表后对外的读取接口可以封装起来，内部处理数据定位问题。cache尽量走内存少走文件，否则数量和访问量上去以后io也够受的。系统的几大模块间尽量独立，互相用消息队列异步通信。
skybyte
大并发系统设计 #1

中级会员

注册日期:
2006/9/3 17:42
所属群组:
会员
帖子: 62
等级: 6; EXP: 75
HP : 0 / 143
MP : 20 / 405

大并发系统设计

2007－2－15

杨思勇

Email: yangsy.cq (啊特) gmail (点) com

架构设计

优化服务器配置。
负载均衡技术。
Web容器采用线程池技术。
数据库连接采用连接池技术。
页面预编译技术。
缓存设计技术。
高度优化SQL(Select和Update)、索引、分页等。
数据流压缩技术。

优化服务器配置。

加大内存。
加大并发数。
升级操作系统版本。
正确的磁盘分区技巧。

负载均衡技术

DNS负载均衡技术。
优点：优点是经济简单易行，节点可以在任意位置。
缺点：更新慢，节点宕机后无法响应。
交换机负载均衡技术。
优点：能及时响应节点宕机，速度快。
缺点：对交换机有要求，节点必须在交换机中。

Web容器采用线程池技术。

进程模式的请求响应非常慢。但是比较稳定，一个进程dead后不影响其它进程。
采用线程池技术的后响应速度非常快，数据可以在线程之间共享。缺点是有可能单个线程会影响其它线程，并且有可能会发生死锁。

数据库连接采用连接池技术

提高了响应时间，尤其是的SQL比较多的时候更应采用连接池技术。
注意：
连接的释放。
连接的事务处理。

页面预编译技术。

编译后的代码执行速度要比脚本语言高出几个数量级。
Jsp主要是第一次运行时编译，这样可以提高第二次响应请求的时间。
可以在部署后批量编译所有动态需要编译的文件。

缓存设计技术。

缓存能大大减少数据量的压力。
页面全部缓存。
优点：整个页面响应速度快。
缺点：更新不及时，无法单独刷新某一块。
单个组件缓存。
优点：执行速度快，可以很细的控制需要缓存的部分，节省内存空间。
优秀的缓存方案
页面级缓存技术有：squid、OSCache的taglib技术等
组件级有：MEMCache、OSCache、ECache等

高度优化SQL、索引、分页

值采用?形式来复用SQL，如：insert into table(f1,f2) values (?,?)。数据库会缓存这些sql，不会再解析了。
关联表查询注意要使用到索引。最好通过他表的主键关联。
采用存贮过程技术。
时间存贮采用时间类型，数据库对date类型字段都做了优化。
经常要查询的字段必须建索引，使用到的索引上最好能排除全表的80%的记录。
如果不能做到，则需要建立联合索引。同样索引必须能排除80%以上的记录。
索引定期优化，重建。
查询排序最好通过主键来排序。
一张表上索引不要超过5个。
尽量不要Like查询大字段。
执行时间超过100ms的SQL基本上都有问题，要么是设计的问题，要么是SQL没有优化，要么是索引没有使用正确。

数据流压缩技术

数据流压缩主要用在web服务器和浏览器之间的数据传送。
现在的浏览器基本上都支持gzip和deflate压缩技术。
注意压缩比。
不要压缩jpg、rar、zip等已经压缩过的文件。否则性能会更低。

 高性能服务器设计
http://blog.chinaunix.net/u/5251/showart_236329.html
书接上文，很自然地就到了高性能服务器设计这个话题上来了。

先后查看了haproxy，l7sw和lighttpd的相关源码，无一例外，他们一致认为多路复用是性能最好的服务器架构。事实也确实应该如此，进程的出现一方面就是为了保存任务的执行上下文从而简化应用程序设计，如果程序的逻辑结构不是很复杂，那么用整个进程控制块来保存执行上下文未免有些大材小用，加上进程调度和其他的一些额外开销，程序设计上的高效很可能会被执行时的低效所抵消。代价也是有的：程序设计工作将更加具有挑战性。

体系结构选定之后，我们就要考虑更加细节的部分，比如说用什么操作系统，用操作系统提供的那些API。在这方面，前辈们已经做过很多，我们只需要简单的“拿来”即可，如果再去枉费唇舌，简直就是浪费时间，图财害命。High-Performance Server Architecture从根本上分析了导致服务器低效的罪魁祸首：数据拷贝、（用户和内核）上下文切换、内存申请（管理）和锁竞争;The C10K Problem列举并分析了UNIX、Linux甚至是部分Windows为提高服务器性能而设计的一些系统调用接口，这篇文档的难能可贵之处还在于它一致保持更新;Benchmarking BSD and Linux更是通过实测数据用图表的形式把BSD和Linux的相关系统调用的性能直观地陈列在我们眼前，结果还是令人激动的：Linux 2.6的相关系统调用的时间复杂度竟然是O(1)。

简单的总结如下:
1. 操作系统采用Linux 2.6.x内核，不仅因为它的高性能，更因为它大开源（这并不是说其他的UNIX或者是BSD衍生物不开源）给程序设计带来的便利，我们甚至可以把服务做到内核空间。
2. 多路复用采用epoll的“电平触发”(Level Triggered)模式，必要时可以采用“边缘触发”(Edge Triggered)，但要注意防止数据停滞。
3. 为避免数据拷贝可以采用sendfile系统调用发送小文件，或者是文件的小部分，注意避免sendfile因磁盘IO而导致的阻塞。
4. 如果服务操作设计大量磁盘IO操作，应选用Linux内核提供的异步IO机制，其对应的用户空间库为libaio，注意：这里提到异步IO库并非目前glibc中附带的异步IO实现。
5. 如果同时有多个数据需要传输，采用writev/readv来减少系统调用所带来的上下文切换开销，如果数据要写到网络套接字文件描述符，这也能在一定程度上防止网络上出现比较小帧，为此，还可以有选择地开启TCP_CORK选项。
6. 实现自己的内存管理，比如说缓存数据，复用常用数据结构等。
7. 用多线程替代多进程，线程库当然选择nptl。
8. 避免进程/线程间非必要的同步，保持互斥区的短小。
上面这些琐碎的细节在ESR看来可能都是过早优化，他可能又会建议我们等待硬件的升级。哈哈，提醒还是不无道理的，算法的设计部分，我们更要下大力气，因地制宜地降低算法的时间复杂度。为什么不提空间复杂度呢？内存的价格还是相对低廉吧，不过还是不要忘记现在的计算机瓶颈多在内存的访问。

有一点需要提醒一下，目前SMP系统和多核心CPU比较常见，如果还是仅采用单进程（线程）的多路复用模型，那么同一时间将只有一个CPU为这个进程（线程）服务，并不能充分发挥CPU的计算能力，所以需要至少CPU（CPU核心）数目个进程（线程）来分担系统负担。有一个变通的解决方案：不用修改源码，在服务器上运行两个服务程序的实例，当然这个时候服务端口应该是不同的，然后在其前端放置负载均衡器将流量和连接平均分配到两个服务端口，可以简单的通过DNAT来实现负载均衡。其实，这个时候我们已经把多CPU或者是多核系统看成了多个系统组成的集群。

为了提高服务器的性能，单纯的依靠提高单个服务器的处理能力似乎不能奏效，况且配置越高的服务器花销也就越高，为此人们经常采用服务器集群的方式，通过把计算尽可能地分配到相对比较廉价的机器上单独完成，籍此来提升服务器的整体性能，事实证明，这种体系结构不仅是切实可行的，而且还能提高服务器的可用性，容错能力也较强。在网络服务器方面，Linux内核中的由国人章文嵩先生设计的IP层负载均衡解决方案LVS比较有名，还有就是工作于应用层的haproxy和刚刚起步的l7sw。

 优势与应用：再谈CDN镜像加速技术
来源：中国IDC圈时间：2007-1-22 作者：佚名保存本文进入论坛　

　　CDN，全称是Content Delivery Network，中文可译为“内容快递网”。它是一个建立并覆盖在互联网（Internet）之上的一层特殊网络，专门用于通过互联网高效传递丰富的多媒体内容。CDN 出现和存在的意义在于它使互联网更有效地为人们服务，特别是那些对互联网内容有更高要求（比如由简单的文字和图片等静态内容到声像俱全的多媒体动态内容）的人们。

“CDN技术”简介

　　CDN的全称是Content Delivery Network，即内容分发网络。其目的是通过在现有的Internet中增加一层新的网络架构，将网站的内容发布到最接近用户的网络”边缘”，使用户可以就近取得所需的内容，解决Internet网络拥挤的状况，提高用户访问网站的响应速度。从技术上全面解决由于网络带宽小、用户访问量大、网点分布不均等原因所造成的用户访问网站响应速度慢的问题。

　　目前，国内访问量较高的大型网站如新浪、网易等，均使用CDN网络加速技术，虽然网站的访问巨大，但无论在什么地方访问都会感觉速度很快。而一般的网站如果服务器在网通，电信用户访问很慢，如果服务器在电信，网通用户访问又很慢。

“CDN技术”的优势

　　1、本地Cache加速提高了企业站点（尤其含有大量图片和静态页面站点）的访问速度，并大大提高以上性质站点的稳定性

　　2、镜像服务消除了不同运营商之间互联的瓶颈造成的影响，实现了跨运营商的网络加速，保证不同网络中的用户都能得到良好的访问质量。

　　3、远程加速远程访问用户根据DNS负载均衡技术智能自动选择Cache服务器，选择最快的Cache服务器，加快远程访问的速度

　　4、带宽优化自动生成服务器的远程Mirror（镜像）cache服务器，远程用户访问时从cache服务器上读取数据，减少远程访问的带宽、分担网络流量、减轻原站点WEB服务器负载等功能。

　　5、集群抗攻击广泛分布的CDN节点加上节点之间的智能冗于机制，可以有效地预防黑客入侵以及降低各种D.D.o.S攻击对网站的影响，同时保证较好的服务质量。

网站用“CDN技术”武装的流程

　　第一步：修改DNS解析

　　前面已经说到，CDN其实是夹在网页浏览者和被访问的服务器中间的一层镜像或者说缓存，浏览者访问时点击的还是服务器原来的URL地址，但是他看到的内容其实是离他的IP地址所在地最近的一台镜像服务器上的页面缓存内容，也就是说用户在使用原来的URL访问服务器时并没有实际访问到服务器上的内容，所以要实现这个效果，就得在这个服务器的域名解析上进行一些调整。

　　实际上，这个服务器的域名解析过程已经转变为为访问者选择离他最近的镜像服务器，因此域名的解析服务器的IP要改成CDN运营商架设的智能解析服务器的IP，例如你在新网注册一个域名，默认用的就是新网的DNS服务器为你进行解析，而假设你选择了网宿的CDN服务，就得修改域名管理的设置，改成使用网宿的CDN解析服务器来进行解析。

　　这样，当一个浏览者访问你的网站时，他访问的URL地址就会被网宿的CDN解析服务器解析到网宿科技各地镜像服务器中离这个浏览者最近的一台上面。

　　第二步：调整网页架构

　　CDN既然是一种缓存技术，那么它的实时性肯定是无法实现的，镜像服务器上的缓存一般都是隔一定的时间更新一次，因此在更新期间内，用户看到的内容是不会变的；所以使用CDN加速的服务器应该以静态页面和实时更新频率较低的内容为主，像论坛、天气预报这种内容更新频繁的站点使用CDN反而适得其反。

　　CDN最适合的领域是资讯提供站点或者其他以静态页面为主的内容展示性质站点。

　　第三步：镜像服务器自动高新缓存

　　镜像服务器上面安装有一个可以进行自动远程备份的软件，当然它只备份静态页面和图片这些，每隔一定的时间，各个镜像服务器就会到网站的源服务器上去获取最新的内容。

　　那么有些网友就觉得，如果源服务器已经更新了但是缓存服务器还没更新，那该怎么办？这个问题其实并不存在，如果用户访问的是缓存服务器上也没有的页面，那么镜像服务器会先从源服务器上拿到这个页面的缓存然后再发送给访问者，如果用户访问的是动态页面，那么这个访问请求就会被提交到源服务器。

“CDN技术”的应用和效果

　　CDN对于门户性质资讯站点的加速效果还是非常明显的，以新浪为例：

　　新浪采用了ChinaCache做的CDN系统，ChinaCache在全国分布了四十多个点，同时采用基于动态DNS分配的全球服务器负载均衡技术。

　　从新浪的站点结构可以看出：

　　 > www.sina.com.cn

　　Server: UnKnown

　　Address: 192.168.1.254

　　Non-authoritative answer:

　　Name: libra.sina.com.cn

　　Addresses: 61.135.152.71, 61.135.152.72, 61.135.152.73, 61.135.152.74 61.135.152.75, 61.135.152.76, 61.135.153.181, 61.135.153.182, 61.135.53.183, 61.135.153.184, 61.135.152.65, 61.135.152.66, 61.135.152.67, 61.135.12.68, 61.135.152.69, 61.135.152.70

　　Aliases: www.sina.com.cn, jupiter.sina.com.cn

　　在北京地区ChinaCache将 www.sina.com.cn的网址解析到libra.sina.com.cn，然后libra.sina.com.cn做了DNS负载均衡，将libra.sina.com.cn解析到61.135.152.71等16个ip上，这16个ip分布在北京的多台前台缓存服务器上，使用squid做前台缓存。如果是在其它地区访问 www.sina.com.cn可能解析到本地相应的服务器，例如pavo.sina.com.cn，然后pavo又对应了很多做了squid的ip。这样就实现了在不同地区访问自动转到最近的服务器访问，达到加快访问速度的效果。

　　我们再看一个新浪其它频道是指到哪里的：

　　> news.sina.com.cn

　　Server: UnKnown

　　Address: 192.168.1.254

　　Non-authoritative answer:

　　Name: libra.sina.com.cn

　　Addresses: 61.135.152.65, 61.135.152.66, 61.135.152.67, 61.135.152.68 61.135.152.69, 61.135.152.70, 61.135.152.71, 61.135.152.72, 61.135.152.73 61.135.153.178, 61.135.153.179, 61.135.153.180, 61.135.153.181, 61.135.153.182 61.135.153.183, 61.135.153.184

　　Aliases: news.sina.com.cn, jupiter.sina.com.cn

　　可以看出，各个频道的前台缓存集群与 www.sina.com.cn的前台缓存集群是相同的。

　　新浪使用CDN后效果也非常明显：

　　这是在笔者在广州 ping 新浪域名，被解析到华南这边的镜像服务器，反映速度快，稳定无丢包：

　　假如没使用CDN，还是访问新浪在北京架设的服务器，不仅反应速度慢了好几倍，甚至还出现超时：

“CDN技术” 与 “镜像站点” 的区别

　　CDN有别于镜像，因为它比镜像更智能，或者可以做这样一个比喻：CDN＝更智能的镜像+缓存+流量导流。因而，CDN可以明显提高Internet网络中信息流动的效率。从技术上全面解决由于网络带宽小、用户访问量大、网点分布不均等问题，提高用户访问网站的响应速度。
 除了程序设计优化，zend+ eacc(memcached)外，有什么办法能提高服务器的负载能力呢?
发表时间: 2007-7-03 17:35 作者: cdexs 来源: PHPChina 开源社区门户
字体: 小中大 | 打印
看到豆瓣网单台AMD服务器,能支撑5w注册用户，我想他同时在线用户不会低于5K,那么，在(php)系统设计、系统加速方面，DB方面做怎样的优化才能充分利用系统资源，最大限度的提高系统负载能力呢。

是不是还有其他的办法呢?
我也来说两句查看全部评论相关评论
• Snake.Zero (2007-7-03 17:37:31)
squid不可少
• cdexs (2007-7-03 18:42:16)
楼上，我想在多台server时，squid比较用的上，当我只有1 ~2台服务器呢?
• 虾球桑 (2007-7-04 03:06:24)
豆瓣的程序应该是他们自己写的吧，如果你的也是自己写的，把脚本好好优化可以提速不少，优良的代码结构能比操蛋的结构高出几成的效率
• cdexs (2007-7-04 10:04:32)
楼上，在程序设计时，针对性能和效率有那些技巧和注意点呢??
• Snake.Zero (2007-7-04 10:35:38)
统计不是个小问题，所以建议独立一台服务器专门做统计，系统优化方面我不是很足的经验，如果是多台服务器的话
可以3：2：1或者3：2：2的方式，3台squid，2台PORTAL，2台DB读写分离
另外你也可以选择分离文件服务器，把那些静态的东西尽量不要让APACHE来完成
• cdexs (2007-7-04 11:26:38)
楼上的对，我准备将站内的图片从逻辑中分离出来(现在物理上还是一台server)。
• pigpluspower (2007-7-04 14:55:49)
最实际的做法（单，多机通用）：
尽量油画你的代码！

若是劣质的代码，即使你有Zend Platform由能怎么样？

除了一些技术性的优化以外，以下这一点小细节可以帮到不少的忙：
1、免除多余的空格，如“if”后面那个括号里面的空格
2、在开发过后记得要去除所有的注释，以提高效率
3、避免“双重循环结构”，比如两个不同条件判断，却要运行同一项处理，if循环中尽量使用“&&”连接（一般没有人会去写“双重循环结构”，除非代码过于复杂）
4、若可能，干脆把“换行”去掉，“;”后面直接接代码
• cdexs (2007-7-04 14:58:59)
楼上的我没法说你的。。。。。
• Snake.Zero (2007-7-04 16:24:38)
是啊，很寒
去注释是通过php.exe来完成的

另：楼主的代码编译过没？
• 太阳雨 (2007-7-04 17:28:05)
豆瓣用的是: lighttpd/1.4.15

dou.png
• cdexs (2007-7-04 20:08:07)
用eaccelerator,准备用 zend 编译过，再加上 eaccelerator。但看到有人说 eacc对 zend编过的代码不起任何作用，而且 zend optimizer 本身也要占用资源，所以一直在权衡……大家给个建议哈

看到 lighhttpd + fastcgi 比 apache+mod_php 快，准备选用前种web服务。

看了下 xajax的调用方法, 想着，对于ajax的支持调用开销，xajax是不是比jquery要小?比如有个情况,需要调用当前php页面中的用户是否存在的检测。用jquery可能需要: current.php?u=xxx。这个时候，页面接受参数，按照顺序执行到处理"u" 参数(页面中的chkusr())，再输出。但如果用xajax, 直接在client里调用后台页面方法 chkusr(arg)
,此时显然比jquery调用少了从页面开始第一条语句顺序执行而造成多余指令执行的资源浪费。偶没有仔细研究过xajax的实现细节。大家讨论讨论

[ 本帖最后由 cdexs 于 2007-7-4 20:17 编辑 ]
• Snake.Zero (2007-7-04 21:25:52)
lighhttpd 确实不错
• ok7758521ok (2007-7-07 12:29:54)
网站的拓扑结构
首先用户层
一般第一层采用cache拦截技术，（一般可以阻挡50%以上的流量）
第二层应用层 --程序设计层
第三层数据缓存层层--memcache
第四层数据库（db）层
 如何规划您的大型JAVA多并发服务器程序

文章作者：陈林茂　发布时间：2003年4-月5 日　文章来源：转载　查看次数：643
版权申明：本站署名的原创文章，本站及作者享有版权，其他网站及传统媒体如需使用，转载时请注明出处和原作者。本站转载的文章如有侵犯到您的版权，请及时向本站提出。

JAVA 自从问世以来，越来越多的大型服务器程序都采用它进行开发，主要是看中
它的稳定性及安全性，但对于一个新手来说，您又如何开发您的JAVA 应用服务器，
同时又如何规划您的JAVA服务器程序，并且很好的控制您的应用服务器开发的进度，
最后，您又如何发布您的JAVA 应用服务器呢？（由于很多前辈已有不错的著作，我
只能在这里画画瓢，不足指出，请多来信指正，晚辈将虚心接受！本人的联系方式：
linmaochen@sohu.com）

废话少说，下面转入正题：
本文将分以下几个部分来阐述我的方法：
1、怎样分析服务器的需求？

2、怎样规划服务器的架构？
3、怎样规划服务器的目录及命名规范、开发代号？
4、原型的开发（-）：怎样设计服务器的代码骨架？
5、原型的开发（二）：怎样测试您的代码骨架？
6、详细的编码？
7、如何发布您的JAVA 服务器产品？

一、如何分析服务器的需求？
我的观点是：

1。服务器就像一台轧汁机，进去的是一根根的甘蔗，出来的是一杯杯的甘蔗汁；
也就是说，在开发服务器之前，先要明白，服务器的请求是什么？原始数据是什么？
接下来要弄明白，希望得到的结果是什么？结果数据应该怎样来表述？
其实要考虑的很多，无法一一列出（略）。

二、如何规划服务器的架构？
首先问大家一个小小的问题：在上海的大都市里，公路上的公交客车大致可以分为以下两类：
空调客车，票价一般为两块，上车不需要排队，能否坐上座位，就要看个人的综合能力；
无人售票车，票价一般1 块和一块五毛，上车前需要规规矩矩排队，当然，座位是每个人都有的。
那么，我的问题是，哪类车的秩序好呢？而且上下车的速度快呢？答案是肯定的：无人售票车。

所以，我一般设计服务器的架构主要为：
首先需要有一个请求队列，负责接收客户端的请求，同时它也应有一个请求处理机制，说到实际

上，应有一个处理的接口；
其次应该有一个输出队列，负责收集已处理好的请求，并准备好对应的回答；当然，它也有一个
回答机制，即如何将结果信息发送给客户端；

大家都知道，服务器程序没有日志是不行的，那么，服务器同时需要有一个日志队列，负责整个服
务器的日志信息收集和处理；

最后说一点，上公交车是需要有钞票的，所以，服务器同样需要有一个验证机制。
…(要说的东西实在太多，只好略)

三、怎样规划服务器的目录及命名规范、开发代号
对于一般的大型服务器程序，应该有下面几个目录：
bin : 主要存放服务器的可执行二进制文件；

common: 存放JAVA程序执行需要的支持类库；
conf : 存放服务器程序的配置文件信息；
logs : 存放服务器的日志信息；
temp : 存放服务器运行当中产生的一些临时文件信息；
cache : 存放服务器运行当中产生的一些缓冲文件；
src : 当然是存放服务器的JAVA源程序啦。
……（其他的设定，根据具体需求。）

四、原型的开发（-）：怎样设计服务器的代码骨架？

1。首先服务器程序需要有一个启动类，我们不妨以服务器的名字命名：(ServerName).class
2。服务器需要有一个掌控全局的线程，姑且以：(MainThread.class)命名；
3。注意不论是短连接和长连接，每一个客户端需要有一个线程给看着，以 ClientThread.class 命名
4。请求队列同样需要以线程的方式来表现： (InputQuene.Class),对应的线程处理类以InputProcessThread.class
命名；
5。输出队列也需要一个线程：（OutputQuene.Class）,对应的处理机制以OutputProcessThread.class 命名；
6。日志队列也是需要一个线程的，我们以 logQuene.class,logQueneThread.Class 来命名；
7。缓冲区的清理同样需要定时工作的，我们以CacheThread.Class 来命名；
8. 如果您的参数信息是以XML的方式来表达的话，那么我也建议用一个单独的类来管理这些参数信息：
Config.Class
9. 当然，如果您想做得更细一点的话，不妨将客户端客服务器端的通讯部分也以接口的形式做出来：

CommInterface.Class
……(太多，只能有空再说！)

五、原型的开发（二）：怎样测试您的代码骨架？
下面为原型的骨架代码，希望大家多多提点意见！谢啦！
/* 服务器描述 : 服务器主控线程
1。读取组态文件信息
2。建立需求输入队列
3。建立需求处理输出队列
4。建立需求处理线程
5。建立输出预处理线程，进行需求处理结果的预处理
6. 建立缓冲区管理线程，开始对缓冲取进行管理
7。建立服务连接套捷字，接受客户的连接请求，并建立客户连接处理线程
*/
import java.io.*;
import java.net.*;
import java.util.*;

public class mainThread extends Thread {

private ServerSocket serverSocket=null;
/*当前服务器监听的端口*/
private int serverPort;
public mainThread(String ConfUrl) {
try{

/*建立服务器监听套接字*/
this.serverSocket =new ServerSocket(serverPort);
}catch(Exception e){
//
System.out.println(e.getMessage());
}
}
/*线程的执行绪*/
public synchronized void run(){
while(listening){
try{
Socket sersocket =this.serverSocket.accept();

ClientThread _clientThread=
new ClientThread([ParamList]);
_clientThread.start();

}catch(Exception e){
}
}
/*退出系统*/
System.exit(0);
}

/*
1。完成客户的连接请求，并验证用户口令
2。接受用户的请求，并将请求信息压入堆栈；
3。从结果输出队列中搜寻对应的结果信息，并将结果信息发送给客户；
4。处理需求处理过程中出现的异常，并将日志信息发送给日志服务器。
*/
import java.io.*;
import java.net.*;
public class ClientThread extends Thread {
public ClientThread([ParamList]){
}

public void synchronized run(){
}
}

/*
请求队列：
1. 将客户的需求压入队列
2。将客户的需求弹出队列
*/

import java.util.*;
public class InputQuene {
private Vector InputTeam;
public InputQuene() {
/*初始化队列容量*/
InputTeam=new Vector(100);
}
/*需求进队函数*/
public synchronized void enQuene(Request request){
InputTeam.add(request);
}
/*将一个请求出队*/
public synchronized void deQuene(int index){
this.InputTeam.remove(index);
}
}

/*
请求队列处理线程
1。按先进先出的算法从需求队列中依次取出每一个请求，并进行处理
2。更新请求的处理状态
3。清理已经处理过的请求
*/

import java.io.*;
import java.util.*;

public class InputProcessThread extends Thread{
private InputQuene _InQuene;
public InputProcessThread(){
}

public void run(){
}
}

/*
结果输出队列：
1。完成输出结果的进队
2。完成输出结果的出队
*/
import java.util.*;
import java.io.*;
public class OutputQuene {
//结果输出队列容器
private Vector outputTeam;
public OutputQuene() {
//初始化结果输出队列
outputTeam=new Vector(100);
}
//进队函数
public synchronized void enQuene(Result result){
outputTeam.add(result);
}

/*出队函数*/
public synchronized void deQuene(int index){
outputTeam.remove(index);
}
}

/*
结果处理线程：
1。完成输出结果的确认
2。完成输出结果文件流的生成
3。完成文件流的压缩处理
*/
import java.io.*;
public class OutputProcessThread extends Thread{

private OutputQuene _outputQuene;
public OutputProcessThread([ParamList]) {
//todo
}
/*线程的执行绪*/
public void run(){
while(doing){
try{
/*处理输出队列*/
ProcessQuene();
}catch(Exception e){
e.printStackTrace();
}
}
}
}

/*
日志信息处理线程：
功能说明：
1。完成服务器日志信息的保存
2。根据设定的规则进行日志信息的清理
期望的目标：
目前日志信息的保存在一个文件当中，以后要自动控制文件的大小。
*/

import java.io.*;
import java.util.*;
public class LogThread extends Thread{
private LogQuene logquene;

public LogThread([ParamList]){
//todo
}
/*处理日志信息*/
public void run(){
while(doing){
this.processLog();
try{
this.sleep(100);
}catch(Exception e){
}
}
}
}

/* 功能描述：
管理缓冲区中的文件信息，将文件所有的大小控制在系统设定的范围之内
*/
import java.io.*;
import java.lang.*;
import java.util.*;
import java.text.*;
import java.math.*;

public class CacheThread extends Thread{

private String CachePath;

/*类的建构式：参数：URL 缓冲区目录的路径信息*/
public CacheThread(String Url) {
this.CachePath =Url;
/*创建文件搜索类*/
try{
this.CacheDir =new File(this.CachePath);
}catch(Exception e){
e.printStackTrace();
}
}
//线程的执行绪
public void run(){
//定时清理缓冲区中的文件
}
……

 如何架构一个“Just so so”的网站？
2007年08月11日星期六下午 04:13
作者：老王
所谓“Just so so”，翻译成中文大致是“马马虎虎，还算凑合”的意思。所以，如果你想搞一个新浪，搜狐之类的门户的话，估计这篇文章对你没有太大用处，但是就像80/20原则所叙述的一样，大多数站点其实都是“Just so so”的规模而已。
那么如何架构一个“Just so so”的网站呢？IMO（在我看来：In My Opinions），可以粗略的分为硬架构和软架构，这个分类是我一拍脑袋杜撰出来的，所以有考证癖的网友们也不用去搜索引擎查找相关资料了。简单解释一下：所谓硬架构主要是说网站的运行方式和环境等。所谓软架构主要是说在代码层次上如何实现功能等。下面就分别看看How to do。
一：硬架构
1：机房的选择：
在选择机房的时候，根据网站用户的地域分布，可以选择网通或电信机房，但更多时候，可能双线机房才是合适的。越大的城市，机房价格越贵，从成本的角度看可以在一些中小城市托管服务器，比如说北京的公司可以考虑把服务器托管在天津，廊坊等地，不是特别远，但是价格会便宜很多。
2：带宽的大小：
通常老板花钱请我们架构网站的时候，会给我们提出一些目标，诸如网站每天要能承受100万PV的访问量等等。这时我们要预算一下大概需要多大的带宽，计算带宽大小主要涉及两个指标（峰值流量和页面大小），我们不妨在计算前先做出必要的假设：
第一：假设峰值流量是平均流量的5倍。
第二：假设每次访问平均的页面大小是100K字节左右。
如果100万PV的访问量在一天内平均分布的话，折合到每秒大约12次访问，如果按平均每次访问页面的大小是100K字节左右计算的话，这12次访问总计大约就是1200K字节，字节的单位是Byte，而带宽的单位是bit，它们之间的关系是1Byte = 8bit，所以1200K Byte大致就相当于9600K bit，也就是9Mbps的样子，实际情况中，我们的网站必须能在峰值流量时保持正常访问，所以按照假设的峰值流量算，真实带宽的需求应该在45Mbps左右。
当然，这个结论是建立在前面提到的两点假设的基础上，如果你的实际情况和这两点假设有出入，那么结果也会有差别。
3：服务器的划分：
先看我们都需要哪些服务器：图片服务器，页面服务器，数据库服务器，应用服务器，日志服务器等等。
对于访问量大点的网站而言，分离单独的图片服务器和页面服务器相当必要，我们可以用lighttpd来跑图片服务器，用apache来跑页面服务器，当然也可以选择别的，甚至，我们可以扩展成很多台图片服务器和很多台页面服务器，并设置相关域名，如img.domain.com和 www.domain.com，页面里的图片路径都使用绝对路径，如http://img.domain.com/abc.gif" />，然后设置DNS轮循，达到最初级的负载均衡。当然，服务器多了就不可避免的涉及一个同步的问题，这个可以使用rsync软件来搞定。
数据库服务器是重中之重，因为网站的瓶颈问题十有八九是出在数据库身上。现在一般的中小网站多使用MySQL数据库，不过它的集群功能似乎还没有达到stable的阶段，所以这里不做评价。一般而言，使用MySQL数据库的时候，我们应该搞一个主从（一主多从）结构，主数据库服务器使用innodb表结构，从数据服务器使用myisam表结构，充分发挥它们各自的优势，而且这样的主从结构分离了读写操作，降低了读操作的压力，甚至我们还可以设定一个专门的从服务器做备份服务器，方便备份。不然如果你只有一台主服务器，在大数据量的情况下，mysqldump基本就没戏了，直接拷贝数据文件的话，还得先停止数据库服务再拷贝，否则备份文件会出错。但对于很多网站而言，即使数据库服务仅停止了一秒也是不可接受的。如果你有了一台从数据库服务器，在备份数据的时候，可以先停止服务（slave stop）再备份，再启动服务（slave start）后从服务器会自动从主服务器同步数据，一切都没有影响。但是主从结构也是有致命缺点的，那就是主从结构只是降低了读操作的压力，却不能降低写操作的压力。为了适应更大的规模，可能只剩下最后这招了：横向/纵向分割数据库。所谓横向分割数据库，就是把不同的表保存到不同的数据库服务器上，比如说用户表保存在A数据库服务器上，文章表保存在B数据库服务器上，当然这样的分割是有代价的，最基本的就是你没法进行LEFT JOIN之类的操作了。所谓纵向分割数据库，一般是指按照用户标识（user_id）等来划分数据存储的服务器，比如说：我们有5台数据库服务器，那么“user_id % 5 + 1”等于1的就保存到1号服务器，等于2的就保存到2好服务器，以此类推，纵向分隔的原则有很多种，可以视情况选择。不过和横向分割数据库一样，纵向分割数据库也是有代价的，最基本的就是我们在进行如COUNT, SUM等汇总操作的时候会麻烦很多。综上所述，数据库服务器的解决方案一般视情况往往是一个混合的方案，以其发挥各种方案的优势，有时候还需要借助memcached之类的第三方软件，以便适应更大访问量的要求。
如果有专门的应用服务器来跑PHP脚本是最合适不过的了，那样我们的页面服务器只保存静态页面就可以了，可以给应用服务器设置一些诸如app.domain.com之类的域名来和页面服务器加以区别。对于应用服务器，我还是更倾向于使用prefork模式的apache，配上必要的xcache之类的PHP缓存软件，加载模块要越少越好，除了mod_rewrite等必要的模块，不必要的东西统统舍弃，尽量减少httpd进程的内存消耗，而那些图片服务器，页面服务器等静态内容就可以使用lighttpd或者tux来搞，充分发挥各种服务器的特点。
如果条件允许，独立的日志服务器也是必要的，一般小网站的做法都是把页面服务器和日志服务器合二为一了，在凌晨访问量不大的时候cron运行前一天的日志计算，不过如果你使用awstats之类的日志分析软件，对于百万级访问量而言，即使按天归档，也会消耗很多时间和服务器资源去计算，所以分离单独的日志服务器还是有好处的，这样不会影响正式服务器的工作状态。
二：软架构
1：框架的选择：
现在的PHP框架有很多选择，比如：CakePHP，Symfony，Zend Framework等等，至于应该使用哪一个并没有唯一的答案，要根据Team里团队成员对各个框架的了解程度而定。很多时候，即使没有使用框架，一样能写出好的程序来，比如Flickr据说就是用Pear+Smarty这样的类库写出来的，所以，是否用框架，用什么框架，一般不是最重要的，重要的是我们的编程思想里要有框架的意识。
2：逻辑的分层：
网站规模到了一定的程度之后，代码里各种逻辑纠缠在一起，会给维护和扩展带来巨大的障碍，这时我们的解决方式其实很简单，那就是重构，将逻辑进行分层。通常，自上而下可以分为表现层，应用层，领域层，持久层。
所谓表现层，并不仅仅就指模板，它的范围要更广一些，所有和表现相关的逻辑都应该被纳入表现层的范畴。比如说某处的字体要显示为红色，某处的开头要空两格，这些都属于表现层。很多时候，我们容易犯的错误就是把本属于表现层的逻辑放到了其他层面去完成，这里说一个很常见的例子：我们在列表页显示文章标题的时候，都会设定一个最大字数，一旦标题长度超过了这个限制，就截断，并在后面显示“..”，这就是最典型的表现层逻辑，但是实际情况，有很多程序员都是在非表现层代码里完成数据的获取和截断，然后赋值给表现层模板，这样的代码最直接的缺点就是同样一段数据，在这个页面我可能想显示前10个字，再另一个页面我可能想显示前15个字，而一旦我们在程序里固化了这个字数，也就丧失了可移植性。正确的做法是应该做一个视图助手之类的程序来专门处理此类逻辑，比如说：Smarty里的truncate就属于这样的视图助手（不过它那个实现不适合中文）。
所谓应用层，它的主要作用是定义用户可以做什么，并把操作结果反馈给表现层。至于如何做，通常不是它的职责范围（而是领域层的职责范围），它会通过委派把如何做的工作交给领域层去处理。在使用MVC架构的网站中，我们可以看到类似下面这样的URL：domain.com/articles/view/123，其内部编码实现，一般就是一个Articles控制器类，里面有一个view方法，这就是一个典型的应用层操作，因为它定义了用户可以做一个查看的动作。在MVC架构中，有一个准则是这么说的：Rich Model Is Good。言外之意，就是Controller要保持“瘦”一些比较好，进而说明应用层要尽量简单，不要包括涉及领域内容的逻辑。
所谓领域层，最直接的解释就是包含领域逻辑的层。它是一个软件的灵魂所在。先来看看什么叫领域逻辑，简单的说，具有明确的领域概念的逻辑就是领域逻辑，比如我们在ATM机上取钱，过程大致是这样的：插入银联卡，输入密码，输入取款金额，确定，拿钱，然后ATM吐出一个交易凭条。在这个过程中，银联卡在ATM机器里完成钱从帐户上划拨的过程就是一个领域逻辑，因为取钱在银行中是一个明确的领域概念，而ATM机吐出一个交易凭条则不是领域逻辑，而仅是一个应用逻辑，因为吐出交易凭条并不是银行中一个明确的领域概念，只是一种技术手段，对应的，我们取钱后不吐交易凭条，而发送一条提醒短信也是可能的，但并不是一定如此，如果在实际情况中，我们要求取款后必须吐出交易凭条，也就是说吐出交易凭条已经和取款紧密结合，那么你也可以把吐出交易凭条看作是领域逻辑的一部分，一切都以问题的具体情况而定。在Eric那本经典的领域驱动设计中，把领域层分为了五种基本元素：实体，值对象，服务，工厂，仓储。具体可以参阅书中的介绍。领域层最常犯的错误就是把本应属于领域层的逻辑泄露到了其他层次，比如说在一个CMS系统，对热门文章的定义是这样的：每天被浏览的次数多于1000次，被评论的次数多于100次，这样的文章就是热门文章。对于一个CMS来说，热门文章这个词无疑是一个重要的领域概念，那么我们如何实现这个逻辑的设计的？你可能会给出类似下面的代码：“SELECT … FROM … WHERE 浏览 > 1000 AND 评论 > 100”，没错，这是最简单的实现方式，但是这里需要注意的是“每天被浏览的次数多于1000次，被评论的次数多于100次”这个重要的领域逻辑被隐藏到了SQL语句中，SQL语句显然不属于领域层的范畴，也就是说，我们的领域逻辑泄露了。
所谓持久层，就是指把我们的领域模型保存到数据库中。因为我们的程序代码是面向对象风格的，而数据库一般是关系型的数据库，所以我们需要把领域模型碾平，才能保存到数据库中，但是在PHP里，直到目前还没有非常好的ORM出现，所以这方面的解决方案不是特别多，参考Martin的企业应用架构模式一书，大致可以使用的方法有行数据入口（Row Data Gateway）或者表数据入口（Table Data Gateway），或者把领域层和持久层合二为一变成活动记录（Active Record）的方式。
 最便宜的高负载网站架构
关键字: 企业应用
1， LVS做前端四层均衡负载
基于IP虚拟分发的规则,不同于apache,squid这些7层基于http协议的反向代理软件, LVS在性能上往往能得到更好的保证！

2，squid 做前端反向代理加缓存
squid 是业内公认的优秀代理服务器，其缓存能力更让许多高负载网站青睐！（比如新浪，网易等）
使用他, 配合ESI做WEB动态内容及图片缓存，最合适不过了

3，apache 用来处理php或静态html，图片
apache是业内主流http服务器，稳定性与性能都能得到良好保证!

4，JBOSS 用来处理含复杂的业务逻辑的请求
JBOSS是red hat旗下的优秀中间件产品，在java开源领域小有名气，并且完全支持j2ee规范的，功能非常强大
使用他，既能保证业务流程的规范性，又可以节省开支（免费的）

5，mysql数据库
使用mysql数据库，达到百万级别的数据存储，及快速响应，应该是没问题的

6，memcache作为分布式缓存
缓存应用数据，或通过squid解析esi后，作为数据载体

LVS

squid + jboss squid + jboss squid + apache ….

mysql + memcache
最后更新：2007-02-04 20:36
19:53 | 永久链接 | 浏览 (3574) | 评论 (6) | 收藏 | linux及网络应用 | 进入论坛 |

永久链接
http://galaxystar.javaeye.com/blog/52178

评论共 6 条发表评论

whisper 2007-02-04 20:46
apache的静态负载能力似乎是靠吃内存换来的
与其jboss，还不如perl来得方便
clark 2007-02-05 00:10
可以用 lighttpd 替换 apache
如果只用 servlet 容器，可以用 resin 替换 jboss
后端配 mysql 群集
galaxystar 2007-02-05 09:20
为了系统能做到线性可扩展及业务需求的稳定性！
一般考虑用比较成熟的技术！
jboss本身支持异步消息，分布事务，AOP,最近5.0的POJOs可拔插组件模式比JMX更容易维护！放弃resin，用jboss也是有道理的！
而lighthttp处于起步阶段，处理HTTP静态请求或许是好一点，但是扩展性，功能都不是很理想，没有多年社区支持的apache那么强大，N多的module撑着，用前者太不划算了吧！
magice 2007-02-05 14:27
jboss的EJB模块基本用不到！
galaxystar 2007-02-05 20:10
是的，业务接口，完全可以用spring来代替！
通信也可以抛弃RMI，用轻量级的hessian!特别是组播，JBOSS的JGroup是TCP群发软件中，比较优秀的！
clark 2007-02-16 11:24
resin 的 servlet 性能比 jboss 的 tomcat 5 要好些。
lighttpd 比 apache 的性能好许多，现在的功能基本满足使用了。
没有特殊需要，可以不用 apache.
 负载均衡技术全攻略
Internet的规模每一百天就会增长一倍，客户希望获得7天24小时的不间断可用性及较快的系统反应时间，而不愿屡次看到某个站点“Server Too Busy”及频繁的系统故障。
　　网络的各个核心部分随着业务量的提高、访问量和数据流量的快速增长，其处理能力和计算强度也相应增大，使得单一设备根本无法承担。在此情况下，如果扔掉现有设备去做大量的硬件升级，这样将造成现有资源的浪费，而且如果再面临下一次业务量的提升，这又将导致再一次硬件升级的高额成本投入，甚至性能再卓越的设备也不能满足当前业务量的需求。于是，负载均衡机制应运而生。
　　负载均衡（Load Balance）建立在现有网络结构之上，它提供了一种廉价有效透明的方法扩展网络设备和服务器的带宽、增加吞吐量、加强网络数据处理能力、提高网络的灵活性和可用性。
　　负载均衡有两方面的含义：首先，大量的并发访问或数据流量分担到多台节点设备上分别处理，减少用户等待响应的时间；其次，单个重负载的运算分担到多台节点设备上做并行处理，每个节点设备处理结束后，将结果汇总，返回给用户，系统处理能力得到大幅度提高。
　　本文所要介绍的负载均衡技术主要是指在均衡服务器群中所有服务器和应用程序之间流量负载的应用，目前负载均衡技术大多数是用于提高诸如在Web服务器、FTP服务器和其它关键任务服务器上的Internet服务器程序的可用性和可伸缩性。
负载均衡技术分类
　　目前有许多不同的负载均衡技术用以满足不同的应用需求，下面从负载均衡所采用的设备对象、应用的网络层次（指OSI参考模型）及应用的地理结构等来分类。
软/硬件负载均衡
软件负载均衡解决方案是指在一台或多台服务器相应的操作系统上安装一个或多个附加软件来实现负载均衡，如DNS Load Balance，CheckPoint Firewall-1 ConnectControl等，它的优点是基于特定环境，配置简单，使用灵活，成本低廉，可以满足一般的负载均衡需求。
　　软件解决方案缺点也较多，因为每台服务器上安装额外的软件运行会消耗系统不定量的资源，越是功能强大的模块，消耗得越多，所以当连接请求特别大的时候，软件本身会成为服务器工作成败的一个关键；软件可扩展性并不是很好，受到操作系统的限制；由于操作系统本身的Bug，往往会引起安全问题。
　　硬件负载均衡解决方案是直接在服务器和外部网络间安装负载均衡设备，这种设备我们通常称之为负载均衡器，由于专门的设备完成专门的任务，独立于操作系统，整体性能得到大量提高，加上多样化的负载均衡策略，智能化的流量管理，可达到最佳的负载均衡需求。
　　负载均衡器有多种多样的形式，除了作为独立意义上的负载均衡器外，有些负载均衡器集成在交换设备中，置于服务器与Internet链接之间，有些则以两块网络适配器将这一功能集成到PC中，一块连接到Internet上，一块连接到后端服务器群的内部网络上。
　　一般而言，硬件负载均衡在功能、性能上优于软件方式，不过成本昂贵。
本地/全局负载均衡
负载均衡从其应用的地理结构上分为本地负载均衡(Local Load Balance)和全局负载均衡(Global Load Balance，也叫地域负载均衡)，本地负载均衡是指对本地的服务器群做负载均衡，全局负载均衡是指对分别放置在不同的地理位置、有不同网络结构的服务器群间作负载均衡。
　　本地负载均衡能有效地解决数据流量过大、网络负荷过重的问题，并且不需花费昂贵开支购置性能卓越的服务器，充分利用现有设备，避免服务器单点故障造成数据流量的损失。其有灵活多样的均衡策略把数据流量合理地分配给服务器群内的服务器共同负担。即使是再给现有服务器扩充升级，也只是简单地增加一个新的服务器到服务群中，而不需改变现有网络结构、停止现有的服务。
　　全局负载均衡主要用于在一个多区域拥有自己服务器的站点，为了使全球用户只以一个IP地址或域名就能访问到离自己最近的服务器，从而获得最快的访问速度，也可用于子公司分散站点分布广的大公司通过Intranet（企业内部互联网）来达到资源统一合理分配的目的。
　　全局负载均衡有以下的特点：
实现地理位置无关性，能够远距离为用户提供完全的透明服务。
除了能避免服务器、数据中心等的单点失效，也能避免由于ISP专线故障引起的单点失效。
解决网络拥塞问题，提高服务器响应速度，服务就近提供，达到更好的访问质量。
网络层次上的负载均衡
针对网络上负载过重的不同瓶颈所在，从网络的不同层次入手，我们可以采用相应的负载均衡技术来解决现有问题。
　　随着带宽增加，数据流量不断增大，网络核心部分的数据接口将面临瓶颈问题，原有的单一线路将很难满足需求，而且线路的升级又过于昂贵甚至难以实现，这时就可以考虑采用链路聚合（Trunking）技术。
　　链路聚合技术（第二层负载均衡）将多条物理链路当作一条单一的聚合逻辑链路使用，网络数据流量由聚合逻辑链路中所有物理链路共同承担，由此在逻辑上增大了链路的容量，使其能满足带宽增加的需求。
　　现代负载均衡技术通常操作于网络的第四层或第七层。第四层负载均衡将一个Internet上合法注册的IP地址映射为多个内部服务器的IP地址，对每次 TCP连接请求动态使用其中一个内部IP地址，达到负载均衡的目的。在第四层交换机中，此种均衡技术得到广泛的应用，一个目标地址是服务器群VIP（虚拟 IP，Virtual IP address）连接请求的数据包流经交换机，交换机根据源端和目的IP地址、TCP或UDP端口号和一定的负载均衡策略，在服务器IP和VIP间进行映射，选取服务器群中最好的服务器来处理连接请求。
　　第七层负载均衡控制应用层服务的内容，提供了一种对访问流量的高层控制方式，适合对HTTP服务器群的应用。第七层负载均衡技术通过检查流经的HTTP报头，根据报头内的信息来执行负载均衡任务。
　　第七层负载均衡优点表现在如下几个方面：
通过对HTTP报头的检查，可以检测出HTTP400、500和600系列的错误信息，因而能透明地将连接请求重新定向到另一台服务器，避免应用层故障。
可根据流经的数据类型（如判断数据包是图像文件、压缩文件或多媒体文件格式等），把数据流量引向相应内容的服务器来处理，增加系统性能。
能根据连接请求的类型，如是普通文本、图象等静态文档请求，还是asp、cgi等的动态文档请求，把相应的请求引向相应的服务器来处理，提高系统的性能及安全性。
第七层负载均衡受到其所支持的协议限制（一般只有HTTP），这样就限制了它应用的广泛性，并且检查HTTP报头会占用大量的系统资源，势必会影响到系统的性能，在大量连接请求的情况下，负载均衡设备自身容易成为网络整体性能的瓶颈。
负载均衡策略
　　在实际应用中，我们可能不想仅仅是把客户端的服务请求平均地分配给内部服务器，而不管服务器是否宕机。而是想使Pentium III服务器比Pentium II能接受更多的服务请求，一台处理服务请求较少的服务器能分配到更多的服务请求，出现故障的服务器将不再接受服务请求直至故障恢复等等。
　　选择合适的负载均衡策略，使多个设备能很好的共同完成任务，消除或避免现有网络负载分布不均、数据流量拥挤反应时间长的瓶颈。在各负载均衡方式中，针对不同的应用需求，在OSI参考模型的第二、三、四、七层的负载均衡都有相应的负载均衡策略。
　　负载均衡策略的优劣及其实现的难易程度有两个关键因素：一、负载均衡算法，二、对网络系统状况的检测方式和能力。
　　考虑到服务请求的不同类型、服务器的不同处理能力以及随机选择造成的负载分配不均匀等问题，为了更加合理的把负载分配给内部的多个服务器，就需要应用相应的能够正确反映各个服务器处理能力及网络状态的负载均衡算法：
轮循均衡（Round Robin）：每一次来自网络的请求轮流分配给内部中的服务器，从1至N然后重新开始。此种均衡算法适合于服务器组中的所有服务器都有相同的软硬件配置并且平均服务请求相对均衡的情况。

权重轮循均衡（Weighted Round Robin）：根据服务器的不同处理能力，给每个服务器分配不同的权值，使其能够接受相应权值数的服务请求。例如：服务器A的权值被设计成1，B的权值是 3，C的权值是6，则服务器A、B、C将分别接受到10%、30％、60％的服务请求。此种均衡算法能确保高性能的服务器得到更多的使用率，避免低性能的服务器负载过重。

随机均衡（Random）：把来自网络的请求随机分配给内部中的多个服务器。

权重随机均衡（Weighted Random）：此种均衡算法类似于权重轮循算法，不过在处理请求分担时是个随机选择的过程。

响应速度均衡（Response Time）：负载均衡设备对内部各服务器发出一个探测请求（例如Ping），然后根据内部中各服务器对探测请求的最快响应时间来决定哪一台服务器来响应客户端的服务请求。此种均衡算法能较好的反映服务器的当前运行状态，但这最快响应时间仅仅指的是负载均衡设备与服务器间的最快响应时间，而不是客户端与服务器间的最快响应时间。

最少连接数均衡（Least Connection）：客户端的每一次请求服务在服务器停留的时间可能会有较大的差异，随着工作时间加长，如果采用简单的轮循或随机均衡算法，每一台服务器上的连接进程可能会产生极大的不同，并没有达到真正的负载均衡。最少连接数均衡算法对内部中需负载的每一台服务器都有一个数据记录，记录当前该服务器正在处理的连接数量，当有新的服务连接请求时，将把当前请求分配给连接数最少的服务器，使均衡更加符合实际情况，负载更加均衡。此种均衡算法适合长时处理的请求服务，如FTP。

处理能力均衡：此种均衡算法将把服务请求分配给内部中处理负荷（根据服务器CPU型号、CPU数量、内存大小及当前连接数等换算而成）最轻的服务器，由于考虑到了内部服务器的处理能力及当前网络运行状况，所以此种均衡算法相对来说更加精确，尤其适合运用到第七层（应用层）负载均衡的情况下。

DNS响应均衡（Flash DNS）：在Internet上，无论是HTTP、FTP或是其它的服务请求，客户端一般都是通过域名解析来找到服务器确切的IP地址的。在此均衡算法下，分处在不同地理位置的负载均衡设备收到同一个客户端的域名解析请求，并在同一时间内把此域名解析成各自相对应服务器的IP地址（即与此负载均衡设备在同一位地理位置的服务器的IP地址）并返回给客户端，则客户端将以最先收到的域名解析IP地址来继续请求服务，而忽略其它的IP地址响应。在种均衡策略适合应用在全局负载均衡的情况下，对本地负载均衡是没有意义的。
尽管有多种的负载均衡算法可以较好的把数据流量分配给服务器去负载，但如果负载均衡策略没有对网络系统状况的检测方式和能力，一旦在某台服务器或某段负载均衡设备与服务器网络间出现故障的情况下，负载均衡设备依然把一部分数据流量引向那台服务器，这势必造成大量的服务请求被丢失，达不到不间断可用性的要求。所以良好的负载均衡策略应有对网络故障、服务器系统故障、应用服务故障的检测方式和能力：
Ping侦测：通过ping的方式检测服务器及网络系统状况，此种方式简单快速，但只能大致检测出网络及服务器上的操作系统是否正常，对服务器上的应用服务检测就无能为力了。

TCP Open侦测：每个服务都会开放某个通过TCP连接，检测服务器上某个TCP端口（如Telnet的23口，HTTP的80口等）是否开放来判断服务是否正常。

HTTP URL侦测：比如向HTTP服务器发出一个对main.html文件的访问请求，如果收到错误信息，则认为服务器出现故障。
负载均衡策略的优劣除受上面所讲的两个因素影响外，在有些应用情况下，我们需要将来自同一客户端的所有请求都分配给同一台服务器去负担，例如服务器将客户端注册、购物等服务请求信息保存的本地数据库的情况下，把客户端的子请求分配给同一台服务器来处理就显的至关重要了。有两种方式可以解决此问题，一是根据IP地址把来自同一客户端的多次请求分配给同一台服务器处理，客户端IP地址与服务器的对应信息是保存在负载均衡设备上的；二是在客户端浏览器 cookie内做独一无二的标识来把多次请求分配给同一台服务器处理，适合通过代理服务器上网的客户端。
　　还有一种路径外返回模式（Out of Path Return），当客户端连接请求发送给负载均衡设备的时候，中心负载均衡设备将请求引向某个服务器，服务器的回应请求不再返回给中心负载均衡设备，即绕过流量分配器，直接返回给客户端，因此中心负载均衡设备只负责接受并转发请求，其网络负担就减少了很多，并且给客户端提供了更快的响应时间。此种模式一般用于HTTP服务器群，在各服务器上要安装一块虚拟网络适配器，并将其IP地址设为服务器群的VIP，这样才能在服务器直接回应客户端请求时顺利的达成三次握手。

负载均衡实施要素

负载均衡方案应是在网站建设初期就应考虑的问题，不过有时随着访问流量的爆炸性增长，超出决策者的意料，这也就成为不得不面对的问题。当我们在引入某种负载均衡方案乃至具体实施时，像其他的许多方案一样，首先是确定当前及将来的应用需求，然后在代价与收效之间做出权衡。

针对当前及将来的应用需求，分析网络瓶颈的不同所在，我们就需要确立是采用哪一类的负载均衡技术，采用什么样的均衡策略，在可用性、兼容性、安全性等等方面要满足多大的需求，如此等等。

不管负载均衡方案是采用花费较少的软件方式，还是购买代价高昂在性能功能上更强的第四层交换机、负载均衡器等硬件方式来实现，亦或其他种类不同的均衡技术，下面这几项都是我们在引入均衡方案时可能要考虑的问题：

性能：性能是我们在引入均衡方案时需要重点考虑的问题，但也是一个最难把握的问题。衡量性能时可将每秒钟通过网络的数据包数目做为一个参数，另一个参数是均衡方案中服务器群所能处理的最大并发连接数目，但是，假设一个均衡系统能处理百万计的并发连接数，可是却只能以每秒2个包的速率转发，这显然是没有任何作用的。性能的优劣与负载均衡设备的处理能力、采用的均衡策略息息相关，并且有两点需要注意：一、均衡方案对服务器群整体的性能，这是响应客户端连接请求速度的关键；二、负载均衡设备自身的性能，避免有大量连接请求时自身性能不足而成为服务瓶颈。有时我们也可以考虑采用混合型负载均衡策略来提升服务器群的总体性能，如DNS负载均衡与NAT负载均衡相结合。另外，针对有大量静态文档请求的站点，也可以考虑采用高速缓存技术，相对来说更节省费用，更能提高响应性能；对有大量ssl/xml内容传输的站点，更应考虑采用ssl/xml加速技术。

可扩展性：IT技术日新月异，一年以前最新的产品，现在或许已是网络中性能最低的产品；业务量的急速上升，一年前的网络，现在需要新一轮的扩展。合适的均衡解决方案应能满足这些需求，能均衡不同操作系统和硬件平台之间的负载，能均衡HTTP、邮件、新闻、代理、数据库、防火墙和 Cache等不同服务器的负载，并且能以对客户端完全透明的方式动态增加或删除某些资源。

灵活性：均衡解决方案应能灵活地提供不同的应用需求，满足应用需求的不断变化。在不同的服务器群有不同的应用需求时，应有多样的均衡策略提供更广泛的选择。

可靠性：在对服务质量要求较高的站点，负载均衡解决方案应能为服务器群提供完全的容错性和高可用性。但在负载均衡设备自身出现故障时，应该有良好的冗余解决方案，提高可靠性。使用冗余时，处于同一个冗余单元的多个负载均衡设备必须具有有效的方式以便互相进行监控，保护系统尽可能地避免遭受到重大故障的损失。

易管理性：不管是通过软件还是硬件方式的均衡解决方案，我们都希望它有灵活、直观和安全的管理方式，这样便于安装、配置、维护和监控，提高工作效率，避免差错。在硬件负载均衡设备上，目前主要有三种管理方式可供选择：一、命令行接口（CLI：Command Line Interface），可通过超级终端连接负载均衡设备串行接口来管理，也能telnet远程登录管理，在初始化配置时，往往要用到前者；二、图形用户接口（GUI：Graphical User Interfaces），有基于普通web页的管理，也有通过Java Applet 进行安全管理，一般都需要管理端安装有某个版本的浏览器；三、SNMP（Simple Network Management Protocol，简单网络管理协议）支持，通过第三方网络管理软件对符合SNMP标准的设备进行管理。
负载均衡配置实例

DNS负载均衡
DNS负载均衡技术是在DNS服务器中为同一个主机名配置多个IP地址，在应答DNS查询时，DNS服务器对每个查询将以DNS文件中主机记录的IP地址按顺序返回不同的解析结果，将客户端的访问引导到不同的机器上去，使得不同的客户端访问不同的服务器，从而达到负载均衡的目的。

DNS负载均衡的优点是经济简单易行，并且服务器可以位于internet上任意的位置。但它也存在不少缺点：

为了使本DNS服务器和其他DNS服务器及时交互，保证DNS数据及时更新，使地址能随机分配，一般都要将DNS的刷新时间设置的较小，但太小将会使DNS流量大增造成额外的网络问题。

一旦某个服务器出现故障，即使及时修改了DNS设置，还是要等待足够的时间（刷新时间）才能发挥作用，在此期间，保存了故障服务器地址的客户计算机将不能正常访问服务器。

DNS负载均衡采用的是简单的轮循负载算法，不能区分服务器的差异，不能反映服务器的当前运行状态，不能做到为性能较好的服务器多分配请求，甚至会出现客户请求集中在某一台服务器上的情况。

要给每台服务器分配一个internet上的IP地址，这势必会占用过多的IP地址。
判断一个站点是否采用了DNS负载均衡的最简单方式就是连续的ping这个域名，如果多次解析返回的IP地址不相同的话，那么这个站点就很可能采用的就是较为普遍的DNS负载均衡。但也不一定，因为如果采用的是DNS响应均衡，多次解析返回的IP地址也可能会不相同。不妨试试Ping一下 www.yesky.com， www.sohu.com， www.yahoo.com

现假设有三台服务器来应对 www.test.com的请求。在采用BIND 8.x DNS服务器的unix系统上实现起来比较简单，只需在该域的数据记录中添加类似下面的结果：

www1 IN A 192.1.1.1
www2 IN A 192.1.1.2
www3 IN A 192.1.1.3
www IN CNAME www1
www IN CNAME www2
www IN CNAME www3

在NT下的实现也很简单，下面详细介绍在win2000 server下实现DNS负载均衡的过程，NT4.0类似：

打开“管理工具”下的“DNS”，进入DNS服务配置控制台。

打开相应DNS 服务器的“属性”，在“高级”选项卡的“服务器选项”中，选中“启用循环”复选框。此步相当于在注册表记录HKEY_LOCAL_MACHINE\ SYSTEM\CurrentControlSet\Services\DNS\Parameters中添加一个双字节制值（dword值） RoundRobin，值为1。

打开正向搜索区域的相应区域（如test.com），新建主机添加主机 (A) 资源记录，记录如下：

www IN A 192.1.1.1
www IN A 192.1.1.2
www IN A 192.1.1.3

在这里可以看到的区别是在NT下一个主机名对应多个IP地址记录，但在unix下，是先添加多个不同的主机名分别对应个自的IP地址，然后再把这些主机赋同一个别名（CNAME）来实现的。

在此需要注意的是，NT下本地子网优先级会取代多宿主名称的循环复用，所以在测试时，如果做测试用的客户机IP地址与主机资源记录的IP在同一有类掩码范围内，就需要清除在“高级”选项卡“服务器选项”中的“启用netmask排序”。
NAT负载均衡
NAT（Network Address Translation 网络地址转换）简单地说就是将一个IP地址转换为另一个IP地址，一般用于未经注册的内部地址与合法的、已获注册的Internet IP地址间进行转换。适用于解决Internet IP地址紧张、不想让网络外部知道内部网络结构等的场合下。每次NAT转换势必会增加NAT设备的开销，但这种额外的开销对于大多数网络来说都是微不足道的，除非在高带宽有大量NAT请求的网络上。

NAT负载均衡将一个外部IP地址映射为多个内部IP地址，对每次连接请求动态地转换为一个内部服务器的地址，将外部连接请求引到转换得到地址的那个服务器上，从而达到负载均衡的目的。

NAT负载均衡是一种比较完善的负载均衡技术，起着NAT负载均衡功能的设备一般处于内部服务器到外部网间的网关位置，如路由器、防火墙、四层交换机、专用负载均衡器等，均衡算法也较灵活，如随机选择、最少连接数及响应时间等来分配负载。

NAT负载均衡可以通过软硬件方式来实现。通过软件方式来实现NAT负载均衡的设备往往受到带宽及系统本身处理能力的限制，由于NAT比较接近网络的低层，因此就可以将它集成在硬件设备中，通常这样的硬件设备是第四层交换机和专用负载均衡器，第四层交换机的一项重要功能就是NAT负载均衡。

下面以实例介绍一下Cisco路由器NAT负载均衡的配置：

现有一台有一个串行接口和一个Ethernet接口的路由器，Ethernet口连接到内部网络，内部网络上有三台web服务器，但都只是低端配置，为了处理好来自Internet上大量的web连接请求，因此需要在此路由器上做NAT负载均衡配置，把发送到web服务器合法Internet IP地址的报文转换成这三台服务器的内部本地地址。其具体配置过程如下：

做好路由器的基本配置，并定义各个接口在做NAT时是内部还是外部接口。

然后定义一个标准访问列表（standard access list），用来标识要转换的合法IP地址。

再定义NAT地址池来标识内部web服务器的本地地址，注意要用到关键字rotary，表明我们要使用轮循（Round Robin）的方式从NAT地址池中取出相应IP地址来转换合法IP报文。

最后，把目标地址为访问表中IP的报文转换成地址池中定义的IP地址。
相应配置文件如下：

interface Ethernet0/0
ip address 192.168.1.4 255.255.255.248
ip nat inside
!
interface Serial0/0
ip address 200.200.1.1 255.255.255.248
ip nat outside
!
ip access-list 1 permit 200.200.1.2
!
ip nat pool websrv 192.168.1.1 192.168.1.3 netmask 255.255.255.248 type rotary
ip nat inside destination list 1 pool websrv

反向代理负载均衡
普通代理方式是代理内部网络用户访问internet上服务器的连接请求，客户端必须指定代理服务器,并将本来要直接发送到internet上服务器的连接请求发送给代理服务器处理。

反向代理（Reverse Proxy）方式是指以代理服务器来接受internet上的连接请求，然后将请求转发给内部网络上的服务器，并将从服务器上得到的结果返回给internet上请求连接的客户端，此时代理服务器对外就表现为一个服务器。

反向代理负载均衡技术是把将来自internet上的连接请求以反向代理的方式动态地转发给内部网络上的多台服务器进行处理，从而达到负载均衡的目的。

反向代理负载均衡能以软件方式来实现，如apache mod_proxy、netscape proxy等，也可以在高速缓存器、负载均衡器等硬件设备上实现。反向代理负载均衡可以将优化的负载均衡策略和代理服务器的高速缓存技术结合在一起，提升静态网页的访问速度，提供有益的性能；由于网络外部用户不能直接访问真实的服务器，具备额外的安全性（同理，NAT负载均衡技术也有此优点）。

其缺点主要表现在以下两个方面：

反向代理是处于OSI参考模型第七层应用的，所以就必须为每一种应用服务专门开发一个反向代理服务器，这样就限制了反向代理负载均衡技术的应用范围，现在一般都用于对web服务器的负载均衡。

针对每一次代理，代理服务器就必须打开两个连接，一个对外，一个对内，因此在并发连接请求数量非常大的时候，代理服务器的负载也就非常大了，在最后代理服务器本身会成为服务的瓶颈。
一般来讲，可以用它来对连接数量不是特别大，但每次连接都需要消耗大量处理资源的站点进行负载均衡，如search。

下面以在apache mod_proxy下做的反向代理负载均衡为配置实例：在站点 www.test.com，我们按提供的内容进行分类，不同的服务器用于提供不同的内容服务，将对 http://www.test.com/news的访问转到IP地址为192.168.1.1的内部服务器上处理，对http: // www.test.com/it的访问转到服务器192.168.1.2上，对 http://www.test.com/life的访问转到服务器 192.168.1.3上，对 http://www.test.com/love的访问转到合作站点 http://www.love.com上，从而减轻本apache服务器的负担，达到负载均衡的目的。

首先要确定域名 www.test.com在DNS上的记录对应apache服务器接口上具有internet合法注册的IP地址，这样才能使internet上对 www.test.com的所有连接请求发送给本台apache服务器。

在本台服务器的apache配置文件httpd.conf中添加如下设置：

proxypass /news http://192.168.1.1
proxypass /it http://192.168.1.2
proxypass /life http://192.168.1.3
proxypass /love http://www.love.com

注意，此项设置最好添加在httpd.conf文件“Section 2”以后的位置，服务器192.168.1.1-3也应是具有相应功能的www服务器，在重启服务时，最好用apachectl configtest命令检查一下配置是否有误.

混合型负载均衡
在有些大型网络，由于多个服务器群内硬件设备、各自的规模、提供的服务等的差异，我们可以考虑给每个服务器群采用最合适的负载均衡方式，然后又在这多个服务器群间再一次负载均衡或群集起来以一个整体向外界提供服务（即把这多个服务器群当做一个新的服务器群），从而达到最佳的性能。我们将这种方式称之为混合型负载均衡。此种方式有时也用于单台均衡设备的性能不能满足大量连接请求的情况下。

下图展示了一个应用示例，三个服务器群针对各自的特点，分别采用了不同的负载均衡方式。当客户端发出域名解析请求时，DNS服务器依次把它解析成三个服务器群的VIP，如此把客户端的连接请求分别引向三个服务器群，从而达到了再一次负载均衡的目的。

在图中大家可能注意到，负载均衡设备在网络拓朴上，可以处于外部网和内部网络间网关的位置，也可以和内部服务器群处于并行的位置，甚至可以处于内部网络或internet上的任意位置，特别是在采用群集负载均衡时，根本就没有单独的负载均衡设备。

服务器群内各服务器只有提供相同内容的服务才有负载均衡的意义，特别是在DNS负载均衡时。要不然，这样会造成大量连接请求的丢失或由于多次返回内容的不同给客户造成混乱。

所以，如图的这个示例在实际中可能没有多大的意义，因为如此大的服务内容相同但各服务器群存在大量差异的网站并不多见。但做为一个示例，相信还是很有参考意义的.

09:12 | 永久链接 | 浏览 (700) | 评论 (3) | 收藏 | linux及网络应用 |

永久链接
http://galaxystar.javaeye.com/blog/50542

评论共 3 条发表评论

hermitte 2007-02-04 15:09
好文章，收藏下。
原来搞负载平衡，都是用DNS，对用NAT的很感兴趣，仔细读读
hermitte 2007-02-04 15:14
看完了。。收回评论。。
没什么新的东西。
对于CGI程序本身的跨域的状态保持，没有好的办法吗？
galaxystar 2007-02-04 20:05
跨域，看你的使用情况！
简单的网站，同一域名跨二级域名好办！cookie里就能支持!
跨一级域名，只能用SSO了！比较流行的是yale大学的CAS,不过性能巨差，我公司里有一套自己搞的！

 海量数据处理分析

北京迈思奇科技有限公司戴子良

笔者在实际工作中，有幸接触到海量的数据处理问题，对其进行处理是一项艰巨而复杂的任务。原因有以下几个方面：
一、数据量过大，数据中什么情况都可能存在。如果说有10条数据，那么大不了每条去逐一检查，人为处理，如果有上百条数据，也可以考虑，如果数据上到千万级别，甚至过亿，那不是手工能解决的了，必须通过工具或者程序进行处理，尤其海量的数据中，什么情况都可能存在，例如，数据中某处格式出了问题，尤其在程序处理时，前面还能正常处理，突然到了某个地方问题出现了，程序终止了。
二、软硬件要求高，系统资源占用率高。对海量的数据进行处理，除了好的方法，最重要的就是合理使用工具，合理分配系统资源。一般情况，如果处理的数据过TB级，小型机是要考虑的，普通的机子如果有好的方法可以考虑，不过也必须加大CPU和内存，就象面对着千军万马，光有勇气没有一兵一卒是很难取胜的。
三、要求很高的处理方法和技巧。这也是本文的写作目的所在，好的处理方法是一位工程师长期工作经验的积累，也是个人的经验的总结。没有通用的处理方法，但有通用的原理和规则。
那么处理海量数据有哪些经验和技巧呢，我把我所知道的罗列一下，以供大家参考：
一、选用优秀的数据库工具
现在的数据库工具厂家比较多，对海量数据的处理对所使用的数据库工具要求比较高，一般使用Oracle或者DB2，微软公司最近发布的SQL Server 2005性能也不错。另外在BI领域：数据库，数据仓库，多维数据库，数据挖掘等相关工具也要进行选择，象好的ETL工具和好的OLAP工具都十分必要，例如Informatic，Eassbase等。笔者在实际数据分析项目中，对每天6000万条的日志数据进行处理，使用SQL Server 2000需要花费6小时，而使用SQL Server 2005则只需要花费3小时。
二、编写优良的程序代码
处理数据离不开优秀的程序代码，尤其在进行复杂数据处理时，必须使用程序。好的程序代码对数据的处理至关重要，这不仅仅是数据处理准确度的问题，更是数据处理效率的问题。良好的程序代码应该包含好的算法，包含好的处理流程，包含好的效率，包含好的异常处理机制等。
三、对海量数据进行分区操作
对海量数据进行分区操作十分必要，例如针对按年份存取的数据，我们可以按年进行分区，不同的数据库有不同的分区方式，不过处理机制大体相同。例如SQL Server的数据库分区是将不同的数据存于不同的文件组下，而不同的文件组存于不同的磁盘分区下，这样将数据分散开，减小磁盘I/O，减小了系统负荷，而且还可以将日志，索引等放于不同的分区下。
四、建立广泛的索引
对海量的数据处理，对大表建立索引是必行的，建立索引要考虑到具体情况，例如针对大表的分组、排序等字段，都要建立相应索引，一般还可以建立复合索引，对经常插入的表则建立索引时要小心，笔者在处理数据时，曾经在一个ETL流程中，当插入表时，首先删除索引，然后插入完毕，建立索引，并实施聚合操作，聚合完成后，再次插入前还是删除索引，所以索引要用到好的时机，索引的填充因子和聚集、非聚集索引都要考虑。
五、建立缓存机制
当数据量增加时，一般的处理工具都要考虑到缓存问题。缓存大小设置的好差也关系到数据处理的成败，例如，笔者在处理2亿条数据聚合操作时，缓存设置为100000条/Buffer，这对于这个级别的数据量是可行的。
六、加大虚拟内存
如果系统资源有限，内存提示不足，则可以靠增加虚拟内存来解决。笔者在实际项目中曾经遇到针对18亿条的数据进行处理，内存为1GB，1个P4 2.4G的CPU，对这么大的数据量进行聚合操作是有问题的，提示内存不足，那么采用了加大虚拟内存的方法来解决，在6块磁盘分区上分别建立了6个4096M的磁盘分区，用于虚拟内存，这样虚拟的内存则增加为 4096*6 + 1024 = 25600 M，解决了数据处理中的内存不足问题。
七、分批处理
海量数据处理难因为数据量大，那么解决海量数据处理难的问题其中一个技巧是减少数据量。可以对海量数据分批处理，然后处理后的数据再进行合并操作，这样逐个击破，有利于小数据量的处理，不至于面对大数据量带来的问题，不过这种方法也要因时因势进行，如果不允许拆分数据，还需要另想办法。不过一般的数据按天、按月、按年等存储的，都可以采用先分后合的方法，对数据进行分开处理。
八、使用临时表和中间表
数据量增加时，处理中要考虑提前汇总。这样做的目的是化整为零，大表变小表，分块处理完成后，再利用一定的规则进行合并，处理过程中的临时表的使用和中间结果的保存都非常重要，如果对于超海量的数据，大表处理不了，只能拆分为多个小表。如果处理过程中需要多步汇总操作，可按汇总步骤一步步来，不要一条语句完成，一口气吃掉一个胖子。
九、优化查询SQL语句
在对海量数据进行查询处理过程中，查询的SQL语句的性能对查询效率的影响是非常大的，编写高效优良的SQL脚本和存储过程是数据库工作人员的职责，也是检验数据库工作人员水平的一个标准，在对SQL语句的编写过程中，例如减少关联，少用或不用游标，设计好高效的数据库表结构等都十分必要。笔者在工作中试着对1亿行的数据使用游标，运行3个小时没有出结果，这是一定要改用程序处理了。
十、使用文本格式进行处理
对一般的数据处理可以使用数据库，如果对复杂的数据处理，必须借助程序，那么在程序操作数据库和程序操作文本之间选择，是一定要选择程序操作文本的，原因为：程序操作文本速度快；对文本进行处理不容易出错；文本的存储不受限制等。例如一般的海量的网络日志都是文本格式或者csv格式（文本格式），对它进行处理牵扯到数据清洗，是要利用程序进行处理的，而不建议导入数据库再做清洗。
十一、定制强大的清洗规则和出错处理机制
海量数据中存在着不一致性，极有可能出现某处的瑕疵。例如，同样的数据中的时间字段，有的可能为非标准的时间，出现的原因可能为应用程序的错误，系统的错误等，这是在进行数据处理时，必须制定强大的数据清洗规则和出错处理机制。
十二、建立视图或者物化视图
视图中的数据来源于基表，对海量数据的处理，可以将数据按一定的规则分散到各个基表中，查询或处理过程中可以基于视图进行，这样分散了磁盘I/O，正如10根绳子吊着一根柱子和一根吊着一根柱子的区别。
十三、避免使用32位机子（极端情况）
目前的计算机很多都是32位的，那么编写的程序对内存的需要便受限制，而很多的海量数据处理是必须大量消耗内存的，这便要求更好性能的机子，其中对位数的限制也十分重要。
十四、考虑操作系统问题
海量数据处理过程中，除了对数据库，处理程序等要求比较高以外，对操作系统的要求也放到了重要的位置，一般是必须使用服务器的，而且对系统的安全性和稳定性等要求也比较高。尤其对操作系统自身的缓存机制，临时空间的处理等问题都需要综合考虑。
十五、使用数据仓库和多维数据库存储
数据量加大是一定要考虑OLAP的，传统的报表可能5、6个小时出来结果，而基于Cube的查询可能只需要几分钟，因此处理海量数据的利器是OLAP多维分析，即建立数据仓库，建立多维数据集，基于多维数据集进行报表展现和数据挖掘等。
十六、使用采样数据，进行数据挖掘
基于海量数据的数据挖掘正在逐步兴起，面对着超海量的数据，一般的挖掘软件或算法往往采用数据抽样的方式进行处理，这样的误差不会很高，大大提高了处理效率和处理的成功率。一般采样时要注意数据的完整性和，防止过大的偏差。笔者曾经对1亿2千万行的表数据进行采样，抽取出400万行，经测试软件测试处理的误差为千分之五，客户可以接受。
还有一些方法，需要在不同的情况和场合下运用，例如使用代理键等操作，这样的好处是加快了聚合时间，因为对数值型的聚合比对字符型的聚合快得多。类似的情况需要针对不同的需求进行处理。
海量数据是发展趋势，对数据分析和挖掘也越来越重要，从海量数据中提取有用信息重要而紧迫，这便要求处理要准确，精度要高，而且处理时间要短，得到有价值信息要快，所以，对海量数据的研究很有前途，也很值得进行广泛深入的研究。
 一个很有意义的SQL的优化过程（一个电子化支局中的大数据量的统计SQL）
Posted on 2007-04-28 10:47 七郎归来阅读(37) 评论(0) 编辑收藏引用

select count(distinct v_yjhm)
from (select v_yjhm
from zjjk_t_yssj_o_his a
where n_yjzl > 0
and d_sjrq between to_date('20070301', 'yyyymmdd') and
to_date('20070401', 'yyyymmdd')
and v_yjzldm like '40%'
and not exists(select 'a' from INST_TRIG_ZJJK_T_YSSJ_O b where a.v_yjtm=b.yjbh)
--and v_yjtm not in (select yjbh from INST_TRIG_ZJJK_T_YSSJ_O)
union
select v_yjhm
from zjjk_t_yssj_u_his a
where n_yjzl > 0
and d_sjrq between to_date('20070301', 'yyyymmdd') and
to_date('20070401', 'yyyymmdd')
and v_yjzldm like '40%'
and not exists(select 'a' from INST_TRIG_ZJJK_T_YSSJ_U b where a.v_yjtm=b.yjbh))
--and v_yjtm not in (select yjbh from INST_TRIG_ZJJK_T_YSSJ_U))

说明：1、zjjk_t_yssj_o_his 、zjjk_t_yssj_u_his 的d_sjrq 上都有一个索引了
2、zjjk_t_yssj_o_his 、zjjk_t_yssj_u_his 的v_yjtm 都为 not null 字段
3、INST_TRIG_ZJJK_T_YSSJ_O、INST_TRIG_ZJJK_T_YSSJ_U 的 yjbh 为PK

优化建议：
1、什么是DISTINCT ? 就是分组排序后取唯一值，底层行为分组排序
2、什么是 UNION 、 UNION ALL ？ UNION ：对多个结果集取DISTINCT ，生成一个不含重复记录的结果集，返回给前端，UNION ALL ：不对结果集进行去重复操作底层行为：分组排序
3、什么是 COUNT(*) ？累加
4、需要有什么样的索引？ S_sjrq + v_yjzldm ：理由：假如全省的数据量在表中全部数为1000万，查询月数据量为200万，1000万中特快占50%，则通过 beween 时间(d_sjrq)+ 种类( v_yjzldm )，可过滤出约100万，这是最好的检索方式了。
5、两表都要进行一次 NOT EXISTS 运算，如何做最优？ NOT EXISTS 是不好做的运算，但是我们可以合并两次的NOT EXISTS 运算。让这费资源的活只干一次。

综合以上，我们可以如下优化这个SQL：
1、内部的UNION 也是去重复，外部的DISTINCT 也是去重复，可左右去掉一个，建议内部的改为 UNION ALL ，这里稍请注意一下，如果V_YJHM 有NULL的情况，可能会引起COUNT值不对实际数的情况。
2、建一个 D_SJRQ+V_YJZLDM 的复合索引
3、将两个子查询先 UNION ALL 联结，另两个用来做 NOT EXISTS 的表也 UNION ALL 联结
4、在3的基础上再做 NOT EXISTS
5、将NOT EXISTS 替换为NOT IN ，同时加提示 HASH_AJ 做半连接HASH运算
6、最后为外层的COUNT(DISTINCT … 获得结果数

SQL书写如下：
select count(distinct v_yjhm)
from (select v_yjtm, v_yjhm
from zjjk_t_yssj_o_his a
where n_yjzl > 0
and d_sjrq between to_date('20070301', 'yyyymmdd') and
to_date('20070401', 'yyyymmdd')
and v_yjzldm like '40%'
union all
select v_yjtm, v_yjhm
from zjjk_t_yssj_u_his a
where n_yjzl > 0
and d_sjrq between to_date('20070301', 'yyyymmdd') and
to_date('20070401', 'yyyymmdd')
and v_yjzldm like '40%'
) a
where a.v_yjtm not IN
(select /*+ HASH_AJ */
yjbh
from (select yjbh
from INST_TRIG_ZJJK_T_YSSJ_O
union all
select yjbh from INST_TRIG_ZJJK_T_YSSJ_U))

经过上述改造，原来这个SQL的执行时间如果为2分钟的话，现在应该20秒足够！

 如何优化大数据量模糊查询（架构，数据库设置，SQL..）
请各位大虾对如下需求提供点意见：
1。实时查询某当日或指定时间段的所有交易记录。
2。实时查询一批记录，查询条件不确定，条件几乎包含所有字段，可自由组合）
3。查询返回数据量可非常大，百万纪录级。

目前系统采用三层结构，中间层是cics,按目前使用的查询方式，系统资源占用大，速度慢，对实时交易会造成影响。
并且速度明显慢于原C/S结构，如C/S结构用2秒，现在可能要10秒。想征询一下是否有好的解决方案，能使三层结构的批量查询快于C/S结构的查询。
由于客户的环境是中间件和DB各一台服务器，所以无法作负载均衡。
由于客户在外地，他们提供的信息有限，我无法做出更多的判断。不过本周我将赴外地，作测试，步骤和biti的类似。
基本思路是首先确认2层和3层是否做完全相同的查询，然后比较执行时间，判断瓶颈，以决定对中间层还是对db和sql进行优化。
针对你的3个条件做一下回答：
1。可以考虑使用以时间做条件的partition
2。总有一两个条件选择性高，使用频率又高的，考虑加索引。
3。既然是3层结构，那就不应该把那么高的数据料一次返回给Client,可以考虑把处理过程放在中间层，或者使用分页技术，根据需要分段返回。
 求助:海量数据处理方法
大家好,我们现在有一个技术问题,不知道能不能帮忙解决?
1.我们网站的信息系统,每天新增100W条用户数据,不知道如果解决才能查询更新更快,更合理.
2.有一个条数据,同时有1W个用户查看(并发用户),我们的统计是每次+1,现在数据库更新时有问题了,排队更新,速度太慢.
注:我们用Asp.Net (C#) ,Sql Server2005平台
# re: 求助:海量数据处理方法回复更多评论
1,记住要安需所取,就是用户一次看多少就显示多少,也就是从数据库中取出这些数量的数据,善用你的索引
2,写存储过程,缓存….
3,静态化页面.
5,修改你的逻辑
建立数据中心，对数据进行按某个条件建分区索引

 海量数据库查询方略
老朋友Bob遇到难题：
“有这样一个系统，每个月系统自动生成一张数据表，表名按业务代码和年月来命名，每张表的数据一个月平均在８k万这样的数据量，但是查询的时候希望能够查到最近三个月的数据，也就是要从三个数据量非常庞大的表中来把查询的数据汇聚到一起，有什么比较好的办法有比较高的效率？”
这当然是个海量数据库，他还具体的举例子：
“我现在测试的结果是查询慢，数据量大的时候，在查询分析器做查询都比较慢，
比如用户输入一个主叫号码，他希望获取最近３个月的数据信息，
但是后台要根据该主叫号码到最近３个月的表中去查数据再汇聚到一起，返回给客户端。”
海量数据对服务器的CPU，IO吞吐都是严峻的考验，我的解决之道：
1.从设计初试就考虑拆分数据库，让数据库变“小”，比如，把将用户按地域划分，或者按VIP等级划分。Bob说按地域划分很困难，因为不知道用户的地域；VIP级别也不知道。
其实，主叫号码就带有信息，比如按区号，按手机号码段，甚至就按主叫号码的前两个数字来拆分。数据库小了自然就快了。
2.上面拆分的方法是把数据库变“小”，更强有力的手段是采用分布式计算，举例如下：
1).用三台服务器安装三套相同的数据库系统，数据完全一样；
2).用三个线程同时向三个服务器发起请求，每个服务器各查一个月，然后将数据汇总起来，这样速度提高了3倍。
3.索引优化，Bob自己也谈到，可以根据查询创建有效的复合索引，不过索引复杂了，插入数据会变慢，要仔细权衡。我觉得还有个办法：
主叫号码通常是字符串型的，建议改为长整型，这样索引后检索会十分快，因为整型的比较要远远快过字符串的比较。
希望我这个纸上谈兵对你有所帮助。
 SQL Server 2005对海量数据处理
超大型数据库的大小常常达到数百GB,有时甚至要用TB来计算。而单表的数据量往往会达到上亿的记录,并且记录数会随着时间而增长。这不但影响着数据库的运行效率,也增大数据库的维护难度。除了表的数据量外,对表不同的访问模式也可能会影响性能和可用性。这些问题都可以通过对大表进行合理分区得到很大的改善。当表和索引变得非常大时,分区可以将数据分为更小、更容易管理的部分来提高系统的运行效率。如果系统有多个CPU或是多个磁盘子系统,可以通过并行操作获得更好的性能。所以对大表进行分区是处理海量数据的一种十分高效的方法。本文通过一个具体实例,介绍如何创建和修改分区表,以及如何查看分区表。
　　1 SQL Server 2005

　　SQL Server 2005是微软在推出SQL Server 2000后时隔五年推出的一个数据库平台,它的数据库引擎为关系型数据和结构化数据提供了更安全可靠的存储功能,使用户可以构建和管理用于业务的高可用和高性能的数据应用程序。此外SQL Server 2005结合了分析、报表、集成和通知功能。这使企业可以构建和部署经济有效的BI解决方案,帮助团队通过记分卡、Dashboard、Web Services和移动设备将数据应用推向业务的各个领域。无论是开发人员、数据库管理员、信息工作者还是决策者,SQL Server 2005都可以提供出创新的解决方案,并可从数据中获得更多的益处。

　　它所带来的新特性,如T-SQL的增强、数据分区、服务代理和与.Net Framework的集成等,在易管理性、可用性、可伸缩性和安全性等方面都有很大的增强。

　　2 表分区的具体实现方法

　　表分区分为水平分区和垂直分区。水平分区将表分为多个表。每个表包含的列数相同,但是行更少。例如,可以将一个包含十亿行的表水平分区成 12 个表,每个小表表示特定年份内一个月的数据。任何需要特定月份数据的查询只需引用相应月份的表。而垂直分区则是将原始表分成多个只包含较少列的表。水平分区是最常用分区方式,本文以水平分区来介绍具体实现方法。

　　水平分区常用的方法是根据时期和使用对数据进行水平分区。例如本文例子,一个短信发送记录表包含最近一年的数据,但是只定期访问本季度的数据。在这种情况下,可考虑将数据分成四个区,每个区只包含一个季度的数据。

　　2.1 创建文件组

　　建立分区表先要创建文件组,而创建多个文件组主要是为了获得好的 I/O 平衡。一般情况下,文件组数最好与分区数相同,并且这些文件组通常位于不同的磁盘上。每个文件组可以由一个或多个文件构成,而每个分区必须映射到一个文件组。一个文件组可以由多个分区使用。为了更好地管理数据(例如,为了获得更精确的备份控制),对分区表应进行设计,以便只有相关数据或逻辑分组的数据位于同一个文件组中。使用 ALTER DATABASE,添加逻辑文件组名:

　　ALTER DATABASE [DeanDB] ADD FILEGROUP [FG1]

　　DeanDB为数据库名称,FG1文件组名。创建文件组后,再使用 ALTER DATABASE 将文件添加到该文件组中:

　　ALTER DATABASE [DeanDB] ADD FILE ( NAME = N'FG1', FILENAME = N'C:DeanDataFG1.ndf' , SIZE = 3072KB , FILEGROWTH = 1024KB ) TO FILEGROUP [FG1]

　　类似的建立四个文件和文件组,并把每一个存储数据的文件放在不同的磁盘驱动器里。

　　2.2 创建分区函数

　　创建分区表必须先确定分区的功能机制,表进行分区的标准是通过分区函数来决定的。创建数据分区函数有RANGE “LEFT | / RIGHT”两种选择。代表每个边界值在局部的哪一边。例如存在四个分区,则定义三个边界点值,并指定每个值是第一个分区的上边界 (LEFT) 还是第二个分区的下边界 (RIGHT)[1]。代码如下:

　　CREATE PARTITION FUNCTION [SendSMSPF](datetime)　AS RANGE RIGHT FOR VALUES ('20070401', '20070701', '20071001')

　　2.3 创建分区方案

　　创建分区函数后,必须将其与分区方案相关联,以便将分区指向至特定的文件组。就是定义实际存放数据的媒体与各数据块的对应关系。多个数据表可以共用相同的数据分区函数,一般不共用相同的数据分区方案。可以通过不同的分区方案,使用相同的分区函数,使不同的数据表有相同的分区条件,但存放在不同的媒介上。创建分区方案的代码如下:

　　CREATE PARTITION SCHEME [SendSMSPS] AS PARTITION [SendSMSPF] TO ([FG1], [FG2], [FG3], [FG4])

　　2.4 创建分区表

　　建立好分区函数和分区方案后,就可以创建分区表了。分区表是通过定义分区键值和分区方案相联系的。插入记录时,SQL SERVER会根据分区键值的不同,通过分区函数的定义将数据放到相应的分区。从而把分区函数、分区方案和分区表三者有机的结合起来。创建分区表的代码如下:

CREATE TABLE SendSMSLog

　　([ID] [int] IDENTITY(1,1) NOT NULL,

　　[IDNum] [nvarchar](50) NULL,

　　[SendContent] [text] NULL

　　[SendDate] [datetime] NOT NULL,

　　) ON SendSMSPS(SendDate)

　　2.5 查看分区表信息

　　系统运行一段时间或者把以前的数据导入分区表后,我们需要查看数据的具体存储情况,即每个分区存取的记录数,那些记录存取在那个分区等。我们可以通过$partition.SendSMSPF来查看,代码如下:

　　SELECT $partition.SendSMSPF(o.SendDate)

　　AS [Partition Number]

　　, min(o.SendDate) AS [Min SendDate]

　　, max(o.SendDate) AS [Max SendDate]

　　, count(*) AS [Rows In Partition]

　　FROM dbo.SendSMSLog AS o

　　GROUP BY $partition.SendSMSPF(o.SendDate)

　　ORDER BY [Partition Number]

　　在查询分析器里执行以上脚本,结果如图1所示:

　　图1　分区表信息

　　2.6 维护分区

　　分区的维护主要设计分区的添加、减少、合并和在分区间转换。可以通过ALTER PARTITION FUNCTION的选项SPLIT,MERGE和ALTER TABLE的选项SWITCH来实现。SPLIT会多增加一个分区,而MEGRE会合并或者减少分区,SWITCH则是逻辑地在组间转换分区。

　　3 性能对比

　　我们对2650万数据,存储空间占用约4G的单表进行性能对比,测试环境为IBM365,CPU 至强2.7G*2、内存 16G、硬盘 136G*2,系统平台为Windows 2003 SP1+SQL Server 2005 SP1。测试结果如表1:

　　表1:分区和未分区性能对比表(单位:毫秒)

　　测试项目分区未分区

　　1 16546 61466

　　2 13 33

　　3 20140 61546

　　4 17140 61000

　　说明:

　　1:根据时间检索某一天记录所耗时间

　　2:单条记录插入所耗时间

　　3:根据时间删除某一天记录所耗时间

　　4:统计每月的记录数所需时间

　　从表1可以看出,对分区表进行操作比未分区的表要快,这是因为对分区表的操作采用了CPU和I/O的并行操作,检索数据的数据量也变小了,定位数据所耗时间变短。

　　4 结束语

　　对海量数据的处理一直是一个令人头痛的问题。分离的技术是所有设计者们首先考虑的问题,不管是分离应用程序功能还是分离数据访问,如果加以了合理规划,都能十分有效的解决大数据表的运行效率低和维护成本高等问题。SQL Server 2005新增的表分区功能,可以对数据进行合理分区,当用户在访问部分数据时,SQL Server最佳化引擎可以根据数据的实体存放,找出最佳的执行方案,而不至于大海捞针。
 分表处理设计思想和实现

作者：heiyeluren (黑夜路人)
博客： http://blog.csdn.net/heiyeshuwu
时间：2007-01-19 01:44:20

一、概述

分表是个目前算是比较炒的比较流行的概念，特别是在大负载的情况下，分表是一个良好分散数据库压力的好方法。

首先要了解为什么要分表，分表的好处是什么。我们先来大概了解以下一个数据库执行SQL的过程：
接收到SQL --> 放入SQL执行队列 --> 使用分析器分解SQL --> 按照分析结果进行数据的提取或者修改 --> 返回处理结果

当然，这个流程图不一定正确，这只是我自己主观意识上这么我认为。那么这个处理过程当中，最容易出现问题的是什么？就是说，如果前一个SQL没有执行完毕的话，后面的SQL是不会执行的，因为为了保证数据的完整性，必须对数据表文件进行锁定，包括共享锁和独享锁两种锁定。共享锁是在锁定的期间，其它线程也可以访问这个数据文件，但是不允许修改操作，相应的，独享锁就是整个文件就是归一个线程所有，其它线程无法访问这个数据文件。一般MySQL中最快的存储引擎MyISAM，它是基于表锁定的，就是说如果一锁定的话，那么整个数据文件外部都无法访问，必须等前一个操作完成后，才能接收下一个操作，那么在这个前一个操作没有执行完成，后一个操作等待在队列里无法执行的情况叫做阻塞，一般我们通俗意义上叫做“锁表”。

锁表直接导致的后果是什么？就是大量的SQL无法立即执行，必须等队列前面的SQL全部执行完毕才能继续执行。这个无法执行的SQL就会导致没有结果，或者延迟严重，影响用户体验。

特别是对于一些使用比较频繁的表，比如SNS系统中的用户信息表、论坛系统中的帖子表等等，都是访问量大很大的表，为了保证数据的快速提取返回给用户，必须使用一些处理方式来解决这个问题，这个就是我今天要聊到的分表技术。

分表技术顾名思义，就是把若干个存储相同类型数据的表分成几个表分表存储，在提取数据的时候，不同的用户访问不同的表，互不冲突，减少锁表的几率。比如，目前保存用户分表有两个表，一个是user_1表，还有一个是 user_2 表，两个表保存了不同的用户信息，user_1 保存了前10万的用户信息，user_2保存了后10万名用户的信息，现在如果同时查询用户 heiyeluren1 和 heiyeluren2 这个两个用户，那么就是分表从不同的表提取出来，减少锁表的可能。

我下面要讲述的两种分表方法我自己都没有实验过，不保证准确能用，只是提供一个设计思路。下面关于分表的例子我假设是在一个贴吧系统的基础上来进行处理和构建的。（如果没有用过贴吧的用户赶紧Google一下）

二、基于基础表的分表处理

这个基于基础表的分表处理方式大致的思想就是：一个主要表，保存了所有的基本信息，如果某个项目需要找到它所存储的表，那么必须从这个基础表中查找出对应的表名等项目，好直接访问这个表。如果觉得这个基础表速度不够快，可以完全把整个基础表保存在缓存或者内存中，方便有效的查询。

我们基于贴吧的情况，构建假设如下的3张表：

1. 贴吧版块表: 保存贴吧中版块的信息
2. 贴吧主题表：保存贴吧中版块中的主题信息，用于浏览
3. 贴吧回复表：保存主题的原始内容和回复内容

“贴吧版块表”包含如下字段：
版块ID board_id int(10)
版块名称 board_name char(50)
子表ID table_id smallint(5)
产生时间 created datetime

“贴吧主题表”包含如下字段：
主题ID topic_id int(10)
主题名称 topic_name char(255)
版块ID board_id int(10)
创建时间 created datetime

“贴吧回复表”的字段如下：
回复ID reply_id int(10)
回复内容 reply_text text
主题ID topic_id int(10)
版块ID board_id int(10)
创建时间 created datetime

那么上面保存了我们整个贴吧中的表结构信息，三个表对应的关系是：

版块 --> 多个主题
主题 --> 多个回复

那么就是说，表文件大小的关系是：
版块表文件 < 主题表文件 < 回复表文件

所以基本可以确定需要对主题表和回复表进行分表，已增加我们数据检索查询更改时候的速度和性能。

看了上面的表结构，会明显发现，在“版块表”中保存了一个"table_id"字段，这个字段就是用于保存一个版块对应的主题和回复都是分表保存在什么表里的。

比如我们有一个叫做“PHP”的贴吧，board_id是1，子表ID也是1，那么这条记录就是：

board_id | board_name | table_id | created
1 | PHP | 1 | 2007-01-19 00:30:12

相应的，如果我需要提取“PHP”吧里的所有主题，那么就必须按照表里保存的table_id来组合一个存储了主题的表名称，比如我们主题表的前缀是“topic_”，那么组合出来“PHP”吧对应的主题表应该是：“topic_1”，那么我们执行：

SELECT * FROM topic_1 WHERE board_id = 1 ORDER BY topic_id DESC LIMIT 10

这样就能够获取这个主题下面回复列表，方便我们进行查看，如果需要查看某个主题下面的回复，我们可以继续使用版块表中保存的“table_id”来进行查询。比如我们回复表的前缀是“reply_”，那么就可以组合出“PHP”吧的ID为1的主题的回复：

SELECT * FROM reply_1 WHERE topic_id = 1 ORDER BY reply_id DESC LIMIT 10

这里，我们能够清晰的看到，其实我们这里使用了基础表，基础表就是我们的版块表。那么相应的，肯定会说：基础表的数据量大了以后如何保证它的速度和效率？

当然，我们就必须使得这个基础表保持最好的速度和性能，比如，可以采用MySQL的内存表来存储，或者保存在内存当中，比如Memcache之类的内存缓存等等，可以按照实际情况来进行调整。

一般基于基础表的分表机制在SNS、交友、论坛等Web2.0网站中是个比较不错的解决方案，在这些网站中，完全可以单独使用一个表来来保存基本标识和目标表之间的关系。使用表保存对应关系的好处是以后扩展非常方便，只需要增加一个表记录。

【优势】增加删除节点非常方便，为后期升级维护带来很大便利
【劣势】需要增加表或者对某一个表进行操作，还是无法离开数据库，会产生瓶颈

三、基于Hash算法的分表处理

我们知道Hash表就是通过某个特殊的Hash算法计算出的一个值，这个值必须是惟一的，并且能够使用这个计算出来的值查找到需要的值，这个叫做哈希表。

我们在分表里的hash算法跟这个思想类似：通过一个原始目标的ID或者名称通过一定的hash算法计算出数据存储表的表名，然后访问相应的表。

继续拿上面的贴吧来说，每个贴吧有版块名称和版块ID，那么这两项值是固定的，并且是惟一的，那么我们就可以考虑通过对这两项值中的一项进行一些运算得出一个目标表的名称。

现在假如我们针对我们这个贴吧系统，假设系统最大允许1亿条数据，考虑每个表保存100万条记录，那么整个系统就不超过100个表就能够容纳。按照这个标准，我们假设在贴吧的版块ID上进行hash，获得一个key值，这个值就是我们的表名，然后访问相应的表。

我们构造一个简单的hash算法：

function get_hash($id){
$str = bin2hex($id);
$hash = substr($str, 0, 4);
if (strlen($hash)<4){
$hash = str_pad($hash, 4, "0");
}
return $hash;
}

算法大致就是传入一个版块ID值，然后函数返回一个4位的字符串，如果字符串长度不够，使用0进行补全。

比如：get_hash(1)，输出的结果是“3100”，输入：get_hash(23819)，得到的结果是：3233，那么我们经过简单的跟表前缀组合，就能够访问这个表了。那么我们需要访问ID为1的内容时候哦，组合的表将是：topic_3100、reply_3100，那么就可以直接对目标表进行访问了。

当然，使用hash算法后，有部分数据是可能在同一个表的，这一点跟hash表不同，hash表是尽量解决冲突，我们这里不需要，当然同样需要预测和分析表数据可能保存的表名。

如果需要存储的数据更多，同样的，可以对版块的名字进行hash操作，比如也是上面的二进制转换成十六进制，因为汉字比数字和字母要多很多，那么重复几率更小，但是可能组合成的表就更多了，相应就必须考虑一些其它的问题。

归根结底，使用hash方式的话必须选择一个好的hash算法，才能生成更多的表，然数据查询的更迅速。

【优点hash算法直接得出目标表名称，效率很高】通过
【劣势】扩展性比较差，选择了一个hash算法，定义了多少数据量，以后只能在这个数据量上跑，不能超过过这个数据量，可扩展性稍差

四、其它问题

1. 搜索问题
现在我们已经进行分表了，那么就无法直接对表进行搜索，因为你无法对可能系统中已经存在的几十或者几百个表进行检索，所以搜索必须借助第三方的组件来进行，比如Lucene作为站内搜索引擎是个不错的选择。

2. 表文件问题
我们知道MySQL的MyISAM引擎每个表都会生成三个文件，*.frm、*.MYD、*.MYI 三个文件，分表用来保存表结构、表数据和表索引。Linux下面每个目录下的文件数量最好不要超过1000个，不然检索数据将更慢，那么每个表都会生成三个文件，相应的如果分表超过300个表，那么将检索非常慢，所以这时候就必须再进行分，比如在进行数据库的分离。

使用基础表，我们可以新增加一个字段，用来保存这个表保存在什么数据。使用Hash的方式，我们必须截取hash值中第几位来作为数据库的名字。这样，完好的解决这个问题。

五、总结

在大负载应用当中，数据库一直是个很重要的瓶颈，必须要突破，本文讲解了两种分表的方式，希望对很多人能够有启发的作用。当然，本文代码和设想没有经过任何代码测试，所以无法保证设计的完全准确实用，具体还是需要读者在使用过程当中认真分析实施。
 Linux系统高负载 MySQL数据库彻底优化(1)
作者: skid 出处:赛迪网　 ( ) 砖 ( ) 好评论 ( ) 条　进入论坛
更新时间：2007-06-25 13:43
关键词：优化 Linux MySQL
阅读提示：本文作者讲述了在高负载的Linux系统下，MySQL数据库如何实现优化，供大家参考！
同时在线访问量继续增大对于1G内存的服务器明显感觉到吃力，严重时，甚至每天都会死机或者时不时的服务器卡一下。这个问题曾经困扰了我半个多月，MySQL使用是很具伸缩性的算法，因此你通常能用很少的内存运行或给MySQL更多的备存以得到更好的性能。
安装好mysql后，配制文件应该在/usr/local/mysql/share/mysql目录中，配制文件有几个，有my-huge.cnf my-medium.cnf my-large.cnf my-small.cnf，不同流量的网站和不同配制的服务器环境，当然需要有不同的配制文件了。
一般的情况下，my-medium.cnf这个配制文件就能满足我们的大多需要；一般我们会把配置文件拷贝到/etc/my.cnf ，只需要修改这个配置文件就可以了，使用mysqladmin variables extended-status -uroot -p可以看到目前的参数，有3个配置参数是最重要的，即key_buffer_size,query_cache_size,table_cache。
key_buffer_size只对MyISAM表起作用，key_buffer_size指定索引缓冲区的大小，它决定索引处理的速度，尤其是索引读的速度。一般我们设为16M,实际上稍微大一点的站点　这个数字是远远不够的，通过检查状态值Key_read_requests和Key_reads，可以知道key_buffer_size设置是否合理。比例key_reads / key_read_requests应该尽可能的低，至少是1:100，1:1000更好（上述状态值可以使用SHOW STATUS LIKE ‘key_read%’获得）。或者如果你装了phpmyadmin 可以通过服务器运行状态看到,笔者推荐用phpmyadmin管理mysql，以下的状态值都是本人通过phpmyadmin获得的实例分析：
这个服务器已经运行了20天
key_buffer_size – 128M

key_read_requests – 650759289

key_reads - 79112
比例接近1:8000 健康状况非常好
另外一个估计key_buffer_size的办法　
把你网站数据库的每个表的索引所占空间大小加起来看看以此服务器为例：比较大的几个表索引加起来大概125M 这个数字会随着表变大而变大。
从4.0.1开始，MySQL提供了查询缓冲机制。使用查询缓冲，MySQL将SELECT语句和查询结果存放在缓冲区中，今后对于同样的SELECT语句（区分大小写），将直接从缓冲区中读取结果。根据MySQL用户手册，使用查询缓冲最多可以达到238%的效率。
通过调节以下几个参数可以知道query_cache_size设置得是否合理
Qcache inserts

Qcache hits

Qcache lowmem prunes

Qcache free blocks

Qcache total blocks
Qcache_lowmem_prunes的值非常大，则表明经常出现缓冲不够的情况，同时Qcache_hits的值非常大，则表明查询缓冲使用非常频繁，此时需要增加缓冲大小Qcache_hits的值不大，则表明你的查询重复率很低，这种情况下使用查询缓冲反而会影响效率，那么可以考虑不用查询缓冲。此外，在SELECT语句中加入SQL_NO_CACHE可以明确表示不使用查询缓冲。
Qcache_free_blocks，如果该值非常大，则表明缓冲区中碎片很多，query_cache_type指定是否使用查询缓冲。
我设置：
QUOTE：
query_cache_size = 32M

query_cache_type= 1
得到如下状态值：
Qcache queries in cache 12737 表明目前缓存的条数

Qcache inserts 20649006

Qcache hits 79060095 　看来重复查询率还挺高的

Qcache lowmem prunes 617913　有这么多次出现缓存过低的情况

Qcache not cached 189896

Qcache free memory 18573912目前剩余缓存空间

Qcache free blocks 5328 这个数字似乎有点大　碎片不少

Qcache total blocks 30953

如果内存允许32M应该要往上加点
table_cache指定表高速缓存的大小。每当MySQL访问一个表时，如果在表缓冲区中还有空间，该表就被打开并放入其中，这样可以更快地访问表内容。通过检查峰值时间的状态值Open_tables和Opened_tables，可以决定是否需要增加table_cache的值。如果你发现open_tables等于table_cache，并且opened_tables在不断增长，那么你就需要增加table_cache的值了（上述状态值可以使用SHOW STATUS LIKE ‘Open%tables’获得）。注意，不能盲目地把table_cache设置成很大的值。如果设置得太高，可能会造成文件描述符不足，从而造成性能不稳定或者连接失败。
对于有1G内存的机器，推荐值是128－256。
笔者设置
QUOTE：
table_cache = 256
得到以下状态：
Open tables 256

Opened tables 9046
虽然open_tables已经等于table_cache，但是相对于服务器运行时间来说，已经运行了20天，opened_tables的值也非常低。因此，增加table_cache的值应该用处不大。如果运行了6个小时就出现上述值那就要考虑增大table_cache。
如果你不需要记录2进制log 就把这个功能关掉，注意关掉以后就不能恢复出问题前的数据了，需要您手动备份，二进制日志包含所有更新数据的语句，其目的是在恢复数据库时，用它来把数据尽可能恢复到最后的状态。另外，如果做同步复制( Replication )的话，也需要使用二进制日志传送修改情况。
log_bin指定日志文件，如果不提供文件名，MySQL将自己产生缺省文件名。MySQL会在文件名后面自动添加数字引，每次启动服务时，都会重新生成一个新的二进制文件。此外，使用log-bin-index可以指定索引文件；使用binlog-do-db可以指定记录的数据库；使用binlog-ignore-db可以指定不记录的数据库。注意的是：binlog-do-db和binlog-ignore-db一次只指定一个数据库，指定多个数据库需要多个语句。而且，MySQL会将所有的数据库名称改成小写，在指定数据库时必须全部使用小写名字，否则不会起作用。
关掉这个功能只需要在他前面加上#号
QUOTE：
#log-bin
数据库的瓶颈大多在查询速度和读写锁定上，除了优化数据库本身和sql语句外，还可以考虑，把一个表拆分成多个或者关系数据库和文本（纯文本／XML／文本数据库等）库配合使用。

我在做文学程序，目前做法是数据库里面只保存文章结构，文章内容用一个个文件保存。

这样好处是数据库小了，查询快，不过全文搜索就不好办了。

我现在手上做的一个项目，这样处理。

形成分站结构，每个分站都对应一个相同结构的数据库，其中放XXOO张表，包括文章主题表。每个分站就等同于一个大类了。再加一个库（类似xxoo_blog这样的名字），目前一张表，保存发表人、时间、分站名、文章ID的对应记录，以做为集中区。

DB封装类中这样定义基类：
[Copy to clipboard] [ - ]
CODE:
/*****************************************************************************
CLASS table base
*****************************************************************************/
class table {
var $table_name;
…

function table() {}
//切换DB
function change_db($db_name) {
global $db;
$db->Database=$db_name;
mysql_select_db($db_name,$db->Link_ID);
}

…

}

相应表类定义：
[Copy to clipboard] [ - ]
CODE:
/*****************************************************************************
CLASS info
*****************************************************************************/
class info extends table{
function info() {
global $site_name;
$this->change_db($site_name);

$this->table_name="info";
$this->order_text="i_date desc";
$this->limit_text="";
$this->id_name="i_id";
}
}

这样调用时只要注意好$site_name这个分站名的参数即可。
 大型数据库的设计与编程技巧

本人最近开发一个访问统计系统，日志非常的大，都保存在数据库里面。

我现在按照常规的设计方法对表进行设计，已经出现了查询非常缓慢地情形。

大家对于这种情况如何来设计数据库呢？把一个表分成多个表么？那么查询和插入数据库又有什么技巧呢？

谢谢，村里面的兄弟们！

按照时间把库分开，建立正确的索引，避免关联及子查询，like的使用
给你两种方案:
数据表大了的话,肯定要分表的.

一种是按数据类型分表.
比如说用户1的日志,用户2的日志,查询的时候又基本上是按照这种方式查的话比较好.还省去了查询条件

另外一个方案就是按时间分:
其实来这也是一种类型,只是稍微有些差别.
如果每天的日志量很大的话可以当天写入临时表比如log_tmp
然后每天定时跑crontab,将log_tmp改名为2007-03-05
然后重新建立一个log_tmp新表

另外也可以按多少条记录,分表,也可以考虑按hash算法分布表.

我个人认为可以按以下的思路优化一下：

1、大表，是指列数多还是行数多？

2、分表有按列分和按行分，

3、频繁操作是插入还是查询？占资源多的是那个，一般如果行数多会造成查询缓慢，

4、统计查询的时间实时要求高不高，比如是否一定要精确到某时某刻，还是某段时间（既可缓10分钟），如果可缓，可以10分钟做一次统计快照，既10分钟(或5分钟)做一次统计快照，把几千万数据统计为几万条数据的快照统计表，这样会明显提高效率。
还有，数据库系统主机的优化也很重要，比如，服务器分流，存储空间的设置技巧 … 这些对大型数据库都很重要

 方案探讨,关于工程中数据库的问题. [已结贴]

• Winters_lee
•
• 等级：
发表于：2007-06-01 09:17:26 楼主
用VB开发的工程,使用SQL Server数据库.

用SQL Server记录历史数据,有几个历史数据的表单,可是这样长年累月的添加记录,弄到这些表单庞大无比,以致对表单进行操作时消耗很长的时间.这个问题就显得很重要的了.

情况具体是这样,有3类表单,1类是1分钟记录一次的表单,1类是20分钟记录一次的表单,这个数据是1分钟表单的平均值记录表单.还有1类是1小时记录一次的表单,这个数据就是20分钟表单的平均值记录.

1分钟记录一次的表单使用1个月就是1*60*24*30=43200条记录,这就有满大了,所以我限制了这个表单的记录条数,在添加一条记录的时候就删除最老的一条,保证这个表单只有1个月的分量.

但是20分钟和1小时的表单我就不能这么做了,因为客户需要保留所有的记录.

曾试过用备份和恢复的方法,但是比较麻烦.请各位大侠看看有何良策,小弟不甚感激!

问题点数：100 回复次数：19

• ZOU_SEAFARER
•
• 等级：
发表于：2007-06-01 10:10:081楼得分:10
但是20分钟和1小时的表单我就不能这么做了,因为客户需要保留所有的记录.
这个要求有点无理了,即便是最先进的系统,最大的磁盘空间也有满的那一天!!

• cangwu_lee
•
• 等级：
发表于：2007-06-01 10:12:562楼得分:5
分表保存

• Winters_lee
•
• 等级：
发表于：2007-06-01 10:22:103楼得分:0
楼上的,分表保存的话,问题来了:
需要在程序里面切入这个分开的表单名称,不然它不会知道该放到哪个表单里面去.然后还有如何定义不同的表单呢?这个在查询的时候同样也是比较麻烦的,查找时,必须判断要查找的数据是存放在哪个表单里面的.

• vbman2003
•
• 等级：
发表于：2007-06-01 10:38:234楼得分:10
从你的描述看，你这点数据量并不算大啊。就说你的分钟记录表，一个月才43200，一年才50多万条数据，用10年也就500万数据，也不能称为庞大无比，如果这点数据，对于常规的操作，要消耗很长的时间，是不是硬件或者程序有问题？

• Winters_lee
•
• 等级：
发表于：2007-06-01 11:00:355楼得分:0
50多万条记录进行查询，求平均值等操作，你觉得会快的起来吗？
我在那机器上不开任何其他的程序，就用SQL Server的查询分析器，使用这条命令：
select top 1 * from XXX order by InTime desc
大概用了我８秒的时间．而对数据记录比较少的表单来说，速度就比较快了，可以满足处理的要求．

• Winters_lee
•
• 等级：
发表于：2007-06-01 11:04:306楼得分:0
尤其是更加精确的查询，耗时更多，象：
select * from XXX where InTime between "2003-01-01 " and "2003-05-30 "

• vbman2003
•
• 等级：
发表于：2007-06-01 11:23:507楼得分:0
50多万条记录进行查询，求平均值等操作，你觉得会快的起来吗？
我在那机器上不开任何其他的程序，就用SQL Server的查询分析器，使用这条命令：
select top 1 * from XXX order by InTime desc
大概用了我８秒的时间．而对数据记录比较少的表单来说，速度就比较快了，可以满足处理的要求．
-----------------
作为一台服务器，50万数据的各种查询，返回数据都是毫秒级的
你的机器让人郁闷，这点数据，要在程序上或者数据库的设计上花费这样的精力，我真是无语

• vbman2003
•
• 等级：
发表于：2007-06-01 11:28:478楼得分:0
50多万条记录进行查询，求平均值等操作，你觉得会快的起来吗？
--------------------------
我有个access，其中的一张表有70万条数据，是从SQL数据库上备份下来的历史数据，我用VB连接查询一些统计信息，快的在1秒以内，最慢也不会超过2秒

• Winters_lee
•
• 等级：
发表于：2007-06-01 13:32:319楼得分:0
数据库的操作对硬件的要求很高么？感觉对内存到是很有要求．

• lsftest
•
• 等级：
发表于：2007-06-01 14:03:3110楼得分:10
使用sql server的工作调度，写个存储过程，在每天适当的时候（如凌晨3、4点）建一新表，把昨天的数据都移到那个新表去，要统计的时候就用union all把需要的数据并在一起统计。。。然后定时做数据备份，清空旧的数据备份表。。
以前我公司那套系统，平均每秒差不多都会有几十条新记录插入。。。。如果一直放着不管它，早崩溃了。。。。

• vbman2003
•
• 等级：
发表于：2007-06-01 14:21:2711楼得分:0
数据库的操作对硬件的要求很高么？感觉对内存到是很有要求．
------------------------------------------------
专业服务器从主板、CPU、内存、硬盘等等都与普通PC不一样的。

• ZOU_SEAFARER
•
• 等级：
发表于：2007-06-01 14:28:0712楼得分:0
lsftest() 的方法我觉得不错,同时你还需要增加一个表,记录什么时候分表了,分表的名称等信息,到查询的时候就把这些表通过你增加的哪个新表联系起来!!

• hupeng213
•
• 等级：
发表于：2007-06-01 14:41:4613楼得分:5
50多万条记录进行查询，求平均值等操作，你觉得会快的起来吗？
我在那机器上不开任何其他的程序，就用SQL Server的查询分析器，使用这条命令：
select top 1 * from XXX order by InTime desc
大概用了我８秒的时间．而对数据记录比较少的表单来说，速度就比较快了，可以满足处理的要求．

-------------------------------
这样子的现象只能证明一件事情，你的数据库结构定义不合理，适当的对某些字段建立索引，可以有效地提高速度。

• lsftest
•
• 等级：
发表于：2007-06-01 20:36:2014楼得分:10
同时你还需要增加一个表,记录什么时候分表了,分表的名称等信息,到查询的时候就把这些表通过你增加的哪个新表联系起来!!
====================================
不需要，只要你自己心里有数就行了。。。
例如，今天是2007.06.01，在2007.06.02的凌晨3：00，工作调度就会执行预先写好的存储过程，大致要完成的工作是：
1.建新表,表名testtable20070601，表结构跟源表（originaltable）一样。
2.从源表中复制2007.06.02前的数据到testtable20070601：
insert into testtable20070601 select * from originaltable where datefield < "2007-06-02 00:00:00.000 "
3.删除源表的旧数据：
delete from originaltable where datefield < "2007-06-02 00:00:00.000 "
4.完成。

到了2007.06.03的凌晨3：00，又重复执行上述操作，只是那时建的新表表名是testtable20070602。由于新表的表名都有规律，要统计时只要找到需要的那些日期的表把它们union all再统计就行了。。。只是根据需要构建一个sql查询语句，简单的字符串操作而已。。。

有需要的话，另做一个存储过程，也放在工作调度里。。作用是定时备份，例如，每个月的1号凌晨4：00（为了不与上面3：00那个冲突），使用sql server的dts功能把testtable20070601、testtable20070602、testtable20070603等表导出为xls文件。存放于特定目录。然后再设一个时间判断，每导出一个月的数据，就把相隔几个月前的数据表删除。例如今天是2007.6.1，凌晨3：00的时候就把testtable20070501、testtable20070502……testtable20070531表导出到目录存放，然后把2007.04.01前的备份表testtable20070301、testtable20070302从库里删除（drop table？？？？）。。。这时不把testtable20070401～testtable20070531也一并删除掉，是预防在导出时出现问题而又删掉源数据就麻烦了。。。有一两个月的时间让你检查导出后的xls文件是否有问题，总足够了吧。。。。另外一种做法是在工作调度中直接做备份，把上面说的那些数据实时备份出来，记得好像是一个mdb文件和一个ldf文件。
这种方法可以让你保留全部数据记录而又不会降低服务器效率及可以大量节省存储空间（以前我把备份出来的数据文件用winrar最大压缩，压缩比约为10：1）。。

• jiataizi
•
• 等级：
发表于：2007-06-02 04:47:4815楼得分:20
尤其是更加精确的查询，耗时更多，象：
select * from XXX where InTime between "2003-01-01 " and "2003-05-30 "

------------------------------------------------

觉得还是楼主的数据库设计有问题，象上面的这条语句，即使数据有100万的，如果数据库设计比较好的话，处理也应该是毫秒级的，建议楼主先试一下把InTime这个字段设置为聚集索引看看,建议楼主看下这篇文章：
http://blog.csdn.net/great_domino/archive/2005/02/01/275839.aspx

• theforever
•
• 等级：
发表于：2007-06-02 13:06:2016楼得分:10
但是20分钟和1小时的表单我就不能这么做了,因为客户需要保留所有的记录.
－－－－－
既然用户只需要保留20分钟和1小时数据的所有记录，则1分钟的记录还用得着保存一个月吗？只用20条就可以了。不会是一次性对一个月的1分钟数据进行统计才生成20分钟和1小时的数据吧，那太低效了。

20分钟和1小时的表单，用户要求保留，那就这样了。
但这样会产生两个头疼结果：
1.硬盘容易满。
解决方法就是备份，没啥说的。啥麻烦不麻烦，用户要求不改变的情况下，除了这个还能有什么办法？何况这也不很麻烦。实际上任何象样的数据库应用软件都必须做好备份和恢复工作的，这是基本。
2.数据操作效率低。
既然量是没法压缩的，就得讲究量的组织形式了。好的组织形式自然可以不受量的影响而致效率问题。合理地分表是少不了的。定期的备份清除也可能需要考虑，那就看具体情况了。

• SupermanKing
•
• 等级：
发表于：2007-06-02 21:59:5617楼得分:10
搂主可能没看过ASP海量数据查询的东西吧，那么点数据要花那么久时间，肯定是代码问题。
还有说到数据量和查询速度，SQL Server 2005 快很多，不防看看，我是深有体会，但是几十万条的数据
只要不超过1G,Access的数度都不慢，何况SQL Server

• haen_zhou
•
• 等级：
发表于：2007-06-04 02:57:3018楼得分:10
建议你采用分表记录联合查询…
建立分表的时候, 得采用有规律的表名称~~~
我记得以前我做过很多这样的生产数据记录, 都采用的分表…
其数据量比你的大得多, 1分钟每个反应罐(有20个反应罐)要保存3条数据… 查询的时候都没有出现那样的问题…

 web软件设计时考虑你的性能解决方案
关键字: 性能,WEB
前段时间搜罗了一些大型web应用程序开发的性能提升方案文章，但是一直不够系统。若现在让我设计一个支持大访问量的系统，仍然难于下手(以前没做过啊)

于是我把这些文章梳理了一下加入了自己的理解，记录了关键准则：

* 关键准则:
1. 选择什么编程语言不是问题
2. 选择的框架才可能影响系统的扩展和性能
3. 我倾向于以数据库为中心设计数据结构。
4. 分从两个方面提升性能：
1) . 软件设计方面
* 网页静态化
* 独立的图片服务器
* 可能采用中间缓存层服务器，最可能采用第三方成熟的软件
* 数据库分表(水平分割是最终方案)
2). 系统、网络、硬件结构
* 集群：数据库集群，WEB集群
* 采用：SAN
* 提升网络接入带宽
……..
其实，我最担心程序的设计架构问题成为制约将来系统扩展和性能提升时的因素。所以，这里也写出一个软件设计方面性能考虑的Step By Step实施方案供自己参考(而硬件扩展则可以根据并发用户数的升高随时调整)：
* Step By Step
假设采用Java语言作为主要开发语言，将Tapestry + Spring + Hibernate + Mysql作为基本架构.
阶段I:
1. 以数据库为中心设计数据结构。最开始可以选择Hibernate作为Persistency，如果需要切换（包括编程语言切换），这种设计思路会最大地减少移植障碍。
2. 基本的性能考虑：是否使用OpenSessionView. 数据库设计一定的冗余度等。
阶段II:
1. 网页静态化。
2. 独立的图片服务器。
阶段III:
1. 中间缓存层组件的使用
阶段IV:
1. 数据库分表：在软件设计上，我认为这几乎是提升性能的最后一个方法。

我认为每个阶段软件设计方面的修正，都将导致部分先期代码的更改，如果我们预先考虑到网站的可能的设计方案更改，那么在软件代码实现的时候就会考虑到将来的修改，使将来的修改尽可能地少。
那么为什么我们不一开始就让系统构架适应巨大并发量的访问呢？对于像我这样没有大型网站开发经验的人，或者还不确定系统的访问量会达到多大的前提下，又想尽快让网站上线，而且又不至于担心将来的扩展问题，那么我的做法未尝不是一个折衷呢？

草稿2007-09-26
最后更新：2007-09-26 13:50
13:45 | 永久链接 | 浏览 (844) | 评论 (6) | 收藏 | 进入论坛 | 发布在 Tapestry 圈子

永久链接
http://koda.javaeye.com/blog/127276

评论共 6 条发表评论

wl95421 2007-09-26 14:29
还是先想清楚你要做的网站对session的相关性有多大
能不能尽量将模块进行无关性分离
这样才是比较好的解决方案
如购物网站和论坛网站，对session的要求肯定不一致
架构也肯定不一样
bluepoint 2007-09-26 14:48
大体上这么玩可以,不过具体业务具体对待,这没有什么标准.
timerri 2007-09-26 15:31
影响性能的因素有哪些？其实只有下面几个方面：
1.持久性数据查找速度
2.持久性数据读写速度
3.逻辑复杂度
4.物理内存不够导致的虚拟存储频繁交换.
对应的解决方法：
1.建立最合适的索引，建立缓存
2.建立缓存，升级硬件
3.精简，优化逻辑
4.减少内存使用。
可以看出来，其实最需要做的，就是如何搞好缓存…….
为什么计算机界没有一个新职位，叫缓存工程师的？？
ahuaxuan 2007-09-26 18:08
说实话，我觉得楼主想了这么多还是没有抓住要领，任何一个软件，它的架构一定是在它的需求确定之后（指总体的业务需求，网站要达到的一个指标，包括业务特性），没有需
求就定架构是一种危险行为。楼主没有把自己网站性质，预期性能先确定就来谈用什么技术了，让人觉得有点空洞。
如果硬要给个方针，那么,应用集群+数据库集群就可以了
说到细节方面，第一个是OpenSessionView的问题，不会用hibernate的人老是说OpenSessionView有问题，OpenSessionView没有问题，说OpenSessionView有问题的人基本上都是用
hibernate用得有问题得人。
第二个是缓存，缓存可以加到很多层面，二级缓存和页面缓存得适用场景是不一样的，如果楼主在作架构的时候提到性能问题立刻就是中间层使用缓存，那么基本上可以说明楼主
对缓存的各种适用场景还不是非常了解，因为这些都是和业务相关（问题又回到了架构的确定需要在需求的确定之后）
第三在没有确定需求之前就一口咬定数据访问层是性能的瓶颈所在是站不住脚的。
那么在楼主现有的描述上，我也发表一下自己的看法：
1，因为不是非常确定以后的访问量，那么为了便于扩展，应用在开发之初应该可以考虑使应用非常容易作集群部署（是农场，还是状态复制，如果是农场如何保证状态，是cookie
，还是memcached，还是用blob放到db）
2，在集群的环境下，使用如何使用缓存，哪些页面需要使用页面缓存page cache，哪些业务对象需要使用hibernate的二级缓存等
我觉得楼主还是把网站得业务特性描述一下，这样才能更好得决定架构的设计。
Lucas Lee 2007-09-26 19:20
我认为目前还没有这种简单的框架能优雅的支持巨大访问量的。
为了高性能，总是有很多权衡的东西，需要额外的处理，想想EJB的机制吧，它就是为了高访问量设计的，但是不论访问量的大小一律都用它，则明显的使开发成本上升。
一般都会有这种多方面的权衡，ROR在开发速度上的优势，是在损失了不少性能的前提下得到的，尽管它可能在中小访问量之下区别不算明显，但性能绝不会是它的优势。
koda 2007-09-26 19:37
ror为什么会损失性能？能给出详细点的理由吗？
james2308大哥，你那数据库分表的功能实现了吗
我发现你的汽车网数据很庞大，汽车的参数很全，汽车的种类也多，如果不分表的话运行会很慢很慢，请教你这个功能实现了没有，能不能说下思路或者直接分享呢[em23]
如果你的汽车的参数建在一个表里，那这个表即使分表了也会很慢，把参数建在多个表里，那分表也增加了难度，请教你如何实现的

这个是早就实现了的，应当是去年吧，就在我发出那个帖子的时间，实际上就已经实现了，
不好意思，一般情况下，只要我发出的一点观点和思想，我一般是首先要实现她，哪怕是简单的测试，只要通过，我才发布出来供大家参考
数据库分表的操作，就需要两点，我们从风讯自身的新闻来看，就是那个News表，我们就假定可以创建无数个News1,news2。。。。。。。这样的话，你就可以将你的数据到表数量到大一定量的时间(你自己规定)，就将主表，News的数据转移到你的辅助表中，….因此从理论上将，只要我的硬盘足够大，我的站点，就可以运行1000年…并且不会影响生成的速度和运行速度
如果需要一个管理表来管理你可以任意增加的这些辅助表，这个管理表的主要作用，例如：如果你要获取某个栏目的新闻列表，有一种可能是这个栏目的新闻是在多个表中，这个就需要你调整程序来判断…
 大型Java Web系统服务器选型问题探讨
作者: 佚名出处:网络　 ( 0 ) 砖 ( 0 ) 好评论 ( 0 ) 条　进入论坛
更新时间：2007-09-20 15:22
关键词：Java Web服务器系统选型
阅读提示：如何能提高现有的基于Java的Web应用的服务能力呢？由于架构模式和部署调优一直是Java社区的热门话题，这个问题引发了很多热心网友的讨论，其中一些意见对其它大型Web项目也有很好的指导意义。
一位网友在JavaEye询问了一个大型Web系统的架构和部署选型问题，希望能提高现有的基于Java的Web应用的服务能力。由于架构模式和部署调优一直是Java社区的热门话题，这个问题引发了很多热心网友的讨论，其中一些意见对其它大型Web项目也有很好的指导意义。在讨论之初jackson1225这样描述了当前的应用的架构和部署方案：
目前系统架构如下:
web层采用struts+tomcat实现，整个系统采用20多台web服务器，其负载均衡采用硬件F5来实现；
中间层采用无状态会话Bean+DAO+helper类来实现，共3台weblogic服务器，部署有多个EJB，其负载均衡也采用F5来实现；
数据库层的操作是自己写的通用类实现的，两台ORACLE数据库服务器，分别存放用户信息和业务数据；一台SQL SERVER数据库，是第三方的业务数据信息；
web层调用EJB远程接口来访问中间件层。web层首先通过一个XML配置文件中配置的EJB接口信息来调用相应的EJB远程接口；
该系统中一次操作涉及到两个ORACLE库以及一个SQL SERVER库的访问和操作，即有三个数据库连接，在一个事务中完成。
这样的架构其实很多公司都在使用，因为Struts和Tomcat分别是最流行的Java Web MVC框架和Servlet容器，而F5公司的负载均衡是横向扩展常见的解决方案（例如配置session sticky方案）。由于这个系统中有跨数据源的事务，所以使用Weblogic Server EJB容器和支持两阶段提交的数据库驱动就可以保证跨数据源的事物完整性（当然，容器管理的分布式事务并非是唯一和最优的解决方案）。
但是随着Rod Johnson重量级的著作《J2EE Development without EJB》和其中的Spring框架的流行，轻量级框架和轻量级容器的概念已经深入人心。所以对于jackson1225提出的这个场景，大多数网友都提出了置疑，认为这个系统滥用了技术，完全是在浪费钱。网友们大都认为SLSB（无状态会话Bean）完全没有必要出现在这个场景中，认为SLSB通过远程接口访问本地资源会有很大的性能开销，这种观点也是Rod johnson在without EJB中批判EJB 2.x中的一大反模式。
由于JavaEE是一个以模式见长的解决方案，模式和架构在JavaEE中占有很重要的地位，所以很多业内专家也都警惕“反模式（Anti-patterns）”的出现。对于上面所述的方案是否是反模式，jackson1225马上站出来申辩：
我们项目就是把EJB作为一个Facade，只是提供给WEB层调用的远程接口，而且只用了无状态会话Bean，所以性能上还可以的。
这个解释很快得到了一些网友的认可，但是大家很快意识到架构的好坏决定于是否能够满足用户的需求，davexin（可能是jackson1225的同事）描述了这个系统的用户和并发情况：
现在有用户4000万，马上要和另一个公司的会员系统合并，加起来一共有9000万用户。数据量单表中有一亿条以上的数据。这是基本的情况，其实我觉得现在的架构还是可以的，现在支持的并发大概5000并发用户左右，接下来会进行系统改造，目标支持1万个并发用户。
具体的并发量公布后又有网友置疑这个数据，认为这个系统的Servlet容器支持的并发数太小，怀疑是否配置不够优化。davexin又补充了该项目的服务器配置：
系统前端tomcat都是用的刀片，配置在2G内存，cpu大概在2.0G，每台机器也就支持250-400个并发，再多的话，就会相应时间非常的常，超过20秒，失去了意义，所以我们才得出这样的结论的。
一位ID是cauherk的网友提出了比较中肯的意见，他没有从Web容器单纯的并发支持能力上提出改进方案，而是提出了对于类似的应用的一些通用的改进提示，这里摘要一下：
数据库压力问题
可以按照业务、区域等等特性对数据库进行配置，可以考虑分库、使用rac、分区、分表等等策略，确保数据库能正常的进行交易。
事务问题
要在两个数据库中操作，那么必须考虑到分布式事务。你应该仔细的设计你的系统，来避免使用分布式事务，以避免分布式事务带来更多的数据库压力和其它问题。推荐你采用延迟提交的策略(并不保证数据的完整)，来避免分布式事务的问题，毕竟commit失败的几率很低。
web的优化
将静态、图片独立使用不同的服务器，对于常态的静态文件，采用E-TAG或者客户端缓存， google很多就是这样干的。对于热点的功能，考虑使用完全装载到内存，保证绝对的响应速度，对于需要频繁访问的热点数据，采用集中缓存(多个可以采用负载均衡)，减轻数据库的压力。
对于几乎除二进制文件，都应该在L4上配置基于硬件的压缩方案，减少网络的流量。提高用户使用的感知。
网络问题
可以考虑采用镜像、多路网络接入、基于DNS的负载均衡。如果有足够的投资，可以采用CDN(内容分发网)，减轻你的服务器压力。
cauherk的这个分析比较到位，其中ETags的方案是最近的一个热点，InfoQ的“使用ETags减少Web应用带宽和负载”里面对这种方案有很详细的介绍。一般以数据库为中心的Web应用的性能瓶颈都在数据库上，所以cauherk把数据库和事务问题放到了前两位来讨论。但是davexin解释在所讨论的这个项目中数据库并非瓶颈：
我们的压力不在数据库层，在web层和F5。当高峰的时候，F5也被点死了，就是每秒点击超过30万，web动态部分根本承受不了。根据我们程序记录，20台web最多承受5000个并发，如果再多，tomcat就不响应了。就像死了一样。
这个回复让接下来的讨论都集中于Web容器的性能优化，但是JavaEye站长robbin发表了自己的意见，将话题引回了这个项目的架构本身：
performance tuning最重要的就是定位瓶颈在哪里，以及瓶颈是怎么产生的。
我的推测是瓶颈还是出在EJB远程方法调用上！
tomcat上面的java应用要通过EJB远程方法调用，来访问weblogic上面的无状态SessionBean，这样的远程方法调用一般都在100ms~500ms级别，或者更多。而如果没有远程方法调用，即使大量采用spring的动态反射，一次完整的web请求处理在本地JVM内部的完成时间一般也不过20ms而已。一次web请求需要过长的执行时间，就会导致servlet线程被占用更多的时间，从而无法及时响应更多的后续请求。
如果这个推测是成立的话，那么我的建议就是既然你没有用到分布式事务，那么就干脆去掉EJB。weblogic也可以全部撤掉，业务层使用spring取代EJB，不要搞分布式架构，在每个tomcat实例上面部署一个完整的分层结构。
另外在高并发情况下，apache处理静态资源也很耗内存和CPU，可以考虑用轻量级web server如lighttpd/litespeed/nginx取代之。
robbin的推断得到了网友们的支持，davexin也认同robbin的看法，但是他解释说公司认为放弃SLSB存在风险，所以公司倾向于通过将Tomcat替换为Weblogic Server 10来提升系统的用户支撑能力。robbin则马上批评了这种做法：
坦白说我还从来没有听说过大规模互联网应用使用EJB的先例。为什么大规模互联网应用不能用EJB，其实就是因为EJB性能太差，用了EJB几乎必然出现性能障碍。
web容器的性能说到底无非就是Servlet线程调度能力而已，Tomcat不像WebLogic那样附加n多管理功能，跑得快很正常。对比测试一下WebLogic的数据库连接池和C3P0连接池的性能也会发现类似的结论，C3P0可要比WebLogic的连接池快好几倍了。这不是说WebLogic性能不好，只不过weblogic要实现更多的功能，所以在单一的速度方面就会牺牲很多东西。
以我的经验来判断，使用tomcat5.5以上的版本，配置apr支持，进行必要的tuning，使用BEA JRockit JVM的话，在你们目前的刀片上面，支撑500个并发完全是可以做到的。结合你们目前20个刀片的硬件，那么达到1万并发是没问题的。当然这样做的前提是必须扔掉EJB，并置web层和业务层在同一个JVM内部。
接下来robbin还针对davexin对话题中的应用分别在tomcat和weblogic上的测试数据进行了分析：
引用：
2。1台weblogic10 Express（相当于1台tomcat，用于发布jsp应用）加1台weblogic10（发布ejb应用），能支持1000个并发用户……
……
4。1台tomcat4.1加1台weblogic8，只能支持350个并发用户，tomcat就连结超时，说明此种结构瓶颈在tomcat。
这说明瓶颈还不在EJB远程调用上，但是问题已经逐渐清楚了。为什么weblogic充当web容器发起远程EJB调用的时候可以支撑1000个并发，但是tomcat只能到350个？只有两个可能的原因：
你的tomcat没有配置好，严重影响了性能表现
tomcat和weblogic之间的接口出了问题
接着springside项目发起者江南白衣也提出了一个总体的优化指导：
1.基础配置优化
tomcat 6？ tomcat参数调优?
JRockit JVM? JVM参数调优？
Apache+Squid 处理静态内容？
2.业务层优化
部分功能本地化，而不调remote session bean?
异步提交操作,JMS？
cache热点数据？
　　3.展示层优化
动态页面发布为静态页面？
Cache部分动态页面内容？
davexin在调整了Tomcat配置后应验了robbin对tomcat配置问题的质疑，davexin这样描述经过配置优化以后的测试结果：
经过测试，并发人数是可以达到像robbin所说的一样，能够在600人左右，如果压到并发700人，就有15%左右的失败，虽然在调整上面参数之后，并发人数上去了，但是在同样的时间内所完成的事务数量下降了10%左右，并且响应时间延迟了1秒左右，但从整体上来说，牺牲一点事务吞吐量和响应时间，并发人数能够提高500，觉得还是值得的。
至此这个话题有了一个比较好的结果。这个话题并非完全针对一个具体的项目才有意义，更重要的是在分析和讨论问题的过程中网友们解决问题的思路，尤其是cauherk、robbin、江南白衣等几位网友提出的意见可以让广大Java Web项目开发者了解到中、大型项目所需要考虑的架构和部署所需要考虑的关键问题，也消除了很多人对轻量Servlet容器与EJB容器性能的一些误解。
富有挑战性的问题,建立超大数据库的问题.

• haiwangstar
•
• 等级：
发表于：2007-03-28 09:32:11 楼主
现在要设计一个记录数非常大的数据库表,表的结构非常简单,但记录数会非常大,可能会有几十亿条,但表的字段只包括几个数字列,再有一个列用来保存图片,我现在准备采用BFILE类型来把图片路径保存在表中,另外表会根据某个列数据进行分区.

在设计这样的大表还应该注意哪些问题..欢迎大家赐教.

凡提出较好意见的,可以另行开贴再加分.

问题点数：100 回复次数：39

• playmud
•
• 等级：
发表于：2007-03-28 09:44:151楼得分:0
建议分表操作,都放入一个表内会严重影响速度.可以按照类型分,可以按照时间分.
总之坚决避免大表的出现.

• playmud
•
• 等级：
发表于：2007-03-28 09:46:162楼得分:0
如果你不听劝告,任性而为,那就把需要查询的项或者组合做索引,合理的给这个表分配物理空间.

• haiwangstar
•
• 等级：
发表于：2007-03-28 09:47:253楼得分:0
谢谢楼上的朋友,表分区是肯定的,我上面也写了.

• haiwangstar
•
• 等级：
发表于：2007-03-28 10:01:554楼得分:0
另外关于索引, 表会有3个数字列,一个是级别,共15级,另一个是X,再有一个是Y.这两列的数据范围都是从0到1000左右,查询的时候每次都会用到这三个列,大抵应该是这样select image from table1 WHERE level = ? and x = ? and y = ?
这个时候如何建索引会比较好, 将级别建为簇索引,X,Y建为复合索引? 还是三个列统一建为复合索引好..

• haiwangstar
•
• 等级：
发表于：2007-03-28 10:47:105楼得分:0
另外还有IO均衡的问题,另一个设计者在看了上面的方案后,认为如果这样做的话,网络IO会集在中一台机器上,形成瓶颈. 在群集的情况下也会这样的吗?
他是想得到的图片的路径后,直接去那台服务器去读图.
另外,对于这样的超大数据库,磁盘是一个整体的阵列为各个服务器所共享,还是每一个服务器有自己的磁盘阵列呢? 这个问题可能有点弱..我过去也没做过这么大的数据库.还望大家能多多赐教.

• junqiang
•
• 等级：
发表于：2007-03-29 08:53:016楼得分:0
没这方面的经验，只是理论：
rac集群一般是共享磁盘组，一般来说磁盘组的io性能很好，带宽高（高级的是光纤连接）。
rac集群的网络io不会集中在一台服务器上，会自动负载平衡。

• skystar99047
•
• 等级：
发表于：2007-03-29 09:42:587楼得分:0
分区数目可以考虑增大。
每个区的表空间可以考虑放在不同的物理空间上。
分区索引是必须的。如果增删改频率较低，查询较多，可以考虑位图索引。
如果图片保存在表中，需要考虑将该字段的存储放在另一单独的物理空间上。

• i_love_pc
•
• 等级：
发表于：2007-03-29 11:11:228楼得分:0
几十亿条
========
的确有点多

• renjun24
•
• 等级：
发表于：2007-03-29 11:19:009楼得分:0
up

• huylghost
•
• 等级：
发表于：2007-03-29 11:30:2710楼得分:0
几十亿条, level ，x，y，图片，
google earth 是不是就是这么做的?

• haiwangstar
•
• 等级：
发表于：2007-03-29 12:03:3211楼得分:0
楼上的朋友,没错.就是做EARTH MAP

• lin_style
•
• 等级：
发表于：2007-03-29 12:08:3212楼得分:0
这样子的话。
扩充什么就不要考虑了。
设计个最适合查询的。

• whalefish2001
•
• 等级：
发表于：2007-03-29 12:40:2113楼得分:0
索引是必要的，不过，索引会占用很大空间。

• rainv
•
• 等级：
发表于：2007-03-29 13:29:3514楼得分:0
mark!
没接触过这种项目.^-^

• e_board
•
• 等级：
发表于：2007-03-29 13:34:2815楼得分:0
MySQL中有分区表的概念;MySQL会自动处理这些,不知道Oracle有没有类似的

• yanxinhao972
•
• 等级：
发表于：2007-03-29 13:37:4816楼得分:0
ORACLE 10g中可以创建分区表

• arust
•
• 等级：
发表于：2007-03-29 14:35:1817楼得分:0
这种数据库用PostgreSQL比较好

• thinkinnight
•
• 等级：
发表于：2007-03-29 15:43:1418楼得分:0
不错，学习

• conanfans
•
• 等级：
发表于：2007-03-29 16:28:5619楼得分:0
ORACLE在大数据量上不如DB2

• smallsophia
•
• 等级：
发表于：2007-03-29 16:42:0120楼得分:0
我只能说学习，继续关注!

• yxsalj
•
• 等级：
发表于：2007-03-29 17:25:1821楼得分:0
几十亿也不是很多,分区,加上合适的索引,问题也不大

• murphyding
•
• 等级：
发表于：2007-03-29 17:28:0522楼得分:0
初次登陆，向大家问好，哈哈

• prcgolf
•
• 等级：
发表于：2007-03-29 18:02:2423楼得分:0
up

• winesmoke
•
• 等级：
发表于：2007-03-29 18:23:2024楼得分:0
那位高手还是整个方案出来噻！
关注！

• MONOLINUX
•
• 等级：
发表于：2007-03-29 21:54:2125楼得分:0
该回复于2007-10-09 14:24:38被管理员删除

• dbpointer
•
• 等级：
发表于：2007-03-29 22:30:4826楼得分:0
楼上的牛啊，不过搜索面好像还不如百度

• uniume
•
• 等级：
发表于：2007-03-29 22:40:3227楼得分:0
该回复于2007-10-14 16:54:02被管理员或版主删除

• kkk_visual
•
• 等级：
发表于：2007-03-29 23:14:3728楼得分:0
帮顶一下。

• flyycyu
•
• 等级：
发表于：2007-03-30 00:20:4629楼得分:0
目前正做完这么一个系统,和你的类似,数据大概是是每个表9亿左右,但是有6,7个表都是这么大数据量,一个表的字段大概有50,60个,主要都是float.
不知道你的数据是怎么进入的,我们系统对数据的装入也是有要求的.
第一,个就是分表操作,在以前用infomix之类的系统的开发都采用,我在做我们系统的时候第一个方案设计出来的就是分表,当然带来的问题是维护问题,系统中上万个表,基本上图形控制台是打不开,所以最好的方式是分表+分区,oracle专家也是这么建议的!
第二,簇索引我没有怎么用过,但是如果你经常3个组合查,就建复合索引,或者在类别上建建BITMAP索引,对x,y建复合索引,另外,我觉得呢,如果按我们现在开发系统的经验来说,你应该把类别做为分表,这样在类别上就不存在建立索引问题,而对x,y,按某种方式在表内进行分区

• flyycyu
•
• 等级：
发表于：2007-03-30 00:27:0130楼得分:0
另外,如果可能的话,用BFILE,还不如系统放在文件系统上,而数据库只做连接,当然这个看你,用jdbc插入当然会比直接拷贝文件系统慢.不过可能管理上带来方便性.另外io均衡问题你不用考虑,在建立数据库时候,这个表,或者是数据文件直接写文件系统的话,你把表空间或者文件系统挂在裸设备上,而不要用本地文件系统,系统会自动给你处理均衡问题的!

• haiwangstar
•
• 等级：
发表于：2007-03-30 09:19:4031楼得分:0
flyycyu(fly) 这位朋友,非常感谢您的意见!!

我过去从来没有搞过这么大的数据库系统,所以我上面所讲述的一切都只是纸上谈兵.但我的思路,设想几乎同你都是一致的.关于是否用BFILE的问题,这个也是一个不大关乎全局的小问题.

我只所以特意问会不会有IO不均衡问题,是因为我的同事认为我们这样的方案是不行的,IO会不均衡(他认为还有很多问题).而我认为是绝对不会出现这种情况的,因为ORACLE在设计中他不可能不考虑这样显而易见的问题,我也查了资料,在ORACLE集群中,网络负荷也是存在ORACLE负荷均衡的.

看来这样的方案才是经实践检验过的可行的.

• haiwangstar
•
• 等级：
发表于：2007-03-30 09:24:2932楼得分:0
另外还有一个问题,就是flyycyu(fly) 这位朋友你们是使用SAN存储设备的吗,光纤网络? 这套系统是不是非常昂贵? 即使不是,我想也一定是共享统一存储器吧.

• haiwangstar
•
• 等级：
发表于：2007-03-30 09:28:1333楼得分:0
我们系统对数据的装入也是有要求的.

朋友这句话是怎么讲.能说一下吗

• asker100
•
• 等级：
发表于：2007-03-30 09:54:4034楼得分:0
这种情况就不要用数据库了，直接的文件存储+优化的索引，要看穿数据库这种东西

• gameboy999
•
• 等级：
发表于：2007-03-30 10:41:1835楼得分:0
同意楼上，地图数据自己的特点，不一定需要通用的数据库

• wuluhua2003
•
• 等级：
发表于：2007-03-30 15:53:5636楼得分:0
学习了

• flyycyu
•
• 等级：
发表于：2007-03-31 09:55:4937楼得分:0
对，san,光纤网络，客户有钱，而且事情又重要。
装入数据的要求是我们数据是批量装入的，一年集中在1，2个时间点，平时就是查询，而装入数据时候，系统上会有几十个并发解析10万-30万之间的数据包。
至于BFILE问题，我只是建议，也可能是自己学艺不精，因为我的数据文件一般在10m以上，所以在上亿后，存库老是出些莫名其妙的问题，所以最后干脆就存外部文件，至少在过程上，你少了一道由web服务器把数据包发包到数据库服务器的过程。
还有io均衡这些问题，我觉得一是自己调很难，加大难度和时间，也未必出来后最优，还不如利用硬件设备，还有像10g里面的ASM。我们的数据量像上面所说，在ibm 570,8cpu,32g内存下，并发解析数据包到40，50都没问题，时间在50秒以内都能完成10万数据装入，而查询这些就更不用说了，压力测试上800都没问题，当然这是非集群环境，当然你的具体业务还是由你分析，这些只是建议

• flyycyu
•
• 等级：
发表于：2007-03-31 10:10:0438楼得分:0
再说下io均衡问题，我只是说下实际部署中碰到的问题，因为这个只有具体问题具体分析，一个是磁盘的io,因为在设计TABLESPACE时候，你已经有意识的根据业务划分到不同的磁盘块上面了，我当时在测试上碰到的问题就是对回滚段的资源占用也很大，后来回滚的表空间分到5个磁盘上去了，性能马上就好了上来。
另外一个是网络io,我看你提的是网络io,而不是磁盘io，不清楚你的这个网络io指得是客户请求到web服务器，还是web服务器到数据库服务器！因为我们业务的关系，oracle没有架集群，当然也是还用不到那功能，所以没有参考意见，至于web请求这块，我觉得解决这个方案应该很多吧？比如我们，最后是按照业务模块来划分的多台web服务器.当然我们系统特定和你的可能有一定程度类似，就是大部分时间主要是查询。

• huylghost
•
• 等级：
发表于：2007-04-05 10:13:2339楼得分:0
进来学习一下

• hrui99
•
• 等级：
发表于：2007-04-06 11:07:4040楼得分:0
skystar99047(天星
分区数目可以考虑增大。
每个区的表空间可以考虑放在不同的物理空间上。
分区索引是必须的。如果增删改频率较低，查询较多，可以考虑位图索引。
如果图片保存在表中，需要考虑将该字段的存储放在另一单独的物理空间上。

同意上面描述。补充SELECT 描述考虑加入HINTS 描述
--并行处理
如：select /*+ parallel(tab,处理器个数） */

高并发高流量网站架构

Architecture of Website with
High Page view and High concurrency

院系：信息科学学院
专业：计算机科学与技术
学号：03281077
姓名：唐福林
指导教师：朱小明

北京师范大学
2007年3月
北京师范大学士学位论文（设计）原创性声明

本人郑重声明：所呈交的学士学位论文（设计），是本人在导师的指导下，独立进行研究工作所取得的成果。除文中已经注明引用的内容外，本论文不含任何其他个人或集体已经发表或撰写过的作品成果。对本文的研究做出重要贡献的个人和集体，均已在文中以明确方式标明。本人完全意识到本声明的法律结果由本人承担。
本人签名：　　　　　　　　年月日

北京师范大学学士学位论文（设计）使用授权的说明

本人完全了解北京师范大学有关收集、保留和使用学士学位论文（设计）的规定，即：本科生在校攻读学位期间论文（设计）工作的知识产权单位属北京师范大学。学校有权保留并向国家有关部门或机构送交论文的复印件和电子版，允许学位论文（设计）被查阅和借阅；学校可以公布学位论文的全部或部分内容，可以采用影印、缩印或扫描等复制手段保存、汇编学位论文。保密的学位论文在解密后遵守此规定。
本论文（是、否）保密论文。
保密论文在年解密后适用本授权书。
本人签名：年月日
导师签名：年月日
摘要
Web2.0的兴起，掀起了互联网新一轮的网络创业大潮。以用户为导向的新网站建设概念，细分了网站功能和用户群，不仅成功的造就了一大批新生的网站，也极大的方便了上网的人们。但Web2.0以用户为导向的理念，使得新生的网站有了新的特点——高并发，高流量，数据量大，逻辑复杂等，对网站建设也提出了新的要求。
本文围绕高并发高流量的网站架构设计问题，主要研究讨论了以下内容：
首先在整个网络的高度讨论了使用镜像网站，CDN内容分发网络等技术对负载均衡带来的便利及各自的优缺点比较。然后在局域网层次对第四层交换技术，包括硬件解决方案F5和软件解决方案LVS，进行了简单的讨论。接下来在单服务器层次，本文着重讨论了单台服务器的Socket优化，硬盘级缓存技术，内存级缓存技术，CPU与IO平衡技术（即以运算为主的程序与以数据读写为主的程序搭配部署），读写分离技术等。在应用层，本文介绍了一些大型网站常用的技术，以及选择使用该技术的理由。最后，在架构的高度讨论了网站扩容，容错等问题。
本文以理论与实践相结合的形式，结合作者实际工作中得到的经验，具有较广泛的适用性。
关键词：高并发高流量网站架构网站扩容容错
Abstract

With web2.0 starting， raised the Internet new turn of network to start undertaking the flood tide. To be user-oriented concept of the new websites, not only successfully created a large number of new sites, but also greatly facilitate the development of the Internet people. The Web2.0 at the same time take the user as the guidance idea, enabled the newborn website to have the new characteristic - high concurrency, high page views, big data quantity, and complex logic , etc. , also set the new request to the website construction. This article revolves the high concurrent high current capacity the website overhead construction design question, the main research discussed these content: First discussed the useage of mirror sites in the entire network, CDN content distribution network, the convenience and respective good and bad points comparison which brings to the load balancing. Then in the local area network, the fourth level exchange technology, including hardware solution F5 and software solution LVS, has been carried on with a simple discussion. Received in the single server level, this article emphatically discussed the single server socket optimization, the hard disk cache technology, the memory level buffer technology, CPU and the IO balance technology , the read-write separation technology and so on. In the application level, this article introduced some large-scale website commonly used technologies, as well as the reason of choice of these technicals. Finally, highly discussed the website in the overhead construction to expand accommodates, fault-tolerant. This article form which unifies by the theory and the practice, experience which in the author practical work obtains, has amore widespread serviceability.

KEY WORDS: high page view, high concurrency, architecture of website site, expansion

目录
1 引言 9
1.1 互联网的发展 9
1.2 互联网网站建设的新趋势 9
1.3 新浪播客的简介 11
2 网络层架构 12
2.1 镜像网站技术 12
2.2 CDN内容分发网络 13
2.3 应用层分布式设计 16
2.4 网络层架构小结 17
3 交换层架构 17
3.1 第四层交换简介 17
3.2 硬件实现 18
3.3 软件实现 18
4 服务器优化 19
4.1 服务器整体性能考虑 19
4.2 Socket优化 19
4.3 硬盘级缓存 22
4.4 内存级缓存 24
4.5 CPU与IO均衡 26
4.6 读写分离 26
5 应用程序层优化 28
5.1 网站服务器程序的选择 28
5.2 数据库选择 29
5.3 服务器端脚本解析器的选择 30
5.4 可配置性 32
5.5 封装和中间层思想 33
6 扩容、容错处理 33
6.1 扩容 33
6.2 容错 34
7 总结及展望 35
7.1 总结 35
7.2 展望 36

 高并发高流量网站架构

1 引言
1.1 互联网的发展
最近十年间，互联网已经从一个单纯的用于科研的，用来传递静态文档的美国内部网络，发展成了一个应用于各行各业的，传送着海量多媒体及动态信息的全球网络。从规模上看，互联网在主机数、带宽、上网人数等方面几乎一直保持着指数增长的趋势，2006年7月，互联网上共有主机439，286，364台，WWW 站点数量达到 96，854，877个［1］。全球上网人口在2004 年达到 7 亿 2900万［2］，中国的上网人数在 2006 年 12 月达到了约 1亿3700 万［3］。另一方面，互联网所传递的内容也发生了巨大的变化，早期互联网以静态、文本的公共信息为主要内容，而目前的互联网则传递着大量的动态、多媒体及人性化的信息，人们不仅可以通过互联网阅读到动态生成的信息，而且可以通过它使用电子商务、即时通信、网上游戏等交互性很强的服务。因此，可以说互联网已经不再仅仅是一个信息共享网络，而已经成为了一个无所不在的交互式服务的平台。
1.2 互联网网站建设的新趋势
互联网不断扩大的规模，日益增长的用户群，以及web2.0［4］的兴起，对互联网网站建设提出了新的要求:
• 高性能和高可扩展性。2000 年 5 月，访问量排名世界第一（统计数据来源［5］）的Yahoo ［6］声称其日页浏览数达到 6 亿 2500 万，即每秒约 30，000 次HTTP 请求(按每个页面浏览平均产生 4 次请求计算) 。这样大规模的访问量对服务的性能提出了非常高的要求。更为重要的是，互联网受众的广泛性，使得成功的互联网服务的访问量增长潜力和速度非常大，因此服务系统必须具有非常好的可扩展性，以应付将来可能的服务增长。
• 支持高度并发的访问。高度并发的访问对服务的存储与并发能力提出了很高的要求，当前主流的超标量和超流水线处理器能处理的并发请求数是有限的，因为随着并发数的上升，进程调度的开销会很快上升。互联网广域网的本质决定了其访问的延迟时间较长，因此一个请求完成时间也较长，按从请求产生到页面下载完成 3 秒计算， Yahoo 在 2000 年 5 月时平均有 90，000 个并发请求。而且对于较复杂的服务，服务器往往要维护用户会话的信息，例如一个互联网网站如果每天有 100 万次用户会话，每次 20分钟的话，那平均同时就会有约 14000 个并发会话。
• 高可用性。互联网服务的全球性决定了其每天 24 小时都会有用户访问，因此任何服务的停止都会对用户造成影响。而对于电子商务等应用，暂时的服务中止则意味着客户的永久失去及大量的经济损失，例如ebay.com［7］1999 年 6 月的一次 22小时的网站不可访问，对此网站的 380万用户的忠诚度造成巨大影响，使得 Ebay 公司不得不支付了近500万美元用于补偿客户的损失，而该公司的市值同期下降了 40 亿美元［8］。因此，关键互联网应用的可用性要求非常高。
1.3 新浪播客的简介
以YouTube［9］为代表的微视频分享网站近来方兴未艾，仅2006年一年，国内就出现近百家仿YouTube的微视频分享网站［10］，试图复制YouTube的成功模式。此类网站可以说是Web2.0概念下的代表网站，具有Web2.0网站所有典型特征：高并发，高流量，数据量大，逻辑复杂，用户分散等等。新浪［11］作为国内最大的门户网站，在2005年成功运作新浪博客的基础上，于2006年底推出了新浪播客服务。新浪播客作为国内门户网站中第一个微视频分享服务的网站，依靠新浪网站及新浪博客的巨大人气资源，在推出后不到半年的时间内，取得了巨大的成功：同类网站中上传视频数量第一、流量增长最快、用户数最多［12］，所有这些成绩的取得的背后，是巨大的硬件投入，良好的架构支撑和灵活的应用层软件设计。
本文是作者在新浪爱问搜索部门实习及参与新浪播客开发的经验和教训的回顾，是作者对一般高并发高流量网站架构的总结和抽象。
2 网络层架构
2.1 镜像网站技术
镜像网站是指将一个完全相同的站点放到几个服务器上，分别有自己的URL，这些服务器上的网站互相称为镜像网站［13］。镜像网站和主站并没有太大差别，或者可以视为主站的拷贝。镜像网站的好处是：如果不能对主站作正常访问（如服务器故障，网络故障或者网速太慢等），仍能通过镜像服务器获得服务。不便之处是：更新网站内容的时候，需要同时更新多个服务器；需要用户记忆超过一个网址，或需要用户选择访问多个镜像网站中的一个，而用户选择的，不一定是最优的。在用户选择的过程中，缺乏必要的可控性。
在互联网发展的初期，互联网上的网站内容很少，而且大都是静态内容，更新频率底。但因为服务器运算能力低，带宽小，网速慢，热门网站的访问压力还是很大。镜像网站技术在这种情况下作为一种有效解决方案，被广泛采用。随着互联网的发展，越来越多的网站使用服务器端脚本动态生成内容，同步更新越来越困难，对可控性要求越来越高，镜像技术因为不能满足这类网站的需要，渐渐的淡出了人们的视线。但有一些大型的软件下载站，因为符合镜像网站的条件——下载的内容是静态的，更新频率较低，对带宽，速度要求又比较高，如国外的SourceForge （ http://www.SourceForge.net，著名开源软件托管网站），Fedora（ http://fedoraproject.org，RedHat赞助的Linux发行版），国内的华军软件园（ http://www.onlinedown.net），天空软件站（ http://www.skycn.com）等，还在使用这项技术（图1）。

图1 上图：天空软件站首页的镜像选择页面
下图：SourceForge下载时的镜像选择页面
在网站建设的过程中，可以根据实际情况，将静态内容作一些镜像，以加快访问速度，提升用户体验。
2.2 CDN内容分发网络
CDN的全称是Content Delivery Network，即内容分发网络。其目的是通过在现有的互联网中增加一层新的网络架构，将网站的内容发布到最接近用户的网络“边缘”，使用户可以就近取得所需的内容，分散服务器的压力，解决互联网拥挤的状况，提高用户访问网站的响应速度。从而解决由于网络带宽小、用户访问量大、网点分布不均等原因所造成的用户访问网站响应速度慢的问题［14］。
CDN与镜像网站技术的不同之处在于网站代替用户去选择最优的内容服务器，增强了可控制性。CDN其实是夹在网页浏览者和被访问的服务器中间的一层镜像或者说缓存，浏览者访问时点击的还是服务器原来的URL地址，但是看到的内容其实是对浏览者来说最优的一台镜像服务器上的页面缓存内容。这是通过调整服务器的域名解析来实现的。使用CDN技术的域名解析服务器需要维护一个镜像服务器列表和一份来访IP到镜像服务器的对应表。当一个用户的请求到来的时候，根据用户的IP，查询对应表，得到最优的镜像服务器的IP地址，返回给用户。这里的最优，需要综合考虑服务器的处理能力，带宽，离访问者的距离远近等因素。当某个地方的镜像网站流量过大，带宽消耗过快，或者出现服务器，网络等故障的时候，可以很方便的设置将用户的访问转到另外一个地方（图2）。这样就增强了可控制性。

图2 CDN原理示意图
CDN网络加速技术也有它的局限性。首先，因为内容更新的时候，需要同步更新多台镜像服务器，所以它也只适用于内容更新不太频繁，或者对实时性要求不是很高的网站；其次，DNS解析有缓存，当某一个镜像网站的访问需要转移时，主DNS服务器更改了IP解析结果，但各地的DNS服务器缓存更新会滞后一段时间，这段时间内用户的访问仍然会指向该服务器，可控制性依然有不足。
目前，国内访问量较高的大型网站如新浪、网易等的资讯频道，均使用CDN网络加速技术（图3），虽然网站的访问量巨大，但无论在什么地方访问，速度都会很快。但论坛，邮箱等更新频繁，实时性要求高的频道，则不适合使用这种技术。

图3 新浪网使用ChinaCache CDN服务。
ChinaCache的服务节点全球超过130个，
其中中国节点超过80个，
覆盖全国主要6大网络的主要省份［15］。
2.3 应用层分布式设计
新浪播客为了获得CDN网络加速的优点，又必须避免CDN的不足，在应用层软件设计上，采取了一个替代的办法。新浪播客提供了一个供播放器查询视频文件地址的接口。当用户打开视频播放页面的时候，播放器首先连接查询接口，通过接口获得视频文件所在的最优的镜像服务器地址，然后再到该服务器去下载视频文件。这样，用一次额外的查询获得了全部的控制性，而这次查询的通讯流量非常小，几乎可以忽略不计。CDN中由域名解析获得的灵活性也保留了下来：由接口程序维护镜像网站列表及来访IP到镜像网站的对应表即可。镜像网站中不需要镜像所有的内容，而是只镜像更新速度较慢的视频文件。这是完全可以承受的。
2.4 网络层架构小结
从整个互联网络的高度来看网站架构，努力的方向是明确的：让用户就近取得内容，但又要在速度和可控制性之间作一个平衡。对于更新比较频繁内容，由于难以保持镜像网站之间的同步，则需要使用其他的辅助技术。
3 交换层架构
3.1 第四层交换简介
按照OSI［16］七层模型，第四层是传输层。传输层负责端到端通信，在IP协议栈中是TCP和UDP所在的协议层。TCP和UDP数据包中包含端口号（port number），它们可以唯一区分每个数据包所属的协议和应用程序。接收端计算机的操作系统根据端口号确定所收到的IP包类型，并把它交给合适的高层程序。IP地址和端口号的组合通常称作“插口（Socket）”。
第四层交换的一个简单定义是：它是一种传输功能，它决定传输不仅仅依据MAC地址(第二层网桥)或源/目标IP地址(第三层路由)，而且依据IP地址与TCP/UDP (第四层) 应用端口号的组合（Socket）［17］。第四层交换功能就像是虚拟IP，指向实际的服务器。它传输的数据支持多种协议，有HTTP、FTP、NFS、Telnet等。
以HTTP协议为例，在第四层交换中为每个服务器组设立一个虚拟IP（Virtue IP，VIP），每组服务器支持某一个或几个域名。在域名服务器（DNS）中存储服务器组的VIP，而不是某一台服务器的真实地址。
当用户请求页面时，一个带有目标服务器组的VIP连接请求发送给第四层交换机。第四层交换机使用某种选择策略，在组中选取最优的服务器，将数据包中的目标VIP地址用实际服务器的IP地址取代，并将连接请求传给该服务器。第四层交换一般都实现了会话保持功能，即同一会话的所有的包由第四层交换机进行映射后，在用户和同一服务器间进行传输［18］。
第四层交换按实现分类，分为硬件实现和软件实现。
3.2 硬件实现
第四层交换的硬件实现一般都由专业的硬件厂商作为商业解决方案提供。常见的有Alteon［19］，F5［20］等。这些产品非常昂贵，但是能够提供非常优秀的性能和很灵活的管理能力。Yahoo中国当初接近2000台服务器使用了三四台Alteon就搞定了［21］。鉴于条件关系，这里不展开讨论。
3.3 软件实现
第四层交换也可以通过软件实现，不过性能比专业硬件稍差，但是满足一定量的压力还是可以达到的，而且软件实现配置起来更灵活。软件四层交换常用的有Linux上的LVS（Linux Virtual Server），它提供了基于心跳（heart beat）的实时灾难应对解决方案，提高了系统的鲁棒性，同时提供了灵活的VIP配置和管理功能，可以同时满足多种应用需求［22］。
4 服务器优化
4.1 服务器整体性能考虑
对于价值昂贵的服务器来说，怎样配置才能发挥它的最大功效，又不至于影响正常的服务，这是在设计网站架构的时候必须要考虑的。常见的影响服务器的处理速度的因素有：网络连接，硬盘读写，内存空间，CPU速度。如果服务器的某一个部件满负荷运转仍然低于需要，而其他部件仍有能力剩余，我们将之称为性能瓶颈。服务器想要发挥最大的功效，关键的是消除瓶颈，让所有的部件都被充分的利用起来。
4.2 Socket优化
以标准的 GNU/Linux 为例。GNU/Linux 发行版试图对各种部署情况都进行优化，这意味着对具体服务器的执行环境来说，标准的发行版可能并不是最优化的［23］。GNU/Linux 提供了很多可调节的内核参数，可以使用这些参数为服务器进行动态配置，包括影响 Socket 性能的一些重要的选项。这些选项包含在 /proc 虚拟文件系统中。这个文件系统中的每个文件都表示一个或多个参数，它们可以通过 cat 工具进行读取，或使用 echo 命令进行修改。这里仅列出一些影响TCP/IP 栈性能的可调节内核参数［24］：
• /proc/sys/net/ipv4/tcp_window_scaling “1”（1表示启用该选项，0表示关闭，下同）启用 RFC［25］ 1323［26］定义的 window scaling；要支持超过 64KB 的窗口，必须启用该值。
• /proc/sys/net/ipv4/tcp_sack “1”启用有选择的应答（Selective Acknowledgment），通过有选择地应答乱序接收到的报文来提高性能（这样可以让发送者只发送丢失的报文段）；对于广域网通信来说，这个选项应该启用，但是这也会增加对 CPU 的占用。
• /proc/sys/net/ipv4/tcp_timestamps “1” 以一种比重发超时更精确的方法（参阅 RFC 1323）来启用对 RTT 的计算；为了实现更好的性能应该启用这个选项。
• /proc/sys/net/ipv4/tcp_mem “24576 32768 49152” 确定 TCP 栈应该如何反映内存使用；每个值的单位都是内存页（通常是 4KB）。第一个值是内存使用的下限。第二个值是内存压力模式开始对缓冲区使用应用压力的上限。第三个值是内存上限。超过这个上限时可以将报文丢弃，从而减少对内存的使用。
• /proc/sys/net/ipv4/tcp_wmem “4096 16384 131072” 为自动调优定义每个 socket 使用的内存。第一个值是为 socket 的发送缓冲区分配的最少字节数。第二个值是默认值（该值会被 wmem_default 覆盖），缓冲区在系统负载不重的情况下可以增长到这个值。第三个值是发送缓冲区空间的最大字节数（该值会被 wmem_max 覆盖）。
• /proc/sys/net/ipv4/tcp_westwood “1” 启用发送者端的拥塞控制算法，它可以维护对吞吐量的评估，并试图对带宽的整体利用情况进行优化；对于 WAN 通信来说应该启用这个选项。
与其他调优努力一样，最好的方法实际上就是不断进行实验。具体应用程序的行为、处理器的速度以及可用内存的多少都会影响到这些参数对性能作用的效果。在某些情况中，一些认为有益的操作可能恰恰是有害的（反之亦然）。因此，需要逐一试验各个选项，然后检查每个选项的结果，最后得出最适合具体机器的一套参数。
如果重启了 GNU/Linux 系统，设置的内核参数都会恢复成默认值。为了将所设置的值作为这些参数的默认值，可以使用 /etc/rc.local 文件，在系统每次启动时自动将这些参数配置成所需要的值。
在检测每个选项的更改带来的效果的时候，GNU/Linux上有一些非常强大的工具可以使用：
• ping 这是用于检查主机的可用性的最常用的工具，也可以用于计算网络带宽延时。
• traceroute 打印连接到特定网络主机所经过的一系列路由器和网关的路径（路由），从而确定每个 hop 之间的延时。
• netstat 确定有关网络子系统、协议和连接的各种统计信息。
• tcpdump 显示一个或多个连接的协议级的报文跟踪信息，其中包括时间信息，可以使用这些信息来研究不同协议的报文时间。
• Ethereal 以一个易于使用的图形化界面提供 tcpump （报文跟踪）的信息，支持报文过滤功能。
• iperf 测量 TCP 和 UDP 的网络性能；测量最大带宽，并汇报延时和数据报的丢失情况。
4.3 硬盘级缓存
硬盘级别的缓存是指将需要动态生成的内容暂时缓存在硬盘上，在一个可接受的延迟时间范围内，同样的请求不再动态生成，以达到节约系统资源，提高网站承受能力的目的。Linux环境下硬盘级缓存一般使用Squid［27］。
Squid是一个高性能的代理缓存服务器。和一般的代理缓存软件不同，Squid用一个单独的、非模块化的、I/O驱动的进程来处理所有的客户端请求。它接受来自客户端对目标对象的请求并适当地处理这些请求。比如说，用户通过浏览器想下载（即浏览）一个web页面，浏览器请求Squid为它取得这个页面。Squid随之连接到页面所在的原始服务器并向服务器发出取得该页面的请求。取得页面后，Squid再将页面返回给用户端浏览器，并且同时在Squid本地缓存目录里保存一份副本。当下一次有用户需要同一页面时，Squid可以简单地从缓存中读取它的副本，直接返回给用户，而不用再次请求原始服务器。当前的Squid可以处理HTTP， FTP， GOPHER， SSL和WAIS等协议。
Squid默认通过检测HTTP协议头的Expires和 Cache-Control字段来决定缓存的时间。在实际应用中，可以显式的在服务器端脚本中输出HTTP头，也可以通过配置apache的mod_expires模块，让apache自动的给每一个网页加上过期时间。对于静态内容，如图片，视频文件，供下载的软件等，还可以针对文件类型（扩展名），用 Squid 的 refresh_pattern 来指定缓存时间。
Squid 运行的时候，默认会在硬盘上建两层hash目录，用来存储缓存的Object。它还会在内存中建立一个Hash Table，用来记录硬盘中Object分布的情况。如果Squid配置成为一个Squid集群中的一个的话，它还会建立一个 Digest Table(摘要表)，用来存储其它 Squid 上的Object摘要。当用户端想要的资料本地硬盘上没有时，可以很快的知道应该去集群中的哪一台机器获得。在硬盘空间快要达到配置限额的时候，可以配置使用某种策略（默认使用LRU：Least Recently Used-最近最少用）删除一些Object，从而腾出空间［28］［29］。
集群中的Squid Server 之间可以有两种关系：第一种关系是：Child 和 Parent。当 Child Squid Server 没有资料时，会直接向 Parent Squid Server 要资料，然后一直等，直到 Parent 给它资料为止。第二种关系是：Sibling 和 Sibling。当 Squid Server 没有资料时，会先向 Sibling 的 Squid Server 要资料，如果 Sibling 没资料，就跳过它向 Parent 要或直接上原始网站去拿。
默认配置的Squid，没有经过任何优化的时候，一般可以达到 50% 的命中率［30］（图4）。如果需要，还可以通过参数优化，拆分业务，优化文件系统等办法，使得Squid达到 90% 以上的缓存命中率。 Squid处理TCP连接消耗的服务器资源比真正的HTTP服务器要小的多，当Squid分担了大部分连接，网站的承压能力就大大增强了。
图
4 某网站使用MRTG工具检测到的Squid命中率
蓝线表示Squid的流量，绿色部分表示Apache流量
4.4 内存级缓存
内存级别的缓存是指将需要动态生成的内容暂时缓存在内存里，在一个可接受的延迟时间范围内，同样的请求不再动态生成，而是直接从内存中读取。Linux环境下内存级缓存Memcached［31］是一个不错的选择。
Memcached是danga.com（运营Live Journal［32］的技术团队）开发的一套非常优秀的分布式内存对象缓存系统，用于在动态系统中减少数据库负载，提升性能。和 Squid 的前端缓存加速不同，它是通过基于内存的对象缓存来减少数据库查询的方式改善网站的性能，而其中最吸引人的一个特性就是支持分布式部署；也就是说可以在一群机器上建立一堆 Memcached 服务，每个服务可以根据具体服务器的硬件配置使用不同大小的内存块，这样，理论上可以建立一个无限大的基于内存的缓存系统。
Memcached 是以守护程序方式运行于一个或多个服务器中，随时接受客户端的连接操作，客户端可以由各种语言编写，目前已知的客户端 API 包括 Perl/PHP/Python/Ruby/Java/C#/C 等等[附录1]。客户端首先与 Memcached 服务建立连接，然后存取对象。每个被存取的对象都有一个唯一的标识符 key，存取操作均通过这个 key 进行，保存的时候还可以设置有效期。保存在 Memcached 中的对象实际上是放置在内存中的，而不是在硬盘上。Memcached 进程运行之后，会预申请一块较大的内存空间，自己进行管理，用完之后再申请一块，而不是每次需要的时候去向操作系统申请。Memcached将对象保存在一个巨大的Hash表中，它还使用NewHash算法来管理Hash表，从而获得进一步的性能提升。所以当分配给Memcached的内存足够大的时候，Memcached的时间消耗基本上只是网络Socket连接了［33］。
Memcached也有它的不足。首先它的数据是保存在内存当中的，一旦服务进程重启（进程意外被关掉，机器重启等），数据会全部丢失。其次Memcached以root权限运行，而且Memcached本身没有任何权限管理和认证功能，安全性不足。第一条是Memcached作为内存缓存服务使用无法避免的，当然，如果内存中的数据需要保存，可以采取更改Memcached的源代码，增加定期写入硬盘的功能。对于第二条，我们可以将Memcached服务绑定在内网IP上，通过Linux防火墙进行防护。
4.5 CPU与IO均衡
在一个网站提供的所有功能中，有的功能可能需要消耗大量的服务器端IO资源，像下载，视频播放等，而有的功能则可能需要消耗大量的服务器CPU资源，像视频格式转换，LOG统计等。在一个服务器集群中，当我们发现某些机器上CPU和IO的利用率相差很大的时候，例如CPU负载很高而IO负责很低，我们可以考虑将该服务器上的某些耗CPU资源的进程换成耗IO的进程，以达到均衡的目的。均衡每一台机器的CPU和IO消耗，不仅可以获得更充分的服务器资源利用，而且还能够支持暂时的过载，遇到突发事件，访问流量剧增的时候，实现得体的性能下降(Graceful performance degradation)［34］，而不是立即崩溃。
4.6 读写分离
如果网站的硬盘读写性能是整个网站性能提升的一个瓶颈的话，可以考虑将硬盘的读，写功能分开，分别进行优化。在专门用来写的硬盘上，我们可以在Linux下使用软件RAID-0（磁盘冗余阵列0级）［35］。RAID-0在获得硬盘IO提升的同时，也会增加整个文件系统的故障率——它等于RAID中所有驱动器的故障率之和。如果需要保持或提高硬盘的容错能力，就需要实现软件RAID-1，4或5，它们能在某一个（甚至几个）磁盘驱动器故障之后仍然保持整个文件系统的正常运行［36］，但文件读写效率不如RAID-0。而专门用来读的硬盘，则不用如此麻烦，可以使用普通的服务器硬盘，以降低开销。
一般的文件系统，会综合考虑各种大小和格式的文件的读，写效率，因而对特定的文件读或写的效率不是最优。如果有必要，可以通过选择文件系统，以及修改文件系统的配置参数来达到对特定文件的读或写的效率最大化。比如说，如果文件系统中需要存储大量的小文件，则可以使用ReiserFS［37］来替代Linux操作系统默认的ext3系统，因为ReiserFS是基于平衡树的文件系统结构，尤其对于大量文件的巨型文件系统，搜索速度要比使用局部的二分查找法的ext3快。 ReiserFS里的目录是完全动态分配的，因此不存在ext3中常见的无法回收巨型目录占用的磁盘空间的情况。ReiserFS里小文件（< 4K）可以直接存储进树，小文件读取和写入的速度更快，树内节点是按字节对齐的，多个小文件可共享同一个硬盘块，节约大量空间。ext3使用固定大小的块分配策略，也就是说，不到4K的小文件也要占据4K的空间，导致的空间浪费比较严重［38］。但ReiserFS对很多Linux内核支持的不是很好，包括2.4.3、2.4.9 甚至相对较新的 2.4.16，如果网站想要使用它，就必须要安装与它配合的较好的2.4.18内核——一般管理员都不是很乐意使用太新的内核，因为在它上面运行的软件，都还没有经过大量的实践测试，也许有一些小的bug还没有被发现，但对于服务器来说，再小的bug也是不能接受的。ReiserFS还是一个较为年轻的，发展迅速的文件系统，它相对于ext3来说有一个很大的缺陷就是，每次ReiserFS文件系统升级的时候，必须完全重新格式化整个磁盘分区。所以在选择使用的时候，需要权衡取舍［39］。 5 应用程序层优化 5.1 网站服务器程序的选择经统计［40］，当前互联网上有超过50%的网站主机使用Apache［41］服务器程序。 Apache是开源界的首选Web服务器，因为它的强大和可靠，而且适用于绝大部分的应用场合。但是它的强大有时候却显得笨重，配置文件复杂得让人望而生畏，高并发情况下效率不太高。而轻量级的Web服务器Lighttpd［42］却是后起之秀，基于单进程多路复用技术，其静态文件的响应能力远高于Apache。 Lighttpd对PHP的支持也很好，还可以通过Fastcgi方式支持其他的语言，比如Python等。虽然Lighttpd是轻量级的服务器，功能上不能跟Apache比，某些复杂应用无法胜任，但即使是大部分内容动态生成的网站，仍免不了会有一些静态元素，比如图片、JS脚本、CSS等等，可以考虑将Lighttpd放在Squid的前面，构成 Lighttpd->Squid->Apache的一条处理链，Lighttpd在最前面，专门处理静态内容的请求，把动态内容请求通过Proxy模块转发给Squid，如果Squid中有该请求的内容且没有过期，则直接返回给Lighttpd。新请求或者过期的页面请求交由Apache中的脚本程序来处理。经过Lighttpd和Squid的两级过滤，Apache需要处理的请求大大减少，减少了Web应用程序的压力。同时这样的构架，便于把不同的处理分散到多台计算机上进行，由Lighttpd在前面统一分发。
在这种架构下，每一级都是可以进行单独优化的，比如Lighttpd可以采用异步IO方式，Squid可以启用内存来缓存，Apache可以启用MPM（Multi -Processing Modules，多道处理模块）等，并且每一级都可以使用多台机器来均衡负载，伸缩性好。
著名视频分享网站YouTube就是选择使用Lighttpd作为网站的前台服务器程序。
5.2 数据库选择
MySQL［43］是一个快速的、多线程、多用户和健壮的SQL数据库服务器，支持关键任务、重负载系统的使用，是最受欢迎的开源数据库管理系统，是Linux下网站开发的首选。它由MySQL AB开发、发布和提供支持。
MySQL数据库能为网站提供：
• 高性能。MySQL支持海量，快速的数据库存储和读取。还可以通过使用64位处理器来获取额外的一些性能，因为MySQL在内部里很多时候都使用64位的整数处理。
• 易用性。MySQL的核心是一个小而快速的数据库。它的快速连接，快速存取和安全可靠的特性使MySQL非常适合在互联网站上使用。
• 开放性。MySQL提供多种后台存储引擎的选择，如MyISAM， Heap， InnoDB，Berkeley Db等。缺省格式为MyISAM。 MyISAM 存储引擎与磁盘兼容的非常好［44］。
• 支持企业级应用。MySQL有一个用于记录数据改变的二进制日志。因为它是二进制的，这一日志能够快速地将数据的更改从一台机器复制（replication）到另一台机器上。即使服务器崩溃，这一二进制日志也能够保持完整。这一特性通常被用来搭建数据库集群，以支持更大的流量访问要求［30］（图5）。
图
5 MySQL主辅库模式集群示意

MySQL也有一些它自身的缺陷，如缺乏图形界面，缺乏存储过程，还不支持触发器，参照完整性，子查询和数据表视图等，但这些功能都在开发者的TO-DO列表当中。这就是开源的力量：你永远可以期待更好。
国外的Yahoo!，国内的新浪，搜狐等很多大型商业网站都使用MySQL 作为后台数据库。对于一般的网站系统，无论从成本还是性能上考虑，MySQL应该是最佳的选择。
5.3 服务器端脚本解析器的选择
目前最常见的服务器端脚本有三种：ASP(Active Server Pages)，JSP(Java Server Pages)，PHP (Hypertext Preprocessor)［45］［46］。
ASP全名Active Server Pages，以及它的升级ASP.NET，是微软公司出品的一个WEB服务器端的开发环境，利用它可以产生和运行动态的、交互的、高性能的WEB服务应用程序。ASP采用脚本语言VBScript（C#）作为自己的开发语言。但因为只能运行在Windows环境下，这里我们不讨论它。
PHP是一种跨平台的服务器端的嵌入式脚本语言。它大量地借用C，Java和Perl语言的语法，并耦合PHP自己的特性，使WEB开发者能够快速地写出动态生成页面。它支持目前绝大多数数据库。PHP也是开源的，它的发行遵从GPL开源协议，你可以从 PHP官方站点( http://www.php.net)自由下载到它的二进制安装文件及全部的源代码。如果在Linux平台上与MySQL搭配使用，PHP是最佳的选择。
JSP是Sun公司推出的新一代站点开发语言，是Java语言除Java应用程序和Java Applet之外的第三个应用。Jsp可以在Serverlet和JavaBean的支持下，完成功能强大的站点程序。作为采用Java技术家族的一部分，以及Java 2（企业版体系结构）的一个组成部分，JSP技术拥有Java技术带来的所有优点，包括优秀的跨平台性，高度可重用的组件设计，健壮性和安全性等，能够支持高度复杂的基于Web的应用。
除了这三种常见的脚本之外，在Linux下我们其实还有很多其他的选择：Python（Google使用），Perl等，如果作为CGI调用，那么可选择范围就更广了。使用这些不太常见的脚本语言的好处是，它们对于某些特殊的应用有别的脚本所不具有的优势；不好的地方是，这些脚本语言在国内使用的人比较少，当碰到技术上的问题的时候，能找到的资料也较少。
5.4 可配置性
在大型网站开发过程中，不管使用什么技术，网站的可配置性是必须的。在网站的后期运营过程中，肯定会有很多的需求变更。如果每一次的需求变更都会导致修改源代码，那么，这个网站的开发可以说是失败的。
首先，也是最重要的一点，功能和展示必须分开。PHP和JSP都支持模板技术，如PHP的Smarty，Phplib，JSP的JSTL（JSP Standard Tag Library）等。核心功能使用脚本语言编写，前台展示使用带特殊标签的HTML，不仅加快了开发速度，而且方便以后的维护和升级［47］。
其次，对于前台模板，一般还需要将页面的头，尾单独提取出来，页面的主体部分也按模块或者功能拆分。对CSS，JS等辅助性的代码，也建议以单独的文件形式存放。这样不仅方便管理，修改，而且还可以在用户访问的时候进行缓存，减少网络流量，减轻服务器压力。
再次，对于核心功能脚本，必须将与服务器相关的配置内容，如数据库连接配置，脚本头文件路径等，与代码分离开。尤其当网站使用集群技术，CDN加速等技术的时候，每一台服务器上的配置可能都会不一样。如果不使用配置文件，则需要同时维护几份不同的代码，很容易出错。
最后，应该尽量做到修改配置文件后能实时生效，避免修改配置文件之后需要重启服务程序的情况。
5.5 封装和中间层思想
在功能块层次，如果使用JSP，基于纯面向对象语言Java的面向对象思想，类似数据库连接，会话管理等基本功能都已经封装成类了。如果使用PHP，则需要在脚本代码中显式的封装，将每一个功能块封装成一个函数，一个文件或者一个类。
在更高的层次，可以将网站分为表示层，逻辑层，持久层，分别进行封装，做到当某一层架构发生变化时，不会影响到其他层。比如新浪播客在一次升级的时候，将持久层的数据库由原来的集中式改为分布式架构，因为封装了数据库连接及所有操作[附录2]，做到了不修改任何上层代码，平稳的实现了过渡。近来流行的MVC架构，将整个网站拆分成Model（模型/逻辑）、View（视图/界面）、Controller（控制/流程）三个部分，而且有很多优秀的代码框架可供选择使用，像JSP的Structs，Spring，PHP的php.MVC， Studs 等。使用现成的代码框架，可以使网站开发事半功倍。
6 扩容、容错处理
6.1 扩容
一个大型网站，在设计架构的时候，必须考虑到以后可能的容量扩充。新浪播客在设计时充分地考虑了这一点。对于视频分享类网站来说，视频存储空间消耗是巨大的。新浪播客在主存储服务器上，采用配置文件形式指定每一个存储盘柜上存储的视频文件的ID范围。当前台服务器需要读取一个视频的时候，首先通过询问主存储服务器上的接口获得该视频所在的盘柜及目录地址，然后再去该盘柜读取实际的视频文件。这样如果需要增加存储用的盘柜，只需要修改配置文件即可，前台程序丝毫不受影响。
新浪播客采用MySQL数据库集群，在逻辑层封装了所有的数据库连接及操作。当数据库存储架构发生改变的时候，如增加一台主库，将某些数据表独立成库，增加读取数据用的从库等，都只需要修改封装了的数据库操作类，上层代码不用修改。
新浪播客的前台页面服务器使用F5公司的硬件第四层交换机，网通，电信分别导向不同的虚拟IP，每一个虚拟IP后面又有多个服务器提供服务。当访问流量增大的时候，可以很方便往虚拟IP后面增加服务器，分担压力。
6.2 容错
对于商业性网站来说，可用性是非常重要的。7*24的访问要求网站具有很强的容错能力。错误包括网络错误，服务器错误以及应用程序错误。
2006年12月27日台湾东部外海发生里氏7.6级地震，造成途径台湾海峡的多条海底电缆中断，导致许多国外网站，像MSN， NBA， Yahoo！（英文主站）等国内无法访问，但也有例外，以Google为代表的在国内建设有分布式数据节点的很多网站却仍然可以访问。虽然说地震造成断网是不可抗原因，但如果在这种情况下网站仍然可以访问，无疑能给网站用户留下深刻的印象。这件事情给大型商业网站留下的教训是：网站需要在用户主要分布区域保持数据存在，以防止可能的网络故障。
对于服务器错误，一般采取冗余设计的方法来避免。对于存储服务器（主要是负责写入的服务器），可以使用RAID（冗余磁盘阵列）；对于数据库（主要是负责写入的主库），可以采用双主库设计［30］；对于提供服务的前台，则可以使用第四层交换的集群，由多台服务器同时提供服务，不仅分担了流量压力，同时还可以互相作为备份。
在应用层程序中，也要考虑“用户友好”的出错设计。典型例子如HTTP 404 出错页面，程序内部错误处理，错误返回提示等，尽可能的做到人性化。
7 总结及展望
7.1 总结

对于一个高并发高流量的网站来说，任何一个环节的瓶颈都会造成网站性能的下降，影响用户体验，进而造成巨大的经济损失。在全互联网层面，应该使用分布式设计，缩短网站与用户的网络距离，减少主干网上的流量，以及防止在网络意外情况下网站无法访问的问题。在局域网层面，应该使用服务器集群，一方面可以支撑更大的访问量，另一方面也作为冗余备份，防止服务器故障导致的网站无法访问。在单服务器层面，应该配置操作系统，文件系统及应用层软件，均衡各种资源的消耗，消除系统性能瓶颈，充分发挥服务器的潜能。在应用层，可以通过各种缓存来提升程序的效率，减少服务器资源消耗（图6）。另外，还需要合理设计应用层程序，为以后的需求变更，扩容做好准备。

图6 典型高并发高流量网站的架构

在每一个层次，都需要考虑容错的问题，严格消除单点故障，做到无论应用层程序错误，服务器软件错误，服务器硬件错误，还是网络错误，都不影响网站服务。
7.2展望
当前Linux环境下有著名的LAMP（Linux＋Apache＋MySQL＋PHP/PERL/PYTHON）网站建设方案，但只是针对一般的中小网站而言。对于高并发高流量的大型商业网站，还没有一个完整的，性价比高的解决方案。除去服务器，硬盘，带宽等硬件投资外，还需要花费大量的预算和时间精力在软件解决方案上。
随着互联网的持续发展，Web2.0的兴起，在可以预见的未来里，互联网的用户持续增多，提供用户参与的网站不断增加，用户参与的内容日益增长，越来越多的网站的并发量，访问量会达到一个新的高度，这就会促使越来越多的个人，公司以及研究机构来关注高并发高流量的网站架构问题。就像Web1.0成就了无数中小网站，成就了LAMP一样，Web2.0注定也会成就一个新的，高效的，成本较低的解决方案。这个方案应该包括透明的第三方CDN网络加速服务，价格低廉的第四层甚至更高层网络交换设备，优化了网络性能的操作系统，优化了读写性能，分布式，高可靠的文件系统，揉合了内存，硬盘等各个级别缓存的HTTP服务器，更为高效的服务器端脚本解析器，以及封装了大部分细节的应用层设计框架。
技术的进步永无止境。我们期待互联网更为美好的明天。

参考文献
［1］Robert Hobbes' Zakon， Hobbes' Internet Timeline v8.2 ， available at http://www.zakon.org/robert/internet/timeline/
［2］GlobalReach Inc.， Global Internet Statistics (by language)， available at
http://www.glreach.com/globstats/index.php3
［3］中国互联网络信息中心，第十九次中国互联网络发展状况统计报告，available at: http://www.cnnic.net.cn/index/0E/index.htm
［4］Web2.0，Definition available at http://www.wikilib.com/wiki/Web2.0
［5］Alexa Internet, Inc. http://www.alexa.com/
［6］Yahoo! Inc. http://www.yahoo.com/
［7］eBay Inc. 著名的网上拍卖网站， http://www.ebay.com/
［8］Chet Dembeck, Yahoo! Cashes In On eBay's Outage, available at:
http://www.ecommercetimes.com/perl/story/545.html
［9］YouTube, Inc. http://www.youtube.com/
［10］数据来源：互联网周刊，2007年第3期
［11］新浪网技术（中国）有限公司， http://www.sina.com.cn/
［12］数据来源：新浪播客改版公告，available at:
http://games.sina.com.cn/x/n/2007-04-16/1427194553.shtml
［13］邓宏炎，叶娟丽，网络参考文献初探，武汉大学学报: 人文社会科学版， 2000
［14］彭湘凯，CDN网络及其应用，微计算机信息，2005年02期
［15］数据来源：ChinaCache, http://www.chinacache.com/
［16］Open System Interconnect，开放式系统互联模型，1984年由国际标准化组织（ISO）提出的一个开放式网络互联参考模型，参考 http://www.iso.org/
［17］凌仲权，丁振国，基于第四层交换技术的负载均衡，中国数据通信，2003
［18］陈明锐，邱钊，黄曦，黄俊，智能负载均衡技术在高负荷网站上的应用，广西师范大学学报(自然科学版)，2006年04期
［19］Alteon Inc. http://www.alteon.com/
［20］F5 Networks, Inc. http://www.f5.com.cn/
［21］数据来源： http://www.toplee.com/blog/archives/71.html
［22］傅明，程晓恒，王玮，基于Linux的服务器负载均衡性访问的解决方案，计算机系统应用，2001年09期
［23］Ming-Wei Wu, Ying-Dar Lin, Open source software development: an overview, Computer, 2001 - ieeexplore.ieee.org
［24］王海花 , 杨斌，Linux TCP/IP协议栈的设计及实现特点，云南民族大学学报（自然科学版），2007年01期
［25］Requests for Comments（RFC），the publication vehicle for technical specifications and policy documents produced by the (IETF (Internet Engineering Task Force) , the IAB (Internet Architecture Board), or the IRTF (Internet Research Task Force)， http://www.ietf.org/rfc.html
［26］RFC 1323， http://www.ietf.org/rfc/rfc1323.txt?number=1323
［27］Squid web proxy cache team， http://www.squid-cache.org/
［28］马俊昌 , 古志民，网络代理缓存Squid存储系统分析，计算机应用，2003年10期
［29］韩向春，郭婷婷，林星宇，丰保杰，集群缓存系统中代理缓存技术的研究，计算机工程与设计，2006年20期
［30］Brad Fitzpatrick， LiveJournal's Backend，A history of scaling， oscon 2005 ， http://www.danga.com/words/
［31］Danga Interactive， http://www.danga.com/memcached/
［32］LiveJournal，著名的博客托管商（BSP）， http://www.livejournal.com/
［33］Brad Fitzpatrick，Distributed caching with memcached，Linux Journal ，Volume 2004，Issue 124，Page 5， August 2004
［34］周枫，面向 Internet 服务的可扩展集群对象存储及磁盘日志缓存技术研究，清华大学硕士毕业论文，2002
［35］陈赟，杨根科，吴智铭，RAID系统中RAID级别的具体实现算法，微型电脑应用，2003年06期
［36］陈平仲，硬件实现RAID与软件实现RAID的比较，现代计算机（专业版），2005年01期
［37］NAMESYS， http://www.namesys.com/
［38］D Bobbins，Advanced file system implementor s guide: Journalling and ReiserFS，IBM's Developer Works Journal，June，2001
［39］刘章仪，Linux文件系统分析，贵州工业大学学报（自然科学版），2002年04期
［40］数据来源： http://news.netcraft.com/archives/2007/04/02/april_2007_web_server_survey.html
［41］The Apache Software Foundation ， http://httpd.apache.org/
［42］Lighttpd， http://www.lighttpd.net/
［43］MySQL AB， http://www.mysql.com/
［44］顾治华，忽朝俭，MySQL存储引擎与数据库性能，计算机时代，2006年10期
［45］The PHP Group， http://www.php.net/
［46］范云芝，动态网页制作技术ASP、PHP和JSP比较分析，电脑知识与技术（学术交流），2005年10期
［47］王耀希，王丽清，徐永跃，利用模板技术实现B/S 研发过程的分离与并行，计算机应用研究，2004
附录
[附录1]
1. Memcache的客户端PHP 封装
class memcache_class
{
function memcache_class()
{
}

/**
* 用post方法，执行memcache的写入操作
* $data参数，允许是php的数组。
* exp参数是设定的超时时间，单位是秒。
*/
function p_memcache_write($key, $data, $exp=3600)
{
$mmPageStartTime = microtime();
$ip = MEMCACHE_SERVER_IP;
$port = MEMCACHE_SERVER_PORT;
$type = MEMCACHE_SERVER_TYPE;

//对$data进行序列化，允许$data是数组
$data = serialize($data);

//对$data进行压缩
//$data = gzcompress ($data);

$submit=array( type => $type,
cmd => "set",
key => $key,
data => $data,
exp => $exp
);
$ret = memcache_class::posttohost($query, $submit);
return $ret;
}

/**
* 用post方法，执行memcache的读出操作
*/
function p_memcache_read($key)
{
$mmPageStartTime = microtime();
$ip = MEMCACHE_SERVER_IP;
$port = MEMCACHE_SERVER_PORT;
$type = MEMCACHE_SERVER_TYPE;

$submit=array( type => $type,
cmd => "get",
key => $key
);
$res = memcache_class::posttohost($query, $submit);

//对$res进行解压缩
//$res = gzuncompress($res);
//对$res进行反序列化，允许$res是数组
$res = unserialize($res);
return $res;
}
/**
* 执行post的函数
*/
function posttohost($url, $data)
{
$mmPageStartTime = microtime();
$url = parse_url($url);
$encoded = "";
while (list($k,$v) = each($data))
{
$encoded .= ($encoded ? "&" : "");
$encoded .= rawurlencode($k)."=".rawurlencode($v);
}
for ($i = 0; $i < 3; $i ++)
{
$fp = @fsockopen($url['host'], $url['port'],$errno, $errstr, 1);
if ($fp)
break;
}
if (!$fp)
{
return "";
}
@stream_set_timeout($fp, 2);
@fputs($fp, sprintf("POST %s%s%s HTTP/1.0\n", $url['path'], $url['query'] ? "?" : "", $url['query']));
@fputs($fp, "Host: $url[host]\n");
@fputs($fp, "Content-type: application/x-www-form-urlencoded\n");
@fputs($fp, "Content-length: " . strlen($encoded) . "\n");
@fputs($fp, "Connection: close\n\n");
@fputs($fp, "$encoded\n");
$line = @fgets($fp,1024);
if (!eregi("^HTTP/1\.. 200", $line)) return;
$results = "";
$inheader = 1;
while(!feof($fp))
{
$line = @fgets($fp,1024);
if ($inheader && ($line == "\n" || $line == "\r\n"))
{
$inheader = 0;
}
elseif (!$inheader)
{
$results .= $line;
}
}
@fclose($fp);
return $results;
}
}

2.使用示例
$out="";
if (MEMCACHE_FLAG === true)
{
$memcache_key = md5(trim($key));
$time_before = getmicrotime();
$mdata = memcache_class::p_memcache_read($memcache_key);
$time_after = getmicrotime();
$memcache_read_time = $time_after - $time_before;
if (strlen($mdata) >= MIN_RESULT) {
$out = $mdata;
$memhit = 1;
memcached_log("CACHE_HIT");
}
else {
$memhit = 0;
memcached_log("CACHE_NOT_HIT");
}
}
if (!(strlen($out) >= MIN_RESULT))
{
$query = get_query();
$time_before=getmicrotime();
$out = http_read($MySQLHost,$MySQLPort,$query,&$errstr,10);
$time_after=getmicrotime();
}

$len = strlen($out);
if(MEMCACHE === true && $memhit <= 0)
{
$memcache_key = md5(trim($key));
$time_before = getmicrotime();
memcache_class::p_memcache_write($memcache_key, $out, MEMCACHE_TIME);
$time_after = getmicrotime();
$memcache_write_time = $time_after - $time_before;
memcached_log("CACHE_WRITE");
}

[附录2]
MySQL wrap class
<?php
class mysqlRpc
{
var $_hostWrite = '';
var $_userWrite = '';
var $_passWrite = '';
var $_hostRead = '';
var $_userRead = '';
var $_passRead = '';
var $_dataBase = '';
var $db_write_handle = null;
var $db_read_handle = null;
var $db_last_handle = null;
var $_cacheData = array();
var $mmtime = 60;
function mysqlRpc($database, $w_servername, $w_username, $w_password, $r_servername='', $r_username='', $r_password='') {}

function connect_write_db() {}
function connect_read_db() {}
function query_write($sql, $return = false) {}
function query_read($sql, $return = false) {}

function query_first($sql, $return = false) {}
function insert_id(){}
function affected_rows(){}

function escape_string($string){}
function fetch_array($queryresult, $type = MYSQL_ASSOC){}

}
作为亚洲最大的交易型电子商务网站，淘宝的每一项数字都是惊人的：2007年上半年，淘宝总成交额突破157亿元人民币，接近其2006年169亿元的全年成交额，相当于122个家乐福或150个沃尔玛大卖场。和去年同期相比，淘宝成交额增长了近200%。目前，淘宝的注册用户超过了4500万，商品数9000多万，而全球最大的C2C网站eBay的商品数是1.1亿左右。　　这些现实中的数字映射到网络中，便是高速膨胀的海量数据。路鹏介绍说，在淘宝的数据管理架构中，首先做的工作便是对信息进行分类管理，比如商品的类目、属性等，淘宝采用了一些分库策略，将其分放在不同的数据库。　　另外，根据数据的重要性，淘宝又将数据分为核心数据和非核心数据。其中，核心数据包括用户信息和交易信息等数据，而非核心数据包括商品描述、图片、商品评价、论坛等信息。这两类数据仍然采用了不同的存储方法，分放在不同的数据库里，其迁移策略和负载的均衡策略不一样。淘宝每天都做一次检查，将其3～6个月内不活跃的数据从在线迁移到近线的存储数据库里。　　淘宝每天的页面浏览量是2亿多，如何在高并发量的状态下加快访问速度？这其实也和数据的分类和存储有关。“从IT架构去规划的话，淘宝的数据分为静态和动态，每天的页面浏览量里，有将近70%是静态数据。”路鹏说。那些网页上能看到的图片、商品描述等信息，淘宝将之归为静态数据，采用了CDN(Content Delivery Network，内容分发网络)技术，静态数据除了存放于杭州的两个数据中心外，淘宝还在上海、天津、杭州、宁波等城市建立了CDN分发点，旨在通过将网站的内容发布到最接近用户的缓存服务器内，使用户就近访问缓存取得需要的内容，提高网站的响应速度。而动态数据则是用户在进行注册、交易、评价等行为时产生的数据，淘宝通过优化后台数据库和应用层系统，把动态的数据快速呈现给用户。　　对于每天不断涌现然后逐渐沉寂的数据，淘宝在它们生命周期的各个阶段也采用了不同的存储管理方案。“它们分为3大类，在线、近线和离线。”路鹏说。其中，在线是指那些正在出售中的商品数据，仅仅是在线的数据就超过10TB，淘宝采用了FC(Fiber Connector)光纤存储技术，尽管成本高，但是可以加快访问速度；近线包括了已经下架的商品数据、交易评价等数据，用户还有可能访问到，淘宝采用了以串行方式传输数据SATA(Serial ATA，串行ATA)技术，这样既能保证用户的正常访问，相比于光纤存储技术，更加经济；离线则是包括了已经交易完成的数据以及相关的信息，淘宝将其作为历史数据，用卡库、光盘等介质进行存储。

 网站架构的高性能和可扩展性
2007-10-09 – 4:33 下午
高性能和可扩展性是网站架构中非常重要的问题，尤其对于Web2.0站点来说，要应付高并发的访问，必须充分考虑到这些问题。
什么是高性能和可扩展性？高性能通俗地说访问速度要快。服务器对一个页面请求的响应时间通常必须控制在10s以内，否则就会产生很糟糕的用户体验。前一段时间我们对育儿网的性能做了一些提升：使用icegrid解决ice请求堆积的问题、加入缓存(memcached)、优化一些数据库的设计，网站的访问量也发生的很明显的增长。缓存、代码优化等都是提高性能的常用手段。
可扩展性就是当访问量增加时，为了维持高性能所要付出的成本。一台服务器的处理能力是有限的，因此访问量进一步增加时我们不得不加入更多的服务器，这时一个好的架构就尤其重要，通常web server、数据库、中间层等都要考虑到可扩展性，如数据层可以用Mysql的Master-slave结构，前端可以用dns轮询，squid等实现负载均衡等。在育儿网我们用icegrid实现了ice层的可扩展性，现在需要增加ice server时只需增加服务器和修改registry的配置就可以了，而不需改变客户端的代码。
总而言之，要用最低的成本获得最高的性能！
新浪这样的大型网站首页如何架构 [已结贴]

• marine_chen
•
• 等级：
发表于：2007-06-05 23:51:50 楼主
新浪、搜狐、淘宝等这样的大型网站，首页的架构设计怎样比较合理？

据我所知，新浪、搜狐这样的新闻媒体为主的，数据实时性要求不高，可以生成静态页实现。淘宝的数据实时性要求较高，也生成静态页来实现？还是用一些cache缓存来实现？

欢迎大家探讨

问题点数：100 回复次数：32

• Rachael1001
•
• 等级：
发表于：2007-06-06 00:35:411楼得分:0
估计甚麽架构都渗杂一些

• zhj92lxs
•
• 等级：
发表于：2007-06-06 01:30:562楼得分:0
都用一下把

• jsczxy2
•
• 等级：
发表于：2007-06-06 01:48:593楼得分:0
还没那水平在这里等待高手

• marine_chen
•
• 等级：
发表于：2007-06-06 08:51:074楼得分:0
我们以前是用过cache，效果还行

• cucuchen
•
• 等级：
发表于：2007-06-06 10:06:435楼得分:0
楼主，我要真城地告诉你的是，新浪这样的大网站，不光是说做一个首页就可以好的。因为是千万人同时访问的网站，所以一般是有很多个数据库同时工作的，说明白一点就是数据库集群和并发控制。另外还有一点的是，那些网站的静态化网页并不是真的，而是通过动态网页与静态网页网址交换做出现的假象，这可以用urlrewrite这样的开源网址映射器实现。这样的网站实时性也是相对的，因为在数据库复制数据的时候有一个过程，一般在技术上可以用到hibernate和ecache，但是如果要使网站工作地更好，可以使用EJB和websphere，weblogic这样大型的服务器来支持，并且要用oracle这样的大型数据库。

• marine_chen
•
• 等级：
发表于：2007-06-06 11:08:586楼得分:0
楼主，我要真城地告诉你的是，新浪这样的大网站，不光是说做一个首页就可以好的。因为是千万人同时访问的网站，所以一般是有很多个数据库同时工作的，说明白一点就是数据库集群和并发控制。另外还有一点的是，那些网站的静态化网页并不是真的，而是通过动态网页与静态网页网址交换做出现的假象，这可以用urlrewrite这样的开源网址映射器实现。这样的网站实时性也是相对的，因为在数据库复制数据的时候有一个过程，一般在技术上可以用到hibernate和ecache，但是如果要使网站工作地更好，可以使用EJB和websphere，weblogic这样大型的服务器来支持，并且要用oracle这样的大型数据库。
-----------------------------------------------------------------------

感谢回复。
数据库集群、并发控制、weblogic、oralce等，这些都是硬件方面的，我也都运用过，效果还行。
我想知道，在具体技术细节的运用上有没有什么可以借鉴的，hibernate的ehcache我也用过，本身也是有一些缺点不够完善，而且从我的经验来看，hibernate不太适合大型系统的运用，从这点来看还不如ibatis。
cache方面，像swarmcache、memcache都只是一些缓存的概念，用哪个都各有利弊，想知道有没有从技术角度出发能提高性能的方法呢

• cucuchen
•
• 等级：
发表于：2007-06-06 11:33:227楼得分:20
楼主，我在made-in-china.com做过设计，通过我的经验，我认为一个网站要做过效率高，不过是一个程序员的事情。在性能优化上要数据库和程序齐头并进！缓存也是两方面同时入手。第一：数据库缓存和数据库优化，这个由dba完成（而且这个有非常大的潜力可挖，只是由于我们都是程序员而忽略了他而已）。第二：程序上的优化，这个非常的有讲究，比如说重要一点就是要规范ＳＱＬ语句，少用in 多用or，多用preparestatement，另外避免程序冗余如查找数据少用双重循环等。另外选用优秀的开源框架加以支持，我个人认为中后台的支持是最最重要的，可以选取spring＋ibatis。因为ibatis直接操作SQL并有缓存机制。spring的好处就不用我多说了，ＩＯＣ的机制可以避免new对象，这样也节省开销！具我分析，绝大部分的开销就是在NEW的时候和连接数据库时候产生的，请你尽量避免。另外可以用一些内存测试工具来做一个demo说明hibernate和ibatis谁更快！前台你想用什么就用什么，struts,webwork都成，如果觉得自己挺牛Ｘ可以试试用tapestry。

• cucuchen
•
• 等级：
发表于：2007-06-06 11:35:108楼得分:0
更正：我认为一个网站要做过效率高，不过是一个程序员的事情－－》我认为一个网站要做的效率高，不光是一个程序员的事情。

• didibaba
•
• 等级：
发表于：2007-06-06 11:43:219楼得分:0
我同意marine_chen(覆雨翻云) 的观点，后缀名为htm或者html并不能说明程序生成了静态页面，可能是通过url重写来实现的，为的只不过是在搜索引擎中提升自己网站的覆盖面积罢了。

其实用数据库也未必不能解决访问量巨大所带来的问题，作成静态文件硬盘的寻址时间也未必少于数据库的搜索时间，当然对资料的索引要下一翻工夫。

我自己觉得门户往往也就是当天、热门的资料点击率较高，将其做缓存最多也不过1~2G的数据量吧，别说服务器，个人电脑，1~2G小意思。
拿网易新闻来说 http://news.163.com/07/0606/09/3GA0D10N00011229.html
格式化一下，方便理解：http://域名/年/月日/新闻所属分类/新闻ID.html
我们可以把当天发布的、热门的、流揽量大的作个缓寸，用hashtable（key：年-月-日-分类-ID，value：新闻对象），静态将其放到内存（速度绝对快过硬盘寻址静态页面）。

这样可以大大增加一台计算机的处理速度。至于一台机器不够处理的，那是httpserver集群来路由的问题了。
生成静态页面其实是比较苯的做法啊：
1、增加了程序的复杂度
2、不利于管理资料
3、速度也不是最快
4、伤硬盘，哈哈

• marine_chen
•
• 等级：
发表于：2007-06-06 11:45:5010楼得分:0
你说的都很有道理，也给我很大启发，非常感谢。
我现在正在设计一个电子商务网站，像首页、二级页面之类的还没考虑好用什么样的方式实现，我发这个帖子希望能在这方面有所启发。
数据库优化、程序优化这个是必须的，框架方面我也选了webwork+ibatis+spring，tapestry了解一些不过还没具体实践，难道tapestry的性能更好？

• didibaba
•
• 等级：
发表于：2007-06-06 11:47:2511楼得分:0
其他旧的资料流量不大，用一般的处理方法能应付~

• cucuchen
•
• 等级：
发表于：2007-06-06 11:53:0212楼得分:0
回楼主，你的方案和我的方案是不谋而合的。我也是：webwork+ibatis+spring。tapestry的确很优秀，但是难度大，但是他的好处是程序和美术分离，而且是事件机制，也非常棒，考虑到学习曲线，你用ww也不错呀！！！

• theforever
•
• 等级：
发表于：2007-06-06 12:36:5613楼得分:0
空说无益,检验真理的标准唯有实践.
写个数据生成程序,在块大硬盘上实际来来,用结果说话.

• marine_chen
•
• 等级：
发表于：2007-06-06 12:39:1214楼得分:0
didibaba说的也有道理。
其实从我的角度是不想做静态页的，还得单独增加服务器用来存储，复杂而且增加开销。

• marine_chen
•
• 等级：
发表于：2007-06-06 12:42:2115楼得分:0
网站首页初始化一般都是如何实现？
难道每次打开都拉一次数据？比如各个小模块都怎么关联？

• marine_chen
•
• 等级：
发表于：2007-06-06 12:42:4616楼得分:0
theforever(碧海情天) 的意思是用静态页？

• tiandiqing
•
• 等级：
发表于：2007-06-06 12:54:4617楼得分:0
新浪的后台发布我接触过
是perl php mysql的，有几个人写的底层东西，然后频道的开发人员在上面进行二次开发。

那个系统很灵活，都是生成静态页面的

• marine_chen
•
• 等级：
发表于：2007-06-06 12:57:2218楼得分:0
cucuchen(绝情酷哥) ,这个框架网站首页初始化怎样比较好？

• marine_chen
•
• 等级：
发表于：2007-06-06 12:57:4919楼得分:0
新浪的后台发布我接触过
是perl php mysql的，有几个人写的底层东西，然后频道的开发人员在上面进行二次开发。

那个系统很灵活，都是生成静态页面的
---------------------------------------------

静态页的速度应该比动态的要快一些

• tiandiqing
•
• 等级：
发表于：2007-06-06 12:59:1620楼得分:0

每签发一条新闻，就会生成静态页面，然后发往前端的web服务器，前端的web都是做负载均衡的。另外还有定时的程序，每5-15分钟自动生成一次。做一个大的网站远没有想象中那么简单。服务器基本就要百十个的

• tiandiqing
•
• 等级：
发表于：2007-06-06 13:00:3121楼得分:0
如果哪位想要做大型的门户网站系统，可以找我联系，我可以出技术解决方案，呵呵

• marine_chen
•
• 等级：
发表于：2007-06-06 13:16:3322楼得分:0
每签发一条新闻，就会生成静态页面，然后发往前端的web服务器，前端的web都是做负载均衡的。另外还有定时的程序，每5-15分钟自动生成一次。做一个大的网站远没有想象中那么简单。服务器基本就要百十个的

-----------------------------------

负载均衡、定时机制都是大型网站必备的，反向代理一般比较常用，不知道其他应用中的大型网站都还用哪些集群技术？

• deng1234
•
• 等级：
发表于：2007-06-06 13:53:0323楼得分:0
我来说一下,
1 我们的新闻是从后台添加进去的，新闻添加进去之后并不能在前台显示，
要发布后才能显示，在发布的时候把新闻生成静态页面，如果在换别的新闻，就必须把以前的新闻撤下来．这样就可以保存发步的新闻是最新的．
2 就技术方面肯定不会用hibernate的，只是用的jsp这种最基本的，在写数据的时候一定用的是存储过程．经过测试存储过程的确快很多，
3数据库用的是oracle．服务器用的是2个weblogic

• marine_chen
•
• 等级：
发表于：2007-06-06 14:21:5624楼得分:0
我来说一下,
1 我们的新闻是从后台添加进去的，新闻添加进去之后并不能在前台显示，
要发布后才能显示，在发布的时候把新闻生成静态页面，如果在换别的新闻，就必须把以前的新闻撤下来．这样就可以保存发步的新闻是最新的．
2 就技术方面肯定不会用hibernate的，只是用的jsp这种最基本的，在写数据的时候一定用的是存储过程．经过测试存储过程的确快很多，
3数据库用的是oracle．服务器用的是2个weblogic
------------------------------------------------------------

我们也是oracle存储过程+2个weblogic，更新机制也几乎一样，看来这个是比较普遍的方法了。

• isline
•
• 等级：
发表于：2007-06-06 14:28:0225楼得分:0
生成静态页面的服务器和www服务器是两组不同的服务器，页面生成后才会到www服务器
一部分数据库并不是关系数据库，这样更适合信息衍生
www、mail服务器、路由器多，主要用负载平衡解决访问瓶颈

• didibaba
•
• 等级：
发表于：2007-06-06 14:33:2926楼得分:20
网站首页初始化一般都是如何实现？
难道每次打开都拉一次数据？比如各个小模块都怎么关联？
------------------------------------------------------
我说的缓存是数据缓存，将当天、热门的数据做成hash放到内存。页面小模块还是用平常的处理办法来将标题拉出来显示，因为有页面缓存这个东西的存在你不用担心每次会去读数据库。
1、如用户点击， http://news.163.com/07/0606/09/3GA0D10N00011229.html
2、由于使用url重写过，实际上可能都是发送到 http://news.163.com/shownews.jsp?id=3GA0D10N00011229去处理。
3、shownews.jsp只需要从内存里面的hashtable取得保存的新闻对象即可
4、如内存里面没有，再去读数据库得到新闻

难道每次打开都拉一次数据？
当然不是如此。可以在发布新闻的同时将数据缓存。当然缓存也不会越来越大，在个特定的时间段（如凌晨）剔除过期的数据。

• net205
•
• 等级：
发表于：2007-06-06 15:03:3627楼得分:0
观望…

• 45Ter
•
• 等级：
发表于：2007-06-06 15:09:5728楼得分:0
哇塞，路过，有些技术都没有接触过，向各位学习！

• cucuchen
•
• 等级：
发表于：2007-06-06 15:26:4729楼得分:0
didibaba(落花有意兮流水无情，郁闷！！！)
我严重同意他的观点。
缓存机制可以用hibernate实现的那套ecache,感觉还可以的。

• marine_chen
•
• 等级：
发表于：2007-06-06 16:04:5030楼得分:0
我说的缓存是数据缓存，将当天、热门的数据做成hash放到内存。
-----------------------------------------------------------------------
struts、webwork好像没有像servlet的init初始化加载数据的功能吧？

页面缓存以前用过oscache，数据缓存用过ehcache和swarmcache，ibatis还有自带的缓存机制，使用这些缓存加上负载均衡技术实现了一个系统。

经过大家探讨，自己再总结一下，感觉首页这样的设计无非就是静态页定时更新、页面缓存、数据缓存、服务器集群等方法，还有没有更新颖的思路？

• forevermihoutao
•
• 等级：
发表于：2007-06-06 16:07:3931楼得分:0
up

• forevermihoutao
•
• 等级：
发表于：2007-06-06 16:36:0032楼得分:60
程序开发是一方面，系统架构设计（硬件+网络+软件）是另一方面。

中国的网络分南北电信和网通，访问的ip就要区分南北进入不同的网络；

然后是集群，包括应用服务器集群和web服务器集群，应用服务器集群可以采用apache+tomcat集群和weblogic集群等，web服务器集群可以用反向代理，也可以用NAT的方式，或者多域名解析都可以；Squid也可以，反正方法很多，可以根据情况选择；

软件架构方面，做网站首先需要很多web服务器存储静态资源，比如图片、视频、静态页等，千万不要把静态资源和应用服务器放在一起；

页面数据调用更要认真设计，一些数据查询可以不通过数据库的方式，实时性要求不高的可以使用lucene来实现，即使有实时性的要求也可以用lucene，lucene+compass还是非常优秀的；

不能用lucene实现的可以用缓存，分布式缓存可以用memcached，如果有钱的话用10来台机器做缓存， >10G的存储量相信存什么都够了；如果没钱的话可以在页面缓存和数据缓存上下功夫，多用OSCACHE和EHCACHE，SWARMCACHE也可以，不过据说同步性不是很好；

然后很重要的一点就是数据库，大型网站要用oracle，数据方面操作尽量多用存储过程，绝对提升性能；同时要让DBA对数据库进行优化，优化后的数据库与没优化的有天壤之别；同时还可以扩展分布式数据库，以后这方面的研究会越来越多；

新闻类的网站可以用静态页存储，采用定时更新机制减轻服务器负担；首页每个小模块可以使用oscache缓存，这样不用每次都拉数据；

最后是写程序了，一个好的程序员写出来的程序会非常简洁、性能很好，一个初级程序员可能会犯很多低级错误，这也是影响网站性能的原因之一。

 资料收集：高并发高性能高扩展性 Web 2.0 站点架构设计及优化策略

最近专门花时间研究了一下高并发高性能高扩展性 Web 2.0 站点架构设计及优化策略，发现了很多不错的资料，继续跟大家分享。——对于期望在大型网络应用的性能测试和性能优化方面获得提高的朋友们来说，尤其应该认真看看。^_^

» 说说大型高并发高负载网站的系统架构俊麟 Michael`s blog
bind dlz - 分布式系统的请求分发工具: 一个藏袍
bind dlz - 分布式系统的请求分发工具 bind dlz全称是bind dynamic loadable zones，是基于bind的提供的一个组件，作用看名字就知道了，支持动态域加载支持。 bind已经有很久的历史，目前是搭建DNS服务器的首选。对于一般网站来说，一个标准的bind已经完全可以完成所有dns解决的工作，但在海量域名数量的情况下，bind也确实存在着一些问题： 1、域名解析信息全部存储在文本文件中，这非常容易导致由于编辑出错导致的域名解析出错。 2、bind运行时将全部的解析信息放在内存里，如果数量巨大将可能出现内存不足的情况，同时解析信息重新加载时所耗费的时间也非常值得考虑，由于加载时间较长，所以基本可以不考虑动态的进行域名的调整。 dlz就是为了解决这个问题而针对bind开发的组件，可以将域名解析信息放在数据库中，从而避免域名信息变动时重新加载的时间，在变动后马上生效。 dlz支持多种数据存储形式，包括文件系统，Berkeley-DB，Postgre-SQL，MySQL，ODBC，LDAP等等。性能的比较在这里。 bind dlz这种提供动态的域名调整，并且仍然可以提供高性能的dns解析服务的特点可以应用于提供二级或三级域名服务的分布式系统的前端，对不同的域名解析到所在服务器组上，从而实现可扩展的系统架构。…
Craigslist 的数据库架构 - DBA notes
CSDN视频：CSDN SD俱乐部与钱宏武探讨如何设计高并发体系架构
CSDN视频频道
Flickr 的开发者的 Web 应用优化技巧 - DBA notes
YouTube 的架构扩展 - DBA notes
了解一下 Technorati 的后台数据库架构 - DBA notes
从LiveJournal后台发展看大规模网站性能优化方法: 一个藏袍
从LiveJournal后台发展看大规模网站性能优化方法于敦德 2006-3-16 一、LiveJournal发展历程 LiveJournal是99年始于校园中的项目，几个人出于爱好做了这样一个应用，以实现以下功能：博客，论坛社会性网络，找到朋友聚合，把朋友的文章聚合在一起 LiveJournal采用了大量的开源软件，甚至它本身也是一个开源软件。在上线后，LiveJournal实现了非常快速的增长： 2004年4月份：280万注册用户。2005年4月份：680万注册用户。2005年8月份：790万注册用户。达到了每秒钟上千次的页面请求及处理。使用了大量MySQL服务器。使用了大量通用组件。二、LiveJournal架构现状概况三、从LiveJournal发展中学习 LiveJournal从1台服务器发展到100台服务器，这其中经历了无数的伤痛，但同时也摸索出了解决这些问题的方法，通过对LiveJournal 的学习，可以让我们避免LJ曾经犯过的错误，并且从一开始就对系统进行良好的设计，以避免后期的痛苦。下面我们一步一步看LJ发展的脚步。…
使用memcached进行内存缓存: 一个藏袍
使用memcached进行内存缓存旧文重发 2005.8.9 通常的网页缓存方式有动态缓存和静态缓存等几种，在ASP.NET中已经可以实现对页面局部进行缓存，而使用memcached的缓存比ASP.NET的局部缓存更加灵活，可以缓存任意的对象，不管是否在页面上输出。而memcached最大的优点是可以分布式的部署，这对于大规模应用来说也是必不可少的要求。 LiveJournal.com使用了memcached在前端进行缓存，取得了良好的效果，而像wikipedia,sourceforge等也采用了或即将采用memcached作为缓存工具。memcached可以大规模网站应用发挥巨大的作用。…
使用开源软件，设计高性能可扩展网站: 一个藏袍
使用开源软件，设计高性能可扩展网站 2006-6-17 于敦德上次我们以LiveJournal为例详细分析了一个小网站在一步一步的发展成为大规模的网站中性能优化的方案，以解决在发展中由于负载增长而引起的性能问题，同时在设计网站架构的时候就从根本上避免或者解决这些问题。今天我们来看一下在网站的设计上一些通常使用的解决大规模访问，高负载的方法。我们将主要涉及到以下几方面： 1、前端负载 2、业务逻辑层 3、数据层在LJ性能优化文章中我们提到对服务器分组是解决负载问题，实现无限扩展的解决方案。通常中我们会采用类似LDAP的方案来解决，这在邮件的服务器以及个人网站，博客的应用中都有使用，在Windows下面有类似的Active Directory解决方案。有的应用（例如博客或者个人网页）会要求在二级域名解析的时候就将用户定位到所属的服务器群组，这个时候请求还没到应用上面，我们需要在DNS里解决这个问题。这个时候可以用到一款软件bind dlz，这是bind的一个插件，用于取代bind的文本解析配置文件。它支持包括LDAP，BDB在内的多种数据存储方式，可以比较好的解决这个问题。另外一种涉及到DNS的问题就是目前普遍存在的南北互联互通的问题，通过bind9内置的视图功能可以根据不同的IP来源解析出不同的结果，从而将南方的用户解析到南方的服务器，北方的用户解析到北方的服务器。这个过程中会碰到两个问题，一是取得南北IP的分布列表，二是保证南北服务器之间的通讯顺畅。第一个问题有个笨办法解决，从日志里取出所有的访问者IP，写一个脚本，从南北的服务器分别ping回去，然后分析结果，可以得到一个大致准确的列表，当然最好的办法还是直到从运营商那里拿到这份列表(update:参见这篇文章)。后一个问题解决办法比较多，最好的办法就是租用双线机房，同一台机器，双 IP，南北同时接入，差一些的办法就是南北各自找机房，通过大量的测试找出中间通讯顺畅的两个机房，后一种通常来说成本较低，但效果较差，维护不便。另外DNS负载均衡也是广泛使用的一种负载均衡方法，通过并列的多条A记录将访问随即的分布到多台前端服务器上，这种通常使用在静态页面居多的应用上，几大门户内容部分的前端很多都是用的这种方法。用户被定位到正确的服务器群组后，应用程序就接手用户的请求，并开始沿着定义好的业务逻辑进行处理。这些请求主要包括两类静态文件(图片，js脚本, css等)，动态请求。静态请求一般使用squid进行缓存处理，可以根据应用的规模采用不同的缓存配置方案，可以是一级缓存，也可以是多级缓存，一般情况下cache的命中率可以达到70%左右，能够比较有效的提升服务器处理能力。Apache的deflate模块可以压缩传输数据，提高速度，2.0版本以后的cache模块也内置实现磁盘和内存的缓存，而不必要一定做反向代理。动态请求目前一般有两种处理方式，一种是静态化，在页面发生变化时重新静态页面，现在大量的CMS，BBS都采用这种方案，加上cache，可以提供较快的访问速度。这种通常是写操作较少的应用比较适合的解决方案。另一种解决办法是动态缓存，所有的访问都仍然通过应用处理，只是应用处理的时候会更多的使用内存，而不是数据库。通常访问数据库的操作是极慢的，而访问内存的操作很快，至少是一个数量级的差距，使用memcached可以实现这一解决方案，做的好的memcache甚至可以达到90%以上的缓存命中率。 10年前我用的还是2M的内存，那时的一本杂事上曾经风趣的描述一对父子的对话：儿子：爸爸，我想要1G的内存。爸爸：儿子，不行，即使是你过生日也不行。时至今日，大内存的成本已经完全可以承受。Google使用了大量的PC机建立集群用于数据处理，而我一直觉得，使用大内存PC可以很低成本的解决前端甚至中间的负载问题。由于PC硬盘寿命比较短，速度比较慢，CPU也稍慢，用于做web前端既便宜，又能充分发挥大内存的优势，而且坏了的话只需要替换即可，不存在数据的迁移问题。下面就是应用的设计。应用在设计的时候应当尽量的设计成支持可扩展的数据库设计，数据库可以动态的添加，同时支持内存缓存，这样的成本是最低的。另外一种应用设计的方法是采用中间件，例如ICE。这种方案的优点是前端应用可以设计的相对简单，数据层对于前端应用透明，由ICE提供，数据库分布式的设计在后端实现，使用ICE封装后给前端应用使用，这路设计对每一部分设计的要求较低，将业务更好的分层，但由于引入了中间件，分了更多层，实现起来成本也相对较高。在数据库的设计上一方面可以使用集群，一方面进行分组。同时在细节上将数据库优化的原则尽量应用，数据库结构和数据层应用在设计上尽量避免临时表的创建、死锁的产生。数据库优化的原则在网上比较常见，多google一下就能解决问题。在数据库的选择上可以根据自己的习惯选择，Oracle，MySQL等，并非Oracle就够解决所有的问题，也并非MySQL就代表小应用，合适的就是最好的。前面讲的都是基于软件的性能设计方案，实际上硬件的良好搭配使用也可以有效的降低时间成本，以及开发维护成本，只是在这里我们不再展开。网站架构的设计是一个整体的工程，在设计的时候需要考虑到性能，可括展性，硬件成本，时间成本等等，如何根据业务的定位，资金，时间，人员的条件设计合适的方案是件比较困难的事情，但多想多实践，终究会建立一套适合自己的网站设计理念，用于指导网站的设计工作，为网站的发展奠定良好的基础。…
初创网站与开源软件: 一个藏袍
初创网站与开源软件前面有一篇文章中提到过开源软件，不过主要是在系统运维的角度去讲的，主要分析一些系统级的开源软件(例如bind,memcached)，这里我们讨论的是用于搭建初创网站应用的开源软件(例如phpbb,phparticle)，运行在Linux，MySQL，Apache,PHP,Java等下面。创业期的网站往往采用比较简单的系统架构，或者是直接使用比较成熟的开源软件。使用开源软件的好处是搭建速度快，基本不需要开发，买个空间域名，下个软件一搭建，用个半天就搞定了，一个崭新的网站就开张了，在前期可以极大程度的节约时间成本和开发成本。当然使用开源软件搭建应用也存在一些局限性，这是我们要重点研究的，而研究的目的就是如何在开源软件选型时以及接下来的维护过程中尽量避免。一方面是开源软件一般只有在比较成熟的领域才有，如果是一些创新型的项目很难找到合适的开源软件，这个时候没什么好的解决办法，如果非要用开源的话一般会找一个最相似的改一下。实际上目前开源的项目也比较多了，在sf.net上可以找到各种各样的开源项目。选型的时候尽量应该选取一个程序架构比较简单的，不一定越简单越好，但一定要简单，一目了然，别用什么太高级的特性，互联网应用项目不需要太复杂的框架。原因有两个，一个是框架复杂无非是为了实现更好的可扩展性和更清晰的层次，而我们正在做的互联网应用范围一般会比开源软件设计时所考虑的范围小的多，所以有的应用会显得设计过度，另外追求完美的层次划分导致的太复杂的继承派生关系也会影响到整个系统维护的工作量。建议应用只需要包含三个层就可以了，数据(实体)层，业务逻辑层，表现层。太复杂的设计容易降低开发效率，提高维护成本，在出现性能问题或者突发事件的时候也不容易找到原因。另外一个问题是开源软件的后期维护和继续开发可能会存在问题，这一点不是绝对的，取决于开源软件的架构是否清晰合理，扩展性好，如果是较小的改动可能一般不会存在什么问题，例如添加一项用户属性或者文章属性，但有些需求可能就不是很容易实现了。例如网站发展到一定阶段后可能会考虑扩展产品线，原来只提供一个论坛加上cms，现在要再加上商城，那用户系统就会有问题，如何解决这个问题已经不仅仅是改一下论坛或者cms就可以解决了，这个时候我们需要上升到更高的层次来考虑问题，是否需要建立针对整个网站的用户认证系统，实现单点登录，用户可以在产品间无缝切换而且保持登录状态。由于网站初始的用户数据可能大部分都存放在论坛里，这个时候我们需要把用户数据独立出来就会碰到麻烦，如何既能把用户数据独立出来又不影响论坛原有系统的继续运行会是件很头痛的事情。经过一段时间的运行，除非是特别好的设计以及比较好的维护，一般都会在论坛里存在各种各样乱七八糟的对用户信息的调用，而且是直接针对数据库的，这样如果要将用户数据移走的话要修改代码的工作量将不容忽视，而另外一个解决办法是复制一份用户数据出来，以新的用户数据库为主，论坛里的用户数据通过同步或异步的机制实现同步。最好的解决办法就是在选型时选一个数据层封装的比较好的，sql代码不要到处飞的软件，然后在维护的时候保持系统原有的优良风格，把所有涉及到数据库的操作都放到数据层或者实体层里，这样无论对数据进行什么扩展，代码修改起来都比较方便，基本不会对上层的代码产生影响。网站访问速度问题对初创网站来说一般考虑的比较少，买个空间或者托管服务器，搭建好应用后基本上就开始运转了，只有到真正面临极大的速度访问瓶颈后才会真正对这个问题产生重视。实际上在从网站的开始阶段开始，速度问题就会一直存在，并且会随着网站的发展也不断演进。一个网站最基本的要求，就是有比较快的访问速度，没有速度，再好的内容或服务也出不来。所以，访问速度在网站初创的时候就需要考虑，无论是采用开源软件还是自己开发都需要注意，数据层尽量能够正确，高效的使用SQL。SQL包含的语法比较复杂，实现同样一个效果如果考虑到应用层的的不同实现方法，可能有好几种方法，但里面只有一种是最高效的，而通常情况下，高效的SQL一般是那个最简单的SQL。在初期这个问题可能不是特别明显，当访问量大起来以后，这个可能成为最主要的性能瓶颈，各种杂乱无章的SQL会让人看的疯掉。当然前期没注意的话后期也有解决办法，只不过可能不会解决的特别彻底，但还是要吧非常有效的提升性能。看MySQL的 SlowQuery Log是一个最为简便的方法，把执行时间超过1秒的查询记录下来，然后分析，把该加的索引加上，该简单的SQL简化。另外也可以通过 Showprocesslist查看当前数据库服务器的死锁进程，从而锁定导致问题的SQL语句。另外在数据库配置文件上可以做一些优化，也可以很好的提升性能，这些文章在网站也比较多，这里就不展开。这些工作都做了以后，下面数据库如果再出现性能问题就需要考虑多台服务器了，一台服务器已经解决不了问题了，我以前的文章中也提到过，这里也不再展开。其它解决速度问题的办法就不仅仅是在应用里面就可以实现的了，需要从更高的高度去设计系统，考虑到服务器，网络的架构，以及各种系统级应用软件的配合，这里也不再展开。良好设计并实现的应用+中间件+良好的分布式设计的数据库+良好的系统配置+良好的服务器/网络结构，就可以支撑起一个较大规模的网站了，加上前面的几篇文章，一个小网站发展到大网站的过程基本上就齐了。这个过程会是一个充满艰辛和乐趣的过程，也是一个可以逐渐过渡的过程，主动出击，提前考虑，减少救火可以让这个过程轻松一些。…
大型web2.0互动网站设计方案 - jim_yeejee的专栏 - CSDNBlog
大型SNS 互动网站实现方案，大型web2.0互动网站实现方案
大型Web2.0站点构建技术初探 - guxianga - CSDNBlog
缓存区还为那些不需要记入数据库的数据提供了驿站，比如为跟踪用户会话而创建的临时文件--Benedetto坦言他需要在这方面补课，
高并发高流量网站架构
首先在整个网络的高度讨论了使用cdn，镜像，以及DNS区域解析等技术对负载均衡带来的便利及各自的优缺点比较。然后在局域网层次对第四层交换技术，包括硬件解决方案F5和软件解决方案LVS，进行了探讨和比较。再次在在单服务器端，本文着重讨论了单台服务器的socket优化，硬盘级缓存技术，内存级缓存技术，cpu与io平衡技术（即以运算为主的程序与以数据读写为主的程序搭配部署），读写分离技术等。在应用层，本文介绍了一些企业常用的技术，以及选择使用该技术的理由。本文选取有代表性的网站服务器程序，数据库程序，数据表存储
高性能网站性能优化 - 人月 - CSDNBlog
网站负载均衡

memcache在blog系统中的应用
http://hi.baidu.com/gowtd/blog/item/4fa90f2353f5444e935807fa.html
对memcache的接触也就不到4-5天的时间，大半的时间是用在研究如何利用memcache的接口，用简单有效的方式融入到我们blog应用系统中。基于此前日志模块已经使用memcache，并且在实际测试中有良好的性能表现（很高的cache命中率, 跟总体数量少有很大关系），相册模块也开始在dao中提供对memcache的支持。只是原先封装memcache接口实在是不够抽象，日志模块的dao使用memcache就使得代码量增加了50％，而且代码也显得非常凌乱。在应用到逻辑更加复杂的相册模块时，dao层变得更加庞大；并且由于对memcache还不够了解，使得相册的dao错误百出。可以说，相册支持memcache的初试版本在关闭memcache后，性能会有一半的下降；打开memcache后性能不会有什么增长，反而会有些下降。而且代码更加凌乱，可读性和可维护性很差。没有办法，只有亲自来重新写一下相册的dao。通过周四和周五期间和taotao同志的多次协商和讨论，总算把新的dao给完成了。代码精简了不少，memcache的接口也有了一定的抽象（虽然还是比较丑陋的接口）,基本每个dao接口都可以在10－15行代码里面搞定。可读性大大提高，对二级索引等比较烦琐的管理也都被屏蔽了起来。下礼拜需要把他们放到服务器上去测试一下，当然，还要逼着taotao把接口搞的更好看些。
今天休息，看了下memcache的client和server的代码，很简单的东西。但是在这中间，我还是觉得有些东西值得思考。
1. memcache对java对象的支持。memcache的客户端对java对象的支持做了些优化。主要是对primitive类型的优化。memcache服务器对所存储的数据类型是完全无知的，甚至也不是一个对象存储系统。它所能看到的就是一块块装着数据的内存。而在memcache的客户端，如果完全按照java的oo思想来把对象放进去，还是有些低效的。比如把一个boolean值要放进cache里去，java客户端的通常做法是生成一个Boolean对象，然后把这个对象串行化，在写入到服务器上。这就浪费了服务器的存储内存。一个Boolean对象好歹也需要40byte的内存，服务器要分配给它64byte的内存来用于存储该对象。而一个boolean值实际上只要1个byte就能搞定的事情。 memcache这点做的比较好，它对primitive类型做压缩，同样是boolean值，用2个byte来存储值并写给服务器。前面一个byte存储值类型，后面一个byte存储实际的值。这个还是很值得推崇的方法。
对其他对象的存储支持，memcahce就采取通用的对象序列化方法，到取回对象时，再重建这个对象。这种方法的好处就是简单，程序员不需要考虑对象的重建问题，依赖java的特性来重建对象。但是我认为，在我们的应用系统中，使用这种简单的方法是不必要的，可以用其他有效的方式来提高效率和性能。举个例子，一个Photo对象要放到cache里面取，假设经过序列化后对象大小变成139byte（很普遍的，实际更大）。根据memcache的内存分层分配，就要实际分配256byte的实际内存给它。如果对象大小是260byte，系统就要分配512byte的内存给它。其中将近一半的内存是用不上的。而真的而去看序列化后的对象，里面很多信息都是用来在java重建对象时用的。如果我们也能参照memcache对primitive支持，让Photo对象实现一个特定的接口，这个接口能从一个字串中初试化一个photo对象来。这样就省了在服务器上存储Photo对象的很多类型信息，节省了对内存占用。但是，这就需要我们所有放入到cache中的对象都要实现这样的接口，限制了memcache的通用性。不过，在我们的blog应用中，没有多少种对象会放到cache中，这种通用性可以牺牲一下。
2. 在blog系统中，将blog对象和Photo对象同时存储，是否合适。在我看来，Blog对象对于Photo对象来说完全是个大家伙。一个包含长篇大论的blog对象，要占用很大一块内存，而且每个blog对象的大小还不一定，或大或小。而Photo对象则大小相对稳定。根据memcache的slab内存分配原则，当内存已经无法再分配时，要根据所请求放入的对象的大小到所对应的slab上以LRU算法把一些对象交换出去。可以设想一下，如果在一个cache服务器上，有很多小的Photo对象，和一些大的Blog对象。并且在开始时，cache服务器先很频繁地为Photo 对象提供存储服务。很明显，当系统稳定时，新放入Blog对象更加容易引起某个slab上的对象被交换。因为系统中的大块内存都被无数小的Photo对象所分割占用，而Blog对象只能获得一小部分的内存。此时，系统不会调整Photo对象占用的内存来补充Blog对象，因为两者很大程度上是处在两个不同的slab上。可以说，在cache服务器上，blog对象的平均生命周期会比Photo短，更容易被交换出去。从而造成blog对象的失配比率会比photo对象要高。我的想法是将blog对象和photo对象分别存放，让一群大小基本相同的对象放置在一个cache服务器上，可能是比较好点。这可以通过在选择cache存储服务器时，同时考虑对象的大小来完成
 CommunityServer性能问题浅析
前言
可能有很多朋友在使用CommunityServer(以下简称CS)的过程中，当数据越来越多后，速度会越来越慢，资源耗用越来越大，对于性能不好的服务器，简直像一场噩梦一样，我终于刚刚结束了这个噩梦，简单谈谈是什么原因导致了CS在性能上存在的种种问题。（我对于数据库方面不是很专业，所以如果本文中有什么谬误，敬请各位指出，不胜感谢！）
忘了自我介绍一下，我是宝玉，以前做过Asp.net Forums和CommunityServer的本地化工作，母校西工大的民间社区( http://www.openlab.net.cn)用的是CS系统。该有人骂我做广告了，其实我是防盗版，郭安定大哥那学的，哈哈！
性能问题分析
鸡肋式的多站点支持
其中一个性能影响就是它的多站点功能，也许这确实是个不错的注意：同一个数据库，不同域名就可以有完全独立的站点，但是对于绝大部分用户来说，这个真的有用么？首先姑且不讨论它是否真的那么有用，但是在性能上，他绝对会有一定影响的：系统初始化的时候，首先要加载所有的站点设置，这也是为什么CS第一次访问会那么慢的原因之一；然后大部分查询的时候，都要带上SettingId字段，并且在数据库中，对这个字段的索引并没有建的很理想，对于大量数据的查询来说，如果没有合理的建索引，有时候多一个查询条件对于性能会带来极大的影响。
内容数据的集中式存储
一般的系统，都尽可能的将大量的内容数据分开存储（例如飞信系统的用户存储，就是分库的^_^），对于数据库，更是有专门的分库方案，这都是为了增加性能，提高检索效率。而CS由于架构的原因，将论坛、博客、相册、留言板等内容管理相关的信息，全部保存在cs_Groups(分组)、cs_Section(分类)、cs_Threads(主题索引)、cs_Posts(内容数据)，这种架构给代码编写上带来了极大的便利，但是在性能上，不折不扣是个性能杀手，这也是CS慢的最根本原因，举个例子，假如我的论坛有100W数据，博客有5万条数据、相册有10万条数据，如果我要检索最新博客帖子，那么我要去这120万数据里面检索符合条件的数据，并且要加上诸如SettingsId、ApplicationType等用来区分属于哪个站点，哪种数据类型之类的条件，数据一多，必然会是一场噩梦，让你的查询响应速度越来越慢，从几秒钟到几十秒钟到Sql超时。
过于依赖缓存
缓存是个很好的东西，可以大大的减少数据库的访问，是asp.net程序提高性能必不可少的。不知道各位在设计开发系统，用缓存用的很爽的时候，有没有想过，如果缓存失效了会怎么样？如果缓存太大了会怎么样？相信各位CS会有一个感觉，那就是CS刚启动的时候速度好慢，或者使用过程中突然变的很慢，那就是因为好多数据还没有初始化到缓存，例如站点设置、用户资料、Groups集合、Setions集合等等一系列信息，这一系列信息的加载加起来在服务器性能不够好的情况下是个漫长的过程，如果碰巧还要去查询最新论坛帖子、未回复的帖子之类，那么噩梦就开始了，这时候就要拼人品了，看你是不是应用程序池刚重启完的第一个人o(∩_∩)o 。CS在缓存的策略上，细粒度不够，一般都是一个集合一个集合的进行缓存（例如最新论坛帖子集合），这样导致缓存需要频繁更新，而且缓存内的数据一般比较大，内存占用涨的很快，内存涨的快又导致了应用程序池频繁重启，这样，CS在缓存方面的优势反而变成了一种缺陷，导致服务器的资源占用居高不下。
CCS的雪上加霜
前面说过，我做过CS的本地化开发，加了不少CS的本地化开发工作，但是由于当时数据库知识的匮乏，导致了一些在性能上雪上加霜的行为，例如精华帖子功能，其中标志是否为精华帖(精华等级)的ValuedLevel字段没有加上索引，在数据量大的情况下，检索会比较慢。由于我已经不在做CCS的开发，已经没有办法来修正这些性能问题了，只能对大家表示歉意。
后记
如何解决？
最简单就是等着升级了，相信CS以后的版本会越来越强劲的，这些问题肯定会逐步解决的。如果等不及的话，就只能自己动手了，使用Sql Profiler监测Sql的执行，找出影响性能的查询，然后针对性优化。
前面我说我结束CS性能的噩梦，肯定有朋友会问我怎么结束的了，在此，就先埋一个伏笔了，在05年的时候，我就开始如何构思开发一套高性能的类似CS的系统，06年初开始设计，然后利用业余时间进行了具体的开发，到今天已经有了小成，在性能上有了质的飞跃，针对这套系统的设计和性能优化的心得，我会逐渐以博客的形式来和大家一起分享交流。
 Digg PHP's Scalability and Performance
• listen
Monday April 10, 2006 9:28AM
by Brian Fioca in Technical
Several weeks ago there was a notable bit of controversy over some comments made by James Gosling, father of the Java programming language. He has since addressed the flame war that erupted, but the whole ordeal got me thinking seriously about PHP and its scalability and performance abilities compared to Java. I knew that several hugely popular Web 2.0 applications were written in scripting languages like PHP, so I contacted Owen Byrne - Senior Software Engineer at digg.com to learn how he addressed any problems they encountered during their meteoric growth. This article addresses the all-to-common false assumptions about the cost of scalability and performance in PHP applications.
At the time Gosling’s comments were made, I was working on tuning and optimizing the source code and server configuration for the launch of Jobby, a Web 2.0 resume tracking application written using the WASP PHP framework. I really hadn’t done any substantial research on how to best optimize PHP applications at the time. My background is heavy in the architecture and development of highly scalable applications in Java, but I realized there were enough substantial differences between Java and PHP to cause me concern. In my experience, it was certainly faster to develop web applications in languages like PHP; but I was curious as to how much of that time savings might be lost to performance tuning and scaling costs. What I found was both encouraging and surprising.
What are Performance and Scalability?
Before I go on, I want to make sure the ideas of performance and scalability are understood. Performance is measured by the output behavior of the application. In other words, performance is whether or not the app is fast. A good performing web application is expected to render a page in around or under 1 second (depending on the complexity of the page, of course). Scalability is the ability of the application to maintain good performance under heavy load with the addition of resources. For example, as the popularity of a web application grows, it can be called scalable if you can maintain good performance metrics by simply making small hardware additions. With that in mind, I wondered how PHP would perform under heavy load, and whether it would scale well compared with Java.
Hardware Cost
My first concern was raw horsepower. Executing scripting language code is more hardware intensive because to the code isn’t compiled. The hardware we had available for the launch of Jobby was a single hosted Linux server with a 2GHz processor and 1GB of RAM. On this single modest server I was going to have to run both Apache 2 and MySQL. Previous applications I had worked on in Java had been deployed on 10-20 application servers with at least 2 dedicated, massively parallel, ultra expensive database servers. Of course, these applications handled traffic in the millions of hits per month.
To get a better idea of what was in store for a heavily loaded PHP application, I set up an interview with Owen Byrne, cofounder and Senior Software Engineer at digg.com. From talking with Owen I learned digg.com gets on the order of 200 million page views per month, and they’re able to handle it with only 3 web servers and 8 small database servers (I’ll discuss the reason for so many database servers in the next section). Even better news was that they were able to handle their first year’s worth of growth on a single hosted server like the one I was using. My hardware worries were relieved. The hardware requirements to run high-traffic PHP applications didn’t seem to be more costly than for Java.
Database Cost
Next I was worried about database costs. The enterprise Java applications I had worked on were powered by expensive database software like Oracle, Informix, and DB2. I had decided early on to use MySQL for my database, which is of course free. I wondered whether the simplicity of MySQL would be a liability when it came to trying to squeeze the last bit of performance out of the database. MySQL has had a reputation for being slow in the past, but most of that seems to have come from sub-optimal configuration and the overuse of MyISAM tables. Owen confirmed that the use of InnoDB for tables for read/write data makes a massive performance difference.
There are some scalability issues with MySQL, one being the need for large amounts of slave databases. However, these issues are decidedly not PHP related, and are being addressed in future versions of MySQL. It could be argued that even with the large amount of slave databases that are needed, the hardware required to support them is less expensive than the 8+ CPU boxes that typically power large Oracle or DB2 databases. The database requirements to run massive PHP applications still weren’t more costly than for Java.
PHP Coding Cost
Lastly, and most importantly, I was worried about scalability and performance costs directly attributed to the PHP language itself. During my conversation with Owen I asked him if there were any performance or scalability problems he encountered that were related to having chosen to write the application in PHP. A bit to my surprise, he responded by saying, “none of the scaling challenges we faced had anything to do with PHP,” and that “the biggest issues faced were database related.” He even added, “in fact, we found that the lightweight nature of PHP allowed us to easily move processing tasks from the database to PHP in order to deal with that problem.” Owen mentioned they use the APC PHP accelerator platform as well as MCache to lighten their database load. Still, I was skeptical. I had written Jobby entirely in PHP 5 using a framework which uses a highly object oriented MVC architecture to provide application development scalability. How would this hold up to large amounts of traffic?
My worries were largely related to the PHP engine having to effectively parse and interpret every included class on each page load. I discovered this was just my misunderstanding of the best way to configure a PHP server. After doing some research, I found that by using a combination of Apache 2’s worker threads, FastCGI, and a PHP accelerator, this was no longer a problem. Any class or script loading overhead was only encountered on the first page load. Subsequent page loads were of comparative performance to a typical Java application. Making these configuration changes were trivial and generated massive performance gains. With regard to scalability and performance, PHP itself, even PHP 5 with heavy OO, was not more costly than Java.
Conclusion
Jobby was launched successfully on its single modest server and, thanks to links from Ajaxian and TechCrunch, went on to happily survive hundreds of thousands of hits in a single week. Assuming I applied all of my new found PHP tuning knowledge correctly, the application should be able to handle much more load on its current hardware.
Digg is in the process of preparing to scale to 10 times current load. I asked Owen Byrne if that meant an increase in headcount and he said that wasn’t necessary. The only real change they identified was a switch to a different database platform. There doesn’t seem to be any additional manpower cost to PHP scalability either.
It turns out that it really is fast and cheap to develop applications in PHP. Most scaling and performance challenges are almost always related to the data layer, and are common across all language platforms. Even as a self-proclaimed PHP evangelist, I was very startled to find out that all of the theories I was subscribing to were true. There is simply no truth to the idea that Java is better than scripting languages at writing scalable web applications. I won’t go as far as to say that PHP is better than Java, because it is never that simple. However it just isn’t true to say that PHP doesn’t scale, and with the rise of Web 2.0, sites like Digg, Flickr, and even Jobby are proving that large scale applications can be rapidly built and maintained on-the-cheap, by one or two developers.
Further Reading
 YouTube Architecture

Tue, 07/17/2007 - 20:20 — Todd Hoff
YouTube Architecture (3936)
YouTube grew incredibly fast, to over 100 million video views per day, with only a handful of people responsible for scaling the site. How did they manage to deliver all that video to all those users? And how have they evolved since being acquired by Google?
Information Sources
• Google Video
Platform
• Apache
• Python
• Linux (SuSe)
• MySQL
• psyco, a dynamic python->C compiler
• lighttpd for video instead of Apache
What's Inside?
The Stats
• Supports the delivery of over 100 million videos per day.
• Founded 2/2005
• 3/2006 30 million video views/day
• 7/2006 100 million video views/day
• 2 sysadmins, 2 scalability software architects
• 2 feature developers, 2 network engineers, 1 DBA
Recipe for handling rapid growth

while (true)
{
identify_and_fix_bottlenecks();
drink();
sleep();
notice_new_bottleneck();
}
This loop runs many times a day.
Web Servers
• NetScalar is used for load balancing and caching static content.
• Run Apache with mod_fast_cgi.
• Requests are routed for handling by a Python application server.
• Application server talks to various databases and other informations sources to get all the data and formats the html page.
• Can usually scale web tier by adding more machines.
• The Python web code is usually NOT the bottleneck, it spends most of its time blocked on RPCs.
• Python allows rapid flexible development and deployment. This is critical given the competition they face.
• Usually less than 100 ms page service times.
• Use psyco, a dynamic python->C compiler that uses a JIT compiler approach to optimize inner loops.
• For high CPU intensive activities like encryption, they use C extensions.
• Some pre-generated cached HTML for expensive to render blocks.
• Row level caching in the database.
• Fully formed Python objects are cached.
• Some data are calculated and sent to each application so the values are cached in local memory. This is an underused strategy. The fastest cache is in your application server and it doesn't take much time to send precalculated data to all your servers. Just have an agent that watches for changes, precalculates, and sends.
Video Serving
• Costs include bandwidth, hardware, and power consumption.
• Each video hosted by a mini-cluster. Each video is served by more than one machine.
• Using a a cluster means:
- More disks serving content which means more speed.
- Headroom. If a machine goes down others can take over.
- There are online backups.
• Servers use the lighttpd web server for video:
- Apache had too much overhead.
- Uses epoll to wait on multiple fds.
- Switched from single process to multiple process configuration to handle more connections.
• Most popular content is moved to a CDN (content delivery network):
- CDNs replicate content in multiple places. There's a better chance of content being closer to the user, with fewer hops, and content will run over a more friendly network.
- CDN machines mostly serve out of memory because the content is so popular there's little thrashing of content into and out of memory.
• Less popular content (1-20 views per day) uses YouTube servers in various colo sites.
- There's a long tail effect. A video may have a few plays, but lots of videos are being played. Random disks blocks are being accessed.
- Caching doesn't do a lot of good in this scenario, so spending money on more cache may not make sense. This is a very interesting point. If you have a long tail product caching won't always be your performance savior.
- Tune RAID controller and pay attention to other lower level issues to help.
- Tune memory on each machine so there's not too much and not too little.
Serving Video Key Points
• Keep it simple and cheap.
• Keep a simple network path. Not too many devices between content and users. Routers, switches, and other appliances may not be able to keep up with so much load.
• Use commodity hardware. More expensive hardware gets the more expensive everything else gets too (support contracts). You are also less likely find help on the net.
• Use simple common tools. They use most tools build into Linux and layer on top of those.
• Handle random seeks well (SATA, tweaks).
Serving Thumbnails
• Surprisingly difficult to do efficiently.
• There are a like 4 thumbnails for each video so there are a lot more thumbnails than videos.
• Thumbnails are hosted on just a few machines.
• Saw problems associated with serving a lot of small objects:
- Lots of disk seeks and problems with inode caches and page caches at OS level.
- Ran into per directory file limit. Ext3 in particular. Moved to a more hierarchical structure. Recent improvements in the 2.6 kernel may improve Ext3 large directory handling up to 100 times, yet storing lots of files in a file system is still not a good idea.
- A high number of requests/sec as web pages can display 60 thumbnails on page.
- Under such high loads Apache performed badly.
- Used squid (reverse proxy) in front of Apache. This worked for a while, but as load increased performance eventually decreased. Went from 300 requests/second to 20.
- Tried using lighttpd but with a single threaded it stalled. Run into problems with multiprocesses mode because they would each keep a separate cache.
- With so many images setting up a new machine took over 24 hours.
- Rebooting machine took 6-10 hours for cache to warm up to not go to disk.
• To solve all their problems they started using Google's BigTable, a distributed data store:
- Avoids small file problem because it clumps files together.
- Fast, fault tolerant. Assumes its working on a unreliable network.
- Lower latency because it uses a distributed multilevel cache. This cache works across different collocation sites.
- For more information on BigTable take a look at Google Architecture, GoogleTalk Architecture, and BigTable.
Databases
• The Early Years
- Use MySQL to store meta data like users, tags, and descriptions.
- Served data off a monolithic RAID 10 Volume with 10 disks.
- Living off credit cards so they leased hardware. When they needed more hardware to handle load it took a few days to order and get delivered.
- They went through a common evolution: single server, went to a single master with multiple read slaves, then partitioned the database, and then settled on a sharding approach.
- Suffered from replica lag. The master is multi-threaded and runs on a large machine so it can handle a lot of work. Slaves are single threaded and usually run on lesser machines and replication is asynchronous, so the slaves can lag significantly behind the master.
- Updates cause cache misses which goes to disk where slow I/O causes slow replication.
- Using a replicating architecture you need to spend a lot of money for incremental bits of write performance.
- One of their solutions was prioritize traffic by splitting the data into two clusters: a video watch pool and a general cluster. The idea is that people want to watch video so that function should get the most resources. The social networking features of YouTube are less important so they can be routed to a less capable cluster.
• The later years:
- Went to database partitioning.
- Split into shards with users assigned to different shards.
- Spreads writes and reads.
- Much better cache locality which means less IO.
- Resulted in a 30% hardware reduction.
- Reduced replica lag to 0.
- Can now scale database almost arbitrarily.
Data Center Strategy
• Used manage hosting providers at first. Living off credit cards so it was the only way.
• Managed hosting can't scale with you. You can't control hardware or make favorable networking agreements.
• So they went to a colocation arrangement. Now they can customize everything and negotiate their own contracts.
• Use 5 or 6 data centers plus the CDN.
• Videos come out of any data center. Not closest match or anything. If a video is popular enough it will move into the CDN.
• Video bandwidth dependent, not really latency dependent. Can come from any colo.
• For images latency matters, especially when you have 60 images on a page.
• Images are replicated to different data centers using BigTable. Code
looks at different metrics to know who is closest.
Lessons Learned
• Stall for time. Creative and risky tricks can help you cope in the short term while you work out longer term solutions.
• Prioritize. Know what's essential to your service and prioritize your resources and efforts around those priorities.
• Pick your battles. Don't be afraid to outsource some essential services. YouTube uses a CDN to distribute their most popular content. Creating their own network would have taken too long and cost too much. You may have similar opportunities in your system. Take a look at Software as a Service for more ideas.
• Keep it simple! Simplicity allows you to rearchitect more quickly so you can respond to problems. It's true that nobody really knows what simplicity is, but if you aren't afraid to make changes then that's a good sign simplicity is happening.
• Shard. Sharding helps to isolate and constrain storage, CPU, memory, and IO. It's not just about getting more writes performance.
• Constant iteration on bottlenecks:
- Software: DB, caching
- OS: disk I/O
- Hardware: memory, RAID
• You succeed as a team. Have a good cross discipline team that understands the whole system and what's underneath the system. People who can set up printers, machines, install networks, and so on. With a good team all things are possible.
1. Jesse • Comments (78) • April 10th
Justin Silverton at Jaslabs has a supposed list of 10 tips for optimizing MySQL queries. I couldn't read this and let it stand because this list is really, really bad. Some guy named Mike noted this, too. So in this entry I'll do two things: first, I'll explain why his list is bad; second, I'll present my own list which, hopefully, is much better. Onward, intrepid readers!
Why That List Sucks
1. He's swinging for the top of the trees
The rule in any situation where you want to opimize some code is that you first profile it and then find the bottlenecks. Mr. Silverton, however, aims right for the tippy top of the trees. I'd say 60% of database optimization is properly understanding SQL and the basics of databases. You need to understand joins vs. subselects, column indices, how to normalize data, etc. The next 35% is understanding the performance characteristics of your database of choice. COUNT(*) in MySQL, for example, can either be almost-free or painfully slow depending on which storage engine you're using. Other things to consider: under what conditions does your database invalidate caches, when does it sort on disk rather than in memory, when does it need to create temporary tables, etc. The final 5%, where few ever need venture, is where Mr. Silverton spends most of his time. Never once in my life have I used SQL_SMALL_RESULT.
2. Good problems, bad solutions
There are cases when Mr. Silverton does note a good problem. MySQL will indeed use a dynamic row format if it contains variable length fields like TEXT or BLOB, which, in this case, means sorting needs to be done on disk. The solution is not to eschew these datatypes, but rather to split off such fields into an associated table. The following schema represents this idea:
1. CREATE TABLE posts (
2. id int UNSIGNED NOT NULL AUTO_INCREMENT,
3. author_id int UNSIGNED NOT NULL,
4. created timestamp NOT NULL,
5. PRIMARY KEY(id)
6. );

8. CREATE TABLE posts_data (
9. post_id int UNSIGNED NOT NULL.
10. body text,
11. PRIMARY KEY(post_id)
12. );
3. That's just…yeah
Some of his suggestions are just mind-boggling, e.g., "remove unnecessary paratheses." It really doesn't matter whether you do SELECT * FROM posts WHERE (author_id = 5 AND published = 1) or SELECT * FROM posts WHERE author_id = 5 AND published = 1. None. Any decent DBMS is going to optimize these away. This level of detail is akin to wondering when writing a C program whether the post-increment or pre-increment operator is faster. Really, if that's where you're spending your energy, it's a surprise you've written any code at all
My list
Let's see if I fare any better. I'm going to start from the most general.
4. Benchmark, benchmark, benchmark!
You're going to need numbers if you want to make a good decision. What queries are the worst? Where are the bottlenecks? Under what circumstances am I generating bad queries? Benchmarking is will let you simulate high-stress situations and, with the aid of profiling tools, expose the cracks in your database configuration. Tools of the trade include supersmack, ab, and SysBench. These tools either hit your database directly (e.g., supersmack) or simulate web traffic (e.g., ab).
5. Profile, profile, profile!
So, you're able to generate high-stress situations, but now you need to find the cracks. This is what profiling is for. Profiling enables you to find the bottlenecks in your configuration, whether they be in memory, CPU, network, disk I/O, or, what is more likely, some combination of all of them.
The very first thing you should do is turn on the MySQL slow query log and install mtop. This will give you access to information about the absolute worst offenders. Have a ten-second query ruining your web application? These guys will show you the query right off.
After you've identified the slow queries you should learn about the MySQL internal tools, like EXPLAIN, SHOW STATUS, and SHOW PROCESSLIST. These will tell you what resources are being spent where, and what side effects your queries are having, e.g., whether your heinous triple-join subselect query is sorting in memory or on disk. Of course, you should also be using your usual array of command-line profiling tools like top, procinfo, vmstat, etc. to get more general system performance information.
6. Tighten Up Your Schema
Before you even start writing queries you have to design a schema. Remember that the memory requirements for a table are going to be around #entries * size of a row. Unless you expect every person on the planet to register 2.8 trillion times on your website you do not in fact need to make your user_id column a BIGINT. Likewise, if a text field will always be a fixed length (e.g., a US zipcode, which always has a canonical representation of the form "XXXXX-XXXX") then a VARCHAR declaration just adds a superfluous byte for every row.
Some people poo-poo database normalization, saying it produces unecessarily complex schema. However, proper normalization results in a minimization of redundant data. Fundamentally that means a smaller overall footprint at the cost of performance — the usual performance/memory tradeoff found everywhere in computer science. The best approach, IMO, is to normalize first and denormalize where performance demands it. Your schema will be more logical and you won't be optimizing prematurely.
7. Partition Your Tables
Often you have a table in which only a few columns are accessed frequently. On a blog, for example, one might display entry titles in many places (e.g., a list of recent posts) but only ever display teasers or the full post bodies once on a given page. Horizontal vertical partitioning helps:
1. CREATE TABLE posts (
2. id int UNSIGNED NOT NULL AUTO_INCREMENT,
3. author_id int UNSIGNED NOT NULL,
4. title varchar(128),
5. created timestamp NOT NULL,
6. PRIMARY KEY(id)
7. );
9. CREATE TABLE posts_data (
10. post_id int UNSIGNED NOT NULL,
11. teaser text,
12. body text,
13. PRIMARY KEY(post_id)
14. );
The above represents a situation where one is optimizing for reading. Frequently accessed data is kept in one table while infrequently accessed data is kept in another. Since the data is now partitioned the infrequently access data takes up less memory. You can also optimize for writing: frequently changed data can be kept in one table, while infrequently changed data can be kept in another. This allows more efficient caching since MySQL no longer needs to expire the cache for data which probably hasn't changed.
8. Don't Overuse Artificial Primary Keys
Artificial primary keys are nice because they can make the schema less volatile. If we stored geography information in the US based on zip code, say, and the zip code system suddenly changed we'd be in a bit of trouble. On the other hand, many times there are perfectly fine natural keys. One example would be a join table for many-to-many relationships. What not to do:
1. CREATE TABLE posts_tags (
2. relation_id int UNSIGNED NOT NULL AUTO_INCREMENT,
3. post_id int UNSIGNED NOT NULL,
4. tag_id int UNSIGNED NOT NULL,
5. PRIMARY KEY(relation_id),
6. UNIQUE INDEX(post_id, tag_id)
7. );
Not only is the artificial key entirely redundant given the column constraints, but the number of post-tag relations are now limited by the system-size of an integer. Instead one should do:
8. CREATE TABLE posts_tags (
9. post_id int UNSIGNED NOT NULL,
10. tag_id int UNSIGNED NOT NULL,
11. PRIMARY KEY(post_id, tag_id)
12. );
9. Learn Your Indices
Often your choice of indices will make or break your database. For those who haven't progressed this far in their database studies, an index is a sort of hash. If we issue the query SELECT * FROM users WHERE last_name = 'Goldstein' and last_name has no index then your DBMS must scan every row of the table and compare it to the string 'Goldstein.' An index is usually a B-tree (though there are other options) which speeds up this comparison considerably.
You should probably create indices for any field on which you are selecting, grouping, ordering, or joining. Obviously each index requires space proportional to the number of rows in your table, so too many indices winds up taking more memory. You also incur a performance hit on write operations, since every write now requires that the corresponding index be updated. There is a balance point which you can uncover by profiling your code. This varies from system to system and implementation to implementation.
10. SQL is Not C
C is the canonical procedural programming language and the greatest pitfall for a programmer looking to show off his database-fu is that he fails to realize that SQL is not procedural (nor is it functional or object-oriented, for that matter). Rather than thinking in terms of data and operations on data one must think of sets of data and relationships among those sets. This usually crops up with the improper use of a subquery:
1. SELECT a.id,
2. (SELECT MAX(created)
3. FROM posts
4. WHERE author_id = a.id)
5. AS latest_post
6. FROM authors a
Since this subquery is correlated, i.e., references a table in the outer query, one should convert the subquery to a join.
7. SELECT a.id, MAX(p.created) AS latest_post
8. FROM authors a
9. INNER JOIN posts p
10. ON (a.id = p.author_id)
11. GROUP BY a.id
11. Understand your engines
MySQL has two primary storange engines: MyISAM and InnoDB. Each has its own performance characteristics and considerations. In the broadest sense MyISAM is good for read-heavy data and InnoDB is good for write-heavy data, though there are cases where the opposite is true. The biggest gotcha is how the two differ with respect to the COUNT function.
MyISAM keeps an internal cache of table meta-data like the number of rows. This means that, generally, COUNT(*) incurs no additional cost for a well-structured query. InnoDB, however, has no such cache. For a concrete example, let's say we're trying to paginate a query. If you have a query SELECT * FROM users LIMIT 5,10, let's say, running SELECT COUNT(*) FROM users LIMIT 5,10 is essentially free with MyISAM but takes the same amount of time as the first query with InnoDB. MySQL has a SQL_CALC_FOUND_ROWS option which tells InnoDB to calculate the number of rows as it runs the query, which can then be retreived by executing SELECT FOUND_ROWS(). This is very MySQL-specific, but can be necessary in certain situations, particularly if you use InnoDB for its other features (e.g., row-level locking, stored procedures, etc.).
12. MySQL specific shortcuts
MySQL provides many extentions to SQL which help performance in many common use scenarios. Among these are INSERT … SELECT, INSERT … ON DUPLICATE KEY UPDATE, and REPLACE.
I rarely hesitate to use the above since they are so convenient and provide real performance benefits in many situations. MySQL has other keywords which are more dangerous, however, and should be used sparingly. These include INSERT DELAYED, which tells MySQL that it is not important to insert the data immediately (say, e.g., in a logging situation). The problem with this is that under high load situations the insert might be delayed indefinitely, causing the insert queue to baloon. You can also give MySQL index hints about which indices to use. MySQL gets it right most of the time and when it doesn't it is usually because of a bad scheme or poorly written query.
13. And one for the road…
Last, but not least, read Peter Zaitsev's MySQL Performance Blog if you're into the nitty-gritty of MySQL performance. He covers many of the finer aspects of database administration and performance.
Library
This is a collection of Slides, presentations and videos on topics related to designing of high throughput, scalable, highly available websites I’ve been collecting for a while.
8/28/2007 Blog Inside Myspace

8/28/2007 FAQ Good Memcached FAQ

8/28/2007 Blog Measuring scalability

8/28/2007 Blog Distributed caching with memcached

8/28/2007 Blog Mailinator stats (all on a single server)

8/28/2007 Blog Architecture of Mailinator

8/25/2007 Slides 74 Building Scalable web architectures

8/25/2007 Slides 42 Typepad Architecture change: Change Your Car’s Tires at 100 mph

8/25/2007 Slides Skype protocol (also talks about p2p connections which is critical for its scalability)

8/20/2007 slides Slashdot’s History of scaling Mysql

8/19/2007 Slides 20 Big Bad Postgres SQL

8/19/2007 Slides 90 Scalable internet architectures

8/19/2007 Slides 59 Production troubleshooting (not related to scalability… but shit happens everywhere)

8/19/2007 Slides 31 Clustered Logging with mod_log_spread

8/19/2007 Slides 86 Understanding and Building HA/LB clusters

8/12/2007 Blog Multi-Master Mysql Replication

8/12/2007 Blog Large-Scale Methodologies for the World Wide Web

8/12/2007 Blog Scaling gracefully

8/12/2007 Blog Implementing Tag cloud - The nasty way

8/12/2007 Blog Normalized Data is for sissies

8/12/2007 Slides APC at facebook

8/6/2007 Video Plenty Of fish interview with its CEO

8/6/2007 Slides PHP scalability myth

8/6/2007 Slides 79 High performance PHP

8/6/2007 Blog Digg: PHP’s scalability and Performance

8/2/2007 Blog Getting Started with Drupal

8/2/2007 Blog 4 Problems with Drupal

8/2/2007 Video 55m Seattle Conference on Scalability: MapReduce Used on Large Data Sets

8/2/2007 Video 60m Seattle Conference on Scalability: Scaling Google for Every User

8/2/2007 Video 53m Seattle Conference on Scalability: VeriSign’s Global DNS Infrastucture

8/2/2007 Video 53m Seattle Conference on Scalability: YouTube Scalability

8/2/2007 Video 59m Seattle Conference on Scalability: Abstractions for Handling Large Datasets

8/2/2007 Video 55m Seattle Conference on Scalability: Building a Scalable Resource Management

8/2/2007 Video 44m Seattle Conference on Scalability: SCTPs Reliability and Fault Tolerance

8/2/2007 Video 27m Seattle Conference on Scalability: Lessons In Building Scalable Systems

8/2/2007 Video 41m Seattle Conference on Scalability: Scalable Test Selection Using Source Code

8/2/2007 Video 53m Seattle Conference on Scalability: Lustre File System

8/2/2007 Slides 16 Technology at Digg.com

8/2/2007 Blog Extreme Makeover: Database or MySQL@YouTube

8/2/2007 Slides 60 “Real Time

8/2/2007 Blog Mysql at Google

8/2/2007 Slides 56 Scaling Twitter

8/2/2007 Slides 53 How we build Vox

8/2/2007 Slides 97 High Performance websites

8/2/2007 Slides 101 Beyond the file system design

8/2/2007 Slides 145 Scalable web architectures

8/2/2007 Blog “Build Scalable Web 2.0 Sites with Ubuntu

8/2/2007 Slides 34 Scalability set Amazon’s servers on fire not yours

8/2/2007 Slides 41 Hardware layouts for LAMP installations

8/2/2007 Video 91m Mysql scaling and high availability architectures

8/2/2007 Audio 137 Lessons from Building world’s largest social music platform

8/2/2007 PDF 137 Lessons from Building world’s largest social music platform

8/2/2007 Slides 137 Lessons from Building world’s largest social music platform

8/2/2007 PDF 80 Livejournal’s backend: history of scaling

8/2/2007 Slides 80 Livejournal’s backend: history of scaling

8/2/2007 Slides 26 Scalable Web Architectures (w/ Ruby and Amazon S3)

8/2/2007 Blog Yahoo! bookmarks uses symfony

8/2/2007 Slides Getting Rich with PHP 5

8/2/2007 Audio Getting Rich with PHP 5

8/2/2007 Blog Scaling Fast and Cheap - How We Built Flickr

8/2/2007 News Open source helps Flickr share photos

8/2/2007 Slides 41 Flickr and PHP

8/2/2007 Slides 30 Wikipedia: Cheap and explosive scaling with LAMP

8/2/2007 Blog YouTube Scalability Talk

8/2/2007 High Order Bit: Architecture for Humanity

8/2/2007 PDF Mysql and Web2.0 companies

8/3/2007 36 Building Highly Scalable Web Applications

8/3/2007 Introduction to hadoop

8/3/2007 webpage The Hadoop Distributed File System: Architecture and Design

8/3/2007 Interpreting the Data: Parallel Analysis with Sawzall

8/3/2007 PDF ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval

8/3/2007 PDF SEDA: An Architecture for well conditioned scalable internet services

8/3/2007 PDF A scalable architecuture for Global web service hosting service

8/3/2007 Meed Hadoop

8/3/2007 Blog Yahoo’s Hadoop Support

8/3/2007 Blog Running Hadoop MapReduce on Amazon EC2 and Amazon S3

8/3/2007 53m LH*RSP2P : A Scalable Distributed Data Structure for P2P Environment

8/3/2007 90m Scaling the Internet routing table with Locator/ID Separation Protocol (LISP)

8/3/2007 Hadoop Map/Reduce

8/3/2007 Slides Hadoop distributed file system

8/3/2007 Video 45m Brad Fitzpatrick - Behind the Scenes at LiveJournal: Scaling Storytime

8/3/2007 Slides Inside LiveJournal’s Backend (April 2004)

8/3/2007 Slides 25 How to scale

8/3/2007 36m Testing Oracle 10g RAC Scalability

8/3/2007 Slides 107 PHP & Performance

8/3/2007 Blog 1217
8/3/2007 45m SQL Performance Optimization

8/3/2007 80m Building_a_Scalable_Software_Security_Practice

8/3/2007 59m Building Large Systems at Google

8/3/2007 Scalable computing with Hadoop

8/3/2007 Slides 37 The Ebay architecture

8/3/2007 PDF Bigtable: A Distributed Storage System for Structured Data

8/3/2007 PDF Fault-Tolerant and scalable TCP splice and web server architecture

8/3/2007 Video BigTable: A Distributed Structured Storage System

8/3/2007 PDF MapReduce: Simplified Data Processing on Large Clusters

8/3/2007 PDF Google Cluster architecture

8/3/2007 PDF Google File System

8/3/2007 Doc Implementing a Scalable Architecture

8/3/2007 News How linux saved Millions for Amazon

8/3/2007 Yahoo experience with hadoop

8/3/2007 Slides Scalable web application using Mysql and Java

8/3/2007 Slides Friendster: scalaing for 1 Billion Queries per day

8/3/2007 Blog Lightweight web servers

8/3/2007 PDF Mysql Scale out by application partitioning

8/3/2007 PDF Replication under scalable hashing: A family of algorithms for Scalable decentralized data distribution

8/3/2007 Product Clustered storage revolution

8/3/2007 Blog Early Amazon Series

8/3/2007 Web Wikimedia Server info

8/3/2007 Slides 32 Wikimedia Architecture

8/3/2007 Slides 21 MySpace presentation

8/3/2007 PDF A scalable and fault-tolerant architecture for distributed web resource discovery

8/4/2007 PDF The Chubby Lock Service for Loosely-Coupled Distributed Systems

8/5/2007 Slides 47 Real world Mysql tuning

8/5/2007 Slides 100 Real world Mysql performance tuning

8/5/2007 Slides 63 Learning MogileFS: Buliding scalable storage system

8/5/2007 Slides Reverse Proxy and Webserver

8/5/2007 PDF Case for Shared Nothing

8/5/2007 Slides 27 A scalable stateless proxy for DBI

8/5/2007 Slides 91 Real world scalability web builder 2006

8/5/2007 Slides 52 Real world web scalability

Friendster Architecture

Thu, 07/12/2007 - 05:18 — Todd Hoff
• Friendster Architecture (341)
Friendster is one of the largest social network sites on the web. it emphasizes genuine friendships and the discovery of new people through friends.
Site: http://www.friendster.com/
Information Sources
• Friendster - Scaling for 1 Billion Queries per day
Platform
• MySQL
• Perl
• PHP
• Linux
• Apache
What's Inside?
• Dual x86-64 AMD Opterons with 8 GB of RAM
• Faster disk (SAN)
• Optimized indexes
• Traditional 3-tier architecture with hardware load balancer in front of the databases
• Clusters based on types: ad, app, photo, monitoring, DNS, gallery search DB, profile DB, user infor DB, IM status cache, message DB, testimonial DB, friend DB, graph servers, gallery search, object cache.
Lessons Learned
• No persistent database connections.
• Removed all sorts.
• Optimized indexes
• Don’t go after the biggest problems first
• Optimize without downtime
• Split load
• Moved sorting query types into the application and added LIMITS.
• Reduced ranges
• Range on primary key
• Benchmark -> Make Change -> Benchmark -> Make Change (Cycle of Improvement)
• Stabilize: always have a plan to rollback
• Work with a team
• Assess: Define the issues
• A key design goal for the new system was to move away from maintaining session state toward a stateless architecture that would clean up after each request
• Rather than buy big, centralized boxes, [our philosophy] was about buying a lot of thin, cheap boxes. If one fails, you roll over to another box.
 Feedblendr Architecture - Using EC2 to Scale

Wed, 10/31/2007 - 05:15 — Todd Hoff
• Feedblendr Architecture - Using EC2 to Scale (56)
A man had a dream. His dream was to blend a bunch of RSS/Atom/RDF feeds into a single feed. The man is Beau Lebens of Feedville and like most dreamers he was a little short on coin. So he took refuge in the home of a cheap hosting provider and Beau realized his dream, creating FEEDblendr. But FEEDblendr chewed up so much CPU creating blended feeds that the cheap hosting provider ordered Beau to find another home. Where was Beau to go? He eventually found a new home in the virtual machine room of Amazon's EC2. This is the story of how Beau was finally able to create his one feeds safe within the cradle of affordable CPU cycles.
Site: http://feedblendr.com/
The Platform
• EC2 (Fedora Core 6 Lite distro)
• S3
• Apache
• PHP
• MySQL
• DynDNS (for round robin DNS)
The Stats
• Beau is a developer with some sysadmin skills, not a web server admin, so a lot of learning was involved in creating FEEDblendr.
• FEEDblendr uses 2 EC2 instances. The same Amazon Instance (AMI) is used for both instances.
• Over 10,000 blends have been created, containing over 45,000 source feeds.
• Approx 30 blends created per day. Processors on the 2 instances are actually pegged pretty high (load averages at ~ 10 - 20 most of the time).
The Architecture
• Round robin DNS is used to load balance between instances.
-The DNS is updated by hand as an instance is validited to work correctly before the DNS is updated.
-Instances seem to be more stable now than they were in the past, but you must still assume they can be lost at any time and no data will be persisted between reboots.
• The database is still hosted on an external service because EC2 does not have a decent persistent storage system.
• The AMI is kept as minimal as possible. It is a clean instance with some auto-deployment code to load the application off of S3. This means you don't have to create new instances for every software release.
• The deployment process is:
- Software is developed on a laptop and stored in subversion.
- A makefile is used to get a revision, fix permissions etc, package and push to S3.
- When the AMI launches it runs a script to grab the software package from S3.
- The package is unpacked and a specific script inside is executed to continue the installation process.
- Configuration files for Apache, PHP, etc are updated.
- Server-specific permissions, symlinks etc are fixed up.
- Apache is restarted and email is sent with the IP of that machine. Then the DNS is updated by hand with the new IP address.
• Feeds are intelligently cached independely on each instance. This is to reduce the costly polling for feeds as much as possible. S3 was tried as a common feed cache for both instances, but it was too slow. Perhaps feeds could be written to each instance so they would be cached on each machine?
Lesson Learned
• A low budget startup can effectively bootstrap using EC2 and S3.
• For the budget conscious the free ZoneEdit service might work just as well as the $50/year DynDNS service (which works fine).
• Round robin load balancing is slow and unreliable. Even with a short TTL for the DNS some systems hold on to the IP addressed for a long time, so new machines are not load balanced to.
• Many problems exist with RSS implementations that keep feeds from being effectively blended. A lot of CPU is spent reading and blending feeds unecessarily because there's no reliable cross implementation way to tell when a feed has really changed or not.
• It's really a big mindset change to consider that your instances can go away at any time. You have to change your architecture and design to live with this fact. But once you internalize this model, most problems can be solved.
• EC2's poor load balancing and persistence capabilities make development and deployment a lot harder than it should be.
• Use the AMI's ability to be passed a parameter to select which configuration to load from S3. This allows you to test different configurations without moving/deleting the current active one.
• Create an automated test system to validate an instance as it boots. Then automatically update the DNS if the tests pass. This makes it easy create new instances and takes the slow human out of the loop.
• Always load software from S3. The last thing you want happening is your instance loading, and for some reason not being able to contact your SVN server, and thus failing to load properly. Putting it in S3 virtually eliminates the chances of this occurring, because it's on the same network.
Related Articles
• What is a 'River of News' style aggregator?
• Build an Infinitely Scalable Infrastructure for $100 Using Amazon Services
• EC2
• Example
• MySQL
• PHP
• S3
• Visit Feedblendr Architecture - Using EC2 to Scale
• 716 reads
Comments
Wed, 10/31/2007 - 15:04 — Greg Linden (not verified)
Re: Feedblendr Architecture - Using EC2 to Scale
I might be missing something, but I don't see how this is an interesting example of "using EC2 to scale".
There appears to be no difference between using EC2 in the way Beau is using it and setting up two leased servers from a normal provider. In fact, getting leased servers might be better, since the cost might be lower (an EC2 instance costs $72/month + bandwidth) and the database would be on the same network.
Beau does not appear to be doing anything that takes advantage of EC2, such as dynamically creating and discarding instances based on demand.
Am I missing something here? Is this an interesting use of using EC2 to scale?
• reply
Wed, 10/31/2007 - 16:35 — Todd Hoff

Re: Feedblendr Architecture - Using EC2 to Scale
> I might be missing something, but I don't see how this is an interesting example of "using EC2 to scale".
I admit to being a bit polymorphously perverse with respect to finding things interesting, but from Beau's position, which many people are, the drama is thrilling. The story starts with a conflict: how to implement this idea? The first option is the traditional cheap host option. And for a long time that would have been the end of the story. Dedicated servers with higher end CPUs, RAM, and persistent storage are still not cheap. So if you aren't making money that would have been where the story ended. Scaling by adding more and more dedicated servers would be impossible. Hopefully the new grid model will allow a lot of people to keep writing their stories. His learning curve of creating the system is what was most interesting. Figuring out how to set things up, load balance, load the software, test it, regular nuts and bolts development stuff. And that puts him in the position of being able to get more CPU immediately when and if the time comes. He'll be able to add that feature in quickly because he's already done the ground work. But for now it's running fine. The spanner in the plan was the database and that points out the fatal flaw of EC2, which is the database. The plan would look a bit more successful if that part had worked out better, but it didn't, which is also interesting.
• reply
Wed, 10/31/2007 - 18:41 — Beau Lebens (not verified)
Re: Feedblendr Architecture - Using EC2 to Scale
@Todd, thanks for the write-up, and a couple quick corrections/clarifications:
- "Beau is a developer with some sysadmin skills, not a web server admin, so a lot of learning was involved in creating FEEDblendr." - Just to be clear, the learning curve was mostly in dealing with EC2 and how it works, not so much FeedBlendr, which at it's core is relatively simple.
- "no data will be persisted between reboots" this is not exactly true. Rebooting will persist data, but a true "crash" or termination of your instance will discard everything.
- "The database is still hosted on an external service because EC2 does not have a decent persistent storage system" - more the case here is that I didn't want to have to deal with (or pay for) setting something up to cater to them not having persistent storage. It is being done by other people, and can be done, it just seemed like overkill for what I was doing.
- "EC2's poor load balancing and persistence capabilities make development and deployment a lot harder than it should be" - to be clear, EC2 has no inherent load balancing, so it's up to you (the developer/admin) to provide it yourself somehow. There are a number of different ways of doing it, but I choose dynamic DNS because it was something I was familiar with.
@Greg in response to your question - I suppose the point here is that even though FeedBlendr isn't currently a poster-child for scaling, that's also kind of the point. As Todd says, this is about the learning curve and trials and tribulations of getting to a point where it can scale. There is nothing stopping me (other than budget!) from launching an additional 5 instances right now and adding them into DNS, and then I've suddenly scaled. From there I can kill some instances off and scale back. This is all about getting to the point where I even have that option, and how it was done on EC2 in particular.
Cheers,
Beau
 PlentyOfFish Architecture

Tue, 10/30/2007 - 04:48 — Todd Hoff
• PlentyOfFish Architecture (983)
Update: by Facebook standards Read/WriteWeb says POF is worth a cool one billion dollars. It helps to talk like Dr. Evil when saying it out loud.
PlentyOfFish is a hugely popular on-line dating system slammed by over 45 million visitors a month and 30+ million hits a day (500 - 600 pages per second). But that's not the most interesting part of the story. All this is handled by one person, using a handful of servers, working a few hours a day, while making $6 million a year from Google ads. Jealous? I know I am. How are all these love connections made using so few resources?
Site: http://www.plentyoffish.com/
Information Sources
• Channel9 Interview with Markus Frind
• Blog of Markus Frind
• Plentyoffish: 1-Man Company May Be Worth $1Billion
The Platform
• Microsoft Windows
• ASP.NET
• IIS
• Akamai CDN
• Foundry ServerIron Load Balancer
The Stats
• PlentyOfFish (POF) gets 1.2 billion page views/month, and 500,000 average unique logins per day. The peak season is January, when it will grow 30 percent.
• POF has one single employee: the founder and CEO Markus Frind.
• Makes up to $10 million a year on Google ads working only two hours a day.
• 30+ Million Hits a Day (500 - 600 pages per second).
• 1.1 billion page views and 45 million visitors a month.
• Has 5-10 times the click through rate of Facebook.
• A top 30 site in the US based on Competes Attention metric, top 10 in Canada and top 30 in the UK.
• 2 load balanced web servers with 2 Quad Core Intel Xeon X5355 @ 2.66Ghz), 8 Gigs of RAM (using about 800 MBs), 2 hard drives, runs Windows x64 Server 2003.
• 3 DB servers. No data on their configuration.
• Approaching 64,000 simultaneous connections and 2 million page views per hour.
• Internet connection is a 1Gbps line of which 200Mbps is used.
• 1 TB/day serving 171 million images through Akamai.
• 6TB storage array to handle millions of full sized images being uploaded every month to the site.
What's Inside
• Revenue model has been to use Google ads. Match.com, in comparison, generates $300 million a year, primarily from subscriptions. POF's revenue model is about to change so it can capture more revenue from all those users. The plan is to hire more employees, hire sales people, and sell ads directly instead of relying solely on AdSense.
• With 30 million page views a day you can make good money on advertising, even a 5 - 10 cents a CPM.
• Akamai is used to serve 100 million plus image requests a day. If you have 8 images and each takes 100 msecs you are talking a second load just for the images. So distributing the images makes sense.
• 10’s of millions of image requests are served directly from their servers, but the majority of these images are less than 2KB and are mostly cached in RAM.
• Everything is dynamic. Nothing is static.
• All outbound Data is Gzipped at a cost of only 30% CPU usage. This implies a lot of processing power on those servers, but it really cuts bandwidth usage.
• No caching functionality in ASP.NET is used. It is not used because as soon as the data is put in the cache it's already expired.
• No built in components from ASP are used. Everything is written from scratch. Nothing is more complex than a simple if then and for loops. Keep it simple.
• Load balancing
- IIS arbitrarily limits the total connections to 64,000 so a load balancer was added to handle the large number of simultaneous connections. Adding a second IP address and then using a round robin DNS was considered, but the load balancer was considered more redundant and allowed easier swap in of more web servers. And using ServerIron allowed advanced functionality like bot blocking and load balancing based on passed on cookies, session data, and IP data.
- The Windows Network Load Balancing (NLB) feature was not used because it doesn't do sticky sessions. A way around this would be to store session state in a database or in a shared file system.
- 8-12 NLB servers can be put in a farm and there can be an unlimited number of farms. A DNS round-robin scheme can be used between farms. Such an architecture has been used to enable 70 front end web servers to support over 300,000 concurrent users.
- NLB has an affinity option so a user always maps to a certain server, thus no external storage is used for session state and if the server fails the user loses their state and must relogin. If this state includes a shopping cart or other important data, this solution may be poor, but for a dating site it seems reasonable.
- It was thought that the cost of storing and fetching session data in software was too expensive. Hardware load balancing is simpler. Just map users to specific servers and if a server fails have the user log in again.
- The cost of a ServerIron was cheaper and simpler than using NLB. Many major sites use them for TCP connection pooling, automated bot detection, etc. ServerIron can do a lot more than load balancing and these features are attractive for the cost.
• Has a big problem picking an ad server. Ad server firms want several hundred thousand a year plus they want multi-year contracts.
• In the process of getting rid of ASP.NET repeaters and instead uses the append string thing or response.write. If you are doing over a million page views a day just write out the code to spit it out to the screen.
• Most of the build out costs went towards a SAN. Redundancy at any cost.
• Growth was through word of mouth. Went nuts in Canada, spread to UK, Australia, and then to the US.
• Database
- One database is the main database.
- Two databases are for search. Load balanced between search servers based on the type of search performed.
- Monitors performance using task manager. When spikes show up he investigates. Problems were usually blocking in the database. It's always database issues. Rarely any problems in .net. Because POF doesn't use the .net library it's relatively easy to track down performance problems. When you are using many layers of frameworks finding out where problems are hiding is frustrating and hard.
- If you call the database 20 times per page view you are screwed no matter what you do.
- Separate database reads from writes. If you don't have a lot of RAM and you do reads and writes you get paging involved which can hang your system for seconds.
- Try and make a read only database if you can.
- Denormalize data. If you have to fetch stuff from 20 different tables try and make one table that is just used for reading.
- One day it will work, but when your database doubles in size it won't work anymore.
- If you only do one thing in a system it will do it really really well. Just do writes and that's good. Just do reads and that's good. Mix them up and it messes things up. You run into locking and blocking issues.
- If you are maxing the CPU you've either done something wrong or it's really really optimized. If you can fit the database in RAM do it.
• The development process is: come up with an idea. Throw it up within 24 hours. It kind of half works. See what user response is by looking at what they actually do on the site. Do messages per user increase? Do session times increase? If people don't like it then take it down.
• System failures are rare and short lived. Biggest issues are DNS issues where some ISP says POF doesn't exist anymore. But because the site is free, people accept a little down time. People often don't notice sites down because they think it's their problem.
• Going from one million to 12 million users was a big jump. He could scale to 60 million users with two web servers.
• Will often look at competitors for ideas for new features.
• Will consider something like S3 when it becomes geographically load balanced.
Lessons Learned
• You don't need millions in funding, a sprawling infrastructure, and a building full of employees to create a world class website that handles a torrent of users while making good money. All you need is an idea that appeals to a lot of people, a site that takes off by word of mouth, and the experience and vision to build a site without falling into the typical traps of the trade. That's all you need :-)
• Necessity is the mother of all change.
• When you grow quickly, but not too quickly you have a chance grow, modify, and adapt.
• RAM solves all problems. After that it's just growing using bigger machines.
• When starting out keep everything as simple as possible. Nearly everyone gives this same advice and Markus makes a noticeable point of saying everything he does is just obvious common sense. But clearly what is simple isn't merely common sense. Creating simple things is the result of years of practical experience.
• Keep database access fast and you have no issues.
• A big reason POF can get away with so few people and so little equipment is they use a CDN for serving large heavily used content. Using a CDN may be the secret sauce in a lot of large websites. Markus thinks there isn't a single site in the top 100 that doesn’t use a CDN. Without a CDN he thinks load time in Australia would go to 3 or 4 seconds because of all the images.
• Advertising on Facebook yielded poor results. With 2000 clicks only 1 signed up. With a CTR of 0.04% Facebook gets 0.4 clicks per 1000 ad impressions, or .4 clicks per CPM. At 5 cent/CPM = 12.5 cents a click, 50 cent/CPM = $1.25 a click. $1.00/CPM = $2.50 a click. $15.00/CPM = $37.50 a click.
• It's easy to sell a few million page views at high CPM’s. It's a LOT harder to sell billions of page views at high CPM’s, as shown by Myspace and Facebook.
• The ad-supported model limits your revenues. You have to go to a paid model to grow larger. To generate 100 million a year as a free site is virtually impossible as you need too big a market.
• Growing page views via Facebook for a dating site won't work. Having a visitor on you site is much more profitable. Most of Facebook's page views are outside the US and you have to split 5 cent CPM’s with Facebook.
• Co-req is a potential large source of income. This is where you offer in your site's sign up to send the user more information about mortgages are some other product.
• You can't always listen to user responses. Some users will always love new features and others will hate it. Only a fraction will complain. Instead, look at what features people are actually using by watching your site.
 Wikimedia architecture

Wed, 08/22/2007 - 23:56 — Todd Hoff
• Wikimedia architecture (566)
Wikimedia is the platform on which Wikipedia, Wiktionary, and the other seven wiki dwarfs are built on. This document is just excellent for the student trying to scale the heights of giant websites. It is full of details and innovative ideas that have been proven on some of the most used websites on the internet.
Site: http://wikimedia.org/
Information Sources
• Wikimedia architecture
• http://meta.wikimedia.org/wiki/Wikimedia_servers
• scale-out vs scale-up in the from Oracle to MySQL blog.
Platform
• Apache
• Linux
• MySQL
• PHP
• Squid
• LVS
• Lucene for Search
• Memcached for Distributed Object Cache
• Lighttpd Image Server
The Stats
• 8 million articles spread over hundreds of language projects (english, dutch, …)
• 10th busiest site in the world (source: Alexa)
• Exponential growth: doubling every 4-6 months in terms of visitors / traffic / servers
• 30 000 HTTP requests/s during peak-time
• 3 Gbit/s of data traffic
• 3 data centers: Tampa, Amsterdam, Seoul
• 350 servers, ranging between 1x P4 to 2x Xeon Quad-Core, 0.5 - 16 GB of memory
• managed by ~ 6 people
• 3 clusters on 3 different continents
The Architecture
• Geographic Load Balancing, based on source IP of client resolver, directs clients to the nearest server cluster. Statically mapping IP addresses to countries to clusters
• HTTP reverse proxy caching implemented using Squid, grouped by text for wiki content and media for images and large static files.
• 55 Squid servers currently, plus 20 waiting for setup.
• 1,000 HTTP requests/s per server, up to 2,500 under stress
• ~ 100 - 250 Mbit/s per server
• ~ 14 000 - 32 000 open connections per server
• Up to 40 GB of disk caches per Squid server
• Up to 4 disks per server (1U rack servers)
• 8 GB of memory, half of that used by Squid
• Hit rates: 85% for Text, 98% for Media, since the use of CARP.
• PowerDNS provides geographical distribution.
• In their primary and regional data center they build text and media clusters built on LVS, CARP Squid, Cache Squid. In the primary datacenter they have the media storage.
• To make sure the latest revision of all pages are served invalidation requests are sent to all Squid caches.
• One centrally managed & synchronized software installation for hundreds of wikis.
• MediaWiki scales well with multiple CPUs, so we buy dual quad-core servers now (8 CPU cores per box)
• Hardware shared with External Storage and Memcached tasks
• Memcached is used to cache image metadata, parser data, differences, users and sessions, and revision text. Metadata, such as article revision history, article relations (links, categories etc.), user accounts and settings are stored in the core databases
• Actual revision text is stored as blobs in External storage
• Static (uploaded) files, such as images, are stored separately on the image server - metadata (size, type, etc.) is cached in the core database and object caches
• Separate database per wiki (not separate server!)
• One master, many replicated slaves
• Read operations are load balanced over the slaves, write operations go to the master
• The master is used for some read operations in case the slaves are not yet up to date (lagged)
• External Storage
- Article text is stored on separate data storage clusters, simple append-only blob storage. Saves space on expensive and busy core databases for largely unused data
- Allows use of spare resources on application servers (2x
250-500 GB per server)
- Currently replicated clusters of 3 MySQL hosts are used;
this might change in the future for better manageability
Lessons Learned
• Focus on architecture, not so much on operations or nontechnical stuff.
• Sometimes caching costs more than recalculating or looking up at the
data source…profiling!
• Avoid expensive algorithms, database queries, etc.
• Cache every result that is expensive and has temporal locality of reference.
• Focus on the hot spots in the code (profiling!).
• Scale by separating:
- Read and write operations (master/slave)
- Expensive operations from cheap and more frequent operations (query groups)
- Big, popular wikis from smaller wikis
• Improve caching: temporal and spatial locality of reference and reduces the data set size per server
• Text is compressed and only revisions between articles are stored.
• Simple seeming library calls like using stat to check for a file's existence can take too long when loaded.
• Disk seek I/O limited, the more disk spindles, the better!
• Scale-out using commodity hardware doesn't require using cheap hardware. Wikipedia's database servers these days are 16GB dual or quad core boxes with 6 15,000 RPM SCSI drives in a RAID 0 setup. That happens to be the sweet spot for the working set and load balancing setup they have. They would use smaller/cheaper systems if it made sense, but 16GB is right for the working set size and that drives the rest of the spec to match the demands of a system with that much RAM. Similarly the web servers are currently 8 core boxes because that happens to work well for load balancing and gives good PHP throughput with relatively easy load balancing.
• It is a lot of work to scale out, more if you didn't design it in originally. Wikipedia's MediaWiki was originally written for a single master database server. Then slave support was added. Then partitioning by language/project was added. The designs from that time have stood the test well, though with much more refining to address new bottlenecks.
• Anyone who wants to design their database architecture so that it'll allow them to inexpensively grow from one box rank nothing to the top ten or hundred sites on the net should start out by designing it to handle slightly out of date data from replication slaves, know how to load balance to slaves for all read queries and if at all possible to design it so that chunks of data (batches of users, accounts, whatever) can go on different servers. You can do this from day one using virtualisation, proving the architecture when you're small. It's a LOT easier than doing it while load is doubling every few months!
 Scaling Early Stage Startups

Mon, 10/29/2007 - 04:26 — Todd Hoff
• Scaling Early Stage Startups (56)
Mark Maunder of No VC Required--who advocates not taking VC money lest you be turned into a frog instead of the prince (or princess) you were dreaming of--has an excellent slide deck on how to scale an early stage startup. His blog also has some good SEO tips and a very spooky widget showing the geographical location of his readers. Perfect for Halloween! What is Mark's other worldly scaling strategies for startups?
Site: http://novcrequired.com/
Information Sources
• Slides from Seattle Tech Startup Talk.
• Scaling Early Stage Startups blog post by Mark Maunder.
The Platform
• Linxux
• An ISAM type data store.
• Perl
• Httperf is used for benchmarking.
• Websitepulse.com is used for perf monitoring.
The Architecture
• Performance matters because being slow could cost you 20% of your revenue. The UIE guys disagree saying this ain't necessarily so. They explain their reasoning in Usability Tools Podcast: The Truth About Page Download Time. The idea is: "There was still another surprising finding from our study: a strong correlation between perceived download time and whether users successfully completed their tasks on a site. There was, however, no correlation between actual download time and task success, causing us to discard our original hypothesis. It seems that, when people accomplish what they set out to do on a site, they perceive that site to be fast." So it might be a better use of time to improve the front-end rather than the back-end.
• MySQL was dumped because of performance problems: MySQL didn't handle a high number of writes and deletes on large tables, writes blow away the query cache, large numbers of small tables (over 10,000) are not well supported, uses a lot of memory to cache indexes, maxed out at 200 concurrent read/write queuries per second with over 1 million records.
• For data storage they evolved to a fixed length ISAM like record scheme that allows seeking directly to the data. Still uses file level locking and its benchmarked at 20,000+ concurrent reads/writes/deletes. Considering moving to BerkelyDB which is a very highly performing and is used by many large websites, especially when you primarily need key-value type lookups. I think it might be interesting to store json if a lot of this data ends up being displayed on the web page.
• Moved to httpd.prefork for Perl. That with no keepalive on the application servers uses less RAM and works well.
Lessons Learned
• Configure your DB and web server correctly. MySQL and Apache's memory usage can easily spiral out of control which leads gridingly slow performance as swapping increases. Here are a few resources for helping with configuration issues.
• Serve only the users you care about. Block content theives that crawl your site using a lot of valuable resources for nothing. Monitor the number of content pages they fetch per minute. If a threshold is exceeded and then do a reverse lookup on their IP address and configure your firewall to block them.
• Cache as much DB data and static content as possible. Perl's Cache::FileCache was used to cache DB data and rendered HTML on disk.
• Use two different host names in URLs to enable browser clients to load images in parallele.
• Make content as static as possible Create a separate Image and CSS server to serve the static content. Use keepalives on static content as static content uses little memory per thread/process.
• Leave plenty of spare memory. Spare memory allows Linux to use more memory fore file system caching which increased performance about 20 percent.
• Turn Keepalive off on your dynamic content. Increasing http requests can exhaust the thread and memory resources needed to serve them.
• You may not need a complex RDBMS for accessing data. Consider a lighter weight database BerkelyDB.
 Database parallelism choices greatly impact scalability
By Sam Madden on October 30, 2007 9:15 AM | Permalink | Comments (2) | TrackBacks (0)
Large databases require the use of parallel computing resources to get good performance. There are several fundamentally different parallel architectures in use today; in this post, Dave DeWitt, Mike Stonebraker, and I review three approaches and reflect on the pros and cons of each. Though these tradeoffs were articulated in the research community twenty years ago, we wanted to revisit these issues to bring readers up to speed before publishing upcoming posts that will discuss recent developments in parallel database design.

Shared-memory systems don't scale well as the shared bus becomes the bottleneck

In a shared-memory approach, as implemented on many symmetric multi-processor machines, all of the CPUs share a single memory and a single collection of disks. This approach is relatively easy to program. Complex distributed locking and commit protocols are not needed because the lock manager and buffer pool are both stored in the memory system where they can be easily accessed by all the processors.

Unfortunately, shared-memory systems have fundamental scalability limitations, as all I/O and memory requests have to be transferred over the same bus that all of the processors share. This causes the bandwidth of the bus to rapidly become a bottleneck. In addition, shared-memory multiprocessors require complex, customized hardware to keep their L2 data caches consistent. Hence, it is unusual to see shared-memory machines of larger than 8 or 16 processors unless they are custom-built from non-commodity parts (and if they are custom-built, they are very expensive). As a result, shared-memory systems don't scale well.

Shared-disk systems don't scale well either

Shared-disk systems suffer from similar scalability limitations. In a shared-disk architecture, there are a number of independent processor nodes, each with its own memory. These nodes all access a single collection of disks, typically in the form of a storage area network (SAN) system or a network-attached storage (NAS) system. This architecture originated with the Digital Equipment Corporation VAXcluster in the early 1980s, and has been widely used by Sun Microsystems and Hewlett-Packard.

Shared-disk architectures have a number of drawbacks that severely limit scalability. First, the interconnection network that connects each of the CPUs to the shared-disk subsystem can become an I/O bottleneck. Second, since there is no pool of memory that is shared by all the processors, there is no obvious place for the lock table or buffer pool to reside. To set locks, one must either centralize the lock manager on one processor or resort to a complex distributed locking protocol. This protocol must use messages to implement in software the same sort of cache-consistency protocol implemented by shared-memory multiprocessors in hardware. Either of these approaches to locking is likely to become a bottleneck as the system is scaled.

To make shared-disk technology work better, vendors typically implement a "shared-cache" design. Shared cache works much like shared disk, except that, when a node in a parallel cluster needs to access a disk page, it first checks to see if the page is in its local buffer pool ("cache"). If not, it checks to see if the page is in the cache of any other node in the cluster. If neither of those efforts works, it reads the page from disk.

Such a cache appears to work fairly well on OLTP but performs less well for data warehousing workloads. The problem with the shared-cache design is that cache hits are unlikely to happen because warehouse queries are typically answered using sequential scans of the fact table (or via materialized views). Unless the whole fact table fits in the aggregate memory of the cluster, sequential scans do not typically benefit from large amounts of cache. Thus, the entire burden of answering such queries is placed on the disk subsystem. As a result, a shared cache just creates overhead and limits scalability.

In addition, the same scalability problems that exist in the shared memory model also occur in the shared-disk architecture. The bus between the disks and the processors will likely become a bottleneck, and resource contention for certain disk blocks, particularly as the number of CPUs increases, can be a problem. To reduce bus contention, customers frequently configure their large clusters with many Fiber channel controllers (disk buses), but this complicates system design because now administrators must partition data across the disks attached to the different controllers.

Shared-nothing scales the best

In a shared-nothing approach, by contrast, each processor has its own set of disks. Data is "horizontally partitioned" across nodes. Each node has a subset of the rows from each table in the database. Each node is then responsible for processing only the rows on its own disks. Such architectures are especially well suited to the star schema queries present in data warehouse workloads, as only a very limited amount of communication bandwidth is required to join one or more (typically small) dimension tables with the (typically much larger) fact table.

In addition, every node maintains its own lock table and buffer pool, eliminating the need for complicated locking and software or hardware consistency mechanisms. Because shared nothing does not typically have nearly as severe bus or resource contention as shared-memory or shared-disk machines, shared nothing can be made to scale to hundreds or even thousands of machines. Because of this, it is generally regarded as the best-scaling architecture.

The shared nothing approach compliments other enhancements

As a closing point, we note that this shared nothing approach is completely compatible with other advanced database techniques we've discussed on this blog, such as compression and vertical partitioning. Systems that combine all of these techniques are likely to offer the best performance and scalability when compared to more traditional architectures.
 Introduction to Distributed System Design
Table of Contents
Audience and Pre-Requisites
The Basics
So How Is It Done?
Remote Procedure Calls
Distributed Design Principles
Exercises
References
________________________________________
Audience and Pre-Requisites
This tutorial covers the basics of distributed systems design. The pre-requisites are significant programming experience with a language such as C++ or Java, a basic understanding of networking, and data structures & algorithms.
The Basics
What is a distributed system? It's one of those things that's hard to define without first defining many other things. Here is a "cascading" definition of a distributed system:
A program
is the code you write.
A process
is what you get when you run it.
A message
is used to communicate between processes.
A packet
is a fragment of a message that might travel on a wire.
A protocol
is a formal description of message formats and the rules that two processes must follow in order to exchange those messages.
A network
is the infrastructure that links computers, workstations, terminals, servers, etc. It consists of routers which are connected by communication links.
A component
can be a process or any piece of hardware required to run a process, support communications between processes, store data, etc.
A distributed system
is an application that executes a collection of protocols to coordinate the actions of multiple processes on a network, such that all components cooperate together to perform a single or small set of related tasks.
Why build a distributed system? There are lots of advantages including the ability to connect remote users with remote resources in an open and scalable way. When we say open, we mean each component is continually open to interaction with other components. When we say scalable, we mean the system can easily be altered to accommodate changes in the number of users, resources and computing entities.
Thus, a distributed system can be much larger and more powerful given the combined capabilities of the distributed components, than combinations of stand-alone systems. But it's not easy - for a distributed system to be useful, it must be reliable. This is a difficult goal to achieve because of the complexity of the interactions between simultaneously running components.
To be truly reliable, a distributed system must have the following characteristics:
• Fault-Tolerant: It can recover from component failures without performing incorrect actions.
• Highly Available: It can restore operations, permitting it to resume providing services even when some components have failed.
• Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired.
• Consistent: The system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies the ability of a distributed system to act like a non-distributed system.
• Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For example, we might increase the size of the network on which the system is running. This increases the frequency of network outages and could degrade a "non-scalable" system. Similarly, we might increase the number of users or servers, or overall load on the system. In a scalable system, this should not have a significant effect.
• Predictable Performance: The ability to provide desired responsiveness in a timely manner.
• Secure: The system authenticates access to data and services [1]
These are high standards, which are challenging to achieve. Probably the most difficult challenge is a distributed system must be able to continue operating correctly even when components fail. This issue is discussed in the following excerpt of an interview with Ken Arnold. Ken is a research scientist at Sun and is one of the original architects of Jini, and was a member of the architectural team that designed CORBA.
________________________________________
Failure is the defining difference between distributed and local programming, so you have to design distributed systems with the expectation of failure. Imagine asking people, "If the probability of something happening is one in 1013, how often would it happen?" Common sense would be to answer, "Never." That is an infinitely large number in human terms. But if you ask a physicist, she would say, "All the time. In a cubic foot of air, those things happen all the time."
When you design distributed systems, you have to say, "Failure happens all the time." So when you design, you design for failure. It is your number one concern. What does designing for failure mean? One classic problem is partial failure. If I send a message to you and then a network failure occurs, there are two possible outcomes. One is that the message got to you, and then the network broke, and I just didn't get the response. The other is the message never got to you because the network broke before it arrived.
So if I never receive a response, how do I know which of those two results happened? I cannot determine that without eventually finding you. The network has to be repaired or you have to come up, because maybe what happened was not a network failure but you died. How does this change how I design things? For one thing, it puts a multiplier on the value of simplicity. The more things I can do with you, the more things I have to think about recovering from. [2]
________________________________________
Handling failures is an important theme in distributed systems design. Failures fall into two obvious categories: hardware and software. Hardware failures were a dominant concern until the late 80's, but since then internal hardware reliability has improved enormously. Decreased heat production and power consumption of smaller circuits, reduction of off-chip connections and wiring, and high-quality manufacturing techniques have all played a positive role in improving hardware reliability. Today, problems are most often associated with connections and mechanical devices, i.e., network failures and drive failures.
Software failures are a significant issue in distributed systems. Even with rigorous testing, software bugs account for a substantial fraction of unplanned downtime (estimated at 25-35%). Residual bugs in mature systems can be classified into two main categories [5].
• Heisenbug: A bug that seems to disappear or alter its characteristics when it is observed or researched. A common example is a bug that occurs in a release-mode compile of a program, but not when researched under debug-mode. The name "heisenbug" is a pun on the "Heisenberg uncertainty principle," a quantum physics term which is commonly (yet inaccurately) used to refer to the way in which observers affect the measurements of the things that they are observing, by the act of observing alone (this is actually the observer effect, and is commonly confused with the Heisenberg uncertainty principle).
• Bohrbug: A bug (named after the Bohr atom model) that, in contrast to a heisenbug, does not disappear or alter its characteristics when it is researched. A Bohrbug typically manifests itself reliably under a well-defined set of conditions. [6]
Heisenbugs tend to be more prevalent in distributed systems than in local systems. One reason for this is the difficulty programmers have in obtaining a coherent and comprehensive view of the interactions of concurrent processes.
Let's get a little more specific about the types of failures that can occur in a distributed system:
• Halting failures: A component simply stops. There is no way to detect the failure except by timeout: it either stops sending "I'm alive" (heartbeat) messages or fails to respond to requests. Your computer freezing is a halting failure.
• Fail-stop: A halting failure with some kind of notification to other components. A network file server telling its clients it is about to go down is a fail-stop.
• Omission failures: Failure to send/receive messages primarily due to lack of buffering space, which causes a message to be discarded with no notification to either the sender or receiver. This can happen when routers become overloaded.
• Network failures: A network link breaks.
• Network partition failure: A network fragments into two or more disjoint sub-networks within which messages can be sent, but between which messages are lost. This can occur due to a network failure.
• Timing failures: A temporal property of the system is violated. For example, clocks on different computers which are used to coordinate processes are not synchronized; when a message is delayed longer than a threshold period, etc.
• Byzantine failures: This captures several types of faulty behaviors including data corruption or loss, failures caused by malicious programs, etc. [1]
Our goal is to design a distributed system with the characteristics listed above (fault-tolerant, highly available, recoverable, etc.), which means we must design for failure. To design for failure, we must be careful to not make any assumptions about the reliability of the components of a system.
Everyone, when they first build a distributed system, makes the following eight assumptions. These are so well-known in this field that they are commonly referred to as the "8 Fallacies".
1. The network is reliable.
2. Latency is zero.
3. Bandwidth is infinite.
4. The network is secure.
5. Topology doesn't change.
6. There is one administrator.
7. Transport cost is zero.
8. The network is homogeneous. [3]
Latency: the time between initiating a request for data and the beginning of the actual data transfer.
Bandwidth: A measure of the capacity of a communications channel. The higher a channel's bandwidth, the more information it can carry.
Topology: The different configurations that can be adopted in building networks, such as a ring, bus, star or meshed.
Homogeneous network: A network running a single network protocol.
So How Is It Done?
Building a reliable system that runs over an unreliable communications network seems like an impossible goal. We are forced to deal with uncertainty. A process knows its own state, and it knows what state other processes were in recently. But the processes have no way of knowing each other's current state. They lack the equivalent of shared memory. They also lack accurate ways to detect failure, or to distinguish a local software/hardware failure from a communication failure.
Distributed systems design is obviously a challenging endeavor. How do we do it when we are not allowed to assume anything, and there are so many complexities? We start by limiting the scope. We will focus on a particular type of distributed systems design, one that uses a client-server model with mostly standard protocols. It turns out that these standard protocols provide considerable help with the low-level details of reliable network communications, which makes our job easier. Let's start by reviewing client-server technology and the protocols.

In client-server applications, the server provides some service, such as processing database queries or sending out current stock prices. The client uses the service provided by the server, either displaying database query results to the user or making stock purchase recommendations to an investor. The communication that occurs between the client and the server must be reliable. That is, no data can be dropped and it must arrive on the client side in the same order in which the server sent it.
There are many types of servers we encounter in a distributed system. For example, file servers manage disk storage units on which file systems reside. Database servers house databases and make them available to clients. Network name servers implement a mapping between a symbolic name or a service description and a value such as an IP address and port number for a process that provides the service.
In distributed systems, there can be many servers of a particular type, e.g., multiple file servers or multiple network name servers. The term service is used to denote a set of servers of a particular type. We say that a binding occurs when a process that needs to access a service becomes associated with a particular server which provides the service. There are many binding policies that define how a particular server is chosen. For example, the policy could be based on locality (a Unix NIS client starts by looking first for a server on its own machine); or it could be based on load balance (a CICS client is bound in such a way that uniform responsiveness for all clients is attempted).
A distributed service may employ data replication, where a service maintains multiple copies of data to permit local access at multiple locations, or to increase availability when a server process may have crashed. Caching is a related concept and very common in distributed systems. We say a process has cached data if it maintains a copy of the data locally, for quick access if it is needed again. A cache hit is when a request is satisfied from cached data, rather than from the primary service. For example, browsers use document caching to speed up access to frequently used documents.
Caching is similar to replication, but cached data can become stale. Thus, there may need to be a policy for validating a cached data item before using it. If a cache is actively refreshed by the primary service, caching is identical to replication. [1]
As mentioned earlier, the communication between client and server needs to be reliable. You have probably heard of TCP/IP before. The Internet Protocol (IP) suite is the set of communication protocols that allow for communication on the Internet and most commercial networks. The Transmission Control Protocol (TCP) is one of the core protocols of this suite. Using TCP, clients and servers can create connections to one another, over which they can exchange data in packets. The protocol guarantees reliable and in-order delivery of data from sender to receiver.
The IP suite can be viewed as a set of layers, each layer having the property that it only uses the functions of the layer below, and only exports functionality to the layer above. A system that implements protocol behavior consisting of layers is known as a protocol stack. Protocol stacks can be implemented either in hardware or software, or a mixture of both. Typically, only the lower layers are implemented in hardware, with the higher layers being implemented in software.
________________________________________
Resource : The history of TCP/IP mirrors the evolution of the Internet. Here is a brief overview of this history.
________________________________________
There are four layers in the IP suite:
1. Application Layer : The application layer is used by most programs that require network communication. Data is passed down from the program in an application-specific format to the next layer, then encapsulated into a transport layer protocol. Examples of applications are HTTP, FTP or Telnet.
2. Transport Layer : The transport layer's responsibilities include end-to-end message transfer independent of the underlying network, along with error control, fragmentation and flow control. End-to-end message transmission at the transport layer can be categorized as either connection-oriented (TCP) or connectionless (UDP). TCP is the more sophisticated of the two protocols, providing reliable delivery. First, TCP ensures that the receiving computer is ready to accept data. It uses a three-packet handshake in which both the sender and receiver agree that they are ready to communicate. Second, TCP makes sure that data gets to its destination. If the receiver doesn't acknowledge a particular packet, TCP automatically retransmits the packet typically three times. If necessary, TCP can also split large packets into smaller ones so that data can travel reliably between source and destination. TCP drops duplicate packets and rearranges packets that arrive out of sequence.
<>UDP is similar to TCP in that it is a protocol for sending and receiving packets across a network, but with two major differences. First, it is connectionless. This means that one program can send off a load of packets to another, but that's the end of their relationship. The second might send some back to the first and the first might send some more, but there's never a solid connection. UDP is also different from TCP in that it doesn't provide any sort of guarantee that the receiver will receive the packets that are sent in the right order. All that is guaranteed is the packet's contents. This means it's a lot faster, because there's no extra overhead for error-checking above the packet level. For this reason, games often use this protocol. In a game, if one packet for updating a screen position goes missing, the player will just jerk a little. The other packets will simply update the position, and the missing packet - although making the movement a little rougher - won't change anything.
<>Although TCP is more reliable than UDP, the protocol is still at risk of failing in many ways. TCP uses acknowledgements and retransmission to detect and repair loss. But it cannot overcome longer communication outages that disconnect the sender and receiver for long enough to defeat the retransmission strategy. The normal maximum disconnection time is between 30 and 90 seconds. TCP could signal a failure and give up when both end-points are fine. This is just one example of how TCP can fail, even though it does provide some mitigating strategies.
3. Network Layer : As originally defined, the Network layer solves the problem of getting packets across a single network. With the advent of the concept of internetworking, additional functionality was added to this layer, namely getting data from a source network to a destination network. This generally involves routing the packet across a network of networks, e.g. the Internet. IP performs the basic task of getting packets of data from source to destination.
4. Link Layer : The link layer deals with the physical transmission of data, and usually involves placing frame headers and trailers on packets for travelling over the physical network and dealing with physical components along the way.
________________________________________
Resource : For more information on the IP Suite, refer to the Wikipedia article.
________________________________________
Remote Procedure Calls
Many distributed systems were built using TCP/IP as the foundation for the communication between components. Over time, an efficient method for clients to interact with servers evolved called RPC, which means remote procedure call. It is a powerful technique based on extending the notion of local procedure calling, so that the called procedure may not exist in the same address space as the calling procedure. The two processes may be on the same system, or they may be on different systems with a network connecting them.
An RPC is similar to a function call. Like a function call, when an RPC is made, the arguments are passed to the remote procedure and the caller waits for a response to be returned. In the illustration below, the client makes a procedure call that sends a request to the server. The client process waits until either a reply is received, or it times out. When the request arrives at the server, it calls a dispatch routine that performs the requested service, and sends the reply to the client. After the RPC call is completed, the client process continues.
<>
Threads are common in RPC-based distributed systems. Each incoming request to a server typically spawns a new thread. A thread in the client typically issues an RPC and then blocks (waits). When the reply is received, the client thread resumes execution.
A programmer writing RPC-based code does three things:
1. Specifies the protocol for client-server communication
2. Develops the client program
3. Develops the server program
The communication protocol is created by stubs generated by a protocol compiler. A stub is a routine that doesn't actually do much other than declare itself and the parameters it accepts. The stub contains just enough code to allow it to be compiled and linked.
The client and server programs must communicate via the procedures and data types specified in the protocol. The server side registers the procedures that may be called by the client and receives and returns data required for processing. The client side calls the remote procedure, passes any required data and receives the returned data.
Thus, an RPC application uses classes generated by the stub generator to execute an RPC and wait for it to finish. The programmer needs to supply classes on the server side that provide the logic for handling an RPC request.
RPC introduces a set of error cases that are not present in local procedure programming. For example, a binding error can occur when a server is not running when the client is started. Version mismatches occur if a client was compiled against one version of a server, but the server has now been updated to a newer version. A timeout can result from a server crash, network problem, or a problem on a client computer.
Some RPC applications view these types of errors as unrecoverable. Fault-tolerant systems, however, have alternate sources for critical services and fail-over from a primary server to a backup server.
A challenging error-handling case occurs when a client needs to know the outcome of a request in order to take the next step, after failure of a server. This can sometimes result in incorrect actions and results. For example, suppose a client process requests a ticket-selling server to check for a seat in the orchestra section of Carnegie Hall. If it's available, the server records the request and the sale. But the request fails by timing out. Was the seat available and the sale recorded? Even if there is a backup server to which the request can be re-issued, there is a risk that the client will be sold two tickets, which is an expensive mistake in Carnegie Hall [1].
Here are some common error conditions that need to be handled:
• Network data loss resulting in retransmit: Often, a system tries to achieve 'at most once' transmission tries. In the worst case, if duplicate transmissions occur, we try to minimize any damage done by the data being received multiple time.
• Server process crashes during RPC operation: If a server process crashes before it completes its task, the system usually recovers correctly because the client will initiate a retry request once the server has recovered. If the server crashes completing the task but before the RPC reply is sent, duplicate requests sometimes result due to client retries.
• Client process crashes before receiving response: Client is restarted. Server discards response data.

Some Distributed Design Principles
Given what we have covered so far, we can define some fundamental design principles which every distributed system designer and software engineer should know. Some of these may seem obvious, but it will be helpful as we proceed to have a good starting list.
________________________________________
• As Ken Arnold says: "You have to design distributed systems with the expectation of failure." Avoid making assumptions that any component in the system is in a particular state. A classic error scenario is for a process to send data to a process running on a second machine. The process on the first machine receives some data back and processes it, and then sends the results back to the second machine assuming it is ready to receive. Any number of things could have failed in the interim and the sending process must anticipate these possible failures.
• Explicitly define failure scenarios and identify how likely each one might occur. Make sure your code is thoroughly covered for the most likely ones.
• Both clients and servers must be able to deal with unresponsive senders/receivers.
• Think carefully about how much data you send over the network. Minimize traffic as much as possible.
• Latency is the time between initiating a request for data and the beginning of the actual data transfer. Minimizing latency sometimes comes down to a question of whether you should make many little calls/data transfers or one big call/data transfer. The way to make this decision is to experiment. Do small tests to identify the best compromise.
• Don't assume that data sent across a network (or even sent from disk to disk in a rack) is the same data when it arrives. If you must be sure, do checksums or validity checks on data to verify that the data has not changed.
• Caches and replication strategies are methods for dealing with state across components. We try to minimize stateful components in distributed systems, but it's challenging. State is something held in one place on behalf of a process that is in another place, something that cannot be reconstructed by any other component. If it can be reconstructed it's a cache. Caches can be helpful in mitigating the risks of maintaining state across components. But cached data can become stale, so there may need to be a policy for validating a cached data item before using it.
If a process stores information that can't be reconstructed, then problems arise. One possible question is, "Are you now a single point of failure?" I have to talk to you now - I can't talk to anyone else. So what happens if you go down? To deal with this issue, you could be replicated. Replication strategies are also useful in mitigating the risks of maintaining state. But there are challenges here too: What if I talk to one replicant and modify some data, then I talk to another? Is that modification guaranteed to have already arrived at the other? What happens if the network gets partitioned and the replicants can't talk to each other? Can anybody proceed?
There are a set of tradeoffs in deciding how and where to maintain state, and when to use caches and replication. It's more difficult to run small tests in these scenarios because of the overhead in setting up the different mechanisms.
• Be sensitive to speed and performance. Take time to determine which parts of your system can have a significant impact on performance: Where are the bottlenecks and why? Devise small tests you can do to evaluate alternatives. Profile and measure to learn more. Talk to your colleagues about these alternatives and your results, and decide on the best solution.
• Acks are expensive and tend to be avoided in distributed systems wherever possible.
• Retransmission is costly. It's important to experiment so you can tune the delay that prompts a retransmission to be optimal.
Exercises
1. Have you ever encountered a Heisenbug? How did you isolate and fix it?
2. For the different failure types listed above, consider what makes each one difficult for a programmer trying to guard against it. What kinds of processing can be added to a program to deal with these failures?
3. Explain why each of the 8 fallacies is actually a fallacy.
4. Contrast TCP and UDP. Under what circumstances would you choose one over the other?
5. What's the difference between caching and data replication?
6. What are stubs in an RPC implementation?
7. What are some of the error conditions we need to guard against in a distributed environment that we do not need to worry about in a local programming environment?
8. Why are pointers (references) not usually passed as parameters to a Remote Procedure Call?
9. Here is an interesting problem called partial connectivity that can occur in a distributed environment. Let's say A and B are systems that need to talk to each other. C is a master that also talks to A and B individually. The communications between A and B fail. C can tell that A and B are both healthy. C tells A to send something to B and waits for this to occur. C has no way of knowing that A cannot talk to B, and thus waits and waits and waits. What diagnostics can you add in your code to deal with this situation?
10. What is the leader-election algorithm? How can it be used in a distributed system?
11. This is the Byzantine Generals problem: Two generals are on hills either side of a valley. They each have an army of 1000 soldiers. In the woods in the valley is an enemy army of 1500 men. If each general attacks alone, his army will lose. If they attack together, they will win. They wish to send messengers through the valley to coordinate when to attack. However, the messengers may get lost or caught in the woods (or brainwashed into delivering different messages). How can they devise a scheme by which they either attack with high probability, or not at all?
References
[1] Birman, Kenneth. Reliable Distributed Systems: Technologies, Web Services and Applications. New York: Springer-Verlag, 2005.
[2] Interview with Ken Arnold
[3] The Eight Fallacies
[4] Wikipedia article on IP Suite
[5] Gray, J. and Reuter, A. Transaction Processing: Concepts and Techniques. San Mateo, CA: Morgan Kaufmann, 1993.
[6] Bohrbugs and Heisenbugs
 Flickr Architecture

Wed, 08/29/2007 - 10:04 — Todd Hoff
• Flickr Architecture (1164)
Flickr is both my favorite bird and the web's leading photo sharing site. Flickr has an amazing challenge, they must handle a vast sea of ever expanding new content, ever increasing legions of users, and a constant stream of new features, all while providing excellent performance. How do they do it?
Site: http://www.flickr.com/
Information Sources
• Flickr and PHP (an early document)
• Capacity Planning for LAMP
• Federation at Flickr: A tour of the Flickr Architecture.
• Building Scalable Web Sites by Cal Henderson from Flickr.
• Database War Stories #3: Flickr by Tim O'Reilly
• Cal Henderson's Talks. A lot of useful PowerPoint presentations.
Platform
• PHP
• MySQL
• Shards
• Memcached for a caching layer.
• Squid in reverse-proxy for html and images.
• Linux (RedHat)
• Smarty for templating
• Perl
• PEAR for XML and Email parsing
• ImageMagick, for image processing
• Java, for the node service
• Apache
• SystemImager for deployment
• Ganglia for distributed system monitoring
• Subcon stores essential system configuration files in a subversion repository for easy deployment to machines in a cluster.
• Cvsup for distributing and updating collections of files across a network.
The Stats
• More than 4 billion queries per day.
• ~35M photos in squid cache (total)
• ~2M photos in squid’s RAM
• ~470M photos, 4 or 5 sizes of each
• 38k req/sec to memcached (12M objects)
• 2 PB raw storage (consumed about ~1.5TB on Sunday
• Over 400,000 photos being added every day
The Architecture
• A pretty picture of Flickr's architecture can be found on this slide . A simple depiction is:
-- Pair of ServerIron's
---- Squid Caches
------ Net App's
---- PHP App Servers
------ Storage Manager
------ Master-master shards
------ Dual Tree Central Database
------ Memcached Cluster
------ Big Search Engine
- The Dual Tree structure is a custom set of changes to MySQL that allows scaling by incrementally adding masters without a ring architecture. This allows cheaper scaling because you need less hardware as compared to master-master setups which always requires double the hardware.
- The central database includes data like the 'users' table, which includes primary user
keys (a few different IDs) and a pointer to which shard a users' data can be found on.
• Use dedicated servers for static content.
• Talks about how to support Unicode.
• Use a share nothing architecture.
• Everything (except photos) are stored in the database.
• Statelessness means they can bounce people around servers and it's easier to make their APIs.
• Scaled at first by replication, but that only helps with reads.
• Create a search farm by replicating the portion of the database they want to search.
• Use horizontal scaling so they just need to add more machines.
• Handle pictures emailed from users by parsing each email is it's delivered in PHP. Email is parsed for any photos.
• Earlier they suffered from Master-Slave lag. Too much load and they had a single point of failure.
• They needed the ability to make live maintenance, repair data, and so forth, without taking the site down.
• Lots of excellent material on capacity planning. Take a look in the Information Sources for more details.
• Went to a federated approach so they can scale far into the future:
- Shards: My data gets stored on my shard, but the record of performing action on your comment, is on your shard. When making a comment on someone else's’ blog
- Global Ring: Its like DNS, you need to know where to go and who controls where you go. Every page view, calculate where your data is, at that moment of time.
- PHP logic to connect to the shards and keep the data consistent (10 lines of code with comments!)
• Shards:
- Slice of the main database
- Active Master-Master Ring Replication: a few drawbacks in MySQL 4.1, as honoring commits in Master-Master. AutoIncrement IDs are automated to keep it Active Active.
- Shard assignments are from a random number for new accounts
- Migration is done from time to time, so you can remove certain power users. Needs to be balanced if you have a lot of photos… 192,000 photos, 700,000 tags, will take about 3-4 minutes. Migration is done manually.
• Clicking a Favorite:
- Pulls the Photo owners Account from Cache, to get the shard location (say on shard-5)
- Pulls my Information from cache, to get my shard location (say on shard-13)
- Starts a “distributed transaction” - to answer the question: Who favorited the photo? What are my favorites?
• Can ask question from any shard, and recover data. Its absolutely redundant.
• To get rid of replication lag…
- every page load, the user is assigned to a bucket
- if host is down, go to next host in the list; if all hosts are down, display an error page. They don’t use persistent connections, they build connections and tear it down. Every page load thus, tests the connection.
• Every users reads and writes are kept in one shard. Notion of replication lag is gone.
• Each server in shard is 50% loaded. Shut down 1/2 the servers in each shard. So 1 server in the shard can take the full load if a server of that shard is down or in maintenance mode. To upgrade you just have to shut down half the shard, upgrade that half, and then repeat the process.
• Periods of time when traffic spikes, they break the 50% rule though. They do something like 6,000-7,000 queries per second. Now, its designed for at most 4,000 queries per second to keep it at 50% load.
• Average queries per page, are 27-35 SQL statements. Favorites counts are real time. API access to the database is all real time. Achieved the real time requirements without any disadvantages.
• Over 36,000 queries per second - running within capacity threshold. Burst of traffic, double 36K/qps.
• Each Shard holds 400K+ users data.
- A lot of data is stored twice. For example, a comment is part of the relation between the commentor and the commentee. Where is the comment stored? How about both places? Transactions are used to prevent out of sync data: open transaction 1, write commands, open transaction 2, write commands, commit 1st transaction if all is well, commit 2nd transaction if 1st committed. but there still a chance for failure when a box goes down during the 1st commit.
• Search:
- Two search back-ends: shards 35k qps on a few shards and Yahoo!’s (proprietary) web search
- Owner’s single tag search or a batch tag change (say, via Organizr) goes to the Shards due to real-time requirements, everything else goes to Yahoo!’s engine (probably about 90% behind the real-time goodness)
- Think of it such that you’ve got Lucene-like search
• Hardware:
- EMT64 w/RHEL4, 16GB RAM
- 6-disk 15K RPM RAID-10.
- Data size is at 12 TB of user metadata (these are not photos, this is just innodb ibdata files - the photos are a lot larger).
- 2U boxes. Each shard has~120GB of data.
• Backup procedure:
- ibbackup on a cron job, that runs across various shards at different times. Hotbackup to a spare.
- Snapshots are taken every night across the entire cluster of databases.
- Writing or deleting several huge backup files at once to a replication filestore can wreck performance on that filestore for the next few hours as it replicates the backup files. Doing this to an in-production photo storage filer is a bad idea.
- However much it costs to keep multiple days of backups of all of your data, it's worth it. Keeping staggered backups is good for when you discover something gone wrong a few days later. something like 1, 2, 10 and 30 day backups.
• Photos are stored on the filer. Upon upload, it processes the photos, gives you different sizes, then its complete. Metadata and points to the filers, are stored in the database.
• Aggregating the data: Very fast, because its a process per shard. Stick it into a table, or recover data from another copy from other users shards.
• max_connections = 400 connections per shard, or 800 connections per server & shard. Plenty of capacity and connections. Thread cache is set to 45, because you don’t have more than 45 users having simultaneous activity.
• Tags:
- Tags do not fit well with traditional normalized RDBMs schema design. Denormalization or heavy caching is the only way to generate a tag cloud in milliseconds for hundreds of millions of tags.
- Some of their data views are calculated offline by dedicated processing clusters which save the results into MySQL because some relationships are so complicated to calculate it would absorb all the database CPU cycles.
• Future Direction:
- Make it faster with real-time BCP, so all data centers can receive writes to the data layer (db, memcache, etc) all at the same time. Everything is active nothing will ever be idle.
Lessons Learned
• Think of your application as more than just a web application. You'll have REST APIs, SOAP APIs, RSS feeds, Atom feeds, etc.
• Go stateless. Statelessness makes for a simpler more robust system that can handle upgrades without flinching.
• Re-architecting your database sucks.
• Capacity plan. Bring capacity planning into the product discussion EARLY. Get buy-in from the $$$ people (and engineering management) that it’s something to watch.
• Start slow. Don’t buy too much equipment just because you’re scared/happy that your site will explode.
• Measure reality. Capacity planning math should be based on real things, not abstract ones.
• Build in logging and metrics. Usage stats are just as important as server stats. Build in custom metrics to measure real-world usage to server-based stats.
• Cache. Caching and RAM is the answer to everything.
• Abstract. Create clear levels of abstraction between database work, business logic, page logic, page mark-up and the presentation layer. This supports quick turn around iterative development.
• Layer. Layering allows developers to create page level logic which designers can use to build the user experience. Designers can ask for page logic as needed. It's a negotiation between the two parties.
• Release frequently. Even every 30 minutes.
• Forget about small efficiencies, about 97% of the time. Premature optimization is the root of all evil.
• Test in production. Build into the architecture mechanisms (config flags, load balancing, etc.) with which you can deploy new hardware easily into (and out of) production.
• Forget benchmarks. Benchmarks are fine for getting a general idea of capabilities, but not for planning. Artificial tests give artificial results, and the time is better used with testing for real.
• Find ceilings.
- What is the maximum something that every server can do ?
- How close are you to that maximum, and how is it trending ?
- MySQL (disk IO ?)
- SQUID (disk IO ? or CPU ?)
- memcached (CPU ? or network ?)
• Be sensitive to the usage patterns for your type of application.
- Do you have event related growth? For example: disaster, news event.
- Flickr gets 20-40% more uploads on first work day of the year than any previous peak the previous year.
- 40-50% more uploads on Sundays than the rest of the week, on average
• Be sensitive to the demands of exponential growth. More users means more content, more content means more connections, more connections mean more usage.
• Plan for peaks. Be able to handle peak loads up and down the stack.
• Apache
• Example
• Java
• Linux
• MySQL
• Perl
• PHP
• Shard
• Visit Flickr Architecture
• 24401 reads
Comments
Wed, 08/08/2007 - 13:23 — Sam (not verified)
How to store images?
Is there an easier solution managing images in combination of database and files? It seems storing your images in database might really slow down the site.
• reply
Wed, 08/08/2007 - 16:23 — Douglas F Shearer (not verified)
RE: How to store images?
Flickr only store a reference to an image in their databases, the actual file is stored on a separate storage server elsewhere on the network.
A typical URL for a Flickr image looks like this:
http://farm1.static.flickr.com/104/301293250_dc284905d0_m.jpg
If we split this up we get:
farm1 - Obviously the farm at which the image is stored. I have yet to see a value other than one.
.static.flickr.com - Fairly self explanitory.
/104 - The server ID number.
/301293250 - The image ID.
_dc284905d0 - The image 'secret'. I assume this is to prevent images being copied without first getting the information from the API.
_m - The size of the image. In this case the 'm' denotes medium, but this can be small, thumb etc. For the standard image size there is no size of this form in the URL.
 Amazon Architecture

Tue, 09/18/2007 - 19:44 — Todd Hoff
• Amazon Architecture (2495)
This is a wonderfully informative Amazon update based on Joachim Rohde's discovery of an interview with Amazon's CTO. You'll learn about how Amazon organizes their teams around services, the CAP theorem of building scalable systems, how they deploy software, and a lot more. Many new additions from the ACM Queue article have also been included.
Amazon grew from a tiny online bookstore to one of the largest stores on earth. They did it while pioneering new and interesting ways to rate, review, and recommend products. Greg Linden shared is version of Amazon's birth pangs in a series of blog articles
Site: http://amazon.com
Information Sources

• Early Amazon by Greg Linden
• How Linux saved Amazon millions
• Interview Werner Vogels - Amazon's CTO
• Asynchronous Architectures - a nice summary of Werner Vogels' talk by Chris Loosley
• Learning from the Amazon technology platform - A Conversation with Werner Vogels
• Werner Vogels' Weblog - building scalable and robust distributed systems
Platform
• Linux
• Oracle
• C++
• Perl
• Mason
• Java
• Jboss
• Servlets
The Stats
• More than 55 million active customer accounts.
• More than 1 million active retail partners worldwide.
• Between 100-150 services are accessed to build a page.
The Architecture
• What is it that we really mean by scalability? A service is said to be scalable if when we increase the resources in a system, it results in increased performance in a manner proportional to resources added. Increasing performance in general means serving more units of work, but it can also be to handle larger units of work, such as when datasets grow.
• The big architectural change that Amazon made was to move from a two-tier monolith to a fully-distributed, decentralized, services platform serving many different applications.
• Started as one application talking to a back end. Written in C++.
• It grew. For years the scaling efforts at Amazon focused on making the back-end databases scale to hold more items, more customers, more orders, and to support multiple international sites. In 2001 it became clear that the front-end application couldn't scale anymore. The databases were split into small parts and around each part and created a services interface that was the only way to access the data.
• The databases became a shared resource that made it hard to scale-out the overall business. The front-end and back-end processes were restricted in their evolution because they were shared by many different teams and processes.
• Their architecture is loosely coupled and built around services. A service-oriented architecture gave them the isolation that would allow building many software components rapidly and independently.
• Grew into hundreds of services and a number of application servers that aggregate the information from the services. The application that renders the Amazon.com Web pages is one such application server. So are the applications that serve the Web-services interface, the customer service application, and the seller interface.
• Many third party technologies are hard to scale to Amazon size. Especially communication infrastructure technologies. They work well up to a certain scale and then fail. So they are forced to build their own.
• Not stuck with one particular approach. Some places they use jboss/java, but they use only servlets, not the rest of the J2EE stack.
• C++ is uses to process requests. Perl/Mason is used to build content.
• Amazon doesn't like middleware because it tends to be framework and not a tool. If you use a middleware package you get lock-in around the software patterns they have chosen. You'll only be able to use their software. So if you want to use different packages you won't be able to. You're stuck. One event loop for messaging, data persistence,
AJAX, etc. Too complex. If middleware was available in smaller components, more as a tool than a framework, they would be more interested.
• The SOAP web stack seems to want to solve all the same distributed systems problems all over again.
• Offer both SOAP and REST web services. 30% use SOAP. These tend to be Java and .NET users and use WSDL files to generate remote object interfaces. 70% use REST. These tend to be PHP or PERL users.
• In either SOAP or REST developers can get an object interface to Amazon. Developers just want to get job done. They don't care what goes over the wire.
• Amazon wanted to build an open community around their services. Web services were chosed because it's simple. But hat's only on the perimeter. Internally it's a service oriented architecture. You can only access the data via the interface. It's described in WSDL, but they use their own encapsulation and transport mechanisms.
• Teams are Small and are Organized Around Services
- Services are the independent units delivering functionality within Amazon. It's also how Amazon is organized internally in terms of teams.
- If you have a new business idea or problem you want to solve you form a team. Limit the team to 8-10 people because communication hard. They are called two pizza teams. The number of people you can feed off two pizzas.
- Teams are small. They are assigned authority and empowered to solve a problem as a service in anyway they see fit.
- As an example, they created a team to find phrases within a book that are unique to the text. This team built a separate service interface for that feature and they had authority to do what they needed.
- Extensive A/B testing is used to integrate a new service . They see what the impact is and take extensive measurements.
• Deployment
- They create special infrastructure for managing dependencies and doing a deployment.
- Goal is to have all right services to be deployed on a box. All application code, monitoring, licensing, etc should be on a box.
- Everyone has a home grown system to solve these problems.
- Output of deployment process is a virtual machine. You can use EC2 to run them.
• Work From the Customer Backwards to Verify a New Service is Worth Doing
- Work from the customer backward. Focus on value you want to deliver
for the customer.
- Force developers to focus on value delivered to the customer instead of building technology first and then figuring how to use it.
- Start with a press release of what features the user will see and work backwards to check that you are building something valuable.
- End up with a design that is as minimal as possible. Simplicity is the key if you really want to build large distributed systems.
• State Management is the Core Problem for Large Scale Systems
- Internally they can deliver infinite storage.
- Not all that many operations are stateful. Checkout steps are stateful.
- Most recent clicked web page service has recommendations based on session IDs.
- They keep track of everything anyway so it's not a matter of keeping state. There's little separate state that needs to be kept for a session. The services will already be keeping the information so you just use the services.
• Eric Brewer's CAP Theorem or the Three properties of Systems
- Three properties of a system: consistency, availability, tolerance to network partitions.
- You can have at most two of these three properties for any shared-data system.
- Partitionability: divide nodes into small groups that can see other groups, but they can't see everyone.
- Consistency: write a value and then you read the value you get the same value back. In a partitioned system there are windows where that's not true.
- Availability: may not always be able to write or read. The system will say you can't write because it wants to keep the system consistent.
- To scale you have to partition, so you are left with choosing either high consistency or high availability for a particular system. You must find the right overlap of availability and consistency.
- Choose a specific approach based on the needs of the service.
- For the checkout process you always want to honor requests to add items to a shopping cart because it's revenue producing. In this case you choose high availability. Errors are hidden from the customer and sorted out later.
- When a customer submits an order you favor consistency because several services--credit card processing, shipping and handling, reporting--are simultaneously accessing the data.
Lessons Learned
• You must change your mentality to build really scalable systems. Approach chaos in a probabilistic sense that things will work well. In traditional systems we present a perfect world where nothing goes down and then we build complex algorithms (agreement technologies) on this perfect world. Instead, take it for granted stuff fails, that's
reality, embrace it. For example, go more with a fast reboot and fast recover approach. With a decent spread of data and services you might get close to 100%. Create self-healing, self-organizing lights out operations.
• Create a shared nothing infrastructure. Infrastructure can become a shared resource for development and deployment with the same downsides as shared resources in your logic and data tiers. It can cause locking and blocking and dead lock. A service oriented architecture allows the creation of a parallel and isolated development process that scales feature development to match your growth.
• Open up you system with APIs and you'll create an ecosystem around your application.
• Only way to manage as large distributed system is to keep things as simple as possible. Keep things simple by making sure there are no hidden requirements and hidden dependencies in the design. Cut technology to the minimum you need to solve the problem you have. It doesn't help the company to create artificial and unneeded layers of complexity.
• Organizing around services gives agility. You can do things in parallel is because the output is a service. This allows fast time to market. Create an infrastructure that allows services to be built very fast.
• There's bound to be problems with anything that produces hype before real implementation
• Use SLAs internally to manage services.
• Anyone can very quickly add web services to their product. Just implement one part of your product as a service and start using it.
• Build your own infrastructure for performance, reliability, and cost control reasons. By building it yourself you never have to say you went down because it was company X's fault. Your software may not be more reliable than others, but you can fix, debug, and deployment much quicker than when working with a 3rd party.
• Use measurement and objective debate to separate the good from the bad. I've been to several presentations by ex-Amazoners and this is the aspect of Amazon that strikes me as uniquely different and interesting from other companies. Their deep seated ethic is to expose real customers to a choice and see which one works best and to make decisions based on those tests.
Avinash
Kaushik calls this getting rid of the influence of the HiPPO's, the highest paid people in the room. This is done with techniques like A/B testing and Web Analytics. If you have a question about what you should do code it up, let people use it, and see which alternative gives you the results you want.
• Create a frugal culture. Amazon used doors for desks, for example.
• Know what you need. Amazon has a bad experience with an early recommender system that didn't work out: "This wasn't what Amazon needed. Book recommendations at Amazon needed to work from sparse data, just a few ratings or purchases. It needed to be fast. The system needed to scale to massive numbers of customers and a huge catalog. And it needed to enhance discovery, surfacing books from deep in the catalog that readers wouldn't find on their own."
• People's side projects, the one's they follow because they are interested, are often ones where you get the most value and innovation. Never underestimate the power of wandering where you are most interested.
• Involve everyone in making dog food. Go out into the warehouse and pack books during the Christmas rush. That's teamwork.
• Create a staging site where you can run thorough tests before releasing into the wild.
• A robust, clustered, replicated, distributed file system is perfect for read-only data used by the web servers.
• Have a way to rollback if an update doesn't work. Write the tools if necessary.
• Switch to a deep services-based architecture ( http://webservices.sys-con.com/read/262024.htm).
• Look for three things in interviews: enthusiasm, creativity, competence. The single biggest predictor of success at Amazon.com was enthusiasm.
• Hire a Bob. Someone who knows their stuff, has incredible debugging skills and system knowledge, and most importantly, has the stones to tackle the worst high pressure problems imaginable by just leaping in.
• Innovation can only come from the bottom. Those closest to the problem are in the best position to solve it. any organization that depends on innovation must embrace chaos. Loyalty and obedience are not your tools.
• Creativity must flow from everywhere.
• Everyone must be able to experiment, learn, and iterate. Position, obedience, and tradition should hold no power. For innovation to flourish, measurement must rule.
• Embrace innovation. In front of the whole company, Jeff Bezos would give an old Nike shoe as "Just do it" award to those who innovated.
• Don't pay for performance. Give good perks and high pay, but keep it flat. Recognize exceptional work in other ways. Merit pay sounds good but is almost impossible to do fairly in large organizations. Use non-monetary awards, like an old shoe. It's a way of saying thank you, somebody cared.
• Get big fast. The big guys like Barnes and Nobel are on your tail. Amazon wasn't even the first, second, or even third book store on the web, but their vision and drive won out in the end.
• In the data center, only 30 percent of the staff time spent on infrastructure issues related to value creation, with the remaining 70 percent devoted to dealing with the "heavy lifting" of hardware procurement, software management, load balancing, maintenance, scalability challenges and so on.
• Prohibit direct database access by clients. This means you can make you service scale and be more reliable without involving your clients. This is much like Google's ability to independently distribute improvements in their stack to the benefit of all applications.
• Create a single unified service-access mechanism. This allows for the easy aggregation of services, decentralized request routing, distributed request tracking, and other advanced infrastructure techniques.
• Making Amazon.com available through a Web services interface to any developer in the world free of charge has also been a major success because it has driven so much innovation that they couldn't have thought of or built on their own.
• Developers themselves know best which tools make them most productive and which tools are right for the job.
• Don't impose too many constraints on engineers. Provide incentives for some things, such as integration with the monitoring system and other infrastructure tools. But for the rest, allow teams to function as independently as possible.
• Developers are like artists; they produce their best work if they have the freedom to do so, but they need good tools. Have many support tools that are of a self-help nature. Support an environment around the service development that never gets in the way of the development itself.
• You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.
• Developers should spend some time with customer service every two years. Their they'll actually listen to customer service calls, answer customer service e-mails, and really understand the impact of the kinds of things they do as technologists.
• Use a "voice of the customer," which is a realistic story from a customer about some specific part of your site's experience. This helps managers and engineers connect with the fact that we build these technologies for real people. Customer service statistics are an early indicator if you are doing something wrong, or what the real pain points are for your customers.
• Infrastructure for Amazon, like for Google, is a huge competitive advantage. They can build very complex applications out of primitive services that are by themselves relatively simple. They can scale their operation independently, maintain unparalleled system availability, and introduce new services quickly without the need for massive reconfiguration.
• Example
• Java
• Linux
• Oracle
• Perl
• Visit Amazon Architecture
• 33843 reads
Comments
Thu, 08/09/2007 - 16:22 — herval (not verified)
Jeff.. Bazos?
Jeff.. Bazos?
• reply
Wed, 08/29/2007 - 22:07 — Joachim Rohde (not verified)
Werner Vogels, the CTO of
Werner Vogels, the CTO of amazon, spoke a tiny bit about technical details on SE-Radio. You can find the podcast under http://www.se-radio.net/index.php?post_id=157593
Interesting episode.
• reply
Fri, 08/31/2007 - 05:27 — Todd Hoff

Re: Amazon Architecture
That is a good interview. Thanks. I'll be adding the new information soon.
• reply
Fri, 09/07/2007 - 08:39 — Arturo Fernandez (not verified)
Re: Amazon Architecture
Amazon uses Perl and Mason.
See: http://www.masonhq.com/?MasonPoweredSites
• reply
Tue, 09/11/2007 - 07:52 — Alexei A. Korolev (not verified)
Re: Amazon Architecture
as i see they reduce c++ part and move to j2ee?
• reply
Tue, 09/18/2007 - 02:30 — Anonymous (not verified)
It's WSDL
I am not sure how you can get that one wrong, unless you are a manager, but even then some engineer would school you 'til Sunday.
• reply
Tue, 09/18/2007 - 04:37 — Todd Hoff

Re: It's WSDL
Thanks for catching that. I listen to these things a few times and sometimes I just write what I hear instead of what I mean.
• reply
Tue, 09/18/2007 - 07:28 — Werner (not verified)
Re: Amazon Architecture
I actually gave a scrisper definition of scalability at: A Word on Scalability
Personaly I like the interview in ACM Queue best for a high level view
--Werner
 Scaling Twitter: Making Twitter 10000 Percent Faster

Mon, 10/08/2007 - 21:01 — Todd Hoff
• Scaling Twitter: Making Twitter 10000 Percent Faster (913)
Twitter started as a side project and blew up fast, going from 0 to millions of page views within a few terrifying months. Early design decisions that worked well in the small melted under the crush of new users chirping tweets to all their friends. Web darling Ruby on Rails was fingered early for the scaling problems, but Blaine Cook, Twitter's lead architect, held Ruby blameless:

For us, it’s really about scaling horizontally - to that end, Rails and Ruby haven’t been stumbling blocks, compared to any other language or framework. The performance boosts associated with a “faster” language would give us a 10-20% improvement, but thanks to architectural changes that Ruby and Rails happily accommodated, Twitter is 10000% faster than it was in January.
If Ruby on Rails wasn't to blame, how did Twitter learn to scale ever higher and higher?
Update: added slides Small Talk on Getting Big. Scaling a Rails App & all that Jazz
Site: http://twitter.com
Information Sources
• Scaling Twitter Video by Blaine Cook.
• Scaling Twitter Slides
• Good News blog post by Rick Denatale
• Scaling Twitter blog post Patrick Joyce.
• Twitter API Traffic is 10x Twitter’s Site.
• A Small Talk on Getting Big. Scaling a Rails App & all that Jazz - really cute dog picks
The Platform
• Ruby on Rails
• Erlang
• MySQL
• Mongrel - hybrid Ruby/C HTTP server designed to be small, fast, and secure
• Munin
• Nagios
• Google Analytics
• AWStats - real-time logfile analyzer to get advanced statistics
• Memcached
The Stats
• Over 350,000 users. The actual numbers are as always, very super super top secret.
• 600 requests per second.
• Average 200-300 connections per second. Spiking to 800 connections per second.
• MySQL handled 2,400 requests per second.
• 180 Rails instances. Uses Mongrel as the "web" server.
• 1 MySQL Server (one big 8 core box) and 1 slave. Slave is read only for statistics and reporting.
• 30+ processes for handling odd jobs.
• 8 Sun X4100s.
• Process a request in 200 milliseconds in Rails.
• Average time spent in the database is 50-100 milliseconds.
• Over 16 GB of memcached.
The Architecture
• Ran into very public scaling problems. The little bird of failure popped up a lot for a while.
• Originally they had no monitoring, no graphs, no statistics, which makes it hard to pinpoint and solve problems. Added Munin and Nagios. There were difficulties using tools on Solaris. Had Google analytics but the pages weren't loading so it wasn't that helpful :-)
• Use caching with memcached a lot.
- For example, if getting a count is slow, you can memoize the count into memcache in a millisecond.
- Getting your friends status is complicated. There are security and other issues. So rather than doing a query, a friend's status is updated in cache instead. It never touches the database. This gives a predictable response time frame (upper bound 20 msecs).
- ActiveRecord objects are huge so that's why they aren't cached. So they want to store critical attributes in a hash and lazy load the other attributes on access.
- 90% of requests are API requests. So don't do any page/fragment caching on the front-end. The pages are so time sensitive it doesn't do any good. But they cache API requests.
• Messaging
- Use message a lot. Producers produce messages, which are queued, and then are distributed to consumers. Twitter's main functionality is to act as a messaging bridge between different formats (SMS, web, IM, etc).
- Send message to invalidate friend's cache in the background instead of doing all individually, synchronously.
- Started with DRb, which stands for distributed Ruby. A library that allows you to send and receive messages from remote Ruby objects via TCP/IP. But it was a little flaky and single point of failure.
- Moved to Rinda, which a shared queue that uses a tuplespace model, along the lines of Linda. But the queues are persistent and the messages are lost on failure.
- Tried Erlang. Problem: How do you get a broken server running at Sunday Monday with 20,000 users waiting? The developer didn't know. Not a lot of documentation. So it violates the use what you know rule.
- Moved to Starling, a distributed queue written in Ruby.
- Distributed queues were made to survive system crashes by writing them to disk. Other big websites take this simple approach as well.
• SMS is handled using an API supplied by third party gateway's. It's very expensive.
• Deployment
- They do a review and push out new mongrel servers. No graceful way yet.
- An internal server error is given to the user if their mongrel server is replaced.
- All servers are killed at once. A rolling blackout isn't used because the message queue state is in the mongrels and a rolling approach would cause all the queues in the remaining mongrels to fill up.
• Abuse
- A lot of down time because people crawl the site and add everyone as friends. 9000 friends in 24 hours. It would take down the site.
- Build tools to detect these problems so you can pinpoint when and where they are happening.
- Be ruthless. Delete them as users.
• Partitioning
- Plan to partition in the future. Currently they don't. These changes have been enough so far.
- The partition scheme will be based on time, not users, because most requests are very temporally local.
- Partitioning will be difficult because of automatic memoization. They can't guarantee read-only operations will really be read-only. May write to a read-only slave, which is really bad.
• Twitter's API Traffic is 10x Twitter’s Site
- Their API is the most important thing Twitter has done.
- Keeping the service simple allowed developers to build on top of their infrastructure and come up with ideas that are way better than Twitter could come up with. For example, Twitterrific, which is a beautiful way to use Twitter that a small team with different priorities could create.
• Monit is used to kill process if they get too big.
Lessons Learned
• Talk to the community. Don't hide and try to solve all problems yourself. Many brilliant people are willing to help if you ask.
• Treat your scaling plan like a business plan. Assemble a board of advisers to help you.
• Build it yourself. Twitter spent a lot of time trying other people's solutions that just almost seemed to work, but not quite. It's better to build some things yourself so you at least have some control and you can build in the features you need.
• Build in user limits. People will try to bust your system. Put in reasonable limits and detection mechanisms to protect your system from being killed.
• Don't make the database the central bottleneck of doom. Not everything needs to require a gigantic join. Cache data. Think of other creative ways to get the same result. A good example is talked about in Twitter, Rails, Hammers, and 11,000 Nails per Second.
• Make your application easily partitionable from the start. Then you always have a way to scale your system.
• Realize your site is slow. Immediately add reporting to track problems.
• Optimize the database.
- Index everything. Rails won't do this for you.
- Use explain to how your queries are running. Indexes may not be being as you expect.
- Denormalize a lot. Single handedly saved them. For example, they store all a user IDs friend IDs together, which prevented a lot of costly joins.
- Avoid complex joins.
- Avoid scanning large sets of data.
• Cache the hell out of everything. Individual active records are not cached, yet. The queries are fast enough for now.
• Test everything.
- You want to know when you deploy an application that it will render correctly.
- They have a full test suite now. So when the caching broke they were able to find the problem before going live.
• Long running processes should be abstracted to daemons.
• Use exception notifier and exception logger to get immediate notification of problems so you can address the right away.
• Don't do stupid things.
- Scale changes what can be stupid.
- Trying to load 3000 friends at once into memory can bring a server down, but when there were only 4 friends it works great.
• Most performance comes not from the language, but from application design.
• Turn your website into an open service by creating an API. Their API is a huge reason for Twitter's success. It allows user's to create an ever expanding and ecosystem around Twitter that is difficult to compete with. You can never do all the work your user's can do and you probably won't be as creative. So open you application up and make it easy for others to integrate your application with theirs.
Related Articles
• For a discussion of partitioning take a look at Amazon Architecture, An Unorthodox Approach to Database Design : The Coming of the Shard, Flickr Architecture
• The Mailinator Architecture has good strategies for abuse protection.
• GoogleTalk Architecture addresses some interesting issues when scaling social networking sites.
• Example
• Memcached
• RoR
• Visit Scaling Twitter: Making Twitter 10000 Percent Faster
• 26585 reads
Comments
Thu, 09/13/2007 - 22:51 — Royans (not verified)
Re: Scaling Twitter: Making Twitter 10000 Percent Faster
Todd, thanks for the excellent research u did on twitter. Its amazing that the entire Twitter infrastructure is running with just one rw database. Would be interesting to find out the usage stats on that single box…
• reply
Fri, 09/14/2007 - 00:15 — Bob Warfield (not verified)
Re: Scaling Twitter: Making Twitter 10000 Percent Faster
Loved your article, it echoes a lot of themes I've been talking about for awhile on my blog, so I wrote about the Twitter case based on your article here:
http://smoothspan.wordpress.com/2007/09/14/twitter-scaling-story-mirrors…
• reply
Sat, 09/15/2007 - 07:15 — Shanti Braford (not verified)
Re: Scaling Twitter: Making Twitter 10000 Percent Faster
I wonder what the RoR haters will make up now to say that ruby doesn't scale.
They loved jumping on the ruby hate bandwagon when twitter was going through it's difficulties. Little bo beep has been quite silent since.
Caching was the answer? Shock. Gasp. Awe. Just like PHP?!? Crazy!
• reply
Sat, 09/15/2007 - 11:23 — Dave Hoover (not verified)
Re: Scaling Twitter: Making Twitter 10000 Percent Faster
I think you're referring to Starfish, not Starling.
Great article!
• reply
Thu, 09/20/2007 - 08:36 — choonkeat (not verified)
Re: Scaling Twitter: Making Twitter 10000 Percent Faster
No, its not Starfish. In the video of his presentation, he mentions "so I wrote Starling…"
• reply
Thu, 09/20/2007 - 16:02 — miles (not verified)
Re: Scaling Twitter: Making Twitter 10000 Percent Faster
great article (and site) Todd. thanks for pulling all this information together. It's a great resource
ps. @Dave: Blaine referred to his 'starling' messaging framework at the SJ Ruby Conference earlier in the year.
• reply
Sat, 09/22/2007 - 14:01 — Marcus (not verified)
They could have been 20% better?
So, let's be clear, the biased source in defense mode says themselves they could have been 20% faster just by selecting a different language (note that it doesn't exactly say what the performance hit of the Rails framework itself is, so let's just go with 20% improvement by changing languages and ignore potential problems in (1) their coding decisions and (2) their chosen framework)…. Wow, sign me up for an easy 20% improvement!
Yeah, yeah, I know, I'll hear the usual tripe about how amazing fast Ruby is to develop with. Visual Basic is pretty easy too, as is PHP, but I don't use those either.
• reply
Mon, 09/24/2007 - 08:02 — Mikael (not verified)
Re: Scaling Twitter: Making Twitter 10000 Percent Faster
Sounds like Ruby on Rails _was_ to blame as the 10000 percent improvement was reached by essentially removing the "on rails" part of the equation by extensive caching. This seems to be the real weakness of RoR; Ruby in itself seems OK performance-wise, slower than PHP for example but not catastrophically so. PHP is slower than Java but scales nicely anyway. The database abstraction in "on rails" is a real performance killer though and all the high traffic sites that use RoR successfully (twitter, penny arcade, …) seems to have taken steps to avoid using the database abstraction on live page views by extensive caching.
Of course, caching is a necessary tool for scaling regardless of the platform but with a less inefficient abstraction layer than the one in RoR it is possible to grow more before you have to recode stuff for caching.
• reply
Fri, 10/12/2007 - 18:07 — Dustin Puryear (not verified)
Re: Scaling Twitter: Making Twitter 10000 Percent Faster
Excellent article.
I agree with one of the other commenters that it's surprising they have this running from a single MySQL server. Wow. The fact that twitter tends to be very write-heavy, and MySQL isn't exactly perfect for multimaster replication architectures probably has a lot to do with that. I wonder what they are planning to do for future growth? Obviously this will not continue to work as-is..
--
Dustin Puryear
Author, "Best Practices for Managing Linux and UNIX Servers"
http://www.puryear-it.com/pubs/linux-unix-best-practices
 Google Architecture

Mon, 07/23/2007 - 04:26 — Todd Hoff
• Google Architecture (1526)
Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build a higher performing higher scaling infrastructure to support their products. How do they do that?
Information Sources
• Video: Building Large Systems at Google
• Google Lab: The Google File System
• Google Lab: MapReduce: Simplified Data Processing on Large Clusters
• Google Lab: BigTable.
• Video: BigTable: A Distributed Structured Storage System.
• Google Lab: The Chubby Lock Service for Loosely-Coupled Distributed Systems.
• How Google Works by David Carr in Baseline Magazine.
• Google Lab: Interpreting the Data: Parallel Analysis with Sawzall.
• Dare Obasonjo's Notes on the scalability conference.
Platform
• Linux
• A large diversity of languages: Python, Java, C++
What's Inside?
The Stats
• Estimated 450,000 low-cost commodity servers in 2006
• In 2005 Google indexed 8 billion web pages. By now, who knows?
• Currently there over 200 GFS clusters at Google. A cluster can have 1000 or even 5000 machines. Pools of tens of thousands of machines retrieve data from GFS clusters that run as large as 5 petabytes of storage. Aggregate read/write throughput can be as high as 40 gigabytes/second across the cluster.
• Currently there are 6000 MapReduce applications at Google and hundreds of new applications are being written each month.
• BigTable scales to store billions of URLs, hundreds of terabytes of satellite imagery, and preferences for hundreds of millions of users.
The Stack
Google visualizes their infrastructure as a three layer stack:
• Products: search, advertising, email, maps, video, chat, blogger
• Distributed Systems Infrastructure: GFS, MapReduce, and BigTable.
• Computing Platforms: a bunch of machines in a bunch of different data centers
• Make sure easy for folks in the company to deploy at a low cost.
• Look at price performance data on a per application basis. Spend more money on hardware to not lose log data, but spend less on other types of data. Having said that, they don't lose data.
Reliable Storage Mechanism with GFS (Google File System)
• Reliable scalable storage is a core need of any application. GFS is their core storage platform.
• Google File System - large distributed log structured file system in which they throw in a lot of data.
• Why build it instead of using something off the shelf? Because they control everything and it's the platform that distinguishes them from everyone else. They required:
- high reliability across data centers
- scalability to thousands of network nodes
- huge read/write bandwidth requirements
- support for large blocks of data which are gigabytes in size.
- efficient distribution of operations across nodes to reduce bottlenecks
• System has master and chunk servers.
- Master servers keep metadata on the various data files. Data are stored in the file system in 64MB chunks. Clients talk to the master servers to perform metadata operations on files and to locate the chunk server that contains the needed they need on disk.
- Chunk servers store the actual data on disk. Each chunk is replicated across three different chunk servers to create redundancy in case of server crashes. Once directed by a master server, a client application retrieves files directly from chunk servers.
• A new application coming on line can use an existing GFS cluster or they can make your own. It would be interesting to understand the provisioning process they use across their data centers.
• Key is enough infrastructure to make sure people have choices for their application. GFS can be tuned to fit individual application needs.
Do Something With the Data Using MapReduce
• Now that you have a good storage system, how do you do anything with so much data? Let's say you have many TBs of data stored across a 1000 machines. Databases don't scale or cost effectively scale to those levels. That's where MapReduce comes in.
• MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
• Why use MapReduce?
- Nice way to partition tasks across lots of machines.
- Handle machine failure.
- Works across different application types, like search and ads. Almost every application has map reduce type operations. You can precompute useful data, find word counts, sort TBs of data, etc.
- Computation can automatically move closer to the IO source.
• The MapReduce system has three different types of servers.
- The Master server assigns user tasks to map and reduce servers. It also tracks the state of the tasks.
- The Map servers accept user input and performs map operations on them. The results are written to intermediate files
- The Reduce servers accepts intermediate files produced by map servers and performs reduce operation on them.
• For example, you want to count the number of words in all web pages. You would feed all the pages stored on GFS into MapReduce. This would all be happening on 1000s of machines simultaneously and all the coordination, job scheduling, failure handling, and data transport would be done automatically.
- The steps look like: GFS -> Map -> Shuffle -> Reduction -> Store Results back into GFS.
- In MapReduce a map maps one view of data to another, producing a key value pair, which in our example is word and count.
- Shuffling aggregates key types.
- The reductions sums up all the key value pairs and produces the final answer.
• The Google indexing pipeline has about 20 different map reductions. A pipeline looks at data with a whole bunch of records and aggregating keys. A second map-reduce comes a long, takes that result and does something else. And so on.
• Programs can be very small. As little as 20 to 50 lines of code.
• One problem is stragglers. A straggler is a computation that is going slower than others which holds up everyone. Stragglers may happen because of slow IO (say a bad controller) or from a temporary CPU spike. The solution is to run multiple of the same computations and when one is done kill all the rest.
• Data transferred between map and reduce servers is compressed. The idea is that because servers aren't CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O.
Storing Structured Data in BigTable
• BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.
• BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries.
• It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.
• Commercial databases simply don't scale to this level and they don't work across 1000s machines.
• By controlling their own low level storage system Google gets more control and leverage to improve their system. For example, if they want features that make cross data center operations easier, they can build it in.
• Machines can be added and deleted while the system is running and the whole system just works.
• Each data item is stored in a cell which can be accessed using a row key, column key, or timestamp.
• Each row is stored in one or more tablets. A tablet is a sequence of 64KB blocks in a data format called SSTable.
• BigTable has three different types of servers:
- The Master servers assign tablets to tablet servers. They track where tablets are located and redistributes tasks as needed.
- The Tablet servers process read/write requests for tablets. They split tablets when they exceed size limits (usually 100MB - 200MB). When a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers.
- The Lock servers form a distributed lock service. Operations like opening a tablet for writing, Master aribtration, and access control checking require mutual exclusion.
• A locality group can be used to physically store related bits of data together for better locality of reference.
• Tablets are cached in RAM as much as possible.
Hardware
• When you have a lot of machines how do you build them to be cost efficient and use power efficiently?
• Use ultra cheap commodity hardware and built software on top to handle their death.
• A 1,000-fold computer power increase can be had for a 33 times lower cost if you you use a failure-prone infrastructure rather than an infrastructure built on highly reliable components. You must build reliability on top of unreliability for this strategy to work.
• Linux, in-house rack design, PC class mother boards, low end storage.
• Price per wattage on performance basis isn't getting better. Have huge power and cooling issues.
• Use a mix of collocation and their own data centers.
Misc
• Push changes out quickly rather than wait for QA.
• Libraries are the predominant way of building programs.
• Some are applications are provided as services, like crawling.
• An infrastructure handles versioning of applications so they can be release without a fear of breaking things.
Future Directions for Google
• Support geo-distributed clusters.
• Create a single global namespace for all data. Currently data is segregated by cluster.
• More and better automated migration of data and computation.
• Solve consistency issues that happen when you couple wide area replication with network partitioning (e.g. keeping services up even if a cluster goes offline for maintenance or due to some sort of outage).
Lessons Learned
• Infrastructure can be a competitive advantage. It certainly is for Google. They can roll out new internet services faster, cheaper, and at scale at few others can compete with. Many companies take a completely different approach. Many companies treat infrastructure as an expense. Each group will use completely different technologies and their will be little planning and commonality of how to build systems. Google thinks of themselves as a systems engineering company, which is a very refreshing way to look at building software.
• Spanning multiple data centers is still an unsolved problem. Most websites are in one and at most two data centers. How to fully distribute a website across a set of data centers is, shall we say, tricky.
• Take a look at Hadoop (product) if you don't have the time to rebuild all this infrastructure from scratch yourself. Hadoop is an open source implementation of many of the same ideas presented here.
• An under appreciated advantage of a platform approach is junior developers can quickly and confidently create robust applications on top of the platform. If every project needs to create the same distributed infrastructure wheel you'll run into difficulty because the people who know how to do this are relatively rare.
• Synergy isn't always crap. By making all parts of a system work together an improvement in one helps them all. Improve the file system and everyone benefits immediately and transparently. If every project uses a different file system then there's no continual incremental improvement across the entire stack.
• Build self-managing systems that work without having to take the system down. This allows you to more easily rebalance resources across servers, add more capacity dynamically, bring machines off line, and gracefully handle upgrades.
• Create a Darwinian infrastructure. Perform time consuming operation in parallel and take the winner.
• Don't ignore the Academy. Academia has a lot of good ideas that don't get translated into production environments. Most of what Google has done has prior art, just not prior large scale deployment.
• Consider compression. Compression is a good option when you have a lot of CPU to throw around and limited IO.
 Digg Architecture

Tue, 08/07/2007 - 01:28 — Todd Hoff
• Digg Architecture (966)
Traffic generated by Digg's over 1.2 million famously info-hungry users can crash an unsuspecting website head-on into its CPU, memory, and bandwidth limits. How does Digg handle all this load?
Site: http://digg.com
Information Sources
• How Digg.com uses the LAMP stack to scale upward
• Digg PHP's Scalability and Performance
Platform
• MySQL
• Linux
• PHP
• Lucene
• APC PHP Accelerator
• MCache
The Stats
• Started in late 2004 with a single Linux server running Apache 1.3, PHP 4, and MySQL. 4.0 using the default MyISAM storage engine
• Over 1.2 million users.
• Over 200 million page views per month
• 100 servers hosted in multiple data centers.
- 20 database servers
- 30 Web servers
- A few search servers running Lucene.
- The rest are used for redundancy.
• 30GB of data.
• None of the scaling challenges we faced had anything to do with PHP. The biggest issues faced were database related.
• The lightweight nature of PHP allowed them to move processing tasks from the database to PHP in order to improve scaling. Ebay does this in a radical way. They moved nearly all work out of the database and into applications, including joins, an operation we normally think of as the job of the database.
What's Inside
• Load balancer in the front that sends queries to PHP servers.
• Uses a MySQL master-slave setup.
- Transaction-heavy servers use the InnoDB storage engine.
- OLAP-heavy servers use the MyISAM storage engine.
- They did not notice a performance degradation moving from MySQL 4.1 to version 5.
• Memcached is used for caching.
• Sharding is used to break the database into several smaller ones.
• Digg's usage pattern makes it easier for them to scale. Most people just view the front page and leave. Thus 98% of Digg's database accesses are reads. With this balance of operations they don't have to worry about the complex work of architecting for writes, which makes it a lot easier for them to scale.
• They had problems with their storage system telling them writes were on disk when they really weren't. Controllers do this to improve the appearance of their performance. But what it does is leave a giant data integrity whole in failure scenarios. This is really a pretty common problem and can be hard to fix, depending on your hardware setup.
• To lighten their database load they used the APC PHP accelerator MCache.
• You can configure PHP not parse and compile on each load using a combination of Apache 2’s worker threads, FastCGI, and a PHP accelerator. On a page's first load the PHP code is compiles so any subsequent page loads are very fast.
Lessons Learned
• Tune MySQL through your database engine selection. Use InnoDB when you need transactions and MyISAM when you don't. For example, transactional tables on the master can use MyISAM for read-only slaves.
• At some point in their growth curve they were unable to grow by adding RAM so had to grow through architecture.
• People often complain Digg is slow. This is perhaps due to their large javascript libraries rather than their backend architecture.
• One way they scale is by being careful of which application they deploy on their system. They are careful not to release applications which use too much CPU. Clearly Digg has a pretty standard LAMP architecture, but I thought this was an interesting point. Engineers often have a bunch of cool features they want to release, but those features can kill an infrastructure if that infrastructure doesn't grow along with the features. So push back until your system can handle the new features. This goes to capacity planning, something the Flickr emphasizes in their scaling process.
• You have to wonder if by limiting new features to match their infrastructure might Digg lose ground to other faster moving social bookmarking services? Perhaps if the infrastructure was more easily scaled they could add features faster which would help them compete better? On the other hand, just adding features because you can doesn't make a lot of sense either.
• The data layer is where most scaling and performance problems are to be found and these are language specific. You'll hit them using Java, PHP, Ruby, or insert your favorite language here.
An Unorthodox Approach to Database Design : The Coming of the Shard

Tue, 07/31/2007 - 18:13 — Todd Hoff
• An Unorthodox Approach to Database Design : The Coming of the Shard (1136)
Once upon a time we scaled databases by buying ever bigger, faster, and more expensive machines. While this arrangement is great for big iron profit margins, it doesn't work so well for the bank accounts of our heroic system builders who need to scale well past what they can afford to spend on giant database servers. In a extraordinary two article series, Dathan Pattishall, explains his motivation for a revolutionary new database architecture--sharding--that he began thinking about even before he worked at Friendster, and fully implemented at Flickr. Flickr now handles more than 1 billion transactions per day, responding in less then a few seconds and can scale linearly at a low cost.
What is sharding and how has it come to be the answer to large website scaling problems?
Information Sources
* Unorthodox approach to database design Part1:History
* Unorthodox approach to database design Part 2:Friendster
What is sharding?
While working at Auction Watch, Dathan got the idea to solve their scaling problems by creating a database server for a group of users and running those servers on cheap Linux boxes. In this scheme the data for User A is stored on one server and the data for User B is stored on another server. It's a federated model. Groups of 500K users are stored together in what are called shards.
The advantages are:
• High availability. If one box goes down the others still operate.
• Faster queries. Smaller amounts of data in each user group mean faster querying.
• More write bandwidth. With no master database serializing writes you can write in parallel which increases your write throughput. Writing is major bottleneck for many websites.
• You can do more work. A parallel backend means you can do more work simultaneously. You can handle higher user loads, especially when writing data, because there are parallel paths through your system. You can load balance web servers, which access shards over different network paths, which are processed by separate CPUs, which use separate caches of RAM and separate disk IO paths to process work. Very few bottlenecks limit your work.
 How is sharding different than traditional architectures?
Sharding is different than traditional database architecture in several important ways:
• Data are denormalized. Traditionally we normalize data. Data are splayed out into anomaly-less tables and then joined back together again when they need to be used. In sharding the data are denormalized. You store together data that are used together.
This doesn't mean you don't also segregate data by type. You can keep a user's profile data separate from their comments, blogs, email, media, etc, but the user profile data would be stored and retrieved as a whole. This is a very fast approach. You just get a blob and store a blob. No joins are needed and it can be written with one disk write.
• Data are parallelized across many physical instances. Historically database servers are scaled up. You buy bigger machines to get more power. With sharding the data are parallelized and you scale by scaling out. Using this approach you can get massively more work done because it can be done in parallel.
• Data are kept small. The larger a set of data a server handles the harder it is to cash intelligently because you have such a wide diversity of data being accessed. You need huge gobs of RAM that may not even be enough to cache the data when you need it. By isolating data into smaller shards the data you are accessing is more likely to stay in cache.
Smaller sets of data are also easier to backup, restore, and manage.
• Data are more highly available. Since the shards are independent a failure in one doesn't cause a failure in another. And if you make each shard operate at 50% capacity it's much easier to upgrade a shard in place. Keeping multiple data copies within a shard also helps with redundancy and making the data more parallelized so more work can be done on the data. You can also setup a shard to have a master-slave or dual master relationship within the shard to avoid a single point of failure within the shard. If one server goes down the other can take over.
• It doesn't use replication. Replicating data from a master server to slave servers is a traditional approach to scaling. Data is written to a master server and then replicated to one or more slave servers. At that point read operations can be handled by the slaves, but all writes happen on the master.
Obviously the master becomes the write bottleneck and a single point of failure. And as load increases the cost of replication increases. Replication costs in CPU, network bandwidth, and disk IO. The slaves fall behind and have stale data. The folks at YouTube had a big problem with replication overhead as they scaled.
Sharding cleanly and elegantly solves the problems with replication.
Some Problems With Sharding
Sharding isn't perfect. It does have a few problems.
• Rebalancing data. What happens when a shard outgrows your storage and needs to be split? Let's say some user has a particularly large friends list that blows your storage capacity for the shard. You need to move the user to a different shard.
On some platforms I've worked on this is a killer problem. You had to build out the data center correctly from the start because moving data from shard to shard required a lot of downtime.
Rebalancing has to be built in from the start. Google's shards automatically rebalance. For this to work data references must go through some sort of naming service so they can be relocated. This is what Flickr does. And your references must be invalidateable so the underlying data can be moved while you are using it.
• Joining data from multiple shards. To create a complex friends page, or a user profile page, or a thread discussion page, you usually must pull together lots of different data from many different sources. With sharding you can't just issue a query and get back all the data. You have to make individual requests to your data sources, get all the responses, and the build the page. Thankfully, because of caching and fast networks this process is usually fast enough that your page load times can be excellent.
• How do you partition your data in shards? What data do you put in which shard? Where do comments go? Should all user data really go together, or just their profile data? Should a user's media, IMs, friends lists, etc go somewhere else? Unfortunately there are no easy answer to these questions.
• Less leverage. People have experience with traditional RDBMS tools so there is a lot of help out there. You have books, experts, tool chains, and discussion forums when something goes wrong or you are wondering how to implement a new feature. Eclipse won't have a shard view and you won't find any automated backup and restore programs for your shard. With sharding you are on your own.
• Implementing shards is not well supported. Sharding is currently mostly a roll your own approach. LiveJournal makes their tool chain available. Hibernate has a library under development. MySQL has added support for partioning. But in general it's still something you must implement yourself.
Comments
Tue, 07/31/2007 - 22:12 — Tim (not verified)
Great post !
This is probably the most interesting post I've read in a long long time. Thanks for sharing the advantages and drawbacks of sharding…. and thanks for putting together all these resources/info about scaling… it's really really interesting.
• reply
Wed, 08/01/2007 - 20:22 — Vinit (not verified)
Thanks for this info.
Thanks for this info. Helped me understand about what the heck to do with all this user data coming my way!!!
• reply
Wed, 08/01/2007 - 23:17 — Ryan T Mulligan (not verified)
Intranet?
I dislike how your link of livejournal does not actually go to a livejournal website, or information about their toolchain.
• reply
Thu, 08/02/2007 - 01:48 — Todd Hoff

re: intranet
I am not sure what you mean about live journal. It goes to a page on this site which references two danga.com sites. Oh I see, memcached goes to a category link which doesn't include memcached. The hover text does include the link, but I'll add it in. Good catch. Thanks.
• reply
Thu, 08/02/2007 - 15:10 — tim wee (not verified)
Question about a statement in the post
"Sharding cleanly and elegantly solves the problems with replication."
Is this true? You do need to replicate still right? You need duplication and a copy of the data that is not too stale in case one of your shards go down? So you still need to replicate correct?
• reply
Thu, 08/02/2007 - 15:21 — Todd Hoff

sharding and replication
> Is this true? You do need to replicate still right?
You won't have the problems with replication overhead and lag because you are writing to a appropriately sized shard rather than a single master that must replicate to all its slaves, assuming you have a lot of slaves.
You will still replicate within the shard, but that will be more of a fixed reasonable cost because the number of slaves will be small.
Google replicates content 3 times, so even in that case it's more of a fixed overhead then chaining a lot of slaves together.
That's my understanding at least.
• reply
Sat, 08/04/2007 - 17:23 — Kevin Burton (not verified)
We might OSS our sharding framework
We've been building out a sharding framework for use with Spinn3r. It's about to be deployed into production this week.
We're very happy with the way things have moved forward and will probably be OSSing it.
We aggregate a LOT of data on behalf of our customers so have huge data storage requirements.
Kevin
• reply
Sat, 08/04/2007 - 21:53 — Anonymous (not verified)
Lookup table
Do you still need a master lookup table? How do you know which shard has the data you need to access?
• reply
Sat, 08/04/2007 - 23:01 — Todd Hoff

re: Lookup table
I think a lookup table is the most flexible option. It allows for flexible shard assignment algorithms and you can change the mapping when you need to. Here a few other ideas. I am sure there are more.
Flickr talks about a "global ring" that is like DNS that maps a key to a shard ID. The map is kept in memcached for a 1/2 hour or so. I bought Cal Henderson's book and it should arrive soon. If he has more details I'll do a quick write up.
Users could be assigned to shards initially on a round robin basis or through some other whizzbang algorithm.
You could embed shard IDs into your assigned row IDs so the mapping is obvious from the ID.
You could hash on a key to a shard bucket.
You could partition based on on key values. So users with names starting with A-C go to shard1, that sort of thing. MySQL has a number of different partition rules as examples.
• reply
Mon, 08/06/2007 - 02:13 — Frank (not verified)
Thank you
It's helpful to know how the big players handle their scaling issues.
Thanks for sharing!
• reply
Thu, 08/09/2007 - 12:43 — Anonymous (not verified)
Sharding
Attacking sharding from the application layer is simply wrong. This functionality should be left to the DBMS. Access from the app layer would be transparent and it would be up to the DB admin to configure the data servers correctly for sharding to automatically and transparantly scale out across them.
If you are implementing sharding from the app layer you are getting yourself in a very tight corner and one day will find out how stuck you are there. This is the cornerstone of improper delegation of the functionalities in a multi-tier system.
• reply
Mon, 08/13/2007 - 14:44 — Diogin (not verified)
>How do you partition your
>How do you partition your data in shards? What data do you put in which shard? Where do comments go? Should all user data really go together, or just their profile data? Should a user's media, IMs, friends lists, etc go somewhere else? Unfortunately there are no easy answer to these questions.
I have exactly the question to ask..
I've refered the architectures of LiveJournal and Mixi, both of which introduce shards.
Howerver, I saw a "Global Cluster" which store meta informations for other clusters. By doing this we get an extreme heavy cluster, it must handle all the cluster_id <-> user_id metas and lost the advantage of sharding…Is it?
The other way, partition by algorithms on keys, is difficult in transition.
So, could you give me some advice? Thank you very much for sharing experiences :)
• reply
Mon, 08/13/2007 - 15:36 — Todd Hoff

> By doing this we get an
> By doing this we get an extreme heavy cluster, it must handle
> all the cluster_id <-> user_id metas
I think the idea is that because these mapping is so small they can all be cached in RAM and thus their resolution can be very very fast. Do you think that would be too slow?
• reply
Mon, 08/13/2007 - 17:11 — Diogin (not verified)
Yeah, you reminded me! I
Yeah, you reminded me! I have a little doubts on this before, when I think the table would be terribly huge as all the table records on other clusters are all gathered in this table, and all queries should first refer to this cluster.
Maybe I can use memcached clusters to cache them. Thank you :)
• reply
Tue, 08/14/2007 - 17:33 — Arnon Rotem-Gal-Oz (not verified)
Partitioning is the new standard
If you look at the architectures of all the major internet-scale sites (such as eBAy, Amazon, Flicker etc.) you'd see they've come to the same conclusion and same patterns
I've also published an article on InfoQ discussing this topic yesterday (I only found this site now or I would have included it in the article)
Arnon
• reply
Wed, 08/22/2007 - 21:24 — Todd Hoff

More Good Info on Partitioning
Jeremy Cole and Eric Bergen have an excellent section on database partitioning starting on about page 14 of MySQL Scaling and High Availability Architectures. They talk about different partitioning models, difficulties with partitioning, HiveDB, Hibernate Shards, and server architectures.
I'll definitely do a write up on this document a little later, but if interested dive in now.
• reply
Fri, 09/14/2007 - 11:20 — Norman Harebottle (not verified)
Re: An Unorthodox Approach to Database Design : The Coming of th
I agree with the above post questioning the placement of partitioning logic in the application layer. Why not write the application layer against a logical model (NOT a storage model!) and then just engineer the existing data storage abstraction mechanism (DBMS engine) such that it will handle the partitioning functionality in a parallel manner? I would be very interested to see a study done comparing the architectures of this sharding concept against a federated database design such as what is described on this site http://www.sql-server-performance.com/tips/federated_databases_p1.aspx
• reply
Fri, 09/14/2007 - 15:22 — Todd Hoff

Re: is the logical model the correct place for paritioning?
> I would be very interested to see a study done comparing the
> architectures of this sharding concept against a federated database design
Most of us don't have accesses to a real affordable federated database (parallel queries), so it's somewhat a moot point :-) And even these haven't been able to scale at the highest levels anyway.
The advantage of partitioning in the application layer is that you are not bottlenecked on the traffic cop server that must analyze SQL and redistribute work to the proper federations and then integrate the results. You go directly to where the work needs to be done.
I understand the architectural drive to push this responsibility to a logical layer, but it's hard to ignore the advantages of using client side resources to go directly to the right shard based on local context. What is the cost? A call to library code that must be kept in sync with the partitioning scheme. Is that really all that bad?
• reply
Thu, 09/20/2007 - 14:10 — Anonymous (not verified)
Re: An Unorthodox Approach to Database Design : The Coming of th
Sorry, this isn't new, except maybe to younger programmers. I've been using this approach for many years. Federated databases with horizontally partioned data is old news. Its just not taught as a standard technique for scaling, mostly because scaling isn't taught as a core subject. (Why is that, do you suppose? Too hard to cover? Practical examples not practical in an academic setting?)
The reason this is getting attention now is a perfect storm of cheap hardware, nearly free network connectivity, free database software, and beaucoup users (testers?) over the 'net.
• reply
Sat, 09/22/2007 - 13:53 — Marcus (not verified)
Useful, interesting, but not new
Great content and great site… except for the worshipful adoration of these young teams who seem to think they've each discovered America.
I could go on at length about Flickr's performance troubles and extremely slow changes after their early success, Twitter's meltdowns and indefensible defense of the slowest language in wide use on the net, and old-school examples of horizontal partitioning (AKA "sharding" heh), but I'll spare you. This cotton candy sugary sweet lovin' of Web 2.0 darlings really is a bit tiresome.
Big kudos though on deep coverage of the subject matter between the cultic chants. :-D
• reply
Sat, 09/22/2007 - 18:39 — Todd Hoff

Re: Useful, interesting, but not new
> Big kudos though on deep coverage of the subject matter between the cultic chants. :-D
I am actually more into root music. But thanks, I think :-)
• reply
Fri, 09/28/2007 - 06:11 — Sean Bannister (not verified)
Re: An Unorthodox Approach to Database Design : The Coming of th
Good article, very interesting to read.
• reply
Sat, 09/29/2007 - 06:46 — Ed (not verified)
Re: An Unorthodox Approach to Database Design : The Coming of th
Fanball.com has been using this technique for its football commissioner product for years. As someone else commented, it used to be called horizontal partitioning back then. Does it cost less if it's called sharding? :-)
• reply
Mon, 10/08/2007 - 03:02 — Anonymous (not verified)
Re: An Unorthodox Approach to Database Design
Why is this even news, we did something similar in my old job. Split up different clients among different server stacks. Move along nothing to see…
• reply
Sun, 10/14/2007 - 11:29 — Anonymous (not verified)
Re: An Unorthodox Approach to Database Design : The Coming of th
Could it be the fact that people are pulling this off with mysql and berkeleydb that is making horizontal partitioning interesting? When you compare two solutions, one using an open source database and one using a closed source database, is one solution more inherently scalable? Well all things being equal performance wise, its nice to not have to do a purchase order for the closed source software, so I would say that is why this is getting all the 'hype'. Old school oracle/mssqlserver patronizing DBAs are getting schooled by non-dbas who are setting up the *highest* data throughput architectures and not using sql server or oracle. That is why this is getting high visibility.
Believe or not many people still say mysql/berkeleydb is a toy outside of some of the major tech hubs. Stories like this are what make people, especially dbas, listen. The only recourse is 'I have done that before with xxx database'. Well you should be the one that suggests doing it with the open source 'toy' database then, if you are so good.
In my experience there are many old-school DBAs that are in denial that this kind of architecture is capable of out performing their *multi-million dollar oracle software purchase decisions* and they don't want to admit it.
• reply
Wed, 10/24/2007 - 13:12 — Harel Malka (not verified)
Re: An Unorthodox Approach to Database Design : The Coming of th
What I'm most interested in relating to Shards is people's thoughts and experience in migrating TO a shard approach from a single database, and moving (large amounts of) data around from shard to shard. In particular - strategies to maintain referential integrity as we're moving data by a user.
As well, should you need to query data joining user A and user B which both reside on different shards - what approaches people see as fit?
Harel
 LiveJournal Architecture

Mon, 07/09/2007 - 16:57 — Todd Hoff
• LiveJournal Architecture (608)
A fascinating and detailed story of how LiveJournal evolved their system to scale. LiveJournal was an early player in the free blog service race and faced issues from quickly adding a large number of users. Blog posts come fast and furious which causes a lot of writes and writes are particularly hard to scale. Understanding how LiveJournal faced their scaling problems will help any aspiring website builder.
Site: http://www.livejournal.com/
Information Sources
• LiveJournal - Behind The Scenes Scaling Storytime
• Google Video
• Tokyo Video
• 2005 version
Platform
• Linux
• MySql
• Perl
• Memcached
• MogileFS
• Apache
What's Inside?
• Scaling from 1, 2, and 4 hosts to cluster of servers.
• Avoid single points of failure.
• Using MySQL replication only takes you so far.
• Becoming IO bound kills scaling.
• Spread out writes and reads for more parallelism.
• You can't keep adding read slaves and scale.
• Shard storage approach, using DRBD, for maximal throughput. Allocate shards based on roles.
• Caching to improve performance with memcached. Two-level hashing to distributed RAM.
• Perlbal for web load balancing.
• MogileFS, a distributed file system, for parallelism.
• TheSchwartz and Gearman for distributed job queuing to do more work in parallel.
• Solving persistent connection problems.
Lessons Learned
• Don't be afraid to write your own software to solve your own problems. LiveJournal as provided incredible value to the community through their efforts.
• Sites can evolve from small 1, 2 machine setups to larger systems as they learn about their users and what their system really needs to do.
• Parallelization is key to scaling. Remove choke points by caching, load balancing, sharding, clustering file systems, and making use of more disk spindles.
• Replication has a cost. You can't just keep adding more and more read slaves and expect to scale.
• Low level issues like which OS event notification mechanism to use, file system and disk interactions, threading and even models, and connection types, matter at scale.
• Large sites eventually turn to a distributed queuing and scheduling mechanism to distribute large work loads across a grid.
 GoogleTalk Architecture

Mon, 07/23/2007 - 22:47 — Todd Hoff
• GoogleTalk Architecture (549)
Google Talk is Google's instant communications service. Interestingly the IM messages aren't the major architectural challenge, handling user presence indications dominate the design. They also have the challenge of handling small low latency messages and integrating with many other systems. How do they do it?
Site: http://www.google.com/talk
Information Sources
• GoogleTalk Architecture
Platform
• Linux
• Java
• Google Stack
• Shard
What's Inside?
The Stats
• Support presence and messages for millions of users.
• Handles billions of packets per day in under 100ms.
• IM is different than many other applications because the requests are small packets.
• Routing and application logic are applied per packet for sender and receiver.
• Messages must be delivered in-order.
• Architecture extends to new clients and Google services.
Lessons Learned
• Measure the right thing.
- People ask about how many IMs do you deliver or how many active users. Turns out not to be the right engineering question.
- Hard part of IM is how to show correct present to all connected users because growth is non-linear: ConnectedUsers * BuddyListSize * OnlineStateChanges
- A linear user grown can mean a very non-linear server growth which requires serving many billions of presence packets per day.
- Have a large number friends and presence explodes. The number IMs not that
big of deal.
• Real Life Load Tests
- Lab tests are good, but don't tell you enough.
- Did a backend launch before the real product launch.
- Simulate presence requests and going on-line and off-line for weeks
and months, even if real data is not returned. It works out many of the
kinks in network, failover, etc.
• Dynamic Resharding
- Divide user data or load across shards.
- Google Talk backend servers handle traffic for a subset of users.
- Make it easy to change the number of shards with zero downtime.
- Don't shard across data centers. Try and keep users local.
- Servers can bring down servers and backups take over. Then you can bring up new servers and data migrated automatically and clients auto detect and go to new servers.
• Add Abstractions to Hide System Complexity
- Different systems should have little knowledge of each other, especially when separate groups are working together.
- Gmail and Orkut don't know about sharding, load-balancing, or fail-over, data center architecture, or number of servers. Can change at anytime without cascading changes throughout the system.
- Abstract these complexities into a set of gateways that are discovered at runtime.
- RPC infrastructure should handle rerouting.
• Understand Semantics of Lower Level Libraries
- Everything is abstracted, but you must still have enough knowledge of how they work to architect your system.
- Does your RPC create TCP connections to all or some of your servers? Very different implications.
- Does the library performance health checking? This is architectural implications as you can have separate system failing independently.
- Which kernel operation should you use? IM requires a lot connections but few have any activity. Use epoll vs poll/select.
• Protect Again Operation Problems
- Smooth out all spoke in server activity graphs.
- What happens when servers restart with an empty cache?
- What happens if traffic shifts to a new data center?
- Limit cascading problems. Back of from busy servers. Don't accept work when sick.
- Isolate in emergencies. Don't infect others with your problems.
- Have intelligent retry logic policies abstracted away. Don't sit in hard 1msec retry loops, for example.
• Any Scalable System is a Distributed System
- Add fault tolerance to every component of the system. Everything fails.
- Add ability to profile live servers without impacting server. Allows continual improvement.
- Collect metrics from server for monitoring. Log everything about your system so you see patterns in cause and effects.
- Log end-to-end so you can reconstruct an entire operation from beginning to end across all machines.
• Software Development Strategies
- Make sure binaries are both backward and forward compatible so you can have old clients work with new code.
- Build an experimentation framework to try new features.
- Give engineers access to product machines. Gives end-to-end ownership. This is very different than many companies who have completely separate OP teams in their data centers. Often developers can't touch production machines.

手机扫一扫

移动阅读更方便

你可能感兴趣的文章

为何每个开发者都在谈论Go？