Open Source Ecology Forums

elifarley

Marcin,

This is what I know. As I know you are short on time lately, I will
leave a summary at the bottom. The body of this letter is largely
technical, you may find it interesting, however, if not useful.

The key to good web performance is distributability and locality.
Moving large amounts of traffic means we need to be able to
"scale-out" (i.e. multiple servers) rather than just being able to
scale-up (i.e. improving one single monolithic server) at every level
from front-end servers to back-end servers to the database.

Front-end servers are generally performance-optimized feature-poor
HTTP servers such as NginX, Squid, Traffic Server. It is ideal for
these servers, particularly for truly static (never-changing) content,
to be located as close to the end user on the network topology as
economically and technically feasible. The business model of a CDN is
focused around making a high level of locality tenable through
economies of scale. MaxCDN (maxcdn.com) is one of the lowest
barrier-to-entry CDN's that I am aware of, offering 1TB of transfer
for around $40. Once we are moving more traffic, we can solicit quotes
from larger, more well-established CDN providers.

Back-end servers are almost always the Apache HTTP Server, which will
also function as a front-end server in low-tech setups. The back-end
server is responsible for holding and serving files to the front-end
server, in addition to running application code such as PHP. There are
many tricks to improve site performance at this level, and most of
them revolve around copying the most amount of content from hard disk
(which is slow) into RAM (which is fast). APC is a module that does
this for PHP, for example. The most important thing in my opinion for
distributability at the back-end level is eliminating the need for
session-dependent code.

Database servers in the open-source world will generally be either
MySQL or PostgreSQL (in the relational model) or CouchDB (in the
document-oriented model). As most of the apps we are currently running
require a relational database, CouchDB should only be considered for
in-house application development. Most people are familiar with the
near-ubiquitous MySQL. For latency reasons, it is almost a necessity
for the database servers to be as close to the back-end servers as
possible, and this is probably why 3rd party database services like
Amazon SimpleDB haven't quite taken off yet.

Let's talk about caching. In HTTP, every document returned may provide
"Last-Modified" and "Expires" values (headers). As its name would
imply, "Last-Modified" refers to the time a given document was last
modified. "Expires" is a guarantee that the document won't change
until at least the date given in the "Expires" header has passed. It
is relatively common to give an "Expires" time in the near-future
regardless of an inability to know exactly the next time a document
will be modified. (Because of community-editing, for example.) The
back-end server will ultimately decide these values.

The front-end server, and every intermediate server and end user
outside of our purview, will use these values in two ways. If the
expiration date has not passed yet, any intermediary may simply serve
a previously received and stored version of the document without
needing to request it from the back-end. (Huge gains in efficiency!)
Even if the expiration time has passed, however, the end-user, or its
intermediaries, may make an "If-Modified-Since" request, and receive
back a response of "Not Modified", and use its previously received,
but expired, version (A smaller, but still significant gain in
efficiency). This methodology describes caching at every level from
the back-end to the end-user.

Between the back-end server and the database, the bread and butter of
caching is memcache. The caching model here is different, and rather
than being based on expiration and re-validation, it is based on
invalidation. As the back-end server is the only thing that writes to
the database, the locally cached entry can be overwritten with the new
value before being handed to the much larger and more reliable
database engine. All locally cached entries therefore may be assumed
to be current and valid.

Summary
=======
+Write code and develop infrastructure that avoids unnecessary
round-trip communications, particularly between distant links.
+Ensure that the end-user and intermediaries outside of our control
can utilize the caching information returned by our applications.
+Use superior 3rd party resources where it makes economic sense,
particularly at the front-end layer. (CDN)

Before we can discuss further where our strategy will lie, I need
information about the applications we are currently using, the servers
that currently run them, and the various software we will need or will
be useful in the future. I have a lot of experience writing
"internal-use" web applications for my employer, and any sort of
information management/reporting that you need personally for the
engineering side of OSE I would be glad to assist in the development
of as it is my particular expertise as a PHP developer. I also believe
project management software would be useful for our needs, if you
aren't using one already.

elifarley

Marcin wrote:

Dear James,

Thanks for this valuable analysis. Eerik (from Solar Fire), can you please comment on this and let us know how your proposal integrates with James'. I'd like to see us come together on an action plan, which we can also get peer-reviewed via a blog post, and then we could implement a solution.

I would also like to hear your comments on traffic numbers, and what solution regime addresses these numbers correctly. And from Elifarley - what is our current level of traffic and what do we realistically expect through a viral marketing campaign that we plan for Kickstarter.

Thanks,
Marcin

elifarley

Our wiki has had ~88 pageviews / minute after featuring on Reddit's front page, and the server at DreamHost seemed to behave nicely under that load (we were using Coral CDN for images, javascript and CSS files then, as we are now)

Let's suppose each one of the other sites can give us about 100 pageviews / minute, that could mean 8 * 100 = 800 pageviews / minute = 800 * 60 = 48k per hour, for a few hours.

After those spikes, we are at about 22k pageviews / day now.

Out of thin air, I can predict we'll have a spike of about 400k pageviews / day for a few days. After that, we'll have 155k pageviews per day, increasing every week.

Maybe Dreamhost can sustain bigger spikes, but I'm not sure. We can enable full CDN mode, which means that our server wouldn't see the spikes, but some people may have slow page loading times, depending on Coral CDN's servers. (To see your page loading time using Coral now, visit http://openfarmtech.org.nyud.net/ )

Everyone in this email now have user rights to see our Google Analytics page - see https://www.google.com/analytics/

P.S.: Should we move this discussion to the discussion forum for web infrastructure ?

elifarley

Hi James,

You've made some very good points.

Please take a look at the Web server configuration page (under the IT Infrastructure category) to see what we are using.

Thanks,
Elifarley

elifarley

Eerik wrote:

Dear all,

What James proposes is correct, but I think is overkill at this stage and more of a second phase problem to solve.

I think the first priority is getting a single server running smoothly, securely with adequate resources and an adapted website. Once this is accomplished, how to maximize delivery can then be worked out through proxy servers such as Squid or services such as maxCDN over a longer term. There are 5 main reasons for this view:

1. The content, functioning and interactivity of the "base / backend" website is reasonable to resolve before working on maximizing load times for end users.
2. Good Apache root servers can deal with millions of simultaneous connections, so there would be no immediate danger.
3. Caching services by definition only work for static content, but a fundamental goal of OSE is to be interactive (i.e. must connect to the backend); the easiest and fastest way to have the fastest interactive experience is by scaling a single server (a decentralized backend is a 3rd phase problem).
4. The architecture of the Internet is fairly quick these days (fiber optics / satellites), so physical proximity to the server is not as important as it used to be. The main lag is due to routers. What is more important is being as close to the internet hubs as possible so as to traverse a minimum of routers; i.e. a dedicated server, in a fast datacentre connected directly to a main Internet gateway.
5. ISP's already implement a caching policy, so if they have a copy in cache, they do the work of requesting from your server if the page has been modified, and for simultaneous connections simply serve the same copy; they have a huge motivation to provide the fastest experience to their end users and decrease their own bandwidth, and do this fairly efficiently.

For instance, my server at Heztner has 5 terrabytes of traffic, and this can be easily exceeded, which is why their policy is “*There are no charges for overage. We will permanently restrict the connection speed to 10 MBit/s if more than 5000 GB/month are used. 100 MBit/s speed can be optionally restored by committing to pay 6,90 € (incl. VAT) per additional TB used”.

Since OSE primarily delivers text (not streaming video or game services) it's highly unlikely 5 TB will be reached in any short time frame, but if so 6.90 Euro per TB would be a trivial expense.

For keeping everything in RAM, an important point, Linux automatically does this at the kernel level (only removing old information from RAM if there is a shortage), so for server speed the general criteria is that the RAM is bigger than the content of the site, which is easy to achieve since RAM is cheap nowadays.

So in my opinion decentralizing service of static pages is a second phase task and not a first phase task.

The first phase task I think is designing the website to be what it needs to be, clearly communicating the content and attracting contributors.

I think solarfire.org and OSE have very similar needs as open source development projects, so I feel OSE can take advantage of this experience and that I can strongly advocate for the approach I arrived at working on Solar Fire.

The first step I would argue is moving away from a wiki.

Problem
Though a wiki is great to get off the ground fast, I think it's inappropriate for the needs of OSE. The reason is that a wiki is designed so multiple people work on the same subjects, which seems reasonable at first for a development project, but in my experience running Solar Fire (I originally designed the first interactive version of the website to be similar to a wiki) the problem I encountered is that this didn't valorize the individual contributors and there was no reason to try to create consensus data in the first place, as every contribution/idea can be valuable and there's no reason to try to harmonize different approaches to the same subjects to one degree or another, as is intrinsic to an encyclopedia (likewise for a open source software project, all the code has to be harmonized, but not so for open hardware where each builder can put their own personality into it).

Also, needing to harmonize data on monolithic pages creates a huge burden on the administration and creates the situation where the contributors feel they are all contributing on equal grounds and have the same exposure opportunity as the main site personalities. For instance when conversion to solarfire 2.0 is complete, if I don't post I won't featured on the main site. In my experience with the wiki what happened was, even though I created access for contributors, they simply emailed me their info and Eva and I were left trying to integrate it somewhere. The result was a huge sprawling archive with no regular contributors.

Solution
So the new design I've recently put in place organizes all the data in terms of who contributed it. Basically a large blogging system. On top of this blogging system each article can have one or multiple key words attached, which can be used to create a vertical navigation by subject, which I'm working on now. So, this way each person has their own folder where they can post what they want, and on the main page various lists of last contributions by subject will be able to roll automatically. Since each user can have the administration rights to their own folder, they can manage it completely autonomously, so the administration overhead is significantly reduced to just oversight; likewise, since they manage their own content autonomously they feel much more part of the site and also that the site serves their own needs of communicating on their activities. When I have time I'll be making the personal pages more customizable by the contributor, so they control what's on the sides as well as part of the banner at the top; exactly the same service they would have if they made their own blog, but now with added value to them of having their content appear on the information hub concerning the subject.

Content Management System
Though all Open Source CMS's can, in an absolute sense, accomplish the same things as their code is open and so can be modified into anything, in practice it's clearly best to start with the CMS that's closest to the need.

As it's a crucial decision for a webmaster what CMS to specialize in, I did serious tests of all the major open source CMS's (Drupal, Wordpress, Typo3, Joomla, Spip) and I settled on Spip. Though less known in the English world Spip is the main professional CMS in Europe.

I've recently been working on reworking my technical defense of Spip for prospective clients, so I've attached it here. I'll be publishing it over the next couple of days, so you'll be able to link to it for peer review.

But in essence, the idea of Spip is to provide a basis where all common web problems are easily solved but design can be easily controlled to the minutest detail.

The system is so well organized that it also makes collaboration from various coders easy accustomed to different systems easier. There's not really any new system to learn so coding primitives can be integrated easily.

Custom higher level coding can then be dedicated to making applications unique to the website, rather than solve problems that have already been solved by someone (which is a huge body of knowledge).

For instance, custom applications for OSE could be interactive design platforms, where the user can scale and tweak the design to their need on-line (I see this as crucial to proliferation in the developing world).

--
Eerik Wissenz

Free Access to Solar Energy
http://www.solarfire.org/
SiSustainable - Web Development
http://www.sisustainable.com/

elifarley

Eerik wrote:

Correction

"Also, needing to harmonize data on monolithic pages creates a huge burden on the administration and creates the situation where the contributors do not feel they are all contributing on equal grounds and have the same exposure opportunity as the main site personalities."

Also, to this I would add that when people feel they are expressing their own personality through their content contributions, rather than the information being absorbed into larger packages, they are much more enthusiastic to make higher quality content.

elifarley

Jeb Bateman wrote:

Excellent comments all. I've forwarded this thread into a Teambox task
I created the other day based on Marcin's last email about web
traffic, and invited Eerik and James to join in there. Hopefully we
can move swiftly through discussion to action. Elifarley has made
outstanding progress on the web config over the past year, and there
is plenty more we can do on the infrastructure side in a process of
constant improvement. Many thanks all around!

elifarley

Hi guys,

Is it best to keep this thread on Teambox (restricted to our group) or we should post it at our public discussion forum where it can be indexed by Google and everyone can read (and maybe add comments to) ?

I'm inclined to think it would be better to make it public. Any drawbacks on that?

Thanks,
Elifarley

elifarley

James French wrote:

The upside is the possibility of solutions we haven't considered, or
an offer of assistance. The only potential downside is the
signal-noise ratio of the conversation, but I wouldn't expect it to be
a problem.

elifarley

Jeb Bateman wrote:

Personally, I'd prefer to keep public info mostly in the wiki, were
plans can be developed and improved over time, and discussion can be
edited in the Talk pages as well. I haven't had a chance to check out
the public forums much myself yet, but generally shy away from public
back and forth. Some people prefer forums, but I've seen them become
overrun with noise and shutdown on other sites, including the most
popular site I used to host years ago. Just my 2 cents though, and
I'll support whatever direction you guys want to go...

elifarley

On the noise problem, it can be mitigated by making certain forum categories invitation-only (in the sense that anyone can read, but only people with the appropriate roles can post to - or it could be even configure so that only certain roles could read them).

elifarley

James French wrote:

With respect to the issue of building out...

Given elifarley's numbers, I agree that what we have now (with the
exception of CoralCDN being utter shit, seriously!) is probably
sufficient for the traffic we're currently receiving. The issue is
that when the traffic ramps up, it tends to ramp up and hard fast.
What was once 100 uniques per day can tomorrow be 1,000 uniques per
day, and in a week can be 5,000 uniques per day. I do agree there's
not much warrant for scaling out right now to multiple servers, given
that we want to run as much of it as we can right now on a shoestring
budget. My point of concern is that when it's, in the blink of an eye,
5,000/day, how quickly can we scale-out? Setting up a memcache bucket
right now with a single server is a cheap optimization that will be a
life-saver when we're unexpectedly hit with the slashdot-effect.
Setting up a reverse proxy running on the same server is another cheap
optimization that will likely be another life-saver, and both of these
things require nothing more than man-hours right now.

Once we have this configuration in place, the capability to scale-out
without necessarily scaling out right now, when the time comes,
handling more traffic will be as simple as buying a second server to
put on the rack. I am worried if we started getting substantial
amounts of new traffic right now with our current set-up we would be
caught with our metaphorical pants down.

elifarley

James, Coral CDN is bad because it's slow and is blocked by some countries and some companies, or there are other issues we should be concerned about too?

jfrench

CoralCDN is particularly slow in my area (Stockton, CA) with most of the time the connection timing out, etc. It literally makes the site unusable at certain times.

jeb

Perhaps we can disable CoralCDN for the time being, as it seems to have similar effects for me at times in Reno, NV. Dreamhost appears to be handling our current spikes better than expected, but we should work toward scalable infrastructure as soon as possible given recent trends...

elifarley

I'd like to discuss part of this thread at this post about using subpages to achieve some benefits highlighted by Eerik.

ARGHaynes

Although it is not the thrust of this post, I would like to argue strongly in favor of the wiki, I have researched it and its application to a degree and think it is a very powerful tool, but it does take some time to settle. Unlike most setups, which tend to lost value over time, because posters either become bored, busy or outdated, the wiki gains value over time through small investments of people's time.

Howdy, Stranger!

Categories

In this Discussion

Tagged

Open Source Ecology Forums