Visit the forum instructions to learn how to post to the forum, enable email notifications, subscribe to a category to receive emails when there are new discussions (like a mailing list), bookmark discussions and to see other tips to get the most out of our forum!
Web Performance (email from James French)
  • Marcin,

    This is what I know. As I know you are short on time lately, I will
    leave a summary at the bottom. The body of this letter is largely
    technical, you may find it interesting, however, if not useful.

    The key to good web performance is distributability and locality.
    Moving large amounts of traffic means we need to be able to
    "scale-out" (i.e. multiple servers) rather than just being able to
    scale-up (i.e. improving one single monolithic server) at every level
    from front-end servers to back-end servers to the database.

    Front-end servers are generally performance-optimized feature-poor
    HTTP servers such as NginX, Squid, Traffic Server. It is ideal for
    these servers, particularly for truly static (never-changing) content,
    to be located as close to the end user on the network topology as
    economically and technically feasible. The business model of a CDN is
    focused around making a high level of locality tenable through
    economies of scale. MaxCDN (maxcdn.com) is one of the lowest
    barrier-to-entry CDN's that I am aware of, offering 1TB of transfer
    for around $40. Once we are moving more traffic, we can solicit quotes
    from larger, more well-established CDN providers.

    Back-end servers are almost always the Apache HTTP Server, which will
    also function as a front-end server in low-tech setups. The back-end
    server is responsible for holding and serving files to the front-end
    server, in addition to running application code such as PHP. There are
    many tricks to improve site performance at this level, and most of
    them revolve around copying the most amount of content from hard disk
    (which is slow) into RAM (which is fast). APC is a module that does
    this for PHP, for example. The most important thing in my opinion for
    distributability at the back-end level is eliminating the need for
    session-dependent code.

    Database servers in the open-source world will generally be either
    MySQL or PostgreSQL (in the relational model) or CouchDB (in the
    document-oriented model). As most of the apps we are currently running
    require a relational database, CouchDB should only be considered for
    in-house application development. Most people are familiar with the
    near-ubiquitous MySQL. For latency reasons, it is almost a necessity
    for the database servers to be as close to the back-end servers as
    possible, and this is probably why 3rd party database services like
    Amazon SimpleDB haven't quite taken off yet.

    Let's talk about caching. In HTTP, every document returned may provide
    "Last-Modified" and "Expires" values (headers). As its name would
    imply, "Last-Modified" refers to the time a given document was last
    modified. "Expires" is a guarantee that the document won't change
    until at least the date given in the "Expires" header has passed. It
    is relatively common to give an "Expires" time in the near-future
    regardless of an inability to know exactly the next time a document
    will be modified. (Because of community-editing, for example.) The
    back-end server will ultimately decide these values.

    The front-end server, and every intermediate server and end user
    outside of our purview, will use these values in two ways. If the
    expiration date has not passed yet, any intermediary may simply serve
    a previously received and stored version of the document without
    needing to request it from the back-end. (Huge gains in efficiency!)
    Even if the expiration time has passed, however, the end-user, or its
    intermediaries, may make an "If-Modified-Since" request, and receive
    back a response of "Not Modified", and use its previously received,
    but expired, version (A smaller, but still significant gain in
    efficiency). This methodology describes caching at every level from
    the back-end to the end-user.

    Between the back-end server and the database, the bread and butter of
    caching is memcache. The caching model here is different, and rather
    than being based on expiration and re-validation, it is based on
    invalidation. As the back-end server is the only thing that writes to
    the database, the locally cached entry can be overwritten with the new
    value before being handed to the much larger and more reliable
    database engine. All locally cached entries therefore may be assumed
    to be current and valid.

    Summary
    =======
    +Write code and develop infrastructure that avoids unnecessary
    round-trip communications, particularly between distant links.
    +Ensure that the end-user and intermediaries outside of our control
    can utilize the caching information returned by our applications.
    +Use superior 3rd party resources where it makes economic sense,
    particularly at the front-end layer. (CDN)

    Before we can discuss further where our strategy will lie, I need
    information about the applications we are currently using, the servers
    that currently run them, and the various software we will need or will
    be useful in the future. I have a lot of experience writing
    "internal-use" web applications for my employer, and any sort of
    information management/reporting that you need personally for the
    engineering side of OSE I would be glad to assist in the development
    of as it is my particular expertise as a PHP developer. I also believe
    project management software would be useful for our needs, if you
    aren't using one already.
     
  • 16 Comments sorted by
  • Marcin wrote:

    Dear James,

    Thanks for this valuable analysis. Eerik (from Solar Fire), can you please comment on this and let us know how your proposal integrates with James'. I'd like to see us come together on an action plan, which we can also get peer-reviewed via a blog post, and then we could implement a solution.

    I would also like to hear your comments on traffic numbers, and what solution regime addresses these numbers correctly. And from Elifarley - what is our current level of traffic and what do we realistically expect through a viral marketing campaign that we plan for Kickstarter.

    Thanks,
    Marcin
     
  • Our wiki has had ~88 pageviews / minute after featuring on Reddit's front page, and the server at DreamHost seemed to behave nicely under that load (we were using Coral CDN for images, javascript and CSS files then, as we are now)

    Let's suppose each one of the other sites can give us about 100 pageviews / minute, that could mean 8 * 100 = 800 pageviews / minute = 800 * 60 = 48k per hour, for a few hours.

    After those spikes, we are at about 22k pageviews / day now.

    Out of thin air, I can predict we'll have a spike of about 400k pageviews / day for a few days. After that, we'll have 155k pageviews per day, increasing every week.

    Maybe Dreamhost can sustain bigger spikes, but I'm not sure. We can enable full CDN mode, which means that our server wouldn't see the spikes, but some people may have slow page loading times, depending on Coral CDN's servers. (To see your page loading time using Coral now, visit http://openfarmtech.org.nyud.net/ )

    Everyone in this email now have user rights to see our Google Analytics page - see https://www.google.com/analytics/


    P.S.: Should we move this discussion to the discussion forum for web infrastructure ?
     
  • Hi James,

    You've made some very good points.

    Please take a look at the Web server configuration page (under the IT Infrastructure category) to see what we are using.

    Thanks,
    Elifarley
     
  • Eerik wrote:

    Dear all,

    What James proposes is correct, but I think is overkill at this stage and more of a second phase problem to solve.

    I think the first priority is getting a single server running smoothly, securely with adequate resources and an adapted website. Once this is accomplished, how to maximize delivery can then be worked out through proxy servers such as Squid or services such as maxCDN over a longer term. There are 5 main reasons for this view:

    1. The content, functioning and interactivity of the "base / backend" website is reasonable to resolve before working on maximizing load times for end users.
    2. Good Apache root servers can deal with millions of simultaneous connections, so there would be no immediate danger.
    3. Caching services by definition only work for static content, but a fundamental goal of OSE is to be interactive (i.e. must connect to the backend); the easiest and fastest way to have the fastest interactive experience is by scaling a single server (a decentralized backend is a 3rd phase problem).
    4. The architecture of the Internet is fairly quick these days (fiber optics / satellites), so physical proximity to the server is not as important as it used to be. The main lag is due to routers. What is more important is being as close to the internet hubs as possible so as to traverse a minimum of routers; i.e. a dedicated server, in a fast datacentre connected directly to a main Internet gateway.
    5. ISP's already implement a caching policy, so if they have a copy in cache, they do the work of requesting from your server if the page has been modified, and for simultaneous connections simply serve the same copy; they have a huge motivation to provide the fastest experience to their end users and decrease their own bandwidth, and do this fairly efficiently.

    For instance, my server at Heztner has 5 terrabytes of traffic, and this can be easily exceeded, which is why their policy is “*There are no charges for overage. We will permanently restrict the connection speed to 10 MBit/s if more than 5000 GB/month are used. 100 MBit/s speed can be optionally restored by committing to pay 6,90 € (incl. VAT) per additional TB used”.

    Since OSE primarily delivers text (not streaming video or game services) it's highly unlikely 5 TB will be reached in any short time frame, but if so 6.90 Euro per TB would be a trivial expense.

    For keeping everything in RAM, an important point, Linux automatically does this at the kernel level (only removing old information from RAM if there is a shortage), so for server speed the general criteria is that the RAM is bigger than the content of the site, which is easy to achieve since RAM is cheap nowadays.

    So in my opinion decentralizing service of static pages is a second phase task and not a first phase task.

    The first phase task I think is designing the website to be what it needs to be, clearly communicating the content and attracting contributors.

    I think solarfire.org and OSE have very similar needs as open source development projects, so I feel OSE can take advantage of this experience and that I can strongly advocate for the approach I arrived at working on Solar Fire.

    The first step I would argue is moving away from a wiki.

    Problem
    Though a wiki is great to get off the ground fast, I think it's inappropriate for the needs of OSE. The reason is that a wiki is designed so multiple people work on the same subjects, which seems reasonable at first for a development project, but in my experience running Solar Fire (I originally designed the first interactive version of the website to be similar to a wiki) the problem I encountered is that this didn't valorize the individual contributors and there was no reason to try to create consensus data in the first place, as every contribution/idea can be valuable and there's no reason to try to harmonize different approaches to the same subjects to one degree or another, as is intrinsic to an encyclopedia (likewise for a open source software project, all the code has to be harmonized, but not so for open hardware where each builder can put their own personality into it).

    Also, needing to harmonize data on monolithic pages creates a huge burden on the administration and creates the situation where the contributors feel they are all contributing on equal grounds and have the same exposure opportunity as the main site personalities. For instance when conversion to solarfire 2.0 is complete, if I don't post I won't featured on the main site. In my experience with the wiki what happened was, even though I created access for contributors, they simply emailed me their info and Eva and I were left trying to integrate it somewhere. The result was a huge sprawling archive with no regular contributors.

    Solution
    So the new design I've recently put in place organizes all the data in terms of who contributed it. Basically a large blogging system. On top of this blogging system each article can have one or multiple key words attached, which can be used to create a vertical navigation by subject, which I'm working on now. So, this way each person has their own folder where they can post what they want, and on the main page various lists of last contributions by subject will be able to roll automatically. Since each user can have the administration rights to their own folder, they can manage it completely autonomously, so the administration overhead is significantly reduced to just oversight; likewise, since they manage their own content autonomously they feel much more part of the site and also that the site serves their own needs of communicating on their activities. When I have time I'll be making the personal pages more customizable by the contributor, so they control what's on the sides as well as part of the banner at the top; exactly the same service they would have if they made their own blog, but now with added value to them of having their content appear on the information hub concerning the subject.

    Content Management System
    Though all Open Source CMS's can, in an absolute sense, accomplish the same things as their code is open and so can be modified into anything, in practice it's clearly best to start with the CMS that's closest to the need.

    As it's a crucial decision for a webmaster what CMS to specialize in, I did serious tests of all the major open source CMS's (Drupal, Wordpress, Typo3, Joomla, Spip) and I settled on Spip. Though less known in the English world Spip is the main professional CMS in Europe.

    I've recently been working on reworking my technical defense of Spip for prospective clients, so I've attached it here. I'll be publishing it over the next couple of days, so you'll be able to link to it for peer review.

    But in essence, the idea of Spip is to provide a basis where all common web problems are easily solved but design can be easily controlled to the minutest detail.

    The system is so well organized that it also makes collaboration from various coders easy accustomed to different systems easier. There's not really any new system to learn so coding primitives can be integrated easily.

    Custom higher level coding can then be dedicated to making applications unique to the website, rather than solve problems that have already been solved by someone (which is a huge body of knowledge).

    For instance, custom applications for OSE could be interactive design platforms, where the user can scale and tweak the design to their need on-line (I see this as crucial to proliferation in the developing world).

    --
    Eerik Wissenz

    Free Access to Solar Energy
    http://www.solarfire.org/
    SiSustainable - Web Development
    http://www.sisustainable.com/
     
  • Eerik wrote:

    Correction

    "Also, needing to harmonize data on monolithic pages creates a huge burden on the administration and creates the situation where the contributors do not feel they are all contributing on equal grounds and have the same exposure opportunity as the main site personalities."

    Also, to this I would add that when people feel they are expressing their own personality through their content contributions, rather than the information being absorbed into larger packages, they are much more enthusiastic to make higher quality content.
     
  • Jeb Bateman wrote:

    Excellent comments all. I've forwarded this thread into a Teambox task
    I created the other day based on Marcin's last email about web
    traffic, and invited Eerik and James to join in there. Hopefully we
    can move swiftly through discussion to action. Elifarley has made
    outstanding progress on the web config over the past year, and there
    is plenty more we can do on the infrastructure side in a process of
    constant improvement. Many thanks all around!
     
  • Hi guys,

    Is it best to keep this thread on Teambox (restricted to our group) or we should post it at our public discussion forum where it can be indexed by Google and everyone can read (and maybe add comments to) ?

    I'm inclined to think it would be better to make it public. Any drawbacks on that?

    Thanks,
    Elifarley
     
  • James French wrote:

    The upside is the possibility of solutions we haven't considered, or
    an offer of assistance. The only potential downside is the
    signal-noise ratio of the conversation, but I wouldn't expect it to be
    a problem.
     
  • Jeb Bateman wrote:

    Personally, I'd prefer to keep public info mostly in the wiki, were
    plans can be developed and improved over time, and discussion can be
    edited in the Talk pages as well. I haven't had a chance to check out
    the public forums much myself yet, but generally shy away from public
    back and forth. Some people prefer forums, but I've seen them become
    overrun with noise and shutdown on other sites, including the most
    popular site I used to host years ago. Just my 2 cents though, and
    I'll support whatever direction you guys want to go...
     
  • On the noise problem, it can be mitigated by making certain forum categories invitation-only (in the sense that anyone can read, but only people with the appropriate roles can post to - or it could be even configure so that only certain roles could read them).
     
  • James French wrote:

    With respect to the issue of building out...

    Given elifarley's numbers, I agree that what we have now (with the
    exception of CoralCDN being utter shit, seriously!) is probably
    sufficient for the traffic we're currently receiving. The issue is
    that when the traffic ramps up, it tends to ramp up and hard fast.
    What was once 100 uniques per day can tomorrow be 1,000 uniques per
    day, and in a week can be 5,000 uniques per day. I do agree there's
    not much warrant for scaling out right now to multiple servers, given
    that we want to run as much of it as we can right now on a shoestring
    budget. My point of concern is that when it's, in the blink of an eye,
    5,000/day, how quickly can we scale-out? Setting up a memcache bucket
    right now with a single server is a cheap optimization that will be a
    life-saver when we're unexpectedly hit with the slashdot-effect.
    Setting up a reverse proxy running on the same server is another cheap
    optimization that will likely be another life-saver, and both of these
    things require nothing more than man-hours right now.

    Once we have this configuration in place, the capability to scale-out
    without necessarily scaling out right now, when the time comes,
    handling more traffic will be as simple as buying a second server to
    put on the rack. I am worried if we started getting substantial
    amounts of new traffic right now with our current set-up we would be
    caught with our metaphorical pants down.
     
  • James, Coral CDN is bad because it's slow and is blocked by some countries and some companies, or there are other issues we should be concerned about too?
     
  • CoralCDN is particularly slow in my area (Stockton, CA) with most of the time the connection timing out, etc. It literally makes the site unusable at certain times.
     
  • Vote Up0Vote Down
    jebjeb
     
    April 2011
    Perhaps we can disable CoralCDN for the time being, as it seems to have similar effects for me at times in Reno, NV. Dreamhost appears to be handling our current spikes better than expected, but we should work toward scalable infrastructure as soon as possible given recent trends...
     
  • I'd like to discuss part of this thread at this post about using subpages to achieve some benefits highlighted by Eerik.
     
  • Although it is not the thrust of this post, I would like to argue strongly in favor of the wiki, I have researched it and its application to a degree and think it is a very powerful tool, but it does take some time to settle. Unlike most setups, which tend to lost value over time, because posters either become bored, busy or outdated, the wiki gains value over time through small investments of people's time.
     

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Login with Facebook Sign In with Google Sign In with OpenID Sign In with Twitter

In this Discussion

Tagged

Loading