First cosmic velocity: stolplit.ru

"Stolplit" company was founded more than 15 years ago and holds leading position on the furniture market today. You can buy "Stolplit" furniture not only on the web, but also in brand shops in dozens of russian regions and near abroad, there are more than two thousand of them.
Intaro performs technical support of the website since midyear of 2012.
TASK
High load optimization of the back end and software part.
WHAT WE DID
    HIGH–LOAD optimisation
    How bad is the situation?
    We need to perform audit to answer this question "in numbers". Website page generation time on server and browser rendering time are critical indicators for us on the first stage of the audit. Less than a second time is the good consolidated figure. The consolidated page display time divided by 1 second is the indicator for the simple audit.

    Below there is a Yandex-Metrics diagram, where you can see that the resource got performance problems. The time of full page loading can take up to 9 seconds. There is a big chance, that user will not wait page load and will go away.

    A lot of small boxes is better than one big!
    Ilya Bezrukavnikov
    DEVOPS DEPARTMENT MANAGER, INTARO
    One more not unimportant indicator is the performance capacity that you can reach making load testing of the working project. You should consider, that the aim of load testing is to get the values of maximum performance, i.e. values, at which you will get the denial of service. For that very reason you should perform load testing with a special care, in order to prevent the real denial of service. Usually the indicator of load testing is RPS (Requests per second) — the number of requests per second, which your server can handle. You can compare the received value with current average RPS number and get the available performance capacity of your project.

    Below there is an example of load testing by Yandex.Tank. The testing was held on the "Live" project in the rush-hour. You can see, that RPS reached values close to 30. But in fact the denial of service began from 17 RPS. If you consider the average webside load of 100 RPS, you can say this margin is critically low and the server is the potential victim of DDoS (The attack, which aim is denial of service).

    Who's fault?
    Usually the reason for it is absence of server architecture or errors in its constructon, or lack of coordination between developers and system administrators. As a rule, different teams of developers and even system administrators can work over the same project. But what also can be the reason, is that the project was developed for low load and was not ready for an abrupt jump.
    What is to be done?
    Anyway, it's never late for everything to be adjusted. And I'd like to mention, that bringing to chaos, as well as bringing everything to order in the project is the result of harmonious work of developers, system administrators, content managers and etc. That is exactly why you need to detect the bottlenecks of the project, divide responsibility zones and solve problems in a complex.
    Servers are the things that make us who we are!
    There is a belief, that if you buy a server and it is rather powerful, you can stop running scared and forget about the upgrade for the nearest couple of years. It can be true for projects with medium website traffic, that do not plan the increase in number of users in the nearest time. At first sight it is easy indeed. To administer and pay for 1 server. But as for the projects with high website traffic, with it's possible jumps, it is fundamentally wrong.

    The off time during technical works and the minimizing the repair time in case of emergency are critically important for such projects. The expression "time is money" is applicable here. That's exactly why we need to create the architecture, that will provide proper fall-over protection and scalability level.

    Three pillars of correct configuration
      1.
      Scalability
      2.
      Fall-over protection
      3.
      Security
      Let's have a closer look at each point.
      Scalability
      Scalability is understood as the ability to add compute capacity to the project fast and simply. That's exatly the reason why you need to separate roles for servers. You need to understand clearly, that many "small" servers is better than one "big". All servers of the same role should have identical configuration. This gives us capability for simple horizontal scaling. In case if one "small" server fails we have a proportional performance decrease. In case if one "big" server fails, we are likely to have the denial of service.
      If we consider a typical website on PHP and MySQL DBMS, the roles separation is as follows: Request balancer, Application server, Database server.
      Balancer — a server with Nginx onboard:
      • performs request balancing to application servers (as below);
      • perform statics processing and caching (images, styles, javascript and etc.);
      • performs redirections;
      • can perform regular "hard" cron tasks.
      Application server — server, which performs project business-logics. In our case it is PHP code. If there is no direct counterindication, it is better to use php-fpm instead of Apache on the server with this role. After all we already have a good web server. Nginx is doing very good. The main reason why system administrators are not likely put aside Apache as backend on "live" projects — is the presence of .htaccess files and, as the result, a lot of redirections in them. I highly recommend to spend time and rewrite all redirections for Nginx. It makes everything a way much faster by it's nature.

      As the matter of course the server performing business logics must have a lot of cache. Everything that is possible to cache — must be cached. The PHP should have an OPcache or its analog. Using memcaсhe in project is always welcome.

      You can synchronize project files between servers in any suitable way. Using lsyncd for example.

      I will forestall and say that application servers must have implemented functionality of load spitting for databases.


      Database server - a server with MySQL DBMS onboard. We recommend stop using clean MySQL and consider using forks, such as Mariadb and Percona. To make a fail-safe database we suggest considering clusterization. Clean MySQL has a built-in realized mechanism of master-slave replication. In point of fact, using it, you can implement master-slave replication. But we would not recommend this. Instead of it we offer to consider a solution from Percona called percona xtradb cluster. It implements the real master-slave replication and the full-fledged cluster. By the way it is very siple to manage, scale and repair in case of failure. It has a built-in xtradb engine that is fully compatible with InnoDB. By that there is a possibility to make backups, that recover quickly. A way much faster than sql dump recovery. And guys from Mariadb have developed a beautiful maxscale utility for spreading the load among servers.


      Fault-tolerance
      The main fail-safety rule is not to allow presence of a single fault point. I.e. not to allow presence of nonredundant devices. By doing so we should have at least 2 servers of each role. What is concerning balancing server, it can be reconstructed to redundant cluster. Or you can implement Round-robin DNS between several balancers.

      In fact, by implementing the scheme with horizontal scaling capability, as presented above, we achieve a definite level of redundancy. Increase of redundancy level is reached by further horizontal scaling, adding servers to the role needed.

      Security
      Security is never enough! I want to stop on basic moments:

      1. Eliminate SSH/SFTP password access. No matter how good your password is, a good bot-net will detect it anyway. In my opinion, the most suitable authorization way are RSA certificates. Simple and convenient.

      2. WAF (Web Application Firewall) — irreplaceable instrument on guard of your service from XSS and SQL-injections. If adjusted right, it will help you to save a lot of time and mental power.

      3. Each of the servers should certainly have an adjusted Firewall with "Forbidden" default policy. Only the necessary ports should be open to the outside.

      There are many ways to defend from a DDoS attack. It can be performed on a host level. Some routers have a feature of intellectual traffic processing and cleaning it from "trash". Same service is usually offered by internet providers on a paid basis. Linux kernel tuning can help to withstand other types of network attacks.
      Results
      On the example of a real project you can see how the approach described above can bring to order even a very uncared-for system.
      The aim of optimization

      The aim is a large furniture e-shop, built on Bitrix CMS.

      Generation time of some pages on the website reached 200 seconds. It is obvious, that in such conditions user will never wait the page to open. Also, some pages with popular product items reached 20 000 requests! The servers performing business logics had different versions of PHP interpreter. Master-slave replication was implemented, but in fact there was no balancing and a single master was working. This fact could tell that money paid for rent of a powerful server didn't bring any profit. The database servers were placed in a datacenter that was far away from all other project servers, that brought 10ms latency in connection time. Memcached, installed on the database master-server, was used only for storage of sessions, and all cache was stored in files.
      Server configuration in the beginning

      Servers fleet consisted of 5 rather powerful computers. A variant with following parameters carrying Nginx onboard and performing master role for business logic was chosen for server: Dell PowerEdge R730 charged with Intel Xeon E5-2600 v3 2.10GHz Octa-Core, 128 ГБ DDR4 ECC RAM. This server was processing 90% of all requests and speaking by Average Load value should have been burning bushy flame from overheating. MySQL database master and slave were placed in virtual containers on the first Dell PowerEdge R730 physical server. As mentioned above, slave server was simply replicating master server and was not carrying any load. This means that server resources designated for container with slave were simply not used. For slave servers, performing business logics, 2 machines were used with parameters Intel Core i7-4770 Quadcore Haswell, 32 ГБ DDR3 and, I would say, they were underperforming. Also a rather powerful server was used as a simple backup storage. All servers used SATA drives 6 GBit/s 7200 rpm.
      As a result most servers were not using even 10% of available compute resources.
      What was made

      It was decided to use new server fleet and launch a new architecture there. After testing, "move" to new servers and stop using old ones. Optimal server choice for each role allowed to cut costs of their rent. The servers turned out to be cheaper than old ones. After analyzing data sizes of the project, it was decided to use SSD drives. They are faster than their sata analogs and speed up file operations significantly. This is especially important on database servers. A separate storage server in the near datacenter was designated for storing backups. Each server itself was a physical machine. Virtualization was not used.

      I will briefly stop on each role:

      Balancer
      • the last stable nginx version with plugins was installed;
      • configuration files were rigidly structurized;
      • "thrash" and "dead" virtual hosts were deleted;
      • load balancing for application servers was implemented. In case of application server failure all "clients" are conveyed to other server. If both application servers fail, then some of the traffic is directed to balancer's own application server and the system continues its work;
      • WAF (Web Application Firewall) was installed to protect servers from XSS and SQL-injections;
      • SSL support was set;
      • all rollouts and exchanges are processed by the balancer, as its compute resources are not used by normal system operation;
      • redirection rules from .htaccess. files were completely moved, rebuilt and optimized;
      • the balancer performs sizes change, piture caching and adding water signs "on the fly".
      Application server
        • php5-fpm was chosen as fastCGI for php;
        • new (compatible) php version is used;
        • php-OpCache code precompiler/cacher is used;
        • production pool configuration was fine-tuned;
        • each application server has its own memcached, sphinxsearch и maxscale.
        Database server
          • database servers were moved to the same datacenter as application servers, that decreased additional charges for communication;
          • servers configuration was fine-tuned. A lot of performance capacity appeared;
          All main server parameters are controlled via zabbix monitoring system with notification about emergency situations to system administator. PHP parameters processing is controlled via Pinboard.

          A huge amount of work was made on project code optimization. Pages causing system load were recompiled and part of system components was overwritten.

          • the servers got master-master replication set. As 2 nodes instead of 3 are used, quorum is turned off. To avoid split-brain, maxscale is used on the application servers. Master-slave logic is implemented with its help. Slave is a hot swap to master and in fact it doesn't know that he is Slave.
          +7 495 268-02-56
          We are in web-development business since 2007. We rank among Top 100 RuNet web-developers and Top 5 e-shop developers.
          Khlebozavodskiy proezd 7, Unit 9, Moscow, 115230
          Got interesting projects? Contact us!