Crawled to a stand still..

Discussion in 'Help' started by Mark, Nov 13, 2016.

  1. Mark

    Mark New Member

    Looks like I'm the first to need some hand holding - how embarrassing.

    I have a dedicated server (web server) that is currently hosting around 20 fairly new websites (quite large with lots of internal links). These are currently being crawled by search engine spiders quite heavily, but everything server-wise is showing within acceptable parameters (using WHM's 'server status' as a gauge).

    The sites though are now loading extremely slowly due to the spidering. I'm happy to upgrade the server if need be, but how best to monitor the server to see exactly what the bottle neck is so I can make sure any upgrading is carried out in the right areas?

    Thanks very much for any help or advice.
     
  2. Mun

    Mun Administrator

    Thankfully there is a lot of options for this! My first suggestion would be to check to see what is actually crawling you, and on what sites. You can do this by checking the logs for apache. Usually they are stuff in /var/log/apache2/. Inside there you should find access.log and error.log. Access.log is going to be when people connect and request a file from the web server, it should also hint at which bots are hitting you.

    For example, if you read this string the AhrefsBot is hitting one of my websites. (this one.)
    Code:
    www.qwdsa.com:443 127.0.0.1 - - [14/Nov/2016:07:43:39 -0800] "GET /converse/find-new/371326/profile-posts HTTP/1.0" 303 649 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.1; +http://ahrefs.com/robot/)"
    
    So after you do some digging there is a few options on what you can do to fix it! First, up will be robots.txt. http://www.robotstxt.org/robotstxt.html robots.txt is a file search engines grab to see what and how they are allowed to act as they crawl your website. This will take a little bit to take affect, but is one very big option. Here is one that I use to delay the crawling of my websites: https://enjen.net/robots.txt.

    Another option can be using something like Cloudflare. Cloudflare will first off cache many of the assets that maybe being pulled from your websites. Here is a lovely article where they talk about bots. https://blog.cloudflare.com/cloudflare-uses-intelligent-caching-to-avoid/. Not only that, but they will slow the rate at which bots hit your websites due to big corporate connections with google and bing.

    Another options, block them.... This is a bad idea unless they are abusive. However, I have done it a few times with bingbot do to it going insane checking a few of my websites. You can do this with your .htaccess file. Here is some examples of using htaccess http://www.clockwatchers.com/htaccess_block.html.

    Finally, if you are using wordpress for your websites I highly suggest that you look into using a caching plugin. They dramatically will speed up your load time!

    Anyways, I hope that all helps... and if you find out more information update your post. I can add some more information as you get more specifics to your issue.
     
  3. Mark

    Mark New Member

    Hi Mun,

    Thanks for the detailed reply.

    I should perhaps have gone into more detail but it was late and wanted to ask the question before I went to bed..

    My question wasn't so much about how to stop the traffic, but more to do with how to upgrade (or better configure?) my server to be able to handle the current traffic, and more, without it causing any slowdown to visitors.

    I already run ZBBlock (http://www.spambotsecurity.com/zbblock.php) on my server with the beta updates as well as some unofficial (but very good) signatures which seem to be stopping most of the junk that normally hits my sites. The spiders that are crawling my sites are all good, google spiders (66.249.*.*) so I don't want to deter them at all, it's just while this spidering is going on my sites are almost unusable.

    How best to upgrade/configure my server to take the current spiders, and more? How can I find out exactly what the bottleneck is? As I mentioned, WHM server status is showing all fine (all green ticks)...

    Thanks for any help.
     
  4. Mun

    Mun Administrator

    Hmm, well first it comes down to what type of content you are hosting. Is it mostly static files (images, videos, txt files.) or dynamic content (php, wordpress, etc.). If it is static changing to a different web server like Nginx may see big performance gains.

    What is your current webserver software that you are using, as I think WHM can use multiples. What kind of files are being hit?

    What software is your website hosted on? Wordpress by chance?
     
  5. Mark

    Mark New Member

    The sites are all hosted on a wordpress multisite install. They are all dynamic (in fact, as dynamic as sites can be) with all content for all pages generated via a remote API real time. I understand that there will always be a delay with the API call etc. but they work fine with relatively small amounts of traffic - it's only when there are large amounts of traffic (or spiders in my current case) that the problem arises.

    Is there a program/script/way I can monitor the server to see exactly what the bottleneck is (CPU, memory, DB calls etc.)? I'm hoping if I can identify the bottleneck I can then look at remedying it.

    Again, thank you so much for your help.
     
  6. Mun

    Mun Administrator

    I'd first look into a caching plugin for your wordpress! It will help greatly, and I mean greatly. https://premium.wpmudev.org/blog/top-wordpress-caching-plugins/

    You can also look into the load on your server using top. You will need to ssh in and run the 'top' command. Here is some understanding of load that will be outputted http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages.

    Cloudflare may also still be of help as it will cache your pages for you, and lower the load on your server.
     
  7. Mark

    Mark New Member

    Ah, the further info you asked about:

    *** System Info ***

    Uptime: 17:01:01 up 3 days, 11:26, 0 users, load average: 6.72, 6.16, 6.41

    OS: CentOS release 6.8 (Final)

    Kernel: Linux ns***.ip-***.***.***.net 3.14.32-xxxx-grs-ipv6-64 #5 SMP Wed Sep 9 17:24:34 CEST 2015 x86_64 x86_64 x86_64 GNU/Linux

    Processors: 8 CPU(s)

    RAM: 31.36 GB

    Memory Usage:
    total used free shared buffers cached
    Mem: 32110 28464 3645 263 461 16407
    -/+ buffers/cache: 11595 20514
    Swap: 510 3 507

    Running Apache
     
  8. Mark

    Mark New Member

    As most of the content is generated dynamically via a remote API I have to be very careful using caching as it is important the information shown is current for all sites. It's kinda more about what actually causes the slowdown and whether I can work around it. The problem is that as I don't know enough about it I'm unsure exactly what causes the slowdown - where the bottleneck is.
     
  9. Mun

    Mun Administrator

    How accurate do you need the information? 5 minutes, 1 hour? Caching is a big help. Cloudflare still may be a big help as you can instruct it to only cache static content.

    Here is an article on high IO wait... which may be the cause if the disk is your slow point http://bencane.com/2012/08/06/troubleshooting-high-io-wait-in-linux/.

    https://tools.pingdom.com/ can be used as well to determine what parts of the site are slow to load, and which pages they are hitting. It can also give you suggestions on how to improve.
     
  10. Mark

    Mark New Member

    Perfect, thank you. I'll do some reading up and let you know my findings and hopefully my solution! Thanks again for taking the time...

    Regards,

    M.
     
  11. Mark

    Mark New Member

    top - 18:18:22 up 3 days, 12:43, 1 user, load average: 7.20, 7.06, 6.88
    Tasks: 215 total, 7 running, 208 sleeping, 0 stopped, 0 zombie
    Cpu(s): 80.6%us, 8.7%sy, 0.0%ni, 10.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
    Mem: 32880772k total, 28983440k used, 3897332k free, 472972k buffers
    Swap: 523260k total, 6932k used, 516328k free, 16415644k cached
     
  12. Mun

    Mun Administrator

    Cpu(s): 80.6%us, 8.7%sy, 0.0%ni, 10.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st <--- this is a good indication that the server is working hard, but that IO (hard drive) is not the limiting factor.
     
  13. Mark

    Mark New Member

    That was my thinking after reading through the links you posted above. Could it be the CPU itself is the limiting factor (currently Intel i7 4790K) and a higher spec one (faster, more cores/threads) would help greatly?
     
  14. Mun

    Mun Administrator

  15. Mark

    Mark New Member

    Thanks, I'll look into this..
     

Share This Page

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice