TEXT   39

cloud.txt

Guest on 24th August 2022 04:52:46 AM

  1. The Cloud
  2. ---
  3. The cloud (aws, gcloud, azure... whatever) is a piece of shit.
  4. 10x price for 1/10th performance (so 100x price haha)
  5.  
  6. Not only the hardware there is pathetic and all IPC is horrible, but
  7. also all the managed services you use from them are performing horribly.
  8. (from es to rabbitmq to even attached block storage [which is
  9. probably why they are performing this way])
  10.  
  11. What I do for https://baxx.dev/stat (and for other projects):
  12. * buy 2-3 machines from hetzner [or somewhere else]
  13.   for 200E per month you get 24 core (48 ht) 128g ram box with 2tb ssd
  14.   (usually in mirror, so 1tb), that can easily do ~100k randread and
  15.   50k randwrite iops, 1gbps unlimited network
  16.  
  17.   just for reference, this will cost ~5k on the cloud, and will
  18.   perform (even though with similar specs) 1/10th of bare metal box
  19.  
  20. * learn some basic sysadmin skills, now it is easier than ever
  21. * systemd + docker can go a long way
  22. * try not to use many dependencies, don't decouple without good reason
  23. * avoid queues if you can
  24.   this seems counter intuitive, by queues I don't mean just kafka, I
  25.   mean all kinds of receive (usually unbounded) queues, for example
  26.   nginx's listen(2) backlog queue has limit N (unlimited in some
  27.   cases), then you have accept(2) queue on wherever nginx is proxying
  28.   to, and then from this thing to your database, and the database's
  29.   queue depth and etc.
  30.  
  31.   interacting queues have extremely annoying emergent chaotic
  32.   properties, so every time you can avoid it, do it
  33.   (I did some investigation in the we-got-it-all-wrong post
  34.    https://punkjazz.org/~jack/we-got-it-all-wrong.txt
  35.    https://punkjazz.org/~jack/we-got-it-all-wrong-2.txt
  36.    where I changed from push to pull to understand the dynamics
  37.    better)
  38.  
  39. you will probably need:
  40. * postgres/mysql
  41.   setup master->slave so you can have 'hot' standby.
  42.   on this machines 1 postgres master can handle your traffic (unless
  43.   you just do bad design) until you reach mid size[100-200 employees]
  44.  
  45. * zookeeper
  46.   pretty much you start it and let it run, unless you abuse it
  47.  
  48. * es, kafka, nginx, redis, some backend (node,go whatever) etc
  49.   use cgroups or docker to make sure one dependency wont bring the
  50.   whole box into thrashing, keep in mind modern thrashing is pretty
  51.   much unstoppable.
  52.  
  53. * some external dns, setup your zone records with 5 min ttl so
  54.   when one of the machines dies you just manually switch until you
  55.   have new one setup (which could take 1 day)
  56.   the machines don't die every day.. so dns round robbin is enough
  57.   and should bring you to .99+ availability
  58.  
  59. * keep in mind you have 1 machine worth of capacity, the other one
  60.   is pretty much for live/live backup, which means at all time you
  61.   must be able to handle all the traffic with 1 machine
  62.  
  63. * make the machines ping each other
  64.   https://github.com/jackdoe/baxx/blob/master/README.txt#L76
  65.   (example of how I do it for baxx so I get notified when any process
  66.   or cronjob on any box is not running as expected)
  67.  
  68. * secure your boxes, following How-To-Secure-A-Linux-Server will
  69.   give you a *very* good head start:
  70.   https://github.com/imthenachoman/How-To-Secure-A-Linux-Server
  71.  
  72. Once you are on your own:
  73. * keep running live/live setup
  74.   Backups do not work very well in chaotic systems, there
  75.   are gazillion reasons why a backup will fail, the only way you can
  76.   be sure you can recover if a machine dies, is to know for a fact
  77.   that the other machine is serving traffic.
  78.  
  79.   Here I want to distinguish between backups of data (saving old copy
  80.   of a database increase someone truncates the wrong table by
  81.   accident, which sadly happens way more than we want to admit), and
  82.   having a way to recover from a situation where a machine is dead.
  83.  
  84.   As stated, the only way to ensure quick recovery is if you actually
  85.   know that the fallback machine was working with the same live
  86.   traffic as the dead machine.
  87.  
  88. * avoid buying managed services
  89.   Not being able to strace/gdb/iostat or use jmx to hook into the
  90.   service that is causing you issues has caused me so much pain. I
  91.   regret it every time I helplessly look at a slow operation that
  92.   intuitively I know should be fast and cant explain why is it
  93.   performing like shit. You cant even login to it to see if the disk
  94.   is faulty.
  95.  
  96.   All those graphs and logs that the managed services usually give you
  97.   are useless in crisis or hardware degradation scenario, as it is
  98.   often impossible to isolate the symptom from the cause when the
  99.   thrashing starts.
  100.  
  101. * don't use CDNs
  102.   This is harder than it sounds of course, especially if you managed
  103.   to get to 2 MB javascript bundle and 50megapixel images..
  104.  
  105.   CDNs increase your complexity, they creep into your deployments and
  106.   the way you think..invalidation of objects, naming conventions etc
  107.   etc.. inline as much as you can and be free.
  108.  
  109.   EDIT(08/08/2019):
  110.   Many people commented they dont agree with this point, the theme of
  111.   the whole post is about reducing complexity and cost *if you can*.
  112.   I realize sometimes this is not possible, but when you have to use
  113.   CDNs then you must use them, reality is in many cases you dont have
  114.   to.
  115.  
  116. * do it once
  117.   Because you will end up running like 20 things, it is important to
  118.   not worry about them. This whole enterprise boils down to you
  119.   running things that are just good software, e.g. redis, you run it
  120.   once and thats it. (LTS is way more marketing than it seems, so
  121.   don't trust it blindly)
  122.  
  123. * avoid big data while you can
  124.   Most companies can go very far by appending their analytics events
  125.   in a log file or a table.
  126.   having 30-40 million events in a text file is in the order of 10-20GB
  127.   on good ssd with a good cpu you can slice and dice it with incredible
  128.   speed.
  129.  
  130.   cat | rg | jq | sort | uniq -c | sort > report.$(date).txt is amazing
  131.   just imagine the alternative:
  132.   oozie, hadoop, spark, job reports, transformers, dependencies
  133.   brrrrrr amazing how we ended up here so we can count some numbers
  134.  
  135. * remove layers
  136.   e.g. don't run elasticsearch if you only need lucene, don't run rails
  137.   if you can do it with sinatra, don't introduce caching layers unless
  138.   absolutely needed, don't use haproxy if you can go by with dns round
  139.   robbin, don't run cassandra if you just need LSMT can simply embed
  140.   rocksdb, dont run kubernetes if you can do it with systemd..
  141.  
  142. Don't go to the cloud.
  143. It will force you to use super crappy and slow or limited things such
  144. as s3 and over-complicate your infrastructure to incredible degree.
  145. It is truly a piece of shit and will just force you to design systems
  146. in a horrible way.

Raw Paste


Login or Register to edit or fork this paste. It's free.