howdoicomputer.lol/public/posts/homelab-2/index.html

162 lines
9.2 KiB
HTML
Raw Permalink Normal View History

2024-08-31 05:27:26 +00:00
<!DOCTYPE html>
<html lang="en">
<head><title>Homelab 2: Monitoring Boogaloo &ndash; howdoicomputer</title>
<meta name="description" content="A dumping ground for ideas related to making, tomfoolery, and tomfoolery related to making">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta charset="UTF-8"/>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.1.2/css/all.min.css" integrity="sha512-1sCRPdkRXhBV2PBLUdRb4tMg1w2YPf37qatUFeS7zlBy7jJI8Lf4VHwWfZZfpXtYSLy85pkm9GaYVYMfw5BC1A==" crossorigin="anonymous" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/academicons/1.9.1/css/academicons.min.css" integrity="sha512-b1ASx0WHgVFL5ZQhTgiPWX+68KjS38Jk87jg7pe+qC7q9YkEtFq0z7xCglv7qGIs/68d3mAp+StfC8WKC5SSAg==" crossorigin="anonymous" />
<link rel="stylesheet" href="https://howdoicomputer.lol/css/palettes/tokyo-night-dark.css">
<link rel="stylesheet" href="https://howdoicomputer.lol/css/risotto.css">
<link rel="stylesheet" href="https://howdoicomputer.lol/css/custom.css">
</head>
<body>
<div class="page">
<header class="page__header"><nav class="page__nav main-nav">
<ul>
<h1 class="page__logo"><a href="https://howdoicomputer.lol/" class="page__logo-inner">howdoicomputer</a></h1>
<li class="main-nav__item"><a class="nav-main-item" href="https://howdoicomputer.lol/about" title="">About</a></li>
<li class="main-nav__item"><a class="nav-main-item active" href="https://howdoicomputer.lol/posts/" title="Posts">Posts</a></li>
</ul>
</nav>
</header>
<section class="page__body">
<header class="content__header">
<h1>Homelab 2: Monitoring Boogaloo</h1>
</header>
<div class="content__body">
<p>Every good production environment needs robust monitoring in-place before it could be considered production and I feel like my homelab shouldn&rsquo;t be an exception.</p>
<p>With that in mind, I created a monitoring stack that I&rsquo;ve used professionally and I&rsquo;m pretty happy with it.</p>
<p><img src="/node_exp_dashboard.png" alt="dash">
Grafana dashboard for host resource metrics</p>
<h2 id="stackem">stackem'</h2>
<p>The different components of the system are thus:</p>
<ul>
<li>Prometheus</li>
<li>Grafana</li>
<li>Prometheus&rsquo;s node_exporter</li>
<li>Consul</li>
</ul>
<p>Here is what I want to collect metrics for, from most critical to least:</p>
<ol>
<li>The base host resources. This includes available memory, CPU, disk space, network traffic, etc. Also includes the ZFS pool.</li>
<li>Nomad itself.</li>
<li>The services that are orchestrated by Nomad.</li>
</ol>
<h2 id="tldr-prometheus">tldr prometheus</h2>
<p>I won&rsquo;t go too deeply into how Prometheus works; I wouldn&rsquo;t be able to do better than the <a href="https://prometheus.io/docs/introduction/overview/">official documentation</a>. That being said, it&rsquo;s worthwhile to do a quick summary:</p>
<ul>
<li>Applications are responsible for presenting their own metrics over HTTP via a dedicated endpoint</li>
<li>Prometheus operates on a pull based model. As in, Prometheus will reach out and HTTP GET that endpoint for each application in order to &ldquo;scrape&rdquo; those metrics and store them</li>
<li>Prometheus has a massive amount of support for different types of service discovery for automatically registering scrape targets</li>
<li>Prometheus is a time series database (TSDB) - it is built to store linear, time oriented data that has dimensionality expressed through key-value pairs</li>
</ul>
<h2 id="base-system-monitoring">base system monitoring</h2>
<p>For monitoring the base system, the Prometheus project supports the collection of metrics for Linux based hosts through their <a href="https://github.com/prometheus/node_exporter">node_exporter</a> project. It&rsquo;s a Golang binary that exposes an absolute treasure trove of data&hellip; including ZFS stats!</p>
<p>The exporter project recommends running it outside of a container as it needs a deep access to the host system and the isolation of containers are counterintiutive to that level of access.</p>
<p>While I could run the <code>node_exporter</code> via systemd, I instead opted to use the <code>raw_exec</code> driver for Nomad. There is a Nix package for installing the exporter so the job definition just relies on executing the binary itself. Doing it this way means I get visibility into the exporter process and logs via the Nomad dashboard.</p>
<pre tabindex="0"><code>task &#34;node-exporter&#34; {
driver = &#34;raw_exec&#34;
config {
command = &#34;/run/current-system/sw/bin/node_exporter&#34;
}
}
</code></pre><p>There is some security implications for enabling the <code>raw_exec</code> driver for Nomad as it will run every process under the same user that the Nomad client is running as - which is usually root. However, it was the path of least resistance and it&rsquo;s a TODO to later use the <code>exec</code> driver - which uses cgroups and chroot to isolate the spawned process. This, again, is counterintuitive to the node exporter being able to collect data and would require me to make sure that the task allocation has the correct access to the right resources in order to facilitate complete data collection.</p>
<h2 id="service-discovery-brings-all-the-metrics-to-the-yard">service discovery brings all the metrics to the yard</h2>
<p>Prometheus has support for a plethora of service discovery methods that can be used to help it locate endpoints for scraping. The integrations range from using simple files to integrating with service meshes. For my homelab, I opted to use the <a href="https://prometheus.io/docs/prometheus/latest/configuration/configuration/#consul_sd_config">consul</a> service discovery integration because, well, that&rsquo;s what I&rsquo;m running.</p>
<p>Whenever a service is deployed to Nomad, it will register an IP address and port to Consul. Prometheus will then talk to Consul to get that address to construct a metrics endpoint and then start scraping metrics an a specified interval.</p>
<p>The configuration for Prometheus looks like this:</p>
<pre tabindex="0"><code>scrape_configs:
- job_name: &#39;service_metrics&#39;
consul_sd_configs:
- server: &#39;{{ env &#34;NOMAD_IP_prometheus&#34; }}:8500&#39;
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: service
</code></pre><p>Since everything is on the same box, there is no authentication method for Consul and so the only necessary config is the server endpoint. The <code>$NOMAD_IP_prometheus</code> is the IP address for the Prometheus task - which is also the IP address for the homelab server itself because I make liberal use of the <code>bridge</code> networking type. As a quick note, bridge here means that all tasks share a networking namespace but ingress happens over a port bound to localhost for the Nomad client.</p>
<p>The relabel config exists to take the Consul service source label - in this case, the Consul service label is the name of the service that is running on Nomad - and create a target label of <code>service</code> so that every metric is properly tagged. This means that if I deploy a <code>gitea</code> service to Nomad then it gets a <code>service: gitea</code> label in Prometheus.</p>
<p>Additionally, I want to monitor Nomad itself and, well, it so happens that Nomad registers itself with Consul and so it gets collected as well. The only difference is that the <code>metrics_path</code> has to manually changed to <code>/v1/metrics</code>.</p>
<h2 id="logs-are-uh-not-done">logs are, uh, not done</h2>
<p>Logging is incredibly import and it&rsquo;s on my TODO list to create a better log collection setup but for now I just use Nomad&rsquo;s log UI.</p>
<p><img src="/nomad_gitea_logs.png" alt="gitea_logs"></p>
<h2 id="future-todo">future todo</h2>
<p>Something that I&rsquo;m excited to play with is using eBPF to construct fine-grained networking monitoring for the box to better monitor traffic flowing between tasks. There is a lot of <a href="https://www.redhat.com/en/blog/monitoring-ebpf-based-metrics">documentation</a> for this - I just need to clear my calendar for it.</p>
<p>Stay tuned for part 3</p>
<hr>
</div>
<footer class="content__footer"></footer>
</section>
<section class="page__aside">
<div class="aside__about">
<div class="aside__about">
<img class="about__logo" src="https://howdoicomputer.lol/favicon.ico" alt="Logo">
<h1 class="about__title">howdoicomputer&#39;s blog</h1>
<p class="about__description">A dumping ground for ideas related to making, tomfoolery, and tomfoolery related to making</p>
</div>
<ul class="aside__social-links">
</ul>
</div>
<hr>
<div class="aside__content">
<p>
2023-09-01
</p>
</div>
</section>
<footer class="page__footer"><p>
</p>
<br /><br />
<p class="copyright"></p>
<p class="advertisement">Powered by <a href="https://gohugo.io/">hugo</a> and <a href="https://github.com/joeroe/risotto">risotto</a>.</p>
</footer>
</div>
</body>
</html>