<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Infrastructure on vnykmshr</title><link>https://blog.vnykmshr.com/writing/tags/infrastructure/</link><description>Recent content in Infrastructure on vnykmshr</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sat, 28 Feb 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.vnykmshr.com/writing/tags/infrastructure/index.xml" rel="self" type="application/rss+xml"/><item><title>The personal agent trap</title><link>https://blog.vnykmshr.com/writing/personal-agent-trap/</link><pubDate>Sat, 28 Feb 2026 00:00:00 +0000</pubDate><guid>https://blog.vnykmshr.com/writing/personal-agent-trap/</guid><description>&lt;p&gt;Spent a week going through the personal agent ecosystem &amp;ndash; OpenClaw, ZeroClaw, PicoClaw, the whole *Claw family. Channel testing, security audit, the whole thing.&lt;/p&gt;
&lt;p&gt;If you want a personal assistant that messages you reminders, triages your inbox, schedules things, posts updates &amp;ndash; these frameworks are actually good at that. OpenClaw connects to 50+ channels out of the box, the setup is real, it works. For that, a $7 VPS and an afternoon gets you something useful.&lt;/p&gt;</description></item><item><title>Evidence</title><link>https://blog.vnykmshr.com/writing/evidence/</link><pubDate>Fri, 10 Sep 2021 00:00:00 +0000</pubDate><guid>https://blog.vnykmshr.com/writing/evidence/</guid><description>&lt;p&gt;A product manager pulls up a dashboard mid-meeting and the debate ends.&lt;/p&gt;
&lt;p&gt;We had been talking for twenty minutes about whether a new feature should be prioritized. Opinions on both sides. The PM clicks, runs a query, flips the panel to a view they saved last month. The graph shows the answer. We move on.&lt;/p&gt;
&lt;p&gt;This is not an unusual meeting. By 2021, it is every meeting.&lt;/p&gt;
&lt;h2 id="what-engineers-always-had"&gt;What engineers always had&lt;/h2&gt;
&lt;p&gt;The component-level observability has been in place for years. SLOs per service. Latency histograms. Request traces that let you follow a single call across twelve systems. Error rate charts with thresholds. Per-service dashboards bookmarked by the team that owns each service.&lt;/p&gt;</description></item><item><title>PostgreSQL HA</title><link>https://blog.vnykmshr.com/writing/postgres-ha/</link><pubDate>Mon, 15 Mar 2021 00:00:00 +0000</pubDate><guid>https://blog.vnykmshr.com/writing/postgres-ha/</guid><description>&lt;p&gt;PostgreSQL&amp;rsquo;s streaming replication is straightforward to set up. The documentation is clear, the configuration is well-understood, and base backups with &lt;code&gt;pg_basebackup&lt;/code&gt; work reliably.&lt;/p&gt;
&lt;p&gt;The operational problems are the hard part. They show up when the primary goes down and the automated failover does the wrong thing. Or when you promote a replica that&amp;rsquo;s silently been two hours behind. Or when you discover that backups you&amp;rsquo;ve been taking for months don&amp;rsquo;t actually restore.&lt;/p&gt;</description></item><item><title>Prescaling for a known spike</title><link>https://blog.vnykmshr.com/writing/prescaling-for-a-known-spike/</link><pubDate>Fri, 15 Mar 2019 00:00:00 +0000</pubDate><guid>https://blog.vnykmshr.com/writing/prescaling-for-a-known-spike/</guid><description>&lt;p&gt;Our biggest sale event of the year is on the calendar. The date is fixed, the hour is fixed, and when it starts, traffic hits a multiple of normal within minutes. The engineering challenge isn&amp;rsquo;t handling surprise. It&amp;rsquo;s handling certainty at a scale we&amp;rsquo;ve never seen before.&lt;/p&gt;
&lt;p&gt;We prepare for months. Six months out, teams start thinking about what their services need. Backend teams work with SRE and infra to define prescale configurations and autoscale rules. Terraform handles the provisioning. Every service team shares their estimates with infra, and the configurations get codified.&lt;/p&gt;</description></item><item><title>Consul in practice</title><link>https://blog.vnykmshr.com/writing/consul-in-practice/</link><pubDate>Mon, 10 Sep 2018 00:00:00 +0000</pubDate><guid>https://blog.vnykmshr.com/writing/consul-in-practice/</guid><description>&lt;p&gt;The microservice count is growing fast. The monolith is mostly gone and what replaced it is dozens of services across datacenters. We don&amp;rsquo;t have a uniform naming convention. Finding a service means knowing which team owns it, which cloud it&amp;rsquo;s on, and what they called it. That&amp;rsquo;s not scalable.&lt;/p&gt;
&lt;p&gt;Consul fixed the naming problem first.&lt;/p&gt;
&lt;h2 id="service-discovery"&gt;Service discovery&lt;/h2&gt;
&lt;p&gt;Every service registers with Consul. The DNS interface gives us a consistent way to find anything:&lt;/p&gt;</description></item><item><title>The week pgbouncer stopped being news</title><link>https://blog.vnykmshr.com/writing/pgbouncer-stopped-being-news/</link><pubDate>Thu, 12 Jul 2018 00:00:00 +0000</pubDate><guid>https://blog.vnykmshr.com/writing/pgbouncer-stopped-being-news/</guid><description>&lt;p&gt;The connection count climbs faster than our instance classes can keep up. Ops is hot. Every few weeks the same thread resurfaces: we need a pool in front of Postgres before the next scale event.&lt;/p&gt;
&lt;p&gt;We move on pgbouncer.&lt;/p&gt;
&lt;h2 id="the-choice"&gt;The choice&lt;/h2&gt;
&lt;p&gt;Two modes on the table. Session pooling hands a connection to a client and gives it back when the client disconnects. Transaction pooling hands one out per transaction. Transaction is tighter &amp;ndash; the pool stretches further, the math gets better &amp;ndash; but the client loses everything a session holds. Server-side prepared statements. Advisory locks. Temp tables. &lt;code&gt;SET&lt;/code&gt; commands that expect to persist.&lt;/p&gt;</description></item><item><title>The GraphQL buffer</title><link>https://blog.vnykmshr.com/writing/the-graphql-buffer/</link><pubDate>Fri, 20 Apr 2018 00:00:00 +0000</pubDate><guid>https://blog.vnykmshr.com/writing/the-graphql-buffer/</guid><description>&lt;p&gt;The GraphQL gateway started as a practical problem. We had mobile apps, web clients, and a growing number of backend services. Every client talked to every backend directly. When a new backend came up or an old one changed its API, every client needed updating. The gateway was supposed to fix that &amp;ndash; one schema, one endpoint, clients talk to GraphQL, GraphQL talks to backends.&lt;/p&gt;
&lt;p&gt;We built it in Go, starting from a fork of &lt;code&gt;graphql-go&lt;/code&gt;. The fork grew over time &amp;ndash; custom resolvers, caching layers, request batching, things we needed that the upstream didn&amp;rsquo;t have. We&amp;rsquo;d sync the fork every few months, but our changes kept growing. Five of us on the team, and most of the early days went into getting other teams to migrate their APIs onto the gateway. We built the base, got teams to add and own their own modules, then moved into a gatekeeping role &amp;ndash; reviewing what went in, making sure the schema stayed coherent.&lt;/p&gt;</description></item><item><title>Hazard lights</title><link>https://blog.vnykmshr.com/writing/hazard-lights/</link><pubDate>Sat, 10 Jun 2017 00:00:00 +0000</pubDate><guid>https://blog.vnykmshr.com/writing/hazard-lights/</guid><description>&lt;p&gt;There are about fifteen of us in the enclosure. Backend engineers, SRE, devops, infra &amp;ndash; handpicked from across the floor. The rest of the team, about a hundred people, sit outside. They call us the fishes in the aquarium.&lt;/p&gt;
&lt;p&gt;The aquarium has hazard lights. Physical ones &amp;ndash; wired to fire on any 5xx in the system. When something breaks in production, the room goes red.&lt;/p&gt;
&lt;p&gt;It sounds like a gimmick. It isn&amp;rsquo;t.&lt;/p&gt;</description></item><item><title>Nginx load balancing decisions</title><link>https://blog.vnykmshr.com/writing/nginx-load-balancing/</link><pubDate>Thu, 18 May 2017 00:00:00 +0000</pubDate><guid>https://blog.vnykmshr.com/writing/nginx-load-balancing/</guid><description>&lt;p&gt;Nginx as a reverse proxy and load balancer is well-documented. The configuration syntax is not the hard part. The decisions are.&lt;/p&gt;
&lt;h2 id="algorithm-selection"&gt;Algorithm selection&lt;/h2&gt;
&lt;p&gt;Three algorithms cover most workloads.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Round-robin&lt;/strong&gt; (the default). Requests cycle through backends sequentially. Weights let you bias toward higher-capacity servers. Simple, predictable, works well when request processing times are uniform.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-nginx" data-lang="nginx"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;upstream&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="n"&gt;api-01&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;8080&lt;/span&gt; &lt;span class="s"&gt;weight=3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="n"&gt;api-02&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;8080&lt;/span&gt; &lt;span class="s"&gt;weight=2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="n"&gt;api-03&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;8080&lt;/span&gt; &lt;span class="s"&gt;weight=1&lt;/span&gt; &lt;span class="s"&gt;backup&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kn"&gt;keepalive&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The &lt;code&gt;backup&lt;/code&gt; directive keeps a server in reserve &amp;ndash; it only receives traffic when all non-backup servers are down. Useful for a smaller instance that can keep the service alive during a partial outage but shouldn&amp;rsquo;t take production load normally.&lt;/p&gt;</description></item><item><title>Node.js on a Raspberry Pi</title><link>https://blog.vnykmshr.com/writing/nodejs-on-raspberry-pi/</link><pubDate>Sun, 05 Jan 2014 00:00:00 +0000</pubDate><guid>https://blog.vnykmshr.com/writing/nodejs-on-raspberry-pi/</guid><description>&lt;p&gt;When I first heard about the Raspberry Pi, I had to get one. A $35 computer that runs real applications. In India in 2013, getting one was the hard part.&lt;/p&gt;
&lt;p&gt;Element14 showed &amp;ldquo;6 qty available.&amp;rdquo; I ordered. The status changed to &amp;ldquo;8-9 weeks lead time.&amp;rdquo; Forty-five days later, the Pi arrived. I plugged it in &amp;ndash; nothing. A blinking red light, no display. I tried reloading Raspbian, different cables, different SD cards. Nothing worked. I packed it away and forgot about it for the better part of a year.&lt;/p&gt;</description></item><item><title>Running Node.js in production</title><link>https://blog.vnykmshr.com/writing/nodejs-in-production/</link><pubDate>Wed, 29 May 2013 00:00:00 +0000</pubDate><guid>https://blog.vnykmshr.com/writing/nodejs-in-production/</guid><description>&lt;p&gt;We&amp;rsquo;ve been running Node.js in production since the 0.4 days. The language is easy to get started with. Keeping it running under real traffic is a different problem.&lt;/p&gt;
&lt;h2 id="process-management"&gt;Process management&lt;/h2&gt;
&lt;p&gt;The application needs to start at boot, restart on crash, and respond to system signals. Upstart handles this on Ubuntu without additional dependencies:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;description &lt;span class="s2"&gt;&amp;#34;myserver&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;env &lt;span class="nv"&gt;APP_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/var/www/myserver/releases/current
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;env &lt;span class="nv"&gt;NODE_ENV&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;production
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;env &lt;span class="nv"&gt;RUN_AS_USER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;www-data
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;start on &lt;span class="o"&gt;(&lt;/span&gt;net-device-up and local-filesystems and runlevel &lt;span class="o"&gt;[&lt;/span&gt;2345&lt;span class="o"&gt;])&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;stop on runlevel &lt;span class="o"&gt;[&lt;/span&gt;016&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;respawn
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;respawn limit &lt;span class="m"&gt;5&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pre-start script
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nb"&gt;test&lt;/span&gt; -x /usr/local/bin/node &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; stop&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;exit&lt;/span&gt; 0&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nb"&gt;test&lt;/span&gt; -e &lt;span class="nv"&gt;$APP_HOME&lt;/span&gt;/logs &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; stop&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;exit&lt;/span&gt; 0&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;end script
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;script
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; chdir &lt;span class="nv"&gt;$APP_HOME&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nb"&gt;exec&lt;/span&gt; /usr/local/bin/node bin/cluster app.js &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -u &lt;span class="nv"&gt;$RUN_AS_USER&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -l logs/myserver.out &lt;span class="se"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -e logs/myserver.err &amp;gt;&amp;gt; &lt;span class="nv"&gt;$APP_HOME&lt;/span&gt;/logs/upstart
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;end script
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;respawn limit 5 60&lt;/code&gt; prevents a crash loop &amp;ndash; if the process dies 5 times within 60 seconds, Upstart stops trying. The &lt;code&gt;pre-start&lt;/code&gt; script verifies that Node and the log directory exist before attempting to start.&lt;/p&gt;</description></item><item><title>MySQL on XFS</title><link>https://blog.vnykmshr.com/writing/mysql-xfs/</link><pubDate>Thu, 11 Apr 2013 00:00:00 +0000</pubDate><guid>https://blog.vnykmshr.com/writing/mysql-xfs/</guid><description>&lt;p&gt;XFS handles database workloads better than ext4 &amp;ndash; better concurrent I/O, more efficient metadata operations for tables-heavy schemas, and delayed allocation that improves write throughput. The obvious approach is to change MySQL&amp;rsquo;s &lt;code&gt;datadir&lt;/code&gt; in the config. The less obvious approach is bind mounts, which keep every path where the system expects it.&lt;/p&gt;
&lt;h2 id="setup"&gt;Setup&lt;/h2&gt;
&lt;p&gt;Install XFS utilities alongside MySQL:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;sudo apt-get install -y xfsprogs mysql-server
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Create the filesystem on the dedicated volume:&lt;/p&gt;</description></item></channel></rss>