Infrastructure on vnykmshr

The personal agent trap

Sat, 28 Feb 2026 00:00:00 +0000

Spent a week going through the personal agent ecosystem – OpenClaw, ZeroClaw, PicoClaw, the whole *Claw family. Channel testing, security audit, the whole thing.

If you want a personal assistant that messages you reminders, triages your inbox, schedules things, posts updates – these frameworks are actually good at that. OpenClaw connects to 50+ channels out of the box, the setup is real, it works. For that, a $7 VPS and an afternoon gets you something useful.

Evidence

Fri, 10 Sep 2021 00:00:00 +0000

A product manager pulls up a dashboard mid-meeting and the debate ends.

We had been talking for twenty minutes about whether a new feature should be prioritized. Opinions on both sides. The PM clicks, runs a query, flips the panel to a view they saved last month. The graph shows the answer. We move on.

This is not an unusual meeting. By 2021, it is every meeting.

What engineers always had

The component-level observability has been in place for years. SLOs per service. Latency histograms. Request traces that let you follow a single call across twelve systems. Error rate charts with thresholds. Per-service dashboards bookmarked by the team that owns each service.

PostgreSQL HA

Mon, 15 Mar 2021 00:00:00 +0000

PostgreSQL’s streaming replication is straightforward to set up. The documentation is clear, the configuration is well-understood, and base backups with pg_basebackup work reliably.

The operational problems are the hard part. They show up when the primary goes down and the automated failover does the wrong thing. Or when you promote a replica that’s silently been two hours behind. Or when you discover that backups you’ve been taking for months don’t actually restore.

Prescaling for a known spike

Fri, 15 Mar 2019 00:00:00 +0000

Our biggest sale event of the year is on the calendar. The date is fixed, the hour is fixed, and when it starts, traffic hits a multiple of normal within minutes. The engineering challenge isn’t handling surprise. It’s handling certainty at a scale we’ve never seen before.

We prepare for months. Six months out, teams start thinking about what their services need. Backend teams work with SRE and infra to define prescale configurations and autoscale rules. Terraform handles the provisioning. Every service team shares their estimates with infra, and the configurations get codified.

Consul in practice

Mon, 10 Sep 2018 00:00:00 +0000

The microservice count is growing fast. The monolith is mostly gone and what replaced it is dozens of services across datacenters. We don’t have a uniform naming convention. Finding a service means knowing which team owns it, which cloud it’s on, and what they called it. That’s not scalable.

Consul fixed the naming problem first.

Service discovery

Every service registers with Consul. The DNS interface gives us a consistent way to find anything:

The week pgbouncer stopped being news

Thu, 12 Jul 2018 00:00:00 +0000

The connection count climbs faster than our instance classes can keep up. Ops is hot. Every few weeks the same thread resurfaces: we need a pool in front of Postgres before the next scale event.

We move on pgbouncer.

The choice

Two modes on the table. Session pooling hands a connection to a client and gives it back when the client disconnects. Transaction pooling hands one out per transaction. Transaction is tighter – the pool stretches further, the math gets better – but the client loses everything a session holds. Server-side prepared statements. Advisory locks. Temp tables. SET commands that expect to persist.

The GraphQL buffer

Fri, 20 Apr 2018 00:00:00 +0000

The GraphQL gateway started as a practical problem. We had mobile apps, web clients, and a growing number of backend services. Every client talked to every backend directly. When a new backend came up or an old one changed its API, every client needed updating. The gateway was supposed to fix that – one schema, one endpoint, clients talk to GraphQL, GraphQL talks to backends.

We built it in Go, starting from a fork of graphql-go. The fork grew over time – custom resolvers, caching layers, request batching, things we needed that the upstream didn’t have. We’d sync the fork every few months, but our changes kept growing. Five of us on the team, and most of the early days went into getting other teams to migrate their APIs onto the gateway. We built the base, got teams to add and own their own modules, then moved into a gatekeeping role – reviewing what went in, making sure the schema stayed coherent.

Hazard lights

Sat, 10 Jun 2017 00:00:00 +0000

There are about fifteen of us in the enclosure. Backend engineers, SRE, devops, infra – handpicked from across the floor. The rest of the team, about a hundred people, sit outside. They call us the fishes in the aquarium.

The aquarium has hazard lights. Physical ones – wired to fire on any 5xx in the system. When something breaks in production, the room goes red.

It sounds like a gimmick. It isn’t.

Nginx load balancing decisions

Thu, 18 May 2017 00:00:00 +0000

Nginx as a reverse proxy and load balancer is well-documented. The configuration syntax is not the hard part. The decisions are.

Algorithm selection

Three algorithms cover most workloads.

Round-robin (the default). Requests cycle through backends sequentially. Weights let you bias toward higher-capacity servers. Simple, predictable, works well when request processing times are uniform.

upstream api {
 server api-01:8080 weight=3;
 server api-02:8080 weight=2;
 server api-03:8080 weight=1 backup;
 keepalive 32;
}

The backup directive keeps a server in reserve – it only receives traffic when all non-backup servers are down. Useful for a smaller instance that can keep the service alive during a partial outage but shouldn’t take production load normally.

Node.js on a Raspberry Pi

Sun, 05 Jan 2014 00:00:00 +0000

When I first heard about the Raspberry Pi, I had to get one. A $35 computer that runs real applications. In India in 2013, getting one was the hard part.

Element14 showed “6 qty available.” I ordered. The status changed to “8-9 weeks lead time.” Forty-five days later, the Pi arrived. I plugged it in – nothing. A blinking red light, no display. I tried reloading Raspbian, different cables, different SD cards. Nothing worked. I packed it away and forgot about it for the better part of a year.

Running Node.js in production

Wed, 29 May 2013 00:00:00 +0000

We’ve been running Node.js in production since the 0.4 days. The language is easy to get started with. Keeping it running under real traffic is a different problem.

Process management

The application needs to start at boot, restart on crash, and respond to system signals. Upstart handles this on Ubuntu without additional dependencies:

description "myserver"

env APP_HOME=/var/www/myserver/releases/current
env NODE_ENV=production
env RUN_AS_USER=www-data

start on (net-device-up and local-filesystems and runlevel [2345])
stop on runlevel [016]

respawn
respawn limit 5 60

pre-start script
 test -x /usr/local/bin/node || { stop; exit 0; }
 test -e $APP_HOME/logs || { stop; exit 0; }
end script

script
 chdir $APP_HOME
 exec /usr/local/bin/node bin/cluster app.js \
 -u $RUN_AS_USER \
 -l logs/myserver.out \
 -e logs/myserver.err >> $APP_HOME/logs/upstart
end script

respawn limit 5 60 prevents a crash loop – if the process dies 5 times within 60 seconds, Upstart stops trying. The pre-start script verifies that Node and the log directory exist before attempting to start.

MySQL on XFS

Thu, 11 Apr 2013 00:00:00 +0000

XFS handles database workloads better than ext4 – better concurrent I/O, more efficient metadata operations for tables-heavy schemas, and delayed allocation that improves write throughput. The obvious approach is to change MySQL’s datadir in the config. The less obvious approach is bind mounts, which keep every path where the system expects it.

Setup

Install XFS utilities alongside MySQL:

sudo apt-get install -y xfsprogs mysql-server

Create the filesystem on the dedicated volume: