<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Text-Processing on vnykmshr</title><link>https://blog.vnykmshr.com/writing/tags/text-processing/</link><description>Recent content in Text-Processing on vnykmshr</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sat, 05 Jul 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.vnykmshr.com/writing/tags/text-processing/index.xml" rel="self" type="application/rss+xml"/><item><title>Fixing OCR addresses</title><link>https://blog.vnykmshr.com/writing/fixing-ocr-addresses/</link><pubDate>Sat, 05 Jul 2025 00:00:00 +0000</pubDate><guid>https://blog.vnykmshr.com/writing/fixing-ocr-addresses/</guid><description>&lt;p&gt;OCR on government documents works well until you look at the address fields.&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;State: &amp;#34;DKI JAKRTA&amp;#34;
City: &amp;#34;JAKRTA PUSAT&amp;#34;
District: &amp;#34;MENTNG&amp;#34;
Village: &amp;#34;MENTENG&amp;#34;
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Dropped vowels, character substitutions, truncated names. Indonesian place names get mangled in predictable ways &amp;ndash; &amp;lsquo;A&amp;rsquo; becomes &amp;lsquo;R&amp;rsquo;, characters vanish mid-word. The OCR engine reads the image fine. It just can&amp;rsquo;t spell.&lt;/p&gt;
&lt;p&gt;The problem: take these broken strings and map them back to real administrative divisions. Province, city, district, village &amp;ndash; the hierarchy matters, and every level needs to resolve correctly.&lt;/p&gt;</description></item></channel></rss>