How to spell check PDF file?

Online: https://t.co/8STYFz9f6i

I would recommend using the command-line for this. Here are the steps courtesy of ChatGPT. These steps are for Windows:

Step 1: Install Poppler for Windows

In Powershell, run:

 winget install --id oschwartz10612.Poppler -e

Log. This will give pdftotext program. Verify the version:

PS C:\Users\siddj> pdftotext -v
pdftotext version 4.00
Copyright 1996-2017 Glyph & Cog, LLC

Step 2: Install codespell – the spell checker

Again in Powershell:

py -m pip install codespell

Log. This installs codespell spell checker but if you try to run it you will get error as its not added to the PATH by default.

 WARNING: The script codespell.exe is installed in 'C:\Users\siddj\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.

So do that. Instructions for how to do this can be found elsewhere.

Step 3: Run it!

pdftotext -layout resume2.pdf - | codespell -
41: · Developed a content management sysem for hosting an annual Data Science Showcase at J&J.
        sysem ==> system

Well worth it!

Posted in Computers, programming, Software | Tagged , | Leave a comment

Own the S&P 500, without the dead weight

Passive indexing is the “gold standard” of modern investing. The advice is almost always the same: Just buy the S&P 500. Don’t think. Don’t look. Just hold.

But there is a flaw in that logic that most investors realize too late. The S&P 500 is market-cap weighted, meaning the larger a company’s valuation becomes, the more of it you own—regardless of whether that valuation is supported by reality.

Most people blindly buy the S&P 500 and end up owning a lot of overpriced, low-margin “dead weight.”

What about a tool for buy-and-hold investors to do the opposite: it uses a fundamental framework to filter the index down to a curated shortlist of ~50 stocks based on quality, cash flow, and value metrics.

The “Signal Over Noise” Approach

I didn’t build this for day traders or people chasing the latest “meme” stock. I built it for people who want the structural security of the index but want to optimize for business quality.

Here is how the tool differs from a traditional screener:

  • No Noise: There are no RSI indicators, no “golden crosses,” and no technical fluff. If a metric doesn’t tell you about the health of the underlying business, it isn’t here.
  • The “Anti-Index” Approach: Instead of owning all 500 companies—including those with deteriorating balance sheets and negative margins—you focus on the top 10% that actually meet strict fundamental criteria.
  • Monthly Rebalance: This tool is built for people who check their portfolio once a month, not once a minute. It’s designed to keep you focused on the long-term horizon.

Cutting the Dead Weight

When you buy the full index, you are effectively saying, “I want to own every business in America, even the ones losing money.”

By applying a fundamental filter, we aren’t “timing the market.” We are simply raising the bar for what earns a spot in our portfolio. If a company can’t generate free cash flow or maintain a healthy return on invested capital, it shouldn’t be on your shortlist.

The goal isn’t to be active. The goal is to be intentional.

Where can you get it? Here

Posted in Money | Leave a comment

GitHub vs. BitBucket

These days you can just ask ChatGPT or your favorite tool to compare the pros and cons of GitHub vs. BitBucket so I’ll skip that, but today I realized Bitbuket has one killer-feature that is missing in GitHub: Projects.

Projects allow you to group multiple repositories into one container or folder. There is no such thing in GitHub. For most people this might be a quality of life factor rather than a killer-feature but if you are anything like Poirot you know what I am talking about.

WDYT?

Posted in Computers, programming, Software | Tagged , | Leave a comment

PHP Fatal Error: Unable to start pcre module in Unknown on line 0

$ php8.1 -v 
PHP Fatal error: Unable to start pcre module in Unknown on line 0

Cause

The cause of the error is that PHP has a hard dependency on pcre. PHP ships with pcre bundled in and normally things work out of the box without you having to do anything. But in my case – since I like doing things the hard way – I had installed pcre manually from source code earlier when I installed SWIG. The binary that gets built by following this procedure gets installed under /usr/local/lib but PHP does not like it and it gives above error PHP Fatal error: Unable to start pcre module in Unknown on line 0 if you try to use your own pcre instead of the pcre that comes with PHP. This is what ABI – application binary interface – (in)compatibility refers to.

API incompatibility ABI incompatibility
Caught at Compile time Runtime
Visible in Source code Compiled binaries
Example Function doesn’t exist Function exists but calling convention differs

Fix

The fix is to change the order of precedence so that PHP’s pcre which gets installed under /lib/x86_64-linux-gnu/libpcre2-8.so.0 takes precedence over the homegrown /usr/local/lib/libpcre2-8.so.0. We can do that by adding a config directive in /etc/ld.so.conf.d. Do that by running:

echo "/lib/x86_64-linux-gnu" | sudo tee /etc/ld.so.conf.d/force-system-pcre2.conf

After this I got:

$  ldconfig -p | grep libpcre
        libpcre2-8.so.0 (libc6,x86-64) => /lib/x86_64-linux-gnu/libpcre2-8.so.0
        libpcre2-8.so.0 (libc6,x86-64) => /usr/local/lib/libpcre2-8.so.0
        libpcre2-posix.so.3 (libc6,x86-64) => /usr/local/lib/libpcre2-posix.so.3
        libpcreposix.so.3 (libc6,x86-64) => /lib/x86_64-linux-gnu/libpcreposix.so.3
        libpcre.so.3 (libc6,x86-64) => /lib/x86_64-linux-gnu/libpcre.so.3

and I am able to run PHP:

$ php8.1 -v
PHP 8.1.2-1ubuntu2.23 (cli) (built: Jan  7 2026 08:37:41) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.1.2, Copyright (c) Zend Technologies
    with Zend OPcache v8.1.2-1ubuntu2.23, Copyright (c), by Zend Technologies
Posted in Computers, programming, Software | Tagged | Leave a comment

Useful Windows Shortcuts

CategoryShortcutFunction
The EssentialsWin + DShow/Hide Desktop
Win + EOpen File Explorer
Win + LLock PC
Win + VClipboard History (See multiple copies)
Alt + TabSwitch between open apps
Window ControlWin + Arrow KeysSnap windows to sides/corners
Win + Shift + SScreenshot specific area
Win + TabOpen Task View (Virtual Desktops)
Win + Ctrl + DCreate a new Virtual Desktop
Win + NumberOpen app at that Taskbar position
System ToolsCtrl + Shift + EscOpen Task Manager directly
Win + IOpen Settings
Win + SSearch Windows
Win + XQuick Link Menu (Power User menu)
Win + . (Period)Emoji & GIF panel
Quick ActionsAlt + F4Close active window
Ctrl + Shift + TReopen last closed browser tab
Ctrl + ScrollZoom in or out
Win + PauseView System/Device specifications
Win + LInstant Lock (Safety first!)
Posted in Computers, programming, Software | Tagged , | Leave a comment

SvelteKit – SPA vs. SSG

This one has tripped me a few times so I thought I’d make a note about it. When reading or learning SvelteKit you might have come across two terms – SPA and SSG. They seem related but are not. SPA and SSG are disjoint concepts and are not mutually exclusive. It is possible to have all 4 combinations below:

SPASSGWhat it means
00
01
10
11

What SPA means – SPA is equivalent to putting:

export ssr = false

in src/routes/+layout.js. A Single Page Application as its name suggests exists as a single HTML file with references to JS chunks. The sub-pages (paths if you will) like /foo, /bar etc. are virtual in the sense that the JS application intercepts them and routing happens client-side – there is no call to the server when you request /foo, /bar etc. The SPA is bad for SEO but for internal applications in a company accessed by company-only employees, SEO does not matter. For completeness and accuracy we should add that API (XHR) calls still go to server (backend). SPA does not mean you cannot make any backend calls. The key point is that when you click on a link within the application, it handles it completely client-side. If you type the link in the browser address bar that’s different – the browser will download the application again (from its cache) and the app will start (boot) all over again.

Now let’s talk about SSG. SSG is equivalent to using adapter-static in svelte.config.js. The adapters influence what happens when you run npm run build. Adapters have no effect when you run npm run dev or npm serve. A static adapter can be used when there is no server-side logic in the SvelteKit app i.e., its a pure browser application. There is either no backend or (more likely) the backend is being developed and run as a separate app (Express.js for example). The static adapter will generate HTML+JS when you run npm run build. This HTML+JS has to be served somehow and that is outside the scope of static adapter. We assume you will host the HTML+JS on a NGINX server for example or any other way you wish.

Now as exercise challenge yourself and fill in the table above.

Now-a-days AI is really good and you can get all the answers using AI eliminating the need for blog posts. See below for what happens if I have server-side logic in a sveltekit app and try to use adapter-static. Does it still work?

No, it generally won’t work as expected if you have real server-side logic (things that need to run dynamically on the server at request time).

What adapter-static does

adapter-static turns your SvelteKit app into a pure static site (or SPA with a fallback). During npm run build, it:

  • Prerenders all pages into static HTML files.
  • Outputs only HTML, CSS, JS, and any prerendered JSON/endpoint files.
  • Deploys to any static host (Netlify, Vercel static, GitHub Pages, Cloudflare Pages, Surge, etc.).

There is no server runtime at all after the build.

What happens to server-side logic

Type of server-side code What happens with adapter-static Does it work?
+server.js / +server.ts endpoints Stripped out by default. Cannot run dynamically. No (unless export const prerender = true;)
+page.server.js / +layout.server.js Server load functions are stripped. Data won’t load at runtime. No (unless prerendered)
Universal load in +page.js Runs at build time during prerendering (if possible). Client-side version may run on navigation. Partially (static data only)
Dynamic server features (DB queries, auth, secrets, POST/PUT, etc.) Impossible — no server to execute them. No
Forms / actions Won’t work unless everything is prerendered or you move logic to an external backend. Usually no

By default, the build will fail with an error like:

@sveltejs/adapter-static: all routes must be fully prerenderable (unless using the 'fallback' option)

This is a safety check to prevent you from deploying a broken app where some routes are missing or non-functional.

When it can work (limited cases)

You can make some server-side code compatible if you prerender it at build time:

  • Add export const prerender = true; to the route (or to the root +layout.js to apply broadly).
  • For +server.js endpoints, this generates static files (e.g. /api/data.json becomes a static JSON file).
  • For server load functions, data is baked into the HTML at build time.

This works well for static sites (blogs, docs, marketing pages) or pages with data that rarely changes.

If the data must be dynamic (user-specific, real-time, depends on request time), it cannot work with adapter-static.

Recommended alternatives

  • Keep server-side logic → Use a different adapter:
    • @sveltejs/adapter-node (self-hosted Node server)
    • Platform-specific ones (Vercel, Netlify, Cloudflare, etc.) for serverless/edge functions.
  • Move the backend elsewhere → Keep adapter-static for the frontend and call an external API (REST, GraphQL, Supabase, your own Node/Express/FastAPI backend, etc.).
  • For SPA mode (client-side only routing) → You can use adapter-static with fallback: 'index.html' and export const ssr = false; in your root layout, but you still cannot have server-only logic.

Quick test / debugging tip

npm run preview often hides these issues because it still runs a dev-like server. Always test a real static build:

npm run build
npx serve ./build   # or http-server, python -m http.server, etc.

In short: adapter-static is great for fully static content, but it’s incompatible with dynamic server-side logic unless you prerender everything at build time or move that logic to a separate backend. If your app needs real server capabilities, switch to a non-static adapter.

Posted in Computers, programming, Software | Tagged , | Leave a comment

How BatchMeterUsage Actually Works (and the Gotchas)

If you’re building a SaaS product on AWS Marketplace with usage-based billing, you’ll be calling BatchMeterUsage. The API looks simple. The documentation is thin. And there are at least five things that will bite you in production if you don’t know about them upfront.

I’ve shipped 4 SaaS products on AWS Marketplace. Here’s what I learned about metering the hard way.

What BatchMeterUsage Does

BatchMeterUsage is the AWS Marketplace API that reports how much of your product a customer used. AWS uses these reports to bill the customer. You call it periodically (typically hourly), and each call contains a batch of usage records – one per customer per metering dimension.

A usage record looks like this:

{
    "CustomerIdentifier": "cust-abc-123",
    "Dimension": "api_calls",
    "Quantity": 1500,
    "Timestamp": "2025-03-15T13:00:00.000Z"
}

Simple, right? Here are the gotchas.

Gotcha 1: You Must Send Zero-Usage Records

This is the one that surprises everyone. If a customer is subscribed but didn’t use your product in the last hour, you still need to send a record with Quantity: 0.

Why? AWS Marketplace uses the absence of metering records as a signal that something is wrong. If you stop sending records for a customer, AWS may flag the subscription for review or pause it. The zero-usage heartbeat tells AWS “this customer is still active, they just didn’t use anything this hour.”

This means your metering job needs to know about all subscribed customers, not just the ones with usage:

const subscribedCustomers = db.customers.getSubscribedCustomers();
const usageMap = aggregateUsageByCustomer(hourStart, hourEnd);

const customersWithRecords = new Set();

// Build records for customers with actual usage
for (const customer of subscribedCustomers) {
    if (usageMap.has(customer.tenantId)) {
        customersWithRecords.add(customer.custId);
        // ... build usage records
    }
}

// Zero-fill for idle customers
for (const customer of subscribedCustomers) {
    if (!customersWithRecords.has(customer.custId)) {
        for (const dimension of dimensions) {
            usageRecords.push({
                Timestamp: hourStart,
                CustomerIdentifier: customer.custId,
                Dimension: dimension,
                Quantity: 0
            });
        }
    }
}

But wait – which dimensions do you send zero records for? You can’t just guess. You need to know the exact set of ExternallyMetered dimensions defined for your product. More on that in Gotcha 3.

Gotcha 2: The Timestamp Is the Hour, Not “Now”

The Timestamp field in each usage record is not when you’re making the API call. It’s the start of the billing hour you’re reporting for.

If your metering job runs at 14:35 UTC, you’re reporting usage for the 13:00-14:00 UTC window. The timestamp must be 2025-03-15T13:00:00.000Z – the start of the previous hour, truncated to the hour boundary.

function getPreviousHourStart() {
    const now = new Date();
    const previousHour = new Date(now);
    previousHour.setUTCHours(previousHour.getUTCHours() - 1);
    previousHour.setUTCMinutes(0);
    previousHour.setUTCSeconds(0);
    previousHour.setUTCMilliseconds(0);
    return previousHour;
}

Get this wrong and you’ll either:

  • Double-bill a customer (reporting the current hour’s usage when the previous hour’s usage was already reported)
  • Get rejected by AWS (timestamps must fall within the last 6 hours)

Your usage aggregation query needs to match this window exactly. Use a half-open interval – greater-than-or-equal to the hour start, strictly less than the hour end:

SELECT acctId, dimension, COALESCE(SUM(usage), 0) as total
FROM usage
WHERE datetime(timestamp) >= datetime(?)    -- 13:00:00
  AND datetime(timestamp) < datetime(?)     -- 14:00:00
GROUP BY acctId, dimension

The < (not <=) on the end boundary is critical. A usage event timestamped at exactly 14:00:00.000 belongs to the next hour’s window, not this one.

Gotcha 3: You Need to Know Your Dimensions at Runtime

When you define your AWS Marketplace product, you configure metering dimensions (e.g., api_calls, storage_gb, users). Some are ExternallyMetered (you report them via BatchMeterUsage), others might be contract-based.

Your metering job needs to know which dimensions are externally metered so it can:

  1. Send zero-usage records for the right dimensions
  2. Not accidentally skip a dimension

You could hardcode them. But then you’d need a code change every time you add a dimension to your product listing. A better approach is to fetch them from the AWS Marketplace Catalog API at startup:

const { MarketplaceCatalogClient, DescribeEntityCommand } = require('@aws-sdk/client-marketplace-catalog');

async function getExternallyMeteredDimensions(productEntityId) {
    const client = new MarketplaceCatalogClient({ region: 'us-east-1' });
    const resp = await client.send(
        new DescribeEntityCommand({
            Catalog: "AWSMarketplace",
            EntityId: productEntityId,
        })
    );

    const details = JSON.parse(resp.Details ?? "{}");
    const dimensions = details.Dimensions || [];

    return dimensions
        .filter(dim => dim.Types && dim.Types.includes('ExternallyMetered'))
        .map(dim => dim.Key);
}

Cache the result in memory for the lifetime of the process. Dimensions don’t change often, and you don’t want to call the Catalog API on every metering cycle. A process restart picks up new dimensions.

One thing to note: the Marketplace Catalog API, like the Metering API, only works in us-east-1. Even if your application runs in another region.

Gotcha 4: The 25-Record Batch Limit

BatchMeterUsage accepts a maximum of 25 usage records per API call. If you have 10 customers and 3 dimensions, that’s 30 records – you need two API calls.

This is easy to overlook in development when you have 1-2 test customers. It breaks in production when you have real customers:

async function sendBatchMeterUsage(usageRecords) {
    const BATCH_SIZE = 25;

    for (let i = 0; i < usageRecords.length; i += BATCH_SIZE) {
        const batch = usageRecords.slice(i, i + BATCH_SIZE);

        const command = new BatchMeterUsageCommand({
            ProductCode: PRODUCT_CODE,
            UsageRecords: batch
        });

        const response = await meteringClient.send(command);
        // Handle response...
    }
}

Send batches sequentially, not in parallel. AWS rate limits the metering API, and you don’t want to deal with throttling errors on top of everything else.

Gotcha 5: Failures Are Silent and Varied

A successful API call doesn’t mean all records were accepted. BatchMeterUsage has three distinct failure modes, and you need to handle all of them:

1. Per-record failures in Results:

The response includes a Results array where each record has a Status. A Status of "Success" means the record was accepted. Anything else – "CustomerNotSubscribed", "DuplicateRecord", etc. – means it wasn’t.

for (const result of response.Results) {
    if (result.Status === 'Success') {
        // Save the MeteringRecordId for your audit trail
        saveSuccessReport(result);
    } else {
        // This record was rejected -- log it, save it, alert on it
        saveFailureReport(result);
    }
}

The MeteringRecordId returned on success is your receipt. Save it. If there’s ever a billing dispute, this is your proof.

2. Unprocessed records:

The response may include an UnprocessedRecords array – records that AWS didn’t even attempt to process. This happens under load or transient issues:

if (response.UnprocessedRecords && response.UnprocessedRecords.length > 0) {
    // These need to be retried
    logger.warn(`${response.UnprocessedRecords.length} records were not processed`);
}

3. Full batch failure (network/API error):

The entire send() call throws. None of the records were submitted:

try {
    const response = await meteringClient.send(command);
    // handle Results + UnprocessedRecords
} catch (error) {
    // None of the 25 records in this batch were submitted
    // Log all of them as failed
    for (const record of batch) {
        saveErrorReport(record, error.message);
    }
}

The key insight: you need an audit table. Every record you submit should be saved with its outcome – success (with MeteringRecordId), failure (with status), or error (with error message). Without this, you’re flying blind.

CREATE TABLE metering_reports (
    id INTEGER PRIMARY KEY,
    customer_identifier TEXT NOT NULL,
    dimension TEXT NOT NULL,
    quantity INTEGER NOT NULL DEFAULT 0,
    metering_timestamp DATETIME NOT NULL,
    metering_record_id TEXT,
    status TEXT NOT NULL DEFAULT 'success',
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

This table is your billing truth. Query it to answer “what did we report to AWS for customer X in March?” or “which records failed last week?”

Gotcha 6: Not all regions support the Metering API

As of this writing, BatchMeterUsage is supported in the following AWS Regions (from here):

Commercial Regions:

eu-north-1, me-south-1, ap-south-1, eu-west-3, ap-southeast-3, us-east-2, af-south-1, eu-west-1,me-central-1, eu-central-1, sa-east-1, ap-east-1, ap-south-2, us-east-1, ap-northeast-2, ap-northeast-3, eu-west-2, ap-southeast-4, eu-south-1, ap-northeast-1, us-west-2, us-west-1, ap-southeast-1, ap-southeast-2, il-central-1, ca-central-1, eu-south-2, eu-central-2

China Regions:

cn-northwest-1

Gotcha 7: Two versions of the API

There are actually two versions of the BatchMeterUsage API but with the same name. One version takes in the CustomerIdentifier and ProductCode and another version takes in CustomerAWSAccountId (instead of CustomerIdentifier) and LicenseArn (instead of ProductCode). The first version is what is widely used but you have to use the second version starting June 1, 2026 if you want to use Concurrent Agreements. Read about both of them including code samples here.

Putting It All Together

Here’s the overall structure of a production metering job:

Every hour:
  1. Compute the time window (previous hour start/end)
  2. Fetch all subscribed customers from your database
  3. Aggregate usage from the usage table, grouped by (customer, dimension)
  4. Build usage records:
     a. Real usage records for customers who had activity
     b. Zero-usage records for idle customers (for all ExternallyMetered dimensions)
  5. Split into batches of 25
  6. Send each batch sequentially
  7. Save every result to the audit table (success, failure, or error)

And separately:

On new subscription (subscribe-success SQS event):
  - Send an immediate zero-usage record for all dimensions
  - This registers the customer with AWS Metering without waiting for the next hourly job

Takeaway

BatchMeterUsage is deceptively simple. The API call is one function. But the operational concerns around it – zero-usage heartbeats, hourly timestamp semantics, batch limits, three failure modes, dimension discovery, identifier mapping – are where the real complexity lives.

If you’re implementing this for the first time, budget more time than you think. And build the audit table from day one. When a customer questions their bill three months from now, you’ll be glad you did.


I’ve packaged a production-tested implementation of all of this – metering, auth, entitlements, and the rest of the AWS Marketplace plumbing – into a self-hosted Node.js gateway kit. If you’re listing a SaaS product on AWS Marketplace, check it out here.

Posted in Computers, programming, Software | Tagged , , | Leave a comment

Essofore Semantic Search — Self-Hosted RAG Infrastructure That Keeps Your Data in Your VPC

Upload documents. Search in plain English. Your data never leaves your AWS account.

Most vector search infrastructure has a hidden problem: your data lives somewhere else. Whether it’s Pinecone’s servers or a managed Elasticsearch cluster, your proprietary documents are outside your control.

Essofore is different. It deploys as an AMI directly into your AWS VPC. Your documents never touch our servers or anyone else’s. And unlike Elasticsearch — which charges more as your data grows because vectors must be loaded into RAM — Essofore’s cost stays flat no matter how much you index for a given EC2 instance.

Who this is for:

  • Developers at healthcare, fintech, or legaltech companies who can’t send proprietary documents to a SaaS vendor
  • Teams getting unexpected bills from Elasticsearch or managed vector DBs as their corpus grows
  • Developers who want RAG search without becoming an ML expert — no embedding knowledge required

What makes it different:

  • Data sovereignty: 100% self-hosted in your AWS account. No third party ever sees your data.
  • Transparent pricing: $0.55/hour + Cost of EC2 independent of the data volume
  • Zero ML expertise required: Upload PDFs, Word docs, HTML. Essofore handles chunking, embeddings, vector storage, and search automatically.
  • Enterprise-grade simplicity: One AMI. One systemctl command. Ready to serve queries.

Typical use cases:

  • Internal enterprise search over company knowledge bases
  • RAG retrieval layer for LLM-powered applications
  • HIPAA/GDPR-compliant AI search where data residency matters

Get Started on AWS.

Posted in Computers, programming, Software | Leave a comment

Note to self: For God’s sake stop renaming things

There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton

I don’t think there is anyone on the planet who renames things as much as I do. Stop it. Seriously. Its not worth it.

It looks innocuous. But it isn’t.

I had a project that was working just fine but being me I renamed the directory and all of a sudden everything stopped working. Turns out it was a Python project using uv (you run into same problem if you use poetry). uv hardcodes the path in .venv/bin/activate. Look for this line:

VIRTUAL_ENV='/path/to/.venv'

So if you rename a directory and then run source .venv/bin/activate or any other uv command, it will stop working!

It doesn’t end there. I was using uvicorn and the path is hardcoded in that too. What makes it hard to debug is that the errors you get are extremely cryptic like File not found when you can see it right in front of your eyes.

Posted in Computers, programming, Software | Leave a comment

ETFs vs. Mutual Funds

This post is AI generated

Are ETFs more tax-efficient than their equivalent mutual funds?

Yes, ETFs are generally more tax-efficient than their equivalent mutual funds, especially in taxable brokerage accounts. This advantage holds for both passive (index-tracking) and many active strategies, though the gap is most pronounced in equity funds.

Why ETFs Tend to Be More Tax-Efficient

The key difference comes down to structure and how shares are created/redeemed, not the underlying investments or tax rules themselves (both are taxed the same way on dividends and realized capital gains when you sell).

  • Mutual funds: When investors redeem shares, the fund often sells underlying securities for cash to pay them out. If those sales generate gains (common in rising markets or with high turnover), the fund must distribute those capital gains to all shareholders at year-end—even if you didn’t sell anything. You owe taxes on those distributions (usually long-term capital gains rates).
  • ETFs: Most use an in-kind creation/redemption process. Large institutional investors (authorized participants) exchange baskets of securities for ETF shares (creation) or vice versa (redemption). No cash changes hands at the fund level in these large blocks, so the ETF avoids selling securities and realizing gains that would need to be distributed. Investor trading on the exchange also doesn’t force the fund to sell holdings.

This results in far fewer (or zero) capital gains distributions from ETFs.

Real-World Data on Tax Efficiency

Recent figures confirm the structural edge:

  • In 2025, only 7% of ETFs distributed capital gains, compared to 52% of mutual funds. Long-term averages (since 2016) are about 9% for ETFs vs. 53% for mutual funds.
  • Among passive funds: Just 4% of passive ETFs distributed gains vs. 41% of passive mutual funds.
  • Active ETFs also outperform active mutual funds on this metric (9% vs. 53% in 2025).

Even when ETFs do distribute gains, the amounts are typically much smaller. Over multi-year periods, this can mean meaningful after-tax return differences for buy-and-hold investors in taxable accounts.

Other Tax Considerations

  • Dividends and interest: Both ETFs and mutual funds distribute these, and they’re taxed similarly (qualified dividends at favorable rates for many investors).
  • Your own sales: You pay capital gains tax only when you sell shares in either vehicle. ETFs trade like stocks (intraday), giving you more control over timing, but this doesn’t create fund-level distributions.
  • Exceptions and caveats:
    • Vanguard’s structure: Many Vanguard mutual funds have an ETF share class that’s essentially the same fund; tax efficiency benefits can flow to both.
    • Bond/fixed-income funds: The advantage is smaller because income (interest) is the main taxable event anyway, and turnover is often lower.
    • Highly active or niche strategies: Some mutual funds use tax-management techniques (loss harvesting, etc.) that can narrow the gap, but ETFs still usually win on average.
    • Tax-advantaged accounts (IRAs, 401(k)s, Roths): Tax efficiency doesn’t matter here—gains and distributions aren’t taxed until withdrawal (or never, in Roths). Choose based on fees, liquidity, or other features.
    • ETFs aren’t immune: They can still have capital gains in rare cases (e.g., major index changes, mergers, or certain commodity/actively managed funds).

Bottom Line

For taxable accounts, equivalent ETFs (same index or strategy) are typically the more tax-efficient choice due to minimized unwanted capital gains distributions. This can compound into higher after-tax returns over time without any extra effort on your part. In tax-sheltered accounts, the difference is irrelevant.

Always check a specific fund’s history of capital gains distributions (available on fund fact sheets or tax guides) and consult a tax advisor for your situation, as individual results depend on your tax bracket, holding period, and market conditions.

Posted in Money | Leave a comment