TypeScript scripts as DuckDB Table Functions

TypeScript scripts as DuckDB Table Functions

Tobias Müller
7 min read

What if you could query any REST API, GraphQL endpoint, or web page directly from DuckDB using SQL? No ETL pipelines, no intermediate files, no complex setup - just write a TypeScript script and use it as a table function.

In this post, I'll show you how to combine DuckDB's shellfs and arrow extensions with Bun's zero-dependency scripts to create a powerful, flexible data pipeline that lets you query the world with SQL.

This post was inspired by Python Scripts as DuckDB Table Functions and Self-Contained TypeScript Programs Using Bun.

The “Magic”

The setup is surprisingly simple. DuckDB's shellfs extension can read data from shell command output, and the arrow extension can parse Apache Arrow IPC format. Combine this with Bun's ability to run TypeScript files directly (with automatic dependency installation), and you get a powerful pattern:

-- One-time setup in DuckDB
INSTALL shellfs FROM community;
INSTALL arrow FROM community;
LOAD shellfs;
LOAD arrow;

-- Create a macro to run Bun scripts as table functions
CREATE OR REPLACE MACRO bun(script, args := '') AS TABLE
SELECT * FROM read_arrow('bun ' || script || ' ' || args || ' |');

Now any TypeScript script that outputs Arrow IPC data becomes a queryable table:

SELECT * FROM bun('my-script.ts');
SELECT * FROM bun('fetch-data.ts', '--limit=100');

The Secret Sauce: json-to-arrow-ipc

The key ingredient is the json-to-arrow-ipc npm package. This package converts JSON to Arrow IPC format without dictionary encoding, making it fully compatible with DuckDB.

A minimal script looks like this:

#!/usr/bin/env bun

import { jsonToArrowIPC } from 'json-to-arrow-ipc';

const data = [{ name: 'Alice', age: 30 }, { name: 'Bob', age: 25 }];
process.stdout.write(jsonToArrowIPC(data));

Real-World Examples

Let's look at some practical examples that demonstrate the power of this approach.

Example 1: Query GitHub Repositories

Want to analyze the most popular TypeScript repositories on GitHub? Create github-repos.ts:

#!/usr/bin/env bun

import { jsonToArrowIPC } from 'json-to-arrow-ipc';

const args = process.argv.slice(2);
const language = args.find(a => a.startsWith('--language='))?.split('=')[1] ?? 'typescript';
const perPage = parseInt(args.find(a => a.startsWith('--per-page='))?.split('=')[1] ?? '30');

const response = await fetch(
  `https://api.github.com/search/repositories?q=language:${language}&sort=stars&order=desc&per_page=${perPage}`,
  {
    headers: {
      'Accept': 'application/vnd.github.v3+json',
      'User-Agent': 'duckdb-bun-script'
    }
  }
);

const { items } = await response.json();

const repos = items.map((repo: any) => ({
  name: repo.name,
  full_name: repo.full_name,
  description: repo.description,
  stars: repo.stargazers_count,
  forks: repo.forks_count,
  open_issues: repo.open_issues_count,
  language: repo.language,
  created_at: repo.created_at,
  updated_at: repo.updated_at,
  homepage: repo.homepage,
  topics: JSON.stringify(repo.topics),
}));

process.stdout.write(jsonToArrowIPC(repos));

Now query it with SQL:

-- Top 10 TypeScript repos by stars
SELECT name, stars, forks, description
FROM bun('github-repos.ts', '--language=typescript --per-page=100')
ORDER BY stars DESC
LIMIT 10;

-- Compare languages
SELECT
  'TypeScript' as language, stars, name
FROM bun('github-repos.ts', '--language=typescript --per-page=10')
UNION ALL
SELECT
  'Rust' as language, stars, name
FROM bun('github-repos.ts', '--language=rust --per-page=10')
ORDER BY stars DESC;

Example 2: Fetch Blog Posts from Hashnode GraphQL API

Hashnode provides a GraphQL API that's perfect for this pattern. Create hashnode-posts.ts:

#!/usr/bin/env bun

import { jsonToArrowIPC } from 'json-to-arrow-ipc';

const args = process.argv.slice(2);
const host = args.find(a => a.startsWith('--host='))?.split('=')[1] ?? 'tobilg.com';

const query = `
  query GetPublicationPosts($host: String!, $first: Int!) {
    publication(host: $host) {
      posts(first: $first) {
        edges {
          node {
            title
            slug
            brief
            publishedAt
            views
            reactionCount
            responseCount
            readTimeInMinutes
            tags {
              name
            }
            url
            author {
              username
            }
          }
        }
      }
    }
  }
`;

const response = await fetch('https://gql.hashnode.com/', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    query,
    variables: { host, first: 20 }
  }),
});

const { data } = await response.json();

const posts = data?.publication?.posts?.edges?.map((edge: any) => ({
  title: edge.node.title,
  slug: edge.node.slug,
  brief: edge.node.brief,
  published_at: edge.node.publishedAt,
  views: edge.node.views ?? 0,
  reactions: edge.node.reactionCount ?? 0,
  responses: edge.node.responseCount ?? 0,
  read_time_minutes: edge.node.readTimeInMinutes,
  tags: JSON.stringify(edge.node.tags?.map((t: any) => t.name) ?? []),
  url: edge.node.url,
  author: edge.node.author?.username ?? null,
})) ?? [];

process.stdout.write(jsonToArrowIPC(posts));

Query blog analytics with SQL:

-- All posts sorted by views
SELECT title, views, reactions, read_time_minutes, published_at
FROM bun('hashnode-posts.ts', '--host=tobilg.com')
ORDER BY views DESC;

-- Posts with most engagement
SELECT
  title,
  views,
  reactions,
  responses,
  (reactions + responses) as total_engagement
FROM bun('hashnode-posts.ts', '--host=tobilg.com')
ORDER BY total_engagement DESC
LIMIT 5;

-- Average read time
SELECT
  AVG(read_time_minutes) as avg_read_time,
  SUM(views) as total_views,
  SUM(reactions) as total_reactions
FROM bun('hashnode-posts.ts', '--host=tobilg.com');

Example 3: GitHub User Activity

Create github-user-events.ts to analyze recent GitHub activity:

#!/usr/bin/env bun

import { jsonToArrowIPC } from 'json-to-arrow-ipc';

const args = process.argv.slice(2);
const username = args.find(a => a.startsWith('--user='))?.split('=')[1] ?? 'tobilg';

const response = await fetch(
  `https://api.github.com/users/${username}/events/public?per_page=100`,
  {
    headers: {
      'Accept': 'application/vnd.github.v3+json',
      'User-Agent': 'duckdb-bun-script'
    }
  }
);

const events = await response.json();

const processed = events.map((event: any) => ({
  id: event.id,
  type: event.type,
  repo: event.repo?.name,
  created_at: event.created_at,
  action: event.payload?.action ?? null,
  ref: event.payload?.ref ?? null,
  ref_type: event.payload?.ref_type ?? null,
}));

process.stdout.write(jsonToArrowIPC(processed));

Analyze activity patterns:

-- Event type breakdown
SELECT type, COUNT(*) as count
FROM bun('github-user-events.ts', '--user=tobilg')
GROUP BY type
ORDER BY count DESC;

-- Most active repositories
SELECT repo, COUNT(*) as events
FROM bun('github-user-events.ts', '--user=tobilg')
GROUP BY repo
ORDER BY events DESC
LIMIT 10;

-- Activity by day of week
SELECT
  strftime(created_at::timestamp, '%A') as day_of_week,
  COUNT(*) as events
FROM bun('github-user-events.ts', '--user=tobilg')
GROUP BY day_of_week
ORDER BY events DESC;

Example 4: NPM Package Statistics

Create npm-package.ts to fetch package metadata:

#!/usr/bin/env bun

import { jsonToArrowIPC } from 'json-to-arrow-ipc';

const args = process.argv.slice(2);
const packageName = args.find(a => a.startsWith('--package='))?.split('=')[1] ?? 'duckdb';

const [registryResponse] = await Promise.all([
  fetch(`https://registry.npmjs.org/${packageName}`)
]);

const registry = await registryResponse.json();

const versions = Object.entries(registry.time || {})
  .filter(([key]) => key !== 'created' && key !== 'modified')
  .map(([version, time]) => ({
    package: packageName,
    version,
    published_at: time as string,
    description: registry.description,
    license: registry.license,
    homepage: registry.homepage,
  }));

process.stdout.write(jsonToArrowIPC(versions));

Create npm-downloads.ts for download stats:

#!/usr/bin/env bun

import { jsonToArrowIPC } from 'json-to-arrow-ipc';

const args = process.argv.slice(2);
const packageName = args.find(a => a.startsWith('--package='))?.split('=')[1] ?? 'duckdb';

const response = await fetch(
  `https://api.npmjs.org/downloads/range/last-month/${packageName}`
);

const data = await response.json();

const downloads = data.downloads?.map((d: any) => ({
  package: packageName,
  date: d.day,
  downloads: d.downloads,
})) ?? [];

process.stdout.write(jsonToArrowIPC(downloads));

Query package analytics:

-- Version history
SELECT version, published_at
FROM bun('npm-package.ts', '--package=apache-arrow')
ORDER BY published_at DESC
LIMIT 10;

-- Download trends
SELECT
  date,
  downloads,
  AVG(downloads) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as rolling_avg
FROM bun('npm-downloads.ts', '--package=duckdb')
ORDER BY date;

-- Compare package downloads
SELECT 'duckdb' as package, SUM(downloads) as total
FROM bun('npm-downloads.ts', '--package=duckdb')
UNION ALL
SELECT 'better-sqlite3' as package, SUM(downloads) as total
FROM bun('npm-downloads.ts', '--package=better-sqlite3');

Example 5: Hacker News Front Page

Create hn-frontpage.ts using the official Hacker News API:

#!/usr/bin/env bun

import { jsonToArrowIPC } from 'json-to-arrow-ipc';

const args = process.argv.slice(2);
const limit = parseInt(args.find(a => a.startsWith('--limit='))?.split('=')[1] ?? '30');

// Fetch top story IDs
const topStoriesResponse = await fetch('https://hacker-news.firebaseio.com/v0/topstories.json');
const topStoryIds: number[] = await topStoriesResponse.json();

// Fetch story details in parallel (limited to avoid rate limiting)
const storyIds = topStoryIds.slice(0, limit);
const stories = await Promise.all(
  storyIds.map(async (id) => {
    const response = await fetch(`https://hacker-news.firebaseio.com/v0/item/${id}.json`);
    return response.json();
  })
);

const processed = stories.map((story: any, index: number) => ({
  rank: index + 1,
  id: story.id,
  title: story.title,
  url: story.url ?? null,
  score: story.score,
  author: story.by,
  comments: story.descendants ?? 0,
  created_at: new Date(story.time * 1000).toISOString(),
  type: story.type,
}));

process.stdout.write(jsonToArrowIPC(processed));

Analyze the front page:

-- Current top stories
SELECT rank, title, score, comments, author
FROM bun('hn-frontpage.ts', '--limit=30')
ORDER BY rank;

-- Stories with high engagement ratio
SELECT
  title,
  score,
  comments,
  ROUND(comments::float / NULLIF(score, 0), 2) as comment_per_point
FROM bun('hn-frontpage.ts', '--limit=50')
WHERE score > 10
ORDER BY comment_per_point DESC
LIMIT 10;

-- Top domains
SELECT
  regexp_extract(url, 'https?://([^/]+)', 1) as domain,
  COUNT(*) as stories,
  AVG(score) as avg_score
FROM bun('hn-frontpage.ts', '--limit=100')
WHERE url IS NOT NULL
GROUP BY domain
ORDER BY stories DESC
LIMIT 10;

Configuration Options

The json-to-arrow-ipc package handles various data scenarios:

import { jsonToArrowIPC } from 'json-to-arrow-ipc';

const ipc = jsonToArrowIPC(data, {
  // Sample rows for schema inference (default: 100)
  schemaSampleSize: 50,

  // Handle schema mismatches: 'error' | 'skip' | 'coerce' (default: 'coerce')
  onSchemaMismatch: 'coerce',

  // Flatten nested objects with dot notation (default: false)
  flattenNestedObjects: true,

  // Serialize arrays as JSON strings (default: true)
  serializeArrays: true,
});

Tips and Best Practices

  1. Handle pagination: For APIs with large datasets, implement pagination in your script and use command-line arguments to control limits.

  2. Error handling: Add try-catch blocks and output empty arrays on error to prevent DuckDB query failures.

  3. Rate limiting: Be mindful of API rate limits. Add delays between requests when fetching many resources.

Conclusion

Any data source with an API becomes a SQL table. Combined with DuckDB's analytical capabilities - window functions, CTEs, JSON extraction, and more - you have a lightweight but powerful tool for ad-hoc data analysis.

The combination of Bun's TypeScript execution, automatic dependency management, and DuckDB's extensibility creates a developer experience that's both powerful and simple.


The json-to-arrow-ipc package and example scripts are available at github.com/tobilg/duckdb-bun-scripts.