TypeScript scripts as DuckDB Table Functions
What if you could query any REST API, GraphQL endpoint, or web page directly from DuckDB using SQL? No ETL pipelines, no intermediate files, no complex setup - just write a TypeScript script and use it as a table function.
In this post, I'll show you how to combine DuckDB's shellfs and arrow extensions with Bun's zero-dependency scripts to create a powerful, flexible data pipeline that lets you query the world with SQL.
This post was inspired by Python Scripts as DuckDB Table Functions and Self-Contained TypeScript Programs Using Bun.
The “Magic”
The setup is surprisingly simple. DuckDB's shellfs extension can read data from shell command output, and the arrow extension can parse Apache Arrow IPC format. Combine this with Bun's ability to run TypeScript files directly (with automatic dependency installation), and you get a powerful pattern:
-- One-time setup in DuckDB
INSTALL shellfs FROM community;
INSTALL arrow FROM community;
LOAD shellfs;
LOAD arrow;
-- Create a macro to run Bun scripts as table functions
CREATE OR REPLACE MACRO bun(script, args := '') AS TABLE
SELECT * FROM read_arrow('bun ' || script || ' ' || args || ' |');
Now any TypeScript script that outputs Arrow IPC data becomes a queryable table:
SELECT * FROM bun('my-script.ts');
SELECT * FROM bun('fetch-data.ts', '--limit=100');
The Secret Sauce: json-to-arrow-ipc
The key ingredient is the json-to-arrow-ipc npm package. This package converts JSON to Arrow IPC format without dictionary encoding, making it fully compatible with DuckDB.
A minimal script looks like this:
#!/usr/bin/env bun
import { jsonToArrowIPC } from 'json-to-arrow-ipc';
const data = [{ name: 'Alice', age: 30 }, { name: 'Bob', age: 25 }];
process.stdout.write(jsonToArrowIPC(data));
Real-World Examples
Let's look at some practical examples that demonstrate the power of this approach.
Example 1: Query GitHub Repositories
Want to analyze the most popular TypeScript repositories on GitHub? Create github-repos.ts:
#!/usr/bin/env bun
import { jsonToArrowIPC } from 'json-to-arrow-ipc';
const args = process.argv.slice(2);
const language = args.find(a => a.startsWith('--language='))?.split('=')[1] ?? 'typescript';
const perPage = parseInt(args.find(a => a.startsWith('--per-page='))?.split('=')[1] ?? '30');
const response = await fetch(
`https://api.github.com/search/repositories?q=language:${language}&sort=stars&order=desc&per_page=${perPage}`,
{
headers: {
'Accept': 'application/vnd.github.v3+json',
'User-Agent': 'duckdb-bun-script'
}
}
);
const { items } = await response.json();
const repos = items.map((repo: any) => ({
name: repo.name,
full_name: repo.full_name,
description: repo.description,
stars: repo.stargazers_count,
forks: repo.forks_count,
open_issues: repo.open_issues_count,
language: repo.language,
created_at: repo.created_at,
updated_at: repo.updated_at,
homepage: repo.homepage,
topics: JSON.stringify(repo.topics),
}));
process.stdout.write(jsonToArrowIPC(repos));
Now query it with SQL:
-- Top 10 TypeScript repos by stars
SELECT name, stars, forks, description
FROM bun('github-repos.ts', '--language=typescript --per-page=100')
ORDER BY stars DESC
LIMIT 10;
-- Compare languages
SELECT
'TypeScript' as language, stars, name
FROM bun('github-repos.ts', '--language=typescript --per-page=10')
UNION ALL
SELECT
'Rust' as language, stars, name
FROM bun('github-repos.ts', '--language=rust --per-page=10')
ORDER BY stars DESC;
Example 2: Fetch Blog Posts from Hashnode GraphQL API
Hashnode provides a GraphQL API that's perfect for this pattern. Create hashnode-posts.ts:
#!/usr/bin/env bun
import { jsonToArrowIPC } from 'json-to-arrow-ipc';
const args = process.argv.slice(2);
const host = args.find(a => a.startsWith('--host='))?.split('=')[1] ?? 'tobilg.com';
const query = `
query GetPublicationPosts($host: String!, $first: Int!) {
publication(host: $host) {
posts(first: $first) {
edges {
node {
title
slug
brief
publishedAt
views
reactionCount
responseCount
readTimeInMinutes
tags {
name
}
url
author {
username
}
}
}
}
}
}
`;
const response = await fetch('https://gql.hashnode.com/', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
query,
variables: { host, first: 20 }
}),
});
const { data } = await response.json();
const posts = data?.publication?.posts?.edges?.map((edge: any) => ({
title: edge.node.title,
slug: edge.node.slug,
brief: edge.node.brief,
published_at: edge.node.publishedAt,
views: edge.node.views ?? 0,
reactions: edge.node.reactionCount ?? 0,
responses: edge.node.responseCount ?? 0,
read_time_minutes: edge.node.readTimeInMinutes,
tags: JSON.stringify(edge.node.tags?.map((t: any) => t.name) ?? []),
url: edge.node.url,
author: edge.node.author?.username ?? null,
})) ?? [];
process.stdout.write(jsonToArrowIPC(posts));
Query blog analytics with SQL:
-- All posts sorted by views
SELECT title, views, reactions, read_time_minutes, published_at
FROM bun('hashnode-posts.ts', '--host=tobilg.com')
ORDER BY views DESC;
-- Posts with most engagement
SELECT
title,
views,
reactions,
responses,
(reactions + responses) as total_engagement
FROM bun('hashnode-posts.ts', '--host=tobilg.com')
ORDER BY total_engagement DESC
LIMIT 5;
-- Average read time
SELECT
AVG(read_time_minutes) as avg_read_time,
SUM(views) as total_views,
SUM(reactions) as total_reactions
FROM bun('hashnode-posts.ts', '--host=tobilg.com');
Example 3: GitHub User Activity
Create github-user-events.ts to analyze recent GitHub activity:
#!/usr/bin/env bun
import { jsonToArrowIPC } from 'json-to-arrow-ipc';
const args = process.argv.slice(2);
const username = args.find(a => a.startsWith('--user='))?.split('=')[1] ?? 'tobilg';
const response = await fetch(
`https://api.github.com/users/${username}/events/public?per_page=100`,
{
headers: {
'Accept': 'application/vnd.github.v3+json',
'User-Agent': 'duckdb-bun-script'
}
}
);
const events = await response.json();
const processed = events.map((event: any) => ({
id: event.id,
type: event.type,
repo: event.repo?.name,
created_at: event.created_at,
action: event.payload?.action ?? null,
ref: event.payload?.ref ?? null,
ref_type: event.payload?.ref_type ?? null,
}));
process.stdout.write(jsonToArrowIPC(processed));
Analyze activity patterns:
-- Event type breakdown
SELECT type, COUNT(*) as count
FROM bun('github-user-events.ts', '--user=tobilg')
GROUP BY type
ORDER BY count DESC;
-- Most active repositories
SELECT repo, COUNT(*) as events
FROM bun('github-user-events.ts', '--user=tobilg')
GROUP BY repo
ORDER BY events DESC
LIMIT 10;
-- Activity by day of week
SELECT
strftime(created_at::timestamp, '%A') as day_of_week,
COUNT(*) as events
FROM bun('github-user-events.ts', '--user=tobilg')
GROUP BY day_of_week
ORDER BY events DESC;
Example 4: NPM Package Statistics
Create npm-package.ts to fetch package metadata:
#!/usr/bin/env bun
import { jsonToArrowIPC } from 'json-to-arrow-ipc';
const args = process.argv.slice(2);
const packageName = args.find(a => a.startsWith('--package='))?.split('=')[1] ?? 'duckdb';
const [registryResponse] = await Promise.all([
fetch(`https://registry.npmjs.org/${packageName}`)
]);
const registry = await registryResponse.json();
const versions = Object.entries(registry.time || {})
.filter(([key]) => key !== 'created' && key !== 'modified')
.map(([version, time]) => ({
package: packageName,
version,
published_at: time as string,
description: registry.description,
license: registry.license,
homepage: registry.homepage,
}));
process.stdout.write(jsonToArrowIPC(versions));
Create npm-downloads.ts for download stats:
#!/usr/bin/env bun
import { jsonToArrowIPC } from 'json-to-arrow-ipc';
const args = process.argv.slice(2);
const packageName = args.find(a => a.startsWith('--package='))?.split('=')[1] ?? 'duckdb';
const response = await fetch(
`https://api.npmjs.org/downloads/range/last-month/${packageName}`
);
const data = await response.json();
const downloads = data.downloads?.map((d: any) => ({
package: packageName,
date: d.day,
downloads: d.downloads,
})) ?? [];
process.stdout.write(jsonToArrowIPC(downloads));
Query package analytics:
-- Version history
SELECT version, published_at
FROM bun('npm-package.ts', '--package=apache-arrow')
ORDER BY published_at DESC
LIMIT 10;
-- Download trends
SELECT
date,
downloads,
AVG(downloads) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as rolling_avg
FROM bun('npm-downloads.ts', '--package=duckdb')
ORDER BY date;
-- Compare package downloads
SELECT 'duckdb' as package, SUM(downloads) as total
FROM bun('npm-downloads.ts', '--package=duckdb')
UNION ALL
SELECT 'better-sqlite3' as package, SUM(downloads) as total
FROM bun('npm-downloads.ts', '--package=better-sqlite3');
Example 5: Hacker News Front Page
Create hn-frontpage.ts using the official Hacker News API:
#!/usr/bin/env bun
import { jsonToArrowIPC } from 'json-to-arrow-ipc';
const args = process.argv.slice(2);
const limit = parseInt(args.find(a => a.startsWith('--limit='))?.split('=')[1] ?? '30');
// Fetch top story IDs
const topStoriesResponse = await fetch('https://hacker-news.firebaseio.com/v0/topstories.json');
const topStoryIds: number[] = await topStoriesResponse.json();
// Fetch story details in parallel (limited to avoid rate limiting)
const storyIds = topStoryIds.slice(0, limit);
const stories = await Promise.all(
storyIds.map(async (id) => {
const response = await fetch(`https://hacker-news.firebaseio.com/v0/item/${id}.json`);
return response.json();
})
);
const processed = stories.map((story: any, index: number) => ({
rank: index + 1,
id: story.id,
title: story.title,
url: story.url ?? null,
score: story.score,
author: story.by,
comments: story.descendants ?? 0,
created_at: new Date(story.time * 1000).toISOString(),
type: story.type,
}));
process.stdout.write(jsonToArrowIPC(processed));
Analyze the front page:
-- Current top stories
SELECT rank, title, score, comments, author
FROM bun('hn-frontpage.ts', '--limit=30')
ORDER BY rank;
-- Stories with high engagement ratio
SELECT
title,
score,
comments,
ROUND(comments::float / NULLIF(score, 0), 2) as comment_per_point
FROM bun('hn-frontpage.ts', '--limit=50')
WHERE score > 10
ORDER BY comment_per_point DESC
LIMIT 10;
-- Top domains
SELECT
regexp_extract(url, 'https?://([^/]+)', 1) as domain,
COUNT(*) as stories,
AVG(score) as avg_score
FROM bun('hn-frontpage.ts', '--limit=100')
WHERE url IS NOT NULL
GROUP BY domain
ORDER BY stories DESC
LIMIT 10;
Configuration Options
The json-to-arrow-ipc package handles various data scenarios:
import { jsonToArrowIPC } from 'json-to-arrow-ipc';
const ipc = jsonToArrowIPC(data, {
// Sample rows for schema inference (default: 100)
schemaSampleSize: 50,
// Handle schema mismatches: 'error' | 'skip' | 'coerce' (default: 'coerce')
onSchemaMismatch: 'coerce',
// Flatten nested objects with dot notation (default: false)
flattenNestedObjects: true,
// Serialize arrays as JSON strings (default: true)
serializeArrays: true,
});
Tips and Best Practices
Handle pagination: For APIs with large datasets, implement pagination in your script and use command-line arguments to control limits.
Error handling: Add try-catch blocks and output empty arrays on error to prevent DuckDB query failures.
Rate limiting: Be mindful of API rate limits. Add delays between requests when fetching many resources.
Conclusion
Any data source with an API becomes a SQL table. Combined with DuckDB's analytical capabilities - window functions, CTEs, JSON extraction, and more - you have a lightweight but powerful tool for ad-hoc data analysis.
The combination of Bun's TypeScript execution, automatic dependency management, and DuckDB's extensibility creates a developer experience that's both powerful and simple.
The json-to-arrow-ipc package and example scripts are available at github.com/tobilg/duckdb-bun-scripts.