<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>tobilg.com</title><description>Developer blog featuring posts about cloud computing, containers, and open source.</description><link>https://tobilg.com/</link><language>en-us</language><atom:link href="https://tobilg.com/rss.xml" rel="self" type="application/rss+xml" xmlns:atom="http://www.w3.org/2005/Atom"/><image><url>https://tobilg.com/images/og.png</url><title>tobilg.com</title><link>https://tobilg.com</link></image><item><title>Introducing polyglot - A Rust/Wasm SQL transpilation library</title><link>https://tobilg.com/posts/introducing-polyglot-a-rust-wasm-sql-transpilation-library/</link><guid isPermaLink="true">https://tobilg.com/posts/introducing-polyglot-a-rust-wasm-sql-transpilation-library/</guid><description>What if you could take a SQL query written for PostgreSQL and transpile it to BigQuery, Snowflake, or any of 33 database dialects — entirely in the browser, with no server round-trip? That&apos;s what polyglot does.
SQL dialect fragmentation is a real pro...</description><pubDate>Sat, 14 Feb 2026 16:56:46 GMT</pubDate><content:encoded>&lt;p&gt;What if you could take a SQL query written for PostgreSQL and transpile it to BigQuery, Snowflake, or any of &lt;strong&gt;33 database dialects&lt;/strong&gt; — entirely in the browser, with no server round-trip? That&apos;s what polyglot does.&lt;/p&gt;
&lt;p&gt;SQL dialect fragmentation is a real problem. Every database has its own flavor: &lt;code&gt;LIMIT&lt;/code&gt; vs &lt;code&gt;TOP&lt;/code&gt;, &lt;code&gt;ILIKE&lt;/code&gt; vs &lt;code&gt;LOWER() LIKE&lt;/code&gt;, &lt;code&gt;STRING&lt;/code&gt; vs &lt;code&gt;VARCHAR&lt;/code&gt; vs &lt;code&gt;TEXT&lt;/code&gt;. If you&apos;re building tools that work across databases, you end up writing dialect-specific code paths or maintaining parallel query sets. It&apos;s tedious and error-prone.&lt;/p&gt;
&lt;p&gt;polyglot tackles this head-on. It&apos;s a SQL transpiler inspired by Python&apos;s &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobymao/sqlglot&quot;&gt;sqlglot&lt;/a&gt;, but built from scratch in Rust, and can be compiled to WebAssembly. The result is a fast, portable library that works in the browser, in Node.js, or natively in any Rust project. It has 100% test fixture compliance to sqlglot, with more than 9k tests.&lt;/p&gt;
&lt;h2 id=&quot;heading-what-is-polyglot&quot;&gt;What is polyglot?&lt;/h2&gt;
&lt;p&gt;The architecture is straightforward: a core Rust library handles all the parsing, AST manipulation, and SQL generation. That core compiles to Wasm, which powers a TypeScript SDK (&lt;code&gt;@polyglot-sql/sdk&lt;/code&gt;) for use in web and Node.js environments.&lt;/p&gt;
&lt;p&gt;It supports &lt;strong&gt;33 SQL dialects&lt;/strong&gt;, including PostgreSQL, MySQL, BigQuery, Snowflake, DuckDB, SQLite, ClickHouse, Redshift, Spark, Trino, TSQL (SQL Server), Oracle, Databricks, Hive, Athena, Teradata, and more.&lt;/p&gt;
&lt;p&gt;The key capabilities are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Transpilation&lt;/strong&gt; — convert SQL between any pair of supported dialects&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Parsing&lt;/strong&gt; — turn SQL strings into a fully-typed AST&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Generation&lt;/strong&gt; — produce SQL from AST nodes&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Formatting&lt;/strong&gt; — pretty-print SQL with proper indentation&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Validation&lt;/strong&gt; — check SQL for syntax and semantic errors&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Builder API&lt;/strong&gt; — construct queries programmatically with a fluent interface&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;heading-the-rust-crate&quot;&gt;The Rust Crate&lt;/h2&gt;
&lt;p&gt;The &lt;a target=&quot;_blank&quot; href=&quot;https://crates.io/crates/polyglot-sql&quot;&gt;polyglot-sql&lt;/a&gt; crate exposes a clean, high-level API. The core functions cover the most common operations:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-rust&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;use&lt;/span&gt; polyglot_sql::{transpile, parse, generate, validate, DialectType};

&lt;span class=&quot;hljs-comment&quot;&gt;// Transpile from PostgreSQL to BigQuery&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;let&lt;/span&gt; result = transpile(
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;SELECT id, name FROM users WHERE created_at &amp;gt; NOW() - INTERVAL &apos;7 days&apos;&quot;&lt;/span&gt;,
    DialectType::Postgres,
    DialectType::BigQuery,
)?;

&lt;span class=&quot;hljs-built_in&quot;&gt;println!&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;&quot;{}&quot;&lt;/span&gt;, result[&lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt;]);
&lt;span class=&quot;hljs-comment&quot;&gt;// SELECT id, name FROM users WHERE created_at &amp;gt; CURRENT_TIMESTAMP() - INTERVAL 7 DAY&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can also parse SQL into an AST, manipulate it, and generate SQL back:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-rust&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;let&lt;/span&gt; ast = parse(&lt;span class=&quot;hljs-string&quot;&gt;&quot;SELECT 1 + 2&quot;&lt;/span&gt;, DialectType::Generic)?;
&lt;span class=&quot;hljs-keyword&quot;&gt;let&lt;/span&gt; sql = generate(&amp;amp;ast[&lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt;], DialectType::Postgres)?;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-builder-api&quot;&gt;Builder API&lt;/h3&gt;
&lt;p&gt;For constructing queries programmatically, the fluent builder API avoids string concatenation entirely:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-rust&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;use&lt;/span&gt; polyglot_sql::builder::*;

&lt;span class=&quot;hljs-keyword&quot;&gt;let&lt;/span&gt; query = select([&lt;span class=&quot;hljs-string&quot;&gt;&quot;id&quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&quot;name&quot;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&quot;email&quot;&lt;/span&gt;])
    .from(&lt;span class=&quot;hljs-string&quot;&gt;&quot;users&quot;&lt;/span&gt;)
    .where_(col(&lt;span class=&quot;hljs-string&quot;&gt;&quot;age&quot;&lt;/span&gt;).gte(lit(&lt;span class=&quot;hljs-number&quot;&gt;18&lt;/span&gt;)).and(col(&lt;span class=&quot;hljs-string&quot;&gt;&quot;active&quot;&lt;/span&gt;).eq(lit(&lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;))))
    .order_by(col(&lt;span class=&quot;hljs-string&quot;&gt;&quot;name&quot;&lt;/span&gt;).asc())
    .limit(&lt;span class=&quot;hljs-number&quot;&gt;100&lt;/span&gt;)
    .build()?;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This produces a proper AST that can be generated into any dialect. The builder supports joins, subqueries, aggregations, window functions, CTEs, and set operations (&lt;code&gt;UNION&lt;/code&gt;, &lt;code&gt;INTERSECT&lt;/code&gt;, &lt;code&gt;EXCEPT&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;Beyond the core API, the crate also includes scope analysis, column lineage tracking, AST traversal utilities, and optimizer passes — useful for building more sophisticated SQL tooling.&lt;/p&gt;
&lt;h2 id=&quot;heading-the-typescript-sdk&quot;&gt;The TypeScript SDK&lt;/h2&gt;
&lt;p&gt;The TypeScript SDK (&lt;a target=&quot;_blank&quot; href=&quot;https://www.npmjs.com/package/@polyglot-sql/sdk&quot;&gt;@polyglot-sql/sdk&lt;/a&gt; on npm)wraps the Wasm module and provides the same capabilities in JavaScript environments. Install it from npm:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;npm install @polyglot-sql/sdk
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-transpilation&quot;&gt;Transpilation&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; { transpile, Dialect } &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&quot;@polyglot-sql/sdk&quot;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; result = transpile(
  &lt;span class=&quot;hljs-string&quot;&gt;&quot;SELECT TOP 10 id, name FROM users WHERE name LIKE &apos;%test%&apos;&quot;&lt;/span&gt;,
  Dialect.TSQL,
  Dialect.PostgreSQL
);

&lt;span class=&quot;hljs-built_in&quot;&gt;console&lt;/span&gt;.log(result.sql);
&lt;span class=&quot;hljs-comment&quot;&gt;// SELECT id, name FROM users WHERE name LIKE &apos;%test%&apos; LIMIT 10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-builder-api-1&quot;&gt;Builder API&lt;/h3&gt;
&lt;p&gt;The TypeScript builder mirrors the Rust API:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; { select, col, lit, Dialect } &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&quot;@polyglot-sql/sdk&quot;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; query = select(col(&lt;span class=&quot;hljs-string&quot;&gt;&quot;id&quot;&lt;/span&gt;), col(&lt;span class=&quot;hljs-string&quot;&gt;&quot;name&quot;&lt;/span&gt;), col(&lt;span class=&quot;hljs-string&quot;&gt;&quot;email&quot;&lt;/span&gt;))
  .from(&lt;span class=&quot;hljs-string&quot;&gt;&quot;users&quot;&lt;/span&gt;)
  .where(col(&lt;span class=&quot;hljs-string&quot;&gt;&quot;active&quot;&lt;/span&gt;).eq(lit(&lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;)))
  .orderBy(col(&lt;span class=&quot;hljs-string&quot;&gt;&quot;name&quot;&lt;/span&gt;).asc())
  .limit(&lt;span class=&quot;hljs-number&quot;&gt;25&lt;/span&gt;)
  .toSql(Dialect.PostgreSQL);

&lt;span class=&quot;hljs-built_in&quot;&gt;console&lt;/span&gt;.log(query);
&lt;span class=&quot;hljs-comment&quot;&gt;// SELECT id, name, email FROM users WHERE active = TRUE ORDER BY name ASC LIMIT 25&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-formatting&quot;&gt;Formatting&lt;/h3&gt;
&lt;p&gt;Pretty-printing SQL is a single function call:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; { format, Dialect } &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&quot;@polyglot-sql/sdk&quot;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; result = format(
  &lt;span class=&quot;hljs-string&quot;&gt;&quot;SELECT a.id, b.name, COUNT(*) FROM orders a JOIN users b ON a.user_id = b.id WHERE a.status = &apos;active&apos; GROUP BY a.id, b.name HAVING COUNT(*) &amp;gt; 5&quot;&lt;/span&gt;,
  Dialect.PostgreSQL
);

&lt;span class=&quot;hljs-built_in&quot;&gt;console&lt;/span&gt;.log(result.sql);
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt;
  a.id,
  b.name,
  &lt;span class=&quot;hljs-keyword&quot;&gt;COUNT&lt;/span&gt;(*)
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; orders &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; a
&lt;span class=&quot;hljs-keyword&quot;&gt;JOIN&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;users&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; b
  &lt;span class=&quot;hljs-keyword&quot;&gt;ON&lt;/span&gt; a.user_id = b.id
&lt;span class=&quot;hljs-keyword&quot;&gt;WHERE&lt;/span&gt;
  a.status = &lt;span class=&quot;hljs-string&quot;&gt;&apos;active&apos;&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;GROUP&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt;
  a.id,
  b.name
&lt;span class=&quot;hljs-keyword&quot;&gt;HAVING&lt;/span&gt;
  &lt;span class=&quot;hljs-keyword&quot;&gt;COUNT&lt;/span&gt;(*) &amp;gt; &lt;span class=&quot;hljs-number&quot;&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For quick browser integration, the SDK can also be loaded via CDN without a build step.&lt;/p&gt;
&lt;h2 id=&quot;heading-integration-paths&quot;&gt;Integration Paths&lt;/h2&gt;
&lt;h3 id=&quot;heading-browser-based-sql-editors&quot;&gt;Browser-based SQL Editors&lt;/h3&gt;
&lt;p&gt;Since &lt;a target=&quot;_blank&quot; href=&quot;https://www.npmjs.com/package/@polyglot-sql/sdk&quot;&gt;@polyglot-sql/sdk&lt;/a&gt; can run as Wasm, it works directly in the browser with no backend. This makes it a natural fit for web-based SQL editors, data tools, and notebook interfaces that need dialect-aware transpilation or formatting.&lt;/p&gt;
&lt;h3 id=&quot;heading-cicd-amp-migration-pipelines&quot;&gt;CI/CD &amp;amp; Migration Pipelines&lt;/h3&gt;
&lt;p&gt;We can integrate the Rust crate or TypeScript SDK into build pipelines to validate SQL syntax, enforce dialect compatibility, or automatically convert queries during database migrations. The &lt;code&gt;validate()&lt;/code&gt; function catches errors early, before they hit production.&lt;/p&gt;
&lt;h3 id=&quot;heading-multi-database-orms-amp-query-builders&quot;&gt;Multi-database ORMs &amp;amp; Query Builders&lt;/h3&gt;
&lt;p&gt;The builder API can serve as a dialect-aware SQL backend for ORMs or custom query builders. Construct queries once using the fluent interface, then generate dialect-specific SQL at runtime.&lt;/p&gt;
&lt;h3 id=&quot;heading-data-catalogs-amp-documentation-tools&quot;&gt;Data Catalogs &amp;amp; Documentation Tools&lt;/h3&gt;
&lt;p&gt;Parsing SQL into a full AST enables extraction of table references, column lineage, and query structure. This is useful for data catalogs, documentation generators, and governance tools that need to understand what a query touches.&lt;/p&gt;
&lt;h3 id=&quot;heading-cli-tools-amp-developer-utilities&quot;&gt;CLI Tools &amp;amp; Developer Utilities&lt;/h3&gt;
&lt;p&gt;The Rust crate can power command-line tools for SQL formatting, linting, or batch transpilation. It&apos;s fast enough for interactive use and handles large query files without issues.&lt;/p&gt;
&lt;h2 id=&quot;heading-use-cases&quot;&gt;Use Cases&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Database migration&lt;/strong&gt;: Converting thousands of queries from Oracle to PostgreSQL (or any other dialect pair) without manual rewrites&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-cloud analytics&lt;/strong&gt;: Writing SQL once and transpiling to BigQuery, Snowflake, or Redshift depending on the target warehouse&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;SQL formatting &amp;amp; linting&lt;/strong&gt;: Enforcing consistent SQL style across a codebase with pretty-printing&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Query analysis &amp;amp; lineage&lt;/strong&gt;: Parsing SQL to track which columns flow through transformations and which tables are referenced&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Educational tools&lt;/strong&gt;: Showing how the same query looks across different SQL dialects, side by side&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;heading-try-it-out&quot;&gt;Try It Out&lt;/h2&gt;
&lt;p&gt;The fastest way to see polyglot in action is the &lt;a target=&quot;_blank&quot; href=&quot;https://polyglot-playground.gh.tobilg.com/&quot;&gt;playground&lt;/a&gt;, where you can transpile and format SQL directly in the browser.&lt;/p&gt;
&lt;p&gt;To get started in a JavaScript or TypeScript project:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;npm install @polyglot-sql/sdk
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; { transpile, Dialect } &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&quot;@polyglot-sql/sdk&quot;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; result = transpile(&lt;span class=&quot;hljs-string&quot;&gt;&quot;SELECT * FROM t LIMIT 10&quot;&lt;/span&gt;, Dialect.PostgreSQL, Dialect.TSQL);
&lt;span class=&quot;hljs-built_in&quot;&gt;console&lt;/span&gt;.log(result.sql); &lt;span class=&quot;hljs-comment&quot;&gt;// SELECT TOP 10 * FROM t&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For Rust projects, add the crate to your &lt;code&gt;Cargo.toml&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-toml&quot;&gt;&lt;span class=&quot;hljs-section&quot;&gt;[dependencies]&lt;/span&gt;
&lt;span class=&quot;hljs-attr&quot;&gt;polyglot-sql&lt;/span&gt; = &lt;span class=&quot;hljs-string&quot;&gt;&quot;0.1&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can find the source code on &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/polyglot&quot;&gt;GitHub&lt;/a&gt;, the Rust docs on &lt;a target=&quot;_blank&quot; href=&quot;https://docs.rs/polyglot-sql/latest/polyglot_sql/&quot;&gt;docs.rs&lt;/a&gt;, the TypeScript docs at &lt;a target=&quot;_blank&quot; href=&quot;https://polyglot.gh.tobilg.com/&quot;&gt;polyglot.gh.tobilg.com&lt;/a&gt;, and the npm package at &lt;a target=&quot;_blank&quot; href=&quot;https://www.npmjs.com/package/@polyglot-sql/sdk&quot;&gt;@polyglot-sql/sdk&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;heading-summary&quot;&gt;Summary&lt;/h2&gt;
&lt;p&gt;polyglot brings SQL transpilation to any environment — browser, server, or CLI — without depending on a Python runtime or external service. With &lt;strong&gt;33 supported dialects&lt;/strong&gt;, a fluent builder API, and full parsing/generation capabilities, it covers a wide range of SQL tooling needs.&lt;/p&gt;
&lt;p&gt;The project is open source and actively developed. If you&apos;re working on SQL tooling, multi-database support, or migration pipelines, give it a try and let us know how it goes. Contributions, bug reports, and feature requests are all welcome on &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/polyglot&quot;&gt;GitHub&lt;/a&gt;.&lt;/p&gt;
</content:encoded><category>Rust</category><category>SQL</category><category>transpiler</category><category>wasm</category><category>data-engineering</category><author>Tobias Müller</author></item><item><title>Custom DuckDB Wasm builds for Cloudflare Workers</title><link>https://tobilg.com/posts/custom-duckdb-wasm-builds-for-cloudflare-workers/</link><guid isPermaLink="true">https://tobilg.com/posts/custom-duckdb-wasm-builds-for-cloudflare-workers/</guid><description>What if you could run full SQL queries, including JOINs, aggregations, and even remote Parquet file reads, directly inside a Cloudflare Worker? No database server, no connection pool, no cold-start latency from external services. Just DuckDB, compile...</description><pubDate>Tue, 27 Jan 2026 19:47:43 GMT</pubDate><content:encoded>&lt;p&gt;What if you could run full SQL queries, including JOINs, aggregations, and even remote Parquet file reads, directly inside a Cloudflare Worker? No database server, no connection pool, no cold-start latency from external services. Just DuckDB, compiled to WebAssembly, running at the edge.&lt;/p&gt;
&lt;p&gt;That&apos;s now possible with &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/ducklings&quot;&gt;Ducklings&lt;/a&gt;, a minimal DuckDB WASM build I created specifically for browsers and serverless environments. In this post, I&apos;ll focus on the &lt;code&gt;@ducklings/workers&lt;/code&gt; package, which makes it possible to deploy DuckDB to Cloudflare Workers for the first time (to my knowledge).&lt;/p&gt;
&lt;h2 id=&quot;heading-the-challenge-duckdb-in-resource-limited-serverless-environments-like-workers&quot;&gt;The challenge: DuckDB in resource-limited serverless environments like Workers&lt;/h2&gt;
&lt;p&gt;DuckDB has had a WASM build for a while through &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/duckdb/duckdb-wasm&quot;&gt;duckdb-wasm&lt;/a&gt;, but it was designed for browsers. Cloudflare Workers are a different beast: they don&apos;t have access to &lt;code&gt;XMLHttpRequest&lt;/code&gt;, &lt;code&gt;SharedArrayBuffer&lt;/code&gt;, or threads. Most critically, the Workers runtime doesn&apos;t support synchronous I/O, everything must go through async &lt;code&gt;fetch()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This matters because DuckDB&apos;s httpfs extension, the one that lets you query remote Parquet, CSV, and JSON files over HTTP, relies on synchronous HTTP calls internally. In a browser, that works via &lt;code&gt;XMLHttpRequest&lt;/code&gt;. In a Worker, it simply doesn&apos;t.&lt;/p&gt;
&lt;p&gt;The solution is &lt;a target=&quot;_blank&quot; href=&quot;https://emscripten.org/docs/porting/asyncify.html&quot;&gt;Emscripten&apos;s Asyncify&lt;/a&gt;, a compile-time transformation that lets synchronous C/C++ code call asynchronous JavaScript functions. Ducklings uses Asyncify to bridge DuckDB&apos;s synchronous HTTP calls to the Workers &lt;code&gt;fetch()&lt;/code&gt; API, making httpfs work transparently.&lt;/p&gt;
&lt;h2 id=&quot;heading-what-ducklings-provides&quot;&gt;What Ducklings provides&lt;/h2&gt;
&lt;p&gt;Ducklings is a from-scratch WASM build of DuckDB, optimized for minimal size:&lt;/p&gt;
&lt;div class=&quot;hn-table&quot;&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;td&gt;Package&lt;/td&gt;&lt;td&gt;Size (gzipped)&lt;/td&gt;&lt;td&gt;API&lt;/td&gt;&lt;td&gt;HTTP transport&lt;/td&gt;&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@ducklings/workers&lt;/code&gt;&lt;/td&gt;&lt;td&gt;~9.6 MB&lt;/td&gt;&lt;td&gt;Async&lt;/td&gt;&lt;td&gt;fetch() via Asyncify&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;&lt;p&gt;The &lt;code&gt;@ducklings/workers&lt;/code&gt; package includes Parquet, JSON, and httpfs extensions compiled in statically. No runtime extension loading, just what you need, baked into the binary.&lt;/p&gt;
&lt;p&gt;The Workers build is larger because Asyncify adds instrumentation to every function in the call chain between DuckDB&apos;s HTTP layer and the JavaScript &lt;code&gt;fetch()&lt;/code&gt; call. That&apos;s the price for making synchronous C++ code work in an async-only environment.&lt;/p&gt;
&lt;h2 id=&quot;heading-getting-started&quot;&gt;Getting started&lt;/h2&gt;
&lt;h3 id=&quot;heading-prerequisites&quot;&gt;Prerequisites&lt;/h3&gt;
&lt;p&gt;You&apos;ll need a &lt;strong&gt;Cloudflare Workers paid plan&lt;/strong&gt; ($5/month). The DuckDB WASM binary is ~9.6 MB gzipped, which exceeds the free tier&apos;s 3 MB code size limit. The paid plan allows up to 10 MB.&lt;/p&gt;
&lt;h3 id=&quot;heading-project-setup&quot;&gt;Project setup&lt;/h3&gt;
&lt;p&gt;Create a new project and install dependencies:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;mkdir duckdb-worker &amp;amp;&amp;amp; &lt;span class=&quot;hljs-built_in&quot;&gt;cd&lt;/span&gt; duckdb-worker
npm init -y
npm install @ducklings/workers
npm install -D @cloudflare/vite-plugin vite wrangler typescript @cloudflare/workers-types
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-configuration-files&quot;&gt;Configuration files&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;wrangler.jsonc&lt;/strong&gt; - Cloudflare Workers configuration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-json&quot;&gt;{
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;$schema&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;node_modules/wrangler/config-schema.json&quot;&lt;/span&gt;,
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;name&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;duckdb-worker&quot;&lt;/span&gt;,
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;main&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;src/index.ts&quot;&lt;/span&gt;,
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;compatibility_date&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;2026-01-11&quot;&lt;/span&gt;,
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;compatibility_flags&quot;&lt;/span&gt;: [&lt;span class=&quot;hljs-string&quot;&gt;&quot;nodejs_compat&quot;&lt;/span&gt;]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;nodejs_compat&lt;/code&gt; flag is required for some Node.js APIs that Emscripten&apos;s glue code uses.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;vite.config.ts&lt;/strong&gt; - Build configuration with the Ducklings Vite plugin:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; { defineConfig } &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;vite&apos;&lt;/span&gt;;
&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; { cloudflare } &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;@cloudflare/vite-plugin&apos;&lt;/span&gt;;
&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; { ducklingsWorkerPlugin } &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;@ducklings/workers/vite-plugin&apos;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;export&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;default&lt;/span&gt; defineConfig({
  plugins: [
    ducklingsWorkerPlugin(),
    cloudflare(),
  ],
});
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;ducklingsWorkerPlugin()&lt;/code&gt; handles resolving the WASM module import so that Vite and Wrangler can bundle it correctly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;tsconfig.json&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-json&quot;&gt;{
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;compilerOptions&quot;&lt;/span&gt;: {
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;target&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;ES2022&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;module&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;ESNext&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;moduleResolution&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;bundler&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;strict&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;skipLibCheck&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;types&quot;&lt;/span&gt;: [&lt;span class=&quot;hljs-string&quot;&gt;&quot;@cloudflare/workers-types/2026-01-11&quot;&lt;/span&gt;],
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;jsx&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;react-jsx&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;lib&quot;&lt;/span&gt;: [&lt;span class=&quot;hljs-string&quot;&gt;&quot;ES2022&quot;&lt;/span&gt;]
  },
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;include&quot;&lt;/span&gt;: [&lt;span class=&quot;hljs-string&quot;&gt;&quot;src&quot;&lt;/span&gt;]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;package.json&lt;/strong&gt; scripts:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-json&quot;&gt;{
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;type&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;module&quot;&lt;/span&gt;,
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;scripts&quot;&lt;/span&gt;: {
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;dev&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;vite dev&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;build&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;vite build&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;deploy&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;vite build &amp;amp;&amp;amp; wrangler deploy&quot;&lt;/span&gt;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-writing-the-worker&quot;&gt;Writing the Worker&lt;/h2&gt;
&lt;p&gt;Here&apos;s a complete Worker that exposes a SQL query API:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;src/index.ts&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; {
  init,
  DuckDB,
  version,
  tableToIPC,
  sanitizeSql,
  DuckDBError,
  &lt;span class=&quot;hljs-keyword&quot;&gt;type&lt;/span&gt; Connection,
} &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;@ducklings/workers&apos;&lt;/span&gt;;
&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; wasmModule &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;@ducklings/workers/wasm&apos;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;let&lt;/span&gt; db: DuckDB | &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt; = &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt;;
&lt;span class=&quot;hljs-keyword&quot;&gt;let&lt;/span&gt; conn: Connection | &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt; = &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt;;
&lt;span class=&quot;hljs-keyword&quot;&gt;let&lt;/span&gt; initialized = &lt;span class=&quot;hljs-literal&quot;&gt;false&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;async&lt;/span&gt; &lt;span class=&quot;hljs-function&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;function&lt;/span&gt; &lt;span class=&quot;hljs-title&quot;&gt;ensureInitialized&lt;/span&gt;(&lt;span class=&quot;hljs-params&quot;&gt;&lt;/span&gt;): &lt;span class=&quot;hljs-title&quot;&gt;Promise&lt;/span&gt;&amp;lt;&lt;span class=&quot;hljs-title&quot;&gt;void&lt;/span&gt;&amp;gt; &lt;/span&gt;{
  &lt;span class=&quot;hljs-keyword&quot;&gt;if&lt;/span&gt; (initialized &amp;amp;&amp;amp; db &amp;amp;&amp;amp; conn) &lt;span class=&quot;hljs-keyword&quot;&gt;return&lt;/span&gt;;

  &lt;span class=&quot;hljs-comment&quot;&gt;// Initialize the WASM module&lt;/span&gt;
  &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; init({ wasmModule });

  &lt;span class=&quot;hljs-comment&quot;&gt;// Create database and connection&lt;/span&gt;
  db = &lt;span class=&quot;hljs-keyword&quot;&gt;new&lt;/span&gt; DuckDB();
  conn = db.connect();

  &lt;span class=&quot;hljs-comment&quot;&gt;// Create sample data&lt;/span&gt;
  &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; conn.execute(&lt;span class=&quot;hljs-string&quot;&gt;`
    CREATE TABLE readings (
      sensor_id INTEGER,
      temperature DOUBLE,
      humidity DOUBLE,
      recorded_at TIMESTAMP
    )
  `&lt;/span&gt;);

  &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; conn.execute(&lt;span class=&quot;hljs-string&quot;&gt;`
    INSERT INTO readings VALUES
      (1, 22.5, 45.0, &apos;2026-01-27 10:00:00&apos;),
      (1, 23.1, 44.2, &apos;2026-01-27 11:00:00&apos;),
      (2, 19.8, 62.1, &apos;2026-01-27 10:00:00&apos;),
      (2, 20.3, 60.5, &apos;2026-01-27 11:00:00&apos;)
  `&lt;/span&gt;);

  initialized = &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;;
}

&lt;span class=&quot;hljs-keyword&quot;&gt;export&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;default&lt;/span&gt; {
  &lt;span class=&quot;hljs-keyword&quot;&gt;async&lt;/span&gt; fetch(request: Request): &lt;span class=&quot;hljs-built_in&quot;&gt;Promise&lt;/span&gt;&amp;lt;Response&amp;gt; {
    &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; ensureInitialized();

    &lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; url = &lt;span class=&quot;hljs-keyword&quot;&gt;new&lt;/span&gt; URL(request.url);

    &lt;span class=&quot;hljs-keyword&quot;&gt;if&lt;/span&gt; (url.pathname === &lt;span class=&quot;hljs-string&quot;&gt;&apos;/query&apos;&lt;/span&gt; &amp;amp;&amp;amp; request.method === &lt;span class=&quot;hljs-string&quot;&gt;&apos;POST&apos;&lt;/span&gt;) {
      &lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; { sql } = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; request.json() &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; { sql: &lt;span class=&quot;hljs-built_in&quot;&gt;string&lt;/span&gt; };

      &lt;span class=&quot;hljs-comment&quot;&gt;// Block dangerous queries&lt;/span&gt;
      &lt;span class=&quot;hljs-keyword&quot;&gt;try&lt;/span&gt; {
        sanitizeSql(sql);
      } &lt;span class=&quot;hljs-keyword&quot;&gt;catch&lt;/span&gt; (e) {
        &lt;span class=&quot;hljs-keyword&quot;&gt;if&lt;/span&gt; (e &lt;span class=&quot;hljs-keyword&quot;&gt;instanceof&lt;/span&gt; DuckDBError) {
          &lt;span class=&quot;hljs-keyword&quot;&gt;return&lt;/span&gt; Response.json({ error: e.message }, { status: &lt;span class=&quot;hljs-number&quot;&gt;400&lt;/span&gt; });
        }
        &lt;span class=&quot;hljs-keyword&quot;&gt;throw&lt;/span&gt; e;
      }

      &lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; rows = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; conn!.query(sql);
      &lt;span class=&quot;hljs-keyword&quot;&gt;return&lt;/span&gt; Response.json({ data: rows, rowCount: rows.length });
    }

    &lt;span class=&quot;hljs-keyword&quot;&gt;if&lt;/span&gt; (url.pathname === &lt;span class=&quot;hljs-string&quot;&gt;&apos;/stats&apos;&lt;/span&gt;) {
      &lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; stats = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; conn!.query(&lt;span class=&quot;hljs-string&quot;&gt;`
        SELECT
          sensor_id,
          AVG(temperature) AS avg_temp,
          AVG(humidity) AS avg_humidity,
          COUNT(*) AS reading_count
        FROM readings
        GROUP BY sensor_id
      `&lt;/span&gt;);
      &lt;span class=&quot;hljs-keyword&quot;&gt;return&lt;/span&gt; Response.json({ data: stats });
    }

    &lt;span class=&quot;hljs-keyword&quot;&gt;return&lt;/span&gt; Response.json({
      name: &lt;span class=&quot;hljs-string&quot;&gt;&apos;DuckDB Worker&apos;&lt;/span&gt;,
      version: version(),
      endpoints: {
        &lt;span class=&quot;hljs-string&quot;&gt;&apos;/query&apos;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&apos;POST SQL queries&apos;&lt;/span&gt;,
        &lt;span class=&quot;hljs-string&quot;&gt;&apos;/stats&apos;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&apos;GET sensor statistics&apos;&lt;/span&gt;,
      },
    });
  },
};
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A few things to note about the Workers API:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;init()&lt;/code&gt; takes a pre-compiled WASM module: In Workers, you import the WASM binary directly and pass it to &lt;code&gt;init()&lt;/code&gt;. There&apos;s no fetch step, the module is bundled into the Worker.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;All query methods are async&lt;/strong&gt;: &lt;code&gt;query()&lt;/code&gt;, &lt;code&gt;execute()&lt;/code&gt;, &lt;code&gt;queryArrow()&lt;/code&gt; all return Promises. This is because of Asyncify, under the hood, DuckDB&apos;s C++ code may need to pause and wait for an async &lt;code&gt;fetch()&lt;/code&gt; call.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;sanitizeSql()&lt;/code&gt; is built in: It blocks &lt;code&gt;duckdb_secrets()&lt;/code&gt;, &lt;code&gt;PRAGMA&lt;/code&gt;, &lt;code&gt;COPY ... TO&lt;/code&gt;, and &lt;code&gt;EXPORT DATABASE&lt;/code&gt; patterns. Use it when accepting SQL from untrusted sources.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;heading-querying-remote-files&quot;&gt;Querying remote files&lt;/h2&gt;
&lt;p&gt;This is where it gets interesting. Because httpfs works in Workers, you can query remote Parquet, CSV, and JSON files directly:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;// Remote Parquet file&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; data = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; conn.query(&lt;span class=&quot;hljs-string&quot;&gt;`
  SELECT *
  FROM &apos;https://example.com/data.parquet&apos;
  LIMIT 100
`&lt;/span&gt;);

&lt;span class=&quot;hljs-comment&quot;&gt;// Remote CSV&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; csv = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; conn.query(&lt;span class=&quot;hljs-string&quot;&gt;`
  SELECT *
  FROM read_csv(&apos;https://example.com/data.csv&apos;)
`&lt;/span&gt;);

&lt;span class=&quot;hljs-comment&quot;&gt;// Remote JSON&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; json = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; conn.query(&lt;span class=&quot;hljs-string&quot;&gt;`
  SELECT *
  FROM read_json(&apos;https://example.com/data.json&apos;)
`&lt;/span&gt;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;No additional configuration needed as httpfs is compiled in and initialized automatically.&lt;/p&gt;
&lt;h2 id=&quot;heading-accessing-s3-and-r2-storage&quot;&gt;Accessing S3 and R2 storage&lt;/h2&gt;
&lt;p&gt;For private data, you can configure secrets to access S3-compatible storage. Here&apos;s how to set up Cloudflare R2:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;interface&lt;/span&gt; Env {
  R2_ACCESS_KEY_ID: &lt;span class=&quot;hljs-built_in&quot;&gt;string&lt;/span&gt;;
  R2_SECRET_ACCESS_KEY: &lt;span class=&quot;hljs-built_in&quot;&gt;string&lt;/span&gt;;
  R2_ACCOUNT_ID: &lt;span class=&quot;hljs-built_in&quot;&gt;string&lt;/span&gt;;
}

&lt;span class=&quot;hljs-keyword&quot;&gt;async&lt;/span&gt; &lt;span class=&quot;hljs-function&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;function&lt;/span&gt; &lt;span class=&quot;hljs-title&quot;&gt;ensureInitialized&lt;/span&gt;(&lt;span class=&quot;hljs-params&quot;&gt;env: Env&lt;/span&gt;): &lt;span class=&quot;hljs-title&quot;&gt;Promise&lt;/span&gt;&amp;lt;&lt;span class=&quot;hljs-title&quot;&gt;void&lt;/span&gt;&amp;gt; &lt;/span&gt;{
  &lt;span class=&quot;hljs-comment&quot;&gt;// ... init code ...&lt;/span&gt;

  &lt;span class=&quot;hljs-comment&quot;&gt;// Create R2 secret for accessing private buckets&lt;/span&gt;
  &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; conn.execute(&lt;span class=&quot;hljs-string&quot;&gt;`
    CREATE SECRET r2 (
      TYPE R2,
      KEY_ID &apos;&lt;span class=&quot;hljs-subst&quot;&gt;${env.R2_ACCESS_KEY_ID}&lt;/span&gt;&apos;,
      SECRET &apos;&lt;span class=&quot;hljs-subst&quot;&gt;${env.R2_SECRET_ACCESS_KEY}&lt;/span&gt;&apos;,
      ACCOUNT_ID &apos;&lt;span class=&quot;hljs-subst&quot;&gt;${env.R2_ACCOUNT_ID}&lt;/span&gt;&apos;
    )
  `&lt;/span&gt;);
}

&lt;span class=&quot;hljs-comment&quot;&gt;// Then query R2 files directly&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; data = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; conn.query(&lt;span class=&quot;hljs-string&quot;&gt;`
  SELECT * FROM &apos;r2://my-bucket/data/events.parquet&apos;
`&lt;/span&gt;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Set the secrets via Wrangler:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;wrangler secret put R2_ACCESS_KEY_ID
wrangler secret put R2_SECRET_ACCESS_KEY
wrangler secret put R2_ACCOUNT_ID
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The same approach works for AWS S3 (&lt;code&gt;TYPE S3&lt;/code&gt;) and Google Cloud Storage (&lt;code&gt;TYPE GCS&lt;/code&gt;).&lt;/p&gt;
&lt;h2 id=&quot;heading-returning-arrow-ipc&quot;&gt;Returning Arrow IPC&lt;/h2&gt;
&lt;p&gt;If your client understands Apache Arrow, you can return results in Arrow IPC format for efficient data transfer:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; { tableToIPC } &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;@ducklings/workers&apos;&lt;/span&gt;;

&lt;span class=&quot;hljs-comment&quot;&gt;// In your request handler:&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; table = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; conn.queryArrow(&lt;span class=&quot;hljs-string&quot;&gt;&apos;SELECT * FROM readings&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; ipcBytes = tableToIPC(table, { format: &lt;span class=&quot;hljs-string&quot;&gt;&apos;stream&apos;&lt;/span&gt; });

&lt;span class=&quot;hljs-keyword&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;new&lt;/span&gt; Response(ipcBytes, {
  headers: {
    &lt;span class=&quot;hljs-string&quot;&gt;&apos;Content-Type&apos;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&apos;application/vnd.apache.arrow.stream&apos;&lt;/span&gt;,
  },
});
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Consumers can read the result with any Arrow-compatible library:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-python&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;# Python&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; pyarrow &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; pa
&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; requests

response = requests.get(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://your-worker.dev/arrow&apos;&lt;/span&gt;)
reader = pa.ipc.open_stream(response.content)
table = reader.read_all()
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;// JavaScript (Flechette)&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; { tableFromIPC } &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;@ducklings/workers&apos;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; response = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; fetch(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://your-worker.dev/arrow&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; bytes = &lt;span class=&quot;hljs-keyword&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;Uint8Array&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; response.arrayBuffer());
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; table = tableFromIPC(bytes);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-deploying&quot;&gt;Deploying&lt;/h2&gt;
&lt;p&gt;With everything in place, deploy to Cloudflare:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;# Local development&lt;/span&gt;
npm run dev

&lt;span class=&quot;hljs-comment&quot;&gt;# Deploy to production&lt;/span&gt;
npm run deploy
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That&apos;s it. Wrangler handles bundling the WASM binary into the Worker.&lt;/p&gt;
&lt;p&gt;Test it:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;# Check the API&lt;/span&gt;
curl https://duckdb-worker.&amp;lt;your-subdomain&amp;gt;.workers.dev/

&lt;span class=&quot;hljs-comment&quot;&gt;# Run a query&lt;/span&gt;
curl -X POST https://duckdb-worker.&amp;lt;your-subdomain&amp;gt;.workers.dev/query \
  -H &lt;span class=&quot;hljs-string&quot;&gt;&apos;Content-Type: application/json&apos;&lt;/span&gt; \
  -d &lt;span class=&quot;hljs-string&quot;&gt;&apos;{&quot;sql&quot;: &quot;SELECT 42 AS answer&quot;}&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-prepared-statements&quot;&gt;Prepared statements&lt;/h2&gt;
&lt;p&gt;For parameterized queries, use prepared statements to avoid SQL injection:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; stmt = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; conn.prepare(
  &lt;span class=&quot;hljs-string&quot;&gt;&apos;SELECT * FROM readings WHERE sensor_id = ? AND temperature &amp;gt; ?&apos;&lt;/span&gt;
);
stmt.bindInt32(&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;);
stmt.bindDouble(&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;22.0&lt;/span&gt;);

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; results = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; stmt.run();
&lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; stmt.close();
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Available binding methods: &lt;code&gt;bindBoolean&lt;/code&gt;, &lt;code&gt;bindInt32&lt;/code&gt;, &lt;code&gt;bindInt64&lt;/code&gt;, &lt;code&gt;bindFloat&lt;/code&gt;, &lt;code&gt;bindDouble&lt;/code&gt;, &lt;code&gt;bindString&lt;/code&gt;, &lt;code&gt;bindBlob&lt;/code&gt;, &lt;code&gt;bindNull&lt;/code&gt;, &lt;code&gt;bindDate&lt;/code&gt;, &lt;code&gt;bindTimestamp&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&quot;heading-streaming-large-results&quot;&gt;Streaming large results&lt;/h2&gt;
&lt;p&gt;For large result sets, avoid loading everything into memory:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; stream = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; conn.queryStreaming(&lt;span class=&quot;hljs-string&quot;&gt;&apos;SELECT * FROM large_table&apos;&lt;/span&gt;);

&lt;span class=&quot;hljs-keyword&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; (&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; chunk &lt;span class=&quot;hljs-keyword&quot;&gt;of&lt;/span&gt; stream) {
  &lt;span class=&quot;hljs-comment&quot;&gt;// Process each chunk individually&lt;/span&gt;
  &lt;span class=&quot;hljs-keyword&quot;&gt;for&lt;/span&gt; (&lt;span class=&quot;hljs-keyword&quot;&gt;let&lt;/span&gt; row = &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt;; row &amp;lt; chunk.rowCount; row++) {
    &lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; id = chunk.getInt32(row, &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt;);
    &lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; name = chunk.getString(row, &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;);
    &lt;span class=&quot;hljs-comment&quot;&gt;// ...&lt;/span&gt;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-limitations&quot;&gt;Limitations&lt;/h2&gt;
&lt;p&gt;A few things to be aware of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Memory&lt;/strong&gt;: Workers have a 128 MB memory limit. DuckDB itself uses a portion of that, so you&apos;ll want to keep datasets and query results reasonably sized.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;WASM size&lt;/strong&gt;: The ~9.6 MB gzipped binary requires a paid Workers plan ($5/month). The free tier caps at 3 MB.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No dynamic extension loading&lt;/strong&gt;: Only Parquet, JSON, and httpfs are available. You can&apos;t &lt;code&gt;INSTALL&lt;/code&gt; or &lt;code&gt;LOAD&lt;/code&gt; other extensions at runtime.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Single-threaded&lt;/strong&gt;: No parallel query execution. Workers are single-threaded by nature, and the WASM build has threading disabled.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;In-memory only&lt;/strong&gt;: No persistent storage. Each new Worker instance starts fresh. Use httpfs to read external data, or populate tables on init.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;heading-wrapping-up&quot;&gt;Wrapping up&lt;/h2&gt;
&lt;p&gt;Ducklings makes it possible to run DuckDB at the edge on Cloudflare Workers, something that wasn&apos;t feasible before due to the async I/O constraint. The &lt;code&gt;@ducklings/workers&lt;/code&gt; package handles the Asyncify plumbing, provides a TypeScript API, and includes httpfs, Parquet, and JSON extensions out of the box.&lt;/p&gt;
&lt;p&gt;This opens up some interesting use cases: SQL APIs without a database server, edge data transformations, Parquet-to-JSON converters, analytics endpoints that query directly from object storage, or lightweight ETL pipelines running globally.&lt;/p&gt;
&lt;p&gt;The project is open source at &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/ducklings&quot;&gt;github.com/tobilg/ducklings&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;heading-references&quot;&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/ducklings&quot;&gt;Ducklings GitHub repository&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://www.npmjs.com/package/@ducklings/workers&quot;&gt;&lt;code&gt;@ducklings/workers&lt;/code&gt; on npm&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://ducklings-api.serverless-duckdb.com&quot;&gt;Ducklings API documentation&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://emscripten.org/docs/porting/asyncify.html&quot;&gt;Emscripten Asyncify&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/workers/&quot;&gt;Cloudflare Workers documentation&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/&quot;&gt;DuckDB documentation&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><category>duckDB</category><category>cloudflare-worker</category><category>wasm</category><category>async</category><author>Tobias Müller</author></item><item><title>Using Iceberg Catalogs in the Browser with DuckDB-Wasm</title><link>https://tobilg.com/posts/using-iceberg-catalogs-in-the-browser-with-duckdb-wasm/</link><guid isPermaLink="true">https://tobilg.com/posts/using-iceberg-catalogs-in-the-browser-with-duckdb-wasm/</guid><description>With recent updates of DuckDB itself, DuckDB-Wasm and the Iceberg extension it is now possible to query Iceberg catalog directly from the browser, with no backends. involved.
Example clients that work:

SQL Workbench

SQL Workbench Embedded

DuckDB T...</description><pubDate>Tue, 16 Dec 2025 15:42:18 GMT</pubDate><content:encoded>&lt;p&gt;With recent updates of &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org&quot;&gt;DuckDB&lt;/a&gt; itself, &lt;a target=&quot;_blank&quot; href=&quot;https://www.npmjs.com/package/@duckdb/duckdb-wasm&quot;&gt;DuckDB-Wasm&lt;/a&gt; and the &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/duckdb/duckdb-iceberg&quot;&gt;Iceberg extension&lt;/a&gt; it is now possible to query Iceberg catalog directly from the browser, with no backends. involved.&lt;/p&gt;
&lt;p&gt;Example clients that work:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://sql-workbench.com&quot;&gt;SQL Workbench&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://embedded.sql-workbench.com&quot;&gt;SQL Workbench Embedded&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://terminal.sql-workbench.com&quot;&gt;DuckDB Terminal&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://shell.duckdb.org&quot;&gt;DuckDB Shell&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;Be aware the credentials you’re using should not be shared. The below examples are for demo purposes, and not for public or production usage. It’s possible to set this up in a safe manner though, for example by granting a fine-grained AWS IAM role to users of a frontend application when they log in via Cognito. This will grant the users temporary permissions to use the respective services.&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;heading-disable-the-built-in-http-client-in-duckdb-wasm&quot;&gt;Disable the built-in http client in DuckDB-Wasm&lt;/h2&gt;
&lt;p&gt;To remotely querying Iceberg data, we first need to disable the internal http client of DuckDB-Wasm, and then use the official httpfs extension build:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;-- Load the httpfs extension, this disables the built-in http client&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;LOAD&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;http&lt;/span&gt;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-load-the-iceberg-extension&quot;&gt;Load the Iceberg extension&lt;/h2&gt;
&lt;p&gt;You can then load the Iceberg extension:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;-- Load Iceberg extension&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;LOAD&lt;/span&gt; iceberg;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-query-iceberg-data-in-s3-tables&quot;&gt;Query Iceberg data in S3 Tables&lt;/h2&gt;
&lt;p&gt;I published another blog post &lt;a target=&quot;_blank&quot; href=&quot;https://tobilg.com/posts/query-s3-tables-with-duckdb/&quot;&gt;“Query S3 Tables with DuckDB”&lt;/a&gt; in Q1/2025 which outlines the basic setup of a S3 Tables buckets, as well as a Namespace and an actual Table, please refer to it when trying to set this up in your environment.&lt;/p&gt;
&lt;p&gt;So, if you’d like to attach an S3 Tables bucket as a database to DuckDB-Wasm, you can do this like the following:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;ATTACH &apos;arn:aws:s3tables:us-east-1:12345678912:bucket/duckdb-test&apos;
  AS test_db (
    TYPE iceberg,
    ENDPOINT_TYPE s3_tables
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The attach target is the ARN of the S3 Table bucket.&lt;/p&gt;
&lt;p&gt;You then can check which tables are available:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;SHOW&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;ALL&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLES&lt;/span&gt;;

┌──────────────┬─────────┬─────────────┬─────────────────────────────────────────────┬─────────────────────────┬───────────┐
│   database   │ schema  │    name     │                column_names                 │      column_types       │ temporary │
│   varchar    │ varchar │   varchar   │                  varchar[]                  │        varchar[]        │  boolean  │
├──────────────┼─────────┼─────────────┼─────────────────────────────────────────────┼─────────────────────────┼───────────┤
│ test_db      │ test    │ daily_sales │ [sale_date, product_category, sales_amount] │ [DATE, VARCHAR, DOUBLE] │ false     │
└──────────────┴─────────┴─────────────┴─────────────────────────────────────────────┴─────────────────────────┴───────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To query the table, just use normal SQL statements like you would with any other local table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; test_db.test.daily_sales;
┌────────────┬──────────────────┬──────────────┐
│ sale_date  │ product_category │ sales_amount │
│    date    │     varchar      │    double    │
├────────────┼──────────────────┼──────────────┤
│ 2024-01-15 │ Laptop           │        900.0 │
│ 2024-01-15 │ Monitor          │        250.0 │
│ 2024-01-16 │ Laptop           │       1350.0 │
│ 2024-02-01 │ Monitor          │        300.0 │
│ 2024-02-01 │ Keyboard         │         60.0 │
│ 2024-02-02 │ Mouse            │         25.0 │
│ 2024-02-02 │ Laptop           │       1050.0 │
│ 2024-02-03 │ Laptop           │       1200.0 │
│ 2024-02-03 │ Monitor          │        375.0 │
└────────────┴──────────────────┴──────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-query-iceberg-data-in-r2-data-catalog-r2&quot;&gt;Query Iceberg data in R2 Data Catalog / R2&lt;/h2&gt;
&lt;p&gt;There are a few steps to set up an R2 bucket for the usage as an Iceberg catalog.&lt;/p&gt;
&lt;h3 id=&quot;heading-creating-a-r2-bucket&quot;&gt;Creating a R2 bucket&lt;/h3&gt;
&lt;p&gt;To create a R2 bucket, go to you account’s Cloudflare Dashboard, and in the left sidebar click on “Storage &amp;amp; databases” → “R2 Object Storage” and then on “Overview”:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1765735621132/6a9dd988-3d62-41e9-a666-3b883e8951e6.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;After that, click on “Create bucket” in the top right corner. You’ll see the next screen below:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1765735700785/70c15f45-0626-4bee-a192-118b3e35e473.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h3 id=&quot;heading-setting-a-cors-policy&quot;&gt;Setting a CORS policy&lt;/h3&gt;
&lt;p&gt;To set a CORS policy which is necessary to allow specific origin requests from the browser, you need to go to the “CORS” Policy” section and click on “Add”. Then, you can copy &amp;amp; paste the CORS policy below and click on “Save”.&lt;/p&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;Make sure to add your frontend domain to the AllowedOrigins, otherwise this will NOT work.&lt;/div&gt;
&lt;/div&gt;

&lt;pre&gt;&lt;code class=&quot;lang-json&quot;&gt;[
  {
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;AllowedOrigins&quot;&lt;/span&gt;: [
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;https://sql-workbench.com&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;https://embedded.sql-workbench.com&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;https://terminal.sql-workbench.com&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;https://shell.duckdb.org&quot;&lt;/span&gt;
    ],
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;AllowedMethods&quot;&lt;/span&gt;: [
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;GET&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;HEAD&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;PUT&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;POST&quot;&lt;/span&gt;
    ],
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;AllowedHeaders&quot;&lt;/span&gt;: [
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;Authorization&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;Cache-Control&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;Content-Range&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;Content-Type&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;Content-Length&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;Range&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;X-Amz-Acl&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;X-Amz-Content-Sha256&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;X-Amz-Date&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;X-Amz-Security-Token&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;X-Iceberg-Access-Delegation&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;X-Host-Override&quot;&lt;/span&gt;
    ],
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;ExposeHeaders&quot;&lt;/span&gt;: [
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;Content-Type&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;Access-Control-Allow-Origin&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;ETag&quot;&lt;/span&gt;
    ]
  }
]
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-enabling-r2-data-catalog&quot;&gt;Enabling R2 Data Catalog&lt;/h3&gt;
&lt;p&gt;Now, enable the R2 Data Catalog in your R2 bucket’s settings:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1765735862782/c383055e-8e6e-42ee-b1b1-b50c43b529d9.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;After clicking on “Enable”, it takes a few seconds, and you should see your Catalog URI and your Warehouse Name. Those are both needed when querying the data later.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1765735883577/6a2554aa-12de-46dd-9780-e780be91a8fc.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h3 id=&quot;heading-creating-a-account-token-for-r2-r2-data-catalog-access&quot;&gt;Creating a Account Token for R2 / R2 Data Catalog access&lt;/h3&gt;
&lt;p&gt;To access the data via queries, DuckDB needs an Account API Token. In the left sidebar menu, click on “Manage account” → “Account API Tokens”&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1765736124390/2499381f-5631-4fc0-9d92-2b1068df3692.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Then, on the next screen, click on “Create Token” and on the following screen on “Get started” in the Custom Token section.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1765736186732/d7794836-fb8e-4258-83d2-df59f8448b3f.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Next, enter a name for your token, and create the two Permission entries as outlined below:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1765736045094/9abe8ff4-7049-40f5-b7be-1911c0415601.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Click on “Continue to summary” and then create your access token&lt;/p&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;Save the newly created Account API Token somewhere safe, you will not be able to retrieve it again. Only a rotation or new creation is possible in case you loose the token afterwards.&lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&quot;heading-querying-data&quot;&gt;Querying data&lt;/h3&gt;
&lt;p&gt;You now have all the pieces you need for querying the data from your R2 Data Catalog.&lt;/p&gt;
&lt;p&gt;As the R2 Data Catalog doesn’t send CORS headers, you need to use our &lt;a target=&quot;_blank&quot; href=&quot;https://cors.sqlqry.run&quot;&gt;cors.sqlqry.run&lt;/a&gt; CORS proxy, which provides a /iceberg route that needs to be postfixed with the R2 Data Catalog base URL concatenated with your Cloudflare Account ID and the name of the R2 bucket your created.&lt;/p&gt;
&lt;p&gt;Create a DuckDB secret for the R2 Data Catalog as shown belog. All requests will subsequently use this secret to authorize themselves:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;OR&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;REPLACE&lt;/span&gt; SECRET r2secret (
    &lt;span class=&quot;hljs-keyword&quot;&gt;TYPE&lt;/span&gt; iceberg, 
    ENDPOINT &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://cors.sqlqry.run/iceberg/catalog.cloudflarestorage.com/YOUR_CLOUDFLARE_ACCOUNT_ID/YOUR_BUCKET_NAME&apos;&lt;/span&gt;, 
    TOKEN &lt;span class=&quot;hljs-string&quot;&gt;&apos;YOUR_ACCOUNT_API_TOKEN&apos;&lt;/span&gt;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then, attach the remote Iceberg catalog as database by supplying the Warehouse Name from the R2 Data Catalog settings:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;ATTACH IF NOT EXISTS &apos;YOUR_WAREHOUSE_NAME&apos; AS r2lake (TYPE iceberg);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you haven’t created a schema (namespace) yet, you can do this like this with plain SQL:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SCHEMA&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;IF&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;NOT&lt;/span&gt; EXISTS. r2lake.test;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create an example table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;IF&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;EXISTS&lt;/span&gt; r2lake.test.sales (
    sale_id &lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt;,
    product_id &lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt;,
    quantity &lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt;,
    price &lt;span class=&quot;hljs-built_in&quot;&gt;DECIMAL&lt;/span&gt;(&lt;span class=&quot;hljs-number&quot;&gt;10&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;),
    sale_date &lt;span class=&quot;hljs-built_in&quot;&gt;DATE&lt;/span&gt;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Insert data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; r2lake.test.sales 
    (sale_id, product_id, quantity, price, sale_date) 
&lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt;
    (&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;10&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;100.00&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;2021-01-01&apos;&lt;/span&gt;),
    (&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;20&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;234.45&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;2021-01-02&apos;&lt;/span&gt;),
    (&lt;span class=&quot;hljs-number&quot;&gt;3&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;3&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;30&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;30.99&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;2021-01-03&apos;&lt;/span&gt;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Query the data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; r2lake.test.sales;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-summary&quot;&gt;Summary&lt;/h2&gt;
&lt;p&gt;Thanks to the innovations of the DuckDB team, it’s now possible to directly query Iceberg catalogs from the browser. The use cases for this may be in-browser SQL clients for example, or demo applications that demonstrate how Iceberg itself works.&lt;/p&gt;
</content:encoded><category>duckDB</category><category>iceberg</category><category>wasm</category><author>Tobias Müller</author></item><item><title>TypeScript scripts as DuckDB Table Functions</title><link>https://tobilg.com/posts/typescript-scripts-as-duckdb-table-functions/</link><guid isPermaLink="true">https://tobilg.com/posts/typescript-scripts-as-duckdb-table-functions/</guid><description>What if you could query any REST API, GraphQL endpoint, or web page directly from DuckDB using SQL? No ETL pipelines, no intermediate files, no complex setup - just write a TypeScript script and use it as a table function.
In this post, I&apos;ll show you...</description><pubDate>Wed, 10 Dec 2025 17:07:28 GMT</pubDate><content:encoded>&lt;p&gt;What if you could query any REST API, GraphQL endpoint, or web page directly from DuckDB using SQL? No ETL pipelines, no intermediate files, no complex setup - just write a TypeScript script and use it as a table function.&lt;/p&gt;
&lt;p&gt;In this post, I&apos;ll show you how to combine DuckDB&apos;s &lt;code&gt;shellfs&lt;/code&gt; and &lt;code&gt;arrow&lt;/code&gt; extensions with Bun&apos;s zero-dependency scripts to create a powerful, flexible data pipeline that lets you query the world with SQL.&lt;/p&gt;
&lt;p&gt;This post was inspired by &lt;a target=&quot;_blank&quot; href=&quot;https://sidequery.dev/blog/uv-run-duckdb&quot;&gt;Python Scripts as DuckDB Table Functions&lt;/a&gt; and &lt;a target=&quot;_blank&quot; href=&quot;https://danielmiessler.com/blog/executable-typescript-programs-using-bun&quot;&gt;Self-Contained TypeScript Programs Using Bun&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;heading-the-magic&quot;&gt;The “Magic”&lt;/h2&gt;
&lt;p&gt;The setup is surprisingly simple. DuckDB&apos;s &lt;code&gt;shellfs&lt;/code&gt; extension can read data from shell command output, and the &lt;code&gt;arrow&lt;/code&gt; extension can parse Apache Arrow IPC format. Combine this with Bun&apos;s ability to run TypeScript files directly (with automatic dependency installation), and you get a powerful pattern:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;-- One-time setup in DuckDB&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;INSTALL&lt;/span&gt; shellfs &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; community;
&lt;span class=&quot;hljs-keyword&quot;&gt;INSTALL&lt;/span&gt; arrow &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; community;
&lt;span class=&quot;hljs-keyword&quot;&gt;LOAD&lt;/span&gt; shellfs;
&lt;span class=&quot;hljs-keyword&quot;&gt;LOAD&lt;/span&gt; arrow;

&lt;span class=&quot;hljs-comment&quot;&gt;-- Create a macro to run Bun scripts as table functions&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;OR&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;REPLACE&lt;/span&gt; MACRO bun(script, args := &lt;span class=&quot;hljs-string&quot;&gt;&apos;&apos;&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_arrow(&lt;span class=&quot;hljs-string&quot;&gt;&apos;bun &apos;&lt;/span&gt; || script || &lt;span class=&quot;hljs-string&quot;&gt;&apos; &apos;&lt;/span&gt; || args || &lt;span class=&quot;hljs-string&quot;&gt;&apos; |&apos;&lt;/span&gt;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now any TypeScript script that outputs Arrow IPC data becomes a queryable table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;my-script.ts&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;fetch-data.ts&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;--limit=100&apos;&lt;/span&gt;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-the-secret-sauce-json-to-arrow-ipc&quot;&gt;The Secret Sauce: json-to-arrow-ipc&lt;/h2&gt;
&lt;p&gt;The key ingredient is the &lt;code&gt;json-to-arrow-ipc&lt;/code&gt; npm package. This package converts JSON to Arrow IPC format without dictionary encoding, making it fully compatible with DuckDB.&lt;/p&gt;
&lt;p&gt;A minimal script looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-meta&quot;&gt;#!/usr/bin/env bun&lt;/span&gt;

&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; { jsonToArrowIPC } &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;json-to-arrow-ipc&apos;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; data = [{ name: &lt;span class=&quot;hljs-string&quot;&gt;&apos;Alice&apos;&lt;/span&gt;, age: &lt;span class=&quot;hljs-number&quot;&gt;30&lt;/span&gt; }, { name: &lt;span class=&quot;hljs-string&quot;&gt;&apos;Bob&apos;&lt;/span&gt;, age: &lt;span class=&quot;hljs-number&quot;&gt;25&lt;/span&gt; }];
process.stdout.write(jsonToArrowIPC(data));
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-real-world-examples&quot;&gt;Real-World Examples&lt;/h2&gt;
&lt;p&gt;Let&apos;s look at some practical examples that demonstrate the power of this approach.&lt;/p&gt;
&lt;h3 id=&quot;heading-example-1-query-github-repositories&quot;&gt;Example 1: Query GitHub Repositories&lt;/h3&gt;
&lt;p&gt;Want to analyze the most popular TypeScript repositories on GitHub? Create &lt;code&gt;github-repos.ts&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-meta&quot;&gt;#!/usr/bin/env bun&lt;/span&gt;

&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; { jsonToArrowIPC } &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;json-to-arrow-ipc&apos;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; args = process.argv.slice(&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; language = args.find(&lt;span class=&quot;hljs-function&quot;&gt;&lt;span class=&quot;hljs-params&quot;&gt;a&lt;/span&gt; =&amp;gt;&lt;/span&gt; a.startsWith(&lt;span class=&quot;hljs-string&quot;&gt;&apos;--language=&apos;&lt;/span&gt;))?.split(&lt;span class=&quot;hljs-string&quot;&gt;&apos;=&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;] ?? &lt;span class=&quot;hljs-string&quot;&gt;&apos;typescript&apos;&lt;/span&gt;;
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; perPage = &lt;span class=&quot;hljs-built_in&quot;&gt;parseInt&lt;/span&gt;(args.find(&lt;span class=&quot;hljs-function&quot;&gt;&lt;span class=&quot;hljs-params&quot;&gt;a&lt;/span&gt; =&amp;gt;&lt;/span&gt; a.startsWith(&lt;span class=&quot;hljs-string&quot;&gt;&apos;--per-page=&apos;&lt;/span&gt;))?.split(&lt;span class=&quot;hljs-string&quot;&gt;&apos;=&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;] ?? &lt;span class=&quot;hljs-string&quot;&gt;&apos;30&apos;&lt;/span&gt;);

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; response = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; fetch(
  &lt;span class=&quot;hljs-string&quot;&gt;`https://api.github.com/search/repositories?q=language:&lt;span class=&quot;hljs-subst&quot;&gt;${language}&lt;/span&gt;&amp;amp;sort=stars&amp;amp;order=desc&amp;amp;per_page=&lt;span class=&quot;hljs-subst&quot;&gt;${perPage}&lt;/span&gt;`&lt;/span&gt;,
  {
    headers: {
      &lt;span class=&quot;hljs-string&quot;&gt;&apos;Accept&apos;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&apos;application/vnd.github.v3+json&apos;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&apos;User-Agent&apos;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&apos;duckdb-bun-script&apos;&lt;/span&gt;
    }
  }
);

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; { items } = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; response.json();

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; repos = items.map(&lt;span class=&quot;hljs-function&quot;&gt;(&lt;span class=&quot;hljs-params&quot;&gt;repo: &lt;span class=&quot;hljs-built_in&quot;&gt;any&lt;/span&gt;&lt;/span&gt;) =&amp;gt;&lt;/span&gt; ({
  name: repo.name,
  full_name: repo.full_name,
  description: repo.description,
  stars: repo.stargazers_count,
  forks: repo.forks_count,
  open_issues: repo.open_issues_count,
  language: repo.language,
  created_at: repo.created_at,
  updated_at: repo.updated_at,
  homepage: repo.homepage,
  topics: &lt;span class=&quot;hljs-built_in&quot;&gt;JSON&lt;/span&gt;.stringify(repo.topics),
}));

process.stdout.write(jsonToArrowIPC(repos));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now query it with SQL:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;-- Top 10 TypeScript repos by stars&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;, stars, forks, description
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;github-repos.ts&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;--language=typescript --per-page=100&apos;&lt;/span&gt;)
&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; stars &lt;span class=&quot;hljs-keyword&quot;&gt;DESC&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;10&lt;/span&gt;;

&lt;span class=&quot;hljs-comment&quot;&gt;-- Compare languages&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt;
  &lt;span class=&quot;hljs-string&quot;&gt;&apos;TypeScript&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;language&lt;/span&gt;, stars, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;github-repos.ts&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;--language=typescript --per-page=10&apos;&lt;/span&gt;)
&lt;span class=&quot;hljs-keyword&quot;&gt;UNION&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;ALL&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt;
  &lt;span class=&quot;hljs-string&quot;&gt;&apos;Rust&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;language&lt;/span&gt;, stars, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;github-repos.ts&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;--language=rust --per-page=10&apos;&lt;/span&gt;)
&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; stars &lt;span class=&quot;hljs-keyword&quot;&gt;DESC&lt;/span&gt;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-example-2-fetch-blog-posts-from-hashnode-graphql-api&quot;&gt;Example 2: Fetch Blog Posts from Hashnode GraphQL API&lt;/h3&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://hashnode.com&quot;&gt;Hashnode&lt;/a&gt; provides a GraphQL API that&apos;s perfect for this pattern. Create &lt;code&gt;hashnode-posts.ts&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-meta&quot;&gt;#!/usr/bin/env bun&lt;/span&gt;

&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; { jsonToArrowIPC } &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;json-to-arrow-ipc&apos;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; args = process.argv.slice(&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; host = args.find(&lt;span class=&quot;hljs-function&quot;&gt;&lt;span class=&quot;hljs-params&quot;&gt;a&lt;/span&gt; =&amp;gt;&lt;/span&gt; a.startsWith(&lt;span class=&quot;hljs-string&quot;&gt;&apos;--host=&apos;&lt;/span&gt;))?.split(&lt;span class=&quot;hljs-string&quot;&gt;&apos;=&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;] ?? &lt;span class=&quot;hljs-string&quot;&gt;&apos;tobilg.com&apos;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; query = &lt;span class=&quot;hljs-string&quot;&gt;`
  query GetPublicationPosts($host: String!, $first: Int!) {
    publication(host: $host) {
      posts(first: $first) {
        edges {
          node {
            title
            slug
            brief
            publishedAt
            views
            reactionCount
            responseCount
            readTimeInMinutes
            tags {
              name
            }
            url
            author {
              username
            }
          }
        }
      }
    }
  }
`&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; response = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; fetch(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://gql.hashnode.com/&apos;&lt;/span&gt;, {
  method: &lt;span class=&quot;hljs-string&quot;&gt;&apos;POST&apos;&lt;/span&gt;,
  headers: { &lt;span class=&quot;hljs-string&quot;&gt;&apos;Content-Type&apos;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&apos;application/json&apos;&lt;/span&gt; },
  body: &lt;span class=&quot;hljs-built_in&quot;&gt;JSON&lt;/span&gt;.stringify({
    query,
    variables: { host, first: &lt;span class=&quot;hljs-number&quot;&gt;20&lt;/span&gt; }
  }),
});

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; { data } = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; response.json();

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; posts = data?.publication?.posts?.edges?.map(&lt;span class=&quot;hljs-function&quot;&gt;(&lt;span class=&quot;hljs-params&quot;&gt;edge: &lt;span class=&quot;hljs-built_in&quot;&gt;any&lt;/span&gt;&lt;/span&gt;) =&amp;gt;&lt;/span&gt; ({
  title: edge.node.title,
  slug: edge.node.slug,
  brief: edge.node.brief,
  published_at: edge.node.publishedAt,
  views: edge.node.views ?? &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt;,
  reactions: edge.node.reactionCount ?? &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt;,
  responses: edge.node.responseCount ?? &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt;,
  read_time_minutes: edge.node.readTimeInMinutes,
  tags: &lt;span class=&quot;hljs-built_in&quot;&gt;JSON&lt;/span&gt;.stringify(edge.node.tags?.map(&lt;span class=&quot;hljs-function&quot;&gt;(&lt;span class=&quot;hljs-params&quot;&gt;t: &lt;span class=&quot;hljs-built_in&quot;&gt;any&lt;/span&gt;&lt;/span&gt;) =&amp;gt;&lt;/span&gt; t.name) ?? []),
  url: edge.node.url,
  author: edge.node.author?.username ?? &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt;,
})) ?? [];

process.stdout.write(jsonToArrowIPC(posts));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Query blog analytics with SQL:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;-- All posts sorted by views&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; title, views, reactions, read_time_minutes, published_at
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;hashnode-posts.ts&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;--host=tobilg.com&apos;&lt;/span&gt;)
&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; views &lt;span class=&quot;hljs-keyword&quot;&gt;DESC&lt;/span&gt;;

&lt;span class=&quot;hljs-comment&quot;&gt;-- Posts with most engagement&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt;
  title,
  views,
  reactions,
  responses,
  (reactions + responses) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; total_engagement
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;hashnode-posts.ts&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;--host=tobilg.com&apos;&lt;/span&gt;)
&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; total_engagement &lt;span class=&quot;hljs-keyword&quot;&gt;DESC&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;5&lt;/span&gt;;

&lt;span class=&quot;hljs-comment&quot;&gt;-- Average read time&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt;
  &lt;span class=&quot;hljs-keyword&quot;&gt;AVG&lt;/span&gt;(read_time_minutes) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; avg_read_time,
  &lt;span class=&quot;hljs-keyword&quot;&gt;SUM&lt;/span&gt;(views) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; total_views,
  &lt;span class=&quot;hljs-keyword&quot;&gt;SUM&lt;/span&gt;(reactions) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; total_reactions
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;hashnode-posts.ts&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;--host=tobilg.com&apos;&lt;/span&gt;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-example-3-github-user-activity&quot;&gt;Example 3: GitHub User Activity&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;github-user-events.ts&lt;/code&gt; to analyze recent &lt;a target=&quot;_blank&quot; href=&quot;https://github.com&quot;&gt;GitHub&lt;/a&gt; activity:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-meta&quot;&gt;#!/usr/bin/env bun&lt;/span&gt;

&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; { jsonToArrowIPC } &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;json-to-arrow-ipc&apos;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; args = process.argv.slice(&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; username = args.find(&lt;span class=&quot;hljs-function&quot;&gt;&lt;span class=&quot;hljs-params&quot;&gt;a&lt;/span&gt; =&amp;gt;&lt;/span&gt; a.startsWith(&lt;span class=&quot;hljs-string&quot;&gt;&apos;--user=&apos;&lt;/span&gt;))?.split(&lt;span class=&quot;hljs-string&quot;&gt;&apos;=&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;] ?? &lt;span class=&quot;hljs-string&quot;&gt;&apos;tobilg&apos;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; response = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; fetch(
  &lt;span class=&quot;hljs-string&quot;&gt;`https://api.github.com/users/&lt;span class=&quot;hljs-subst&quot;&gt;${username}&lt;/span&gt;/events/public?per_page=100`&lt;/span&gt;,
  {
    headers: {
      &lt;span class=&quot;hljs-string&quot;&gt;&apos;Accept&apos;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&apos;application/vnd.github.v3+json&apos;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&apos;User-Agent&apos;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&apos;duckdb-bun-script&apos;&lt;/span&gt;
    }
  }
);

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; events = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; response.json();

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; processed = events.map(&lt;span class=&quot;hljs-function&quot;&gt;(&lt;span class=&quot;hljs-params&quot;&gt;event: &lt;span class=&quot;hljs-built_in&quot;&gt;any&lt;/span&gt;&lt;/span&gt;) =&amp;gt;&lt;/span&gt; ({
  id: event.id,
  &lt;span class=&quot;hljs-keyword&quot;&gt;type&lt;/span&gt;: event.type,
  repo: event.repo?.name,
  created_at: event.created_at,
  action: event.payload?.action ?? &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt;,
  ref: event.payload?.ref ?? &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt;,
  ref_type: event.payload?.ref_type ?? &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt;,
}));

process.stdout.write(jsonToArrowIPC(processed));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Analyze activity patterns:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;-- Event type breakdown&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;type&lt;/span&gt;, &lt;span class=&quot;hljs-keyword&quot;&gt;COUNT&lt;/span&gt;(*) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;count&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;github-user-events.ts&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;--user=tobilg&apos;&lt;/span&gt;)
&lt;span class=&quot;hljs-keyword&quot;&gt;GROUP&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;type&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;count&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;DESC&lt;/span&gt;;

&lt;span class=&quot;hljs-comment&quot;&gt;-- Most active repositories&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; repo, &lt;span class=&quot;hljs-keyword&quot;&gt;COUNT&lt;/span&gt;(*) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;events&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;github-user-events.ts&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;--user=tobilg&apos;&lt;/span&gt;)
&lt;span class=&quot;hljs-keyword&quot;&gt;GROUP&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; repo
&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;DESC&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;10&lt;/span&gt;;

&lt;span class=&quot;hljs-comment&quot;&gt;-- Activity by day of week&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt;
  strftime(created_at::&lt;span class=&quot;hljs-built_in&quot;&gt;timestamp&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;%A&apos;&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; day_of_week,
  &lt;span class=&quot;hljs-keyword&quot;&gt;COUNT&lt;/span&gt;(*) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;events&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;github-user-events.ts&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;--user=tobilg&apos;&lt;/span&gt;)
&lt;span class=&quot;hljs-keyword&quot;&gt;GROUP&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; day_of_week
&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;events&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;DESC&lt;/span&gt;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-example-4-npm-package-statistics&quot;&gt;Example 4: NPM Package Statistics&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;npm-package.ts&lt;/code&gt; to fetch package metadata:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-meta&quot;&gt;#!/usr/bin/env bun&lt;/span&gt;

&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; { jsonToArrowIPC } &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;json-to-arrow-ipc&apos;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; args = process.argv.slice(&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; packageName = args.find(&lt;span class=&quot;hljs-function&quot;&gt;&lt;span class=&quot;hljs-params&quot;&gt;a&lt;/span&gt; =&amp;gt;&lt;/span&gt; a.startsWith(&lt;span class=&quot;hljs-string&quot;&gt;&apos;--package=&apos;&lt;/span&gt;))?.split(&lt;span class=&quot;hljs-string&quot;&gt;&apos;=&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;] ?? &lt;span class=&quot;hljs-string&quot;&gt;&apos;duckdb&apos;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; [registryResponse] = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;Promise&lt;/span&gt;.all([
  fetch(&lt;span class=&quot;hljs-string&quot;&gt;`https://registry.npmjs.org/&lt;span class=&quot;hljs-subst&quot;&gt;${packageName}&lt;/span&gt;`&lt;/span&gt;)
]);

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; registry = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; registryResponse.json();

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; versions = &lt;span class=&quot;hljs-built_in&quot;&gt;Object&lt;/span&gt;.entries(registry.time || {})
  .filter(&lt;span class=&quot;hljs-function&quot;&gt;(&lt;span class=&quot;hljs-params&quot;&gt;[key]&lt;/span&gt;) =&amp;gt;&lt;/span&gt; key !== &lt;span class=&quot;hljs-string&quot;&gt;&apos;created&apos;&lt;/span&gt; &amp;amp;&amp;amp; key !== &lt;span class=&quot;hljs-string&quot;&gt;&apos;modified&apos;&lt;/span&gt;)
  .map(&lt;span class=&quot;hljs-function&quot;&gt;(&lt;span class=&quot;hljs-params&quot;&gt;[version, time]&lt;/span&gt;) =&amp;gt;&lt;/span&gt; ({
    package: packageName,
    version,
    published_at: time &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;string&lt;/span&gt;,
    description: registry.description,
    license: registry.license,
    homepage: registry.homepage,
  }));

process.stdout.write(jsonToArrowIPC(versions));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create &lt;code&gt;npm-downloads.ts&lt;/code&gt; for download stats:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-meta&quot;&gt;#!/usr/bin/env bun&lt;/span&gt;

&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; { jsonToArrowIPC } &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;json-to-arrow-ipc&apos;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; args = process.argv.slice(&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; packageName = args.find(&lt;span class=&quot;hljs-function&quot;&gt;&lt;span class=&quot;hljs-params&quot;&gt;a&lt;/span&gt; =&amp;gt;&lt;/span&gt; a.startsWith(&lt;span class=&quot;hljs-string&quot;&gt;&apos;--package=&apos;&lt;/span&gt;))?.split(&lt;span class=&quot;hljs-string&quot;&gt;&apos;=&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;] ?? &lt;span class=&quot;hljs-string&quot;&gt;&apos;duckdb&apos;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; response = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; fetch(
  &lt;span class=&quot;hljs-string&quot;&gt;`https://api.npmjs.org/downloads/range/last-month/&lt;span class=&quot;hljs-subst&quot;&gt;${packageName}&lt;/span&gt;`&lt;/span&gt;
);

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; data = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; response.json();

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; downloads = data.downloads?.map(&lt;span class=&quot;hljs-function&quot;&gt;(&lt;span class=&quot;hljs-params&quot;&gt;d: &lt;span class=&quot;hljs-built_in&quot;&gt;any&lt;/span&gt;&lt;/span&gt;) =&amp;gt;&lt;/span&gt; ({
  package: packageName,
  date: d.day,
  downloads: d.downloads,
})) ?? [];

process.stdout.write(jsonToArrowIPC(downloads));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Query package analytics:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;-- Version history&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;version&lt;/span&gt;, published_at
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;npm-package.ts&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;--package=apache-arrow&apos;&lt;/span&gt;)
&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; published_at &lt;span class=&quot;hljs-keyword&quot;&gt;DESC&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;10&lt;/span&gt;;

&lt;span class=&quot;hljs-comment&quot;&gt;-- Download trends&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt;
  &lt;span class=&quot;hljs-built_in&quot;&gt;date&lt;/span&gt;,
  downloads,
  &lt;span class=&quot;hljs-keyword&quot;&gt;AVG&lt;/span&gt;(downloads) &lt;span class=&quot;hljs-keyword&quot;&gt;OVER&lt;/span&gt; (&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;date&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;ROWS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BETWEEN&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;6&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;PRECEDING&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;CURRENT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;ROW&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; rolling_avg
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;npm-downloads.ts&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;--package=duckdb&apos;&lt;/span&gt;)
&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;date&lt;/span&gt;;

&lt;span class=&quot;hljs-comment&quot;&gt;-- Compare package downloads&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;duckdb&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;package&lt;/span&gt;, &lt;span class=&quot;hljs-keyword&quot;&gt;SUM&lt;/span&gt;(downloads) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; total
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;npm-downloads.ts&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;--package=duckdb&apos;&lt;/span&gt;)
&lt;span class=&quot;hljs-keyword&quot;&gt;UNION&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;ALL&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;better-sqlite3&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;package&lt;/span&gt;, &lt;span class=&quot;hljs-keyword&quot;&gt;SUM&lt;/span&gt;(downloads) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; total
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;npm-downloads.ts&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;--package=better-sqlite3&apos;&lt;/span&gt;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-example-5-hacker-news-front-page&quot;&gt;Example 5: Hacker News Front Page&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;hn-frontpage.ts&lt;/code&gt; using the official Hacker News API:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-meta&quot;&gt;#!/usr/bin/env bun&lt;/span&gt;

&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; { jsonToArrowIPC } &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;json-to-arrow-ipc&apos;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; args = process.argv.slice(&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; limit = &lt;span class=&quot;hljs-built_in&quot;&gt;parseInt&lt;/span&gt;(args.find(&lt;span class=&quot;hljs-function&quot;&gt;&lt;span class=&quot;hljs-params&quot;&gt;a&lt;/span&gt; =&amp;gt;&lt;/span&gt; a.startsWith(&lt;span class=&quot;hljs-string&quot;&gt;&apos;--limit=&apos;&lt;/span&gt;))?.split(&lt;span class=&quot;hljs-string&quot;&gt;&apos;=&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;] ?? &lt;span class=&quot;hljs-string&quot;&gt;&apos;30&apos;&lt;/span&gt;);

&lt;span class=&quot;hljs-comment&quot;&gt;// Fetch top story IDs&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; topStoriesResponse = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; fetch(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://hacker-news.firebaseio.com/v0/topstories.json&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; topStoryIds: &lt;span class=&quot;hljs-built_in&quot;&gt;number&lt;/span&gt;[] = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; topStoriesResponse.json();

&lt;span class=&quot;hljs-comment&quot;&gt;// Fetch story details in parallel (limited to avoid rate limiting)&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; storyIds = topStoryIds.slice(&lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt;, limit);
&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; stories = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;Promise&lt;/span&gt;.all(
  storyIds.map(&lt;span class=&quot;hljs-keyword&quot;&gt;async&lt;/span&gt; (id) =&amp;gt; {
    &lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; response = &lt;span class=&quot;hljs-keyword&quot;&gt;await&lt;/span&gt; fetch(&lt;span class=&quot;hljs-string&quot;&gt;`https://hacker-news.firebaseio.com/v0/item/&lt;span class=&quot;hljs-subst&quot;&gt;${id}&lt;/span&gt;.json`&lt;/span&gt;);
    &lt;span class=&quot;hljs-keyword&quot;&gt;return&lt;/span&gt; response.json();
  })
);

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; processed = stories.map(&lt;span class=&quot;hljs-function&quot;&gt;(&lt;span class=&quot;hljs-params&quot;&gt;story: &lt;span class=&quot;hljs-built_in&quot;&gt;any&lt;/span&gt;, index: &lt;span class=&quot;hljs-built_in&quot;&gt;number&lt;/span&gt;&lt;/span&gt;) =&amp;gt;&lt;/span&gt; ({
  rank: index + &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;,
  id: story.id,
  title: story.title,
  url: story.url ?? &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt;,
  score: story.score,
  author: story.by,
  comments: story.descendants ?? &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt;,
  created_at: &lt;span class=&quot;hljs-keyword&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;Date&lt;/span&gt;(story.time * &lt;span class=&quot;hljs-number&quot;&gt;1000&lt;/span&gt;).toISOString(),
  &lt;span class=&quot;hljs-keyword&quot;&gt;type&lt;/span&gt;: story.type,
}));

process.stdout.write(jsonToArrowIPC(processed));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Analyze the front page:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;-- Current top stories&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;rank&lt;/span&gt;, title, score, comments, author
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;hn-frontpage.ts&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;--limit=30&apos;&lt;/span&gt;)
&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;rank&lt;/span&gt;;

&lt;span class=&quot;hljs-comment&quot;&gt;-- Stories with high engagement ratio&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt;
  title,
  score,
  comments,
  &lt;span class=&quot;hljs-keyword&quot;&gt;ROUND&lt;/span&gt;(comments::&lt;span class=&quot;hljs-built_in&quot;&gt;float&lt;/span&gt; / &lt;span class=&quot;hljs-keyword&quot;&gt;NULLIF&lt;/span&gt;(score, &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt;), &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; comment_per_point
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;hn-frontpage.ts&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;--limit=50&apos;&lt;/span&gt;)
&lt;span class=&quot;hljs-keyword&quot;&gt;WHERE&lt;/span&gt; score &amp;gt; &lt;span class=&quot;hljs-number&quot;&gt;10&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; comment_per_point &lt;span class=&quot;hljs-keyword&quot;&gt;DESC&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;10&lt;/span&gt;;

&lt;span class=&quot;hljs-comment&quot;&gt;-- Top domains&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt;
  regexp_extract(&lt;span class=&quot;hljs-keyword&quot;&gt;url&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;https?://([^/]+)&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;domain&lt;/span&gt;,
  &lt;span class=&quot;hljs-keyword&quot;&gt;COUNT&lt;/span&gt;(*) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; stories,
  &lt;span class=&quot;hljs-keyword&quot;&gt;AVG&lt;/span&gt;(score) &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; avg_score
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; bun(&lt;span class=&quot;hljs-string&quot;&gt;&apos;hn-frontpage.ts&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;--limit=100&apos;&lt;/span&gt;)
&lt;span class=&quot;hljs-keyword&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;IS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;hljs-literal&quot;&gt;NULL&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;GROUP&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;domain&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; stories &lt;span class=&quot;hljs-keyword&quot;&gt;DESC&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;10&lt;/span&gt;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-configuration-options&quot;&gt;Configuration Options&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;json-to-arrow-ipc&lt;/code&gt; package handles various data scenarios:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-typescript&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;import&lt;/span&gt; { jsonToArrowIPC } &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;json-to-arrow-ipc&apos;&lt;/span&gt;;

&lt;span class=&quot;hljs-keyword&quot;&gt;const&lt;/span&gt; ipc = jsonToArrowIPC(data, {
  &lt;span class=&quot;hljs-comment&quot;&gt;// Sample rows for schema inference (default: 100)&lt;/span&gt;
  schemaSampleSize: &lt;span class=&quot;hljs-number&quot;&gt;50&lt;/span&gt;,

  &lt;span class=&quot;hljs-comment&quot;&gt;// Handle schema mismatches: &apos;error&apos; | &apos;skip&apos; | &apos;coerce&apos; (default: &apos;coerce&apos;)&lt;/span&gt;
  onSchemaMismatch: &lt;span class=&quot;hljs-string&quot;&gt;&apos;coerce&apos;&lt;/span&gt;,

  &lt;span class=&quot;hljs-comment&quot;&gt;// Flatten nested objects with dot notation (default: false)&lt;/span&gt;
  flattenNestedObjects: &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;,

  &lt;span class=&quot;hljs-comment&quot;&gt;// Serialize arrays as JSON strings (default: true)&lt;/span&gt;
  serializeArrays: &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;,
});
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-tips-and-best-practices&quot;&gt;Tips and Best Practices&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Handle pagination&lt;/strong&gt;: For APIs with large datasets, implement pagination in your script and use command-line arguments to control limits.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Error handling&lt;/strong&gt;: Add try-catch blocks and output empty arrays on error to prevent DuckDB query failures.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rate limiting&lt;/strong&gt;: Be mindful of API rate limits. Add delays between requests when fetching many resources.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;heading-conclusion&quot;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Any data source with an API becomes a SQL table. Combined with DuckDB&apos;s analytical capabilities - window functions, CTEs, JSON extraction, and more - you have a lightweight but powerful tool for ad-hoc data analysis.&lt;/p&gt;
&lt;p&gt;The combination of Bun&apos;s TypeScript execution, automatic dependency management, and DuckDB&apos;s extensibility creates a developer experience that&apos;s both powerful and simple.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;The&lt;/em&gt; &lt;code&gt;json-to-arrow-ipc&lt;/code&gt; package and example scripts are available at &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/duckdb-bun-scripts&quot;&gt;github.com/tobilg/duckdb-bun-scripts&lt;/a&gt;.&lt;/p&gt;
</content:encoded><category>duckDB</category><category>TypeScript</category><category>arrow</category><author>Tobias Müller</author></item><item><title>Use your favorite AI tool to read the lastest AWS News</title><link>https://tobilg.com/posts/use-ai-tools-to-read-aws-news/</link><guid isPermaLink="true">https://tobilg.com/posts/use-ai-tools-to-read-aws-news/</guid><description>You can use the unofficial AWS News MCP Server, which combines the news articles, blog posts and updates of more than 40 different AWS Newsfeeds to a single source.
Connecting to the AWS News MCP Server
You can use different protocols to access it:

...</description><pubDate>Tue, 14 Oct 2025 17:16:50 GMT</pubDate><content:encoded>&lt;p&gt;You can use the &lt;a target=&quot;_blank&quot; href=&quot;https://awsnews.remotemcp.directory/&quot;&gt;unofficial AWS News MCP Server&lt;/a&gt;, which combines the news articles, blog posts and updates of more than 40 different AWS Newsfeeds to a single source.&lt;/p&gt;
&lt;h2 id=&quot;heading-connecting-to-the-aws-news-mcp-server&quot;&gt;Connecting to the AWS News MCP Server&lt;/h2&gt;
&lt;p&gt;You can use different protocols to access it:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://awsnews.remotemcp.directory/mcp&quot;&gt;&lt;strong&gt;https://awsnews.remotemcp.directory/mcp&lt;/strong&gt;&lt;/a&gt; (Streaming HTTP, recommended)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://awsnews.remotemcp.directory/sse&quot;&gt;&lt;strong&gt;https://awsnews.remotemcp.directory/sse&lt;/strong&gt;&lt;/a&gt; (SSE, deprecated)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The MCP Server is available &lt;strong&gt;without authentication&lt;/strong&gt;, &lt;strong&gt;free to use&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id=&quot;heading-available-tools&quot;&gt;Available tools&lt;/h2&gt;
&lt;div class=&quot;hn-table&quot;&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tool&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Parameters&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;countNews&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Get the number of AWS news articles&lt;/td&gt;&lt;td&gt;-&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;getNewsStats&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Get the AWS news statistics by date&lt;/td&gt;&lt;td&gt;Number of Days (optional, default: 90)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;getNewsByDate&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Get the AWS news articles by date&lt;/td&gt;&lt;td&gt;Date (in format YYYY-MM-DD)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;getLatestNews&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Get the latest AWS news articles&lt;/td&gt;&lt;td&gt;Limit (optional, default: 10)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;searchNews&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Search the AWS news articles&lt;/td&gt;&lt;td&gt;Query (required), Limit (optional, default: 10), SortBy (optional, allowed values: relevance, date, default: relevance)&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;&lt;h2 id=&quot;heading-using-claude&quot;&gt;Using Claude&lt;/h2&gt;
&lt;p&gt;To add the AWS News MCP Server to &lt;a target=&quot;_blank&quot; href=&quot;https://claude.ai&quot;&gt;claude.ai&lt;/a&gt;, you can follow the &lt;a target=&quot;_blank&quot; href=&quot;https://support.claude.com/en/articles/11175166-getting-started-with-custom-connectors-using-remote-mcp#h_3d1a65aded&quot;&gt;connector documentation&lt;/a&gt;. Navigate to &lt;a target=&quot;_blank&quot; href=&quot;https://claude.ai/settings/connectors&quot;&gt;Settings → Connectors&lt;/a&gt; while logged-in, and enter the details as shown below:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1760459708499/00dec957-098e-43ce-9487-e666ca03a3f9.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Once you did this, you can click on the “Configure” button to allow unsupervised access if you like to, but this isn’t necessary to use the MCP Server itself. Otherwise, you’ll be asked if Claude is allowed to access it each time you use it.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1760459766276/666e15f2-5ac3-46ff-a201-8e144613167c.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;You can then just create a new chat, and ask a question like&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Show the latest 5 AWS news&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;and the answer from Claude will look like this (after you eventually had to confirm the MCP tool usage:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1760459974071/c284f9b1-77bd-419f-bcb9-64863b4f5af2.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;You can also ask about specific AWS services:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;What are the latest 3 AWS news articles about EKS?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1760460128165/7a0c39f3-d41f-48e2-8dc3-e8dde9845469.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Or, ask about AWS News statistics:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Give me the AWS News stats for October 2025&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1760460277647/4ef1d3a8-c7c5-4996-b3bc-a402e1a2572d.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;You can even use Claude to analyze the data:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;What were the top 5 services of the AWS news of October 13th 2025?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1760461293692/f8f9f464-0c47-484b-a484-f1869e09591e.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-using-claude-code&quot;&gt;Using Claude Code&lt;/h2&gt;
&lt;p&gt;To add the AWS News MCP Server to &lt;a target=&quot;_blank&quot; href=&quot;https://www.claude.com/product/claude-code&quot;&gt;Claude Code&lt;/a&gt;, you can check the &lt;a target=&quot;_blank&quot; href=&quot;https://docs.claude.com/en/docs/claude-code/mcp#option-1%3A-add-a-remote-http-server&quot;&gt;documentation&lt;/a&gt; on how to add remote MCP Servers. It supports different &lt;a target=&quot;_blank&quot; href=&quot;https://docs.claude.com/en/docs/claude-code/mcp#mcp-installation-scopes&quot;&gt;installation scopes&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;# Local scope (default)&lt;/span&gt;
claude mcp add --transport http aws-news https://awsnews.remotemcp.directory/mcp

&lt;span class=&quot;hljs-comment&quot;&gt;# Project scope&lt;/span&gt;
claude mcp add --transport http aws-news --scope project https://awsnews.remotemcp.directory/mcp

&lt;span class=&quot;hljs-comment&quot;&gt;# User scope&lt;/span&gt;
claude mcp add --transport http aws-news --scope user https://awsnews.remotemcp.directory/mcp
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-using-cursor&quot;&gt;Using Cursor&lt;/h2&gt;
&lt;p&gt;To add the AWS News MCP Server to &lt;a target=&quot;_blank&quot; href=&quot;https://cursor.com&quot;&gt;Cursor&lt;/a&gt;, you can just &lt;a target=&quot;_blank&quot; href=&quot;https://cursor.com/en-US/install-mcp?name=aws-news&amp;amp;config=eyJ1cmwiOiJodHRwczovL2F3c25ld3MucmVtb3RlbWNwLmRpcmVjdG9yeS9tY3AifQ%3D%3D&quot;&gt;click on this link&lt;/a&gt; that will add it automatically, or you can install it manually via a &lt;a target=&quot;_blank&quot; href=&quot;https://cursor.com/docs/context/mcp#using-mcpjson&quot;&gt;mcp.json&lt;/a&gt; file, which you can place &lt;a target=&quot;_blank&quot; href=&quot;https://cursor.com/docs/context/mcp#configuration-locations&quot;&gt;either in a project or or a user’s home directory&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-json&quot;&gt;{
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;mcpServers&quot;&lt;/span&gt;: {
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;aws-news&quot;&lt;/span&gt;: {
      &lt;span class=&quot;hljs-attr&quot;&gt;&quot;url&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;https://awsnews.remotemcp.directory/mcp&quot;&lt;/span&gt;
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-using-vs-code-amp-github-copilot&quot;&gt;Using VS Code &amp;amp; GitHub Copilot&lt;/h2&gt;
&lt;p&gt;To add the AWS News MCP Server to &lt;a target=&quot;_blank&quot; href=&quot;https://code.visualstudio.com/&quot;&gt;VS Code&lt;/a&gt; for the usage with &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/features/copilot&quot;&gt;GitHub Copilot&lt;/a&gt;, you can install it manually via a &lt;a target=&quot;_blank&quot; href=&quot;https://code.visualstudio.com/docs/copilot/customization/mcp-servers#_add-an-mcp-server&quot;&gt;mcp.json&lt;/a&gt; file, which supports different locations and scopes (e.g. workspace or user):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-json&quot;&gt;{
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;servers&quot;&lt;/span&gt;: {
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;aws-news&quot;&lt;/span&gt;: {
      &lt;span class=&quot;hljs-attr&quot;&gt;&quot;url&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;https://awsnews.remotemcp.directory/mcp&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-attr&quot;&gt;&quot;type&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;http&quot;&lt;/span&gt;
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can than use the MCP Server in the Agent Mode as described in &lt;a target=&quot;_blank&quot; href=&quot;https://code.visualstudio.com/docs/copilot/customization/mcp-servers#_use-mcp-tools-in-agent-mode&quot;&gt;the docs&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;heading-using-amazon-q-developer&quot;&gt;Using Amazon Q Developer&lt;/h2&gt;
&lt;p&gt;To add the AWS News MCP Server to Amazon Q Developer, you can follow the &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/command-line-mcp-config-CLI.html&quot;&gt;documentation&lt;/a&gt; and add a mcp.json file like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-json&quot;&gt;{
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;mcpServers&quot;&lt;/span&gt;: {
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;aws-news&quot;&lt;/span&gt;: {
      &lt;span class=&quot;hljs-attr&quot;&gt;&quot;url&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;https://awsnews.remotemcp.directory/mcp&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-attr&quot;&gt;&quot;type&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;http&quot;&lt;/span&gt;
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After that, use the &lt;code&gt;/mcp&lt;/code&gt; command to activate the AWS News MCP server.&lt;/p&gt;
&lt;h2 id=&quot;heading-using-cloudflare-ai-playground&quot;&gt;Using Cloudflare AI Playground&lt;/h2&gt;
&lt;p&gt;The AI Playground is a free UI (e.g. if you don’t have a AI subscription) that allows users to explore different models and integrations like MCP Servers. Just head to &lt;a target=&quot;_blank&quot; href=&quot;https://playground.ai.cloudflare.com/&quot;&gt;https://playground.ai.cloudflare.com/&lt;/a&gt; and configure the MCP server:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1760461832003/0d82fdf4-64f9-43cc-80c4-2233de124ffd.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Then, click on the “Connect” button and you should see the following output with the available tools, and some debug output:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1760461863103/21617589-d41a-4535-aad6-1a97ddd0b672.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;You then can ask questions in the main chat window like&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Show the latest 3 AWS News articles&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1760461925094/cc4cd014-ae39-4c4b-bab1-07933e700f4a.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-summary&quot;&gt;Summary&lt;/h2&gt;
&lt;p&gt;With the freely available remote AWS News MCP Server, it’s possible to access and analyze the latest AWS News articles by your favorite AI tools, such as Claude, Cursor, GitHub Copilot and others.&lt;/p&gt;
</content:encoded><category>AWS</category><category>mcp</category><category>AI</category><category>#agent</category><author>Tobias Müller</author></item><item><title>Using Amazon SageMaker Lakehouse with DuckDB</title><link>https://tobilg.com/posts/using-amazon-sagemaker-lakehouse-with-duckdb/</link><guid isPermaLink="true">https://tobilg.com/posts/using-amazon-sagemaker-lakehouse-with-duckdb/</guid><description>Preconditions
To use the Amazon SageMaker Lakehouse with DuckDB, you first have to create a S3 Table bucket, a namespace and an actual S3 Table. All those steps are described in my other blog post “Query S3 Tables with DuckDB”, so please make sure yo...</description><pubDate>Sun, 08 Jun 2025 17:48:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;heading-preconditions&quot;&gt;Preconditions&lt;/h2&gt;
&lt;p&gt;To use the &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/sagemaker/lakehouse/&quot;&gt;Amazon SageMaker Lakehouse&lt;/a&gt; with DuckDB, you first have to create a S3 Table bucket, a namespace and an actual S3 Table. All those steps are described in my other blog post &lt;a target=&quot;_blank&quot; href=&quot;https://tobilg.com/query-s3-tables-with-duckdb&quot;&gt;“Query S3 Tables with DuckDB”&lt;/a&gt;, so please make sure you followed the outlined (manual) steps of the &lt;strong&gt;Setting up a S3 Table&lt;/strong&gt; section before continuing with this blog post.&lt;/p&gt;
&lt;p&gt;The setup of the pretty complicated permissions on the AWS side is implemented according to their blog post &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/blogs/storage/access-data-in-amazon-s3-tables-using-pyiceberg-through-the-aws-glue-iceberg-rest-endpoint/&quot;&gt;“Access data in Amazon S3 Tables using PyIceberg through the AWS Glue Iceberg REST endpoint”&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-create-iam-role-and-policy&quot;&gt;Create IAM role and policy&lt;/h2&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;As outlined in the previous blog post, you already created a S3 Table bucket called “duckdb-test” and a namespace “test“. In the following examples, I will not use my real AWS Account ID, but the artificial “123456789012” instead. Please keep in mind to use your actual AWS Account ID if you follow to implement this in your account.&lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&quot;heading-check-your-arns&quot;&gt;Check your ARNs&lt;/h3&gt;
&lt;p&gt;Before creating your IAM policy, you need to check the ARNs of your S3 Table bucket, and also note the namespace.&lt;/p&gt;
&lt;p&gt;So, if your S3 Table bucket lives in us-east-1, go to the &lt;a target=&quot;_blank&quot; href=&quot;https://us-east-1.console.aws.amazon.com/s3/table-buckets?region=us-east-1&quot;&gt;AWS Console&lt;/a&gt; and check the ARN for it:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1749392102095/8de287eb-9017-42a9-8ded-8efd4c352496.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;In our case it’s &lt;code&gt;arn:aws:glue:us-east-1:123456789012:catalog/s3tablescatalog/duckdb-test&lt;/code&gt; and the namespace we created as of the referenced blog post is &lt;code&gt;test&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id=&quot;heading-create-the-policy&quot;&gt;Create the policy&lt;/h3&gt;
&lt;p&gt;Now, go the the &lt;a target=&quot;_blank&quot; href=&quot;https://us-east-1.console.aws.amazon.com/iam/home?region=us-east-1#/policies&quot;&gt;IAM Policies&lt;/a&gt; screen of the AWS Console, and click on the &lt;strong&gt;Create policy&lt;/strong&gt; button. Click on the &lt;strong&gt;JSON button&lt;/strong&gt;, and paste the following JSON policy into the editor.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-json&quot;&gt;{
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;Version&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;2012-10-17&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;Statement&quot;&lt;/span&gt;: [
        {
            &lt;span class=&quot;hljs-attr&quot;&gt;&quot;Sid&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;SagemakerLakehouseTest&quot;&lt;/span&gt;,
            &lt;span class=&quot;hljs-attr&quot;&gt;&quot;Effect&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;Allow&quot;&lt;/span&gt;,
            &lt;span class=&quot;hljs-attr&quot;&gt;&quot;Action&quot;&lt;/span&gt;: [
                &lt;span class=&quot;hljs-string&quot;&gt;&quot;glue:GetCatalog&quot;&lt;/span&gt;,
                &lt;span class=&quot;hljs-string&quot;&gt;&quot;glue:GetDatabase&quot;&lt;/span&gt;,
                &lt;span class=&quot;hljs-string&quot;&gt;&quot;glue:GetDatabases&quot;&lt;/span&gt;,
                &lt;span class=&quot;hljs-string&quot;&gt;&quot;glue:GetTable&quot;&lt;/span&gt;,
                &lt;span class=&quot;hljs-string&quot;&gt;&quot;glue:GetTables&quot;&lt;/span&gt;,
                &lt;span class=&quot;hljs-string&quot;&gt;&quot;glue:CreateTable&quot;&lt;/span&gt;,
                &lt;span class=&quot;hljs-string&quot;&gt;&quot;glue:UpdateTable&quot;&lt;/span&gt;
            ],
            &lt;span class=&quot;hljs-attr&quot;&gt;&quot;Resource&quot;&lt;/span&gt;: [
                &lt;span class=&quot;hljs-string&quot;&gt;&quot;arn:aws:glue:us-east-1:123456789012:catalog&quot;&lt;/span&gt;,
                &lt;span class=&quot;hljs-string&quot;&gt;&quot;arn:aws:glue:us-east-1:123456789012:catalog/s3tablescatalog&quot;&lt;/span&gt;,
                &lt;span class=&quot;hljs-string&quot;&gt;&quot;arn:aws:glue:us-east-1:123456789012:catalog/s3tablescatalog/duckdb-test&quot;&lt;/span&gt;,
                &lt;span class=&quot;hljs-string&quot;&gt;&quot;arn:aws:glue:us-east-1:123456789012:table/s3tablescatalog/duckdb-test/test/*&quot;&lt;/span&gt;,
                &lt;span class=&quot;hljs-string&quot;&gt;&quot;arn:aws:glue:us-east-1:123456789012:database/s3tablescatalog/duckdb-test/test&quot;&lt;/span&gt;
            ]
        },
        {
            &lt;span class=&quot;hljs-attr&quot;&gt;&quot;Effect&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;Allow&quot;&lt;/span&gt;,
            &lt;span class=&quot;hljs-attr&quot;&gt;&quot;Action&quot;&lt;/span&gt;: [
                &lt;span class=&quot;hljs-string&quot;&gt;&quot;lakeformation:GetDataAccess&quot;&lt;/span&gt;
            ],
            &lt;span class=&quot;hljs-attr&quot;&gt;&quot;Resource&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;*&quot;&lt;/span&gt;
        }
    ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then update your AWS Account ID, the S3 Table bucket name and the namespace accordingly if you chose other values for them.&lt;/p&gt;
&lt;p&gt;Then press &lt;strong&gt;Next&lt;/strong&gt;, and use &lt;code&gt;s3tables-duckdb-test&lt;/code&gt; as policy name and save the policy with &lt;strong&gt;Save policy&lt;/strong&gt;.&lt;/p&gt;
&lt;h3 id=&quot;heading-create-the-role&quot;&gt;Create the role&lt;/h3&gt;
&lt;p&gt;Go to the &lt;a target=&quot;_blank&quot; href=&quot;https://us-east-1.console.aws.amazon.com/iam/home?region=us-east-1#/roles&quot;&gt;IAM Roles&lt;/a&gt; screen of the AWS Console, and click on the &lt;strong&gt;Create role&lt;/strong&gt; button.&lt;/p&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;This will require an existing IAM role or IAM user. Normally, this will be a role you’re assuming via STS, or a user whose credentials you stored on the machine you’re using. The setup of this is not part of this blog post.&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1749392888669/db63bb0e-6715-4255-9c59-e9611298b500.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Now click on &lt;strong&gt;Next&lt;/strong&gt;. Select your previously created policy &lt;code&gt;s3tables-duckdb-test&lt;/code&gt; as trust policy by clicking the checkbox in front of the policy and clicking &lt;strong&gt;Next&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1749393012909/1f75327f-33ad-4e22-96aa-c9e6240266b1.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Enter the role name &lt;code&gt;s3tables-duckdb-test-role&lt;/code&gt; and click on &lt;strong&gt;Create role&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id=&quot;heading-define-access-control-using-lake-formation&quot;&gt;Define access control using Lake Formation&lt;/h2&gt;
&lt;h3 id=&quot;heading-application-integration-setup&quot;&gt;Application integration setup&lt;/h3&gt;
&lt;p&gt;In Lake Formation, enable full table access for external engines to access data. This allows third-party applications to get the Lake Formation temporary credential using an IAM role(s) that has full permissions (ALL) on the requested table.&lt;/p&gt;
&lt;p&gt;Go to the &lt;a target=&quot;_blank&quot; href=&quot;https://us-east-1.console.aws.amazon.com/lakeformation/home?region=us-east-1#dashboard&quot;&gt;AWS Lake Formation Dashboard&lt;/a&gt;, and then on the Left Pane, expand the &lt;strong&gt;Administration&lt;/strong&gt; section, then &lt;strong&gt;Application integration settings&lt;/strong&gt; and choose &lt;strong&gt;Allow external engines to access data in Amazon S3 locations with full table access&lt;/strong&gt; and click on &lt;strong&gt;Save:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1749395676996/2a39f805-9416-44da-b50b-fc99f562dd5d.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h3 id=&quot;heading-set-up-lake-formation-grants&quot;&gt;Set up Lake Formation grants&lt;/h3&gt;
&lt;p&gt;For the newly created role &lt;code&gt;s3tables-duckdb-test-role&lt;/code&gt; to be able to actually access the database and the tables, you need to provide database and table level permissions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Database-level permissions&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;In the Lake Formation &lt;a target=&quot;_blank&quot; href=&quot;https://us-east-1.console.aws.amazon.com/lakeformation/home?region=us-east-1#permissions-list&quot;&gt;Data permissions&lt;/a&gt;, click on &lt;strong&gt;Grant&lt;/strong&gt;. Then, in the &lt;strong&gt;Principals&lt;/strong&gt; section, choose the radio button &lt;strong&gt;IAM users and roles&lt;/strong&gt;, and from the drop-down choose &lt;code&gt;s3tables-duckdb-test-role&lt;/code&gt;. In the &lt;strong&gt;LF-Tags&lt;/strong&gt; or &lt;strong&gt;catalog resources&lt;/strong&gt; section, choose &lt;strong&gt;Named Data Catalog&lt;/strong&gt; resources:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Choose &lt;code&gt;123456789012:s3tablescatalog/duckdb-test&lt;/code&gt; for Catalog&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Choose &lt;code&gt;test&lt;/code&gt; for Databases&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Choose &lt;strong&gt;CREATE TABLE, DESCRIBE&lt;/strong&gt; for database permissions&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Click on &lt;strong&gt;Grant&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1749396349796/5e19c3b3-1630-408a-843a-b63c2131433d.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1749396364252/bf0897a1-63f2-4a02-9914-37f5e7b6e6ec.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Table-level permissions&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;In the Lake Formation &lt;a target=&quot;_blank&quot; href=&quot;https://us-east-1.console.aws.amazon.com/lakeformation/home?region=us-east-1#permissions-list&quot;&gt;Data permissions&lt;/a&gt;, click on &lt;strong&gt;Grant&lt;/strong&gt;. Then, in the &lt;strong&gt;Principals&lt;/strong&gt; section, choose the radio button &lt;strong&gt;IAM users and roles&lt;/strong&gt;, and from the drop-down choose &lt;code&gt;s3tables-duckdb-test-role&lt;/code&gt;. In the &lt;strong&gt;LF-Tags&lt;/strong&gt; or &lt;strong&gt;catalog resources&lt;/strong&gt; section, choose &lt;strong&gt;Named Data Catalog&lt;/strong&gt; resources:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Choose &lt;code&gt;123456789012:s3tablescatalog/duckdb-test&lt;/code&gt; for Catalog&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Choose &lt;code&gt;test&lt;/code&gt; for Databases&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Choose &lt;strong&gt;All Tables&lt;/strong&gt; for tables&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Choose &lt;strong&gt;SUPER&lt;/strong&gt; for table permissions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Click on &lt;strong&gt;Grant&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1749396555724/a9a408f8-144b-42a2-86c4-177374c5819b.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1749396666587/93993c4f-9757-4b71-8465-8600cb68a7fe.jpeg&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-quick-check-in-athena&quot;&gt;Quick check in Athena&lt;/h2&gt;
&lt;p&gt;We verify in Athena that the table &lt;code&gt;daily_sales&lt;/code&gt; exists, and that we have data, before going to the DuckDB CLI to test the catalog integration:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1749397072106/788cd335-0966-4a87-a85a-14b044eefbd8.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-validate-iam-role&quot;&gt;Validate IAM role&lt;/h2&gt;
&lt;p&gt;You can validate the created IAM role on the target machine by issuing the following command (this assumes you have installed the AWS CLI):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;aws sts assume-role --role-arn &lt;span class=&quot;hljs-string&quot;&gt;&quot;arn:aws:iam::123456789012:role/s3tables-duckdb-test-role&quot;&lt;/span&gt; --role-session-name s3tables-duckdb-test-role
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-usage-with-duckdb&quot;&gt;Usage with DuckDB&lt;/h2&gt;
&lt;h3 id=&quot;heading-create-duckdb-secret-for-catalog-access&quot;&gt;Create DuckDB secret for catalog access&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; SECRET (
    &lt;span class=&quot;hljs-keyword&quot;&gt;TYPE&lt;/span&gt; s3,
    PROVIDER credential_chain,
    &lt;span class=&quot;hljs-keyword&quot;&gt;CHAIN&lt;/span&gt; sts,
    ASSUME_ROLE_ARN &lt;span class=&quot;hljs-string&quot;&gt;&apos;arn:aws:iam::123456789012:role/s3tables-duckdb-test-role&apos;&lt;/span&gt;,
    REGION &lt;span class=&quot;hljs-string&quot;&gt;&apos;us-east-1&apos;&lt;/span&gt;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-attach-the-lake-formation-glue-catalog&quot;&gt;Attach the Lake Formation (Glue) catalog&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;ATTACH &apos;123456789012:s3tablescatalog/duckdb-test&apos; AS s3tables_datalake (
    TYPE ICEBERG,
    ENDPOINT_TYPE &apos;GLUE&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-list-tables&quot;&gt;List tables&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;SHOW&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;ALL&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLES&lt;/span&gt;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-query-table&quot;&gt;Query table&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; s3tables_datalake.test.daily_sales &lt;span class=&quot;hljs-keyword&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;5&lt;/span&gt;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-cli-screenshot&quot;&gt;CLI screenshot&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1749404173596/9346a02e-a6ed-412f-8747-ffb756931f02.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-summary&quot;&gt;Summary&lt;/h2&gt;
&lt;p&gt;We were able to show that the integration with the &lt;strong&gt;AWS Sagemaker Catalog&lt;/strong&gt; (a.k.a. &lt;strong&gt;AWS Glue Iceberg REST endpoint&lt;/strong&gt;) works with DuckDB. Once the quite tedious IAM role/policy, as well as the Lake Formation permission setup are done, the usage is pretty straight forward with simple &lt;code&gt;CREATE SECRET&lt;/code&gt; and &lt;code&gt;ATTACH&lt;/code&gt; statements.&lt;/p&gt;
&lt;h2 id=&quot;heading-references&quot;&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;DuckDB docs: &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/stable/core_extensions/iceberg/amazon_sagemaker_lakehouse.html&quot;&gt;Amazon Sagemaker Lakehouse&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;AWS Blog Post: &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/blogs/storage/access-data-in-amazon-s3-tables-using-pyiceberg-through-the-aws-glue-iceberg-rest-endpoint/&quot;&gt;Access data in Amazon S3 Tables using PyIceberg through the AWS Glue Iceberg REST endpoint&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Blog post: &lt;a target=&quot;_blank&quot; href=&quot;https://tobilg.com/query-s3-tables-with-duckdb&quot;&gt;Query S3 Tables with DuckDB&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><category>sagemaker </category><category>duckDB</category><category>glue</category><author>Tobias Müller</author></item><item><title>Welcome to the age of $10/month Lakehouses</title><link>https://tobilg.com/posts/the-age-of-10-dollar-a-month-lakehouses/</link><guid isPermaLink="true">https://tobilg.com/posts/the-age-of-10-dollar-a-month-lakehouses/</guid><description>Recap: Data Warehouses, Data Lakes, Lakehouses?
As a short recap, what do these mean, and how are they differentiated?
Modern Data Warehouses, like Amazon Redshift, Google BigQuery, and Snowflake, offer fast, SQL-optimized performance for structured ...</description><pubDate>Fri, 30 May 2025 18:22:31 GMT</pubDate><content:encoded>&lt;h1 id=&quot;heading-recap-data-warehouses-data-lakes-lakehouses&quot;&gt;Recap: Data Warehouses, Data Lakes, Lakehouses?&lt;/h1&gt;
&lt;p&gt;As a short recap, what do these mean, and how are they differentiated?&lt;/p&gt;
&lt;p&gt;Modern &lt;strong&gt;Data Warehouses&lt;/strong&gt;, like &lt;strong&gt;Amazon Redshift&lt;/strong&gt;, &lt;strong&gt;Google BigQuery&lt;/strong&gt;, and &lt;strong&gt;Snowflake&lt;/strong&gt;, offer fast, SQL-optimized performance for structured data and BI workloads. Their columnar storage, advanced indexing, and automatic optimization make them ideal for analytics. However, they come at a premium: storage and compute are tightly coupled, and costs can rise sharply with large-scale, always-on workloads or frequent queries over massive datasets.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data Lakes&lt;/strong&gt;, typically built on &lt;strong&gt;Amazon S3&lt;/strong&gt;, &lt;strong&gt;Azure Data Lake Storage (ADLS)&lt;/strong&gt;, &lt;strong&gt;Google Cloud Storage (GCS)&lt;/strong&gt;, or &lt;strong&gt;Cloudflare R2&lt;/strong&gt;, provide a low-cost solution for storing raw structured and unstructured data. They scale effortlessly and are the foundation of many batch and streaming pipelines. While storage is cheap, performance suffers without significant investment in tooling, such as &lt;strong&gt;Apache Spark&lt;/strong&gt;, &lt;strong&gt;Presto&lt;/strong&gt;, or &lt;strong&gt;Hive&lt;/strong&gt;, to support transformation and query execution. Data governance and consistency also become challenges as systems scale.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;Lakehouse&lt;/strong&gt; architecture, popularized by platforms like &lt;strong&gt;Databricks&lt;/strong&gt; or &lt;strong&gt;Snowflake&lt;/strong&gt; with table formats like &lt;strong&gt;Delta Lake&lt;/strong&gt; and &lt;strong&gt;Apache Iceberg&lt;/strong&gt;, offers a compelling middle ground. Lakehouses combine the low-cost, schema-flexible storage of data lakes with the performance and ACID transaction support of warehouses. Technologies like &lt;strong&gt;Databricks Lakehouse Platform&lt;/strong&gt;, &lt;strong&gt;Snowflake’s Unistore&lt;/strong&gt;, or &lt;strong&gt;Dremio&lt;/strong&gt; provide unified data management, allowing teams to run SQL analytics, data science, and streaming workloads directly on cloud object storage, without needing separate ETL into a warehouse.&lt;/p&gt;
&lt;p&gt;From a cost-performance perspective, Lakehouses reduce redundancy and simplify infrastructure by eliminating the need to maintain both a lake and a warehouse. They support open formats (e.g., Parquet, ORC) and bring advanced features like time travel (see Apache Iceberg), schema evolution, and fine-grained access control which previously was reserved for traditional Data Warehouses into the Data Lake domain.&lt;/p&gt;
&lt;p&gt;Ultimately, the choice depends on your workload characteristics, query latency requirements, data volume, and ecosystem alignment. But as the major cloud providers and vendors converge around the Lakehouse paradigm, it&apos;s becoming an increasingly attractive option for teams seeking agility, scalability, and cost control in modern data platforms.&lt;/p&gt;
&lt;h1 id=&quot;heading-existing-open-table-formats&quot;&gt;Existing open table formats&lt;/h1&gt;
&lt;p&gt;There are different “table formats” for the Lakehouse architecture, such as &lt;strong&gt;Apache Iceberg&lt;/strong&gt; and &lt;strong&gt;Delta Lake&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt; is an open table format designed for huge analytic datasets in distributed data processing systems like Apache Spark, Trino, Presto, Flink, and Hive. It was developed at &lt;strong&gt;Netflix&lt;/strong&gt; and later donated to the &lt;strong&gt;Apache Software Foundation&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt; is an open-source storage layer that brings ACID transactions, scalable metadata handling, and unified streaming and batch processing to Apache Spark and other big data engines. Originally developed by &lt;strong&gt;Databricks&lt;/strong&gt;, it is now part of the &lt;strong&gt;Linux Foundation&lt;/strong&gt;.&lt;/p&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;Because I’m a bit lazy sometimes, I asked ChatGPT the following: “You are an expert data engineer. Can you write a comparison summary of open table formats of both Apache Iceberg and Delta Lake? Use the same criteria while doing so”. The shortened output is found below. I tried to verify the claims as far as I could. If you find an error, please leave a comment.&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;heading-comparison&quot;&gt;Comparison&lt;/h2&gt;
&lt;h3 id=&quot;heading-acid-transactions&quot;&gt;ACID Transactions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Iceberg&lt;/strong&gt;:&lt;br /&gt;  Fully supports ACID transactions using &lt;strong&gt;snapshot isolation&lt;/strong&gt;. Transactions are handled via metadata manifests and atomic commits to the metadata file. Suitable for both streaming and batch operations.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt;:&lt;br /&gt;  Implements ACID transactions via a &lt;strong&gt;write-ahead log&lt;/strong&gt; (&lt;code&gt;_delta_log&lt;/code&gt;). Also supports snapshot isolation, enabling reliable concurrent read/write operations in batch and streaming.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;heading-schema-evolution&quot;&gt;Schema Evolution&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Iceberg&lt;/strong&gt;:&lt;br /&gt;  Offers &lt;strong&gt;flexible schema evolution&lt;/strong&gt; and supports adding, renaming, reordering, and deleting columns. Field IDs are used to track schema changes safely.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt;:&lt;br /&gt;  Supports &lt;strong&gt;additive schema evolution&lt;/strong&gt; (e.g., adding columns). Reordering and renaming are partially supported and can be more restrictive than Iceberg. Dropping columns my require rewriting of the underlying Parquet files.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;heading-partitioning&quot;&gt;Partitioning&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Iceberg&lt;/strong&gt;:&lt;br /&gt;  Uses &lt;strong&gt;hidden partitioning&lt;/strong&gt;, abstracting physical layout from logical queries. Automatically rewrites queries for efficient partition pruning.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt;:&lt;br /&gt;  Relies on &lt;strong&gt;explicit partition columns&lt;/strong&gt;, which are visible to users. Partition pruning depends on query awareness of partition structure.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;heading-time-travel&quot;&gt;Time Travel&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Iceberg&lt;/strong&gt;:&lt;br /&gt;  Native time travel support via &lt;strong&gt;snapshots&lt;/strong&gt;. You can query historical versions based on snapshot ID or timestamp.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt;:&lt;br /&gt;  Also supports time travel using version numbers or timestamps. Retains transaction logs that allow rollback and historical queries.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;heading-performance-and-metadata-scalability&quot;&gt;Performance and Metadata Scalability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Iceberg&lt;/strong&gt;:&lt;br /&gt;  Uses a &lt;strong&gt;metadata tree structure&lt;/strong&gt; (manifests and manifest lists) to avoid file listing and support petabyte-scale datasets. More scalable with large numbers of files and partitions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt;:&lt;br /&gt;  Stores metadata in a &lt;strong&gt;log of JSON files&lt;/strong&gt;, which can become large over time and may require compaction (&lt;code&gt;OPTIMIZE&lt;/code&gt; and &lt;code&gt;VACUUM&lt;/code&gt;) for performance.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;heading-streaming-support&quot;&gt;Streaming Support&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Iceberg&lt;/strong&gt;:&lt;br /&gt;  Has native support for streaming writes and reads, especially with Apache Flink and Spark Structured Streaming. Streaming write support is still evolving in some engines.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt;:&lt;br /&gt;  Designed with &lt;strong&gt;stream-batch unification&lt;/strong&gt; in mind. Seamless support for Structured Streaming in Spark.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;heading-file-format-support&quot;&gt;File Format Support&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Iceberg&lt;/strong&gt;:&lt;br /&gt;  Supports &lt;strong&gt;Parquet&lt;/strong&gt;, &lt;strong&gt;Avro&lt;/strong&gt;, and &lt;strong&gt;ORC&lt;/strong&gt;. File format agnostic, with clean decoupling of metadata and data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt;:&lt;br /&gt;  Primarily based on &lt;strong&gt;Parquet&lt;/strong&gt;. No support for other formats as of now.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;heading-ecosystem-and-engine-support&quot;&gt;Ecosystem and Engine Support&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Iceberg&lt;/strong&gt;:&lt;br /&gt;  Integrates with &lt;strong&gt;Apache Spark, Flink, Trino, Presto, Hive, Dremio, Snowflake&lt;/strong&gt;, and more. Supported by &lt;strong&gt;AWS Glue, Snowflake, and others&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt;:&lt;br /&gt;  Strong integration with &lt;strong&gt;Databricks&lt;/strong&gt;, &lt;strong&gt;Apache Spark&lt;/strong&gt;, and supported in &lt;strong&gt;Presto, Trino, Flink&lt;/strong&gt; (via Delta-RS). Emerging ecosystem via Delta Kernel and Delta-RS.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;heading-operations-and-tooling&quot;&gt;Operations and Tooling&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Iceberg&lt;/strong&gt;:&lt;br /&gt;  Growing support for table maintenance (compaction, expiration, snapshot management). Still evolving CLI tools.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt;:&lt;br /&gt;  Mature operational tooling via Databricks (e.g., &lt;code&gt;OPTIMIZE&lt;/code&gt;, &lt;code&gt;VACUUM&lt;/code&gt;, Delta Live Tables). Better UX for managed environments.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;heading-summary&quot;&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;/h2&gt;
&lt;div class=&quot;hn-table&quot;&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Criteria&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ACID Transactions&lt;/td&gt;&lt;td&gt;✅ Yes (Snapshot-based)&lt;/td&gt;&lt;td&gt;✅ Yes (Log-based)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema Evolution&lt;/td&gt;&lt;td&gt;✅ Flexible (Field IDs, reorder, rename)&lt;/td&gt;&lt;td&gt;⚠️ Limited (mostly additive)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partitioning&lt;/td&gt;&lt;td&gt;✅ Hidden, automatic pruning&lt;/td&gt;&lt;td&gt;⚠️ Explicit, user-managed&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time Travel&lt;/td&gt;&lt;td&gt;✅ Snapshots, timestamps&lt;/td&gt;&lt;td&gt;✅ Versioned log, timestamps&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Metadata Performance&lt;/td&gt;&lt;td&gt;✅ Scalable (manifest-based)&lt;/td&gt;&lt;td&gt;⚠️ Requires periodic compaction&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Streaming Support&lt;/td&gt;&lt;td&gt;⚠️ Evolving (good Flink/Spark support)&lt;/td&gt;&lt;td&gt;✅ Mature (best with Spark)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File Format Support&lt;/td&gt;&lt;td&gt;✅ Parquet, Avro, ORC&lt;/td&gt;&lt;td&gt;⚠️ Parquet only&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine/Ecosystem Support&lt;/td&gt;&lt;td&gt;✅ Broad, vendor-neutral&lt;/td&gt;&lt;td&gt;⚠️ Strong Spark/Databricks focus&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operational Tooling&lt;/td&gt;&lt;td&gt;⚠️ Growing&lt;/td&gt;&lt;td&gt;✅ Advanced in Databricks&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;&lt;h1 id=&quot;heading-the-new-kid-on-the-block-ducklake&quot;&gt;The new kid on the block: DuckLake&lt;/h1&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://ducklake.select/&quot;&gt;DuckLake&lt;/a&gt; was introduced on 2025-05-27 by the by the founders of DuckDB. There’s an interesting &lt;a target=&quot;_blank&quot; href=&quot;https://www.youtube.com/watch?v=zeonmOO9jm4&quot;&gt;podcast&lt;/a&gt; where Mark &amp;amp; Hannes explain more about the motivation and goals behind it, viewing it is strongly recommended! Also, they wrote a manifesto: &lt;a target=&quot;_blank&quot; href=&quot;https://ducklake.select/manifesto/&quot;&gt;SQL as a Lakehouse format&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Its key features are:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Supports snapshots, time travel queries, schema evolution and partitioning&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Can have as many lightweight snapshots as you want without frequent compacting steps&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Allows concurrent access with ACID transactional guarantees over multi-table operations&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Uses statistics for filter pushdown, enabling fast queries even on large datasets&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There’s also a nice introduction and overview in Jordan Tigani’s blog post in the MotherDuck blog: &lt;a target=&quot;_blank&quot; href=&quot;https://motherduck.com/blog/ducklake-motherduck/&quot;&gt;“A Duck walks into a lake”&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The big difference compared to Apache Iceberg and Delta Lake is that DuckLake uses a database as a backing store for the metadata, instead of keeping it in complex files as JSON and/or Avro on object storage, like both others do. Only to eventually have a data catalog put on top of that.&lt;/p&gt;
&lt;p&gt;DuckLake stores data in Parquet, and metadata in relational tables, plain and simple. This has many benefits, e.g. faster lookups (SQL query vs. an eventual cascade of S3 HEAD/GET requests), and the catalog is already built-in.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;catalog database&lt;/strong&gt; should be chosen upon the following criteria (according to the &lt;a target=&quot;_blank&quot; href=&quot;https://ducklake.select/docs/stable/duckdb/usage/choosing_a_catalog_database&quot;&gt;DuckLake website&lt;/a&gt;):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;If you would like to perform &lt;strong&gt;local data warehousing&lt;/strong&gt; with a &lt;strong&gt;single client,&lt;/strong&gt; use &lt;strong&gt;DuckDB&lt;/strong&gt; as the catalog database&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If you would like to perform &lt;strong&gt;local data warehousing&lt;/strong&gt; using &lt;strong&gt;multiple local clients&lt;/strong&gt;, use &lt;strong&gt;SQLite&lt;/strong&gt; as the catalog database&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If you would like to operate a &lt;strong&gt;multi-user Lakehouse&lt;/strong&gt; with &lt;strong&gt;potentially remote clients&lt;/strong&gt;, choose a transactional client-server database system as the catalog database: &lt;strong&gt;MySQL&lt;/strong&gt; or &lt;strong&gt;PostgreSQL&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Even though it’s a very early version, which still misses some functionality, it’s already looking very useful from a (metadata) complexity and performance perspective.&lt;/p&gt;
&lt;p&gt;The necessary relational database for multi-user scenarios adds some complexity, but there are some SaaS solutions on the market that reduce or actually get rid of the operative burden of managing relational databases. The leading principle still is the separation of storage and compute. More on that in the next chapter.&lt;/p&gt;
&lt;h1 id=&quot;heading-is-big-data-dead&quot;&gt;Is Big Data dead?&lt;/h1&gt;
&lt;p&gt;Coming back to the article’s title, where’s the promised $10/month Lakehouse at? And why $10/month?&lt;/p&gt;
&lt;p&gt;In his article &lt;a target=&quot;_blank&quot; href=&quot;https://motherduck.com/blog/big-data-is-dead/&quot;&gt;“Big Data is dead”&lt;/a&gt;, &lt;a target=&quot;_blank&quot; href=&quot;https://motherduck.com/authors/jordan-tigani/&quot;&gt;Jordan Tigani&lt;/a&gt; shared some interesting insights from his days at Google’s BigQuery. One was that the 90th percentile of all queries only queried 100MB data, and the 99th no more than 10GB.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1748538334295/3d067b59-5a86-4ffc-96b8-73fc6a76ed85.webp&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Another one was that most people don’t actually have that much data, but some have real Big Data. “Among customers who were using the service heavily, the median data storage size was much less than 100 GB”:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1748538578119/04ad0388-fc08-436a-aca1-00da3515dad8.webp&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Amazon released a &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/amazon-science/redset&quot;&gt;dataset&lt;/a&gt; about the Redshift usage and query sizes, which Jordan analyzed in another &lt;a target=&quot;_blank&quot; href=&quot;https://motherduck.com/blog/redshift-files-hunt-for-big-data/&quot;&gt;blog post&lt;/a&gt; that supports his original article.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;So, what’ the relation to this blog post?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Basically, it validates that most of the people/projects/companies do neither have “real” Big Data, nor run queries that require actual “real” Big Data tools that are distributed, horizontally scaling etc. etc.&lt;/p&gt;
&lt;p&gt;Many of them though have established systems that are much too complex and/or much too expensive for their actual need.&lt;/p&gt;
&lt;h1 id=&quot;heading-the-10month-lakehouse&quot;&gt;The $10/month Lakehouse&lt;/h1&gt;
&lt;h2 id=&quot;heading-requirements&quot;&gt;Requirements&lt;/h2&gt;
&lt;p&gt;If we apply the “separation of storage and compute” paradigm, we need to find a supplier for the (object) storage part, and one for the compute. For the storage, we want to use a S3-compatible API for flexibility reasons.&lt;/p&gt;
&lt;p&gt;As we want to leverage the new DuckLake format, and we want to eventually have multiple reader and writer processes. This means we do not “only” have to search for a compute provider, but also find a provider for a relational database for the DuckLake metadata, which enables the desired multi-user capabilities.&lt;/p&gt;
&lt;p&gt;Furthermore, we’d like to spend as little time on managing actual infrastructure as possible. This leads us to using serverless services for both compute/querying, and metadata storage.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Going forward, we assume the following:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;We have 250GB of data we want to analyze (as monthly average)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We would like to query the remote data via HTTPs endpoints, as well as from out local machines&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We want the whole Lakehouse to run in “auto-pilot mode”, meaning we don’t want to manage the underlying infrastructure&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We want SQL to be our primary/sole interface to the data, so we want to use the new DuckLake, based on DuckDB&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;heading-a-suggested-solution&quot;&gt;A suggested solution&lt;/h2&gt;
&lt;h3 id=&quot;heading-storage&quot;&gt;Storage&lt;/h3&gt;
&lt;p&gt;For storage, there are many providers on the market, be it the big public cloud providers like &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/s3/&quot;&gt;AWS&lt;/a&gt;, &lt;a target=&quot;_blank&quot; href=&quot;https://azure.microsoft.com/en-us/products/storage/blobs&quot;&gt;Azure&lt;/a&gt;, &lt;a target=&quot;_blank&quot; href=&quot;https://www.cloudflare.com/developer-platform/products/r2/&quot;&gt;Cloudflare&lt;/a&gt; or &lt;a target=&quot;_blank&quot; href=&quot;https://cloud.google.com/storage&quot;&gt;Google Cloud&lt;/a&gt;, or others, like &lt;a target=&quot;_blank&quot; href=&quot;https://www.hetzner.com/storage/object-storage/&quot;&gt;Hetzner&lt;/a&gt;, or, more specialized, &lt;a target=&quot;_blank&quot; href=&quot;https://www.backblaze.com/docs/cloud-storage-s3-compatible-api&quot;&gt;Backblaze&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If we want to optimize on costs, Cloudflare R2 comes into focus, mainly because the free egress fees, a generous free tier, and generally cheap standard storage costs of $0.015 / GB-month.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;R2 Free Tier&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1748539143973/c6aedc4b-dd2b-485e-93c3-865bcf05e19c.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/r2/pricing/#class-a-operations&quot;&gt;Class A operations&lt;/a&gt; are basically write or list operations, &lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/r2/pricing/#class-b-operations&quot;&gt;Class B operations&lt;/a&gt; are mainly read operations, that will occur much more often as write operations in a Lakehouse setting.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;R2 Pricing after Free Tier&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1748539159811/1b79d952-0689-4e82-a5b3-b3d6f8da4d10.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The calculated cost per month, based on our requirements for the storage part would be the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;250 GB-month - 10 GB-month (free tier) * $0.015 = &lt;strong&gt;$3.60&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Write requests: We assume that the 1 million requests / month are sufficient, &lt;strong&gt;$0&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Read requests: We assume that the 10 million requests / month are sufficient, &lt;strong&gt;$0&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Egress is free, &lt;strong&gt;$0&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Overall costs&lt;/strong&gt; for storage for a month&lt;strong&gt;: $3.60&lt;/strong&gt;&lt;/p&gt;
&lt;h3 id=&quot;heading-compute&quot;&gt;Compute&lt;/h3&gt;
&lt;p&gt;When wanting to run DuckDB in a serverless fashion, some options come into one’s mind. E.g. we could use AWS services such as &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/fargate/pricing/&quot;&gt;Fargate&lt;/a&gt;, running DuckDB in a container, or AWS &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/lambda/pricing/&quot;&gt;Lambda&lt;/a&gt;, running it in a serverless function. Google Cloud has &lt;a target=&quot;_blank&quot; href=&quot;https://cloud.google.com/run?hl=en#pricing&quot;&gt;CloudRun&lt;/a&gt;, Azure has &lt;a target=&quot;_blank&quot; href=&quot;https://azure.microsoft.com/en-us/pricing/details/container-instances/&quot;&gt;Container Instances&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Cloudflare will launch it’s Containers service in June 2025, which has a &lt;a target=&quot;_blank&quot; href=&quot;https://blog.cloudflare.com/cloudflare-containers-coming-2025/#pay-for-what-you-use-and-use-the-right-tool&quot;&gt;pricing model&lt;/a&gt; of $0.072 per vCPU-hour, and $0.009 per GB-hour. This is not necessarily cheaper than the other providers, but it comes with 1TB / month free egress from the containers. Because we’ll already be using R2, we’ll use it anyways, despite of being able to safe maybe $1 with other providers, based on our assumed usage (see below).&lt;/p&gt;
&lt;p&gt;The monthly costs, based on some assumptions (containers use 4GB and 2 vCPUs when run, meaning $0.18 / hour / container), would be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;25.000 queries, with an average of 3s runtime (=): $0.18 &lt;em&gt;(25.000 × 3 / 3600&lt;/em&gt;) = &lt;strong&gt;$3.75&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Container will spin down after 1 minute of inactivity, so we add another &lt;strong&gt;$2&lt;/strong&gt; to accommodate this behavior&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Overall costs&lt;/strong&gt; for compute for a month: &lt;strong&gt;$5.75&lt;/strong&gt;&lt;/p&gt;
&lt;h3 id=&quot;heading-metadata-storage&quot;&gt;Metadata storage&lt;/h3&gt;
&lt;p&gt;Because we’d like to use DuckLake in multi-user/writer mode, we’ll need either a MySQL or Postgres compatible database to store the metadata. There are many options on the market, like &lt;a target=&quot;_blank&quot; href=&quot;https://supabase.com/pricing&quot;&gt;Supabase&lt;/a&gt; or &lt;a target=&quot;_blank&quot; href=&quot;https://neon.com/pricing&quot;&gt;Neon&lt;/a&gt;, which both provide serverless Postgres services.&lt;/p&gt;
&lt;p&gt;We’ll choose Neon, because it has a more generous free tier, that offers up to 0.5 GB of database storage, and auto-scaling of up to 2 vCPUs and 8 GB of memory, having 190 compute-hours included. This should also fit to our assumed compute usage as outlined above:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1748625134932/48bc4417-1b96-4b94-a557-ad998ceb11f8.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Overall costs&lt;/strong&gt; for metadata storage for a month: &lt;strong&gt;$0&lt;/strong&gt;&lt;/p&gt;
&lt;h3 id=&quot;heading-cumulative-costs&quot;&gt;Cumulative costs&lt;/h3&gt;
&lt;p&gt;From the costs we derived for each part of the stack, we can calculate the cumulative costs of the stack per month: storage &lt;strong&gt;$3.60&lt;/strong&gt;, compute &lt;strong&gt;$5.75&lt;/strong&gt;, and metadata storage &lt;strong&gt;$0&lt;/strong&gt;. This means overall monthly costs of &lt;strong&gt;$9.35&lt;/strong&gt;!&lt;/p&gt;
&lt;h1 id=&quot;heading-deploy-a-serverless-ducklake&quot;&gt;Deploy a Serverless DuckLake&lt;/h1&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;The Cloudflare Containers service is currently in closed beta, but I got access a few weeks ago. It will GA in June 2025, as stated in the &lt;a target=&quot;_self&quot; href=&quot;https://blog.cloudflare.com/cloudflare-containers-coming-2025/&quot;&gt;introduction article&lt;/a&gt;.&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;heading-github-repo&quot;&gt;GitHub repo&lt;/h2&gt;
&lt;p&gt;The repository with the code can be found at &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/cloudflare-ducklake/&quot;&gt;tobilg/cloudflare-ducklake&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/cloudflare-ducklake/&quot;&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1748627995667/ed823314-5704-4749-b37d-a06a28a66484.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Once you cloned the code to your local machine, you can start with the deployment:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;git &lt;span class=&quot;hljs-built_in&quot;&gt;clone&lt;/span&gt; git@github.com:tobilg/cloudflare-ducklake.git &amp;amp;&amp;amp; &lt;span class=&quot;hljs-built_in&quot;&gt;cd&lt;/span&gt; cloudflare-ducklake
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-preconditions&quot;&gt;Preconditions&lt;/h2&gt;
&lt;p&gt;To deploy the project, you need to have the following preconditions to be present on the machine you want to use:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://nodejs.org/en/download&quot;&gt;Node&lt;/a&gt; 20 or higher&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://www.docker.com/products/docker-desktop/&quot;&gt;Docker Desktop&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/workers/wrangler/install-and-update/&quot;&gt;wrangler&lt;/a&gt; (newest version)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;An IDE like &lt;a target=&quot;_blank&quot; href=&quot;https://code.visualstudio.com/download&quot;&gt;VS Code&lt;/a&gt; with TypeScript support&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;heading-installing-dependencies&quot;&gt;Installing dependencies&lt;/h2&gt;
&lt;p&gt;To install the dependencies, please run&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;npm i
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-build-image-locally&quot;&gt;Build image locally&lt;/h2&gt;
&lt;p&gt;Before being able to build the Docker image locally, you have to download the DuckDB extensions we&apos;d like to package into the image, so that they don&apos;t need to be downloaded on each container start:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;scripts/download_extensions.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once this is done, you can run the following to build the image locally:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;npm run build:docker
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-run-the-image-locally&quot;&gt;Run the image locally&lt;/h2&gt;
&lt;p&gt;To run the newly built image locally, run&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;npm run dev:docker
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To query the DuckDB API within the locally running container, use&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;curl --location &lt;span class=&quot;hljs-string&quot;&gt;&apos;http://localhost:8080/query&apos;&lt;/span&gt; \
--header &lt;span class=&quot;hljs-string&quot;&gt;&apos;Content-Type: application/json&apos;&lt;/span&gt; \
--data &lt;span class=&quot;hljs-string&quot;&gt;&apos;{
  &quot;query&quot;: &quot;SELECT * FROM &apos;&lt;/span&gt;\&lt;span class=&quot;hljs-string&quot;&gt;&apos;&apos;&lt;/span&gt;https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet&lt;span class=&quot;hljs-string&quot;&gt;&apos;\&apos;&lt;/span&gt;&lt;span class=&quot;hljs-string&quot;&gt;&apos; LIMIT 1000&quot;
}&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;Currently it&apos;s &lt;strong&gt;not possible&lt;/strong&gt; to use &lt;code&gt;wrangler dev&lt;/code&gt; during local development. I guess this will eventually change once Containers become GA.&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;heading-duckdb-api&quot;&gt;DuckDB API&lt;/h2&gt;
&lt;p&gt;DuckDB is exposed as a Hono.js-based API, that offers a few endpoints:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;GET /&lt;/code&gt;: Will show a JSON welcome message&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;GET /_health&lt;/code&gt;: Enables potential container health checking (currently not used)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;POST /query&lt;/code&gt;: Takes a &lt;code&gt;application/json&lt;/code&gt; object body with a &lt;code&gt;query&lt;/code&gt; property that contains the (encoded) SQL query. Returns the query result in &lt;code&gt;application/json&lt;/code&gt; as well (see example above)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;heading-securing-the-api&quot;&gt;Securing the API&lt;/h2&gt;
&lt;p&gt;You can generate a unique API Token, e.g. with a tool like &lt;a target=&quot;_blank&quot; href=&quot;https://1password.com/password-generator&quot;&gt;Strong Password Generator&lt;/a&gt;, and deploy a Workers secret named &lt;code&gt;API_TOKEN&lt;/code&gt; that will automatically be used to secure the &lt;code&gt;/query&lt;/code&gt; endpoint once it&apos;s present with the below script:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;scripts/setup_api_secrets.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you want to test this locally, too, you need to create a &lt;code&gt;.dev.vars&lt;/code&gt; file in the root directory of this project. This will be appended for R2 Data Catalog &amp;amp; DuckLake usage later if you want to try these before deployment as well.&lt;/p&gt;
&lt;h2 id=&quot;heading-creating-a-r2-bucket&quot;&gt;Creating a R2 bucket&lt;/h2&gt;
&lt;p&gt;If you want to either deploy the DuckLake or the Apache Iceberg integration via R2 Data Catalog, you have to set up a R2 bucket first. To create a new R2 bucket, you can run the following:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;scripts/setup_r2.sh YOUR-BUCKET-NAME
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;where &lt;code&gt;YOUR-BUCKET-NAME&lt;/code&gt; is your desired name for the bucket. It will automatically add the respective environment variables to &lt;code&gt;.dev.vars&lt;/code&gt;. You&apos;ll also need the bucket name later for settng up the DuckLake secrets.&lt;/p&gt;
&lt;p&gt;If you&apos;d want to set a &lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/r2/reference/data-location/#location-hints&quot;&gt;location hint&lt;/a&gt; or a &lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/r2/reference/data-location/#available-jurisdictions&quot;&gt;jurisdiction&lt;/a&gt;, please edit the script accordingly before running it.&lt;/p&gt;
&lt;h2 id=&quot;heading-setting-up-a-catalog-database&quot;&gt;Setting up a catalog database&lt;/h2&gt;
&lt;p&gt;As described above, we chose &lt;a target=&quot;_blank&quot; href=&quot;https://neon.tech&quot;&gt;Neon‘s&lt;/a&gt; free tier to act as the metadata storage provider. Once you initially signed-up, you can choose the project name, cloud provider and the region:&lt;/p&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/cloudflare-ducklake/blob/main/docs/neon-config.png&quot;&gt;&lt;img src=&quot;https://github.com/tobilg/cloudflare-ducklake/raw/main/docs/neon-config.png&quot; alt=&quot;Neon configuration&quot; /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;After you clicked on &quot;Create project&quot;, you get directly taken to your Dashboard. The next step is to get the connection details. Therefore, click on the &quot;Connect&quot; button in the upper-right corner:&lt;/p&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/cloudflare-ducklake/blob/main/docs/neon-parameters.png&quot;&gt;&lt;img src=&quot;https://github.com/tobilg/cloudflare-ducklake/raw/main/docs/neon-parameters.png&quot; alt=&quot;Neon parameters&quot; /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Note the following connection parameters, and their equivalent environment variable/secret names (and add them to the &lt;code&gt;.dev.vars&lt;/code&gt; file):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The user (&lt;code&gt;POSTGRES_USER&lt;/code&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The password (&lt;code&gt;POSTGRES_PASSWORD&lt;/code&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The hostname (&lt;code&gt;POSTGRES_HOST&lt;/code&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The database name (&lt;code&gt;POSTGRES_DB&lt;/code&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can also create another user, and use this if you don&apos;t want to use the main instance user.&lt;/p&gt;
&lt;h2 id=&quot;heading-getting-a-r2-access-token&quot;&gt;Getting a R2 Access Token&lt;/h2&gt;
&lt;p&gt;Please follow the instructions in the R2 docs on &lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/r2/data-catalog/get-started/#3-create-an-api-token&quot;&gt;how to create an API to&lt;/a&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/r2/data-catalog/get-started/#3-create-an-api-token&quot;&gt;ken.&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/r2/data-catalog/get-started/#3-create-an-api-token&quot;&gt;You&apos;ll need to store&lt;/a&gt; t&lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/r2/data-catalog/get-started/#3-create-an-api-token&quot;&gt;his token in a secure loca&lt;/a&gt;tion (and eventually in &lt;code&gt;.dev.vars&lt;/code&gt; if you want to use &lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/r2/data-catalog/get-started/#3-create-an-api-token&quot;&gt;the R2 Data Catalog when&lt;/a&gt; running the Docker image locally), as you&apos;ll need later for the R2 Data Catalog deployment.&lt;/p&gt;
&lt;p&gt;Also, please note the the S3 APIs Access Key (&lt;code&gt;R2_ACCESS_KEY_ID&lt;/code&gt;) and Secret Key (&lt;code&gt;R2_SECRET_ACCESS_KEY&lt;/code&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/r2/data-catalog/get-started/#3-create-an-api-token&quot;&gt;), as well as your Cloudfl&lt;/a&gt;are account ID (&lt;code&gt;R2_ACCOUNT_ID&lt;/code&gt;), you&apos;ll need them in the next step. Put them in &lt;code&gt;.dev.vars&lt;/code&gt; as well if you want to run it locally.&lt;/p&gt;
&lt;h2 id=&quot;heading-create-secrets-for-ducklake&quot;&gt;Create secrets for DuckLake&lt;/h2&gt;
&lt;p&gt;Running the following script will create eight new Workers secrets need&lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/r2/data-catalog/get-started/#3-create-an-api-token&quot;&gt;ed for deployment:&lt;/a&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;scripts/setup_ducklake_secrets.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-deployment&quot;&gt;Deployment&lt;/h2&gt;
&lt;p&gt;If you followed the above steps, you can now run the actual deployment. This will create &lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/r2/data-catalog/get-started/#3-create-an-api-token&quot;&gt;&lt;/a&gt;a Worker, a DurableObject, build the &lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/r2/data-catalog/get-started/#3-create-an-api-token&quot;&gt;&lt;/a&gt;Docker image locally and the upload it to the Cloudflare container registry.&lt;/p&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;&lt;strong&gt;You need access to Cloudflare containers beta to be able to deploy it to Cloudflare!&lt;/strong&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;npm run deploy
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-running-queries&quot;&gt;Running queries&lt;/h2&gt;
&lt;p&gt;Replace the &lt;code&gt;WORKERS_URL&lt;/code&gt; with the real URL, and the &lt;code&gt;API_TOKEN&lt;/code&gt; with &lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/r2/data-catalog/get-started/#3-create-an-api-token&quot;&gt;your real API token:&lt;/a&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;curl --location &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://WORKERS_URL/query&apos;&lt;/span&gt; \
--header &lt;span class=&quot;hljs-string&quot;&gt;&apos;Content-Type: application/json&apos;&lt;/span&gt; \
--header &lt;span class=&quot;hljs-string&quot;&gt;&apos;Authorization: Bearer API_TOKEN&apos;&lt;/span&gt; \
--data &lt;span class=&quot;hljs-string&quot;&gt;&apos;{
  &quot;query&quot;: &quot;CREATE TABLE ducklake.orders AS SELECT * FROM &apos;&lt;/span&gt;\&lt;span class=&quot;hljs-string&quot;&gt;&apos;&apos;&lt;/span&gt;https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet&lt;span class=&quot;hljs-string&quot;&gt;&apos;\&apos;&lt;/span&gt;&lt;span class=&quot;hljs-string&quot;&gt;&apos;&quot;
}&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The response should look like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;[
  {&lt;span class=&quot;hljs-string&quot;&gt;&quot;Count&quot;&lt;/span&gt;:&lt;span class=&quot;hljs-string&quot;&gt;&quot;15000&quot;&lt;/span&gt;}
]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Get the first row of the new table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;curl --location &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://WORKERS_URL/query&apos;&lt;/span&gt; \
--header &lt;span class=&quot;hljs-string&quot;&gt;&apos;Content-Type: application/json&apos;&lt;/span&gt; \
--header &lt;span class=&quot;hljs-string&quot;&gt;&apos;Authorization: Bearer API_TOKEN&apos;&lt;/span&gt; \
--data &lt;span class=&quot;hljs-string&quot;&gt;&apos;{
  &quot;query&quot;: &quot;SELECT * FROM ducklake.orders LIMIT 1&quot;
}&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The response should look like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;[
  {
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;o_orderkey&quot;&lt;/span&gt;: 1,
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;o_custkey&quot;&lt;/span&gt;: 370,
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;o_orderstatus&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;O&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;o_totalprice&quot;&lt;/span&gt;: 172799.49,
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;o_orderdate&quot;&lt;/span&gt;: {
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;days&quot;&lt;/span&gt;: 9497
    },
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;o_orderpriority&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;5-LOW&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;o_clerk&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;Clerk#000000951&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;o_shippriority&quot;&lt;/span&gt;: 0,
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;o_comment&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;nstructions sleep furiously among &quot;&lt;/span&gt;
  }
]
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-deployment-with-the-additional-r2-data-catalog-iceberg-integration&quot;&gt;Deployment with the additional R2 Data Catalog / Iceberg integration&lt;/h2&gt;
&lt;p&gt;With the &lt;code&gt;v1.3.0&lt;/code&gt; release of DuckDB, it became possible to connect to the R2 Data Catalog. This means that you can now also read Iceberg data from R2&apos;s Object Storage directly from a SQL statement issued by DuckDB.&lt;/p&gt;
&lt;h3 id=&quot;heading-enable-r2-data-catalog-for-existing-r2-bucket&quot;&gt;Enable R2 Data Catalog for existing R2 bucket&lt;/h3&gt;
&lt;p&gt;You can enable the Apache Iceberg support via the R2 Data Catalogby running the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;scripts/setup_r2_data_catalog.sh YOUR-BUCKET-NAME
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;where &lt;code&gt;YOUR-BUCKET-NAME&lt;/code&gt; is the name for the bucket you used in the R2 bucket creation step before.&lt;/p&gt;
&lt;h3 id=&quot;heading-getting-the-r2-data-catalog-information&quot;&gt;Getting the R2 Data Catalog information&lt;/h3&gt;
&lt;p&gt;The information about the R2 Data Catalog URI (&lt;code&gt;R2_ENDPOINT&lt;/code&gt; secret) and warehouse name (&lt;code&gt;R2_CATALOG&lt;/code&gt; secret) can be gathered by running:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;scripts/get_catalog_info.sh YOUR-BUCKET-NAME
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;where &lt;code&gt;YOUR-BUCKET-NAME&lt;/code&gt; is the name for the bucket you used in the R2 bucket creation step before.&lt;/p&gt;
&lt;p&gt;Please also store this information, because you&apos;ll need in in the next step. Also, please add the three variables to &lt;code&gt;.dev.vars&lt;/code&gt; if you want to use the R2 Data Catalog when running the Docker image locally.&lt;/p&gt;
&lt;h3 id=&quot;heading-creating-secrets-for-r2-data-catalog&quot;&gt;Creating secrets for R2 Data Catalog&lt;/h3&gt;
&lt;p&gt;To create the necessary Workers secrets, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;scripts/create_r2_data_catalog_secrets.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;and copy &amp;amp; paste the respective values you noted in the last two steps.&lt;/p&gt;
&lt;h3 id=&quot;heading-writing-iceberg-example-data&quot;&gt;Writing Iceberg example data&lt;/h3&gt;
&lt;p&gt;As we need some example data if we want to test the new Iceberg capabilities, we need to create this data manually with a &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/cloudflare-ducklake/blob/main/scripts/python/create-iceberg-data.py&quot;&gt;Python script&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For convenience, we&apos;ll use &lt;a target=&quot;_blank&quot; href=&quot;https://marimo.io/&quot;&gt;Marimo&lt;/a&gt; for this. It requires a working Python installation on your machine.&lt;/p&gt;
&lt;h4 id=&quot;heading-setup-marimo&quot;&gt;&lt;strong&gt;Setup Marimo&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;To setup Marimo, run the following npm task:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;npm run iceberg:setup
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will create a new directory and install Marimo and some dependencies in a virtual environment.&lt;/p&gt;
&lt;h4 id=&quot;heading-create-the-iceberg-example-data&quot;&gt;&lt;strong&gt;Create the Iceberg example data&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;To create the Iceberg example data, run the following npm task:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;npm run iceberg:create
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will start Marimo, and load the respective Python script. You&apos;ll need to edit the variables for &lt;code&gt;WAREHOUSE&lt;/code&gt;, &lt;code&gt;CATALOG_URI&lt;/code&gt; and &lt;code&gt;TOKEN&lt;/code&gt; with the values gathered in the last steps. This is also described in the &lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/r2/data-catalog/get-started/#6-create-a-python-notebook-to-interact-with-the-data-warehouse&quot;&gt;R2 Data Catalog docs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;After you did that, you can run the Python cells in the Marimo notebook, and should end up with some data created in the R2 Bucket. You can check in the Cloudflare Dashboard, or via &lt;code&gt;wrangler&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id=&quot;heading-running-duckdb-with-iceberg-support&quot;&gt;Running DuckDB with Iceberg support&lt;/h3&gt;
&lt;p&gt;Once you created the secrets as outlined above (all of them!), the application will automatically create a DuckDB secret for accessing the R2 Data Catalog, and attach the catalog as well under the name of &lt;code&gt;r2lake&lt;/code&gt; when it starts. The example table created in the last step is called &lt;code&gt;people&lt;/code&gt; and was created in the &lt;code&gt;default&lt;/code&gt; schema.&lt;/p&gt;
&lt;h3 id=&quot;heading-deployment-1&quot;&gt;Deployment&lt;/h3&gt;
&lt;p&gt;If you followed the above steps, you can now run the deployment. This will update the existing deployment when you already deployed the DuckLake configuration.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hint: You need access to Cloudflare containers beta to be able to deploy it to Cloudflare!&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;npm run deploy
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;wrangler&lt;/code&gt; deployment output shows the workers URL where the service can be reached. Note it down if you want to run test queries (&lt;code&gt;WORKERS_URL&lt;/code&gt; in the step below).&lt;/p&gt;
&lt;h3 id=&quot;heading-running-queries-1&quot;&gt;Running queries&lt;/h3&gt;
&lt;p&gt;Replace the &lt;code&gt;WORKERS_URL&lt;/code&gt; with the real URL, and the &lt;code&gt;API_TOKEN&lt;/code&gt; with your real API token:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;curl --location &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://WORKERS_URL/query&apos;&lt;/span&gt; \
--header &lt;span class=&quot;hljs-string&quot;&gt;&apos;Content-Type: application/json&apos;&lt;/span&gt; \
--header &lt;span class=&quot;hljs-string&quot;&gt;&apos;Authorization: Bearer API_TOKEN&apos;&lt;/span&gt; \
--data &lt;span class=&quot;hljs-string&quot;&gt;&apos;{
  &quot;query&quot;: &quot;SELECT * FROM r2lake.default.people&quot;
}&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This should return the response below:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;[
  {
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;id&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;1&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;name&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;Alice&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;score&quot;&lt;/span&gt;: 80
  },
  {
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;id&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;2&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;name&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;Bob&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;score&quot;&lt;/span&gt;: 92.5
  },
  {
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;id&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;3&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;name&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;Charlie&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-string&quot;&gt;&quot;score&quot;&lt;/span&gt;: 88
  }
]
&lt;/code&gt;&lt;/pre&gt;
&lt;h1 id=&quot;heading-summary-1&quot;&gt;Summary&lt;/h1&gt;
&lt;p&gt;We’ve demonstrated that it’s possible to build (and deploy) a Lakehouse infrastructure based on DuckLake and DuckDB, that can stay below $10 / month, while running completely serverless. This can be a suitable way to run a flexible Lakehouse for many smaller project or teams.&lt;/p&gt;
&lt;p&gt;There’s no need anymore to build up complex and costly infrastructure projects on big public cloud providers.&lt;/p&gt;
&lt;p&gt;Let me know if you have any feedback in the comments!&lt;/p&gt;
</content:encoded><category>lakehouse</category><category>Data-lake</category><category>duckDB</category><author>Tobias Müller</author></item><item><title>Using DuckDB databases as lightweight Data Lake access layer</title><link>https://tobilg.com/posts/using-duckdb-databases-as-lightweight-data-lake-access-layer/</link><guid isPermaLink="true">https://tobilg.com/posts/using-duckdb-databases-as-lightweight-data-lake-access-layer/</guid><description>Data Lakes come in a broad variety and lots of different flavors. AWS, Azure, Google Cloud, Snowflake, DataBricks, etc. they all have their specialties, strong and weak sides. Common among them is that the most, if not all, of them use Object Storage...</description><pubDate>Sat, 17 May 2025 22:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Data Lakes come in a broad variety and lots of different flavors. AWS, Azure, Google Cloud, Snowflake, DataBricks, etc. they all have their specialties, strong and weak sides. Common among them is that the most, if not all, of them use Object Storage services such as Amazon S3 or Cloudflare R2 for storing their vast amount of data. Accessing those Data Lakes can involve pretty complicated IAM setups, OAuth, OIDC and other technologies in this space.&lt;/p&gt;
&lt;p&gt;But if we step a few steps back from all the complexity: &lt;strong&gt;What’s the most lightweight and flexible way to provide access to data in Data Lakes?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For example, if we’d like to share open public transportation data, and we’d like to focus on costs and ease of use? With an interface that exists since more than 50 years?&lt;/p&gt;
&lt;p&gt;Introducing: &lt;strong&gt;DuckDB databases as Data Lake access layer!&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Jokes aside, I’m not the first one to ever suggest or use it, but I’ve hardly seen any written explanation of this yet, and also got some questions from the community about it (looking at you, David!), so I though it’d make sense to write a short blog post about this technique.&lt;/p&gt;
&lt;h2 id=&quot;heading-what-are-our-goals&quot;&gt;&lt;strong&gt;What are our goals?&lt;/strong&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;We want to query the data with SQL&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We want the data to be accessible via HTTP(S)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We want to download as little data as possible when querying it&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We want (or already) have our data stored in Object Storage services like S3 or R2&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We don’t need authentication and authorization to access the data &lt;em&gt;(yet)&lt;/em&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;heading-how-does-this-work&quot;&gt;&lt;strong&gt;How does this work?&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;The basis of the approach is that we store our datasets on Object Storage. For example, you can store a table’s data as a &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/stable/data/parquet/overview#examples&quot;&gt;single Parquet file&lt;/a&gt;, &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/stable/data/multiple_files/overview#parquet&quot;&gt;multiple Parquet files&lt;/a&gt; in a prefix with a naming pattern, or even as &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/stable/data/partitioning/hive_partitioning.html&quot;&gt;Hive-partitioned&lt;/a&gt; data. DuckDB can read (and write) them all. The latter two only via the &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/stable/extensions/httpfs/s3api#reading&quot;&gt;S3-compatible API&lt;/a&gt; though, which needs some form of credentials, so for keeping our example simple, we focus on single Parquet files per table.&lt;/p&gt;
&lt;p&gt;The other puzzle pieces are that DuckDB database files can contain &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/stable/sql/statements/create_view&quot;&gt;view definitions&lt;/a&gt; that can &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/stable/guides/network_cloud_storage/http_import&quot;&gt;reference remote Parquet files&lt;/a&gt;, and can be shared and attached from remote clients either via &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/stable/guides/network_cloud_storage/duckdb_over_https_or_s3#attaching-to-a-database-over-https&quot;&gt;HTTP(S)&lt;/a&gt; or &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/stable/guides/network_cloud_storage/duckdb_over_https_or_s3#attaching-to-a-database-over-the-s3-api&quot;&gt;S3&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;From a high-level perspective, our solution looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1747608549101/e8e394fa-3411-4c45-a0e9-abe7ccddfa7b.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-how-can-this-be-implemented&quot;&gt;How can this be implemented?&lt;/h2&gt;
&lt;p&gt;The process to create such a &quot;view database&quot; consists of four main steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Exporting the data from a pre-made DuckDB database as Parquet files&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Uploading the exported Parquet files to an Object Storage service&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Creating the database file containing the views to the Object Storage files&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Uploading the new database file to Object Storage as well, to be able to use the &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/stable/sql/statements/attach.html#attach&quot;&gt;ATTACH&lt;/a&gt; statement of DuckDB&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;heading-example-use-case&quot;&gt;Example Use Case&lt;/h2&gt;
&lt;p&gt;I recently published another blog post about &lt;a target=&quot;_blank&quot; href=&quot;https://tobilg.com/handling-gtfs-data-with-duckdb&quot;&gt;Handling GTFS data with DuckDB&lt;/a&gt;. The accompanying GitHub repository can be found at &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/duckdb-gtfs&quot;&gt;tobilg/duckdb-gtfs&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;heading-preparation&quot;&gt;Preparation&lt;/h3&gt;
&lt;p&gt;As preparation, you’ll need to perform the steps outlined in the &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/duckdb-gtfs?tab=readme-ov-file#usage&quot;&gt;Usage&lt;/a&gt; chapter of the README. If you follow these instructions, it will generate a 1.4GB uncompressed DuckDB database file on your local machine, after it downloaded the example dataset’ raw data and loaded it into the default GTFS table schemas.&lt;/p&gt;
&lt;h3 id=&quot;heading-export-single-parquet-files&quot;&gt;Export single Parquet files&lt;/h3&gt;
&lt;p&gt;The database can be exported as one Parquet file per table with a simple command (also available as &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/duckdb-gtfs/blob/main/scripts/providers/gtfs-de/full/export_data.sh&quot;&gt;bash script&lt;/a&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;duckdb exported-data/providers/gtfs-de/full/data.duckdb -c &lt;span class=&quot;hljs-string&quot;&gt;&quot;EXPORT DATABASE &apos;exported-data/providers/gtfs-de/full&apos; (FORMAT parquet, COMPRESSION zstd);&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The result looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;$ tree exported-data 
exported-data
└── providers
    └── gtfs-de
        └── full
            ├── agency.parquet
            ├── areas.parquet
            ├── attributions.parquet
            ├── booking_rules.parquet
            ├── calendar.parquet
            ├── calendar_dates.parquet
            ├── fare_attributes.parquet
            ├── fare_leg_join_rules.parquet
            ├── fare_leg_rules.parquet
            ├── fare_media.parquet
            ├── fare_products.parquet
            ├── fare_rules.parquet
            ├── fare_transfer_rules.parquet
            ├── feed_info.parquet
            ├── frequencies.parquet
            ├── levels.parquet
            ├── load.sql
            ├── location_group_stops.parquet
            ├── location_groups.parquet
            ├── networks.parquet
            ├── pathways.parquet
            ├── rider_categories.parquet
            ├── route_networks.parquet
            ├── routes.parquet
            ├── schema.sql
            ├── shapes.parquet
            ├── stop_areas.parquet
            ├── stop_times.parquet
            ├── stops.parquet
            ├── timeframes.parquet
            ├── transfers.parquet
            ├── translations.parquet
            └── trips.parquet
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Not only have the Parquet files been produced, but also the &lt;code&gt;load.sql&lt;/code&gt; and &lt;code&gt;schema.sql&lt;/code&gt; scripts. We don’t need those for our use case though, so we can either ignore or delete them.&lt;/p&gt;
&lt;h3 id=&quot;heading-upload-parquet-files-to-object-storage&quot;&gt;Upload Parquet files to Object Storage&lt;/h3&gt;
&lt;p&gt;The next step is to upload the newly created Parquet files to your Object Storage provider. In this example, I already have created a new R2 Bucket, and also assigned a custom domain for it (&lt;code&gt;data.openrailway.dev&lt;/code&gt;), which is possible with a few clicks because the domain itself is hosted on Cloudflare as well.&lt;/p&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;R2 offers free egress, compared to S3 which for example charges $0.09 per GB in us-east-1. For storage, S3 charges $0.023 / GB / month, whereas R2 charges $0.015 / GB / month.This can incur large costs if your data is used a lot (reminder: Our use case is providing public transportation datasets for analysis, so we don’t really now how many people will use them, and how). I’d recommend to make an explicit decision here based on your use case.&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;If you want to enable browser-based clients as well, make sure that you also setup a &lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/r2/buckets/cors/&quot;&gt;CORS policy&lt;/a&gt;. With R2, this can be configured via API or via the Dashboard and looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-json&quot;&gt;[
  {
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;AllowedOrigins&quot;&lt;/span&gt;: [
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;http://localhost:3000&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;http://localhost:5174&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;https://sql-workbench.com&quot;&lt;/span&gt;
    ],
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;AllowedMethods&quot;&lt;/span&gt;: [
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;GET&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;HEAD&quot;&lt;/span&gt;
    ],
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;AllowedHeaders&quot;&lt;/span&gt;: [
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;*&quot;&lt;/span&gt;
    ],
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;ExposeHeaders&quot;&lt;/span&gt;: [
      &lt;span class=&quot;hljs-string&quot;&gt;&quot;ETag&quot;&lt;/span&gt;
    ]
  }
]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This enables local development, and also &lt;a target=&quot;_blank&quot; href=&quot;https://sql-workbench.com&quot;&gt;SQL Workbench&lt;/a&gt; as browser-based query interface (more of this later). Add your own domains appropriately.&lt;/p&gt;
&lt;p&gt;Once you prepared this, you can go on and upload the data to your Object Storage bucket, either with tools like &lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/r2/examples/rclone/&quot;&gt;rclone&lt;/a&gt; or directly via the Dashboard of R2 or the S3 Console.&lt;/p&gt;
&lt;h3 id=&quot;heading-creating-the-view-database&quot;&gt;Creating the “view database”&lt;/h3&gt;
&lt;p&gt;As the data is now available via HTTP(S), we can now create a new DuckDB database that contains the &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/stable/sql/statements/create_view.html&quot;&gt;VIEWs&lt;/a&gt; that reference the data, namely the Parquet files, on Object Storage. This leverages the &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/stable/data/parquet/overview#read_parquet-function&quot;&gt;read_parquet()&lt;/a&gt; function, with which you can load remote data via HTTP(S).&lt;/p&gt;
&lt;p&gt;An example for our use case (&lt;a target=&quot;_blank&quot; href=&quot;https://sql-workbench.com/#queries=v0,SELECT-*-FROM-read_parquet\(&apos;https%3A%2F%2Fdata.openrailway.dev%2Fproviders%2Fgtfs%20de%2Ffull%2Fagency.parquet&apos;\)~&quot;&gt;try it with SQL Workbench&lt;/a&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/agency.parquet&apos;&lt;/span&gt;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The output will look like this:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1747657609014/4188a86b-2a4e-402b-a290-b94343fe1be5.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;We now prepare a &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/duckdb-gtfs/blob/main/queries/providers/gtfs-de/full/create_view_database.sql&quot;&gt;SQL script&lt;/a&gt; that will create a view for each remote Parquet file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; agency &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/agency.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; areas &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/areas.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; attributions &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/attributions.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; booking_rules &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/booking_rules.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; calendar &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/calendar.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; calendar_dates &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/calendar_dates.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; fare_leg_join_rules &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/fare_leg_join_rules.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; fare_leg_rules &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/fare_leg_rules.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; fare_media &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/fare_media.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; fare_products &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/fare_products.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; fare_transfer_rules &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/fare_transfer_rules.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; feed_info &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/feed_info.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; frequencies &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/frequencies.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;levels&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/levels.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; location_groups &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/location_groups.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; networks &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/networks.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; rider_categories &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/rider_categories.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; shapes &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/shapes.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; timeframes &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/timeframes.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; translations &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/translations.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; fare_attributes &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/fare_attributes.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; routes &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/routes.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; route_networks &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/route_networks.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; stops &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/stops.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; stop_areas &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/stop_areas.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; trips &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/trips.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; fare_rules &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/fare_rules.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; location_group_stops &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/location_group_stops.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; pathways &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/pathways.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; stop_times &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/stop_times.parquet&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; transfers &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_parquet(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/transfers.parquet&apos;&lt;/span&gt;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To create the actual “view database”, we can use the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;duckdb exported-data/providers/gtfs-de/full/database.duckdb &amp;lt; queries/providers/gtfs-de/full/create_view_database.sql
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will create a new database in the &lt;code&gt;exported-data/providers/gtfs-de/full&lt;/code&gt; directory named &lt;code&gt;database.duckdb&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;We can now inspect its size:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;$ ls -la exported-data/providers/gtfs-de/full/*.duckdb
-rw-r--r--@ 1 tmueller  staff  274432 18 Mai 14:32 exported-data/providers/gtfs-de/full/database.duckdb
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This means the “view database” has &lt;strong&gt;ONLY 268kb in size&lt;/strong&gt;!&lt;/p&gt;
&lt;h3 id=&quot;heading-testing-the-view-database&quot;&gt;Testing the “view database”&lt;/h3&gt;
&lt;p&gt;We can now test if it works locally by doing a &lt;code&gt;SHOW TABLES&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;$ duckdb exported-data/providers/gtfs-de/full/database.duckdb
v1.2.2 7c039464e4
Enter &lt;span class=&quot;hljs-string&quot;&gt;&quot;.help&quot;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;for&lt;/span&gt; usage hints.
D show tables;
┌──────────────────────┐
│         name         │
│       varchar        │
├──────────────────────┤
│ agency               │
│ areas                │
│ attributions         │
│ booking_rules        │
│ calendar             │
│ calendar_dates       │
│ fare_attributes      │
│ fare_leg_join_rules  │
│ fare_leg_rules       │
│ fare_media           │
│ fare_products        │
│ fare_rules           │
│ fare_transfer_rules  │
│ feed_info            │
│ frequencies          │
│ levels               │
│ location_group_stops │
│ location_groups      │
│ networks             │
│ pathways             │
│ rider_categories     │
│ route_networks       │
│ routes               │
│ shapes               │
│ stop_areas           │
│ stop_times           │
│ stops                │
│ timeframes           │
│ transfers            │
│ translations         │
│ trips                │
├──────────────────────┤
│       31 rows        │
└──────────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Also, we can run some test queries on the remote data, and show the time they take:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;$ duckdb exported-data/providers/gtfs-de/full/database.duckdb
v1.2.2 7c039464e4
Enter &lt;span class=&quot;hljs-string&quot;&gt;&quot;.help&quot;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;for&lt;/span&gt; usage hints.
D .timer on
D SELECT count(*) FROM trips;
┌────────────────┐
│  count_star()  │
│     int64      │
├────────────────┤
│    1630671     │
│ (1.63 million) │
└────────────────┘
Run Time (s): real 0.554 user 0.175864 sys 0.089233
D SELECT count(*) FROM stops;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    678388    │
└──────────────┘
Run Time (s): real 0.336 user 0.006173 sys 0.002544
D SELECT count(*) FROM stop_times;
┌─────────────────┐
│  count_star()   │
│      int64      │
├─────────────────┤
│    32228711     │
│ (32.23 million) │
└─────────────────┘
Run Time (s): real 0.367 user 0.051620 sys 0.009298
D SELECT * FROM stops LIMIT 10;
┌─────────┬───────────┬──────────────────────┬───────────────┬───────────┬───────────┬───┬──────────┬───────────────┬────────────────┬───────────────┬─────────────────────┬──────────┬───────────────┐
│ stop_id │ stop_code │      stop_name       │ tts_stop_name │ stop_desc │ stop_lat  │ … │ stop_url │ location_type │ parent_station │ stop_timezone │ wheelchair_boarding │ level_id │ platform_code │
│ varchar │  varchar  │       varchar        │    varchar    │  varchar  │  double   │   │ varchar  │     int32     │    varchar     │    varchar    │        int32        │ varchar  │    varchar    │
├─────────┼───────────┼──────────────────────┼───────────────┼───────────┼───────────┼───┼──────────┼───────────────┼────────────────┼───────────────┼─────────────────────┼──────────┼───────────────┤
│ 199269  │ NULL      │ &lt;span class=&quot;hljs-string&quot;&gt;&apos;s-Heerenberg Goud…  │ NULL          │ NULL      │  51.87225 │ … │ NULL     │             1 │ NULL           │ NULL          │                NULL │ NULL     │ NULL          │
│ 358723  │ NULL      │ &apos;&lt;/span&gt;s-Heerenberg Goud…  │ NULL          │ NULL      │  51.87228 │ … │ NULL     │          NULL │ 199269         │ NULL          │                NULL │ NULL     │ NULL          │
│ 536901  │ NULL      │ &lt;span class=&quot;hljs-string&quot;&gt;&apos;s-Heerenberg Mole…  │ NULL          │ NULL      │  51.87632 │ … │ NULL     │             1 │ NULL           │ NULL          │                NULL │ NULL     │ NULL          │
│ 532065  │ NULL      │ &apos;&lt;/span&gt;s-Heerenberg Mole…  │ NULL          │ NULL      │  51.87632 │ … │ NULL     │          NULL │ 536901         │ NULL          │                NULL │ NULL     │ NULL          │
│ 666544  │ NULL      │ &lt;span class=&quot;hljs-string&quot;&gt;&apos;s-Heerenberg Muzi…  │ NULL          │ NULL      │  51.87479 │ … │ NULL     │             1 │ NULL           │ NULL          │                NULL │ NULL     │ NULL          │
│ 269106  │ NULL      │ &apos;&lt;/span&gt;s-Heerenberg Muzi…  │ NULL          │ NULL      │ 51.874844 │ … │ NULL     │          NULL │ 666544         │ NULL          │                NULL │ NULL     │ NULL          │
│ 540312  │ NULL      │ &lt;span class=&quot;hljs-string&quot;&gt;&apos;s-Heerenberg Muzi…  │ NULL          │ NULL      │ 51.874737 │ … │ NULL     │          NULL │ 666544         │ NULL          │                NULL │ NULL     │ NULL          │
│ 615572  │ NULL      │ &apos;&lt;/span&gt;s-Hertogenbosch     │ NULL          │ NULL      │  51.69054 │ … │ NULL     │             1 │ NULL           │ NULL          │                NULL │ NULL     │ NULL          │
│ 525421  │ NULL      │ &lt;span class=&quot;hljs-string&quot;&gt;&apos;s-Hertogenbosch     │ NULL          │ NULL      │  51.69054 │ … │ NULL     │          NULL │ 615572         │ NULL          │                NULL │ NULL     │ NULL          │
│ 198290  │ NULL      │ 12 Apostel           │ NULL          │ NULL      │ 52.052963 │ … │ NULL     │          NULL │ 655876         │ NULL          │                NULL │ NULL     │ NULL          │
├─────────┴───────────┴──────────────────────┴───────────────┴───────────┴───────────┴───┴──────────┴───────────────┴────────────────┴───────────────┴─────────────────────┴──────────┴───────────────┤
│ 10 rows                                                                                                                                                                       15 columns (13 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.909 user 0.214754 sys 0.023443&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As you might have noticed, &lt;strong&gt;we counted the records of a remote &amp;gt;32 million records file in less than 0.4 seconds&lt;/strong&gt;!&lt;/p&gt;
&lt;p&gt;We can also switch on the &lt;code&gt;enable_http_logging&lt;/code&gt; flag to see the HTTP(S) requests performed by DuckDB when querying the remote files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;D SET enable_http_logging = &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;;
Run Time (s): real 0.001 user 0.000300 sys 0.000365
D SELECT * FROM stops LIMIT 10;
HTTP Request:
  HEAD /providers/gtfs-de/full/stops.parquet
  Accept: */*
  Host: data.openrailway.dev
  User-Agent: cpp-httplib/0.14.3

HTTP Response:
  200 OK 
  Accept-Ranges: bytes
  alt-svc: h3=&lt;span class=&quot;hljs-string&quot;&gt;&quot;:443&quot;&lt;/span&gt;; ma=86400
  cf-cache-status: DYNAMIC
  CF-RAY: 9423a96b0dd4d38c-FRA
  Connection: keep-alive
  Content-Length: 14580912
  Date: Sun, 18 May 2025 12:40:15 GMT
  ETag: &lt;span class=&quot;hljs-string&quot;&gt;&quot;5d654fe45888665fcaff7f2831180088&quot;&lt;/span&gt;
  Last-Modified: Sun, 18 May 2025 10:42:40 GMT
  NEL: {&lt;span class=&quot;hljs-string&quot;&gt;&quot;success_fraction&quot;&lt;/span&gt;:0,&lt;span class=&quot;hljs-string&quot;&gt;&quot;report_to&quot;&lt;/span&gt;:&lt;span class=&quot;hljs-string&quot;&gt;&quot;cf-nel&quot;&lt;/span&gt;,&lt;span class=&quot;hljs-string&quot;&gt;&quot;max_age&quot;&lt;/span&gt;:604800}
  Report-To: {&lt;span class=&quot;hljs-string&quot;&gt;&quot;endpoints&quot;&lt;/span&gt;:[{&lt;span class=&quot;hljs-string&quot;&gt;&quot;url&quot;&lt;/span&gt;:&lt;span class=&quot;hljs-string&quot;&gt;&quot;https:\/\/a.nel.cloudflare.com\/report\/v4?s=8J9y/6mVjzeKKD/RHRLvPc53+xxuxj+n8m6FugMHTbB0N8cD0l18HfCieD6GeWrGlFB1vCMf7PnzSxj8oCYJYgsQHvcJCxnw5OEIkgv9C4FWDrkv1jgTKj3N4IfoIwJGLMqAMAQ+LaEssRo5FN9DPSmWsg==&quot;&lt;/span&gt;}],&lt;span class=&quot;hljs-string&quot;&gt;&quot;group&quot;&lt;/span&gt;:&lt;span class=&quot;hljs-string&quot;&gt;&quot;cf-nel&quot;&lt;/span&gt;,&lt;span class=&quot;hljs-string&quot;&gt;&quot;max_age&quot;&lt;/span&gt;:604800}
  Server: cloudflare
  server-timing: cfL4;desc=&lt;span class=&quot;hljs-string&quot;&gt;&quot;?proto=TCP&amp;amp;rtt=9270&amp;amp;min_rtt=8571&amp;amp;rtt_var=3023&amp;amp;sent=6&amp;amp;recv=7&amp;amp;lost=0&amp;amp;retrans=0&amp;amp;sent_bytes=2869&amp;amp;recv_bytes=556&amp;amp;delivery_rate=469577&amp;amp;cwnd=33&amp;amp;unsent_bytes=0&amp;amp;cid=6800b4b191d1dbe2&amp;amp;ts=165&amp;amp;x=0&quot;&lt;/span&gt;
  Vary: Accept-Encoding

HTTP Request:
  GET /providers/gtfs-de/full/stops.parquet
  Accept: */*
  Host: data.openrailway.dev
  Range: bytes=14580904-14580911
  User-Agent: cpp-httplib/0.14.3

HTTP Response:
  206 Partial Content 
  Accept-Ranges: bytes
  alt-svc: h3=&lt;span class=&quot;hljs-string&quot;&gt;&quot;:443&quot;&lt;/span&gt;; ma=86400
  cf-cache-status: DYNAMIC
  CF-RAY: 9423a96bbf92d38c-FRA
  Connection: keep-alive
  Content-Length: 8
  Content-Range: bytes 14580904-14580911/14580912
  Date: Sun, 18 May 2025 12:40:16 GMT
  ETag: &lt;span class=&quot;hljs-string&quot;&gt;&quot;5d654fe45888665fcaff7f2831180088&quot;&lt;/span&gt;
  Last-Modified: Sun, 18 May 2025 10:42:40 GMT
  NEL: {&lt;span class=&quot;hljs-string&quot;&gt;&quot;success_fraction&quot;&lt;/span&gt;:0,&lt;span class=&quot;hljs-string&quot;&gt;&quot;report_to&quot;&lt;/span&gt;:&lt;span class=&quot;hljs-string&quot;&gt;&quot;cf-nel&quot;&lt;/span&gt;,&lt;span class=&quot;hljs-string&quot;&gt;&quot;max_age&quot;&lt;/span&gt;:604800}
  Report-To: {&lt;span class=&quot;hljs-string&quot;&gt;&quot;endpoints&quot;&lt;/span&gt;:[{&lt;span class=&quot;hljs-string&quot;&gt;&quot;url&quot;&lt;/span&gt;:&lt;span class=&quot;hljs-string&quot;&gt;&quot;https:\/\/a.nel.cloudflare.com\/report\/v4?s=095gOTFBeOfulP4iVVnDyvqk/FLW3exKsuA0e2t3/kNqb2QNnBtswCgTkKoGlL3F8qnNYZjdlljl2wTDb1+KWcrhAYrVAC5sUOHEnjkVzYixd1OQq/WFlHh3/u9HRTqo+UbBxBJKXIS+yxJHwlIHZDzFXQ==&quot;&lt;/span&gt;}],&lt;span class=&quot;hljs-string&quot;&gt;&quot;group&quot;&lt;/span&gt;:&lt;span class=&quot;hljs-string&quot;&gt;&quot;cf-nel&quot;&lt;/span&gt;,&lt;span class=&quot;hljs-string&quot;&gt;&quot;max_age&quot;&lt;/span&gt;:604800}
  Server: cloudflare
  server-timing: cfL4;desc=&lt;span class=&quot;hljs-string&quot;&gt;&quot;?proto=TCP&amp;amp;rtt=9152&amp;amp;min_rtt=8571&amp;amp;rtt_var=1901&amp;amp;sent=9&amp;amp;recv=10&amp;amp;lost=0&amp;amp;retrans=0&amp;amp;sent_bytes=4253&amp;amp;recv_bytes=737&amp;amp;delivery_rate=469577&amp;amp;cwnd=35&amp;amp;unsent_bytes=0&amp;amp;cid=6800b4b191d1dbe2&amp;amp;ts=245&amp;amp;x=0&quot;&lt;/span&gt;
  Vary: Accept-Encoding

// SNIP //

┌─────────┬───────────┬──────────────────────┬───────────────┬───────────┬───────────┬───┬──────────┬───────────────┬────────────────┬───────────────┬─────────────────────┬──────────┬───────────────┐
│ stop_id │ stop_code │      stop_name       │ tts_stop_name │ stop_desc │ stop_lat  │ … │ stop_url │ location_type │ parent_station │ stop_timezone │ wheelchair_boarding │ level_id │ platform_code │
│ varchar │  varchar  │       varchar        │    varchar    │  varchar  │  double   │   │ varchar  │     int32     │    varchar     │    varchar    │        int32        │ varchar  │    varchar    │
├─────────┼───────────┼──────────────────────┼───────────────┼───────────┼───────────┼───┼──────────┼───────────────┼────────────────┼───────────────┼─────────────────────┼──────────┼───────────────┤
│ 199269  │ NULL      │ &lt;span class=&quot;hljs-string&quot;&gt;&apos;s-Heerenberg Goud…  │ NULL          │ NULL      │  51.87225 │ … │ NULL     │             1 │ NULL           │ NULL          │                NULL │ NULL     │ NULL          │
│ 358723  │ NULL      │ &apos;&lt;/span&gt;s-Heerenberg Goud…  │ NULL          │ NULL      │  51.87228 │ … │ NULL     │          NULL │ 199269         │ NULL          │                NULL │ NULL     │ NULL          │
│ 536901  │ NULL      │ &lt;span class=&quot;hljs-string&quot;&gt;&apos;s-Heerenberg Mole…  │ NULL          │ NULL      │  51.87632 │ … │ NULL     │             1 │ NULL           │ NULL          │                NULL │ NULL     │ NULL          │
│ 532065  │ NULL      │ &apos;&lt;/span&gt;s-Heerenberg Mole…  │ NULL          │ NULL      │  51.87632 │ … │ NULL     │          NULL │ 536901         │ NULL          │                NULL │ NULL     │ NULL          │
│ 666544  │ NULL      │ &lt;span class=&quot;hljs-string&quot;&gt;&apos;s-Heerenberg Muzi…  │ NULL          │ NULL      │  51.87479 │ … │ NULL     │             1 │ NULL           │ NULL          │                NULL │ NULL     │ NULL          │
│ 269106  │ NULL      │ &apos;&lt;/span&gt;s-Heerenberg Muzi…  │ NULL          │ NULL      │ 51.874844 │ … │ NULL     │          NULL │ 666544         │ NULL          │                NULL │ NULL     │ NULL          │
│ 540312  │ NULL      │ &lt;span class=&quot;hljs-string&quot;&gt;&apos;s-Heerenberg Muzi…  │ NULL          │ NULL      │ 51.874737 │ … │ NULL     │          NULL │ 666544         │ NULL          │                NULL │ NULL     │ NULL          │
│ 615572  │ NULL      │ &apos;&lt;/span&gt;s-Hertogenbosch     │ NULL          │ NULL      │  51.69054 │ … │ NULL     │             1 │ NULL           │ NULL          │                NULL │ NULL     │ NULL          │
│ 525421  │ NULL      │ &lt;span class=&quot;hljs-string&quot;&gt;&apos;s-Hertogenbosch     │ NULL          │ NULL      │  51.69054 │ … │ NULL     │          NULL │ 615572         │ NULL          │                NULL │ NULL     │ NULL          │
│ 198290  │ NULL      │ 12 Apostel           │ NULL          │ NULL      │ 52.052963 │ … │ NULL     │          NULL │ 655876         │ NULL          │                NULL │ NULL     │ NULL          │
├─────────┴───────────┴──────────────────────┴───────────────┴───────────┴───────────┴───┴──────────┴───────────────┴────────────────┴───────────────┴─────────────────────┴──────────┴───────────────┤
│ 10 rows                                                                                                                                                                       15 columns (13 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.578 user 0.026387 sys 0.018349&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can see the &lt;a target=&quot;_blank&quot; href=&quot;https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Range_requests&quot;&gt;range requests&lt;/a&gt; in action, first sending a &lt;code&gt;HEAD&lt;/code&gt; request, and then followed by multiple (logs are truncated for brevity) &lt;code&gt;GET&lt;/code&gt; range requests.&lt;/p&gt;
&lt;p&gt;To even get some more insights, we can prefix the query with EXPLAIN ANALYZE and have a look how the query is executed, along with some stats:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;D EXPLAIN ANALYZE SELECT * FROM stops LIMIT 10;
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE SELECT * FROM stops LIMIT 10;
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         HTTPFS HTTP Stats         ││
││                                   ││
││            &lt;span class=&quot;hljs-keyword&quot;&gt;in&lt;/span&gt;: 2.5 MiB            ││
││            out: 0 bytes           ││
││              &lt;span class=&quot;hljs-comment&quot;&gt;#HEAD: 1             ││&lt;/span&gt;
││              &lt;span class=&quot;hljs-comment&quot;&gt;#GET: 3              ││&lt;/span&gt;
││              &lt;span class=&quot;hljs-comment&quot;&gt;#PUT: 0              ││&lt;/span&gt;
││              &lt;span class=&quot;hljs-comment&quot;&gt;#POST: 0             ││&lt;/span&gt;
││             &lt;span class=&quot;hljs-comment&quot;&gt;#DELETE: 0            ││&lt;/span&gt;
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 0.490s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      STREAMING_LIMIT      │
│    ────────────────────   │
│          10 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         TABLE_SCAN        │
│    ────────────────────   │
│         Function:         │
│        READ_PARQUET       │
│                           │
│        Projections:       │
│          stop_id          │
│         stop_code         │
│         stop_name         │
│       tts_stop_name       │
│         stop_desc         │
│          stop_lat         │
│          stop_lon         │
│          zone_id          │
│          stop_url         │
│       location_type       │
│       parent_station      │
│       stop_timezone       │
│    wheelchair_boarding    │
│          level_id         │
│       platform_code       │
│                           │
│         4096 Rows         │
│          (0.22s)          │
└───────────────────────────┘
Run Time (s): real 0.491 user 0.194019 sys 0.022184
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-uploading-the-view-database-to-object-storage&quot;&gt;Uploading the “view database” to Object Storage&lt;/h3&gt;
&lt;p&gt;For the final step to use the DuckDB database file containing the views to the remote Parquet files, we need to upload the database file itself to Object Storage as well.&lt;/p&gt;
&lt;p&gt;For our example, we’ll upload it to the same directory as the Parquet files, meaning that it’s available at:&lt;/p&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://data.openrailway.dev/providers/gtfs-de/full/database.duckdb&quot;&gt;https://data.openrailway.dev/providers/gtfs-de/full/database.duckdb&lt;/a&gt;&lt;/p&gt;
&lt;h3 id=&quot;heading-using-the-duckdb-database&quot;&gt;Using the DuckDB database&lt;/h3&gt;
&lt;p&gt;DuckDB is able to ATTACH remote databases via &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/stable/guides/network_cloud_storage/duckdb_over_https_or_s3#attaching-to-a-database-over-https&quot;&gt;HTTP(S)&lt;/a&gt; or the &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/stable/guides/network_cloud_storage/duckdb_over_https_or_s3#attaching-to-a-database-over-the-s3-api&quot;&gt;S3 API&lt;/a&gt;, meaning that you can both use it in the local DuckDB CLI, or with another, for example in-browser, tool like &lt;a target=&quot;_blank&quot; href=&quot;https://sql-workbench.com&quot;&gt;SQL Workbench&lt;/a&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;$ duckdb                                                     
v1.2.2 7c039464e4
Enter &lt;span class=&quot;hljs-string&quot;&gt;&quot;.help&quot;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;for&lt;/span&gt; usage hints.
Connected to a transient in-memory database.
Use &lt;span class=&quot;hljs-string&quot;&gt;&quot;.open FILENAME&quot;&lt;/span&gt; to reopen on a persistent database.
D ATTACH &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.openrailway.dev/providers/gtfs-de/full/database.duckdb&apos;&lt;/span&gt; as gtfs_de_full (READ_ONLY);
D USE gtfs_de_full;
D SHOW TABLES;
┌──────────────────────┐
│         name         │
│       varchar        │
├──────────────────────┤
│ agency               │
│ areas                │
│ attributions         │
│ booking_rules        │
│ calendar             │
│ calendar_dates       │
│ fare_attributes      │
│ fare_leg_join_rules  │
│ fare_leg_rules       │
│ fare_media           │
│ fare_products        │
│ fare_rules           │
│ fare_transfer_rules  │
│ feed_info            │
│ frequencies          │
│ levels               │
│ location_group_stops │
│ location_groups      │
│ networks             │
│ pathways             │
│ rider_categories     │
│ route_networks       │
│ routes               │
│ shapes               │
│ stop_areas           │
│ stop_times           │
│ stops                │
│ timeframes           │
│ transfers            │
│ translations         │
│ trips                │
├──────────────────────┤
│       31 rows        │
└──────────────────────┘
D SELECT count(*)::INT AS cnt FROM gtfs_de_full.stop_times;
┌─────────────────┐
│       cnt       │
│      int32      │
├─────────────────┤
│    32228711     │
│ (32.23 million) │
└─────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To query the data in your browser, just click on the this link to &lt;a target=&quot;_blank&quot; href=&quot;https://sql-workbench.com/#queries=v0,ATTACH-&apos;https%3A%2F%2Fdata.openrailway.dev%2Fproviders%2Fgtfs%20de%2Ffull%2Fdatabase.duckdb&apos;-as-gtfs_de_full-\(READ_ONLY\)~,SELECT-count\(*\)-FROM-gtfs_de_full.stop_times~&quot;&gt;SQL Workbench&lt;/a&gt;. The result will look like this:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1747660044340/bb26a63a-c9c2-4e90-b32a-0e8a8d6d1915.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-summary&quot;&gt;Summary&lt;/h2&gt;
&lt;p&gt;In this blog post, we were able to show that DuckDB database files containing views to static Parquet files are a very lightweight solution to provide a single, highly available entry point to datasets that need to be queryable with SQL.&lt;/p&gt;
&lt;p&gt;It intentionally doesn’t talk about authentication and authorization, which will be covered in another subsequent blog post soon.&lt;/p&gt;
</content:encoded><category>duckDB</category><category>Data-lake</category><category>analytics</category><author>Tobias Müller</author></item><item><title>Handling GTFS data with DuckDB</title><link>https://tobilg.com/posts/handling-gtfs-data-with-duckdb/</link><guid isPermaLink="true">https://tobilg.com/posts/handling-gtfs-data-with-duckdb/</guid><description>The General Transit Feed Specification (GTFS) is a standardized, open data format for public transportation schedules and geographic information. In practice, a GTFS feed is simply a ZIP archive of text (CSV) tables - such as stops.txt, routes.txt, a...</description><pubDate>Fri, 16 May 2025 22:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The &lt;strong&gt;General Transit Feed Specification&lt;/strong&gt; (&lt;a target=&quot;_blank&quot; href=&quot;https://gtfs.org/documentation/schedule/reference/&quot;&gt;GTFS&lt;/a&gt;) is a standardized, open data format for public transportation schedules and geographic information. In practice, a GTFS feed is simply a ZIP archive of text (CSV) tables - such as &lt;code&gt;stops.txt&lt;/code&gt;, &lt;code&gt;routes.txt&lt;/code&gt;, and &lt;code&gt;trips.txt&lt;/code&gt; - that together describe an agency’s routes, stops, schedules, and fare rules.&lt;/p&gt;
&lt;p&gt;By following this common schema, transit agencies can publish their data so that any GTFS-compatible application (trip planners, mapping tools, or analytics pipelines) can consume it easily. Because GTFS is an open standard, it has become the go-to format for thousands of transit agencies worldwide.&lt;/p&gt;
&lt;p&gt;This broad adoption means developers can integrate multiple feeds into a single pipeline without writing custom parsers. For example, many journey-planning apps and mapping services rely on GTFS feeds to deliver route and schedule information to users.&lt;/p&gt;
&lt;p&gt;The specification even separates static and dynamic data: the &lt;strong&gt;Schedule&lt;/strong&gt; portion provides planned timetables (in CSV), while the &lt;strong&gt;Realtime&lt;/strong&gt; extension (using Protocol Buffers) adds live trip updates, vehicle positions, and service alerts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The main benefits of GTFS are:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;It defines a consistent schema (stops, routes, trips, etc), so users / developers only need to learn one format for all agencies&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Interoperability: Any GTFS-compliant tool can read feeds from any transit provider, enabling seamless integration across different platforms and systems&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Plain-text and open: The format is based on simple CSV files, which are easy to generate, ingest (egvia SQL or pandas), and maintain without proprietary software&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Static + real-time: GTFS distinguishes fixed schedules from live updates. The static GTFS-Schedule files cover routes and timetables, while GTFS-Realtime (Protocol Buffers) provides real-time trip updates, vehicle positions, and alerts.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ecosystem and tooling: A large community supports GTFS with open documentation and many libraries/validators (e.g. GTFS-kit in Python), so developer can leverage existing tools rather than build custom parsers&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;heading-creating-a-duckdb-database-for-gtfs-schedule-data&quot;&gt;Creating a DuckDB database for GTFS Schedule data&lt;/h2&gt;
&lt;p&gt;As the GTFS Schedule standard uses ZIP-compressed CSV files, it’s not possible out of the box to directly analyze datasets with the most tools.&lt;/p&gt;
&lt;p&gt;Also, a lot of GTFS datasets don’t publish all &lt;a target=&quot;_blank&quot; href=&quot;https://gtfs.org/documentation/schedule/reference/#dataset-files&quot;&gt;dataset files&lt;/a&gt;, as some of the data files are optional, additionally not all fields in each those files are required as well:&lt;/p&gt;
&lt;div class=&quot;hn-table&quot;&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;File Name&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Presence&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;agency.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Required&lt;/td&gt;&lt;td&gt;Transit agencies with service represented in this dataset.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;stops.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Conditionally Required&lt;/td&gt;&lt;td&gt;Stops where vehicles pick up or drop off riders. Also defines stations and station entrances. Optional if demand-responsive zones are defined in &lt;strong&gt;locations.geojson&lt;/strong&gt;, required otherwise.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;routes.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Required&lt;/td&gt;&lt;td&gt;Transit routes. A route is a group of trips that are displayed to riders as a single service.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;trips.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Required&lt;/td&gt;&lt;td&gt;Trips for each route. A trip is a sequence of two or more stops that occur during a specific time period.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;stop_times.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Required&lt;/td&gt;&lt;td&gt;Times that a vehicle arrives at and departs from stops for each trip.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;calendar.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Conditionally Required&lt;/td&gt;&lt;td&gt;Service dates specified using a weekly schedule with start and end dates. Required unless all dates of service are defined in &lt;strong&gt;calendar_dates.txt&lt;/strong&gt;, optional otherwise.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;calendar_dates.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Conditionally Required&lt;/td&gt;&lt;td&gt;Exceptions for the services defined in &lt;strong&gt;calendar.txt&lt;/strong&gt;. Required if &lt;strong&gt;calendar.txt&lt;/strong&gt; is omitted (must then contain all dates of service), optional otherwise.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;fare_attributes.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Optional&lt;/td&gt;&lt;td&gt;Fare information for a transit agency’s routes.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;fare_rules.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Optional&lt;/td&gt;&lt;td&gt;Rules to apply fares for itineraries.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;timeframes.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Optional&lt;/td&gt;&lt;td&gt;Date and time periods to use in fare rules for fares that depend on date/time factors. Forbidden if &lt;code&gt;network_id&lt;/code&gt; exists in &lt;strong&gt;routes.txt&lt;/strong&gt;, optional otherwise.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;shapes.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Optional&lt;/td&gt;&lt;td&gt;Rules for mapping vehicle travel paths (route alignments).&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;frequencies.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Optional&lt;/td&gt;&lt;td&gt;Headway (time between trips) for headway-based service or compressed fixed-schedule service.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;transfers.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Optional&lt;/td&gt;&lt;td&gt;Rules for making connections at transfer points between routes.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;pathways.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Optional&lt;/td&gt;&lt;td&gt;Pathways linking together locations within stations.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;levels.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Conditionally Required&lt;/td&gt;&lt;td&gt;Levels within stations. Required when describing pathways with elevators (&lt;code&gt;pathway_mode=5&lt;/code&gt;), optional otherwise.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;location_groups.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Optional&lt;/td&gt;&lt;td&gt;A group of stops that together indicate locations where a rider may request pickup or drop off.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;location_group_stops.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Optional&lt;/td&gt;&lt;td&gt;Rules to assign stops to location groups.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;locations.geojson&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Optional&lt;/td&gt;&lt;td&gt;Zones for rider pickup/drop-off requests by on-demand services, represented as GeoJSON polygons.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;booking_rules.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Optional&lt;/td&gt;&lt;td&gt;Booking information for rider-requested services.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;translations.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Optional&lt;/td&gt;&lt;td&gt;Translations of customer-facing dataset values.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;feed_info.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Conditionally Required&lt;/td&gt;&lt;td&gt;Dataset metadata, including publisher, version, and expiration information. Required if &lt;strong&gt;translations.txt&lt;/strong&gt; is provided, recommended otherwise.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;attributions.txt&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Optional&lt;/td&gt;&lt;td&gt;Dataset attributions.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;&lt;p&gt;Taken that into account, the plan for being able to use GTFS Schedule datasets with DuckDB is: Create a DuckDB database that contains all tables with all columns. The DDL for the tables will be derived from the &lt;a target=&quot;_blank&quot; href=&quot;https://gtfs.org/documentation/schedule/reference/#field-definitions&quot;&gt;official standard’s website&lt;/a&gt;. We want to use the foreign key relationships outlined in the standard, and thus need to make sure that the tables are created in the right order.&lt;/p&gt;
&lt;p&gt;The database creation script can be found at &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/duckdb-gtfs/blob/main/queries/create_gtfs_database.sql&quot;&gt;queries/create_gtfs_database.sql&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This results in a database that has the following Entity Relationship Diagram (ERD):&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1747569246715/9364377a-35a0-42c0-a730-70b0bcda98b9.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-creating-a-duckdb-gtfs-database-from-an-example-dataset&quot;&gt;Creating a DuckDB GTFS database from an example dataset&lt;/h2&gt;
&lt;p&gt;To load some sample dataset, we need to choose one of the many available data sources. For German data, the &lt;a target=&quot;_blank&quot; href=&quot;https://gtfs.de&quot;&gt;gtfs.de&lt;/a&gt; website provides some interesting datasets.&lt;/p&gt;
&lt;p&gt;There’s one dataset that contains the full train and local traffic for Germany: &lt;a target=&quot;_blank&quot; href=&quot;https://gtfs.de/de/feeds/de_full/&quot;&gt;de_full&lt;/a&gt;, and that’s the one we’ll use. It contains 1.6 million trips, 663 thousand stops, and more than 32 million stop times.&lt;/p&gt;
&lt;p&gt;The accompanying GitHub repository &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/duckdb-gtfs&quot;&gt;tobilg/duckdb-gtfs&lt;/a&gt; has the necessary scripts available. You’ll need a locally installed &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/installation/?version=stable&amp;amp;environment=cli&amp;amp;platform=macos&amp;amp;download_method=direct&quot;&gt;DuckDB CLI&lt;/a&gt; to be able to follow the next steps on your machine.&lt;/p&gt;
&lt;h3 id=&quot;heading-preparations&quot;&gt;Preparations&lt;/h3&gt;
&lt;p&gt;You need to clone the repository from GitHub first:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;git &lt;span class=&quot;hljs-built_in&quot;&gt;clone&lt;/span&gt; git@github.com:tobilg/duckdb-gtfs.git &amp;amp;&amp;amp; &lt;span class=&quot;hljs-built_in&quot;&gt;cd&lt;/span&gt; duckdb-gtfs
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-downloading-the-dataset&quot;&gt;Downloading the dataset&lt;/h3&gt;
&lt;p&gt;To download the example dataset (218MB zipped, around 1.4GB unzipped), you can run the &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/duckdb-gtfs/blob/main/scripts/providers/gtfs-de/full/download_data.sh&quot;&gt;download script&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;scripts/providers/gtfs-de/full/download_data.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-loading-the-dataset-in-a-new-gtfs-database&quot;&gt;Loading the dataset in a new GTFS database&lt;/h3&gt;
&lt;p&gt;As outlined above, not all dataset providers use the full set of tables in their datasets, so we’ll have to write a &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/duckdb-gtfs/blob/main/scripts/providers/gtfs-de/full/create_database.sh&quot;&gt;custom database creation script&lt;/a&gt; for each provider. In our case, the SQL for the example dataset looks like this (see &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/duckdb-gtfs/blob/main/queries/providers/gtfs-de/full/load_data.sql&quot;&gt;load_data.sql&lt;/a&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; agency (agency_id,agency_name,agency_url,agency_timezone,agency_lang) 
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_csv(&lt;span class=&quot;hljs-string&quot;&gt;&apos;source-data/providers/gtfs-de/full/agency.txt&apos;&lt;/span&gt;, delim = &lt;span class=&quot;hljs-string&quot;&gt;&apos;,&apos;&lt;/span&gt;, header = &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;);

&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; attributions (attribution_id,organization_name,is_producer,attribution_url,attribution_email) 
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_csv(&lt;span class=&quot;hljs-string&quot;&gt;&apos;source-data/providers/gtfs-de/full/attributions.txt&apos;&lt;/span&gt;, delim = &lt;span class=&quot;hljs-string&quot;&gt;&apos;,&apos;&lt;/span&gt;, header = &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;);

&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; calendar (monday,tuesday,wednesday,thursday,friday,saturday,sunday,start_date,end_date,service_id) 
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_csv(&lt;span class=&quot;hljs-string&quot;&gt;&apos;source-data/providers/gtfs-de/full/calendar.txt&apos;&lt;/span&gt;, delim = &lt;span class=&quot;hljs-string&quot;&gt;&apos;,&apos;&lt;/span&gt;, header = &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;, dateformat = &lt;span class=&quot;hljs-string&quot;&gt;&apos;%Y%m%d&apos;&lt;/span&gt;);

&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; calendar_dates (service_id,exception_type,&lt;span class=&quot;hljs-string&quot;&gt;&quot;date&quot;&lt;/span&gt;) 
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_csv(&lt;span class=&quot;hljs-string&quot;&gt;&apos;source-data/providers/gtfs-de/full/calendar_dates.txt&apos;&lt;/span&gt;, delim = &lt;span class=&quot;hljs-string&quot;&gt;&apos;,&apos;&lt;/span&gt;, header = &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;, &lt;span class=&quot;hljs-keyword&quot;&gt;columns&lt;/span&gt; = {
  &lt;span class=&quot;hljs-string&quot;&gt;&apos;service_id&apos;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&apos;VARCHAR&apos;&lt;/span&gt;,
  &lt;span class=&quot;hljs-string&quot;&gt;&apos;exception_type&apos;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&apos;INTEGER&apos;&lt;/span&gt;,
  &lt;span class=&quot;hljs-string&quot;&gt;&apos;date&apos;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&apos;DATE&apos;&lt;/span&gt;
}, dateformat = &lt;span class=&quot;hljs-string&quot;&gt;&apos;%Y%m%d&apos;&lt;/span&gt;, ignore_errors = &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;);

&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; feed_info (feed_publisher_name,feed_publisher_url,feed_lang,feed_version,feed_contact_email,feed_contact_url) 
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_csv(&lt;span class=&quot;hljs-string&quot;&gt;&apos;source-data/providers/gtfs-de/full/feed_info.txt&apos;&lt;/span&gt;, delim = &lt;span class=&quot;hljs-string&quot;&gt;&apos;,&apos;&lt;/span&gt;, header = &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;, dateformat = &lt;span class=&quot;hljs-string&quot;&gt;&apos;%Y%m%d&apos;&lt;/span&gt;);

&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; routes (route_long_name,route_short_name,agency_id,route_type,route_id) 
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_csv(&lt;span class=&quot;hljs-string&quot;&gt;&apos;source-data/providers/gtfs-de/full/routes.txt&apos;&lt;/span&gt;, delim = &lt;span class=&quot;hljs-string&quot;&gt;&apos;,&apos;&lt;/span&gt;, header = &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;);

&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; trips (route_id,service_id,trip_id) 
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_csv(&lt;span class=&quot;hljs-string&quot;&gt;&apos;source-data/providers/gtfs-de/full/trips.txt&apos;&lt;/span&gt;, delim = &lt;span class=&quot;hljs-string&quot;&gt;&apos;,&apos;&lt;/span&gt;, header = &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;, ignore_errors = &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;);

&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; stops (stop_name,parent_station,stop_id,stop_lat,stop_lon,location_type) 
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_csv(&lt;span class=&quot;hljs-string&quot;&gt;&apos;source-data/providers/gtfs-de/full/stops.txt&apos;&lt;/span&gt;, delim = &lt;span class=&quot;hljs-string&quot;&gt;&apos;,&apos;&lt;/span&gt;, header = &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;);

&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; stop_times (trip_id,arrival_time,departure_time,stop_id,stop_sequence,pickup_type,drop_off_type) 
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; read_csv(&lt;span class=&quot;hljs-string&quot;&gt;&apos;source-data/providers/gtfs-de/full/stop_times.txt&apos;&lt;/span&gt;, delim = &lt;span class=&quot;hljs-string&quot;&gt;&apos;,&apos;&lt;/span&gt;, header = &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;, ignore_errors = &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, run the database creation script (this will take some time depending on your machine):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;scripts/providers/gtfs-de/full/create_database.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The resulting DuckDB database file can be found at &lt;code&gt;exported-data/providers/gtfs-de/full/data.duckdb&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id=&quot;heading-querying-the-new-gtfs-database&quot;&gt;Querying the new GTFS database&lt;/h3&gt;
&lt;p&gt;To query the new GTFS database, you can just use the newly created database file with your local DuckDB CLI:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;duckdb exported-data/providers/gtfs-de/full/data.duckdb
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;v1.2.2 7c039464e4
Enter &lt;span class=&quot;hljs-string&quot;&gt;&quot;.help&quot;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;for&lt;/span&gt; usage hints.
D show tables;
┌──────────────────────┐
│         name         │
│       varchar        │
├──────────────────────┤
│ agency               │
│ areas                │
│ attributions         │
│ booking_rules        │
│ calendar             │
│ calendar_dates       │
│ fare_attributes      │
│ fare_leg_join_rules  │
│ fare_leg_rules       │
│ fare_media           │
│ fare_products        │
│ fare_rules           │
│ fare_transfer_rules  │
│ feed_info            │
│ frequencies          │
│ levels               │
│ location_group_stops │
│ location_groups      │
│ networks             │
│ pathways             │
│ rider_categories     │
│ route_networks       │
│ routes               │
│ shapes               │
│ stop_areas           │
│ stop_times           │
│ stops                │
│ timeframes           │
│ transfers            │
│ translations         │
│ trips                │
├──────────────────────┤
│       31 rows        │
└──────────────────────┘
D SELECT count(*) FROM stop_times;
┌─────────────────┐
│  count_star()   │
│      int64      │
├─────────────────┤
│    32228711     │
│ (32.23 million) │
└─────────────────┘
D SELECT count(*) FROM stops;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    678388    │
└──────────────┘
D SELECT count(*) FROM trips;
┌────────────────┐
│  count_star()  │
│     int64      │
├────────────────┤
│    1630671     │
│ (1.63 million) │
└────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-export-the-data-in-parquet-format&quot;&gt;Export the data in Parquet format&lt;/h3&gt;
&lt;p&gt;If you want to export the data in an optimized format like Parquet, this is very straight forward with DuckDB (as demonstrated with the &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/duckdb-gtfs/blob/main/scripts/providers/gtfs-de/full/export_data.sh&quot;&gt;export_data.sh&lt;/a&gt; script:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;duckdb exported-data/providers/gtfs-de/full/data.duckdb -c &lt;span class=&quot;hljs-string&quot;&gt;&quot;EXPORT DATABASE &apos;exported-data/providers/gtfs-de/full&apos; (FORMAT parquet);&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will export each table as a Parquet file in the &lt;code&gt;exported-data/providers/gtfs-de/full/&lt;/code&gt; directory.&lt;/p&gt;
&lt;h2 id=&quot;heading-summary&quot;&gt;Summary&lt;/h2&gt;
&lt;p&gt;Using DuckDB for storing and analyzing GTFS Schedule data seems like a great choice. For example, the CSV reader of DuckDB can handle a lot of different error scenarios, which are quite common with data providers.&lt;/p&gt;
&lt;p&gt;We successfully created a default GTFS Schedule database, that can be loaded with data from different providers with custom loading scripts. Those require specific effort to create, but as the result is a standardized database schema, it can create value for data-driven organizations to do so.&lt;/p&gt;
&lt;p&gt;Being able to export the data to storage-optimized formats out-of-the-box is also a great benefit of DuckDB. We were able to shrink the 1.4GB unzipped data to 118MB in compressed Parquet files.&lt;/p&gt;
&lt;p&gt;If you want to query the data live and in your browser, you can use &lt;a target=&quot;_blank&quot; href=&quot;https://sql-workbench.com/#queries=v0,ATTACH-&apos;https%3A%2F%2Fdata.openrailway.dev%2Fproviders%2Fgtfs%20de%2Ffull%2Fdatabase.duckdb&apos;-as-gtfs_de_full-\(READ_ONLY\)~,SELECT-count\(*\)%3A%3AINT-AS-cnt-FROM-gtfs_de_full.stop_times~&quot;&gt;SQL Workbench&lt;/a&gt;.&lt;/p&gt;
</content:encoded><category>gtfs</category><category>duckDB</category><category>data-engineering</category><author>Tobias Müller</author></item><item><title>Cost-efficient event ingestion into Iceberg S3 Tables on AWS</title><link>https://tobilg.com/posts/cost-efficient-event-ingestion-into-iceberg-s3-tables-on-aws/</link><guid isPermaLink="true">https://tobilg.com/posts/cost-efficient-event-ingestion-into-iceberg-s3-tables-on-aws/</guid><description>Amazon S3 Tables was launched on December 3rd 2024, and provides you “storage that is optimized for tabular data such as daily purchase transactions, streaming sensor data, and ad impressions in Apache Iceberg format”.
While S3 Tables can be queried ...</description><pubDate>Thu, 10 Apr 2025 19:22:36 GMT</pubDate><content:encoded>&lt;p&gt;Amazon S3 Tables was &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/blogs/aws/new-amazon-s3-tables-storage-optimized-for-analytics-workloads/&quot;&gt;launched&lt;/a&gt; on December 3rd 2024, and provides you “storage that is optimized for tabular data such as daily purchase transactions, streaming sensor data, and ad impressions in Apache Iceberg format”.&lt;/p&gt;
&lt;p&gt;While &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables.html&quot;&gt;S3 Tables&lt;/a&gt; can be queried with a variety of AWS services, such as Amazon Athena, Amazon EMR, Apache Spark or, even better, &lt;a target=&quot;_blank&quot; href=&quot;https://tobilg.com/query-s3-tables-with-duckdb&quot;&gt;DuckDB&lt;/a&gt;, the ingestion of data is currently pretty limited. You can insert data via SQL queries in Athena, EMR Spark clusters, and as a pretty new feature, &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-firehose.html&quot;&gt;Amazon Data Firehose&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For use cases that involve (mass) event ingestion, which also want to be cost-efficient, using Athena would be impractical (and slow), whereas doing so via EMR Spark clusters is very costly. Contrary to that, &lt;strong&gt;Amazon Data Firehose&lt;/strong&gt;’s &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/firehose/pricing/&quot;&gt;pricing&lt;/a&gt; is strictly pay-per-use for Apache Iceberg tables (not sure why the use a different name for S3 Tables though):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For Firehose streams that is configured with Apache Iceberg Tables as a destination, you will be billed for the amount of data processed to the destination. Pricing is billed per GB ingested with no 5KB increments. If data processed bytes before delivery is more than the ingested bytes due to custom Lambda processing, then the additional bytes are also billed. Additional bytes are billed at the same rate as shown in Kinesis Data Streams as a source to Apache Iceberg tables as a destination for all sources of ingestion including Direct PUT.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So that’s a pretty sweet deal at $0.075/GB (for the first 250TB/month) one might say!&lt;/p&gt;
&lt;h1 id=&quot;heading-example-use-case&quot;&gt;Example Use Case&lt;/h1&gt;
&lt;p&gt;How would a event ingestion pipeline that lands the data in S3 Tables look like? A typical example use case would be web analytics, where client browsers send requests containing information about the browser, the used devices, user location and potentially many more details to a backend. There can be different event types, such as normal pageviews, or tracking events of varying structure.&lt;/p&gt;
&lt;p&gt;The backend then ingests the events, validates, cleans, transform and stores them, so that different metrics can be calculated in later steps. For example visitor count, pageview counts, session length etc.&lt;/p&gt;
&lt;p&gt;For sending the events, we will use the &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/ownstats/ownstats/tree/main/client&quot;&gt;OwnStats Client&lt;/a&gt;, which is a plugin for &lt;a target=&quot;_blank&quot; href=&quot;https://getanalytics.io/&quot;&gt;getanalytics.io&lt;/a&gt;. Alternatively, you can also sent sample events via HTTP clients as well.&lt;/p&gt;
&lt;h1 id=&quot;heading-architectural-implementation-options&quot;&gt;Architectural Implementation Options&lt;/h1&gt;
&lt;p&gt;But how should the actual events be “landed” on AWS, if we consider them as typical JSON payloads? Below you’ll find some architectural variants that achieve the same or similar goals, but with different services involved. As we’d like to optimize for costs, we want to focus on services that are pay-per-use, but also discuss break-even points compared to fixed-priced services as well.&lt;/p&gt;
&lt;h2 id=&quot;heading-variant-1-api-gateway-lambda-and-kinesis-data-firehose-to-s3&quot;&gt;Variant 1: API Gateway, Lambda and Kinesis Data Firehose to S3&lt;/h2&gt;
&lt;p&gt;The usual suspect for event ingestion would be an API created with API Gateway fronting Lambda functions, then publishing the events to a Kinesis Data Firehose stream that forwards the data to S3 Tables, or more “classical” to S3 as Hive-partitioned Parquet files:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1744197239687/ddaf1a5b-0c60-465a-8f32-7d4bf8da1cf0.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-variant-2-cloudfront-realtime-logs-kinesis-data-streams-amp-firehose-to-s3-via-lambda&quot;&gt;Variant 2: CloudFront, Realtime Logs, Kinesis Data Streams &amp;amp; Firehose to S3 via Lambda&lt;/h2&gt;
&lt;p&gt;CloudFront basically replaces API Gateway here, by hosting a 1×1 pixel GIF file that can be used to pass event payloads to via GET request query strings, which is the same method a lot of web analytics providers are using (such as Google Analytics).&lt;/p&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/real-time-logs.html&quot;&gt;CloudFront Realtime Logs&lt;/a&gt; forward the 100% “sample” of the logs containing the actual event payloads to a Kinesis Data Stream, which itself feeds the events to a Kinesis Data Firehose stream that uses a &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html&quot;&gt;processing Lambda function&lt;/a&gt; that decodes the event payloads to a tabular form, and also adds Hive partitioning info back to the stream. Firehose then writes the data in batches to S3.&lt;/p&gt;
&lt;p&gt;Unfortunately, it’s currently not possible for the Realtime Logs to directly write to Kinesis Data Firehose, instead of Kinesis Data Streams first, which imposes a &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/kinesis/data-streams/pricing/&quot;&gt;fixed cost&lt;/a&gt; “tax” on this architecture of $0.015/hour, which means at least $10.80/month.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1744198048167/1f790d99-a8e9-4c7a-93a9-74c22d0bf548.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;This architecture is implemented at &lt;a target=&quot;_blank&quot; href=&quot;https://ownstats.com&quot;&gt;OwnStats&lt;/a&gt;, which is also further &lt;a target=&quot;_blank&quot; href=&quot;https://docs.ownstats.com/architecture/introduction/&quot;&gt;documented&lt;/a&gt; and available on &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/ownstats/ownstats&quot;&gt;GitHub&lt;/a&gt; as well.&lt;/p&gt;
&lt;h2 id=&quot;heading-variant-3-cloudfront-cloudwatch-logs-kinesis-data-firehose-to-s3-via-lambda&quot;&gt;Variant 3: CloudFront, CloudWatch Logs, Kinesis Data Firehose to S3 via Lambda&lt;/h2&gt;
&lt;p&gt;In this architecture variant, the CloudFront Realtime Logs and the Kinesis Data Stream are replaced by a &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AWS-logs-and-resource-policy.html#AWS-logs-infrastructure-V2-Firehose&quot;&gt;CloudWatch Logs Delivery&lt;/a&gt;, which forward the &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/standard-logging.html#send-logs-kinesis&quot;&gt;standard logs&lt;/a&gt; (v2!) from CloudFront to the Kinesis Data Firehose.&lt;/p&gt;
&lt;p&gt;This gets rid of the fixed costs for the Kinesis Data Stream (as outlined above), and replaces this with a pay-per-use &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/cloudwatch/pricing/&quot;&gt;pricing&lt;/a&gt; of $0.25/GB for the Logs Delivery. This means if you ingest less than 40GB/month of events, you’ll definitely spent less with this variant.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1744199388067/ae59ff4f-d0ec-4134-8397-7bd02f10e3a7.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The data doesn’t land in S3 as Hive-partitioned Parquet files in this case, but in the self-optimizing Apache Iceberg format in a S3 Table. The processing Lambda function is updated to support the r&lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/firehose/latest/dev/apache-iceberg-format-input-record-different.html#apache-iceberg-route-lambda&quot;&gt;outing of the events&lt;/a&gt; to the correct S3 Table, as well as parsing the raw logs from CloudFront, and transform them to the right tabular format.&lt;/p&gt;
&lt;h3 id=&quot;heading-caveats&quot;&gt;Caveats&lt;/h3&gt;
&lt;p&gt;There are currently multiple downsides of this approach:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;There is no complete CloudFormation support for S3 Tables, meaning that you can use it to manage &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-s3tables-tablebucket.html&quot;&gt;S3 Table Buckets&lt;/a&gt;, but not &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-namespace-create.html&quot;&gt;Namespaces&lt;/a&gt; or actual &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-create.html&quot;&gt;S3 Tables&lt;/a&gt;. This works only with the CLI, SDKs or the API. This means that we can’t deploy this as one single stack, but have to take a multi-step/stack approach with manual interactions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The forwarding of the data from the Kinesis Data Firehose to the S3 Table only works if a &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/lake-formation/latest/dg/initial-lf-config.html#create-data-lake-admin&quot;&gt;LakeFormation Data Lake Administrator has been set up&lt;/a&gt;. Furthermore, the IAM role you deploy the CloudFormation stacks with needs to have the correct &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSLakeFormationDataAdmin.html&quot;&gt;permissions&lt;/a&gt; / &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/lake-formation/latest/dg/permissions-reference.html#lf-managed-policies&quot;&gt;managed policy&lt;/a&gt; as well.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When creating the &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lakeformation-permissions.html&quot;&gt;LakeFormation Permission&lt;/a&gt; via CloudFormation, there seems to be a strange bug that even when you define a &lt;em&gt;DependsOn&lt;/em&gt; dependency for the permission that needs to be there before creating the Kinesis Data Firehose resource which has a reference to the actual Namespace and S3 Table, you’ll be seeing the following error:&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Role arn:aws:iam::112233445566:role/rolename is not authorized to perform: glue:GetTable for the given table or the table does not exist&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;If the Kinesis Data Firehose resource is created in another subsequent CloudFormation stack, it works. This seems to be some kind of race condition in the CloudFormation service.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;S3 Tables currently only support the Apache Iceberg primitive types as per the &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/cli/latest/reference/s3tables/create-table.html&quot;&gt;docs&lt;/a&gt;. As we’d like to store and query JSON payloads, it’d be much nicer if it would also support semi-structured types like &lt;code&gt;variant&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;S3 Tables supports all Apache Iceberg primitive types. For more information, see the &lt;a target=&quot;_blank&quot; href=&quot;https://iceberg.apache.org/spec/#primitive-types&quot;&gt;Apache Iceberg documentation.&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Nevertheless, we want to move forward to the implementation, while taking the caveats into account.&lt;/p&gt;
&lt;h1 id=&quot;heading-implementation-of-the-event-ingestion-pipeline&quot;&gt;Implementation of the Event Ingestion Pipeline&lt;/h1&gt;
&lt;p&gt;As &lt;strong&gt;Variant 1&lt;/strong&gt; has been implemented by many other people already, and &lt;strong&gt;Variant 2&lt;/strong&gt; is implemented for example by OwnStats as well (while producing fixed costs), &lt;strong&gt;Variant 3&lt;/strong&gt; shall be be the focus because it offers an overall pay-per-use approach, is technically novel and thus potentially more interesting for the community, even though it has the outlined caveats.&lt;/p&gt;
&lt;p&gt;The focus lies on implementing the Example Use Case of web analytics events, as outlined above, and builds upon the insights gained while creating OwnStats, a completely serverless web analytics&lt;/p&gt;
&lt;h2 id=&quot;heading-prerequisites&quot;&gt;Prerequisites&lt;/h2&gt;
&lt;p&gt;The following prerequisites need to be available on the machine you want to deploy from:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;AWS IAM credentials that have the respective permissions to deploy the different stack’s resources&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A &lt;a target=&quot;_blank&quot; href=&quot;https://nodejs.org&quot;&gt;Node.js&lt;/a&gt; installation (&amp;gt; v18, including npm)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A global installation of the &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/serverless/serverless&quot;&gt;Serverless Framework&lt;/a&gt; (v3, not v4!) via &lt;code&gt;npm i -g serverless@3&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;heading-structure-of-the-project&quot;&gt;Structure of the project&lt;/h2&gt;
&lt;p&gt;The &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/iceberg-event-pipeline&quot;&gt;&lt;strong&gt;Iceberg Event Pipeline project&lt;/strong&gt;&lt;/a&gt; consists of three sequentially deployable CloudFormation stacks, as well as a set of scripts that create the Namespace and the actual S3 Table resources, as they currently have no CloudFormation support themselves.&lt;/p&gt;
&lt;h3 id=&quot;heading-base-stack&quot;&gt;Base Stack&lt;/h3&gt;
&lt;p&gt;The Base Stack contains the CloudFront Distribution, the S3 Bucket for the Backup of non-forwardable data from the Kinesis Data Firehose to the S3 Table, and the S3 Table Bucket.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1744300534039/9347f553-9803-4c3f-9cbd-7df3e67ff97c.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h3 id=&quot;heading-processing-stack&quot;&gt;Processing Stack&lt;/h3&gt;
&lt;p&gt;The Processing Stack contains the Kinesis Data Firehose Policy, the LakeFormation permissions, the Glue Database Resource Link, and the Lambda function for the event processing.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1744300550061/d0386f74-84aa-40e7-a56a-6eb04c557f7d.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h3 id=&quot;heading-storage-stack&quot;&gt;Storage Stack&lt;/h3&gt;
&lt;p&gt;The Storage Stack contains the actual Kinesis Data Firehose, the CloudWatch Logs Delivery as well as it’s Source and Destination.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1744300584335/2a756a47-fa55-434f-addf-579899f56c93.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h3 id=&quot;heading-configuration&quot;&gt;Configuration&lt;/h3&gt;
&lt;p&gt;In the root folder of the project, there’s a file called config.json, which contains the project’s configuration that’s used during the stack deployments. It looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-json&quot;&gt;{
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;profile&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;default&quot;&lt;/span&gt;,
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;stage&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;prd&quot;&lt;/span&gt;,
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;region&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;us-east-1&quot;&lt;/span&gt;,
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;s3&quot;&lt;/span&gt;: {
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;tableBucket&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;iceberg-event-analytics-data&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;namespace&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;iceberg_event_analytics&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;tableName&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;event_data&quot;&lt;/span&gt;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can configure the AWS credentials profile name, the stage, and the AWS region you want to deploy the stacks to. In the s3 object, the S3 Tables-related resource names can be configured. You can also just use the defaults.&lt;/p&gt;
&lt;h3 id=&quot;heading-s3-table-structure&quot;&gt;S3 Table Structure&lt;/h3&gt;
&lt;p&gt;The structure of the target table is defined in a &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/iceberg-event-pipeline/blob/main/manual-setup/event-table.json&quot;&gt;JSON file&lt;/a&gt; as well, which is used during the manual creation of the S3 Table resources (see below). The table structure is reused from the OwnStats &lt;a target=&quot;_blank&quot; href=&quot;https://docs.ownstats.com/architecture/catalog-layer/#incoming_events-table&quot;&gt;incoming_events&lt;/a&gt; table, though slightly adapted/simplified.&lt;/p&gt;
&lt;div class=&quot;hn-table&quot;&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Column name&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Data type&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Is partition key?&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;event_id&lt;/td&gt;&lt;td&gt;uuid&lt;/td&gt;&lt;td&gt;yes&lt;/td&gt;&lt;td&gt;The domain name&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_date&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The date of the event (YYYY-MM-DD), as string&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_timestamp&lt;/td&gt;&lt;td&gt;timestamp&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The exact event timestamp&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;arrival_timestamp&lt;/td&gt;&lt;td&gt;timestamp&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The exact timestamp when the event arrived in the Kinesis Data Stream&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_city&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The name of the edge city (all edge location info is derived from the &lt;code&gt;x-edge-location&lt;/code&gt; field in the &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#LogFileFormat&quot;&gt;&lt;strong&gt;logs&lt;/strong&gt;&lt;/a&gt;)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_state&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The state of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_country&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The country of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_country_code&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The country code of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_latitude&lt;/td&gt;&lt;td&gt;float&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The latitude of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_longitude&lt;/td&gt;&lt;td&gt;float&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The longitude of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_id&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The original id of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;domain_name&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The domain name where the event was recorded&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;referrer&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The referrer&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;referrer_domain_name&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The domain name of the referrer&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;browser_name&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The name of the browser&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;browser_version&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The version of the browser&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;browser_os_name&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The OS name of the browser&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;browser_os_version&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The OS version of the browser&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;browser_timezone&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The timezone of the browser&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;browser_language&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The language of the browser&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_type&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The device type&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_vendor&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The device vendor&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_outer_resolution&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The outer resolution of the device&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_inner_resolution&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The inner resolution of the device&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_color_depth&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The color depth of the device&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_platform&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The platform of the device&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_memory&lt;/td&gt;&lt;td&gt;float&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The memory of the device (in MB)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_cores&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The number of cores of the device&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;utm_source&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;Identifies which site sent the traffic&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;utm_campaign&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;Identifies a specific product promotion or strategic campaign&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;utm_medium&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;Identifies what type of link was used, such as cost per click or email&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;utm_content&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;Identifies what specifically was clicked to bring the user to the site&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;utm_term&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;Identifies search terms&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_url&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The full requested URL&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_path&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The path of the requested URL&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_query_string&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The query string of the requested URL&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_bytes&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The size of the request in bytes&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_status_code&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The HTTP status code of the request&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_cache_status&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The CloudFront cache status&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_delivery_time_ms&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The time in ms it took for CloudFront to complete the request&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_asn&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The &lt;a target=&quot;_blank&quot; href=&quot;https://www.arin.net/resources/guide/asn/&quot;&gt;&lt;strong&gt;ASN&lt;/strong&gt;&lt;/a&gt; of the requestor&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_type&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;yes&lt;/td&gt;&lt;td&gt;The type of the event (&lt;code&gt;pageview&lt;/code&gt; or &lt;code&gt;track&lt;/code&gt;)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_name&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The name of the event for tracking events&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_data&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The stringified event payload for tracking events&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;&lt;h2 id=&quot;heading-deployment&quot;&gt;Deployment&lt;/h2&gt;
&lt;p&gt;To be able to deploy the project, you have to have set up all the above mentioned requirements on the machine you want to deploy from.&lt;/p&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;&lt;strong&gt;You need to deploy the stacks / do the manual steps as outlined below, otherwise it won’t work as they depend on each other.&lt;/strong&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h3 id=&quot;heading-deployment-of-the-base-stack&quot;&gt;Deployment of the Base Stack&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;&lt;span class=&quot;hljs-built_in&quot;&gt;cd&lt;/span&gt; base-stack

&lt;span class=&quot;hljs-comment&quot;&gt;# Deploy Base Stack&lt;/span&gt;
sls deploy

&lt;span class=&quot;hljs-comment&quot;&gt;# Return to project base directory&lt;/span&gt;
&lt;span class=&quot;hljs-built_in&quot;&gt;cd&lt;/span&gt; ..
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-manual-creation-of-s3-table-resources-amp-asset-sync-for-cloudfront&quot;&gt;Manual creation of S3 Table resources &amp;amp; Asset sync for CloudFront&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;&lt;span class=&quot;hljs-built_in&quot;&gt;cd&lt;/span&gt; manual-setup

&lt;span class=&quot;hljs-comment&quot;&gt;# Install dependencies&lt;/span&gt;
npm i 

&lt;span class=&quot;hljs-comment&quot;&gt;# Create S3 Table resources&lt;/span&gt;
npm run create-resources

&lt;span class=&quot;hljs-comment&quot;&gt;# Sync assets to S3 Bucket that backs the Cloudfront Distribution&lt;/span&gt;
npm run sync-s3

&lt;span class=&quot;hljs-comment&quot;&gt;# Return to project base directory&lt;/span&gt;
&lt;span class=&quot;hljs-built_in&quot;&gt;cd&lt;/span&gt; ..
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-deployment-of-the-processing-stack&quot;&gt;Deployment of the Processing Stack&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;&lt;span class=&quot;hljs-built_in&quot;&gt;cd&lt;/span&gt; processing-stack

&lt;span class=&quot;hljs-comment&quot;&gt;# Install dependencies&lt;/span&gt;
npm i

&lt;span class=&quot;hljs-comment&quot;&gt;# Deploy Processing Stack&lt;/span&gt;
sls deploy

&lt;span class=&quot;hljs-comment&quot;&gt;# Return to project base directory&lt;/span&gt;
&lt;span class=&quot;hljs-built_in&quot;&gt;cd&lt;/span&gt; ..
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-deployment-of-the-storage-stack&quot;&gt;Deployment of the Storage Stack&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;&lt;span class=&quot;hljs-built_in&quot;&gt;cd&lt;/span&gt; storage-stack

&lt;span class=&quot;hljs-comment&quot;&gt;# Deploy Storage Stack&lt;/span&gt;
sls deploy

&lt;span class=&quot;hljs-comment&quot;&gt;# Return to project base directory&lt;/span&gt;
&lt;span class=&quot;hljs-built_in&quot;&gt;cd&lt;/span&gt; ..
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-sending-test-events&quot;&gt;Sending test events&lt;/h2&gt;
&lt;p&gt;If everything went smoothly, you can now send send events to your pipeline, which will land in the S3 Table. Therefore, you can either use the OwnStats client as stated in the Use Case description, or use provided script that produces the same request signature:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;&lt;span class=&quot;hljs-built_in&quot;&gt;cd&lt;/span&gt; manual-setup

&lt;span class=&quot;hljs-comment&quot;&gt;# Send event&lt;/span&gt;
npm run send-event -- &lt;span class=&quot;hljs-string&quot;&gt;&quot;My Event&quot;&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;{&quot;testevent&quot;:true,&quot;message&quot;:&quot;foobar&quot;}&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-querying-the-data-in-the-s3-table&quot;&gt;Querying the data in the S3 Table&lt;/h2&gt;
&lt;h3 id=&quot;heading-using-athena&quot;&gt;Using Athena&lt;/h3&gt;
&lt;p&gt;You can query the data with Athena in the AWS Console. Please make sure that you select the correct catalog and database before you run a sample query on the &lt;code&gt;event_data&lt;/code&gt; table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; event_data &lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; arrival_timestamp &lt;span class=&quot;hljs-keyword&quot;&gt;DESC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If everything worked, you should see your events in the result browser:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1744311924345/0b083c8e-d3fe-43ed-bde6-5e02ac015e90.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h3 id=&quot;heading-using-duckdb&quot;&gt;Using DuckDB&lt;/h3&gt;
&lt;p&gt;I recently wrote a blog post about &lt;a target=&quot;_blank&quot; href=&quot;https://tobilg.com/query-s3-tables-with-duckdb&quot;&gt;querying S3 Tables with DuckDB&lt;/a&gt;, please have a look to learn how you can do it.&lt;/p&gt;
&lt;h1 id=&quot;heading-verdict&quot;&gt;Verdict&lt;/h1&gt;
&lt;p&gt;Creating low-cost, pay-per-use event data pipelines which store their data in the Apache Iceberg format on AWS is possible. Currently, it still requires a lot of configuration efforts, and doesn’t support single-stack IaC, neither via CloudFormation nor the AWS CDK.&lt;/p&gt;
&lt;p&gt;This is a mayor hurdle in the DX from my current point of view, especially in the light of the very recent product releases from &lt;a target=&quot;_blank&quot; href=&quot;https://cloudflare.com&quot;&gt;Cloudflare&lt;/a&gt; on 2025-04-10:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://blog.cloudflare.com/cloudflare-acquires-arroyo-pipelines-streaming-ingestion-beta/&quot;&gt;Streaming ingestion with Pipelines&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://blog.cloudflare.com/r2-data-catalog-public-beta/&quot;&gt;Managed Iceberg Tables with R2 Data Catalog&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://developers.cloudflare.com/pipelines/&quot;&gt;Pipelines documentation&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://blog.cloudflare.com/cloudflare-containers-coming-2025/&quot;&gt;Containers on Workers&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While those are still beta releases, it’s visible that the innovation in the event processing / efficient Data Lake storage space is accelerating. Once Cloudflare enables that Pipelines can directly write Iceberg tables on R2, together with the R2 Data Catalog and the upcoming containers product (e.g. running DuckDB as an API), you can host your complete Lakehouse data stack on Cloudflare. With potentially much less complexity compared to AWS at the moment.&lt;/p&gt;
&lt;p&gt;Also from a cost perspective, this is bound to gain some share of the market, especially for smaller, non-enterprise companies&lt;/p&gt;
&lt;blockquote&gt;
&lt;div class=&quot;hn-table&quot;&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;Workers Paid ($5 / month)&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ingestion&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;First 50 GB per month included&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Delivery to R2&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;First 50 GB per month included&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;&lt;/blockquote&gt;
&lt;p&gt;In this scenario, you’d pay $5 for ingesting 50GB of data to R2, while on AWS you’d already have to pay $12.5 for the CloudWatch Logs Delivery to Kinesis Data Firehose only.&lt;/p&gt;
&lt;p&gt;Also, there’s much less complexity from a IaC perspective compared to the pretty huge required infrastructure on AWS. I’m curious how / if AWS will react.&lt;/p&gt;
</content:encoded><category>apache iceberg</category><category>AWS</category><category>cloudflare</category><category>serverless</category><author>Tobias Müller</author></item><item><title>Query S3 Tables with DuckDB</title><link>https://tobilg.com/posts/query-s3-tables-with-duckdb/</link><guid isPermaLink="true">https://tobilg.com/posts/query-s3-tables-with-duckdb/</guid><description>DuckDB has gained a new feature in preview, that allows querying of Iceberg data in AWS S3 Tables.
Setting up a S3 Table
There are multiple steps which need to be performed to set up a S3 Table that can be then queried with tools like DuckDB. As the ...</description><pubDate>Fri, 14 Mar 2025 18:35:31 GMT</pubDate><content:encoded>&lt;p&gt;DuckDB has gained a new feature in preview, that allows querying of Iceberg data in AWS S3 Tables.&lt;/p&gt;
&lt;h1 id=&quot;heading-setting-up-a-s3-table&quot;&gt;Setting up a S3 Table&lt;/h1&gt;
&lt;p&gt;There are multiple steps which need to be performed to set up a S3 Table that can be then queried with tools like DuckDB. As the Infrastructure as Code (IaC) support from is not yet complete, a manual setup in the AWS Console will be used.&lt;/p&gt;
&lt;h2 id=&quot;heading-creating-a-s3-table-bucket&quot;&gt;Creating a S3 Table Bucket&lt;/h2&gt;
&lt;p&gt;Open the AWS Console, and go to &lt;a target=&quot;_blank&quot; href=&quot;https://us-east-1.console.aws.amazon.com/s3/table-buckets?region=us-east-1&quot;&gt;“Amazon S3” → “Table Buckets”&lt;/a&gt;, and then click on the “Create table bucket” button. Enter a name for the table bucket.&lt;/p&gt;
&lt;p&gt;If you do this the first time, also activate the “Integration with AWS analytics services” for the specific region.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1741974721915/02ae26e3-7962-4433-a4dc-226a50f0cc54.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Press the “Create table bucket” button again, and the Table Bucket will be created.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1741974864331/832ee11b-08a6-409d-9848-76062abe387b.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-creating-a-s3-table-via-athena&quot;&gt;Creating a S3 Table via Athena&lt;/h2&gt;
&lt;p&gt;As a next step, you can create a S3 Table with Athena. Click on the “Create table with Athena” button:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1741974971270/2f1df18e-c4cb-4786-89d1-42383a06441c.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;If you haven’t setup a Namespace yet, you need to do this in the next step, before you can actually create the S3 Table:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1741975044652/1381bddd-ce42-4288-b541-915bb8c9dec3.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;If you created a namespace before, you can select it and finally continue to the S3 Table creation, by clicking on “Create table with Athena”:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1741975156669/5ba70129-dcc0-4dea-a33c-f600a3fe53ad.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;This will open a new tab with the Athena console:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1741975262660/035701dd-1cbc-47a5-9a5f-e26dd8a63790.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;You need to create a Athena query result location before you can actually run queries, but this is not part of this blog post&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;For our testing purposes, we can just use the example table DDL that Athena provides. Select the CREATE TABLE statement, and click on the “Run” button:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1741975653210/8eda8921-5778-4280-8e9e-047b511ed803.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Yay, we successfully created a new S3 Table with Iceberg support!&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-inserting-some-test-data-via-athena&quot;&gt;Inserting some test data via Athena&lt;/h2&gt;
&lt;p&gt;As we want to query some data from DuckDB later, we need to insert some sample data in our newly created table as well. Therefore, just uncomment the INSERT statement below the CREATE TABLE statement, and click on the “Run” button:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1741975783283/42140023-5bd2-418b-ad32-97f9eb4896c2.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;This takes surprisingly long for those few records, but if successful looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1741975844220/87d489fb-6e6a-4c6c-a08b-6f0eeb3e3073.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Now, we have established the infrastructure and data we need for actually start to query the S3 Table with DuckDB!&lt;/p&gt;
&lt;h2 id=&quot;heading-iam-permissions&quot;&gt;IAM permissions&lt;/h2&gt;
&lt;p&gt;But before, you need to make sure the the IAM role you’re planning to use for the test has the appropriate permissions. Otherwise you’ll not be able to perform the queries.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-json&quot;&gt;{
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;Version&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;2012-10-17&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;Statement&quot;&lt;/span&gt;: [
        {
            &lt;span class=&quot;hljs-attr&quot;&gt;&quot;Effect&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;Allow&quot;&lt;/span&gt;,
            &lt;span class=&quot;hljs-attr&quot;&gt;&quot;Action&quot;&lt;/span&gt;: [
                &lt;span class=&quot;hljs-string&quot;&gt;&quot;s3tables:*&quot;&lt;/span&gt;,
                &lt;span class=&quot;hljs-string&quot;&gt;&quot;YOUR OTHER IAM permissions&quot;&lt;/span&gt;,
            ],
            &lt;span class=&quot;hljs-attr&quot;&gt;&quot;Resource&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;*&quot;&lt;/span&gt;
        }
    ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The `s3tables:*` is the relevant part, as it will enable you to use all IAM actions respective to the s3tables service.&lt;/p&gt;
&lt;p&gt;Going forward, it’s assumed that you have setup your AWS credentials on the machine you want to use DuckDB from to query the S3 Table, the details are not covered in this article, but a starting point in the docs is &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h1 id=&quot;heading-using-the-latest-duckdb-version&quot;&gt;Using the latest DuckDB version&lt;/h1&gt;
&lt;p&gt;Before continuing, please make sure that you have installed the version 1.2.1 of DuckDB. Otherwise this will not work!&lt;/p&gt;
&lt;h1 id=&quot;heading-installing-extensions&quot;&gt;Installing extensions&lt;/h1&gt;
&lt;p&gt;Open DuckDB from your terminal:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;$ duckdb --version
v1.2.1 8e52ec4395

$ duckdb
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The first step is to install the relevant extensions (aws, httpsfs and iceberg) in their nightly versions (latest):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;FORCE &lt;span class=&quot;hljs-keyword&quot;&gt;INSTALL&lt;/span&gt; aws &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; core_nightly;
FORCE &lt;span class=&quot;hljs-keyword&quot;&gt;INSTALL&lt;/span&gt; httpfs &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; core_nightly;
FORCE &lt;span class=&quot;hljs-keyword&quot;&gt;INSTALL&lt;/span&gt; iceberg &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; core_nightly;
&lt;/code&gt;&lt;/pre&gt;
&lt;h1 id=&quot;heading-creating-a-s3-secret&quot;&gt;Creating a S3 secret&lt;/h1&gt;
&lt;p&gt;As the next step, you need to create a secret, so that DuckDB can automatically authenticate you via the credential chain provider (you need to have set this up beforehand, see above):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; SECRET (
  &lt;span class=&quot;hljs-keyword&quot;&gt;TYPE&lt;/span&gt; s3,
  PROVIDER credential_chain
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h1 id=&quot;heading-attach-the-database&quot;&gt;Attach the database&lt;/h1&gt;
&lt;p&gt;Next, we need to attach the actual remote database to our local DuckDB instance. Please make sure that the ARN is correct, you can copy the ARN from the Table Buckets details:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1741976968157/e6085841-be67-4b42-a4a8-70fe2952d27d.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Replace your ARN in the statement below, and run it:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;ATTACH &apos;arn:aws:s3tables:us-east-1:12345678912:bucket/duckdb-test&apos;
  AS test_db (
    TYPE iceberg,
    ENDPOINT_TYPE s3_tables
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h1 id=&quot;heading-show-all-tables-in-the-table-buckets-namespace&quot;&gt;Show all tables in the Table Bucket’s Namespace&lt;/h1&gt;
&lt;p&gt;Run the below statement, which will show you a list of your S3 Tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;SHOW&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;ALL&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLES&lt;/span&gt;;

┌──────────────┬─────────┬─────────────┬─────────────────────────────────────────────┬─────────────────────────┬───────────┐
│   database   │ schema  │    name     │                column_names                 │      column_types       │ temporary │
│   varchar    │ varchar │   varchar   │                  varchar[]                  │        varchar[]        │  boolean  │
├──────────────┼─────────┼─────────────┼─────────────────────────────────────────────┼─────────────────────────┼───────────┤
│ test_db      │ test    │ daily_sales │ [sale_date, product_category, sales_amount] │ [DATE, VARCHAR, DOUBLE] │ false     │
└──────────────┴─────────┴─────────────┴─────────────────────────────────────────────┴─────────────────────────┴───────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;h1 id=&quot;heading-querying-the-s3-table&quot;&gt;Querying the S3 Table&lt;/h1&gt;
&lt;p&gt;Finally, you can now query your S3 Table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;select&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; test_db.test.daily_sales;
┌────────────┬──────────────────┬──────────────┐
│ sale_date  │ product_category │ sales_amount │
│    date    │     varchar      │    double    │
├────────────┼──────────────────┼──────────────┤
│ 2024-01-15 │ Laptop           │        900.0 │
│ 2024-01-15 │ Monitor          │        250.0 │
│ 2024-01-16 │ Laptop           │       1350.0 │
│ 2024-02-01 │ Monitor          │        300.0 │
│ 2024-02-01 │ Keyboard         │         60.0 │
│ 2024-02-02 │ Mouse            │         25.0 │
│ 2024-02-02 │ Laptop           │       1050.0 │
│ 2024-02-03 │ Laptop           │       1200.0 │
│ 2024-02-03 │ Monitor          │        375.0 │
└────────────┴──────────────────┴──────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;h1 id=&quot;heading-references&quot;&gt;References&lt;/h1&gt;
&lt;p&gt;The following article is a great start:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/2025/03/14/preview-amazon-s3-tables.html&quot;&gt;DuckDB Blog: Amazon S3 Tables in DuckDB&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are some videos from the Amazon Pi Day where this was introduced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://www.youtube.com/watch?v=z-DVBcV2JQM&amp;amp;t=1141s&quot;&gt;Keynote segment on DuckDB&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://www.youtube.com/watch?v=z-DVBcV2JQM&amp;amp;t=5326s&quot;&gt;Demo with the DuckDB UI&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The following PRs are relevant for this new feature&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/duckdb/duckdb-iceberg/pull/112&quot;&gt;Add Glue support&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/duckdb/duckdb-iceberg/pull/98&quot;&gt;Add Iceberg Catalog read support&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;An test with some example code can be found under&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/duckdb/duckdb-iceberg/blob/main/test/sql/cloud/glue/test_glue.test&quot;&gt;test_glue.test&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><category>s3-tables</category><category>AWS</category><category>apache iceberg</category><category>iceberg</category><category>S3</category><author>Tobias Müller</author></item><item><title>Querying IP addresses and CIDR ranges with DuckDB</title><link>https://tobilg.com/posts/querying-ip-addresses-and-cidr-ranges-with-duckdb/</link><guid isPermaLink="true">https://tobilg.com/posts/querying-ip-addresses-and-cidr-ranges-with-duckdb/</guid><description>I had a use case that eventually required performing IP address lookups in a given list of CIDR ranges, as I maintain an open source project that gathers IP address range data from public cloud providers, and also wrote an article in my blog about an...</description><pubDate>Fri, 20 Sep 2024 22:00:00 GMT</pubDate><content:encoded>&lt;p&gt;I had a use case that eventually required performing IP address lookups in a given list of CIDR ranges, as I maintain an &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/public-cloud-provider-ip-ranges&quot;&gt;open source project&lt;/a&gt; that gathers IP address range data from public cloud providers, and also wrote an &lt;a target=&quot;_blank&quot; href=&quot;https://tobilg.com/gathering-and-analyzing-public-cloud-provider-ip-address-data-with-duckdb-observerable&quot;&gt;article in my blog&lt;/a&gt; about analyzing this data as well.&lt;/p&gt;
&lt;p&gt;I had a look at the DuckDB docs, and found the &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/extensions/inet.html&quot;&gt;inet extension&lt;/a&gt; that provides some functionality around IP addresses. Unfortunately, it offers no direct way to determine whether a given IP address is part of a CIDR range.&lt;/p&gt;
&lt;p&gt;After some research, I found a &lt;a target=&quot;_blank&quot; href=&quot;https://stackoverflow.com/a/56121277/1603357&quot;&gt;StackOverflow answer&lt;/a&gt; that showed how this could be done with basic functions in Postgres. I decided to create three DuckDB macros to implement the needed IP address lookup functionality.&lt;/p&gt;
&lt;h3 id=&quot;heading-the-plan&quot;&gt;The plan&lt;/h3&gt;
&lt;p&gt;Basically, both the starting (network) and ending (broadcast) IP addresses of a CIDR range need to be cast to integers, to be able to determine if a given IP address (also cast to an integer) lies within the derived integer value boundaries.&lt;/p&gt;
&lt;h3 id=&quot;heading-deriving-the-network-address-value&quot;&gt;Deriving the network address value&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;OR&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;REPLACE&lt;/span&gt; MACRO network_from_cidr(cidr_range) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; (
    &lt;span class=&quot;hljs-keyword&quot;&gt;cast&lt;/span&gt;(string_split(string_split(cidr_range, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;], &lt;span class=&quot;hljs-string&quot;&gt;&apos;.&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;bigint&lt;/span&gt;) * (&lt;span class=&quot;hljs-number&quot;&gt;256&lt;/span&gt; * &lt;span class=&quot;hljs-number&quot;&gt;256&lt;/span&gt; * &lt;span class=&quot;hljs-number&quot;&gt;256&lt;/span&gt;) +
    &lt;span class=&quot;hljs-keyword&quot;&gt;cast&lt;/span&gt;(string_split(string_split(cidr_range, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;], &lt;span class=&quot;hljs-string&quot;&gt;&apos;.&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;bigint&lt;/span&gt;) * (&lt;span class=&quot;hljs-number&quot;&gt;256&lt;/span&gt; * &lt;span class=&quot;hljs-number&quot;&gt;256&lt;/span&gt;      ) +
    &lt;span class=&quot;hljs-keyword&quot;&gt;cast&lt;/span&gt;(string_split(string_split(cidr_range, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;], &lt;span class=&quot;hljs-string&quot;&gt;&apos;.&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;3&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;bigint&lt;/span&gt;) * (&lt;span class=&quot;hljs-number&quot;&gt;256&lt;/span&gt;            ) +
    &lt;span class=&quot;hljs-keyword&quot;&gt;cast&lt;/span&gt;(string_split(string_split(cidr_range, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;], &lt;span class=&quot;hljs-string&quot;&gt;&apos;.&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;4&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;bigint&lt;/span&gt;)
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-deriving-the-broadcast-address-value&quot;&gt;Deriving the broadcast address value&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;OR&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;REPLACE&lt;/span&gt; MACRO broadcast_from_cidr(cidr_range) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; (
    &lt;span class=&quot;hljs-keyword&quot;&gt;cast&lt;/span&gt;(string_split(string_split(cidr_range, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;], &lt;span class=&quot;hljs-string&quot;&gt;&apos;.&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;bigint&lt;/span&gt;) * (&lt;span class=&quot;hljs-number&quot;&gt;256&lt;/span&gt; * &lt;span class=&quot;hljs-number&quot;&gt;256&lt;/span&gt; * &lt;span class=&quot;hljs-number&quot;&gt;256&lt;/span&gt;) +
    &lt;span class=&quot;hljs-keyword&quot;&gt;cast&lt;/span&gt;(string_split(string_split(cidr_range, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;], &lt;span class=&quot;hljs-string&quot;&gt;&apos;.&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;bigint&lt;/span&gt;) * (&lt;span class=&quot;hljs-number&quot;&gt;256&lt;/span&gt; * &lt;span class=&quot;hljs-number&quot;&gt;256&lt;/span&gt;      ) +
    &lt;span class=&quot;hljs-keyword&quot;&gt;cast&lt;/span&gt;(string_split(string_split(cidr_range, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;], &lt;span class=&quot;hljs-string&quot;&gt;&apos;.&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;3&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;bigint&lt;/span&gt;) * (&lt;span class=&quot;hljs-number&quot;&gt;256&lt;/span&gt;            ) +
    &lt;span class=&quot;hljs-keyword&quot;&gt;cast&lt;/span&gt;(string_split(string_split(cidr_range, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;], &lt;span class=&quot;hljs-string&quot;&gt;&apos;.&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;4&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;bigint&lt;/span&gt;)) + 
    &lt;span class=&quot;hljs-keyword&quot;&gt;cast&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;pow&lt;/span&gt;(&lt;span class=&quot;hljs-number&quot;&gt;256&lt;/span&gt;, (&lt;span class=&quot;hljs-number&quot;&gt;32&lt;/span&gt; - &lt;span class=&quot;hljs-keyword&quot;&gt;cast&lt;/span&gt;(string_split(cidr_range, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;bigint&lt;/span&gt;)) / &lt;span class=&quot;hljs-number&quot;&gt;8&lt;/span&gt;) - &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;bigint&lt;/span&gt;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-final-macro-for-the-ip-address-lookup&quot;&gt;Final macro for the IP address lookup&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;OR&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;REPLACE&lt;/span&gt; MACRO ip_within_cidr(ip, cidr_range) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; (
    network_from_cidr(ip || &lt;span class=&quot;hljs-string&quot;&gt;&apos;/32&apos;&lt;/span&gt;) &amp;gt;= network_from_cidr(cidr_range) &lt;span class=&quot;hljs-keyword&quot;&gt;AND&lt;/span&gt; network_from_cidr(ip || &lt;span class=&quot;hljs-string&quot;&gt;&apos;/32&apos;&lt;/span&gt;) &amp;lt;= broadcast_from_cidr(cidr_range)
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-sample-sqls&quot;&gt;Sample SQLs&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;-- Transform to network start address&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;select&lt;/span&gt; network_from_cidr(&lt;span class=&quot;hljs-string&quot;&gt;&apos;4.0.0.0/8&apos;&lt;/span&gt;)
&lt;span class=&quot;hljs-comment&quot;&gt;-- Transform to broadcast address&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;select&lt;/span&gt; broadcast_from_cidr(&lt;span class=&quot;hljs-string&quot;&gt;&apos;4.0.0.0/8&apos;&lt;/span&gt;)
&lt;span class=&quot;hljs-comment&quot;&gt;-- Checks&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;select&lt;/span&gt; ip_within_cidr(&lt;span class=&quot;hljs-string&quot;&gt;&apos;4.0.0.0&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;4.0.0.0/8&apos;&lt;/span&gt;) &lt;span class=&quot;hljs-comment&quot;&gt;-- true&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;select&lt;/span&gt; ip_within_cidr(&lt;span class=&quot;hljs-string&quot;&gt;&apos;4.255.255.255&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;4.0.0.0/8&apos;&lt;/span&gt;) &lt;span class=&quot;hljs-comment&quot;&gt;-- true&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;select&lt;/span&gt; ip_within_cidr(&lt;span class=&quot;hljs-string&quot;&gt;&apos;3.255.255.255&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;4.0.0.0/8&apos;&lt;/span&gt;) &lt;span class=&quot;hljs-comment&quot;&gt;-- false&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;select&lt;/span&gt; ip_within_cidr(&lt;span class=&quot;hljs-string&quot;&gt;&apos;5.0.0.0&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;4.0.0.0/8&apos;&lt;/span&gt;) &lt;span class=&quot;hljs-comment&quot;&gt;-- false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-try-it-out-yourself&quot;&gt;Try it out yourself&lt;/h3&gt;
&lt;p&gt;You can try the described functionalities in my free online &lt;a target=&quot;_blank&quot; href=&quot;https://sql-workbench.com&quot;&gt;SQL Workbench&lt;/a&gt; by &lt;a target=&quot;_blank&quot; href=&quot;https://dub.sh/duckdb-ip-lookups&quot;&gt;clicking on this link&lt;/a&gt;. It will open a new browser window, and create all three macros and run a test query accordingly.&lt;/p&gt;
</content:encoded><category>duckDB</category><category>ip address</category><category>CIDR</category><category>data analysis</category><author>Tobias Müller</author></item><item><title>Chat with a Duck</title><link>https://tobilg.com/posts/chat-with-a-duck/</link><guid isPermaLink="true">https://tobilg.com/posts/chat-with-a-duck/</guid><description>A while ago I published sql-workbench.com and the accompanying blog post called &quot;Using DuckDB-WASM for in-browser Data Engineering&quot;. The SQL Workbench enables its users to analyze local or remote data directly in the browser.
This lowers the bar rega...</description><pubDate>Tue, 16 Apr 2024 23:24:58 GMT</pubDate><content:encoded>&lt;p&gt;A while ago I published &lt;a target=&quot;_blank&quot; href=&quot;https://sql-workbench.com&quot;&gt;sql-workbench.com&lt;/a&gt; and the accompanying blog post called &quot;&lt;a target=&quot;_blank&quot; href=&quot;https://tobilg.com/using-duckdb-wasm-for-in-browser-data-engineering&quot;&gt;Using DuckDB-WASM for in-browser Data Engineering&lt;/a&gt;&quot;. The SQL Workbench enables its users to analyze local or remote data directly in the browser.&lt;/p&gt;
&lt;p&gt;This lowers the bar regarding the infrastructure needed to get started with Data Analysis and Data Engineering. No databases must be installed on servers or developer machines, no data is sent to the internet and SaaS or Cloud providers. The sole interface to the data you want to analyze is SQL.&lt;/p&gt;
&lt;p&gt;But what if a user is new in this space, and has no SQL skills yet? Or, if he/she has SQL skills, but wants to query the data without having to understand the data model first?&lt;/p&gt;
&lt;h2 id=&quot;heading-enter-duckdb-nsql&quot;&gt;Enter DuckDB-NSQL&lt;/h2&gt;
&lt;p&gt;LLMs (Large Language Models) can be used to generate code from natural language questions. Popular examples include &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/features/copilot&quot;&gt;GitHub Copilot&lt;/a&gt;, &lt;a target=&quot;_blank&quot; href=&quot;https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4&quot;&gt;OpenAI&apos;s GPT-4&lt;/a&gt; or Meta&apos;s &lt;a target=&quot;_blank&quot; href=&quot;https://ai.meta.com/blog/code-llama-large-language-model-coding/&quot;&gt;Code Llama&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;DuckDB-NSQL is a Text-to-SQL model created by NumberStation for MotherDuck. It&apos;s hosted on &lt;a target=&quot;_blank&quot; href=&quot;https://huggingface.co/motherduckdb/DuckDB-NSQL-7B-v0.1&quot;&gt;HuggingFace&lt;/a&gt; and on &lt;a target=&quot;_blank&quot; href=&quot;https://ollama.com/library/duckdb-nsql&quot;&gt;Ollama&lt;/a&gt;, and can be use with different LLM runtimes. There are also some nice blog posts that are worth a read:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://numbersstation.ai/duckdb-nsql-how-to-quack-in-sql/&quot;&gt;How to quack in SQL&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://motherduck.com/blog/duckdb-text2sql-llm/&quot;&gt;AI that quacks&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The model was specifically trained for the DuckDB SQL syntax with 200k DuckDB Text-to-SQL pairs, and is based on the &lt;a target=&quot;_blank&quot; href=&quot;https://huggingface.co/meta-llama/Llama-2-7b&quot;&gt;&lt;strong&gt;Llama-2 7B model&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-bring-your-own-ai&quot;&gt;Bring your own AI&lt;/h2&gt;
&lt;p&gt;If you want to enable SQL Workbench&apos;s privacy first AI integration for SQL generation, you first have to install &lt;a target=&quot;_blank&quot; href=&quot;https://ollama.com/download&quot;&gt;&lt;strong&gt;Ollama&lt;/strong&gt;&lt;/a&gt; on your local machine.&lt;/p&gt;
&lt;p&gt;Once you installed Ollama, you can either download the relevant &lt;a target=&quot;_blank&quot; href=&quot;https://ollama.com/library/duckdb-nsql&quot;&gt;&lt;strong&gt;DuckDB-NSQL model&lt;/strong&gt;&lt;/a&gt; beforehand, or have it automatically downloaded on the first usage.&lt;/p&gt;
&lt;p&gt;If you want to pull the model yourself, you can do this in your terminal (after you installed Ollama) by issuing the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;ollama pull duckdb-nsql:7b
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Please be aware that the default model has a size of &lt;strong&gt;3.8GB&lt;/strong&gt;, which can take a while to download, depending on your internet connection speed. There are &lt;a target=&quot;_blank&quot; href=&quot;https://ollama.com/library/duckdb-nsql/tags&quot;&gt;smaller quantized models&lt;/a&gt; as well, but be aware that the answer quality might be lower with them.&lt;/p&gt;
&lt;p&gt;Once the model is downloaded, Ollama can be started from your terminal:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;OLLAMA_ORIGINS=&lt;span class=&quot;hljs-string&quot;&gt;&quot;https://sql-workbench.com&quot;&lt;/span&gt; ollama serve
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Setting the &lt;code&gt;OLLAMA_ORIGINS&lt;/code&gt; environment variable to &lt;code&gt;https://sql-workbench.com&lt;/code&gt; is necessary to &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/ollama/ollama/issues/669&quot;&gt;enable CORS&lt;/a&gt; from the SQL Workbench running in your browser for your locally running Ollama server.&lt;/p&gt;
&lt;p&gt;You can enable the AI feature in SQL Workbench:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968764454/f60a0ddd-54bf-47f4-b460-415ecf24e896.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Then, you can ask your questions after a specific comment string like below:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;--ai your natural language question&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To execute the prompt, you have two options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Press &lt;strong&gt;ALT + g&lt;/strong&gt; to generate the SQL&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Press &lt;strong&gt;ALT + r&lt;/strong&gt; to run the generated SQL directly&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So if you created a table named &quot;locations&quot; beforehand, that has a column named &quot;country&quot;, the following would generate an appropriate SQL:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;--ai distinct country from locations&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;distinct&lt;/span&gt; country &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; locations
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The generated SQL is automatically inserted below the closest prompt comment string. In case you have multiple comment strings in the current SQL Workbench tab, the one closest to the actual cursor position is used.&lt;/p&gt;
&lt;p&gt;You can also ask questions regarding about remote files (Parquet, CSV), but don&apos;t expect the answer quality to be very good, because the schema will be unknown to the model.&lt;/p&gt;
&lt;h2 id=&quot;heading-explore-aws-iam-data-with-the-help-of-ai&quot;&gt;Explore AWS IAM data with the help of AI&lt;/h2&gt;
&lt;p&gt;AWS publishes its &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/service-authorization/latest/reference/reference_policies_actions-resources-contextkeys.html&quot;&gt;Service Authorization Reference&lt;/a&gt; documentation, and there&apos;s a &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/aws-iam-data&quot;&gt;Github repository&lt;/a&gt; that transforms the published data automatically to Parquet, CSV and JSON data formats every night at 4AM UTC.&lt;/p&gt;
&lt;p&gt;Each Parquet file represents a table in a relational data model that looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1713303490653/93837698-c257-4baa-b36b-5f6ba02eafd4.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;With the help of SQL Workbench and its underlying DuckDB-WASM instance, it&apos;s possible to load the remote data from GitHub into our browser&apos;s memory.&lt;/p&gt;
&lt;p&gt;To do this, the following SQL statements need to be executed, either by copy &amp;amp; pasting the code, or automatically via the SQL Workbench&apos;s shareable query feature by clicking on the duck below (this will open in a new browser window/tab, redirected via &lt;a target=&quot;_blank&quot; href=&quot;https://dub.co&quot;&gt;dub.co&lt;/a&gt;):&lt;/p&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://dub.sh/query-aws-iam-data-with-ai&quot;&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1713304001847/a7cce793-194c-48d1-8804-0ed10163bb27.avif&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The create schema will then be used as input for the prompt which is sent to Ollama in the background by the browser.&lt;/p&gt;
&lt;h3 id=&quot;heading-sql-script&quot;&gt;SQL script&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; services (
  service_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt; PRIMARY &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt;, 
  &lt;span class=&quot;hljs-string&quot;&gt;&quot;name&quot;&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;, 
  prefix &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;, 
  reference_url &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt; 
);

&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; actions (
  action_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt; PRIMARY &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt;, 
  service_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;, 
  &lt;span class=&quot;hljs-string&quot;&gt;&quot;name&quot;&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;, 
  reference_url &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;, 
  permission_only_flag &lt;span class=&quot;hljs-built_in&quot;&gt;BOOLEAN&lt;/span&gt;, 
  access_level &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;,
  &lt;span class=&quot;hljs-keyword&quot;&gt;FOREIGN&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt; (service_id) &lt;span class=&quot;hljs-keyword&quot;&gt;REFERENCES&lt;/span&gt; services (service_id)
);

&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; condition_keys (
  condition_key_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt; PRIMARY &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt;, 
  &lt;span class=&quot;hljs-string&quot;&gt;&quot;name&quot;&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;, 
  reference_url &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;, 
  description &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;, 
  &lt;span class=&quot;hljs-string&quot;&gt;&quot;type&quot;&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;
);

&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; resource_types (
  resource_type_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt; PRIMARY &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt;, 
  service_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;, 
  &lt;span class=&quot;hljs-string&quot;&gt;&quot;name&quot;&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;, 
  reference_url &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;, 
  arn_pattern &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;,
  &lt;span class=&quot;hljs-keyword&quot;&gt;FOREIGN&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt; (service_id) &lt;span class=&quot;hljs-keyword&quot;&gt;REFERENCES&lt;/span&gt; services (service_id)
);

&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; resource_types_condition_keys (
  resource_type_condition_key_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt; PRIMARY &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt;, 
  resource_type_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;, 
  condition_key_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;,
  &lt;span class=&quot;hljs-keyword&quot;&gt;FOREIGN&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt; (resource_type_id) &lt;span class=&quot;hljs-keyword&quot;&gt;REFERENCES&lt;/span&gt; resource_types (resource_type_id),
  &lt;span class=&quot;hljs-keyword&quot;&gt;FOREIGN&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt; (condition_key_id) &lt;span class=&quot;hljs-keyword&quot;&gt;REFERENCES&lt;/span&gt; condition_keys (condition_key_id)
);

&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; actions_resource_types (
  action_resource_type_id &lt;span class=&quot;hljs-built_in&quot;&gt;BIGINT&lt;/span&gt; PRIMARY &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt;, 
  action_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;, 
  resource_type_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;, 
  required_flag &lt;span class=&quot;hljs-built_in&quot;&gt;BOOLEAN&lt;/span&gt;,
  &lt;span class=&quot;hljs-keyword&quot;&gt;FOREIGN&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt; (action_id) &lt;span class=&quot;hljs-keyword&quot;&gt;REFERENCES&lt;/span&gt; actions (action_id)
);

&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; actions_condition_keys (
  action_condition_key_id &lt;span class=&quot;hljs-built_in&quot;&gt;BIGINT&lt;/span&gt; PRIMARY &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt;, 
  action_resource_type_id &lt;span class=&quot;hljs-built_in&quot;&gt;BIGINT&lt;/span&gt;, 
  action_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;, 
  condition_key_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;,
  &lt;span class=&quot;hljs-keyword&quot;&gt;FOREIGN&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt; (action_id) &lt;span class=&quot;hljs-keyword&quot;&gt;REFERENCES&lt;/span&gt; actions (action_id),
  &lt;span class=&quot;hljs-keyword&quot;&gt;FOREIGN&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt; (condition_key_id) &lt;span class=&quot;hljs-keyword&quot;&gt;REFERENCES&lt;/span&gt; condition_keys (condition_key_id)
);

&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; actions_dependant_actions (
  action_dependent_action_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt; PRIMARY &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt;, 
  action_resource_type_id &lt;span class=&quot;hljs-built_in&quot;&gt;BIGINT&lt;/span&gt;, 
  action_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;, 
  dependent_action_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;,
  &lt;span class=&quot;hljs-keyword&quot;&gt;FOREIGN&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt; (action_id) &lt;span class=&quot;hljs-keyword&quot;&gt;REFERENCES&lt;/span&gt; actions (action_id),
  &lt;span class=&quot;hljs-keyword&quot;&gt;FOREIGN&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt; (action_resource_type_id) &lt;span class=&quot;hljs-keyword&quot;&gt;REFERENCES&lt;/span&gt; actions_resource_types (action_resource_type_id)
);

&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; services &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/parquet/aws_services.parquet&apos;&lt;/span&gt;;
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; resource_types &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/parquet/aws_resource_types.parquet&apos;&lt;/span&gt;;
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; condition_keys &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/parquet/aws_condition_keys.parquet&apos;&lt;/span&gt;;
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; actions &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/parquet/aws_actions.parquet&apos;&lt;/span&gt;;
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; resource_types_condition_keys &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/parquet/aws_resource_types_condition_keys.parquet&apos;&lt;/span&gt;;
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; actions_resource_types &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/parquet/aws_actions_resource_types.parquet&apos;&lt;/span&gt;;
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; actions_condition_keys &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/parquet/aws_actions_condition_keys.parquet&apos;&lt;/span&gt;;
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; actions_dependant_actions &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/parquet/aws_actions_dependant_actions.parquet&apos;&lt;/span&gt;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-example-prompts-amp-results&quot;&gt;Example prompts &amp;amp; results&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Prompt&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;--ai distinct service names that contain &apos;S3&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968797665/d5d82884-5796-4fef-b699-791f5a1a03a8.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prompt&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;--ai count distinct action names for service name contains S3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968825767/a075244a-df9b-4db1-afbf-8ecc99b4b40e.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prompt&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;--ai show all &apos;Write&apos; access level action for service &apos;Amazon S3&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968887070/f0f58029-e3c6-44ab-ba5e-6de1a8379209.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prompt&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;--ai first 10 actions for service &apos;Amazon CloudFront&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968900218/eda5866d-8e71-49f5-9d17-83d932ffe511.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prompt&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;--ai service name with least actions&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968924079/a6087662-4309-4f61-82ee-31ce31e90620.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prompt&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;--ai service name with most resource types&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968937854/eec34862-95c1-4347-8f1b-948a655fc382.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prompt&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;--ai count actions names of service &apos;Amazon S3&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968951986/f6cfa3d8-02d0-40ae-9ab8-5f63b377f3b5.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prompt&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;--ai top 10 resource types by services&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968988534/3b9fd788-216c-41fc-a6aa-02972e04d409.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-demo-video&quot;&gt;Demo video&lt;/h2&gt;
&lt;div class=&quot;embed-wrapper&quot;&gt;&lt;div class=&quot;embed-loading&quot;&gt;&lt;div class=&quot;loadingRow&quot;&gt;&lt;/div&gt;&lt;div class=&quot;loadingRow&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;a class=&quot;embed-card&quot; href=&quot;https://youtu.be/rTuCec_fhlk&quot;&gt;https://youtu.be/rTuCec_fhlk&lt;/a&gt;&lt;/div&gt;
&lt;p&gt; &lt;/p&gt;
&lt;h2 id=&quot;heading-summary&quot;&gt;Summary&lt;/h2&gt;
&lt;p&gt;Using a locally hosted LLM together with an in-browser SQL Workspace enables a cost-effective and privacy-friendly way to use state of the art tools without needing to rely on third party services.&lt;/p&gt;
</content:encoded><category>duckDB</category><category>llm</category><category>AI</category><category>SQL</category><author>Tobias Müller</author></item><item><title>Using DuckDB-WASM for in-browser Data Engineering</title><link>https://tobilg.com/posts/using-duckdb-wasm-for-in-browser-data-engineering/</link><guid isPermaLink="true">https://tobilg.com/posts/using-duckdb-wasm-for-in-browser-data-engineering/</guid><description>Introduction
DuckDB, the in-process DBMS specialized in OLAP workloads, had a very rapid growth during the last year, both in functionality, but also popularity amongst its users, but also with developers that contribute many projects to the Open Sou...</description><pubDate>Sat, 27 Jan 2024 23:00:00 GMT</pubDate><content:encoded>&lt;h1 id=&quot;heading-introduction&quot;&gt;Introduction&lt;/h1&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/&quot;&gt;DuckDB&lt;/a&gt;, the in-process DBMS specialized in OLAP workloads, had a very rapid growth during the last year, both in functionality, but also popularity amongst its users, but also with developers that contribute many projects to the &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/davidgasquez/awesome-duckdb&quot;&gt;Open Source DuckDB ecosystem&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706462929992/c9f633a0-d1c7-420a-b019-5aff0a8ee1de.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;DuckDB cannot &quot;only&quot; be run on a variety of Operating Systems and Architectures, there&apos;s also a &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/duckdb/duckdb-wasm&quot;&gt;DuckDB-WASM version&lt;/a&gt;, that allows running DuckDB in a browser. This opens up some very interesting use cases, and is also gaining a lot of traction in the last 12 months.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706463261970/895fcefd-ff43-445a-b5fe-cc78d085125c.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h1 id=&quot;heading-use-case-building-a-sql-workbench-with-duckdb-wasm&quot;&gt;Use Case: Building a SQL Workbench with DuckDB-WASM&lt;/h1&gt;
&lt;p&gt;One of the first things that came to my mind once I learned about the existence of DuckDB-WASM was that it could be used to create an online SQL Workbench, where people could interactively run queries, show their results, but also visualize them. DuckDB-WASM sits at its core, providing the storage layer, query engine &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/why_duckdb.html#standing-on-the-shoulders-of-giants&quot;&gt;and many things more&lt;/a&gt;...&lt;/p&gt;
&lt;p&gt;You can find the project at&lt;/p&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://sql-workbench.com&quot;&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1708297858847/6857669b-1992-4279-93ef-9ceeb560c594.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;It&apos;s built with the following core technologies / frameworks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://react.dev/&quot;&gt;React&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://vitejs.dev/&quot;&gt;Vite&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/duckdb/duckdb-wasm&quot;&gt;DuckDB-WASM&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://perspective.finos.org/&quot;&gt;Perspective.js&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It&apos;s hosted as a static website export / single page application on AWS using&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/cloudfront/&quot;&gt;CloudFront&lt;/a&gt; as CDN&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/s3/&quot;&gt;S3&lt;/a&gt; as file hosting service&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/certificate-manager/&quot;&gt;ACM&lt;/a&gt; for managing certificates&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/route53/&quot;&gt;Route53&lt;/a&gt; for DNS&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you&apos;re interested in the hosting setup, you can have a look at &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/serverless-aws-static-websites&quot;&gt;https://github.com/tobilg/serverless-aws-static-websites&lt;/a&gt; which can deploy such static websites on AWS via IaC with minimum effort.&lt;/p&gt;
&lt;h1 id=&quot;heading-using-the-sql-workbench&quot;&gt;Using the SQL Workbench&lt;/h1&gt;
&lt;p&gt;There are many possibilities how you can use the SQL Workbench, some are described below&lt;/p&gt;
&lt;h2 id=&quot;heading-overview&quot;&gt;Overview&lt;/h2&gt;
&lt;p&gt;When you open &lt;a target=&quot;_blank&quot; href=&quot;https://sql-workbench.com&quot;&gt;sql-workbench.com&lt;/a&gt; for the first time, you can see that the workbench is divided in three different areas:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;On the left, there&apos;s the &lt;strong&gt;&quot;Local Tables&quot; area&lt;/strong&gt;, that will display the created tables of you ran queries such as &lt;code&gt;CREATE TABLE names (name VARCHAR)&lt;/code&gt;, or used the drag-and-drop area on the lower left corner to drop any CSV, Parquet or Arrow file on it (details see below).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The upper main &lt;strong&gt;editor area&lt;/strong&gt; is the SQL editor, where you can type your SQL queries. You&apos;re already presented with some example queries for different types of data once the page is loaded.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The lower main &lt;strong&gt;result area&lt;/strong&gt; where the results of the ran queries will be shown, or alternatively, the visualizations of these results.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728967963523/d4e48ff0-818c-46af-9e20-5e7d2fc59429.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;&lt;strong&gt;You can adjust the respective heights of the main areas by dragging the lever in the middle.&lt;/strong&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;heading-running-sql-queries&quot;&gt;Running SQL queries&lt;/h2&gt;
&lt;p&gt;To run your first query, select the first line of SQL, either with your keyboard or with your mouse, and press the key combination &lt;code&gt;CMD + Enter&lt;/code&gt; of you&apos;re on a Mac, or &lt;code&gt;Ctrl + Enter&lt;/code&gt; if you&apos;re on a Windows or Linux machine.&lt;/p&gt;
&lt;p&gt;The result of the query that was executed can then be found in the lower main area as a table:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968086306/b2db05fb-f821-4387-a394-26036103cef2.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;&lt;strong&gt;Queries will only be executed if one or more queries are selected. If multiple queries shall be executed, make sure you use a semicolon at the end of each query. Otherwise an error will be displayed.&lt;/strong&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;heading-running-multiple-queries&quot;&gt;Running multiple queries&lt;/h2&gt;
&lt;p&gt;You can also run multiple queries sequentially, e.g. to create a table, insert some records, and display the results:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; first_names (&lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;, birth_cnt &lt;span class=&quot;hljs-built_in&quot;&gt;integer&lt;/span&gt;);

&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; first_names (&lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;, birth_cnt) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-string&quot;&gt;&apos;Liam&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;20456&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; first_names (&lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;, birth_cnt) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-string&quot;&gt;&apos;Noah&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;18621&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; first_names (&lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;, birth_cnt) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-string&quot;&gt;&apos;Oliver&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;15076&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; first_names (&lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;, birth_cnt) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-string&quot;&gt;&apos;James&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;12028&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; first_names (&lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;, birth_cnt) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-string&quot;&gt;&apos;Elijah&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;11979&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; first_names (&lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;, birth_cnt) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-string&quot;&gt;&apos;William&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;11282&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; first_names (&lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;, birth_cnt) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-string&quot;&gt;&apos;Henry&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;11221&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; first_names (&lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;, birth_cnt) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-string&quot;&gt;&apos;Lucas&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;10909&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; first_names (&lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;, birth_cnt) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-string&quot;&gt;&apos;Benjamin&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;10842&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; first_names (&lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;, birth_cnt) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-string&quot;&gt;&apos;Theodore&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;10754&lt;/span&gt;);

&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; first_names;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When you copy &amp;amp; paste the above SQLs, select them and run them, the result looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968117345/256bf29e-5107-4acb-9be4-cd18c0a74115.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;You can see on the left-hand side the newly created table first_names, that can be reused for other queries without having to reload the data again.&lt;/p&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;&lt;strong&gt;Only the result of the last-run query will be displayed in the lower main result area!&lt;/strong&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;If you want to open a new SQL Workbench and directly run the above query, please click on the image below:&lt;/p&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://sql-workbench.com/#queries=v0&amp;amp;config=%7B%22plugin%22%3A%22Datagrid%22%2C%22plugin_config%22%3A%7B%22columns%22%3A%7B%7D%2C%22editable%22%3Afalse%2C%22scroll_lock%22%3Afalse%7D%2C%22title%22%3A%22Export%22%2C%22group_by%22%3A%5B%5D%2C%22split_by%22%3A%5B%5D%2C%22columns%22%3A%5B%22name%22%2C%22birth_cnt%22%5D%2C%22filter%22%3A%5B%5D%2C%22sort%22%3A%5B%5D%2C%22expressions%22%3A%5B%5D%2C%22aggregates%22%3A%7B%7D%7D,CREATE-TABLE-first_names-(name-VARCHAR%2C-birth_cnt-integer)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-(&apos;Liam&apos;%2C-20456)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-(&apos;Noah&apos;%2C-18621)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-(&apos;Oliver&apos;%2C-15076)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-(&apos;James&apos;%2C-12028)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-(&apos;Elijah&apos;%2C-11979)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-(&apos;William&apos;%2C-11282)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-(&apos;Henry&apos;%2C-11221)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-(&apos;Lucas&apos;%2C-10909)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-(&apos;Benjamin&apos;%2C-10842)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-(&apos;Theodore&apos;%2C-10754)~,SELECT-*-FROM-first_names~&quot;&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706466737102/9a07966d-0e5e-4c80-b319-8a104091948d.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;&lt;strong&gt;The data is persisted until you reload the overall SQL Workbench page.&lt;/strong&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;heading-querying-data-you-have-on-your-machine&quot;&gt;Querying data you have on your machine&lt;/h2&gt;
&lt;p&gt;To try this, you can for example download a list of AWS Services as a CSV from&lt;/p&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/csv/aws_services.csv&quot;&gt;https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/csv/aws_services.csv&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This file has four columns, &lt;code&gt;service_id&lt;/code&gt;, &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;prefix&lt;/code&gt; and &lt;code&gt;reference_url&lt;/code&gt;. Once you downloaded the file, you can simply drag-and-drop from the folder you downloaded it to to the area in the lower left corner of the SQL Workbench:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968154608/ea0a76f3-9a37-47bc-8cfe-fb7b927ef7a1.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;A table called &lt;code&gt;aws_services.csv&lt;/code&gt; has now been automatically created, which you can query via SQLs, for example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;, prefix &lt;span class=&quot;hljs-keyword&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;aws_services.csv&apos;&lt;/span&gt;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you want to open a new SQL Workbench and directly run the above query, please click on the image below:&lt;/p&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://sql-workbench.com/#queries=v0&amp;amp;config=%7B%7D,CREATE-TABLE-&apos;aws_services.csv&apos;-AS-(SELECT-*-FROM-&apos;https%3A%2F%2Fraw.githubusercontent.com%2Ftobilg%2Faws%20iam%20data%2Fmain%2Fdata%2Fcsv%2Faws_services.csv&apos;)~,SELECT-name%2C-prefix-from-&apos;aws_services.csv&apos;~&quot;&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706466737102/9a07966d-0e5e-4c80-b319-8a104091948d.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-querying-and-visualizing-remote-data&quot;&gt;Querying and visualizing remote data&lt;/h2&gt;
&lt;p&gt;DuckDB-WASM supports the loading of compatible data in different formats (e.g. CSV, Parquet or Arrow) &lt;strong&gt;from remote http(s) sources&lt;/strong&gt;. Other data formats that can be used include JSON, but this requires the loading of so-called &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/extensions/overview&quot;&gt;DuckDB extensions&lt;/a&gt;.&lt;/p&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;&lt;strong&gt;It&apos;s necessary that the websites hosting the data add the relevant CORS headers, otherwise the browser (not DuckDB-WASM or the SQL Workbench) will forbid the loading of the files and show an error message instead&lt;/strong&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;In this example, we will use data about AWS CloudFront Edge Locations, that is available at &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/aws-edge-locations/tree/main/data&quot;&gt;tobilg/aws-edge-locations&lt;/a&gt; with this query:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://raw.githubusercontent.com/tobilg/aws-edge-locations/main/data/aws-edge-locations.parquet&apos;&lt;/span&gt;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The result will look like this:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968183325/455a65c3-9ee6-4ec4-8e4e-bc7990d21706.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;We now want to create a bar chart of the data, showing the number of Edge Locations by country and city. This can be done by &lt;strong&gt;hovering over the result table&lt;/strong&gt;, and clicking on the small &lt;strong&gt;&quot;configure&quot; button&lt;/strong&gt; that &lt;strong&gt;looks like a wrench&lt;/strong&gt; which subsequently appears on the &lt;strong&gt;upper right corner of the table&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968225733/9ca6f21b-d43e-4157-8fde-c4bcaeadd78a.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;You then see the overview of the available columns, and the current visualization type (Datagrid)&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968248434/d602c72b-27d8-4f21-ba4f-03c73568e4ee.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;To get an overview of the possible visualization types click on the Datagrid icon:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968274440/c816825e-294a-40be-ba9b-8a6cf6f19c95.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Then select &quot;Y Bar&quot;. This will give you an initial bar char:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968288263/0bf5d067-a001-4117-9267-77c578af3f3c.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;But as we want to display the count of Edge Locations by country and city, we need to drag-and-drop the columns &lt;code&gt;country&lt;/code&gt; and &lt;code&gt;city&lt;/code&gt; to the &quot;Group By&quot; area:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968303582/136389d8-6c43-4fc0-b70a-2641b3f82812.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;We can now close the configuration menu to see the chart in it&apos;s full size:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968318448/ba6b1c69-0350-4283-8897-d6957a20b0f0.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;There are many other visualization types from which you can choose from, such as Treemaps and Sunbursts, as well as Map Scatters:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968339621/e189c35e-ca66-4185-bd46-aadaa3c16bad.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968355463/2b050619-08ae-4cdf-ac7e-c67ef18d320a.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968365037/ee053723-e3d7-4bc7-9b4a-cb5fbf6074da.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-exporting-visualizations-and-data&quot;&gt;Exporting visualizations and data&lt;/h2&gt;
&lt;p&gt;You can also export the visualizations, as well as the data. Just click on &quot;Export&quot; and type in a &quot;Save as&quot; name, and select the output format you want to download:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968391639/285b9add-4f43-46e6-9e24-2deb93b8add8.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The data can be downloaded as CSV, JSON or Arrow file. Here&apos;s the CSV example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&quot;country (Group by 1)&quot;,&quot;count&quot;
,517
&quot;Argentina&quot;,3
&quot;Australia&quot;,10
&quot;Austria&quot;,3
&quot;Bahrain&quot;,2
&quot;Belgium&quot;,1
&quot;Brazil&quot;,21
&quot;Bulgaria&quot;,3
&quot;Canada&quot;,8
&quot;Chile&quot;,6
&quot;China&quot;,8
&quot;Colombia&quot;,3
&quot;Croatia&quot;,1
&quot;Czech Republic&quot;,1
&quot;Denmark&quot;,3
&quot;Finland&quot;,4
&quot;France&quot;,17
&quot;Germany&quot;,37
&quot;Greece&quot;,1
&quot;Hungary&quot;,1
&quot;India&quot;,48
&quot;Indonesia&quot;,5
&quot;Ireland&quot;,2
&quot;Israel&quot;,2
&quot;Italy&quot;,16
&quot;Japan&quot;,27
&quot;Kenya&quot;,1
&quot;Korea&quot;,8
&quot;Malaysia&quot;,2
&quot;Mexico&quot;,4
&quot;Netherlands&quot;,5
&quot;New Zealand&quot;,2
&quot;Nigeria&quot;,1
&quot;Norway&quot;,2
&quot;Oman&quot;,1
&quot;Peru&quot;,2
&quot;Philippines&quot;,2
&quot;Poland&quot;,5
&quot;Portugal&quot;,1
&quot;Romania&quot;,1
&quot;Singapore&quot;,7
&quot;South Africa&quot;,2
&quot;Spain&quot;,12
&quot;Sweden&quot;,4
&quot;Switzerland&quot;,2
&quot;Taiwan&quot;,3
&quot;Thailand&quot;,2
&quot;UAE&quot;,4
&quot;UK&quot;,30
&quot;United States&quot;,179
&quot;Vietnam&quot;,2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And here&apos;s the exported PNG:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968419596/e7ef6b0f-3253-47b3-a144-09329b698c13.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Exporting the HTML version will give you an interactive graph with hovering etc. Furthermore, you can also change the theme for the different visualizations:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968452361/91baacbb-8616-44f2-8ac1-28ba767e6cb7.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;This is also reflected in the exported graphs:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968466943/6236529c-b48e-443a-a4c5-cb7e56a2632c.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-using-the-schema-browser&quot;&gt;Using the schema browser&lt;/h2&gt;
&lt;p&gt;The schema browser can be found on the left-hand side. It&apos;s automatically updated after each executed query, so that all schema operations can be captured. On table-level, the columns, constraints and indexes are shown:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968497663/177d8402-07b9-46ae-a9a3-d7733b56fb5b.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;If you right-click on a table name, a context menu is shown that has different options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Generating scripts based on the table definition&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Truncating, deleting or summarizing the table&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Viewing table data (all records, first 10 and first 100)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once clicked, those menu items will create a new tab (see below), and generate and execute the appropriate SQL statements:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968531599/66db6a14-58fe-485b-b653-e3bc3deeb4ef.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-using-query-tabs&quot;&gt;Using query tabs&lt;/h2&gt;
&lt;p&gt;Another new feature is the possibility to have multiple query tabs. Those are either automatically created by context menu actions, or the user that clicks on the plus icon:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968548014/087c932b-dad1-43d0-b6d1-43085db2f885.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Each tab can be closed by clicking on the &quot;x&quot; icon next to the tab name.&lt;/p&gt;
&lt;h2 id=&quot;heading-generating-data-models&quot;&gt;Generating data models&lt;/h2&gt;
&lt;p&gt;If users have created some tables, it&apos;s then possible to create a data model from the schema metadata. If the tables also have foreign key relationships, those are also shown in the diagram. Just click on the &quot;Data Model&quot; menu entry on the lower left corner.&lt;/p&gt;
&lt;p&gt;Under the hood, this feature generates &lt;a target=&quot;_blank&quot; href=&quot;https://mermaid.js.org/syntax/entityRelationshipDiagram.html&quot;&gt;Mermaid Entity Relationship Diagram&lt;/a&gt; code, that is dynamically rendered as a graph.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968567301/4ec55815-bfa5-462d-a0d9-b7b6da96aeb7.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-using-the-query-history&quot;&gt;Using the query history&lt;/h2&gt;
&lt;p&gt;Each query that is issued in the current version of the SQL Workbench is recorded for the so-called query history. It can be accessed by clicking on the &quot;Query History&quot; menu entry in the lower left corner. Once clicked, there&apos;s an overlay on the right-hand side with the list of the issued queries.&lt;/p&gt;
&lt;p&gt;The newest queries can be found on top of the list, and with each query listed, there&apos;s also an indication when the query was run, and how long it took to execute.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968584985/0fd57061-02db-4277-b677-8ab12fdd3c59.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;With the trash icon in the top-right corner, the complete query history can be truncated. Also, single query history entries can be deleted, as well as specific queries can be re-run in a new tab by clicking &quot;Replay query&quot; in the menu that&apos;s present for each query history entry.&lt;/p&gt;
&lt;h1 id=&quot;heading-example-data-engineering-pipeline&quot;&gt;Example Data Engineering pipeline&lt;/h1&gt;
&lt;h2 id=&quot;heading-dataset-amp-goals&quot;&gt;Dataset &amp;amp; Goals&lt;/h2&gt;
&lt;p&gt;A well-known dataset is the NYC TLC Trip Record dataset. It can be found freely available on the website of &lt;a target=&quot;_blank&quot; href=&quot;https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page&quot;&gt;NYC Taxi and Limousine Commission website&lt;/a&gt;. It also comes with some explanations and additional lookup data. In this example, we focus on the yellow taxi data.&lt;/p&gt;
&lt;p&gt;The goal of this example pipeline is to create a clean Data Mart from the given trip records and location data, being able to support some basic analysis of the data via OLAP patterns.&lt;/p&gt;
&lt;h2 id=&quot;heading-source-data-analysis&quot;&gt;Source Data analysis&lt;/h2&gt;
&lt;p&gt;On the NYC TLC website, there&apos;s a &lt;a target=&quot;_blank&quot; href=&quot;https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf&quot;&gt;PDF file&lt;/a&gt; explaining the structure and contents of the data. The table structure can be found below, the highlighted columns indicate dimensional values, for which we&apos;ll build dimension tables for in the later steps.&lt;/p&gt;
&lt;div class=&quot;hn-table&quot;&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;td&gt;Column name&lt;/td&gt;&lt;td&gt;Description&lt;/td&gt;&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;mark&gt;VendorID&lt;/mark&gt;&lt;/td&gt;&lt;td&gt;A code indicating the TPEP provider that provided the record.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tpep_pickup_datetime&lt;/td&gt;&lt;td&gt;The date and time when the meter was engaged.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tpep_dropoff_datetime&lt;/td&gt;&lt;td&gt;The date and time when the meter was disengaged.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Passenger_count&lt;/td&gt;&lt;td&gt;The number of passengers in the vehicle. This is a driver-entered value.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trip_distance&lt;/td&gt;&lt;td&gt;The elapsed trip distance in miles reported by the taximeter.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;mark&gt;PULocationID&lt;/mark&gt;&lt;/td&gt;&lt;td&gt;TLC Taxi Zone in which the taximeter was engaged&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;mark&gt;DOLocationID&lt;/mark&gt;&lt;/td&gt;&lt;td&gt;TLC Taxi Zone in which the taximeter was disengaged&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;mark&gt;RateCodeID&lt;/mark&gt;&lt;/td&gt;&lt;td&gt;The final rate code in effect at the end of the trip.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;mark&gt;Store_and_fwd_flag&lt;/mark&gt;&lt;/td&gt;&lt;td&gt;This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;mark&gt;Payment_type&lt;/mark&gt;&lt;/td&gt;&lt;td&gt;A numeric code signifying how the passenger paid for the trip.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fare_amount&lt;/td&gt;&lt;td&gt;The time-and-distance fare calculated by the meter.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extra&lt;/td&gt;&lt;td&gt;Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MTA_tax&lt;/td&gt;&lt;td&gt;$0.50 MTA tax that is automatically triggered based on the metered rate in use.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Improvement_surcharge&lt;/td&gt;&lt;td&gt;$0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tip_amount&lt;/td&gt;&lt;td&gt;Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tolls_amount&lt;/td&gt;&lt;td&gt;Total amount of all tolls paid in trip.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total_amount&lt;/td&gt;&lt;td&gt;The total amount charged to passengers. Does not include cash tips.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Congestion_Surcharge&lt;/td&gt;&lt;td&gt;Total amount collected in trip for NYS congestion surcharge.&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Airport_fee&lt;/td&gt;&lt;td&gt;$1.25 for pick up only at LaGuardia and John F. Kennedy Airports&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;&lt;p&gt;There&apos;s an additional &lt;a target=&quot;_blank&quot; href=&quot;https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv&quot;&gt;CSV file&lt;/a&gt; for the so-called Taxi Zones, as well as a &lt;a target=&quot;_blank&quot; href=&quot;https://d37ci6vzurychx.cloudfront.net/misc/taxi_zones.zip&quot;&gt;SHX shapefile&lt;/a&gt; containing the same info, but with an additional geo information. The structure is the following:&lt;/p&gt;
&lt;div class=&quot;hn-table&quot;&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;td&gt;Column name&lt;/td&gt;&lt;td&gt;Description&lt;/td&gt;&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LocationID&lt;/td&gt;&lt;td&gt;TLC Taxi Zone, corresponding to the PULocationID and DOLocationID columns in the trip dataset&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Borough&lt;/td&gt;&lt;td&gt;The name of the NYC borough this Taxi Zone is in&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zone&lt;/td&gt;&lt;td&gt;The name of the Taxi Zone&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;service_zone&lt;/td&gt;&lt;td&gt;Can either be &quot;Yellow Zone&quot; or &quot;Boro Zone&quot;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;&lt;h2 id=&quot;heading-target-data-model&quot;&gt;Target Data Model&lt;/h2&gt;
&lt;p&gt;The target data model is derived from the original trip record data, with extracted dimension tables plus a new date hierarchy dimension. Also, the naming schema gets unified and cleaned up.&lt;/p&gt;
&lt;p&gt;It is modeled as a so-called &lt;a target=&quot;_blank&quot; href=&quot;https://en.wikipedia.org/wiki/Snowflake_schema&quot;&gt;Snowflake Schema&lt;/a&gt; (check the Mermaid source):&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706571275733/932ced37-0a11-4eb1-a08c-cf123529ec8c.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-loading-amp-transforming-the-data&quot;&gt;Loading &amp;amp; Transforming the data&lt;/h2&gt;
&lt;p&gt;The loading &amp;amp; transforming of the data is divided in multiple steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Generating the dimensional tables and values from the given dataset information&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Generating a date hierarchy dimension&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Loading and transforming the trip data&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Replacing values with dimension references&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Clean the column naming&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Unify values&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;heading-generating-the-dimensional-tables&quot;&gt;&lt;strong&gt;Generating the dimensional tables&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;We use the given dataset information from the PDF file to manually create dimension tables and their values:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;-- Install and load the spatial extension&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;INSTALL&lt;/span&gt; spatial;
&lt;span class=&quot;hljs-keyword&quot;&gt;LOAD&lt;/span&gt; spatial;

&lt;span class=&quot;hljs-comment&quot;&gt;-- Create temporary table&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; tmp_service_zones &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; (
    &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
        &lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt; service_zone 
    &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; 
        &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.quacking.cloud/nyc-taxi-metadata/taxi_zones.csv&apos;&lt;/span&gt; 
    &lt;span class=&quot;hljs-keyword&quot;&gt;WHERE&lt;/span&gt; 
        service_zone != &lt;span class=&quot;hljs-string&quot;&gt;&apos;N/A&apos;&lt;/span&gt; 
    &lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; 
        service_zone
); 

&lt;span class=&quot;hljs-comment&quot;&gt;-- Create dim_zone_type table&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; dim_zone_type (
    zone_type_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt; PRIMARY &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt;,
    &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;
);

&lt;span class=&quot;hljs-comment&quot;&gt;-- Insert dim_zone_type table&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_zone_type
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
    &lt;span class=&quot;hljs-number&quot;&gt;-1&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; zone_type_id, 
    &lt;span class=&quot;hljs-string&quot;&gt;&apos;N/A&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt; 
&lt;span class=&quot;hljs-keyword&quot;&gt;UNION&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;ALL&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
    (&lt;span class=&quot;hljs-keyword&quot;&gt;rowid&lt;/span&gt; + &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;)::&lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; zone_type_id, 
    service_zone &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt; 
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; 
    tmp_service_zones 
; 

&lt;span class=&quot;hljs-comment&quot;&gt;-- Drop table tmp_service_zones&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;DROP&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; tmp_service_zones; 

&lt;span class=&quot;hljs-comment&quot;&gt;-- Create temporary table&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; tmp_borough &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; (
    &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
        &lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt; borough 
    &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; 
        &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.quacking.cloud/nyc-taxi-metadata/taxi_zones.csv&apos;&lt;/span&gt; 
    &lt;span class=&quot;hljs-keyword&quot;&gt;WHERE&lt;/span&gt;
        borough != &lt;span class=&quot;hljs-string&quot;&gt;&apos;Unknown&apos;&lt;/span&gt;
    &lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; 
        borough
); 

&lt;span class=&quot;hljs-comment&quot;&gt;-- Create dim_borough table&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; dim_borough (
    borough_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt; PRIMARY &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt;,
    &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;
);

&lt;span class=&quot;hljs-comment&quot;&gt;-- Insert dim_borough table&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_borough
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
    &lt;span class=&quot;hljs-number&quot;&gt;-1&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; borough_id, 
    &lt;span class=&quot;hljs-string&quot;&gt;&apos;N/A&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt; 
&lt;span class=&quot;hljs-keyword&quot;&gt;UNION&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;ALL&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
    (&lt;span class=&quot;hljs-keyword&quot;&gt;rowid&lt;/span&gt; + &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;)::&lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; borough_id, 
    borough &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt; 
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; 
    tmp_borough 
; 

&lt;span class=&quot;hljs-comment&quot;&gt;-- Drop temporary table&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;DROP&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; tmp_borough;

&lt;span class=&quot;hljs-comment&quot;&gt;-- Create dim_zone table&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; dim_zone (
    zone_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt; PRIMARY &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt;,
    zone_type_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;,
    borough_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;,
    &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;,
    geojson &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;,
    &lt;span class=&quot;hljs-keyword&quot;&gt;FOREIGN&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt; (zone_type_id) &lt;span class=&quot;hljs-keyword&quot;&gt;REFERENCES&lt;/span&gt; dim_zone_type (zone_type_id),
    &lt;span class=&quot;hljs-keyword&quot;&gt;FOREIGN&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt; (borough_id) &lt;span class=&quot;hljs-keyword&quot;&gt;REFERENCES&lt;/span&gt; dim_borough (borough_id)
);

&lt;span class=&quot;hljs-comment&quot;&gt;-- Insert dim_zone table&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_zone 
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt;
    &lt;span class=&quot;hljs-keyword&quot;&gt;CASE&lt;/span&gt;
        &lt;span class=&quot;hljs-keyword&quot;&gt;WHEN&lt;/span&gt; csv.LocationID &lt;span class=&quot;hljs-keyword&quot;&gt;IS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;hljs-literal&quot;&gt;NULL&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;THEN&lt;/span&gt; csv.LocationID::&lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt;
        &lt;span class=&quot;hljs-keyword&quot;&gt;ELSE&lt;/span&gt; raw.LocationID
    &lt;span class=&quot;hljs-keyword&quot;&gt;END&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; zone_id, 
    zt.zone_type_id, 
    &lt;span class=&quot;hljs-keyword&quot;&gt;CASE&lt;/span&gt;
        &lt;span class=&quot;hljs-keyword&quot;&gt;WHEN&lt;/span&gt; b.borough_id &lt;span class=&quot;hljs-keyword&quot;&gt;IS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;hljs-literal&quot;&gt;NULL&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;THEN&lt;/span&gt; b.borough_id
        &lt;span class=&quot;hljs-keyword&quot;&gt;ELSE&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;-1&lt;/span&gt;
    &lt;span class=&quot;hljs-keyword&quot;&gt;END&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; borough_id,
    &lt;span class=&quot;hljs-keyword&quot;&gt;CASE&lt;/span&gt;
        &lt;span class=&quot;hljs-keyword&quot;&gt;WHEN&lt;/span&gt; csv.Zone &lt;span class=&quot;hljs-keyword&quot;&gt;IS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;hljs-literal&quot;&gt;NULL&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;THEN&lt;/span&gt; csv.Zone
        &lt;span class=&quot;hljs-keyword&quot;&gt;ELSE&lt;/span&gt; raw.zone
    &lt;span class=&quot;hljs-keyword&quot;&gt;END&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;, 
    raw.geojson 
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; 
    (
        &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
            LocationID, 
            borough,
            zone,  
            geojson 
        &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; 
            (
                &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
                    LocationID, 
                    borough, 
                    zone, 
                    &lt;span class=&quot;hljs-keyword&quot;&gt;rank&lt;/span&gt;() &lt;span class=&quot;hljs-keyword&quot;&gt;OVER&lt;/span&gt; (&lt;span class=&quot;hljs-keyword&quot;&gt;PARTITION&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; LocationID &lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; Shape_Leng) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ranked, 
                    ST_AsGeoJSON(ST_Transform(geom, &lt;span class=&quot;hljs-string&quot;&gt;&apos;ESRI:102718&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;EPSG:4326&apos;&lt;/span&gt;)) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; geojson 
                &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; ST_Read(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.quacking.cloud/nyc-taxi-metadata/taxi_zones.shx&apos;&lt;/span&gt;) 
            ) sub
        &lt;span class=&quot;hljs-keyword&quot;&gt;WHERE&lt;/span&gt; 
            sub.ranked = &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt; 
    ) &lt;span class=&quot;hljs-keyword&quot;&gt;raw&lt;/span&gt; 
&lt;span class=&quot;hljs-keyword&quot;&gt;FULL&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;OUTER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;JOIN&lt;/span&gt; 
    (
        &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt; 
            LocationID,
            Zone, 
            service_zone 
        &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; 
            &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.quacking.cloud/nyc-taxi-metadata/taxi_zones.csv&apos;&lt;/span&gt; 
    ) csv 
&lt;span class=&quot;hljs-keyword&quot;&gt;ON&lt;/span&gt; 
    csv.LocationId = raw.LocationId 
&lt;span class=&quot;hljs-keyword&quot;&gt;FULL&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;OUTER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;JOIN&lt;/span&gt; 
    dim_zone_type zt 
&lt;span class=&quot;hljs-keyword&quot;&gt;ON&lt;/span&gt; 
    csv.service_zone = zt.name 
&lt;span class=&quot;hljs-keyword&quot;&gt;FULL&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;OUTER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;JOIN&lt;/span&gt; 
    dim_borough b 
&lt;span class=&quot;hljs-keyword&quot;&gt;ON&lt;/span&gt; 
    b.name = raw.borough
&lt;span class=&quot;hljs-keyword&quot;&gt;WHERE&lt;/span&gt;
    zone_id &lt;span class=&quot;hljs-keyword&quot;&gt;IS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;hljs-literal&quot;&gt;NULL&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt;
    zone_id
;

&lt;span class=&quot;hljs-comment&quot;&gt;-- Create dim_rate_code table&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; dim_rate_code (
    rate_code_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt; PRIMARY &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt;,
    &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;
);

&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_rate_code (rate_code_id, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;Standard rate&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_rate_code (rate_code_id, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;JFK&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_rate_code (rate_code_id, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-number&quot;&gt;3&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;Newark&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_rate_code (rate_code_id, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-number&quot;&gt;4&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;Nassau or Westchester&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_rate_code (rate_code_id, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-number&quot;&gt;5&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;Negotiated fare&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_rate_code (rate_code_id, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-number&quot;&gt;6&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;Group ride&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_rate_code (rate_code_id, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-number&quot;&gt;99&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;N/A&apos;&lt;/span&gt;);

&lt;span class=&quot;hljs-comment&quot;&gt;-- Create dim_payment_type table&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; dim_payment_type (
    payment_type_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt; PRIMARY &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt;,
    &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;
);

&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_payment_type (payment_type_id, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;Credit card&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_payment_type (payment_type_id, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;Cash&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_payment_type (payment_type_id, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-number&quot;&gt;3&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;No charge&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_payment_type (payment_type_id, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-number&quot;&gt;4&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;Dispute&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_payment_type (payment_type_id, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-number&quot;&gt;5&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;Unknown&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_payment_type (payment_type_id, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-number&quot;&gt;6&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;Voided trip&apos;&lt;/span&gt;);

&lt;span class=&quot;hljs-comment&quot;&gt;-- Create dim_vendor table&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; dim_vendor (
    vendor_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt; PRIMARY &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt;,
    &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;
);

&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_vendor (vendor_id, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;Creative Mobile Technologies&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_vendor (vendor_id, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;VeriFone Inc.&apos;&lt;/span&gt;);

&lt;span class=&quot;hljs-comment&quot;&gt;-- Create dim_stored_type table&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; dim_stored_type (
    stored_type_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt; PRIMARY &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt;,
    &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;
);

&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_stored_type (stored_type_id, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;Store and forward trip&apos;&lt;/span&gt;);
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_stored_type (stored_type_id, &lt;span class=&quot;hljs-keyword&quot;&gt;name&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;VALUES&lt;/span&gt; (&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;, &lt;span class=&quot;hljs-string&quot;&gt;&apos;Not a store and forward trip&apos;&lt;/span&gt;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-generating-a-date-hierarchy-dimension&quot;&gt;&lt;strong&gt;Generating a date hierarchy dimension&lt;/strong&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;-- Create dim_date table&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; dim_date (
    day_dt &lt;span class=&quot;hljs-built_in&quot;&gt;DATE&lt;/span&gt; PRIMARY &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt;,
    day_name &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;,
    day_of_week &lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt;,
    day_of_month &lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt;,
    day_of_year &lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt;,
    week_of_year &lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt;,
    month_of_year &lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt;,
    month_name &lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt;,
    &lt;span class=&quot;hljs-keyword&quot;&gt;year&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt;
);

&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; dim_date
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
    date_key &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; day_dt,
    &lt;span class=&quot;hljs-keyword&quot;&gt;DAYNAME&lt;/span&gt;(date_key)::&lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; day_name,
    ISODOW(date_key)::&lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; day_of_week,
    &lt;span class=&quot;hljs-keyword&quot;&gt;DAYOFMONTH&lt;/span&gt;(date_key)::&lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; day_of_month,
    &lt;span class=&quot;hljs-keyword&quot;&gt;DAYOFYEAR&lt;/span&gt;(date_key)::&lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; day_of_year, 
    &lt;span class=&quot;hljs-keyword&quot;&gt;WEEKOFYEAR&lt;/span&gt;(date_key)::&lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; week_of_year,
    &lt;span class=&quot;hljs-keyword&quot;&gt;MONTH&lt;/span&gt;(date_key)::&lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; month_of_year,
    MONTHNAME(date_key)::&lt;span class=&quot;hljs-built_in&quot;&gt;VARCHAR&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; month_name,
    &lt;span class=&quot;hljs-keyword&quot;&gt;YEAR&lt;/span&gt;(date_key)::&lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;year&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; 
    (
        &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
            &lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;RANGE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;DATE&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; date_key 
        &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; 
            &lt;span class=&quot;hljs-keyword&quot;&gt;RANGE&lt;/span&gt;(&lt;span class=&quot;hljs-built_in&quot;&gt;DATE&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;2005-01-01&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-built_in&quot;&gt;DATE&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;2030-12-31&apos;&lt;/span&gt;, &lt;span class=&quot;hljs-built_in&quot;&gt;INTERVAL&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;DAY&lt;/span&gt;)
    ) generate_date
;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-loading-and-transforming-the-trip-data&quot;&gt;&lt;strong&gt;Loading and transforming the trip data&lt;/strong&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;-- Create sequence for generating trip_ids&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;SEQUENCE&lt;/span&gt; trip_id_sequence &lt;span class=&quot;hljs-keyword&quot;&gt;START&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;;

&lt;span class=&quot;hljs-comment&quot;&gt;-- Create fact_trip table&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; fact_trip (
    trip_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;  &lt;span class=&quot;hljs-keyword&quot;&gt;DEFAULT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;nextval&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;&apos;trip_id_sequence&apos;&lt;/span&gt;) PRIMARY &lt;span class=&quot;hljs-keyword&quot;&gt;KEY&lt;/span&gt;,
    pickup_zone_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;,
    pickup_dt &lt;span class=&quot;hljs-built_in&quot;&gt;DATE&lt;/span&gt;,
    pickup_ts &lt;span class=&quot;hljs-built_in&quot;&gt;TIMESTAMP&lt;/span&gt;,
    dropoff_zone_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;,
    dropoff_dt &lt;span class=&quot;hljs-built_in&quot;&gt;DATE&lt;/span&gt;,
    dropoff_ts &lt;span class=&quot;hljs-built_in&quot;&gt;TIMESTAMP&lt;/span&gt;,
    rate_code_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;, 
    stored_type_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;, 
    payment_type_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;, 
    vendor_id &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;, 
    passenger_count &lt;span class=&quot;hljs-keyword&quot;&gt;DOUBLE&lt;/span&gt;,
    trip_distance_miles &lt;span class=&quot;hljs-keyword&quot;&gt;DOUBLE&lt;/span&gt;,
    fare_amount &lt;span class=&quot;hljs-keyword&quot;&gt;DOUBLE&lt;/span&gt;,
    extra_amount &lt;span class=&quot;hljs-keyword&quot;&gt;DOUBLE&lt;/span&gt;,
    mta_tax_amount &lt;span class=&quot;hljs-keyword&quot;&gt;DOUBLE&lt;/span&gt;,
    improvement_surcharge_amount &lt;span class=&quot;hljs-keyword&quot;&gt;DOUBLE&lt;/span&gt;,
    tip_amount &lt;span class=&quot;hljs-keyword&quot;&gt;DOUBLE&lt;/span&gt;,
    tolls_amount &lt;span class=&quot;hljs-keyword&quot;&gt;DOUBLE&lt;/span&gt;,
    congestion_surcharge_amount &lt;span class=&quot;hljs-keyword&quot;&gt;DOUBLE&lt;/span&gt;,
    airport_fee_amount &lt;span class=&quot;hljs-keyword&quot;&gt;DOUBLE&lt;/span&gt;,
    total_amount &lt;span class=&quot;hljs-keyword&quot;&gt;DOUBLE&lt;/span&gt;
);

&lt;span class=&quot;hljs-comment&quot;&gt;-- Deactivating FK relationships for now, due to performance issues when inserting 3 million records&lt;/span&gt;
&lt;span class=&quot;hljs-comment&quot;&gt;-- FOREIGN KEY (pickup_zone_id) REFERENCES dim_zone (zone_id),&lt;/span&gt;
&lt;span class=&quot;hljs-comment&quot;&gt;-- FOREIGN KEY (dropoff_zone_id) REFERENCES dim_zone (zone_id),&lt;/span&gt;
&lt;span class=&quot;hljs-comment&quot;&gt;-- FOREIGN KEY (pickup_dt) REFERENCES dim_date (day_dt),&lt;/span&gt;
&lt;span class=&quot;hljs-comment&quot;&gt;-- FOREIGN KEY (dropoff_dt) REFERENCES dim_date (day_dt),&lt;/span&gt;
&lt;span class=&quot;hljs-comment&quot;&gt;-- FOREIGN KEY (rate_code_id) REFERENCES dim_rate_code (rate_code_id),&lt;/span&gt;
&lt;span class=&quot;hljs-comment&quot;&gt;-- FOREIGN KEY (stored_type_id) REFERENCES dim_stored_type (stored_type_id),&lt;/span&gt;
&lt;span class=&quot;hljs-comment&quot;&gt;-- FOREIGN KEY (payment_type_id) REFERENCES dim_payment_type (payment_type_id),&lt;/span&gt;
&lt;span class=&quot;hljs-comment&quot;&gt;-- FOREIGN KEY (vendor_id) REFERENCES dim_vendor (vendor_id)&lt;/span&gt;

&lt;span class=&quot;hljs-comment&quot;&gt;-- Insert transformed fact data&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;INSERT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;INTO&lt;/span&gt; fact_trip
&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
    &lt;span class=&quot;hljs-keyword&quot;&gt;nextval&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;&apos;trip_id_sequence&apos;&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; trip_id,
    PULocationID::&lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; pickup_zone_id,
    tpep_pickup_datetime::&lt;span class=&quot;hljs-built_in&quot;&gt;DATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; pickup_dt,
    tpep_pickup_datetime &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; pickup_ts,
    DOLocationID::&lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; dropoff_zone_id,
    tpep_dropoff_datetime::&lt;span class=&quot;hljs-built_in&quot;&gt;DATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; dropoff_dt,
    tpep_dropoff_datetime &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; dropoff_ts,
    RatecodeID::&lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; rate_code_id,
    &lt;span class=&quot;hljs-keyword&quot;&gt;CASE&lt;/span&gt;
        &lt;span class=&quot;hljs-keyword&quot;&gt;WHEN&lt;/span&gt; store_and_fwd_flag = &lt;span class=&quot;hljs-string&quot;&gt;&apos;Y&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;THEN&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;
        &lt;span class=&quot;hljs-keyword&quot;&gt;WHEN&lt;/span&gt; store_and_fwd_flag = &lt;span class=&quot;hljs-string&quot;&gt;&apos;N&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;THEN&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;
    &lt;span class=&quot;hljs-keyword&quot;&gt;END&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; stored_type_id,
    payment_type::&lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; payment_type_id,
    VendorID::&lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; vendor_id,
    passenger_count,
    trip_distance &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; trip_distance_miles,
    fare_amount,
    extra &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; extra_amount,
    mta_tax &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; mta_tax_amount,
    improvement_surcharge &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; improvement_surcharge_amount,
    tip_amount,
    tolls_amount,
    congestion_surcharge &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; congestion_surcharge_amount,
    airport_fee &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; airport_fee_amount,
    total_amount
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; 
    &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://data.quacking.cloud/nyc-taxi-data/yellow_tripdata_2023-01.parquet&apos;&lt;/span&gt;
;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-data-analysis&quot;&gt;Data Analysis&lt;/h2&gt;
&lt;p&gt;The following analyses are just examples on how you could analyze the data set. Feel free to think about your own questions for the dataset, and try to build queries yourselves!&lt;/p&gt;
&lt;h3 id=&quot;heading-preparation&quot;&gt;Preparation&lt;/h3&gt;
&lt;p&gt;DuckDB supports the &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/guides/meta/summarize&quot;&gt;SUMMARIZE&lt;/a&gt; command can help you understand the final data in the fact table before querying it. It launches a query that computes a number of aggregates over all columns, including &lt;code&gt;min&lt;/code&gt;, &lt;code&gt;max&lt;/code&gt;, &lt;code&gt;avg&lt;/code&gt;, &lt;code&gt;std&lt;/code&gt; and &lt;code&gt;approx_unique&lt;/code&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706635916309/a55a933c-cf39-49e2-adcd-f159a619ec65.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The output already shows some &quot;interesting&quot; things, such as &lt;code&gt;pickup_dt&lt;/code&gt; being far in the past, e.g. &lt;code&gt;2008-12-31&lt;/code&gt;, and the &lt;code&gt;dropoff_dt&lt;/code&gt; with similar values (&lt;code&gt;2009-01-01&lt;/code&gt;).&lt;/p&gt;
&lt;h3 id=&quot;heading-most-utilized-trip-locations&quot;&gt;Most utilized trip locations&lt;/h3&gt;
&lt;p&gt;With this analysis, we want to have a look at the 20 most frequented trips from pickup zone to dropoff zone:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt;
    pz.name || &lt;span class=&quot;hljs-string&quot;&gt;&apos; -&amp;gt; &apos;&lt;/span&gt; || dz.name &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; trip_description,
    &lt;span class=&quot;hljs-keyword&quot;&gt;count&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt; ft.trip_id)::&lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; trip_count,
    &lt;span class=&quot;hljs-keyword&quot;&gt;sum&lt;/span&gt;(ft.passenger_count)::&lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; passenger_count,
    &lt;span class=&quot;hljs-keyword&quot;&gt;sum&lt;/span&gt;(ft.total_amount)::&lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; total_amount,
    &lt;span class=&quot;hljs-keyword&quot;&gt;sum&lt;/span&gt;(ft.trip_distance_miles)::&lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; trip_distance_miles,
    (&lt;span class=&quot;hljs-keyword&quot;&gt;sum&lt;/span&gt;(trip_distance_miles)/&lt;span class=&quot;hljs-keyword&quot;&gt;count&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt; ft.trip_id))::&lt;span class=&quot;hljs-keyword&quot;&gt;DOUBLE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; trip_distance_miles_avg,
    (&lt;span class=&quot;hljs-keyword&quot;&gt;sum&lt;/span&gt;(ft.total_amount)/&lt;span class=&quot;hljs-keyword&quot;&gt;count&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt; ft.trip_id))::&lt;span class=&quot;hljs-keyword&quot;&gt;DOUBLE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; total_amount_avg,
    (&lt;span class=&quot;hljs-keyword&quot;&gt;sum&lt;/span&gt;(ft.passenger_count)/&lt;span class=&quot;hljs-keyword&quot;&gt;count&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt; ft.trip_id))::&lt;span class=&quot;hljs-keyword&quot;&gt;DOUBLE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; passenger_count_avg
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt;
    fact_trip ft
&lt;span class=&quot;hljs-keyword&quot;&gt;INNER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;JOIN&lt;/span&gt;
    dim_zone pz
&lt;span class=&quot;hljs-keyword&quot;&gt;ON&lt;/span&gt;
    pz.zone_id = ft.pickup_zone_id
&lt;span class=&quot;hljs-keyword&quot;&gt;INNER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;JOIN&lt;/span&gt;
    dim_zone dz
&lt;span class=&quot;hljs-keyword&quot;&gt;ON&lt;/span&gt;
    dz.zone_id = ft.dropoff_zone_id
&lt;span class=&quot;hljs-keyword&quot;&gt;GROUP&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt;
    trip_description
&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt;
    trip_count &lt;span class=&quot;hljs-keyword&quot;&gt;DESC&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;20&lt;/span&gt;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;On a M2 Mac Mini with 16GB RAM, aggregating the 3 million trips takes around 900ms. The result looks like the following:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706652583933/219edfc9-9f2e-4e57-976f-99c1f01cceb8.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;We can now also create a Y Bar chart showing the (total) trip count, passenger count and trip distance by top 20 most frequented trips:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706652794872/03f96773-09a8-4dd4-8318-7a272f6b0ab1.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;End result:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706652997260/f9a98328-d4b1-4d3a-8ffa-bf16eb11a998.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h3 id=&quot;heading-trip-frequency-by-weekday-and-time-of-day&quot;&gt;Trip frequency by weekday and time of day&lt;/h3&gt;
&lt;p&gt;To inspect the traffic patterns, we want to analyze the trip frequency by weekday and time of day (aggregated on hourly level). Therefore, we make use of DuckDB&apos;s advanced timestamp/time handling functions:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt;
    dd.day_name,
    dd.day_of_week,
    &lt;span class=&quot;hljs-keyword&quot;&gt;datepart&lt;/span&gt;(&lt;span class=&quot;hljs-string&quot;&gt;&apos;hour&apos;&lt;/span&gt;, time_bucket(&lt;span class=&quot;hljs-built_in&quot;&gt;INTERVAL&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;1 HOUR&apos;&lt;/span&gt;, ft.pickup_ts)) day_hour,
    &lt;span class=&quot;hljs-keyword&quot;&gt;count&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt; ft.trip_id)::&lt;span class=&quot;hljs-built_in&quot;&gt;INT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; trip_count,
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt;
    fact_trip ft
&lt;span class=&quot;hljs-keyword&quot;&gt;INNER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;JOIN&lt;/span&gt;
    dim_date dd
&lt;span class=&quot;hljs-keyword&quot;&gt;ON&lt;/span&gt;
    dd.day_dt = ft.pickup_dt
&lt;span class=&quot;hljs-keyword&quot;&gt;GROUP&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt;
    dd.day_name,
    dd.day_of_week,
    day_hour
&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt;
    dd.day_of_week,
    day_hour;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then, we can configure a Y Bar chart that can show us the number of trips by weekday and hour:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706654378057/28b1a67c-6a18-4544-af3a-942840b4a18a.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;End result:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706654434855/47d1005d-6466-47c2-8ce3-80c8c626c83e.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h1 id=&quot;heading-sharing-of-data-pipelines-amp-visualizations&quot;&gt;Sharing of Data Pipelines &amp;amp; Visualizations&lt;/h1&gt;
&lt;p&gt;With the latest version of &lt;a target=&quot;_blank&quot; href=&quot;https://sql-workbench.com&quot;&gt;sql-workbench.com&lt;/a&gt; it&apos;s possible to share both queries and the (customized) visualization of the last executed query. Therefore, you write your queries, run them to check whether they work, and then update the visualization configuration.&lt;/p&gt;
&lt;p&gt;When you did that, you can then click on &quot;Share queries&quot; in the lower left corner of the SQL Workbench. The toggle will let you choose whether you want to copy the visualization configuration as well, or not.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706620864709/aa314815-a44d-428d-99c6-449b4aede72e.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;If you want to run the complete pipeline to build our dimensional data model, you can click on the link below (this can take from 10 to 120 seconds depending on your machine and internet connection speed, as approximately 50MB of data will be downloaded):&lt;/p&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://dub.sh/run-data-pipeline&quot;&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706466737102/9a07966d-0e5e-4c80-b319-8a104091948d.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div data-node-type=&quot;callout&quot;&gt;
&lt;div data-node-type=&quot;callout-emoji&quot;&gt;💡&lt;/div&gt;
&lt;div data-node-type=&quot;callout-text&quot;&gt;&lt;strong&gt;The above link unfortunately has to be routed through https://dub.co as URL shortener, because HashNode doesn&apos;t support very long URLs as links!&lt;/strong&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h1 id=&quot;heading-attaching-remote-databases&quot;&gt;Attaching remote databases&lt;/h1&gt;
&lt;p&gt;Since DuckDB v1.0.0, it is possible to &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/guides/network_cloud_storage/duckdb_over_https_or_s3.html&quot;&gt;attach a remote database via HTTPS or S3&lt;/a&gt;. As an example, you could use the below statements to load and query a remote dataset of AWS IAM data from GitHub:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1728968630673/ea696998-9434-4215-8ac3-cd926f26536f.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;ATTACH &apos;https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/db/iam.duckdb&apos; AS iam (READ_ONLY);

&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
  s.name &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; service_name, 
  &lt;span class=&quot;hljs-keyword&quot;&gt;count&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;distinct&lt;/span&gt; a.name)::&lt;span class=&quot;hljs-built_in&quot;&gt;int&lt;/span&gt; action_cnt 
&lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; 
  iam.services s 
&lt;span class=&quot;hljs-keyword&quot;&gt;INNER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;JOIN&lt;/span&gt; 
  iam.actions a &lt;span class=&quot;hljs-keyword&quot;&gt;ON&lt;/span&gt; s.service_id = a.service_id 
&lt;span class=&quot;hljs-keyword&quot;&gt;GROUP&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;ALL&lt;/span&gt; 
&lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; action_cnt &lt;span class=&quot;hljs-keyword&quot;&gt;desc&lt;/span&gt; 
&lt;span class=&quot;hljs-keyword&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;25&lt;/span&gt;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Click on the image below to execute the queries:&lt;/p&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://sql-workbench.com/#queries=v0,ATTACH-&apos;https%3A%2F%2Fraw.githubusercontent.com%2Ftobilg%2Faws%20iam%20data%2Fmain%2Fdata%2Fdb%2Fiam.duckdb&apos;-AS-iam-(READ_ONLY)~,SELECT-s.name-as-service_name%2C-count(distinct-a.name)%3A%3Aint-action_cnt-FROM-iam.services-s-INNER-JOIN-iam.actions-a-ON-s.service_id-%3D-a.service_id-GROUP-BY-ALL-ORDER-BY-action_cnt-desc-LIMIT-25&quot;&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706466737102/9a07966d-0e5e-4c80-b319-8a104091948d.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1 id=&quot;heading-conclusion&quot;&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;With DuckDB-WASM and some common web frameworks, it&apos;s pretty easy and fast to create custom data applications that can handle datasets in the size of millions of records.&lt;/p&gt;
&lt;p&gt;Those applications are able to provide a very lightweight approach to working with different types of data (such as &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/guides/import/parquet_import&quot;&gt;Parquet&lt;/a&gt;, &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/guides/import/s3_iceberg_import&quot;&gt;Iceberg&lt;/a&gt;, Arrow, &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/guides/import/csv_import&quot;&gt;CSV&lt;/a&gt;, &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/extensions/json&quot;&gt;JSON&lt;/a&gt; or &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/extensions/spatial.html#st_read---read-spatial-data-from-files&quot;&gt;spatial data formats&lt;/a&gt;), whether locally or remote (via HTTP(S) of S3-compatible storage services), thanks to the versatile DuckDB engine.&lt;/p&gt;
&lt;p&gt;Users can interactively work with the data, create data pipelines by using raw SQL, and iterate until the final desired state has been achieved. The generated data pipeline queries can easily be shared with simple links to &lt;a target=&quot;_blank&quot; href=&quot;https://sql-workbench.com&quot;&gt;sql-workbench.com&lt;/a&gt;, so that other collaborators can continue to iterate on the existing work, or even create new solutions with it.&lt;/p&gt;
&lt;p&gt;Once a data pipeline has been finalized, it could for example be deployed to DuckDB instances running in own cloud accounts of the users. A great example would be &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/duckdb-nodejs-layer&quot;&gt;running DuckDB in AWS Lambda&lt;/a&gt;, e.g. for &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/serverless-parquet-repartitioner&quot;&gt;repartitioning Parquet data in S3 nightly&lt;/a&gt;, or automatically running reports based on aggregation pipelines etc.&lt;/p&gt;
&lt;p&gt;The possibilities are nearly endless, so I&apos;m very curious what you all build with this great technology! Thanks for reading this length article, I&apos;m happy to answer any questions in the comments.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1706547031708/c6cefe3e-c142-4cf4-acd2-da65d2a3140b.jpeg&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
</content:encoded><category>data-engineering</category><category>wasm</category><category>SQL</category><category>duckDB</category><category>data analytics</category><author>Tobias Müller</author></item><item><title>Retrieving Lambda@Edge CloudWatch Logs</title><link>https://tobilg.com/posts/retrieving-lambda-at-edge-cloudwatch-logs/</link><guid isPermaLink="true">https://tobilg.com/posts/retrieving-lambda-at-edge-cloudwatch-logs/</guid><description>What is Lambda@Edge
AWS Lambda@Edge is an extension of the traditional AWS Lambda service, but with a crucial twist – it brings serverless computing capabilities closer to the end-users.
In essence, Lambda@Edge empowers developers to run custom code ...</description><pubDate>Fri, 26 Jan 2024 20:11:22 GMT</pubDate><content:encoded>&lt;h3 id=&quot;heading-what-is-lambdaedge&quot;&gt;What is Lambda@Edge&lt;/h3&gt;
&lt;p&gt;AWS Lambda@Edge is an extension of the traditional AWS Lambda service, but with a crucial twist – it brings serverless computing capabilities closer to the end-users.&lt;/p&gt;
&lt;p&gt;In essence, Lambda@Edge empowers developers to run custom code in response to &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/lambda-cloudfront-trigger-events.html&quot;&gt;specific CloudFront events&lt;/a&gt;. Whether it&apos;s tailoring content based on user location, device type, or handling real-time data processing at the edge, Lambda@Edge functions open up a world of possibilities for optimizing content delivery.&lt;/p&gt;
&lt;p&gt;Lambda@Edge functions have the following features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;They can access public internet&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;They can be run before or after your cache&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;They allow developers to view and modify not only the client request/response but also the origin request/response&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;They can read the body of the request&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;They can use a much bigger package size for code compared to CloudFront Functions (1MB for a client request/response trigger and 50MB for an origin request/response trigger)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;They allow up to 5 seconds for a client request/response trigger and up to 30 seconds for an origin request/response trigger.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;heading-how-can-you-retrieve-the-lambdaedge-logs&quot;&gt;How can you retrieve the Lambda@Edge logs?&lt;/h3&gt;
&lt;p&gt;Due to the fact that Lambda@Edge functions run in the CloudFront Regional Edge Caches, the CloudWatch logs cannot only be found in &lt;code&gt;us-east-1&lt;/code&gt;, but potentially in all other AWS regions that support &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/cloudfront/features/&quot;&gt;Regional Edge Caches&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/lambda-edge-testing-debugging.html&quot;&gt;Lambda@Edge testing and debugging guide&lt;/a&gt; states&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;When you review CloudWatch log files or metrics when you&apos;re troubleshooting errors, be aware that they are displayed or stored in the Region closest to the location where the function executed. So, if you have a website or web application with users in the United Kingdom, and you have a Lambda function associated with your distribution, for example, you must change the Region to view the CloudWatch metrics or log files for the London AWS Region.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So, one simple solution to automatically get the logs from all regions would be to write a script that searches for relevant CloudWatch Logs groups in each region, and displays their respective LogStream entries.&lt;/p&gt;
&lt;p&gt;This script could look like this&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;&lt;span class=&quot;hljs-meta&quot;&gt;#!/bin/bash&lt;/span&gt;

FUNCTION_NAME=&lt;span class=&quot;hljs-variable&quot;&gt;$1&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;for&lt;/span&gt; region &lt;span class=&quot;hljs-keyword&quot;&gt;in&lt;/span&gt; $(aws --output text ec2 describe-regions | cut -f 4) 
&lt;span class=&quot;hljs-keyword&quot;&gt;do&lt;/span&gt;
  &lt;span class=&quot;hljs-built_in&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&quot;Checking &lt;span class=&quot;hljs-variable&quot;&gt;$region&lt;/span&gt;&quot;&lt;/span&gt;
  &lt;span class=&quot;hljs-keyword&quot;&gt;for&lt;/span&gt; loggroup &lt;span class=&quot;hljs-keyword&quot;&gt;in&lt;/span&gt; $(aws --output text logs describe-log-groups --log-group-prefix &lt;span class=&quot;hljs-string&quot;&gt;&quot;/aws/lambda/us-east-1.&lt;span class=&quot;hljs-variable&quot;&gt;$FUNCTION_NAME&lt;/span&gt;&quot;&lt;/span&gt; --region &lt;span class=&quot;hljs-variable&quot;&gt;$region&lt;/span&gt; --query &lt;span class=&quot;hljs-string&quot;&gt;&apos;logGroups[].logGroupName&apos;&lt;/span&gt;)
  &lt;span class=&quot;hljs-keyword&quot;&gt;do&lt;/span&gt;
    &lt;span class=&quot;hljs-built_in&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&quot;Found &apos;&lt;span class=&quot;hljs-variable&quot;&gt;$loggroup&lt;/span&gt;&apos; in region &lt;span class=&quot;hljs-variable&quot;&gt;$region&lt;/span&gt;&quot;&lt;/span&gt;
    &lt;span class=&quot;hljs-keyword&quot;&gt;for&lt;/span&gt; logstream &lt;span class=&quot;hljs-keyword&quot;&gt;in&lt;/span&gt; $(aws --output text logs describe-log-streams --log-group-name &lt;span class=&quot;hljs-variable&quot;&gt;$loggroup&lt;/span&gt; --region &lt;span class=&quot;hljs-variable&quot;&gt;$region&lt;/span&gt; --query &lt;span class=&quot;hljs-string&quot;&gt;&apos;logStreams[].logStreamName&apos;&lt;/span&gt;)
    &lt;span class=&quot;hljs-keyword&quot;&gt;do&lt;/span&gt;
      aws --output text logs get-log-events --log-group-name &lt;span class=&quot;hljs-variable&quot;&gt;$loggroup&lt;/span&gt; --region &lt;span class=&quot;hljs-variable&quot;&gt;$region&lt;/span&gt; --log-stream-name &lt;span class=&quot;hljs-variable&quot;&gt;$logstream&lt;/span&gt; | cat
    &lt;span class=&quot;hljs-keyword&quot;&gt;done&lt;/span&gt;
  &lt;span class=&quot;hljs-keyword&quot;&gt;done&lt;/span&gt;
&lt;span class=&quot;hljs-keyword&quot;&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you save this script as &lt;code&gt;cf-logs.sh&lt;/code&gt;, after giving it execution rights with &lt;code&gt;chmod +x cf-logs.sh&lt;/code&gt;, can then be started with &lt;code&gt;./cf-logs.sh YOUR_FUNCTION_NAME&lt;/code&gt;, where &lt;code&gt;YOUR_FUNCTION_NAME&lt;/code&gt; is the real Lambda@Edge function name.&lt;/p&gt;
&lt;p&gt;This solution is great for quick log viewing while you&apos;re still developing your edge-enabled application, but surely is not a sustainable one when running in production.&lt;/p&gt;
&lt;h3 id=&quot;heading-aggregating-lambdaedge-logs-with-kinesis&quot;&gt;Aggregating Lambda@Edge logs with Kinesis&lt;/h3&gt;
&lt;p&gt;For a production setup, you&apos;d probably want to be able to aggregate and store the individual logs coming from the different regions in one place. A possible solution is to stream them to a Kinesis Firehose Delivery Stream, which then stores the logs in S3.&lt;/p&gt;
&lt;p&gt;An example implementation can be found at &lt;a target=&quot;_blank&quot; href=&quot;https://gist.github.com/heitorlessa/5d2295655f9d76483969d215986e53b0&quot;&gt;https://gist.github.com/heitorlessa/5d2295655f9d76483969d215986e53b0&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Please be aware that this will incur additional fixed and variable costs, so please review AWS&apos; pricing of the used services.&lt;/p&gt;
</content:encoded><category>serverless</category><category>lambda@edge</category><category>AWS</category><author>Tobias Müller</author></item><item><title>List of free AWS Knowledge Badges</title><link>https://tobilg.com/posts/list-of-free-aws-knowledge-badges/</link><guid isPermaLink="true">https://tobilg.com/posts/list-of-free-aws-knowledge-badges/</guid><description>As the Skillbuilder website is sometimes a bit hard to navigate, here&apos;s the full list of free badges you can do on AWS Skillbuilder:

AWS Knowledge: Cloud Essentials

AWS Knowledge: Architecting

AWS Knowledge: Serverless

AWS Knowledge: Object Stora...</description><pubDate>Fri, 01 Sep 2023 10:32:10 GMT</pubDate><content:encoded>&lt;p&gt;As the Skillbuilder website is sometimes a bit hard to navigate, here&apos;s the full list of free badges you can do on &lt;a target=&quot;_blank&quot; href=&quot;https://explore.skillbuilder.aws/&quot;&gt;AWS Skillbuilder&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://explore.skillbuilder.aws/learn/lp/82/Cloud%2520Essentials%2520-%2520Knowledge%2520Badge%2520Readiness%2520Path&quot;&gt;AWS Knowledge: Cloud Essentials&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://explore.skillbuilder.aws/learn/lp/1044/Solutions%2520Architect%2520-%2520Knowledge%2520Badge%2520Readiness%2520Path&quot;&gt;AWS Knowledge: Architecting&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://explore.skillbuilder.aws/learn/lp/92/Serverless%2520-%2520Knowledge%2520Badge%2520Readiness%2520Path&quot;&gt;AWS Knowledge: Serverless&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://explore.skillbuilder.aws/learn/lp/51/object-storage-knowledge-badge-readiness-path&quot;&gt;AWS Knowledge: Object Storage&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://explore.skillbuilder.aws/learn/learning_plan/view/93/block-storage-knowledge-badge-readiness-path&quot;&gt;AWS Knowledge: Block Storage&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://explore.skillbuilder.aws/learn/learning_plan/view/95/file-storage-knowledge-badge-readiness-path&quot;&gt;AWS Knowledge: File Storage&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://explore.skillbuilder.aws/learn/learning_plan/view/94/storage-data-migration-knowledge-badge-readiness-path&quot;&gt;AWS Knowledge: Data Migration&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://explore.skillbuilder.aws/learn/learning_plan/view/54/storage-data-protection-and-disaster-recovery-knowledge-badge-readiness-path&quot;&gt;AWS Knowledge: Data Protection &amp;amp; Disaster Recovery&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://explore.skillbuilder.aws/learn/learning_plan/view/1944/networking-core-knowledge-badge-readiness-path&quot;&gt;AWS Knowledge: Networking Core&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://explore.skillbuilder.aws/learn/public/learning_plan/view/1985/compute-knowledge-badge-readiness-path&quot;&gt;AWS Knowledge: Compute&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://explore.skillbuilder.aws/learn/public/learning_plan/view/1931/amazon-eks-knowledge-badge-readiness-path&quot;&gt;AWS Knowledge: EKS&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://explore.skillbuilder.aws/learn/public/learning_plan/view/1927/events-and-workflows-knowledge-badge-readiness-path&quot;&gt;AWS Knowledge: Events and Workflows&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://explore.skillbuilder.aws/learn/public/learning_plan/view/1986/amazon-braket-badge-knowledge-badge-readiness-path&quot;&gt;AWS Knowledge: Braket&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://explore.skillbuilder.aws/learn/learning_plan/view/1570/aws-for-games-cloud-game-development-knowledge-badge-readiness-path&quot;&gt;AWS Knowledge: AWS for Games: Cloud Game Development&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://explore.skillbuilder.aws/learn/learning_plan/view/1722/media-entertainment-direct-to-consumer-and-broadcast-foundations-knowledge-badge-readiness-path&quot;&gt;AWS Knowledge: Media &amp;amp; Entertainment: Direct-to-Consumer and Broadcast Foundations&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><category>AWS</category><category>skills</category><category>learning</category><category>free</category><category>badge</category><author>Tobias Müller</author></item><item><title>Serverless Maps for fun and profit</title><link>https://tobilg.com/posts/serverless-maps-for-fun-and-profit/</link><guid isPermaLink="true">https://tobilg.com/posts/serverless-maps-for-fun-and-profit/</guid><description>Introduction
In today&apos;s data-driven world, interactive and visually appealing web-based maps have become an integral part of countless applications and services. Whether it&apos;s for navigation, location-based services, or data visualization, delivering ...</description><pubDate>Mon, 07 Aug 2023 13:10:07 GMT</pubDate><content:encoded>&lt;h1 id=&quot;heading-introduction&quot;&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In today&apos;s data-driven world, interactive and visually appealing web-based maps have become an integral part of countless applications and services. Whether it&apos;s for navigation, location-based services, or data visualization, delivering seamless and lightning-fast user experiences is paramount. In this article, we explore how to host web-based maps on Amazon Web Services (AWS) in a serverless manner, exploring how this approach not only boosts speed but also brings significant cost advantages.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Web-Based Maps: A Key to Engaging User Experiences&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The ubiquity of smartphones and the proliferation of internet-connected devices have transformed how we interact with the digital world. Whether we&apos;re exploring a new city, tracking our fitness activities, or monitoring real-time data, web-based maps provide a dynamic and immersive way to engage users. However, delivering smooth and responsive map experiences can be challenging, especially as data sizes and user demands continue to grow.&lt;/p&gt;
&lt;p&gt;Traditionally, hosting web-based maps required dedicated servers or complex infrastructure setups, which often resulted in high operational costs and scalability limitations. Fortunately, the advent of serverless computing and the capabilities offered by AWS have strongly improved how we can deploy and manage web applications, including mapping services.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Serverless map hosting: A Game-Changer&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;AWS&apos; serverless services allow developers to build and run applications without worrying about provisioning and managing servers. This approach not only simplifies development but also offers numerous advantages, particularly when it comes to hosting web-based maps.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Speed&lt;/strong&gt; and &lt;strong&gt;Responsiveness&lt;/strong&gt;: With serverless hosting via CloudFront and S3, the infrastructure caches data on the edge. This means that users will experience near-instantaneous map loading and seamless interactions, translating to heightened satisfaction and engagement.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost Efficiency&lt;/strong&gt;: One of the most compelling aspects of the serverless approach is its cost-effectiveness. Traditional hosting methods often require paying for idle server resources, which can quickly add up and strain budgets. In contrast, serverless computing on AWS charges you only for the actual compute resources consumed during map requests. As a result, you can significantly reduce operational costs and allocate resources more efficiently.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In this comprehensive guide, we&apos;ll walk you through the steps of setting up a serverless web-based map hosting solution on AWS. From leveraging AWS Lambda for dynamic map tile querying to utilizing S3 for scalable and cost-effective data storage and using CloudFront for edge caching, you&apos;ll learn how to harness the full potential of AWS services for a top-notch mapping experience.&lt;/p&gt;
&lt;h1 id=&quot;heading-choosing-a-map-tile-format&quot;&gt;Choosing a map tile format&lt;/h1&gt;
&lt;p&gt;In the realm of digital maps, efficiency and performance are paramount. Whether you&apos;re navigating through city streets or exploring remote terrains, quick loading times and seamless zooming can make all the difference in delivering a superior user experience. Enter &lt;a target=&quot;_blank&quot; href=&quot;https://protomaps.com/docs/pmtiles&quot;&gt;PMTiles&lt;/a&gt;, a cutting-edge map tile format that is redefining how we interact with digital maps.&lt;/p&gt;
&lt;h2 id=&quot;heading-what-are-pmtiles&quot;&gt;What are PMTiles?&lt;/h2&gt;
&lt;p&gt;Traditional map tile formats, such as the popular PNG or JPEG images, can be bulky and slow to load, especially when dealing with intricate cartography or high-resolution imagery. PMTiles was specifically designed to address these limitations, offering a lightweight and performant solution for delivering map tiles.&lt;/p&gt;
&lt;p&gt;According to the &lt;a target=&quot;_blank&quot; href=&quot;https://protomaps.com/docs/pmtiles&quot;&gt;protomaps.com&lt;/a&gt; website&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;PMTiles is a single-file archive format for pyramids of tiled data. A PMTiles archive can be hosted on a storage platform like S3, and enables low-cost, zero-maintenance map applications.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Concepts include&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A general format for tiled data addressable by Z/X/Y coordinates, which can be cartographic basemap vector tiles, remote sensing observations, JPEG images, or more.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Readers use &lt;a target=&quot;_blank&quot; href=&quot;https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests&quot;&gt;HTTP Range Requests&lt;/a&gt; to fetch only the relevant tile or metadata inside a PMTiles archive on-demand.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The arrangement of tiles and directories is designed to minimize the amount of overhead requests when panning and zooming.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The current specification of PMTiles v3 can be found on &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/protomaps/PMTiles/blob/main/spec/v3/spec.md&quot;&gt;GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;heading-why-pmtiles-matter&quot;&gt;Why PMTiles matter?&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Faster Loading Times:&lt;/strong&gt; PMTiles leverages the power of vector data to store map tiles, which means that the tiles are smaller in size compared to raster image formats like PNG or JPEG. This smaller size translates to quicker loading times, enabling a seamless map browsing experience even in areas with limited network connectivity.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Efficient Storage:&lt;/strong&gt; Due to their compact size, PMTiles require less storage space than traditional map tile formats. This advantage is particularly significant for applications with large map datasets, as it reduces the infrastructure costs associated with storing and serving the maps.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dynamic Styling:&lt;/strong&gt; PMTiles offer the flexibility of dynamic styling, allowing developers to modify the appearance of map tiles on the fly. By adjusting colors, labels, and map features in real-time, developers can create customized maps that cater to the specific needs of their users.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Offline Accessibility:&lt;/strong&gt; One of the most impressive features of PMTiles is its ability to support offline access. By preloading PMTiles on a user&apos;s device, applications can ensure uninterrupted map access, even when an internet connection is not available. This functionality is invaluable for users in remote locations or areas with unstable network coverage. Also, usage of cache headers like &lt;code&gt;Cache-Control&lt;/code&gt;, when used in a browser, can save bandwidth and rendering time.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;heading-how-can-pmtiles-be-used&quot;&gt;How can PMTiles be used?&lt;/h2&gt;
&lt;p&gt;PMTiles can be used with common maps rendering engines like Leaflet, MapLibre GL or OpenLayers by using the &lt;a target=&quot;_blank&quot; href=&quot;https://www.npmjs.com/package/protomaps&quot;&gt;protomaps.js&lt;/a&gt; library. Examples of how to integrate it can be found on the website, or the &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/serverlessmaps/serverlessmaps/blob/main/website/index.html&quot;&gt;index.html&lt;/a&gt; and &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/serverlessmaps/serverlessmaps/blob/main/website/basemap.html&quot;&gt;basemap.html&lt;/a&gt; examples in our &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/serverlessmaps/serverlessmaps&quot;&gt;repo&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;heading-generating-pmtiles-from-openstreetmap-data&quot;&gt;Generating PMTiles from OpenStreetMap data&lt;/h2&gt;
&lt;p&gt;The first step for hosting your self-hosted maps can be downloading publically available map data from OpenStreetMap. This data is then to be transformed into a compatible basemap layer for use with PMTiles.&lt;/p&gt;
&lt;p&gt;See the documentation about &lt;a target=&quot;_blank&quot; href=&quot;https://protomaps.com/docs/frontends/basemap-layers&quot;&gt;vector basemap layers&lt;/a&gt; to get an idea of how this works in general. The following is necessary to mention though (from &lt;a target=&quot;_blank&quot; href=&quot;https://protomaps.com/docs/frontends/basemap-layers&quot;&gt;protomaps.com&lt;/a&gt;):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The organization of features with layers and tags is specific to Protomaps services; this means that map styles are not directly portable with other systems such as &lt;a target=&quot;_blank&quot; href=&quot;https://openmaptiles.org/schema/&quot;&gt;OpenMapTiles&lt;/a&gt; or &lt;a target=&quot;_blank&quot; href=&quot;https://tilezen.readthedocs.io/en/latest/&quot;&gt;Mapzen tiles&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This step can be automated to a high extent.&lt;/p&gt;
&lt;h1 id=&quot;heading-solution&quot;&gt;Solution&lt;/h1&gt;
&lt;p&gt;Enter &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/serverlessmaps/serverlessmaps&quot;&gt;ServerlessMaps&lt;/a&gt;! The project is outlined in the following paragraphs.&lt;/p&gt;
&lt;h1 id=&quot;heading-architecture&quot;&gt;Architecture&lt;/h1&gt;
&lt;p&gt;In our example implementation, we will only rely on a very few services:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;AWS Lambda (proxying the map tile requests to the S3 origin)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Amazon CloudFront (globally distributed CDN)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Amazon S3 (storing the PMTiles files, and the map example website)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Amazon CloudWatch (keeping the logs of the Lambda function)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The overall architecture looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1689947287518/920eba80-8993-47b8-a513-69437fef3cc9.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h1 id=&quot;heading-deployment&quot;&gt;Deployment&lt;/h1&gt;
&lt;p&gt;The deployment of &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/serverlessmaps/serverlessmaps&quot;&gt;ServerlessMaps&lt;/a&gt; must be done in multiple steps. From a high-level perspective, they are&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Setting up the project from GitHub locally&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Setting up the local environment to prepare the basemaps&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Build the desired basemaps as PMTiles file&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Deploy the serverless infrastructure on AWS&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Upload the basemaps and example websites to S3 (done automatically)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;heading-setting-up-the-project-locally&quot;&gt;Setting up the project locally&lt;/h3&gt;
&lt;p&gt;To clone the project to your local machine, please use&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;git &lt;span class=&quot;hljs-built_in&quot;&gt;clone&lt;/span&gt; https://github.com/serverlessmaps/serverlessmaps.git
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;in your desired directory. Then, do a &lt;code&gt;cd serverlessmaps&lt;/code&gt; in the directory you ran the above command.&lt;/p&gt;
&lt;h3 id=&quot;heading-setting-up-the-local-environment&quot;&gt;Setting up the local environment&lt;/h3&gt;
&lt;p&gt;This step assumes that you&apos;re using MacOS as the operating system. If so, you can run&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;scripts/install_macos.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;to install the dependencies needed to build the basemaps as PMTiles file. This will install the latest OpenJDK and Maven, which are both necessary for the basemaps build.&lt;/p&gt;
&lt;p&gt;If your system is not using MacOS, you can also install those two dependencies manually.&lt;/p&gt;
&lt;p&gt;The next step is to compile the &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/onthegomap/planetiler&quot;&gt;Planetiler&lt;/a&gt; build profile, that later can generate the PMTiles file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;scripts/compile_basemaps_builder.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will create a new directory &lt;code&gt;builder&lt;/code&gt; that will contain the JAR with the runnable builder.&lt;/p&gt;
&lt;h3 id=&quot;heading-build-the-desired-basemaps&quot;&gt;Build the desired basemaps&lt;/h3&gt;
&lt;p&gt;To build a basemap with the before compiled builder, run&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;scripts/build_pmtiles.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will build a map with a default area of Hamburg / Germany. If you want to generate maps for other areas, have a look at the OSM (sub-)region names, e.g. on the &lt;a target=&quot;_blank&quot; href=&quot;https://download.geofabrik.de/&quot;&gt;https://download.geofabrik.de/&lt;/a&gt; server.&lt;/p&gt;
&lt;p&gt;For example, if you&apos;d like to generate a map for the whole of Europe, you could run&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;scripts/build_pmtiles.sh europe
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Please be aware that this will run for several hours depending on your machine, and will generate a PMTiles file of around 45GB. This file will take some time to upload to S3 in the next step as well. The recommendation is if you just want to try out this project, use the default &lt;a target=&quot;_blank&quot; href=&quot;https://download.geofabrik.de/europe/germany/hamburg.html&quot;&gt;hamburg&lt;/a&gt; sub-region, which is around 35MB.&lt;/p&gt;
&lt;h3 id=&quot;heading-deploy-the-serverless-infrastructure&quot;&gt;Deploy the serverless infrastructure&lt;/h3&gt;
&lt;p&gt;This project assumes that you already have set up your AWS credentials locally so that the &lt;a target=&quot;_blank&quot; href=&quot;https://www.serverless.com&quot;&gt;Serverless framework&lt;/a&gt; can use it accordingly.&lt;/p&gt;
&lt;p&gt;To deploy the serverless AWS infrastructure, you can do a &lt;code&gt;cd iac&lt;/code&gt; from the project&apos;s root directory, and then use&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;sls deploy
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;to deploy the necessary stack.&lt;/p&gt;
&lt;p&gt;You can customize some parameters for the deployment:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;region&lt;/code&gt;: The AWS region you want to deploy your stack to (default: &lt;code&gt;us-east-1&lt;/code&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;stage&lt;/code&gt;: The stage name (default: &lt;code&gt;prd&lt;/code&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;cors&lt;/code&gt;: The allowed hostname for the CORS header (default: &lt;code&gt;*&lt;/code&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The following will deploy the stack to the &lt;code&gt;eu-central-1&lt;/code&gt; region with the stage name &lt;code&gt;dev&lt;/code&gt; and the allowed CORS hostname &lt;a target=&quot;_blank&quot; href=&quot;http://mymapservice.xyz&quot;&gt;&lt;code&gt;mymapservice.xyz&lt;/code&gt;&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;sls deploy --region eu-central-1 --stage dev --cors mymapservice.xyz
&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id=&quot;heading-stack-output&quot;&gt;Stack output&lt;/h4&gt;
&lt;p&gt;The deployment of the stack will generate an output like this on the console:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-text&quot;&gt;---------------------------------------------------------------------------------

-&amp;gt; The map can be viewed at https://d1b056iuztreqte.cloudfront.net/#13.54/53.54958/9.99286

-&amp;gt; The basemap themes can be viewed at https://d1b056iuztreqte.cloudfront.net/basemap.html#13.54/53.54958/9.99286

-&amp;gt; Please manually do a &apos;npm run website:sync&apos; on your console to sync the static website assets if you changed them after the last deployment

-&amp;gt; Afterwards, run &apos;npm run website:invalidate&apos; to invalidate the website&apos;s CloudFront distribution cache

---------------------------------------------------------------------------------
&lt;/code&gt;&lt;/pre&gt;
&lt;h4 id=&quot;heading-automatically-created-files&quot;&gt;Automatically created files&lt;/h4&gt;
&lt;p&gt;There will be two automatically created files based on the setting you chose before:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;website/urlConfig.js&lt;/code&gt;: This contains the CloudFront Distribution hostname (variable &lt;code&gt;tilesDistributionHostname&lt;/code&gt;) for the caching of the PMTiles. This is assigned by CloudFront during deployment.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;website/tilePath.js&lt;/code&gt;: This contains the needed &lt;code&gt;tilePath&lt;/code&gt; variable, which depends on the area you chose for the basemap. This is generated by the &lt;code&gt;scripts/build_pmtiles.sh&lt;/code&gt; script automatically.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;heading-upload-the-basemaps-and-example-websites&quot;&gt;Upload the basemaps and example websites&lt;/h3&gt;
&lt;p&gt;The basemap that was generated in the step before the deployment, and the two example websites are synched automatically to the website S3 bucket.&lt;/p&gt;
&lt;p&gt;If you want to deploy your web application/website, you need to run the sync manually via &lt;code&gt;npm run website:sync&lt;/code&gt;. After that, the CloudFront cache needs to be invalidated as well to show the new content. This can be done via &lt;code&gt;npm run website:invalidate&lt;/code&gt;, both from the &lt;code&gt;iac&lt;/code&gt; directory.&lt;/p&gt;
&lt;h2 id=&quot;heading-result&quot;&gt;Result&lt;/h2&gt;
&lt;p&gt;If everything went well, you can access the URL (&lt;a target=&quot;_blank&quot; href=&quot;https://d1b056iuztreqte.cloudfront.net/#13.54/53.54958/9.99286&quot;&gt;&lt;code&gt;https://d1b056iuztreqte.cloudfront.net/#13.54/53.54958/9.99286&lt;/code&gt;&lt;/a&gt; in the above example output) to view your basic map:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1691413129462/db36e396-d763-48b7-aa30-70f74d45d1b7.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h1 id=&quot;heading-conclusion&quot;&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;Hosting web-based maps on AWS in a serverless manner opens up a world of opportunities for delivering blazing-fast, interactive, and cost-efficient map services. By taking advantage of AWS&apos;s scalable infrastructure and pay-as-you-go pricing model, you can create an outstanding user experience without breaking the bank.&lt;/p&gt;
</content:encoded><category>serverless</category><category>maps</category><category>CDN</category><author>Tobias Müller</author></item><item><title>Gathering and analyzing public cloud provider IP address data with DuckDB &amp; Observerable</title><link>https://tobilg.com/posts/gathering-and-analyzing-public-cloud-provider-ip-address-data-with-duckdb-observerable/</link><guid isPermaLink="true">https://tobilg.com/posts/gathering-and-analyzing-public-cloud-provider-ip-address-data-with-duckdb-observerable/</guid><description>As organizations increasingly adopt the public cloud, managing the networking and security aspects of cloud computing becomes more complex. One of the challenges that cloud administrators face is, especially in a hybrid cloud environment, keeping tra...</description><pubDate>Wed, 26 Apr 2023 16:06:39 GMT</pubDate><content:encoded>&lt;p&gt;As organizations increasingly adopt the public cloud, managing the networking and security aspects of cloud computing becomes more complex. One of the challenges that cloud administrators face is, especially in a hybrid cloud environment, keeping track of the IP address ranges of the public cloud providers, which all use different file formats to publish their IP address range data. The formats include deeply nested JSONs, CSVs as well as plain text.&lt;/p&gt;
&lt;p&gt;The goal of this article is to outline how this data can be unified, cleaned and made available on a platform that makes it easy for users to consume. Furthermore, some interesting statistics can be derived from those public datasets.&lt;/p&gt;
&lt;p&gt;The data and the source code can be found at:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/public-cloud-provider-ip-ranges&quot;&gt;https://github.com/tobilg/public-cloud-provider-ip-ranges&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id=&quot;heading-data-sources&quot;&gt;Data sources&lt;/h1&gt;
&lt;p&gt;The (incomplete) list of public cloud providers that are publishing their &lt;a target=&quot;_blank&quot; href=&quot;https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing#IPv4_CIDR_blocks&quot;&gt;IPv4 CIDR blocks&lt;/a&gt; are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://ip-ranges.amazonaws.com/ip-ranges.json&quot;&gt;AWS&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://download.microsoft.com/download/7/1/D/71D86715-5596-4529-9B13-DA13A5DE5B63/ServiceTags_Public_20230417.json&quot;&gt;Azure&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://www.cloudflare.com/ips-v4&quot;&gt;CloudFlare&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://digitalocean.com/geo/google.csv&quot;&gt;DigitalOcean&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://api.fastly.com/public-ip-list&quot;&gt;Fastly&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://www.gstatic.com/ipranges/cloud.json&quot;&gt;Google Cloud&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://docs.oracle.com/en-us/iaas/tools/public_ip_ranges.json&quot;&gt;Oracle Cloud&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can click on the individual links to view or download the data manually.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;HINT: The published lists of IP address ranges don&apos;t represent the overall IP address space that each of these providers are possessing.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;To get the complete IP range, data from organizations like ARIN would need to be added as well. For simplicity and brevity, only the publically downloadable info was used.&lt;/p&gt;
&lt;h1 id=&quot;heading-retrieving-cleaning-storing-and-exporting-data&quot;&gt;Retrieving, cleaning, storing and exporting data&lt;/h1&gt;
&lt;p&gt;The overall data engineering process is divided into multiple steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Identify data sources (see above) and define the common schema&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Retrieving the raw data from public data sources (via HTTP)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cleaning the retrieved data (e.g. remove duplicates)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Storing the data in a common schema, so that it can be aggregated and analyzed for different public cloud providers at once&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Exporting the stored data into different file formats, so that as many different types of clients can make use of it&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Additionally, we&apos;d like to keep the costs low, and the infrastructure as simple as possible. That&apos;s why DuckDB is chosen as the database layer, which offers a rich and advanced set of features to handle (read and write) different file formats, as well as it can read directly from remote data sources via HTTP, only by using SQL. That saves additional effort for out-of-band ETL.&lt;/p&gt;
&lt;p&gt;Furthermore, to share the data, we chose GitHub, which is free to use for the scope of our use case. Most importantly, it allows us to store the exported data files in our &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/public-cloud-provider-ip-ranges/tree/main/data/providers&quot;&gt;repository&lt;/a&gt;. To run the overall process, &lt;a target=&quot;_blank&quot; href=&quot;https://docs.github.com/en/actions&quot;&gt;GitHub Actions&lt;/a&gt; are used as they also offer a free usage tier, and have everything we need to &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/public-cloud-provider-ip-ranges/blob/main/.github/workflows/main.yml&quot;&gt;create the described data pipeline&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;heading-common-data-schema&quot;&gt;Common data schema&lt;/h2&gt;
&lt;p&gt;After inspecting the data source files, the derived unified schema for all loaded data sources will look like this:&lt;/p&gt;
&lt;div class=&quot;hn-table&quot;&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;td&gt;Column name&lt;/td&gt;&lt;td&gt;Data type&lt;/td&gt;&lt;td&gt;Description&lt;/td&gt;&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;cloud_provider&lt;/td&gt;&lt;td&gt;VARCHAR&lt;/td&gt;&lt;td&gt;The public cloud provider&apos;s name&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cidr_block&lt;/td&gt;&lt;td&gt;VARCHAR&lt;/td&gt;&lt;td&gt;The CIDR block, e.g. &lt;code&gt;10.0.0.0/32&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ip_address&lt;/td&gt;&lt;td&gt;VARCHAR&lt;/td&gt;&lt;td&gt;The IP address, e.g. &lt;code&gt;10.0.0.0&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ip_address_mask&lt;/td&gt;&lt;td&gt;INTEGER&lt;/td&gt;&lt;td&gt;The IP address mask, e.g. &lt;code&gt;32&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ip_address_cnt&lt;/td&gt;&lt;td&gt;INTEGER&lt;/td&gt;&lt;td&gt;The number of IP addresses in this CIDR block&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;region&lt;/td&gt;&lt;td&gt;VARCHAR&lt;/td&gt;&lt;td&gt;The public cloud provider region information (if given)&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;&lt;h2 id=&quot;heading-creating-the-cloud-provider-tables&quot;&gt;Creating the cloud provider tables&lt;/h2&gt;
&lt;p&gt;At first, we&apos;ll create a table in DuckDB for each of the public cloud providers. If DuckDB is installed (see &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/installation/&quot;&gt;docs&lt;/a&gt;) and in the PATH, we can execute SQL scripts like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;# $DATA_PATH is the location of the DuckDB database file (this is important, because otherwise an in-memory table will be automatically created that will not be able to persist the data&lt;/span&gt;
&lt;span class=&quot;hljs-comment&quot;&gt;# $SCRIPT is the path to the SQL script that shall be executed&lt;/span&gt;
duckdb &lt;span class=&quot;hljs-variable&quot;&gt;$DATA_PATH&lt;/span&gt; &amp;lt; &lt;span class=&quot;hljs-variable&quot;&gt;$SCRIPT&lt;/span&gt;.sql
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each table has different SQLs, as the data sources (contents and formats) of each provider are different.&lt;/p&gt;
&lt;p&gt;Before starting, we need to make sure that the &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/extensions/httpfs&quot;&gt;httpfs extension&lt;/a&gt; is &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/public-cloud-provider-ip-ranges/blob/main/queries/install_extensions.sql&quot;&gt;installed&lt;/a&gt; and loaded (as we use remote datasets):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;INSTALL&lt;/span&gt; httpfs;
&lt;span class=&quot;hljs-keyword&quot;&gt;LOAD&lt;/span&gt; httpfs;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-aws-table&quot;&gt;AWS table&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; aws_ip_data &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; (
  &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt;
    prefixes.cidr_block,
    prefixes.ip_address,
    prefixes.ip_address_mask,
    &lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;POW&lt;/span&gt;(&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;32&lt;/span&gt;-prefixes.ip_address_mask) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address_cnt,
    prefixes.region
  &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; (
    &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt;
      prefix_object.ip_prefix &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; cidr_block,
      STR_SPLIT(prefix_object.ip_prefix, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address,
      &lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(STR_SPLIT(prefix_object.ip_prefix, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address_mask,
      prefix_object.region
    &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; (
      &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;UNNEST&lt;/span&gt;(prefixes) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; prefix_object &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://ip-ranges.amazonaws.com/ip-ranges.json&apos;&lt;/span&gt;
    )
  ) prefixes
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-azure-table&quot;&gt;Azure table&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; azure_ip_data &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; (
  &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt;
    prefixes &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; cidr_block,
    STR_SPLIT(prefixes, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address,
    &lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(STR_SPLIT(prefixes, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address_mask,
    &lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;POW&lt;/span&gt;(&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;32&lt;/span&gt;-&lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(STR_SPLIT(prefixes, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;)) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address_cnt,
    &lt;span class=&quot;hljs-keyword&quot;&gt;CASE&lt;/span&gt;
      &lt;span class=&quot;hljs-keyword&quot;&gt;WHEN&lt;/span&gt; region = &lt;span class=&quot;hljs-string&quot;&gt;&apos;&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;THEN&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;No region&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-keyword&quot;&gt;ELSE&lt;/span&gt; region
    &lt;span class=&quot;hljs-keyword&quot;&gt;END&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; region
  &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; (
    &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt;
      prop.region &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; region,
      &lt;span class=&quot;hljs-keyword&quot;&gt;UNNEST&lt;/span&gt;(prop.addressPrefixes) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; prefixes
    &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; (
      &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
        values.properties &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; prop
      &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; (
        &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
          &lt;span class=&quot;hljs-keyword&quot;&gt;UNNEST&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;values&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;values&lt;/span&gt;
        &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt;
          read_json_auto(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://download.microsoft.com/download/7/1/D/71D86715-5596-4529-9B13-DA13A5DE5B63/ServiceTags_Public_20230417.json&apos;&lt;/span&gt;, maximum_object_size=&lt;span class=&quot;hljs-number&quot;&gt;10000000&lt;/span&gt;)
      )
    )
  )
  &lt;span class=&quot;hljs-keyword&quot;&gt;WHERE&lt;/span&gt;
    prefixes &lt;span class=&quot;hljs-keyword&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;LIKE&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;%::%&apos;&lt;/span&gt;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-cloudflare-table&quot;&gt;CloudFlare table&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; cloudflare_ip_data &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; (
  &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt;
    prefixes &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; cidr_block,
    STR_SPLIT(prefixes, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address,
    &lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(STR_SPLIT(prefixes, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address_mask,
    &lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;POW&lt;/span&gt;(&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;32&lt;/span&gt;-&lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(STR_SPLIT(prefixes, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;)) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address_cnt,
    &lt;span class=&quot;hljs-string&quot;&gt;&apos;No region&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; region
  &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; (
    &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt;
      column0 &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; prefixes
    &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt;
      read_csv_auto(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://www.cloudflare.com/ips-v4&apos;&lt;/span&gt;)
  )
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-digitalocean-table&quot;&gt;DigitalOcean table&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; digitalocean_ip_data &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; (
  &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt;
    prefixes &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; cidr_block,
    STR_SPLIT(prefixes, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address,
    &lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(STR_SPLIT(prefixes, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address_mask,
    &lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;POW&lt;/span&gt;(&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;32&lt;/span&gt;-&lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(STR_SPLIT(prefixes, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;)) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address_cnt,
    &lt;span class=&quot;hljs-string&quot;&gt;&apos;No region&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; region
  &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; (
    &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt;
      column0 &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; prefixes
    &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt;
      read_csv_auto(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://digitalocean.com/geo/google.csv&apos;&lt;/span&gt;)
    &lt;span class=&quot;hljs-keyword&quot;&gt;WHERE&lt;/span&gt;
      column0 &lt;span class=&quot;hljs-keyword&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;LIKE&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;%::%&apos;&lt;/span&gt;
  )
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-fastly-table&quot;&gt;Fastly table&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; fastly_ip_data &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; (
  &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt;
    prefixes &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; cidr_block,
    STR_SPLIT(prefixes, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address,
    &lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(STR_SPLIT(prefixes, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address_mask,
    &lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;POW&lt;/span&gt;(&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;32&lt;/span&gt;-&lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(STR_SPLIT(prefixes, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;)) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address_cnt,
    &lt;span class=&quot;hljs-string&quot;&gt;&apos;No region&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; region
  &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; (
    &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt;
      &lt;span class=&quot;hljs-keyword&quot;&gt;UNNEST&lt;/span&gt;(addresses) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; prefixes
    &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt;
      read_json_auto(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://api.fastly.com/public-ip-list&apos;&lt;/span&gt;)
  )
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-google-cloud-table&quot;&gt;Google Cloud table&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; google_ip_data &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; (
  &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt;
    prefixes.cidr_block,
    prefixes.ip_address,
    prefixes.ip_address_mask,
    &lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;pow&lt;/span&gt;(&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;32&lt;/span&gt;-prefixes.ip_address_mask) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address_cnt,
    prefixes.region
  &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; (
    &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt;
      prefix_object.ipv4Prefix &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; cidr_block,
      str_split(prefix_object.ipv4Prefix, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address,
      &lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(str_split(prefix_object.ipv4Prefix, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address_mask,
      prefix_object.scope &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; region
    &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; (
      &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;unnest&lt;/span&gt;(prefixes) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; prefix_object &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://www.gstatic.com/ipranges/cloud.json&apos;&lt;/span&gt;
    )
    &lt;span class=&quot;hljs-keyword&quot;&gt;WHERE&lt;/span&gt;
      prefix_object.ipv4Prefix &lt;span class=&quot;hljs-keyword&quot;&gt;IS&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;NOT&lt;/span&gt; &lt;span class=&quot;hljs-literal&quot;&gt;NULL&lt;/span&gt;
  ) prefixes
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-oracle-cloud-table&quot;&gt;Oracle Cloud table&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;TABLE&lt;/span&gt; oracle_ip_data &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; (
  &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt;
    prefixes.cidr &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; cidr_block,
    STR_SPLIT(prefixes.cidr, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address,
    &lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(STR_SPLIT(prefixes.cidr, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address_mask,
    &lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(&lt;span class=&quot;hljs-keyword&quot;&gt;POW&lt;/span&gt;(&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;32&lt;/span&gt;-&lt;span class=&quot;hljs-keyword&quot;&gt;CAST&lt;/span&gt;(STR_SPLIT(prefixes.cidr, &lt;span class=&quot;hljs-string&quot;&gt;&apos;/&apos;&lt;/span&gt;)[&lt;span class=&quot;hljs-number&quot;&gt;2&lt;/span&gt;] &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;)) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;hljs-built_in&quot;&gt;INTEGER&lt;/span&gt;) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; ip_address_cnt,
    &lt;span class=&quot;hljs-keyword&quot;&gt;CASE&lt;/span&gt;
      &lt;span class=&quot;hljs-keyword&quot;&gt;WHEN&lt;/span&gt; region = &lt;span class=&quot;hljs-string&quot;&gt;&apos;&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;THEN&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;No region&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-keyword&quot;&gt;ELSE&lt;/span&gt; region
    &lt;span class=&quot;hljs-keyword&quot;&gt;END&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; region
  &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; (
    &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;DISTINCT&lt;/span&gt;
      region,
      &lt;span class=&quot;hljs-keyword&quot;&gt;UNNEST&lt;/span&gt;(cidrs) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; prefixes
    &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; (
      &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
        regions.region &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; region,
        regions.cidrs &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; cidrs
      &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; (
        &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; 
          &lt;span class=&quot;hljs-keyword&quot;&gt;UNNEST&lt;/span&gt;(regions) &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; regions
        &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt;
          read_json_auto(&lt;span class=&quot;hljs-string&quot;&gt;&apos;https://docs.oracle.com/en-us/iaas/tools/public_ip_ranges.json&apos;&lt;/span&gt;, maximum_object_size=&lt;span class=&quot;hljs-number&quot;&gt;10000000&lt;/span&gt;)
      )
    )
  )
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-create-a-combined-view&quot;&gt;Create a combined view&lt;/h2&gt;
&lt;p&gt;The next step is to create a new view (&lt;code&gt;ip_data&lt;/code&gt;) that combines our tables for the individual cloud providers. We can then use this view later to compare the different cloud providers.&lt;/p&gt;
&lt;p&gt;The view definition looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-keyword&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;VIEW&lt;/span&gt; ip_data &lt;span class=&quot;hljs-keyword&quot;&gt;AS&lt;/span&gt; (
  &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;AWS&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; aws_ip_data
  &lt;span class=&quot;hljs-keyword&quot;&gt;UNION&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;ALL&lt;/span&gt;
  &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;Azure&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; azure_ip_data
  &lt;span class=&quot;hljs-keyword&quot;&gt;UNION&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;ALL&lt;/span&gt;
  &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;CloudFlare&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; cloudflare_ip_data
  &lt;span class=&quot;hljs-keyword&quot;&gt;UNION&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;ALL&lt;/span&gt;
  &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;DigitalOcean&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; digitalocean_ip_data
  &lt;span class=&quot;hljs-keyword&quot;&gt;UNION&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;ALL&lt;/span&gt;
  &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;Fastly&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; fastly_ip_data
  &lt;span class=&quot;hljs-keyword&quot;&gt;UNION&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;ALL&lt;/span&gt;
  &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;Google Cloud&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; google_ip_data
  &lt;span class=&quot;hljs-keyword&quot;&gt;UNION&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;ALL&lt;/span&gt;
  &lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;Oracle&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;as&lt;/span&gt; cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; oracle_ip_data
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-export-the-data&quot;&gt;Export the data&lt;/h2&gt;
&lt;p&gt;To be able to use the data with other tools, we need to export the data to different formats, in our case &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/guides/import/csv_export&quot;&gt;CSV&lt;/a&gt; and &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/guides/import/parquet_export&quot;&gt;Parquet&lt;/a&gt;. You can review the executed queries in the &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/public-cloud-provider-ip-ranges/blob/main/queries/export_provider_data.sql&quot;&gt;repository&lt;/a&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;&lt;span class=&quot;hljs-comment&quot;&gt;-- Only an example, this needs to be done for all providers as well!&lt;/span&gt;

&lt;span class=&quot;hljs-comment&quot;&gt;-- Export complete data as CSV&lt;/span&gt;
COPY (&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; ip_data &lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; cloud_provider, cidr_block) &lt;span class=&quot;hljs-keyword&quot;&gt;TO&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;data/providers/all.csv&apos;&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;WITH&lt;/span&gt; (HEADER &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;, DELIMITER &lt;span class=&quot;hljs-string&quot;&gt;&apos;,&apos;&lt;/span&gt;);

&lt;span class=&quot;hljs-comment&quot;&gt;-- Export complete data as Parquet&lt;/span&gt;
COPY (&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; ip_data &lt;span class=&quot;hljs-keyword&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;hljs-keyword&quot;&gt;BY&lt;/span&gt; cloud_provider, cidr_block) &lt;span class=&quot;hljs-keyword&quot;&gt;TO&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;data/providers/all.parquet&apos;&lt;/span&gt; (&lt;span class=&quot;hljs-keyword&quot;&gt;FORMAT&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;parquet&apos;&lt;/span&gt;, COMPRESSION &lt;span class=&quot;hljs-string&quot;&gt;&apos;SNAPPY&apos;&lt;/span&gt;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-analyze-the-data&quot;&gt;Analyze the data&lt;/h2&gt;
&lt;p&gt;As we now prepared our data, we can start analyzing it. Therefore, we&apos;ll use an &lt;a target=&quot;_blank&quot; href=&quot;https://observablehq.com/@duckdb-projects/public-cloud-provider-ip-ranges&quot;&gt;ObservableHQ notebook&lt;/a&gt;, where we&apos;ll upload the &lt;a target=&quot;_blank&quot; href=&quot;https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/main/data/providers/all.csv&quot;&gt;all.csv&lt;/a&gt; file to.&lt;/p&gt;
&lt;h3 id=&quot;heading-overall-ip-address-counts&quot;&gt;Overall IP address counts&lt;/h3&gt;
&lt;iframe width=&quot;100%&quot; height=&quot;514&quot; src=&quot;https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=Overall&quot;&gt;&lt;/iframe&gt;

&lt;h3 id=&quot;heading-total-number-and-value-of-ip-addresses&quot;&gt;Total number and value of IP addresses&lt;/h3&gt;
&lt;p&gt;An astonishing insight is that both &lt;strong&gt;AWS&lt;/strong&gt; and &lt;strong&gt;Azure&lt;/strong&gt; have more than &lt;strong&gt;six times as many&lt;/strong&gt; IP addresses available as their next competitor.&lt;/p&gt;
&lt;p&gt;Also, the &lt;strong&gt;market values&lt;/strong&gt; of their IP addresses are &lt;strong&gt;nearly four billion Dollars&lt;/strong&gt; according to a &lt;a target=&quot;_blank&quot; href=&quot;https://circleid.com/posts/20220610-recent-ipv4-pricing-trends-may-2022&quot;&gt;market analysis&lt;/a&gt;.&lt;/p&gt;
&lt;iframe width=&quot;100%&quot; height=&quot;348.09375&quot; src=&quot;https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=NumberAndValueOfIPAddresses&quot;&gt;&lt;/iframe&gt;

&lt;h3 id=&quot;heading-cidr-masks-distribution-by-public-cloud-provider&quot;&gt;CIDR masks distribution by public cloud provider&lt;/h3&gt;
&lt;p&gt;It&apos;s remarkable that, although AWS and Azure have similar absolute numbers of IP addresses, the type of CIDR blocks / IP ranges strongly differ: AWS owns very few very large IP ranges, whereas Azure owns very many rather small IP ranges, and just a few very large ones:&lt;/p&gt;
&lt;iframe width=&quot;100%&quot; height=&quot;514&quot; src=&quot;https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=IPRangeSizeFrequencyGraph&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;Another view is the filterable table for this data:&lt;/p&gt;
&lt;iframe width=&quot;100%&quot; height=&quot;479&quot; src=&quot;https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=IPRangeSizeFrequency&quot;&gt;&lt;/iframe&gt;

&lt;h3 id=&quot;heading-aws-cidr-masks&quot;&gt;AWS CIDR masks&lt;/h3&gt;
&lt;iframe width=&quot;100%&quot; height=&quot;479&quot; src=&quot;https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=AWSbyCIDRMask&quot;&gt;&lt;/iframe&gt;

&lt;h3 id=&quot;heading-azure-cidr-masks&quot;&gt;Azure CIDR masks&lt;/h3&gt;
&lt;iframe width=&quot;100%&quot; height=&quot;479&quot; src=&quot;https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=AzurebyCIDRMask&quot;&gt;&lt;/iframe&gt;

&lt;h3 id=&quot;heading-cloudflare-cidr-masks&quot;&gt;CloudFlare CIDR masks&lt;/h3&gt;
&lt;iframe width=&quot;100%&quot; height=&quot;393.6875&quot; src=&quot;https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=CloudFlarebyCIDRMask&quot;&gt;&lt;/iframe&gt;

&lt;h3 id=&quot;heading-digitalocean-cidr-masks&quot;&gt;DigitalOcean CIDR masks&lt;/h3&gt;
&lt;iframe width=&quot;100%&quot; height=&quot;445&quot; src=&quot;https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=DigitalOceanbyCIDRMask&quot;&gt;&lt;/iframe&gt;

&lt;h3 id=&quot;heading-fastly-cidr-masks&quot;&gt;Fastly CIDR masks&lt;/h3&gt;
&lt;iframe width=&quot;100%&quot; height=&quot;420&quot; src=&quot;https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=FastlybyCIDRMask&quot;&gt;&lt;/iframe&gt;

&lt;h3 id=&quot;heading-google-cloud-cidr-masks&quot;&gt;Google Cloud CIDR masks&lt;/h3&gt;
&lt;iframe width=&quot;100%&quot; height=&quot;479&quot; src=&quot;https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=GoogleCloudbyCIDRMask&quot;&gt;&lt;/iframe&gt;

&lt;h3 id=&quot;heading-oracle-cloud-cidr-masks&quot;&gt;Oracle Cloud CIDR masks&lt;/h3&gt;
&lt;iframe width=&quot;100%&quot; height=&quot;479&quot; src=&quot;https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=OracleCloudbyCIDRMask&quot;&gt;&lt;/iframe&gt;

&lt;h1 id=&quot;heading-conclusion&quot;&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;In this article, we described a simple and straightforward way to gather and transform data in different formats with DuckDB, as well as export it to a common schema as CSV and Parquet files.&lt;/p&gt;
&lt;p&gt;Furthermore, we leveraged DuckDB on &lt;a target=&quot;_blank&quot; href=&quot;https://observablehq.com/&quot;&gt;Observable&lt;/a&gt; to analyze and display the data in beautiful and interactive graphs.&lt;/p&gt;
&lt;p&gt;By using GitHub Actions free tier as our &quot;runtime&quot; for the data processing via Bash and SQL scripts, and hosting our data in a GitHub repo (also covered by the free tier), we were able to show that data pipelines like covered use case can be built without accruing infrastructure costs. Also, the &lt;a target=&quot;_blank&quot; href=&quot;https://observablehq.com/@observablehq/team-and-individual-workspaces?collection=@observablehq/accounts-and-workspaces#freeTeam&quot;&gt;Observable pricing model&lt;/a&gt; supports our analyses for free.&lt;/p&gt;
</content:encoded><category>duckDB</category><category>dataengineering</category><category>free</category><category>dataanalytics</category><category>observable</category><author>Tobias Müller</author></item><item><title>Casual data engineering, or: A poor man&apos;s Data Lake in the cloud - Part I</title><link>https://tobilg.com/posts/casual-data-engineering-or-a-poor-mans-data-lake-in-the-cloud-part-i/</link><guid isPermaLink="true">https://tobilg.com/posts/casual-data-engineering-or-a-poor-mans-data-lake-in-the-cloud-part-i/</guid><description>In the age of big data, organizations of all sizes are collecting vast amounts of information about their operations, customers, and markets. To make sense of this data, many are turning to data lakes - centralized repositories that store and manage ...</description><pubDate>Mon, 24 Apr 2023 06:00:40 GMT</pubDate><content:encoded>&lt;p&gt;In the age of big data, organizations of all sizes are collecting vast amounts of information about their operations, customers, and markets. To make sense of this data, many are turning to data lakes - centralized repositories that store and manage data of all types and sizes, from structured to unstructured. However, building a data lake can be a daunting task, requiring significant resources and expertise.&lt;/p&gt;
&lt;p&gt;For enterprises, this often means using SaaS solutions like &lt;a target=&quot;_blank&quot; href=&quot;https://www.snowflake.com&quot;&gt;Snowflake&lt;/a&gt;, &lt;a target=&quot;_blank&quot; href=&quot;https://dremio.com&quot;&gt;Dremio&lt;/a&gt;, &lt;a target=&quot;_blank&quot; href=&quot;https://www.databricks.com&quot;&gt;DataBricks&lt;/a&gt; or the like. Or, go all-in on the public cloud provider offerings from AWS, Azure and Google Cloud. But what if, as recent studies show, the data sizes aren&apos;t as big as commonly thought? Is it really necessary to spend so much money on usage and infrastructure?&lt;/p&gt;
&lt;p&gt;In this blog post, we&apos;ll walk you through the steps to create a &lt;strong&gt;scalable&lt;/strong&gt;, &lt;strong&gt;cost-effective&lt;/strong&gt; data lake on AWS. Whether you&apos;re a startup, a small business, or a large enterprise, this guide will help you unlock the power of big data without breaking the bank (also see the excellent &lt;a target=&quot;_blank&quot; href=&quot;https://motherduck.com/blog/big-data-is-dead/&quot;&gt;&quot;Big data is dead&quot;&lt;/a&gt; blog post by Jordan Tigani).&lt;/p&gt;
&lt;h1 id=&quot;heading-modern-data-lake-basics&quot;&gt;Modern Data Lake basics&lt;/h1&gt;
&lt;p&gt;The definition of what a Data Lake is, is probably slightly different depending on whom you&apos;re asking (see &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/&quot;&gt;AWS&lt;/a&gt;, &lt;a target=&quot;_blank&quot; href=&quot;https://cloud.google.com/learn/what-is-a-data-lake&quot;&gt;Google Cloud&lt;/a&gt;, &lt;a target=&quot;_blank&quot; href=&quot;https://azure.microsoft.com/en-us/solutions/data-lake/&quot;&gt;Azure&lt;/a&gt;, &lt;a target=&quot;_blank&quot; href=&quot;https://www.databricks.com/discover/data-lakes/introduction&quot;&gt;DataBricks&lt;/a&gt;, &lt;a target=&quot;_blank&quot; href=&quot;https://www.ibm.com/topics/data-lake&quot;&gt;IBM&lt;/a&gt; or &lt;a target=&quot;_blank&quot; href=&quot;https://en.wikipedia.org/wiki/Data_lake&quot;&gt;Wikipedia&lt;/a&gt;). What is common to all these definitions and explanations is that it consists of different layers, such as ingestion, storage, processing and consumption. There can be several other layers as well, like cataloging and search, as well as a security and governance layer.&lt;/p&gt;
&lt;p&gt;This is outlined in the excellent AWS article &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/blogs/big-data/aws-serverless-data-analytics-pipeline-reference-architecture/&quot;&gt;&quot;AWS serverless data analytics pipeline reference architecture&quot;&lt;/a&gt;, which shall be the basis for this blog post:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2021/02/19/data-analytics-update-1-final.jpg&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-separation-of-storage-amp-compute&quot;&gt;Separation of storage &amp;amp; compute&lt;/h2&gt;
&lt;p&gt;Modern data lakes have revolutionized the way organizations handle big data. A data lake is a central repository that allows organizations to store all types of data, both structured and unstructured, at any scale. The flexibility and scalability of data lakes enable organizations to perform advanced analytics and gain insights that can drive business decisions. One of the key architectural patterns that modern data lakes follow is the separation of storage and compute.&lt;/p&gt;
&lt;p&gt;Traditionally, data storage and processing were tightly coupled in data warehouses. However, in modern data lakes, data is stored in a separate layer from the computational layer that processes it. Data storage is handled by a data storage layer, while data processing is done by a compute layer. This approach allows organizations to scale storage and compute independently, enabling them to process vast amounts of data without incurring significant costs.&lt;/p&gt;
&lt;p&gt;This has several advantages, which include:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Scalability: It allows organizations to scale each layer independently. The storage layer can be scaled up or down depending on the amount of data being stored, while the compute layer can be scaled up or down depending on the processing requirements.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cost Savings: Decoupling storage and compute can significantly reduce costs. In traditional data warehouses, organizations must provision sufficient storage and processing power to handle peak loads. This results in underutilized resources during periods of low demand, leading to the wastage of resources and increased costs. In modern data lakes, organizations can store data cheaply and only provision the necessary compute resources when required, leading to significant cost savings.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Flexibility: Organizations can use a range of storage options, including object storage, file storage, and block storage, to store their data. This flexibility allows organizations to choose the most appropriate storage option for their data, depending on factors such as cost, performance, and durability.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Performance: In traditional data warehouses, data is moved from storage to processing, which can be slow and time-consuming, leading to performance issues. In modern data lakes, data is stored in a central repository, and processing is done where the data resides. This approach eliminates the need for data movement, leading to faster processing and improved performance.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&quot;heading-optimized-file-formats&quot;&gt;Optimized file formats&lt;/h2&gt;
&lt;p&gt;As an example, Parquet is an open-source columnar storage format for data lakes that is widely used in modern data lakes. Parquet stores data in columns rather than rows, which enables it to perform selective queries faster and more efficiently than traditional row-based storage formats.&lt;/p&gt;
&lt;p&gt;Additionally, Parquet supports compression, which reduces storage requirements and improves data processing performance. It&apos;s supported by many big data processing engines, including Apache Hadoop, Apache Spark, Apache Drill and many services of public cloud providers, such as Amazon Athena and AWS Glue.&lt;/p&gt;
&lt;h2 id=&quot;heading-hive-partitioning-amp-query-filter-pushdown&quot;&gt;Hive partitioning &amp;amp; query filter pushdown&lt;/h2&gt;
&lt;p&gt;The so-called &quot;Hive partitioning&quot; is a technique used in data lakes that involves dividing data into smaller, more manageable parts, called partitions, based on specific criteria such as date, time, or location.&lt;/p&gt;
&lt;p&gt;Partitioning can help improve query performance and reduce data processing time by allowing users to select only the relevant partitions, rather than scanning the entire dataset.&lt;/p&gt;
&lt;p&gt;Query filter pushdown is another optimization technique used in Apache Hive and other services that involves pushing down query filters into the storage layer, allowing it to eliminate irrelevant data before processing the query.&lt;/p&gt;
&lt;p&gt;Combining Hive partitioning and query filter pushdown can result in significant performance gains in data processing, as the query filters can eliminate large amounts of irrelevant data at the partition level, reducing the amount of data that needs to be processed. Therefore, Hive partitioning and query filter pushdown are essential techniques for optimizing data processing performance in data lakes.&lt;/p&gt;
&lt;h2 id=&quot;heading-repartitioning-of-data&quot;&gt;Repartitioning of data&lt;/h2&gt;
&lt;p&gt;Repartitioning Parquet data in data lakes is a useful technique that involves redistributing data across partitions based on specific criteria. This technique can help optimize query performance and reduce data shuffling during big data processing.&lt;/p&gt;
&lt;p&gt;For instance, if a large amount of data is stored in a single partition, querying that data may take longer than if the data were spread across several partitions. Or, you could write aggregation queries whose output contains much less data, which could improve query speeds significantly.&lt;/p&gt;
&lt;h1 id=&quot;heading-the-use-case&quot;&gt;The use case&lt;/h1&gt;
&lt;p&gt;Data privacy and GDPR are pretty talked-about topics in recent years. A lot of existing web tracking solutions were deemed as non-compliant, especially in the EU. Thus, individuals and companies had to eventually change their Web Analytics providers, which lead to a rise of new, data privacy-focussing companies in this space (e.g. &lt;a target=&quot;_blank&quot; href=&quot;https://usefathom.com/&quot;&gt;Fathom Analytics&lt;/a&gt;, &lt;a target=&quot;_blank&quot; href=&quot;https://www.simpleanalytics.com&quot;&gt;SimpleAnalytics&lt;/a&gt;, and &lt;a target=&quot;_blank&quot; href=&quot;https://www.plausible.io&quot;&gt;Plausible&lt;/a&gt; just to name a few).&lt;/p&gt;
&lt;p&gt;The pricing of those providers can get relatively steep quite fast if you have a higher amount of pageviews ($74/mo for 2m at Fathom, €99/mo for 1m at SimpleAnalytics, €89/mo for 2m at Plausible). Also, if you&apos;re using a provider, you&apos;re normally not owning your data.&lt;/p&gt;
&lt;p&gt;So, &lt;strong&gt;let&apos;s try to build a web tracking and analytics service on AWS for the cheap, while owning our data, adhering to data privacy laws and using scalable serverless cloud services to avoid having to manage infrastructure by ourselves.&lt;/strong&gt; And have some fun and learn a bit while doing it :-)&lt;/p&gt;
&lt;h1 id=&quot;heading-high-level-architecture&quot;&gt;High-level architecture&lt;/h1&gt;
&lt;p&gt;The overall architecture for the outlined use case looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1681729029658/08a0a6e0-b1d1-4001-896c-1ffd88b98205.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The details will be described further for each layer in the coming paragraphs. For brevity, the focus lies on the main data processing layers. Other layers, such as cataloging, consumption and security &amp;amp; governance, are eventually handled in other upcoming blog posts.&lt;/p&gt;
&lt;h1 id=&quot;heading-serving-layer&quot;&gt;Serving layer&lt;/h1&gt;
&lt;p&gt;The serving layer is not a part of the data lake. Its main goal is to serve static assets, such as the tracking JavaScript libraries (those will be covered in more detail in another part of this series), and the 1x1 pixel GIF files that are used as endpoints that the tracking library can push its gathered data to. This is done by sending the JSON payload as URL-encoded query strings.&lt;/p&gt;
&lt;p&gt;In our use case, we want to leverage existing AWS services and optimize our costs, while providing great response times. From an architectural perspective, there are many ways we could set up this data-gathering endpoint. &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/cloudfront/&quot;&gt;Amazon CloudFront&lt;/a&gt; is a CDN that has currently &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/cloudfront/features/?whats-new-cloudfront.sort-by=item.additionalFields.postDateTime&amp;amp;whats-new-cloudfront.sort-order=desc&quot;&gt;over 90 edge locations worldwide&lt;/a&gt;, thus providing great latencies compared to classical webservers or APIs that are deployed in one or more regions.&lt;/p&gt;
&lt;p&gt;It also has a &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/cloudfront/pricing/&quot;&gt;very generous free tier&lt;/a&gt; (1TB outgoing traffic, and 10M requests), and with its &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/real-time-logs.html&quot;&gt;real-time logs feature&lt;/a&gt; a great and very cost-effective way ($0.01 for every 1M log lines) to set up such an endpoint by just storing a 1x1px GIF with appropriate caching headers, to which the &lt;strong&gt;JavaScript tracking library will send its payload to as an encoded query string&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;CloudFront can use S3 as a so-called origin (where the assets will be loaded from if they aren&apos;t yet in the edge caches), and that&apos;s where the static asset data will be located. Between the CloudFront distribution and the S3 bucket, an &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/private-content-restricting-access-to-s3.html&quot;&gt;Origin Access Identity&lt;/a&gt; will be created, which enables secure communication between both services and avoids that the S3 bucket needs to be publicly accessible.&lt;/p&gt;
&lt;p&gt;To configure CloudFront real-time logs that contain the &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/real-time-logs.html#understand-real-time-log-config-fields&quot;&gt;necessary information&lt;/a&gt;, a &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-cloudfront-realtimelogconfig.html&quot;&gt;RealtimeLogConfig&lt;/a&gt; needs to be &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/ownstats/ownstats/blob/main/backend/resources/cf-distribution.yml#L8-L33&quot;&gt;created&lt;/a&gt;. This acts as &quot;glue&quot; between the CloudFront distribution and the Kinesis Data Stream that consumes the logs:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-yaml&quot;&gt;&lt;span class=&quot;hljs-attr&quot;&gt;CFRealtimeLogsConfig:&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;Type:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;AWS::CloudFront::RealtimeLogConfig&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;Properties:&lt;/span&gt; 
    &lt;span class=&quot;hljs-attr&quot;&gt;EndPoints:&lt;/span&gt; 
      &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;StreamType:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;Kinesis&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;KinesisStreamConfig:&lt;/span&gt;
          &lt;span class=&quot;hljs-attr&quot;&gt;RoleArn:&lt;/span&gt; &lt;span class=&quot;hljs-type&quot;&gt;!GetAtt&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;AnalyticsKinesisDataRole.Arn&apos;&lt;/span&gt;
          &lt;span class=&quot;hljs-attr&quot;&gt;StreamArn:&lt;/span&gt; &lt;span class=&quot;hljs-type&quot;&gt;!GetAtt&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;AnalyticsKinesisStream.Arn&apos;&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;Fields:&lt;/span&gt; 
      &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;timestamp&lt;/span&gt;
      &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;c-ip&lt;/span&gt;
      &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;sc-status&lt;/span&gt;
      &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;cs-uri-stem&lt;/span&gt;
      &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;cs-bytes&lt;/span&gt;
      &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;x-edge-location&lt;/span&gt;
      &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;time-taken&lt;/span&gt;
      &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;cs-user-agent&lt;/span&gt;
      &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;cs-referer&lt;/span&gt;
      &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;cs-uri-query&lt;/span&gt;
      &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;x-edge-result-type&lt;/span&gt;
      &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;asn&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;Name:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;${self:service}-cdn-realtime-log-config&apos;&lt;/span&gt;
    &lt;span class=&quot;hljs-comment&quot;&gt;# IMPORTANT: This setting make sure we receive all the log lines, otherwise it&apos;s just sampled!&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;SamplingRate:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h1 id=&quot;heading-ingestion-layer&quot;&gt;Ingestion layer&lt;/h1&gt;
&lt;p&gt;The ingestion layer mainly consists of two services: A &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/kinesis/data-streams/&quot;&gt;Kinesis Data Stream&lt;/a&gt;, which is the &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/real-time-logs.html#real-time-log-consumer-guidance&quot;&gt;consumer&lt;/a&gt; of the real-time logs feature of CloudFront, and a &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html&quot;&gt;Kinesis Data Firehose Delivery Stream&lt;/a&gt;, which will back up the raw data in S3, and also store the data as partitioned parquet files in another S3 bucket. Both S3 buckets are part of the storage layer.&lt;/p&gt;
&lt;p&gt;The &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/kinesis/data-streams/pricing/?nc=sn&amp;amp;loc=3&quot;&gt;Kinesis Data Stream&lt;/a&gt; (one shard in provisioned mode) provides an ingest capacity of 1 MB/second or 1,000 records/second, for a price of $0.015/hour in us-east-1, and $0.014 per 1M PUT payload units. It forwards the incoming data to the Kinesis Data Firehose Delivery Stream, whose &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/kinesis/data-firehose/pricing/?nc=sn&amp;amp;loc=3&quot;&gt;pricing&lt;/a&gt; is more complex. The ingestion costs $0.029/GB, the format conversion $0.018/GB, and the dynamic partitioning $0.02/GB. That sums up to $0.067/GB ingested and written to S3, plus the S3 costs of $0.005/1k PUT object calls.&lt;/p&gt;
&lt;p&gt;The Kinesis Data Firehose Delivery Stream uses &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html&quot;&gt;data transformation&lt;/a&gt; and &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html&quot;&gt;dynamic partitioning&lt;/a&gt; with a Lambda function, which cleans, transforms and enriches the data so that it can be stored in S3 as parquet files with appropriate Hive partitions.&lt;/p&gt;
&lt;p&gt;The Delivery Stream has so-called &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-kinesisfirehose-deliverystream-bufferinghints.html&quot;&gt;BufferingHints&lt;/a&gt;, which either define from which size (from 1 to 128MB) or in which interval (between 60 to 900 seconds) the data is flushed to S3. The interval defines the minimum latency at which the data gets persisted in the data lake. The Lambda function is part of the processing layer and is discussed below.&lt;/p&gt;
&lt;p&gt;The CloudFormation &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/ownstats/ownstats/blob/main/backend/resources/kinesis.yml#L13-L100&quot;&gt;resource definition&lt;/a&gt; for the Kinesis Data Firehose Delivery Stream can be found below. It sources its variables from the &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/ownstats/ownstats/blob/main/backend/serverless.yml#L34-L42&quot;&gt;serverless.yml&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-yaml&quot;&gt;&lt;span class=&quot;hljs-attr&quot;&gt;AnalyticsKinesisFirehose:&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;Type:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;AWS::KinesisFirehose::DeliveryStream&apos;&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;Properties:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;DeliveryStreamName:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;${self:custom.kinesis.delivery.name}&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;DeliveryStreamType:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;KinesisStreamAsSource&lt;/span&gt;
    &lt;span class=&quot;hljs-comment&quot;&gt;# Source configuration&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;KinesisStreamSourceConfiguration:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;KinesisStreamARN:&lt;/span&gt; &lt;span class=&quot;hljs-type&quot;&gt;!GetAtt&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;AnalyticsKinesisStream.Arn&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;RoleARN:&lt;/span&gt; &lt;span class=&quot;hljs-type&quot;&gt;!GetAtt&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;AnalyticsKinesisFirehoseRole.Arn&apos;&lt;/span&gt;
    &lt;span class=&quot;hljs-comment&quot;&gt;# Necessary configuration to transfrom and write data to S3 as parquet files&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;ExtendedS3DestinationConfiguration:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;BucketARN:&lt;/span&gt; &lt;span class=&quot;hljs-type&quot;&gt;!GetAtt&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;CleanedBucket.Arn&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;BufferingHints:&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;IntervalInSeconds:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;${self:custom.kinesis.delivery.limits.intervalInSeconds}&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;SizeInMBs:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;${self:custom.kinesis.delivery.limits.sizeInMB}&lt;/span&gt;
      &lt;span class=&quot;hljs-comment&quot;&gt;# This enables logging to CloudWatch for better debugging possibilities&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;CloudWatchLoggingOptions:&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;Enabled:&lt;/span&gt; &lt;span class=&quot;hljs-literal&quot;&gt;True&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;LogGroupName:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;${self:custom.logs.groupName}&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;LogStreamName:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;${self:custom.logs.streamName}&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;DataFormatConversionConfiguration:&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;Enabled:&lt;/span&gt; &lt;span class=&quot;hljs-literal&quot;&gt;True&lt;/span&gt;
        &lt;span class=&quot;hljs-comment&quot;&gt;# Define the input format&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;InputFormatConfiguration:&lt;/span&gt; 
          &lt;span class=&quot;hljs-attr&quot;&gt;Deserializer:&lt;/span&gt; 
            &lt;span class=&quot;hljs-attr&quot;&gt;OpenXJsonSerDe:&lt;/span&gt; 
              &lt;span class=&quot;hljs-attr&quot;&gt;CaseInsensitive:&lt;/span&gt; &lt;span class=&quot;hljs-literal&quot;&gt;True&lt;/span&gt;
        &lt;span class=&quot;hljs-comment&quot;&gt;# Define the output format&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;OutputFormatConfiguration:&lt;/span&gt; 
          &lt;span class=&quot;hljs-attr&quot;&gt;Serializer:&lt;/span&gt; 
            &lt;span class=&quot;hljs-attr&quot;&gt;ParquetSerDe:&lt;/span&gt; 
              &lt;span class=&quot;hljs-attr&quot;&gt;Compression:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;SNAPPY&lt;/span&gt;
              &lt;span class=&quot;hljs-attr&quot;&gt;WriterVersion:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;V1&lt;/span&gt;
        &lt;span class=&quot;hljs-comment&quot;&gt;# The schema configuration based on Glue tables&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;SchemaConfiguration:&lt;/span&gt; 
          &lt;span class=&quot;hljs-attr&quot;&gt;RoleArn:&lt;/span&gt; &lt;span class=&quot;hljs-type&quot;&gt;!GetAtt&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;AnalyticsKinesisFirehoseRole.Arn&apos;&lt;/span&gt;
          &lt;span class=&quot;hljs-attr&quot;&gt;DatabaseName:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;${self:custom.glue.database}&apos;&lt;/span&gt;
          &lt;span class=&quot;hljs-attr&quot;&gt;TableName:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;incoming_events&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-comment&quot;&gt;# Enable dynamic partitioning&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;DynamicPartitioningConfiguration:&lt;/span&gt;
          &lt;span class=&quot;hljs-attr&quot;&gt;Enabled:&lt;/span&gt; &lt;span class=&quot;hljs-literal&quot;&gt;True&lt;/span&gt;
      &lt;span class=&quot;hljs-comment&quot;&gt;# Enable Lambda function for pre-processing the Kinesis records&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;ProcessingConfiguration:&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;Enabled:&lt;/span&gt; &lt;span class=&quot;hljs-literal&quot;&gt;True&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;Processors:&lt;/span&gt; 
          &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;Type:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;Lambda&lt;/span&gt;
            &lt;span class=&quot;hljs-attr&quot;&gt;Parameters:&lt;/span&gt; 
              &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;ParameterName:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;NumberOfRetries&lt;/span&gt;
                &lt;span class=&quot;hljs-attr&quot;&gt;ParameterValue:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;3&lt;/span&gt;
              &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;ParameterName:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;BufferIntervalInSeconds&lt;/span&gt;
                &lt;span class=&quot;hljs-attr&quot;&gt;ParameterValue:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;60&lt;/span&gt;
              &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;ParameterName:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;BufferSizeInMBs&lt;/span&gt;
                &lt;span class=&quot;hljs-attr&quot;&gt;ParameterValue:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;3&lt;/span&gt;
              &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;ParameterName:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;LambdaArn&lt;/span&gt;
                &lt;span class=&quot;hljs-attr&quot;&gt;ParameterValue:&lt;/span&gt; &lt;span class=&quot;hljs-type&quot;&gt;!GetAtt&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;ProcessKinesisRecordsLambdaFunction.Arn&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-comment&quot;&gt;# Enable backups for the raw incoming data&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;S3BackupMode:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;Enabled&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;S3BackupConfiguration:&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;BucketARN:&lt;/span&gt; &lt;span class=&quot;hljs-type&quot;&gt;!GetAtt&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;RawBucket.Arn&apos;&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;BufferingHints:&lt;/span&gt;
          &lt;span class=&quot;hljs-attr&quot;&gt;IntervalInSeconds:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;${self:custom.kinesis.delivery.limits.intervalInSeconds}&lt;/span&gt;
          &lt;span class=&quot;hljs-attr&quot;&gt;SizeInMBs:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;${self:custom.kinesis.delivery.limits.sizeInMB}&lt;/span&gt;
        &lt;span class=&quot;hljs-comment&quot;&gt;# Disable logging to CloudWatch for raw data&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;CloudWatchLoggingOptions:&lt;/span&gt;
          &lt;span class=&quot;hljs-attr&quot;&gt;Enabled:&lt;/span&gt; &lt;span class=&quot;hljs-literal&quot;&gt;false&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;CompressionFormat:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;GZIP&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;Prefix:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;${self:custom.prefixes.raw}&apos;&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;ErrorOutputPrefix:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;${self:custom.prefixes.error}&apos;&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;RoleARN:&lt;/span&gt; &lt;span class=&quot;hljs-type&quot;&gt;!GetAtt&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;AnalyticsKinesisFirehoseRole.Arn&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;RoleARN:&lt;/span&gt; &lt;span class=&quot;hljs-type&quot;&gt;!GetAtt&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;AnalyticsKinesisFirehoseRole.Arn&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-comment&quot;&gt;# Define output S3 prefixes&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;Prefix:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;${self:custom.prefixes.incoming}/domain_name=!{partitionKeyFromLambda:domain_name}/event_type=!{partitionKeyFromLambda:event_type}/event_date=!{partitionKeyFromLambda:event_date}/&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;ErrorOutputPrefix:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;${self:custom.prefixes.error}&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h1 id=&quot;heading-processing-layer&quot;&gt;Processing layer&lt;/h1&gt;
&lt;p&gt;The processing layer consists of two parts, the Lambda function that is used for the dynamic partitioning of the incoming data, and a Lambda function that uses the &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/data/partitioning/partitioned_writes.html&quot;&gt;COPY TO PARTITION BY&lt;/a&gt; feature of DuckDB to aggregate and repartition the ingested, enriched and stored page views data.&lt;/p&gt;
&lt;h2 id=&quot;heading-data-transformation-amp-dynamic-partitioning-lambda&quot;&gt;Data transformation &amp;amp; Dynamic partitioning Lambda&lt;/h2&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html&quot;&gt;Data transformation&lt;/a&gt; is a Kinesis Data Firehose Delivery Stream feature that enables the cleaning, transformation and enrichment of incoming records in a batched manner. In combination with the &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html#dynamic-partitioning-partitioning-keys&quot;&gt;dynamic partitioning feature&lt;/a&gt;, this provides powerful data handling capabilities with the data still being &quot;on stream&quot;. When writing data to S3 as parquet files, a &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-kinesisfirehose-deliverystream-schemaconfiguration.html&quot;&gt;schema configuration in the form of a Glue Table&lt;/a&gt; needs to be defined as well to make it work (see &quot;Cataloging &amp;amp; search layer&quot; below).&lt;/p&gt;
&lt;p&gt;It&apos;s necessary to define some &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-kinesisfirehose-deliverystream-processorparameter.html&quot;&gt;buffer configuration&lt;/a&gt; for the Lambda function, meaning that you need to specify the time interval of 60 seconds (this will add a max delay of one minute to the stream data), the size in MB (between 0.2 and 3), and the number of retries (3 is the only usable default).&lt;/p&gt;
&lt;p&gt;The input coming from the Kinesis Data Firehose Delivery Stream are a base64 encoded strings that contain the loglines coming from the CloudFront distribution:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;MTY4MjA4NDI0MS40NjlcdDIwMDM6ZTE6YmYxZjo3YzAwOjhlYjoxOGY4OmExZmI6OWRhZFx0MzA0XHQvaGVsbG8uZ2lmP3Q9cHYmdHM9MTY4MjA4MzgwNDc2OCZ1PWh0dHBzJTI1M0ElMjUyRiUyNTJGbXlkb21haW4udGxkJTI1MkYmaG49bXlkb21haW4udGxkJnBhPSUyNTJGJnVhPU1vemlsbGElMjUyRjUuMCUyNTIwKE1hY2ludG9zaCUyNTNCJTI1MjBJbnRlbCUyNTIwTWFjJTI1MjBPUyUyNTIwWCUyNTIwMTBfMTVfNyklMjUyMEFwcGxlV2ViS2l0JTI1MkY1MzcuMzYlMjUyMChLSFRNTCUyNTJDJTI1MjBsaWtlJTI1MjBHZWNrbyklMjUyMENocm9tZSUyNTJGMTEyLjAuMC4wJTI1MjBTYWZhcmklMjUyRjUzNy4zNiZpdz0xMjkyJmloPTkyNiZ0aT1NeSUyNTIwRG9tYWluJnc9MzQ0MCZoPTE0NDAmZD0yNCZsPWRlLURFJnA9TWFjSW50ZWwmbT04JmM9OCZ0ej1FdXJvcGUlMjUyRkJlcmxpblx0Nzg5XHRIQU01MC1QMlx0MC4wMDFcdE1vemlsbGEvNS4wJTIwKE1hY2ludG9zaDslMjBJbnRlbCUyME1hYyUyME9TJTIwWCUyMDEwXzE1XzcpJTIwQXBwbGVXZWJLaXQvNTM3LjM2JTIwKEtIVE1MLCUyMGxpa2UlMjBHZWNrbyklMjBDaHJvbWUvMTEyLjAuMC4wJTIwU2FmYXJpLzUzNy4zNlx0LVx0dD1wdiZ0cz0xNjgyMDgzODA0NzY4JnU9aHR0cHMlMjUzQSUyNTJGJTI1MkZteWRvbWFpbi50bGQlMjUyRiZobj1teWRvbWFpbi50bGQmcGE9JTI1MkYmdWE9TW96aWxsYSUyNTJGNS4wJTI1MjAoTWFjaW50b3NoJTI1M0IlMjUyMEludGVsJTI1MjBNYWMlMjUyME9TJTI1MjBYJTI1MjAxMF8xNV83KSUyNTIwQXBwbGVXZWJLaXQlMjUyRjUzNy4zNiUyNTIwKEtIVE1MJTI1MkMlMjUyMGxpa2UlMjUyMEdlY2tvKSUyNTIwQ2hyb21lJTI1MkYxMTIuMC4wLjAlMjUyMFNhZmFyaSUyNTJGNTM3LjM2Jml3PTEyOTImaWg9OTI2JnRpPU15JTI1MjBEb21haW4mdz0zNDQwJmg9MTQ0MCZkPTI0Jmw9ZGUtREUmcD1NYWNJbnRlbCZtPTgmYz04JnR6PUV1cm9wZSUyNTJGQmVybGluXHRIaXRcdDMzMjBcbg==
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After decoding, the logline is visible and contains the info from the real-time log fields, which are tab-separated and contain newlines:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;1682084241.469\t2003:e1:bf1f:7c00:8eb:18f8:a1fb:9dad\t304\t/hello.gif?t=pv&amp;amp;ts=1682083804768&amp;amp;u=https%253A%252F%252Fmydomain.tld%252F&amp;amp;hn=mydomain.tld&amp;amp;pa=%252F&amp;amp;ua=Mozilla%252F5.0%2520(Macintosh%253B%2520Intel%2520Mac%2520OS%2520X%252010_15_7)%2520AppleWebKit%252F537.36%2520(KHTML%252C%2520like%2520Gecko)%2520Chrome%252F112.0.0.0%2520Safari%252F537.36&amp;amp;iw=1292&amp;amp;ih=926&amp;amp;ti=My%2520Domain&amp;amp;w=3440&amp;amp;h=1440&amp;amp;d=24&amp;amp;l=de-DE&amp;amp;p=MacIntel&amp;amp;m=8&amp;amp;c=8&amp;amp;tz=Europe%252FBerlin\t789\tHAM50-P2\t0.001\tMozilla/5.0%20(Macintosh;%20Intel%20Mac%20OS%20X%2010_15_7)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/112.0.0.0%20Safari/537.36\t-\tt=pv&amp;amp;ts=1682083804768&amp;amp;u=https%253A%252F%252Fmydomain.tld%252F&amp;amp;hn=mydomain.tld&amp;amp;pa=%252F&amp;amp;ua=Mozilla%252F5.0%2520(Macintosh%253B%2520Intel%2520Mac%2520OS%2520X%252010_15_7)%2520AppleWebKit%252F537.36%2520(KHTML%252C%2520like%2520Gecko)%2520Chrome%252F112.0.0.0%2520Safari%252F537.36&amp;amp;iw=1292&amp;amp;ih=926&amp;amp;ti=My%2520Domain&amp;amp;w=3440&amp;amp;h=1440&amp;amp;d=24&amp;amp;l=de-DE&amp;amp;p=MacIntel&amp;amp;m=8&amp;amp;c=8&amp;amp;tz=Europe%252FBerlin\tHit\t3320\n
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;During transformation and enrichment, the following steps are followed:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Validating the source record&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Enriching the &lt;a target=&quot;_blank&quot; href=&quot;https://www.npmjs.com/package/ua-parser-js&quot;&gt;browser and device data&lt;/a&gt; from the user agent string&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Determine whether the record was generated by a &lt;a target=&quot;_blank&quot; href=&quot;https://www.npmjs.com/package/isbot&quot;&gt;bot&lt;/a&gt; (by user agent string)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Add nearest geographical information based on &lt;a target=&quot;_blank&quot; href=&quot;https://www.npmjs.com/package/aws-edge-locations&quot;&gt;edge locations&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Compute referer&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Derive requested URI&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Compute UTM data&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Get the event type (either a page view or a tracking event)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Build the time hierarchy (year, month, day, event timestamp)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Compute data arrival delays (data/process metrics)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Generate hashes for page view, daily page view and daily visitor ids (later used to calculate page views and visits)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Add metadata with the partition key values (in our case, the partition keys are &lt;strong&gt;domain_name&lt;/strong&gt;, &lt;strong&gt;event_date&lt;/strong&gt;, and &lt;strong&gt;event_type&lt;/strong&gt;), to be able to use the dynamic partitioning feature&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The generated JSON looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-json&quot;&gt;{
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;result&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;Ok&quot;&lt;/span&gt;,
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;error&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt;,
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;data&quot;&lt;/span&gt;: {
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;event_year&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-number&quot;&gt;2023&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;event_month&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-number&quot;&gt;4&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;event_day&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-number&quot;&gt;21&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;event_timestamp&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;2023-04-21T13:30:04.768Z&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;arrival_timestamp&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;2023-04-21T13:37:21.000Z&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;arrival_delay_ms&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-number&quot;&gt;-436232&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;edge_city&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;Hamburg&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;edge_state&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;edge_country&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;Germany&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;edge_country_code&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;DE&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;edge_latitude&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-number&quot;&gt;53.630401611328&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;edge_longitude&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-number&quot;&gt;9.9882297515869&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;edge_id&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;HAM&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;referer&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;referer_domain_name&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;Direct / None&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;browser_name&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;Chrome&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;browser_version&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;112.0.0.0&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;browser_os_name&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;Mac OS&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;browser_os_version&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;10.15.7&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;browser_timezone&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;Europe/Berlin&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;browser_language&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;de-DE&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;device_type&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;Desktop&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;device_vendor&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;Apple&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;device_outer_resolution&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;3440x1440&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;device_inner_resolution&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;1292x926&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;device_color_depth&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-number&quot;&gt;24&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;device_platform&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;MacIntel&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;device_memory&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-number&quot;&gt;8&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;device_cores&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-number&quot;&gt;8&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;utm_source&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;utm_campaign&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;utm_medium&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;utm_content&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;utm_term&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;request_url&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;https://mydomain.tld/&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;request_path&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;/&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;request_query_string&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;t=pv&amp;amp;ts=1682083804768&amp;amp;u=https%253A%252F%252Fmydomain.tld%252F&amp;amp;hn=mydomain.tld&amp;amp;pa=%252F&amp;amp;ua=Mozilla%252F5.0%2520(Macintosh%253B%2520Intel%2520Mac%2520OS%2520X%252010_15_7)%2520AppleWebKit%252F537.36%2520(KHTML%252C%2520like%2520Gecko)%2520Chrome%252F112.0.0.0%2520Safari%252F537.36&amp;amp;iw=1292&amp;amp;ih=926&amp;amp;ti=My%2520Domain&amp;amp;w=3440&amp;amp;h=1440&amp;amp;d=24&amp;amp;l=de-DE&amp;amp;p=MacIntel&amp;amp;m=8&amp;amp;c=8&amp;amp;tz=Europe%252FBerlin\t789\tHAM50-P2\t0.001\tMozilla/5.0%20(Macintosh;%20Intel%20Mac%20OS%20X%2010_15_7)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/112.0.0.0%20Safari/537.36\t-\tt=pv&amp;amp;ts=1682083804768&amp;amp;u=https%253A%252F%252Fmydomain.tld%252F&amp;amp;hn=mydomain.tld&amp;amp;pa=%252F&amp;amp;ua=Mozilla%252F5.0%2520(Macintosh%253B%2520Intel%2520Mac%2520OS%2520X%252010_15_7)%2520AppleWebKit%252F537.36%2520(KHTML%252C%2520like%2520Gecko)%2520Chrome%252F112.0.0.0%2520Safari%252F537.36&amp;amp;iw=1292&amp;amp;ih=926&amp;amp;ti=My%2520Domain&amp;amp;w=3440&amp;amp;h=1440&amp;amp;d=24&amp;amp;l=de-DE&amp;amp;p=MacIntel&amp;amp;m=8&amp;amp;c=8&amp;amp;tz=Europe%252FBerlin&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;request_bytes&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-number&quot;&gt;789&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;request_status_code&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-number&quot;&gt;304&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;request_cache_status&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;Hit&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;request_delivery_time_ms&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;request_asn&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-number&quot;&gt;3320&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;request_is_bot&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-number&quot;&gt;0&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;event_name&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;event_data&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-literal&quot;&gt;null&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;page_view_id&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;f4e1939bc259131659b00cd5f73e55a5bed04fbfa63f095b561fd87009d0a228&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;daily_page_view_id&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;7c82d13036aa2cfe04720e0388bb8645eb90de084bd50cf69356fa8ec9d8b407&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;daily_visitor_id&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;9f0ac3a2560cfa6d5c3494e1891d284225e15f088414390a40fece320021a658&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;domain_name&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;mydomain.tld&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;event_date&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;2023-04-21&quot;&lt;/span&gt;,
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;event_type&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;pageview&quot;&lt;/span&gt;
  },
  &lt;span class=&quot;hljs-attr&quot;&gt;&quot;metadata&quot;&lt;/span&gt;: {
    &lt;span class=&quot;hljs-attr&quot;&gt;&quot;partitionKeys&quot;&lt;/span&gt;: {
      &lt;span class=&quot;hljs-attr&quot;&gt;&quot;domain_name&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;mydomain.tld&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-attr&quot;&gt;&quot;event_date&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;2023-04-21&quot;&lt;/span&gt;,
      &lt;span class=&quot;hljs-attr&quot;&gt;&quot;event_type&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-string&quot;&gt;&quot;pageview&quot;&lt;/span&gt;
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then, the following steps are done by the Lambda function:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Encode the JSON stringified records in base64 again&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Return them to the Kinesis Data Firehose Delivery Stream, which will then persist the data based on the defined &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html#dynamic-partitioning-namespaces&quot;&gt;prefix&lt;/a&gt; in the S3 bucket for incoming data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;heading-aggregation-lambda&quot;&gt;Aggregation Lambda&lt;/h2&gt;
&lt;p&gt;As the ingested data contains information on a single request level, it makes sense to aggregate the data so that queries can be run optimally, and query response times are reduced.&lt;/p&gt;
&lt;p&gt;The aggregation Lambda function is based on &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/serverless-parquet-repartitioner&quot;&gt;tobilg/serverless-parquet-repartitioner&lt;/a&gt;, which also has an &lt;a target=&quot;_blank&quot; href=&quot;https://tobilg.com/using-duckdb-to-repartition-parquet-data-in-s3&quot;&gt;accompanying blog post&lt;/a&gt; that explains in more detail how the &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/duckdb-nodejs-layer&quot;&gt;DuckDB Lambda Layer&lt;/a&gt; can be used to repartition or aggregate existing data in S3.&lt;/p&gt;
&lt;p&gt;The Lambda function is scheduled to run each night at 00:30AM, which makes sure that all the Kinesis Firehose Delivery Stream output files of the last day have been written to S3 (this is because the maximum buffer time is 15 minutes).&lt;/p&gt;
&lt;p&gt;When it runs, it does three things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Create a session aggregation, that derives the session information and whether single requests were bounces or not&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Calculate the pageviews and visitor numbers, broken down by several dimensions which are later needed for querying (see &lt;code&gt;stats&lt;/code&gt; table below)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Store the extraction of the event data separately, newly partitioned by &lt;code&gt;event_name&lt;/code&gt; to speed up queries&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/ownstats/ownstats/blob/main/backend/functions/utils/queryRenderer.ts&quot;&gt;queries&lt;/a&gt; can be inspected in the accompanying repository to get an idea about the sophisticated query patterns DuckDB supports.&lt;/p&gt;
&lt;h1 id=&quot;heading-storage-layer&quot;&gt;Storage layer&lt;/h1&gt;
&lt;p&gt;The storage layer consists of three S3 buckets, where each conforms to a zone outlined in the reference architecture diagram (see above):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A &lt;strong&gt;raw&lt;/strong&gt; bucket, where the raw incoming data to the Kinesis Firehose Delivery Stream is backed up to (partitioned by &lt;code&gt;event_date&lt;/code&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A &lt;strong&gt;cleaned&lt;/strong&gt; bucket, where the data is stored by the Kinesis Firehose Delivery Stream (partitioned by &lt;code&gt;domain_name&lt;/code&gt;, &lt;code&gt;event_date&lt;/code&gt; and &lt;code&gt;event_type&lt;/code&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A &lt;strong&gt;curated&lt;/strong&gt; bucket, where the aggregated pageviews and visitors data are stored (partitioned by &lt;code&gt;domain_name&lt;/code&gt; and &lt;code&gt;event_date&lt;/code&gt;), as well as the aggregated and filtered events (partitioned by &lt;code&gt;domain_name&lt;/code&gt;, &lt;code&gt;event_date&lt;/code&gt; and &lt;code&gt;event_name&lt;/code&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id=&quot;heading-cataloging-amp-search-layer&quot;&gt;Cataloging &amp;amp; search layer&lt;/h1&gt;
&lt;p&gt;The Kinesis Data Firehose Delivery Stream needs a &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/glue/latest/dg/tables-described.html&quot;&gt;Glue table&lt;/a&gt; that holds the &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-kinesisfirehose-deliverystream-schemaconfiguration.html&quot;&gt;schema&lt;/a&gt; of the parquet files to be able to produce them (&lt;code&gt;incoming_events&lt;/code&gt; table). The &lt;code&gt;stats&lt;/code&gt; and the &lt;code&gt;events&lt;/code&gt; tables are aggregated daily from the base &lt;code&gt;incoming_events&lt;/code&gt; table via cron jobs scheduled by Amazon &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html&quot;&gt;EventBridge Rules&lt;/a&gt; at 00:30 AM.&lt;/p&gt;
&lt;h2 id=&quot;heading-incomingevents-table&quot;&gt;incoming_events table&lt;/h2&gt;
&lt;p&gt;This table stores the events that are the result of the data transformation and dynamic partitioning Lambda function. The schema for the table &lt;code&gt;incoming_events&lt;/code&gt; looks like this:&lt;/p&gt;
&lt;div class=&quot;hn-table&quot;&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;td&gt;Column name&lt;/td&gt;&lt;td&gt;Data type&lt;/td&gt;&lt;td&gt;Is partition key?&lt;/td&gt;&lt;td&gt;Description&lt;/td&gt;&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;domain_name&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;yes&lt;/td&gt;&lt;td&gt;The domain name&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_date&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;yes&lt;/td&gt;&lt;td&gt;The date of the event (YYYY-MM-DD), as string&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_type&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;yes&lt;/td&gt;&lt;td&gt;The type of the event (&lt;code&gt;pageview&lt;/code&gt; or &lt;code&gt;track&lt;/code&gt;)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_year&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The year of the event_date (YYYY)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_month&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The month of the event (MM)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_day&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The day of the event (DD)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_timestamp&lt;/td&gt;&lt;td&gt;timestamp&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The exact event timestamp&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;arrival_timestamp&lt;/td&gt;&lt;td&gt;timestamp&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The exact timestamp when the event arrived in the Kinesis Data Stream&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;arrival_delay_ms&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The difference between event_timestamp and arrival_timestamp in milliseconds&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_city&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The name of the edge city (all edge location info is derived from the &lt;code&gt;x-edge-location&lt;/code&gt; field in the &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#LogFileFormat&quot;&gt;logs&lt;/a&gt;)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_state&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The state of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_country&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The country of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_country_code&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The country code of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_latitude&lt;/td&gt;&lt;td&gt;float&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The latitude of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_longitude&lt;/td&gt;&lt;td&gt;float&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The longitude of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_id&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The original id of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;referrer&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The referrer&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;referrer_domain_name&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The domain name of the referrer&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;browser_name&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The name of the browser&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;browser_version&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The version of the browser&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;browser_os_name&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The OS name of the browser&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;browser_os_version&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The OS version of the browser&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;browser_timezone&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The timezone of the browser&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;browser_language&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The language of the browser&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_type&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The device type&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_vendor&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The device vendor&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_outer_resolution&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The outer resolution of the device&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_inner_resolution&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The inner resolution of the device&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_color_depth&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The color depth of the device&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_platform&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The platform of the device&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_memory&lt;/td&gt;&lt;td&gt;float&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The memory of the device (in MB)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_cores&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The number of cores of the device&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;utm_source&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;Identifies which site sent the traffic&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;utm_campaign&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;Identifies a specific product promotion or strategic campaign&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;utm_medium&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;Identifies what type of link was used, such as cost per click or email&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;utm_content&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;Identifies what specifically was clicked to bring the user to the site&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;utm_term&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;Identifies search terms&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_url&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The full requested URL&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_path&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The path of the requested URL&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_query_string&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The query string of the requested URL&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_bytes&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The size of the request in bytes&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_status_code&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The HTTP status code of the request&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_cache_status&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The CloudFront cache status&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_delivery_time_ms&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The time in ms it took for CloudFront to complete the request&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_asn&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The &lt;a target=&quot;_blank&quot; href=&quot;https://www.arin.net/resources/guide/asn/&quot;&gt;ASN&lt;/a&gt; of the requestor&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_is_bot&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;If the request is &lt;a target=&quot;_blank&quot; href=&quot;https://www.npmjs.com/package/isbot&quot;&gt;categorized as a bot&lt;/a&gt;, the value will be &lt;code&gt;1&lt;/code&gt;, if not &lt;code&gt;0&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_name&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The name of the event for tracking events&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_data&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The stringified event payload for tracking events&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;page_view_id&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The unique pageview id&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;daily_page_view_id&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The unique daily pageview id&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;daily_visitor_id&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The unique daily visitor id&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;&lt;h2 id=&quot;heading-stats-table&quot;&gt;stats table&lt;/h2&gt;
&lt;p&gt;The pageviews and visitor aggregation table. Its schema looks like this:&lt;/p&gt;
&lt;div class=&quot;hn-table&quot;&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;td&gt;Column name&lt;/td&gt;&lt;td&gt;Data type&lt;/td&gt;&lt;td&gt;Is partition key?&lt;/td&gt;&lt;td&gt;Description&lt;/td&gt;&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;domain_name&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;yes&lt;/td&gt;&lt;td&gt;The domain name&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_date&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;yes&lt;/td&gt;&lt;td&gt;The date of the event (YYYY-MM-DD), as string&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_hour&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The hour part of the event timestamp&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_city&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The name of the edge city (all edge location info is derived from the &lt;code&gt;x-edge-location&lt;/code&gt; field in the &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#LogFileFormat&quot;&gt;logs&lt;/a&gt;)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_country&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The country of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_latitude&lt;/td&gt;&lt;td&gt;float&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The latitude of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_longitude&lt;/td&gt;&lt;td&gt;float&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The longitude of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;referrer_domain_name&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The domain name of the referrer&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;browser_name&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The name of the browser&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;browser_os_name&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The OS name of the browser&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_type&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The device type&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;device_vendor&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The device vendor&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;utm_source&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;Identifies which site sent the traffic&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;utm_campaign&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;Identifies a specific product promotion or strategic campaign&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;utm_medium&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;Identifies what type of link was used, such as cost per click or email&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;utm_content&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;Identifies what type of link was used, such as cost per click or email&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;utm_term&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;Identifies search terms&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_path&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The path of the requested URL&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;page_view_cnt&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The number of page views&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;visitor_cnt&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The number of daily visitors&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bounces_cnt&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The number of bounces (visited only one page)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;visit_duration_sec_avg&lt;/td&gt;&lt;td&gt;int&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The average duration of a visit (in seconds)&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;&lt;h2 id=&quot;heading-events-table&quot;&gt;events table&lt;/h2&gt;
&lt;p&gt;The schema for the table &lt;code&gt;events&lt;/code&gt; looks like this:&lt;/p&gt;
&lt;div class=&quot;hn-table&quot;&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;td&gt;Column name&lt;/td&gt;&lt;td&gt;Data type&lt;/td&gt;&lt;td&gt;Is partition key&lt;/td&gt;&lt;td&gt;Description&lt;/td&gt;&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;domain_name&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;yes&lt;/td&gt;&lt;td&gt;The domain name&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_date&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;yes&lt;/td&gt;&lt;td&gt;The date of the event (YYYY-MM-DD), as string&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_name&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;yes&lt;/td&gt;&lt;td&gt;The name of the event for tracking events&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_timestamp&lt;/td&gt;&lt;td&gt;timestamp&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The exact event timestamp&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_city&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The name of the edge city (all edge location info is derived from the &lt;code&gt;x-edge-location&lt;/code&gt; field in the &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#LogFileFormat&quot;&gt;logs&lt;/a&gt;)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_country&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The country of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_latitude&lt;/td&gt;&lt;td&gt;float&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The latitude of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;edge_longitude&lt;/td&gt;&lt;td&gt;float&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The longitude of the edge location&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;request_path&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The path of the requested URL&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;page_view_id&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The unique pageview id&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;daily_visitor_id&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The unique daily visitor id&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;event_data&lt;/td&gt;&lt;td&gt;string&lt;/td&gt;&lt;td&gt;no&lt;/td&gt;&lt;td&gt;The stringified event payload for tracking events&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;&lt;h1 id=&quot;heading-consumption-layer&quot;&gt;Consumption layer&lt;/h1&gt;
&lt;p&gt;The consumption layer will be part of another blog post in this series. Stay tuned! Until it&apos;s released, you can have a look at &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/serverless-duckdb&quot;&gt;tobilg/serverless-duckdb&lt;/a&gt; to get an idea of how the data could potentially be queried in a serverless manner.&lt;/p&gt;
&lt;h1 id=&quot;heading-wrapping-up&quot;&gt;Wrapping up&lt;/h1&gt;
&lt;p&gt;In this article, you learned some basic principles of modern data lakes in the introduction. After that, it described how to build a serverless, near-realtime data pipeline leveraging AWS services and &lt;a target=&quot;_blank&quot; href=&quot;https://www.duckdb.org&quot;&gt;DuckDB&lt;/a&gt; on these principles, by the example of a web analytics application.&lt;/p&gt;
&lt;p&gt;The example implementation of this article can be found on GitHub at &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/ownstats/ownstats/tree/main/analytics-backend&quot;&gt;ownstats/ownstats&lt;/a&gt;. Feel free to open an issue in case something doesn&apos;t work as expected, or if you&apos;d like to add a feature request.&lt;/p&gt;
&lt;p&gt;The next posts in this series will be&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Part II&lt;/strong&gt;: Building a lightweight JavaScript library for the gathering of web analytics data&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Part III&lt;/strong&gt;: Consuming the gathered web analytics data by building a serverless query layer&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Part IV&lt;/strong&gt;: Building a frontend for web analytics data&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><category>AWS</category><category>serverless</category><category>datalake</category><category>duckDB</category><category>SQL</category><author>Tobias Müller</author></item><item><title>Using DuckDB to repartition parquet data in S3</title><link>https://tobilg.com/posts/using-duckdb-to-repartition-parquet-data-in-s3/</link><guid isPermaLink="true">https://tobilg.com/posts/using-duckdb-to-repartition-parquet-data-in-s3/</guid><description>Since release v0.7.1, DuckDB has the ability to repartition data stored in S3 as parquet files by a simple SQL query, which enables some interesting use cases.
Why not use existing AWS services?
If your data lake lives in AWS, a natural choice for ET...</description><pubDate>Sun, 26 Feb 2023 17:45:04 GMT</pubDate><content:encoded>&lt;p&gt;Since release v0.7.1, DuckDB has the ability to repartition data stored in S3 as parquet files by a simple SQL query, which enables some interesting use cases.&lt;/p&gt;
&lt;h1 id=&quot;heading-why-not-use-existing-aws-services&quot;&gt;Why not use existing AWS services?&lt;/h1&gt;
&lt;p&gt;If your data lake lives in AWS, a natural choice for ETL pipelines would be existing AWS services such as Amazon Athena. Unfortunately, Athena has &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/athena/latest/ug/ctas-insert-into.html&quot;&gt;pretty tight limits&lt;/a&gt; on the number of partitions that can be written by a single query (only up to 100). You, therefore, would need to create your workaround logic to be able to adhere to the limits, while still being able to do the operations you want. Additionally, Athena in some cases takes several 100ms to even start a query.&lt;/p&gt;
&lt;p&gt;This is where DuckDB comes into play because it (theoretically) supports an unlimited amount of Hive partitions, and offers very fast queries on partitioned parquet files.&lt;/p&gt;
&lt;h1 id=&quot;heading-use-case&quot;&gt;Use case&lt;/h1&gt;
&lt;p&gt;A common pattern to ingest streaming data and store it in S3 is to use &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html&quot;&gt;Kinesis Data Firehose Delivery Streams&lt;/a&gt;, which can write the incoming stream data as batched parquet files to S3. You can use &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html&quot;&gt;custom S3 prefixes&lt;/a&gt; with it when using Lambda processing functions, but by default, you can only partition the data by the timestamp (the timestamp the event reached the Kinesis Data Stream, not the event timestamp!).&lt;/p&gt;
&lt;p&gt;So, a few common use cases for data repartitioning could include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Repartitioning the written data for the real event timestamp if it&apos;s included in the incoming data&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Repartitioning the data for other query patterns, e.g. to support query filter pushdown and optimize query speeds and costs&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Aggregation of raw or preprocessed data, and storing them in an optimized manner to support analytical queries&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We want to be able to achieve this without having to manage our own infrastructure and built-upon services, which is the reason we want to be as serverless as possible.&lt;/p&gt;
&lt;h1 id=&quot;heading-solution&quot;&gt;Solution&lt;/h1&gt;
&lt;p&gt;As described before, Amazon Athena only has a partition limit of 100 partitions when writing data. DuckDB doesn&apos;t have this limitation, that&apos;s why we want to be able to use it for repartitioning.&lt;/p&gt;
&lt;p&gt;This requires that we can deploy DuckDB in a serverless manner. The choice is to run it in Lambda functions, which can be provisioned with up to 10GB of memory (meaning 6 vCPUs), and a maximum runtime of 900 seconds / 15 minutes. This should be enough for most repartitioning needs, because the throughput from/to S3 is pretty fast. Also, we want to be able to run our repartition queries on flexible schedules, that&apos;s why we&apos;ll use E&lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html&quot;&gt;ventBridge Rules with a schedule&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The project can be found at &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/serverless-parquet-repartitioner&quot;&gt;https://github.com/tobilg/serverless-parquet-repartitioner&lt;/a&gt;, and just needs to be configured and deployed.&lt;/p&gt;
&lt;h2 id=&quot;heading-architecture-overview&quot;&gt;Architecture overview&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1677692338278/13d3f454-6d20-4a3b-bc79-8344f54b1af7.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-configuration&quot;&gt;Configuration&lt;/h2&gt;
&lt;h3 id=&quot;heading-mandatory-configuration-settings&quot;&gt;Mandatory configuration settings&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/serverless-parquet-repartitioner/blob/main/serverless.yml#L18&quot;&gt;S3 bucket name&lt;/a&gt;: You need to specify the S3 bucket where the data that you want to repartition resides (e.g. &lt;code&gt;my-source-bucket&lt;/code&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/serverless-parquet-repartitioner/blob/main/serverless.yml#L77&quot;&gt;Custom repartitioning query&lt;/a&gt;: You can write flexible repartitioning queries in the DuckDB syntax. Have a look at the examples in the &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/extensions/httpfs&quot;&gt;httpfs extension docs&lt;/a&gt;. You &lt;strong&gt;need&lt;/strong&gt; to update this, as the template uses only example values!&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;heading-optional-configuration-settings&quot;&gt;Optional configuration settings&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/serverless-parquet-repartitioner/blob/main/serverless.yml#L79&quot;&gt;S3 region&lt;/a&gt;: The AWS region your S3 bucket is deployed to (if different from the region the Lambda function is deployed to)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/serverless-parquet-repartitioner/blob/main/serverless.yml#L84&quot;&gt;The schedule&lt;/a&gt;: The actual schedule on why the Lambda function is run. Have a look at the &lt;a target=&quot;_blank&quot; href=&quot;https://www.serverless.com/framework/docs/providers/aws/events/schedule&quot;&gt;Serverless Framework docs&lt;/a&gt; to find out what the potential settings are.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/serverless-parquet-repartitioner/blob/main/serverless.yml#L48&quot;&gt;DuckDB memory limit&lt;/a&gt;: The memory limit is influenced by the function memory setting (automatically)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/serverless-parquet-repartitioner/blob/main/serverless.yml#L75&quot;&gt;DuckDB threads count&lt;/a&gt;: Optionally set the max thread limit (on Lambda, this is set automatically by the amount of memory the functions has assigned), but with this setting, you can influence how many files are written per partition. If you set a lower thread count than available, this means that the computation will not use all available resources for the sake of being able to set the number of generated files! Ideally, rather align the amount of memory you assign to the Lambda function.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/serverless-parquet-repartitioner/blob/main/serverless.yml#L50&quot;&gt;Lambda timeout&lt;/a&gt;: The maximum time a Lambda function can run is currently 15min / 900sec. This means that if your query takes longer than that, it will be terminated by the underlying Firecracker engine.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;heading-using-different-sourcetarget-s3-buckets&quot;&gt;Using different source/target S3 buckets&lt;/h3&gt;
&lt;p&gt;If you&apos;re planning to use different S3 buckets as sources and targets for the data repartitioning, you need to adapt the &lt;code&gt;iamRoleStatements&lt;/code&gt; settings of the function.&lt;/p&gt;
&lt;p&gt;Here&apos;s an example with minimal privileges:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;iamRoleStatements:
  &lt;span class=&quot;hljs-comment&quot;&gt;# Source S3 bucket permissions&lt;/span&gt;
  - Effect: Allow
    Action:
      - s3:ListBucket
    Resource: &lt;span class=&quot;hljs-string&quot;&gt;&apos;arn:aws:s3:::my-source-bucket&apos;&lt;/span&gt;
  - Effect: Allow
    Action:
      - s3:GetObject
    Resource: &lt;span class=&quot;hljs-string&quot;&gt;&apos;arn:aws:s3:::my-source-bucket/*&apos;&lt;/span&gt;
  &lt;span class=&quot;hljs-comment&quot;&gt;# Target S3 bucket permissions&lt;/span&gt;
  - Effect: Allow
    Action:
      - s3:ListBucket
      - s3:AbortMultipartUpload
      - s3:ListMultipartUploadParts
      - s3:ListBucketMultipartUploads
    Resource: &lt;span class=&quot;hljs-string&quot;&gt;&apos;arn:aws:s3:::my-target-bucket&apos;&lt;/span&gt;
  - Effect: Allow
    Action:
      - s3:GetObject
      - s3:PutObject
    Resource: &lt;span class=&quot;hljs-string&quot;&gt;&apos;arn:aws:s3:::my-target-bucket/*&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A query for this use case would look like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-sql&quot;&gt;COPY (&lt;span class=&quot;hljs-keyword&quot;&gt;SELECT&lt;/span&gt; * &lt;span class=&quot;hljs-keyword&quot;&gt;FROM&lt;/span&gt; parquet_scan(&lt;span class=&quot;hljs-string&quot;&gt;&apos;s3://my-source-bucket/input/*.parquet&apos;&lt;/span&gt;, HIVE_PARTITIONING = &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;)) &lt;span class=&quot;hljs-keyword&quot;&gt;TO&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;s3://my-starget-bucket/output&apos;&lt;/span&gt; (&lt;span class=&quot;hljs-keyword&quot;&gt;FORMAT&lt;/span&gt; PARQUET, PARTITION_BY (column1, column2, column3), ALLOW_OVERWRITE &lt;span class=&quot;hljs-literal&quot;&gt;TRUE&lt;/span&gt;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-deployment&quot;&gt;Deployment&lt;/h2&gt;
&lt;p&gt;After you cloned this repository to your local machine and cd&apos;ed in its directory, the application can be deployed like this (don&apos;t forget a &lt;code&gt;npm i&lt;/code&gt; to install the dependencies!):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;$ sls deploy
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will deploy the stack to the default AWS region &lt;code&gt;us-east-1&lt;/code&gt;. In case you want to deploy the stack to a different region, you can specify a &lt;code&gt;--region&lt;/code&gt; argument:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;$ sls deploy --region eu-central-1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The deployment should take 2-3 minutes.&lt;/p&gt;
&lt;h2 id=&quot;heading-checks-and-manual-triggering&quot;&gt;Checks and manual triggering&lt;/h2&gt;
&lt;p&gt;You can &lt;a target=&quot;_blank&quot; href=&quot;https://www.serverless.com/framework/docs/providers/aws/cli-reference/invoke&quot;&gt;manually invoke&lt;/a&gt; the deployed Lambda function by running&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;$ sls invoke -f repartitionData
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After that, you can &lt;a target=&quot;_blank&quot; href=&quot;https://www.serverless.com/framework/docs/providers/aws/cli-reference/logs&quot;&gt;check the generated CloudWatch logs&lt;/a&gt; by issuing&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;$ sls logs -f repartitionData
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you don&apos;t see any &lt;code&gt;DUCKDB_NODEJS_ERROR&lt;/code&gt; in the logs, everything ran successfully, and you can have a look at your S3 bucket for the newly generated parquet files.&lt;/p&gt;
&lt;h2 id=&quot;heading-costs&quot;&gt;Costs&lt;/h2&gt;
&lt;p&gt;Using this repository will generate costs in your AWS account. Please refer to the AWS pricing docs for the respective services before deploying and running it.&lt;/p&gt;
&lt;h1 id=&quot;heading-summary&quot;&gt;Summary&lt;/h1&gt;
&lt;p&gt;We were able to show a possible serverless solution to repartition data that is stored in S3 as parquet files, without limitations imposed by certain AWS services. With the solution shown, we can use plain and simple SQL queries, instead of having to eventually use external libraries etc.&lt;/p&gt;
&lt;h1 id=&quot;heading-references&quot;&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Serverless parquet repartitioner repo: &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/serverless-parquet-repartitioner&quot;&gt;https://github.com/tobilg/serverless-parquet-repartitioner&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><category>duckDB</category><category>Amazon S3</category><category>Data-lake</category><category>analytics</category><category>Parquet</category><author>Tobias Müller</author></item><item><title>Using DuckDB in AWS Lambda</title><link>https://tobilg.com/posts/using-duckdb-in-aws-lambda/</link><guid isPermaLink="true">https://tobilg.com/posts/using-duckdb-in-aws-lambda/</guid><description>Prelude
DuckDB is an open-source in-process SQL OLAP database management system that has recently gained significant public interest due to its unique architecture and impressive performance benchmarks.
Unlike traditional databases that are designed ...</description><pubDate>Sun, 12 Feb 2023 16:41:41 GMT</pubDate><content:encoded>&lt;h2 id=&quot;heading-prelude&quot;&gt;Prelude&lt;/h2&gt;
&lt;p&gt;DuckDB is an open-source in-process SQL OLAP database management system that has recently gained significant public interest due to its unique architecture and impressive performance benchmarks.&lt;/p&gt;
&lt;p&gt;Unlike traditional databases that are designed to handle a wide variety of use cases, DuckDB is built specifically for analytical queries and is optimized to perform extremely well in these scenarios. This focus on analytics has allowed DuckDB to outperform traditional databases by several orders of magnitude, making it a popular choice for data scientists and analysts who need to process large datasets quickly and efficiently.&lt;/p&gt;
&lt;p&gt;As DuckDB is designed to be a highly efficient and scalable database system, which makes it a perfect fit for serverless architectures that allow developers to build and run applications and services without having to manage infrastructure.&lt;/p&gt;
&lt;p&gt;DuckDB&apos;s ability to handle large datasets in a memory-efficient manner makes it an ideal choice for serverless environments. Being able to read columnar storage formats like Parquet or Apache Arrow tables from local, S3 or HTTP sources, DuckDB can quickly scan and aggregate large amounts of data without having to load it all into memory, reducing the amount of memory required to perform complex analytical queries. This allows for cost savings, as serverless environments typically charge for both compute and memory resources.&lt;/p&gt;
&lt;p&gt;Existing AWS services, such as Athena or RDS, don&apos;t provide the same functionalities, and also have different scaling and pricing models. That&apos;s why it makes sense to explore ways to run DuckDB as an analytical service on AWS.&lt;/p&gt;
&lt;h2 id=&quot;heading-how-to-run-duckdb-in-aws-lambda&quot;&gt;How to run DuckDB in AWS Lambda?&lt;/h2&gt;
&lt;p&gt;The goal of this article is to use DuckDB on Node.js runtimes (12, 14, 16 and 18), thus it is necessary to find a way to make DuckDB usable with Lambda. The first idea was to simply use the existing &lt;a target=&quot;_blank&quot; href=&quot;https://www.npmjs.com/package/duckdb&quot;&gt;DuckDB npm package&lt;/a&gt; and use the default packaging mechanisms when deploying Lambda functions. Unfortunately, this idea proved impossible due to &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/mapbox/node-pre-gyp/issues/661#issuecomment-1347186316&quot;&gt;downstream package problems&lt;/a&gt; and different build Operating Systems (Lambda needs statically linked binaries built on Amazon Linux).&lt;/p&gt;
&lt;p&gt;Generally, there are several ways to package dependencies in AWS Lambda functions with Node.js runtimes. Using bundlers like WebPack or ESbuild is probably the most used option at the moment.&lt;/p&gt;
&lt;p&gt;Another one is using AWS Lambda layers for distributing dependencies to Lambda functions, allowing developers to manage common components and libraries across multiple functions in a centralized manner. By creating a layer for the dependencies, developers can avoid having to include them in each function’s deployment package. This helps reduce the size of the deployment package and makes it easier to manage updates to the dependencies. Moreover, using a Lambda layer can also improve the performance of Lambda functions.&lt;/p&gt;
&lt;h2 id=&quot;heading-building-duckdb-for-aws-lambda&quot;&gt;Building DuckDB for AWS Lambda&lt;/h2&gt;
&lt;p&gt;So, how can we achieve to build a DuckDB version that can be used with Node.js runtimes on AWS Lambda?&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;We have to use a compatible environment when compiling the DuckDB binary, to avoid GLIBC incompatibilities etc. This means that we have to use an Amazon Linux distribution to build DuckDB, and enable static linking.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We have to solve the downstream package problems stated above, which make it impossible to use the default package.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;On a side note, we want to enable the current features like COPY TO PARTITIONED BY and improved Parquet file reading, thus requiring a build from the master branch of DuckDB.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I created &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/duckdb-nodejs-layer&quot;&gt;https://github.com/tobilg/duckdb-nodejs-layer&lt;/a&gt; to achieve this. It uses GitHub Actions to automatically trigger a build of DuckDB from the current master, package it as AWS Lambda layer and automatically upload it to all AWS regions. Feel free to have a look at the source code, and open a GitHub issue in case you find some errors or ideas for improvement.&lt;/p&gt;
&lt;h2 id=&quot;heading-using-the-duckdb-lambda-layer&quot;&gt;Using the DuckDB Lambda layer&lt;/h2&gt;
&lt;p&gt;Depending on your preferred framework, the methods to use a Lambda layer are different. You can find the respective docs of the most common frameworks below:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://www.serverless.com/framework/docs/providers/aws/guide/serverless.yml/#functions&quot;&gt;Serverless Framework&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/blogs/compute/working-with-aws-lambda-and-lambda-layers-in-aws-sam/&quot;&gt;SAM&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/cdk/api/v1/docs/aws-lambda-readme.html#layers&quot;&gt;CDK&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-function.html#cfn-lambda-function-layers&quot;&gt;CloudFormation&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The ARNs follow the following logic:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;arn:aws:lambda:&lt;span class=&quot;hljs-variable&quot;&gt;$REGION&lt;/span&gt;:041475135427:layer:duckdb-nodejs-layer:&lt;span class=&quot;hljs-variable&quot;&gt;$VERSION&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can find the list of ARNs for all regions at &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/duckdb-nodejs-layer#usage&quot;&gt;https://github.com/tobilg/duckdb-nodejs-layer#usage&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I created an example repository with the Serverless Framework at &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/serverless-duckdb&quot;&gt;https://github.com/tobilg/serverless-duckdb&lt;/a&gt; that uses API Gateway and Lambda to provide an endpoint to which SQL queries can be issued, which the built-in DuckDB then executes. Let&apos;s walk through it!&lt;/p&gt;
&lt;h3 id=&quot;heading-requirements&quot;&gt;Requirements&lt;/h3&gt;
&lt;p&gt;You&apos;ll need a current v3 version installation of the &lt;a target=&quot;_blank&quot; href=&quot;https://serverless.com/&quot;&gt;Serverless Framework&lt;/a&gt; on the machine you&apos;re planning to deploy the application from.&lt;/p&gt;
&lt;p&gt;Also, you&apos;ll have to set up your AWS credentials according to the &lt;a target=&quot;_blank&quot; href=&quot;https://www.serverless.com/framework/docs/providers/aws/guide/credentials/&quot;&gt;Serverless docs&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;heading-configuration&quot;&gt;Configuration&lt;/h3&gt;
&lt;p&gt;DuckDB is automatically configured to use the &lt;a target=&quot;_blank&quot; href=&quot;https://duckdb.org/docs/extensions/httpfs&quot;&gt;HTTPFS extension&lt;/a&gt; and uses the AWS credentials that are given to your Lambda function by its execution role. This means you can potentially query data that is available via HTTP(S) or in AWS S3 buckets.&lt;/p&gt;
&lt;p&gt;If you want to also query data (e.g. Parquet files) that resides in one or more S3 buckets, you&apos;ll have to adjust the &lt;code&gt;iamRoleStatements&lt;/code&gt; part of the function configuration in the &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/serverless-duckdb/blob/main/serverless.yml#L45&quot;&gt;serverless.yml&lt;/a&gt; file. Just replace the &lt;code&gt;YOUR-S3-BUCKET-NAME&lt;/code&gt; with your actual S3 bucket name.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-yaml&quot;&gt;  &lt;span class=&quot;hljs-attr&quot;&gt;query:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;handler:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;src/functions/query.handler&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;memorySize:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;10240&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;timeout:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;30&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;iamRoleStatements:&lt;/span&gt;
      &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;Effect:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;Allow&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;Action:&lt;/span&gt;
          &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;s3:GetObject&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;Resource:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;arn:aws:s3:::YOUR-S3-BUCKET-NAME/*&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;Effect:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;Allow&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;Action:&lt;/span&gt;
          &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;s3:ListBucket&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;Resource:&lt;/span&gt;
          &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;arn:aws:s3:::YOUR-S3-BUCKET-NAME&apos;&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;layers:&lt;/span&gt;
      &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;arn:aws:lambda:${self:provider.region}:041475135427:layer:duckdb-nodejs-layer:3&apos;&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;events:&lt;/span&gt;
      &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;http:&lt;/span&gt;
          &lt;span class=&quot;hljs-attr&quot;&gt;path:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;${self:custom.api.version}/query&lt;/span&gt;
          &lt;span class=&quot;hljs-attr&quot;&gt;method:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;post&lt;/span&gt;
          &lt;span class=&quot;hljs-attr&quot;&gt;cors:&lt;/span&gt; &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;
          &lt;span class=&quot;hljs-attr&quot;&gt;private:&lt;/span&gt; &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-deployment&quot;&gt;Deployment&lt;/h3&gt;
&lt;p&gt;After you cloned this repository to your local machine and cd&apos;ed in its directory, the application can be deployed like this (don&apos;t forget a &lt;code&gt;npm i&lt;/code&gt; to install the dependencies):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;$ sls deploy
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will deploy the stack to the default AWS region &lt;code&gt;us-east-1&lt;/code&gt;. In case you want to deploy the stack to a different region, you can specify a &lt;code&gt;--region&lt;/code&gt; argument:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;$ sls deploy --region eu-central-1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The deployment should take 2-3 minutes. Once the deployment is finished, you should find some output in your console that indicates the API Gateway endpoint URL and the API Key:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-yaml&quot;&gt;&lt;span class=&quot;hljs-attr&quot;&gt;api keys:&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;DuckDBKey:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;REDACTED&lt;/span&gt;
&lt;span class=&quot;hljs-attr&quot;&gt;endpoint:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;POST&lt;/span&gt; &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;https://REDACTED.execute-api.us-east-1.amazonaws.com/prd/v1/query&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-usage&quot;&gt;Usage&lt;/h3&gt;
&lt;p&gt;You can now query your DuckDB endpoint via HTTP requests (don&apos;t forget to exchange &lt;code&gt;REDACTED&lt;/code&gt; with your real URL and API Key), e.g.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;curl --location --request POST &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://REDACTED.execute-api.us-east-1.amazonaws.com/prd/v1/query&apos;&lt;/span&gt; \
--header &lt;span class=&quot;hljs-string&quot;&gt;&apos;x-api-key: REDACTED&apos;&lt;/span&gt; \
--header &lt;span class=&quot;hljs-string&quot;&gt;&apos;Content-Type: application/json&apos;&lt;/span&gt; \
--data-raw &lt;span class=&quot;hljs-string&quot;&gt;&apos;{
    &quot;query&quot;: &quot;SELECT avg(c_acctbal) FROM &apos;&lt;/span&gt;\&lt;span class=&quot;hljs-string&quot;&gt;&apos;&apos;&lt;/span&gt;https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet&lt;span class=&quot;hljs-string&quot;&gt;&apos;\&apos;&lt;/span&gt;&lt;span class=&quot;hljs-string&quot;&gt;&apos;;&quot;
}&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The query results will look to this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-json&quot;&gt;[
    {
        &lt;span class=&quot;hljs-attr&quot;&gt;&quot;avg(c_acctbal)&quot;&lt;/span&gt;: &lt;span class=&quot;hljs-number&quot;&gt;4454.577060000001&lt;/span&gt;
    }
]
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-example-queries&quot;&gt;Example queries&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;Remote Parquet Scans:
  SELECT count(*) FROM &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet&apos;&lt;/span&gt;;
  SELECT count(*) FROM &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet&apos;&lt;/span&gt;;
  SELECT avg(c_acctbal) FROM &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet&apos;&lt;/span&gt;;
  SELECT * FROM &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet&apos;&lt;/span&gt; LIMIT 10;

Remote Parquet/Parquet Join:
  SELECT n_name, count(*)
  FROM &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet&apos;&lt;/span&gt;,
       &lt;span class=&quot;hljs-string&quot;&gt;&apos;https://shell.duckdb.org/data/tpch/0_01/parquet/nation.parquet&apos;&lt;/span&gt;
  WHERE c_nationkey = n_nationkey GROUP BY n_name;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&quot;heading-conclusion&quot;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;We were able to show that it&apos;s possible to package and use DuckDB in a Lambda function, as well as to run performant queries on remote data with this setup.&lt;/p&gt;
&lt;p&gt;But we need to keep in mind that this is just a showcase. The example application doesn&apos;t solve a lot of issues we&apos;d have to solve if we&apos;d want to run this in a distributed manner:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A query planner and router, that scales DuckDB instances and redistributes the queries to the &quot;query backend&quot; functions, as well as unites the query results before passing them back to the query issuer&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&quot;Query stickiness&quot;: The example is stateless, meaning that even if you&apos;d load data into a memory table, you&apos;d not be able to be sure that you&apos;d reach the same function instance with a subsequent query due to Lambda&apos;s scaling/execution model&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Running DuckDB &quot;only&quot; in Lambda functions may not be the most performant way when AWS Fargate and very large EC2 instances exist&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The example application uses API Gateway as the event source for the Lambda function, which means the maximum runtime of the queries can be 30 seconds, which is unrealistic for large datasets or complicated queries. In real-world scenarios, the Lambda function would need to be triggered asynchronously, e.g. via SNS or SQS. This also means that the queries probably can&apos;t follow a strict request/response model.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;heading-references&quot;&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://www.boilingdata.com/&quot;&gt;BoilingData&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://stoic.com/&quot;&gt;STOIC&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://twitter.com/ghalimi&quot;&gt;Ismael Ghalimi&apos;s Twitter feed&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><category>AWS</category><category>duckDB</category><category>serverless</category><category>data analytics</category><category>lambda</category><author>Tobias Müller</author></item><item><title>Building a global reverse proxy with on-demand SSL support</title><link>https://tobilg.com/posts/building-a-global-reverse-proxy-with-on-demand-ssl-support/</link><guid isPermaLink="true">https://tobilg.com/posts/building-a-global-reverse-proxy-with-on-demand-ssl-support/</guid><description>Motivation
Who needs a reverse proxy with on-demand SSL support? Well, think about services as Hashnode, which also runs this blog, or Fathom and SimpleAnalytics. A feature that all those services have in common? They all are enabling their customers...</description><pubDate>Tue, 10 Jan 2023 23:25:26 GMT</pubDate><content:encoded>&lt;h1 id=&quot;heading-motivation&quot;&gt;Motivation&lt;/h1&gt;
&lt;p&gt;Who needs a reverse proxy with on-demand SSL support? Well, think about services as &lt;a target=&quot;_blank&quot; href=&quot;http://hashnode.com&quot;&gt;Hashnode&lt;/a&gt;, which also runs this blog, or &lt;a target=&quot;_blank&quot; href=&quot;http://usefathom.com&quot;&gt;Fathom&lt;/a&gt; and &lt;a target=&quot;_blank&quot; href=&quot;http://simpleanalytics.com&quot;&gt;SimpleAnalytics&lt;/a&gt;. A feature that all those services have in common? They all are enabling their customers to bring their own, custom domain names. The latter two services use them to &lt;a target=&quot;_blank&quot; href=&quot;https://usefathom.com/blog/bypass-adblockers&quot;&gt;bypass&lt;/a&gt; &lt;a target=&quot;_blank&quot; href=&quot;https://docs.simpleanalytics.com/bypass-ad-blockers&quot;&gt;adblockers&lt;/a&gt;, with the intention that customers can track all their pageviews and events, which potentially wouldn&apos;t be possible otherwise because the service&apos;s own domain are prone to DNS block lists. Hashnode is using them to enable their customers to host their blogs under their own domain names.&lt;/p&gt;
&lt;h1 id=&quot;heading-requirements&quot;&gt;Requirements&lt;/h1&gt;
&lt;p&gt;What are the functional &amp;amp; non-functional requirements to build such a system? Let&apos;s try to recap:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;We want to be able to use a custom (&quot;external&quot;) domain, e.g. &lt;code&gt;subdomain.customdomain.tld&lt;/code&gt; to redirect to another domain, such as &lt;code&gt;targetdomain.tld&lt;/code&gt; via a &lt;code&gt;CNAME&lt;/code&gt; DNS record&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The custom domains need to have SSL/TLS support&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The custom domains need to be configurable, without changing the underlying infrastructure&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We want to make sure that the service will only create and provide the certificates for whitelisted custom domains&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We want to optimize the latency of individual requests, thus need to support a scalable and distributed infrastructure&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We want to be as Serverless as possible&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We want to optimize for infrastructure costs (variable and fixed)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We want to build this on AWS&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The service needs to be deployable/updateable/removable via Infrastructure as Code (IaC) in a repeatable manner&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id=&quot;heading-architectural-considerations&quot;&gt;Architectural considerations&lt;/h1&gt;
&lt;p&gt;Looking at the above requirements, what are the implications from an architectural point of view? Which tools and services are already on the market? What does AWS as Public Cloud Provider offer for our use case?&lt;/p&gt;
&lt;p&gt;For the main requirements of a &lt;a target=&quot;_blank&quot; href=&quot;https://caddyserver.com/docs/quick-starts/reverse-proxy&quot;&gt;reverse proxy&lt;/a&gt; server with &lt;a target=&quot;_blank&quot; href=&quot;https://caddyserver.com/docs/automatic-https&quot;&gt;automatic SSL/TLS&lt;/a&gt; support, &lt;a target=&quot;_blank&quot; href=&quot;https://caddyserver.com/&quot;&gt;Caddy&lt;/a&gt; seems to be an optimal candidate. As it is written in Go, it can run in Docker containers very well and can be used on numerous operating systems. This means we have the options to either run it on EC2 instances or ECS/Fargate if we decide to run it in containers. The latter would cater to the requirement to run as Serverless as possible. It has modules to store the generated SSL/TLS on-demand certificates in &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/silinternational/certmagic-storage-dynamodb&quot;&gt;DynamoDB&lt;/a&gt; or &lt;a target=&quot;_blank&quot; href=&quot;https://caddyserver.com/docs/modules/caddy.storage.s3#github.com/ss098/certmagic-s3&quot;&gt;S3&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Also, the whitelisting of custom domains is possible for those certificates, by providing an additional backend service, which Caddy can ask whether a requested custom domain is &lt;a target=&quot;_blank&quot; href=&quot;https://caddyserver.com/docs/caddyfile/options#on-demand-tls&quot;&gt;allowed to use or not&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;A challenge is that none of those modules are contained in the official Caddy builds, meaning that we&apos;d have to build a custom version of Caddy to be able to use those storage backends.&lt;/p&gt;
&lt;p&gt;Regarding the requirement of global availability and short response times, &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/global-accelerator/&quot;&gt;AWS Global Accelerator&lt;/a&gt; is a viable option, as it can provide a global single static IP address endpoint for multiple, regionally distributed services. In our use case, those services would be our Caddy installations.&lt;/p&gt;
&lt;p&gt;Running Caddy itself, as said before, is possible via Containers or EC2 instances / VMs. As the services will run the whole time and assumingly don&apos;t need a lot of resources if not under heavy load, we assume that 1 vCPU and 1 GB of memory should be enough.&lt;/p&gt;
&lt;p&gt;When projecting this on the necessary infrastructure, the cost comparison between Containers and VMs looks like the following (for simplification, we just compare the fixed costs, ignore variable costs such as &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/global-accelerator/pricing/&quot;&gt;egress traffic&lt;/a&gt;, and assume the &lt;code&gt;us-east-1&lt;/code&gt; region is used):&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Containers&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Fargate task with 1 vCPU and 1 GB of memory for each Caddy regional instance&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Price: ($0.04048/vCPU hour + $0.004445/GB hour) * 720 hours (30 days) = $32.35 / 30 days&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ALB to make the Fargate tasks available to the outside world&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Price: ($0.0225 per ALB-hour + $0.008 per LCU-hour) * 720 hours (30 days) = $21.96 / 30 days&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In combination, it would cost &lt;strong&gt;$54.31&lt;/strong&gt; to run this setup for 30 days.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;EC2 instances / VMs&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A t2.micro instance with 1 vCPU and 1 GB f memory for each Caddy regional instance&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Price: $0.0116/hour on-demand * 720 hours (30 days) = $8.35&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;There&apos;s no need for a Load Balancer in front of the EC2 instances, as Global Accelerator can directly use them&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Price: &lt;em&gt;$0&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In combination, it would cost &lt;strong&gt;&lt;em&gt;$8.35&lt;/em&gt;&lt;/strong&gt; to run this setup for 30 days.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Additional costs are:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;AWS Global Accelerator&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Price: $0.025 / hour * 720 hours (30 days) = &lt;strong&gt;$18&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;DynamoDB table&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Price: (5000 reads/day * $0.25/million + 50 writes/day * $1.25/million) * 30 days = $0.0375 reads + 0.001875 writes = &lt;strong&gt;$0.04&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Lambda function (128 MB / 0.25 sec avg. duration / 5000 req./day)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Price: ($0.0000166667 / GB-second + $0.20 per 1M req.) = basically &lt;strong&gt;$0&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id=&quot;heading-resulting-architecture&quot;&gt;&lt;strong&gt;Resulting architecture&lt;/strong&gt;&lt;/h1&gt;
&lt;p&gt;Based on the calculated fixed costs, we decided to use EC2 instances instead of Fargate tasks, which will save us a decent amount of money even for one Caddy instance. &lt;strong&gt;The estimated costs for running this architecture for 30 days are $26.39&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;As one of our requirements was that we can roll out this infrastructure potentially on a global scale, we need to be able to deploy the EC2 instances with the Caddy servers in different AWS regions, as well as having multiple instances in the same region.&lt;/p&gt;
&lt;p&gt;Furthermore, we could use &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/dynamodb/global-tables/&quot;&gt;DynamoDB Global Tables&lt;/a&gt; to achieve a global distribution of the certificates to get faster response times, but deem it as out of scope for this article.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The final architecture:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cdn.hashnode.com/res/hashnode/image/upload/v1673307403175/75a9ddcc-76f5-4f5e-81cb-e5604a09953c.png&quot; alt class=&quot;image--center mx-auto&quot; /&gt;&lt;/p&gt;
&lt;h1 id=&quot;heading-implementation&quot;&gt;Implementation&lt;/h1&gt;
&lt;p&gt;To implement the described architecture, we must take several steps. First of all, we must build a custom version of Caddy that includes the DynamoDB module, which then enables us to use DynamoDB as certificate store.&lt;/p&gt;
&lt;h2 id=&quot;heading-custom-caddy-build&quot;&gt;Custom Caddy build&lt;/h2&gt;
&lt;p&gt;This can be achieved via a custom build process leveraging Docker images of AmazonLinux 2, as found at &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/aws-caddy-build&quot;&gt;tobilg/aws-caddy-build&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/aws-caddy-build/blob/main/build.sh&quot;&gt;build.sh&lt;/a&gt; (parametrized custom build of Caddy via Docker)&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;&lt;span class=&quot;hljs-meta&quot;&gt;#!/usr/bin/env bash&lt;/span&gt;

&lt;span class=&quot;hljs-built_in&quot;&gt;set&lt;/span&gt; -e

&lt;span class=&quot;hljs-comment&quot;&gt;# Set OS (first script argument)&lt;/span&gt;
OS=&lt;span class=&quot;hljs-variable&quot;&gt;${1:-linux}&lt;/span&gt;

&lt;span class=&quot;hljs-comment&quot;&gt;# Set Caddy version (second script argument)&lt;/span&gt;
CADDY_VERSION=&lt;span class=&quot;hljs-variable&quot;&gt;${2:-v2.6.2}&lt;/span&gt;

&lt;span class=&quot;hljs-comment&quot;&gt;# Create release folders&lt;/span&gt;
mkdir -p &lt;span class=&quot;hljs-variable&quot;&gt;$PWD&lt;/span&gt;/releases &lt;span class=&quot;hljs-variable&quot;&gt;$PWD&lt;/span&gt;/temp_release

&lt;span class=&quot;hljs-comment&quot;&gt;# Run build&lt;/span&gt;
docker build --build-arg OS=&lt;span class=&quot;hljs-variable&quot;&gt;$OS&lt;/span&gt; --build-arg CADDY_VERSION=&lt;span class=&quot;hljs-variable&quot;&gt;$CADDY_VERSION&lt;/span&gt; -t custom-caddy-build .

&lt;span class=&quot;hljs-comment&quot;&gt;# Copy release from image to temporary folder&lt;/span&gt;
docker run -v &lt;span class=&quot;hljs-variable&quot;&gt;$PWD&lt;/span&gt;/temp_release:/opt/mount --rm -ti custom-caddy-build bash -c &lt;span class=&quot;hljs-string&quot;&gt;&quot;cp /tmp/caddy-build/* /opt/mount/&quot;&lt;/span&gt;

&lt;span class=&quot;hljs-comment&quot;&gt;# Copy release to releases&lt;/span&gt;
cp &lt;span class=&quot;hljs-variable&quot;&gt;$PWD&lt;/span&gt;/temp_release/* &lt;span class=&quot;hljs-variable&quot;&gt;$PWD&lt;/span&gt;/releases/

&lt;span class=&quot;hljs-comment&quot;&gt;# Cleanup&lt;/span&gt;
rm -rf &lt;span class=&quot;hljs-variable&quot;&gt;$PWD&lt;/span&gt;/temp_release
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/aws-caddy-build/blob/main/Dockerfile&quot;&gt;Dockerfile&lt;/a&gt; (will build Caddy with the DynamoDb and S3 modules)&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;FROM amazonlinux:2

ARG CADDY_VERSION=v2.6.2
ARG OS=linux

&lt;span class=&quot;hljs-comment&quot;&gt;# Install dependencies&lt;/span&gt;
RUN yum update -y &amp;amp;&amp;amp; \
  yum install golang -y

RUN GOBIN=/usr/&lt;span class=&quot;hljs-built_in&quot;&gt;local&lt;/span&gt;/bin/ go install github.com/caddyserver/xcaddy/cmd/xcaddy@latest

RUN mkdir -p /tmp/caddy-build &amp;amp;&amp;amp; \
  GOOS=&lt;span class=&quot;hljs-variable&quot;&gt;${OS}&lt;/span&gt; xcaddy build &lt;span class=&quot;hljs-variable&quot;&gt;${CADDY_VERSION}&lt;/span&gt; --with github.com/ss098/certmagic-s3 --with github.com/silinternational/certmagic-storage-dynamodb/v3 --output /tmp/caddy-build/aws_caddy_&lt;span class=&quot;hljs-variable&quot;&gt;${CADDY_VERSION}&lt;/span&gt;_&lt;span class=&quot;hljs-variable&quot;&gt;${OS}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That&apos;s it for the custom Caddy build. &lt;mark&gt;You don&apos;t need to build this yourself, as the further steps use the &lt;/mark&gt; &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/aws-caddy-build/tree/main/releases&quot;&gt;&lt;mark&gt;release&lt;/mark&gt;&lt;/a&gt; &lt;mark&gt;I built and uploaded to GitHub.&lt;/mark&gt;&lt;/p&gt;
&lt;h2 id=&quot;heading-reverse-proxy-service&quot;&gt;Reverse Proxy Service&lt;/h2&gt;
&lt;p&gt;The implementation of the reverse proxy service can be found at &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/global-reverse-proxy&quot;&gt;tobilg/global-reverse-proxy&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Just clone it via &lt;code&gt;git clone&lt;/code&gt; &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/global-reverse-proxy.git&quot;&gt;&lt;code&gt;https://github.com/tobilg/global-reverse-proxy.git&lt;/code&gt;&lt;/a&gt; to your local machine, and configure it as described below.&lt;/p&gt;
&lt;h3 id=&quot;heading-prerequisites&quot;&gt;Prerequisites&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Serverless Framework&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You need to have a recent (&amp;gt;=3.1.2) version of the &lt;a target=&quot;_blank&quot; href=&quot;https://goserverless.com/&quot;&gt;Serverless Framework&lt;/a&gt; installed globally on your machine. If you haven&apos;t, you can run &lt;code&gt;npm i -g serverless&lt;/code&gt; to install it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Valid AWS credentials&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Serverless Framework relies on already configured AWS credentials. Please refer to the &lt;a target=&quot;_blank&quot; href=&quot;https://www.serverless.com/framework/docs/providers/aws/guide/credentials/&quot;&gt;docs&lt;/a&gt; to learn how to set them up on your local machine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;EC2 key already configured&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If you want to interact with the deployed EC2 instance(s), you need to add your existing public SSH key or create a new one. Please have a look at the &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/create-key-pairs.html#how-to-generate-your-own-key-and-import-it-to-aws&quot;&gt;AWS docs&lt;/a&gt; to learn how you can do that.&lt;/p&gt;
&lt;p&gt;Please also note the name you have given to the newly created key, as you will have to update the &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/global-reverse-proxy/blob/main/proxy-server-stack/serverless.yml#L15&quot;&gt;configuration of the proxy server(s) stack&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;heading-infrastructure-as-code-overview&quot;&gt;Infrastructure as Code overview&lt;/h3&gt;
&lt;p&gt;The infrastructure consists of three different stacks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A stack for the domain whitelisting service, and the certificate table in DynamoDB&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A stack for the proxy server(s) itself, which can be deployed multiple times if you want high (global) availability and fast latencies&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A stack for the Global Accelerator, and the according DNS records&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;heading-most-important-parts&quot;&gt;Most important parts&lt;/h3&gt;
&lt;p&gt;The main functionality, the reverse proxy based on Caddy, is deployed via an EC2 instance. Its configuration, the so-called Caddyfile, is, together with the CloudFormation resource for the EC2 instance, the most important part.&lt;/p&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/global-reverse-proxy/blob/main/proxy-server-stack/caddy-config/Caddyfile&quot;&gt;Caddyfile&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This configuration enables the reverse proxy, the on-demand TLS feature and DynamoDB storage for certificates. It&apos;s automatically parametrized via the generated &lt;code&gt;/etc/caddy/environment&lt;/code&gt; file (see ec2.yml below). There&apos;s a &lt;code&gt;systemctl&lt;/code&gt; service for Caddy generated, based on our configuration derived from the serverless.yml, as well.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;{
        admin off
        on_demand_tls {
                ask {&lt;span class=&quot;hljs-variable&quot;&gt;$DOMAIN_SERVICE_ENDPOINT&lt;/span&gt;}
        }

        storage dynamodb {&lt;span class=&quot;hljs-variable&quot;&gt;$TABLE_NAME&lt;/span&gt;} {
                aws_region {&lt;span class=&quot;hljs-variable&quot;&gt;$TABLE_REGION&lt;/span&gt;}
        }
}

:80 {
       respond /health &lt;span class=&quot;hljs-string&quot;&gt;&quot;Im healthy&quot;&lt;/span&gt; 200
}

:443 {
        tls {&lt;span class=&quot;hljs-variable&quot;&gt;$LETSENCRYPT_EMAIL_ADDRESS&lt;/span&gt;} {
                on_demand
        }

        reverse_proxy https://{&lt;span class=&quot;hljs-variable&quot;&gt;$TARGET_DOMAIN&lt;/span&gt;} {
                header_up Host {&lt;span class=&quot;hljs-variable&quot;&gt;$TARGET_DOMAIN&lt;/span&gt;}
                header_up User-Custom-Domain {host}
                header_up X-Forwarded-Port {server_port}

                health_timeout 5s
        }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/global-reverse-proxy/blob/main/proxy-server-stack/resources/ec2.yml&quot;&gt;ec2.yml&lt;/a&gt; (extract)&lt;/p&gt;
&lt;p&gt;The interesting part is the &lt;a target=&quot;_blank&quot; href=&quot;https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html&quot;&gt;UserData&lt;/a&gt; script, which is run automatically when the EC2 instance starts. It does the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Download the custom Caddy build with DynamoDB support&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Prepare a group and a user for Caddy&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Create the &lt;code&gt;caddy.service&lt;/code&gt; file for &lt;code&gt;systemctl&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Create the &lt;code&gt;Caddyfile&lt;/code&gt; (as outlined above)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Create the environment file (&lt;code&gt;/etc/caddy/environment&lt;/code&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Enable &amp;amp; reload the &lt;code&gt;systemctl&lt;/code&gt; service&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;lang-yaml&quot;&gt;&lt;span class=&quot;hljs-attr&quot;&gt;Resources:&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;EC2Instance:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;Type:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;AWS::EC2::Instance&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;Properties:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;InstanceType:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;${self:custom.ec2.instanceType}&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;KeyName:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;${self:custom.ec2.keyName}&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;SecurityGroups:&lt;/span&gt; 
        &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-type&quot;&gt;!Ref&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;InstanceSecurityGroup&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;ImageId:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;ami-0b5eea76982371e91&apos;&lt;/span&gt; &lt;span class=&quot;hljs-comment&quot;&gt;# Amazon Linux 2 AMI&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;IamInstanceProfile:&lt;/span&gt; &lt;span class=&quot;hljs-type&quot;&gt;!Ref&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;InstanceProfile&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;UserData:&lt;/span&gt; &lt;span class=&quot;hljs-type&quot;&gt;!Base64&lt;/span&gt; 
        &lt;span class=&quot;hljs-attr&quot;&gt;&apos;Fn::Join&apos;:&lt;/span&gt;
          &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;&apos;&lt;/span&gt;
          &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;|
              #!/bin/bash -xe
&lt;/span&gt;            &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;|
              sudo wget -O /usr/bin/caddy &quot;https://github.com/tobilg/aws-caddy-build/raw/main/releases/aws_caddy_v2.6.2_linux&quot;
&lt;/span&gt;            &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;|
              sudo chmod +x /usr/bin/caddy
&lt;/span&gt;            &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;|
              sudo groupadd --system caddy
&lt;/span&gt;            &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;|
              sudo useradd --system --gid caddy --create-home --home-dir /var/lib/caddy --shell /usr/sbin/nologin --comment &quot;Caddy web server&quot; caddy
&lt;/span&gt;            &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;|
              sudo mkdir -p /etc/caddy
&lt;/span&gt;            &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;|
              sudo echo -e &apos;${file(./configs.js):caddyService}&apos; | sudo tee /etc/systemd/system/caddy.service
&lt;/span&gt;            &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;|
              sudo printf &apos;${file(./configs.js):caddyFile}&apos; | sudo tee /etc/caddy/Caddyfile
&lt;/span&gt;            &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;|
              sudo echo -e &quot;TABLE_REGION=${self:custom.caddy.dynamoDBTableRegion}\nTABLE_NAME=${self:custom.caddy.dynamoDBTableName}\nDOMAIN_SERVICE_ENDPOINT=${self:custom.caddy.domainServiceEndpoint}\nLETSENCRYPT_EMAIL_ADDRESS=${self:custom.caddy.letsEncryptEmailAddress}\nTARGET_DOMAIN=${self:custom.caddy.targetDomainName}&quot; | sudo tee /etc/caddy/environment
&lt;/span&gt;            &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;|
              sudo systemctl daemon-reload
&lt;/span&gt;            &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;|
              sudo systemctl enable caddy
&lt;/span&gt;            &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;|&lt;/span&gt;
              &lt;span class=&quot;hljs-string&quot;&gt;sudo&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;systemctl&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;start&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;--now&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;caddy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/global-reverse-proxy/blob/main/accelerator-stack/resources/global-accelerator.yml&quot;&gt;global-accelerator.yml&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The Global Accelerator CloudFormation resource wires the EC2 instance(s) to its kind-of global load balancer. This is then referenced by the dns-record.yml, which assigns the configured domain name to the Global Accelerator.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-yaml&quot;&gt;&lt;span class=&quot;hljs-attr&quot;&gt;Resources:&lt;/span&gt;
  &lt;span class=&quot;hljs-attr&quot;&gt;Accelerator:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;Type:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;AWS::GlobalAccelerator::Accelerator&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;Properties:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;Name:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;External-Accelerator&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;Enabled:&lt;/span&gt; &lt;span class=&quot;hljs-literal&quot;&gt;true&lt;/span&gt;

  &lt;span class=&quot;hljs-attr&quot;&gt;Listener:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;Type:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;AWS::GlobalAccelerator::Listener&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;Properties:&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;AcceleratorArn:&lt;/span&gt;
        &lt;span class=&quot;hljs-attr&quot;&gt;Ref:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;Accelerator&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;Protocol:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;TCP&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;ClientAffinity:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;NONE&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;PortRanges:&lt;/span&gt;
        &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;FromPort:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;443&lt;/span&gt;
          &lt;span class=&quot;hljs-attr&quot;&gt;ToPort:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;443&lt;/span&gt;

  &lt;span class=&quot;hljs-attr&quot;&gt;EndpointGroup1:&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;Type:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;AWS::GlobalAccelerator::EndpointGroup&lt;/span&gt;
    &lt;span class=&quot;hljs-attr&quot;&gt;Properties:&lt;/span&gt; 
      &lt;span class=&quot;hljs-attr&quot;&gt;EndpointConfigurations:&lt;/span&gt; 
        &lt;span class=&quot;hljs-bullet&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;hljs-attr&quot;&gt;EndpointId:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;${self:custom.ec2.instance1.id}&apos;&lt;/span&gt;
          &lt;span class=&quot;hljs-attr&quot;&gt;Weight:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;1&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;EndpointGroupRegion:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;${self:custom.ec2.instance1.region}&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;HealthCheckIntervalSeconds:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;30&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;HealthCheckPath:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;/health&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;HealthCheckPort:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;80&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;HealthCheckProtocol:&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;HTTP&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;ListenerArn:&lt;/span&gt; &lt;span class=&quot;hljs-type&quot;&gt;!Ref&lt;/span&gt; &lt;span class=&quot;hljs-string&quot;&gt;&apos;Listener&apos;&lt;/span&gt;
      &lt;span class=&quot;hljs-attr&quot;&gt;ThresholdCount:&lt;/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;heading-detailed-configuration&quot;&gt;Detailed configuration&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Stack configurations&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Please configure the following values for the different stacks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The target domain name where you want your reverse proxy to send the requests to (&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/global-reverse-proxy/blob/main/proxy-server-stack/serverless.yml#L7&quot;&gt;targetDomainName&lt;/a&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The email address to use for automatic certificate generation via LetsEncrypt (&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/global-reverse-proxy/blob/main/proxy-server-stack/serverless.yml#L8&quot;&gt;letsEncryptEmailAddress&lt;/a&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The domain name of the proxy service itself, which is then used by GlobalAccelerator (&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/global-reverse-proxy/blob/main/accelerator-stack/serverless.yml#L6&quot;&gt;domain&lt;/a&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Optionally: The current IP address from which you want to use the EC2 instance(s) via SSH from (&lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/global-reverse-proxy/blob/main/proxy-server-stack/serverless.yml#L18&quot;&gt;sshClientIPAddress&lt;/a&gt;). If you want to use SSH, you&apos;ll need to uncomment the respective &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/global-reverse-proxy/blob/main/proxy-server-stack/resources/ec2.yml#L56-L59&quot;&gt;SecurityGroup settings&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Whitelisted domain configuration&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You need to make sure that not everyone can use your reverse proxy with every domain. Therefore, you need to configure the whitelist of domains that you be used by Caddy&apos;s &lt;a target=&quot;_blank&quot; href=&quot;https://caddyserver.com/docs/automatic-https#on-demand-tls&quot;&gt;on-demand TLS feature&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This is done with the Domain Verifier Lambda function, which is deployed at a Function URL endpoint.&lt;/p&gt;
&lt;p&gt;The configuration can be changed &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/global-reverse-proxy/blob/main/domain-service-stack/src/domainVerifier.js#L3-L6&quot;&gt;here&lt;/a&gt; before deploying the service.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;mark&gt;HINT&lt;/mark&gt;&lt;/strong&gt;&lt;mark&gt;: To use this dynamically, as you&apos;d probably wish in a production setting, you could rewrite the Lambda function to read the custom domains from a DynamoDB table, and have another Lambda function run recurrently to issue DNS checks for the CNAME entries the customers would need to make (see below).&lt;/mark&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DNS / Nameserver configurations&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If you use an external domain provider, such as &lt;a target=&quot;_blank&quot; href=&quot;https://www.namecheap.com/support/knowledgebase/article.aspx/767/10/how-to-change-dns-for-a-domain/&quot;&gt;Namecheap&lt;/a&gt; or GoDaddy, make such that you point the DNS settings at your domain&apos;s configuration to those which are assigned to your HostedZone by Amazon. You can look these up in the AWS Console or via the AWS CLI.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;CNAME configuration for proxying&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You also need to &lt;a target=&quot;_blank&quot; href=&quot;https://aws.amazon.com/premiumsupport/knowledge-center/route-53-create-alias-records/&quot;&gt;add CNAME records&lt;/a&gt; to the domains you want to proxy for, e.g. if your proxy service domain is &lt;a target=&quot;_blank&quot; href=&quot;http://external.mygreatproxyservice.com&quot;&gt;&lt;code&gt;external.mygreatproxyservice.com&lt;/code&gt;&lt;/a&gt;, you need to add a CNAME record to your existing domain (e.g. &lt;a target=&quot;_blank&quot; href=&quot;http://test.myexistingdomain.com&quot;&gt;&lt;code&gt;test.myexistingdomain.com&lt;/code&gt;&lt;/a&gt;) to redirect to the proxy service domain:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;lang-bash&quot;&gt;CNAME test.myexistingdomain.com external.mygreatproxyservice.com
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Passing options during deployment&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When running &lt;code&gt;sls deploy&lt;/code&gt; for each stack, you can specify the following options to customize the deployments:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;--stage&lt;/code&gt;: This will configure the so-called stage, which is part of the stack name (default: &lt;code&gt;prd&lt;/code&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;--region&lt;/code&gt;: This will configure the AWS region where the stack is deployed (default: &lt;code&gt;us-east-1&lt;/code&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;heading-deployment&quot;&gt;Deployment&lt;/h3&gt;
&lt;p&gt;You need to follow a specific deployment order to be able to run the overall service:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Domain whitelisting service: &lt;code&gt;cd domain-service-stack &amp;amp;&amp;amp; sls deploy &amp;amp;&amp;amp; cd ..&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Proxy server(s): &lt;code&gt;cd proxy-server-stack &amp;amp;&amp;amp; sls deploy &amp;amp;&amp;amp; cd ..&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Global Accelerator &amp;amp; HostedZone / DNS : &lt;code&gt;cd accelerator-stack &amp;amp;&amp;amp; sls deploy &amp;amp;&amp;amp; cd ..&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id=&quot;heading-removal&quot;&gt;Removal&lt;/h3&gt;
&lt;p&gt;To remove the individual stacks, you can run &lt;code&gt;sls remove&lt;/code&gt; in the individual subfolders.&lt;/p&gt;
&lt;h1 id=&quot;heading-wrapping-up&quot;&gt;Wrapping up&lt;/h1&gt;
&lt;p&gt;We were able to build a POC for a (potentially) globally distributed reverse proxy service, with on-demand TLS support. We decided against using Fargate, and for using EC2 due to cost reasons. This prioritized costs higher, than running as Serverless as possible. It&apos;s possible that, in another setting / environment / experience, you might come to another conclusion, which is completely fine.&lt;/p&gt;
&lt;p&gt;For a more production-like setup, you&apos;d probably need to amend the &lt;a target=&quot;_blank&quot; href=&quot;https://github.com/tobilg/global-reverse-proxy/blob/main/domain-service-stack/src/domainVerifier.js&quot;&gt;Domain Verifier Lambda function&lt;/a&gt;, so that it dynamically looks up the custom domains that are configured e.g. by your customers via a UI, and stored in another DynamoDB table via another Lambda function. Deleting or updating those custom domains should probably be possible, too.&lt;/p&gt;
&lt;p&gt;Furthermore, you should then write an additional Lambda function that recurrently checks each stored custom domain if:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The CNAME records point to your &lt;code&gt;external.$YOUR_DOMAIN_NAME.tld&lt;/code&gt;, and updates the status accordingly&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Performs a check via HTTPS whether an actual redirect from the custom domain to your domain is possible&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><category>AWS</category><category>Reverse Proxy</category><category>Caddy</category><category>serverless framework</category><category>cloudformation</category><author>Tobias Müller</author></item></channel></rss>