<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[tobilg.com]]></title><description><![CDATA[Blog posts about AWS, Serverless, data engineering, databases and everything in between.]]></description><link>https://tobilg.com</link><generator>RSS for Node</generator><lastBuildDate>Fri, 23 Feb 2024 07:55:06 GMT</lastBuildDate><atom:link href="https://tobilg.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><atom:link rel="next" href="https://tobilg.com/rss.xml?page=1"/><item><title><![CDATA[Using DuckDB-WASM for in-browser Data Engineering]]></title><description><![CDATA[<h1 id="heading-introduction">Introduction</h1><p><a target="_blank" href="https://duckdb.org/">DuckDB</a>, the in-process DBMS specialized in OLAP workloads, had a very rapid growth during the last year, both in functionality, but also popularity amongst its users, but also with developers that contribute many projects to the <a target="_blank" href="https://github.com/davidgasquez/awesome-duckdb">Open Source DuckDB ecosystem</a>.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706462929992/c9f633a0-d1c7-420a-b019-5aff0a8ee1de.png" alt class="image--center mx-auto" /></p><p>DuckDB cannot "only" be run on a variety of Operating Systems and Architectures, there's also a <a target="_blank" href="https://github.com/duckdb/duckdb-wasm">DuckDB-WASM version</a>, that allows running DuckDB in a browser. This opens up some very interesting use cases, and is also gaining a lot of traction in the last 12 months.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706463261970/895fcefd-ff43-445a-b5fe-cc78d085125c.png" alt class="image--center mx-auto" /></p><h1 id="heading-use-case-building-a-sql-workbench-with-duckdb-wasm">Use Case: Building a SQL Workbench with DuckDB-WASM</h1><p>One of the first things that came to my mind once I learned about the existence of DuckDB-WASM was that it could be used to create an online SQL Workbench, where people could interactively run queries, show their results, but also visualize them. DuckDB-WASM sits at its core, providing the storage layer, query engine <a target="_blank" href="https://duckdb.org/why_duckdb.html#standing-on-the-shoulders-of-giants">and many things more</a>...</p><p>You can find the project at</p><p><a target="_blank" href="https://sql-workbench.com"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1708297858847/6857669b-1992-4279-93ef-9ceeb560c594.png" alt class="image--center mx-auto" /></a></p><p>It's built with the following core technologies / frameworks:</p><ul><li><p><a target="_blank" href="https://react.dev/">React</a></p></li><li><p><a target="_blank" href="https://vitejs.dev/">Vite</a></p></li><li><p><a target="_blank" href="https://github.com/duckdb/duckdb-wasm">DuckDB-WASM</a></p></li><li><p><a target="_blank" href="https://perspective.finos.org/">Perspective.js</a></p></li></ul><p>It's hosted as a static website export / single page application on AWS using</p><ul><li><p><a target="_blank" href="https://aws.amazon.com/cloudfront/">CloudFront</a> as CDN</p></li><li><p><a target="_blank" href="https://aws.amazon.com/s3/">S3</a> as file hosting service</p></li><li><p><a target="_blank" href="https://aws.amazon.com/certificate-manager/">ACM</a> for managing certificates</p></li><li><p><a target="_blank" href="https://aws.amazon.com/route53/">Route53</a> for DNS</p></li></ul><p>If you're interested in the hosting setup, you can have a look at <a target="_blank" href="https://github.com/tobilg/serverless-aws-static-websites">https://github.com/tobilg/serverless-aws-static-websites</a> which can deploy such static websites on AWS via IaC with minimum effort.</p><h1 id="heading-using-the-sql-workbench">Using the SQL Workbench</h1><p>There are many possibilities how you can use the SQL Workbench, some are described below</p><h2 id="heading-overview">Overview</h2><p>When you open <a target="_blank" href="https://sql-workbench.com">sql-workbench.com</a> for the first time, you can see that the workbench is divided in three different areas:</p><ul><li><p>On the left, there's the <strong>"Local Tables" area</strong>, that will display the created tables of you ran queries such as <code>CREATE TABLE names (name VARCHAR)</code>, or used the drag-and-drop area on the lower left corner to drop any CSV, Parquet or Arrow file on it (details see below).</p></li><li><p>The upper main <strong>editor area</strong> is the SQL editor, where you can type your SQL queries. You're already presented with some example queries for different types of data once the page is loaded.</p></li><li><p>The lower main <strong>result area</strong> where the results of the ran queries will be shown, or alternatively, the visualizations of these results.</p></li></ul><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706465207010/2bf496dd-c6e6-4440-ae92-5c1a11ebf28f.png" alt class="image--center mx-auto" /></p><div data-node-type="callout"><div data-node-type="callout-emoji">💡</div><div data-node-type="callout-text"><strong>You can adjust the respective heights of the main areas by dragging the lever in the middle.</strong></div></div><h2 id="heading-running-sql-queries">Running SQL queries</h2><p>To run your first query, select the first line of SQL, either with your keyboard or with your mouse, and press the key combination <code>CMD + Enter</code> of you're on a Mac, or <code>Ctrl + Enter</code> if you're on a Windows or Linux machine.</p><p>The result of the query that was executed can then be found in the lower main area as a table:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706465576400/946928fc-a9cb-4792-a5ef-f6ab6477095f.png" alt class="image--center mx-auto" /></p><div data-node-type="callout"><div data-node-type="callout-emoji">💡</div><div data-node-type="callout-text"><strong>Queries will only be executed if one or more queries are selected. If multiple queries shall be executed, make sure you use a semicolon at the end of each query. Otherwise an error will be displayed.</strong></div></div><h2 id="heading-running-multiple-queries">Running multiple queries</h2><p>You can also run multiple queries sequentially, e.g. to create a table, insert some records, and display the results:</p><pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> first_names (<span class="hljs-keyword">name</span> <span class="hljs-built_in">VARCHAR</span>, birth_cnt <span class="hljs-built_in">integer</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> first_names (<span class="hljs-keyword">name</span>, birth_cnt) <span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'Liam'</span>, <span class="hljs-number">20456</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> first_names (<span class="hljs-keyword">name</span>, birth_cnt) <span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'Noah'</span>, <span class="hljs-number">18621</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> first_names (<span class="hljs-keyword">name</span>, birth_cnt) <span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'Oliver'</span>, <span class="hljs-number">15076</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> first_names (<span class="hljs-keyword">name</span>, birth_cnt) <span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'James'</span>, <span class="hljs-number">12028</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> first_names (<span class="hljs-keyword">name</span>, birth_cnt) <span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'Elijah'</span>, <span class="hljs-number">11979</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> first_names (<span class="hljs-keyword">name</span>, birth_cnt) <span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'William'</span>, <span class="hljs-number">11282</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> first_names (<span class="hljs-keyword">name</span>, birth_cnt) <span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'Henry'</span>, <span class="hljs-number">11221</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> first_names (<span class="hljs-keyword">name</span>, birth_cnt) <span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'Lucas'</span>, <span class="hljs-number">10909</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> first_names (<span class="hljs-keyword">name</span>, birth_cnt) <span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'Benjamin'</span>, <span class="hljs-number">10842</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> first_names (<span class="hljs-keyword">name</span>, birth_cnt) <span class="hljs-keyword">VALUES</span> (<span class="hljs-string">'Theodore'</span>, <span class="hljs-number">10754</span>);<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> first_names;</code></pre><p>When you copy &amp; paste the above SQLs, select them and run them, the result looks like this:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706466464268/106e3256-a0d1-4b1c-90fc-6745fd2fab61.png" alt class="image--center mx-auto" /></p><p>You can see on the left-hand side the newly created table first_names, that can be reused for other queries without having to reload the data again.</p><div data-node-type="callout"><div data-node-type="callout-emoji">💡</div><div data-node-type="callout-text"><strong>Only the result of the last-run query will be displayed in the lower main result area!</strong></div></div><p>If you want to open a new SQL Workbench and directly run the above query, please click on the image below:</p><p><a target="_blank" href="https://sql-workbench.com/#queries=v0&amp;config=%7B%22plugin%22%3A%22Datagrid%22%2C%22plugin_config%22%3A%7B%22columns%22%3A%7B%7D%2C%22editable%22%3Afalse%2C%22scroll_lock%22%3Afalse%7D%2C%22title%22%3A%22Export%22%2C%22group_by%22%3A%5B%5D%2C%22split_by%22%3A%5B%5D%2C%22columns%22%3A%5B%22name%22%2C%22birth_cnt%22%5D%2C%22filter%22%3A%5B%5D%2C%22sort%22%3A%5B%5D%2C%22expressions%22%3A%5B%5D%2C%22aggregates%22%3A%7B%7D%7D,CREATE-TABLE-first_names-(name-VARCHAR%2C-birth_cnt-integer)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-('Liam'%2C-20456)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-('Noah'%2C-18621)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-('Oliver'%2C-15076)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-('James'%2C-12028)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-('Elijah'%2C-11979)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-('William'%2C-11282)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-('Henry'%2C-11221)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-('Lucas'%2C-10909)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-('Benjamin'%2C-10842)~,INSERT-INTO-first_names-(name%2C-birth_cnt)-VALUES-('Theodore'%2C-10754)~,SELECT-*-FROM-first_names~"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706466737102/9a07966d-0e5e-4c80-b319-8a104091948d.png" alt class="image--center mx-auto" /></a></p><div data-node-type="callout"><div data-node-type="callout-emoji">💡</div><div data-node-type="callout-text"><strong>The data is persisted until you reload the overall SQL Workbench page.</strong></div></div><h2 id="heading-querying-data-you-have-on-your-machine">Querying data you have on your machine</h2><p>To try this, you can for example download a list of AWS Services as a CSV from</p><p><a target="_blank" href="https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/csv/aws_services.csv">https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/csv/aws_services.csv</a></p><p>This file has four columns, <code>service_id</code>, <code>name</code>, <code>prefix</code> and <code>reference_url</code>. Once you downloaded the file, you can simply drag-and-drop from the folder you downloaded it to to the area in the lower left corner of the SQL Workbench:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706467248994/276eb9de-580f-433f-82d5-a48d2ab50c66.png" alt class="image--center mx-auto" /></p><p>A table called <code>aws_services.csv</code> has now been automatically created, which you can query via SQLs, for example:</p><pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">name</span>, prefix <span class="hljs-keyword">from</span> <span class="hljs-string">'aws_services.csv'</span>;</code></pre><p>If you want to open a new SQL Workbench and directly run the above query, please click on the image below:</p><p><a target="_blank" href="https://sql-workbench.com/#queries=v0&amp;config=%7B%7D,CREATE-TABLE-'aws_services.csv'-AS-(SELECT-*-FROM-'https%3A%2F%2Fraw.githubusercontent.com%2Ftobilg%2Faws%20iam%20data%2Fmain%2Fdata%2Fcsv%2Faws_services.csv')~,SELECT-name%2C-prefix-from-'aws_services.csv'~"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706466737102/9a07966d-0e5e-4c80-b319-8a104091948d.png" alt class="image--center mx-auto" /></a></p><h2 id="heading-querying-and-visualizing-remote-data">Querying and visualizing remote data</h2><p>DuckDB-WASM supports the loading of compatible data in different formats (e.g. CSV, Parquet or Arrow) <strong>from remote http(s) sources</strong>. Other data formats that can be used include JSON, but this requires the loading of so-called <a target="_blank" href="https://duckdb.org/docs/extensions/overview">DuckDB extensions</a>.</p><div data-node-type="callout"><div data-node-type="callout-emoji">💡</div><div data-node-type="callout-text"><strong>It's necessary that the websites hosting the data add the relevant CORS headers, otherwise the browser (not DuckDB-WASM or the SQL Workbench) will forbid the loading of the files and show an error message instead</strong></div></div><p>In this example, we will use data about AWS CloudFront Edge Locations, that is available at <a target="_blank" href="https://github.com/tobilg/aws-edge-locations/tree/main/data">tobilg/aws-edge-locations</a> with this query:</p><pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> <span class="hljs-string">'https://raw.githubusercontent.com/tobilg/aws-edge-locations/main/data/aws-edge-locations.parquet'</span>;</code></pre><p>The result will look like this:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706468112264/decfe33c-b961-4cd0-a1f3-6956eee511b7.png" alt class="image--center mx-auto" /></p><p>We now want to create a bar chart of the data, showing the number of Edge Locations by country and city. This can be done by <strong>hovering over the result table</strong>, and clicking on the small <strong>"configure" button</strong> that <strong>looks like a wrench</strong> which subsequently appears on the <strong>upper right corner of the table</strong>:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706468950491/f63e77f0-53c6-4111-b1a4-dea216ec402f.png" alt class="image--center mx-auto" /></p><p>You then see the overview of the available columns, and the current visualization type (Datagrid)</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706469067571/9825cca0-1819-4788-8562-41f86a82058f.png" alt class="image--center mx-auto" /></p><p>To get an overview of the possible visualization types click on the Datagrid icon:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706469390505/c202d902-9089-4724-a344-ad47e51743ea.png" alt class="image--center mx-auto" /></p><p>Then select "Y Bar". This will give you an initial bar char:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706469436147/c54ed46b-4227-4a3d-b5c4-885f02e16b88.png" alt class="image--center mx-auto" /></p><p>But as we want to display the count of Edge Locations by country and city, we need to drag-and-drop the columns <code>country</code> and <code>city</code> to the "Group By" area:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706469647166/d8b9e733-5dfb-43d1-b792-dfd50bb49795.png" alt class="image--center mx-auto" /></p><p>We can now close the configuration menu to see the chart in it's full size:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706469733791/c197240d-9546-4c3e-889b-6c3b39fc1525.png" alt class="image--center mx-auto" /></p><p>There are many other visualization types from which you can choose from, such as Treemaps and Sunbursts:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706469922003/55e21f77-e12f-4488-a5e2-bb1039a7a2a3.png" alt class="image--center mx-auto" /></p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706469943029/9018027a-3939-4b5b-ac50-99b6eea60e4d.png" alt class="image--center mx-auto" /></p><h2 id="heading-exporting-visualizations-and-data">Exporting visualizations and data</h2><p>You can also export the visualizations, as well as the data. Just click on "Export" and type in a "Save as" name, and select the output format you want to download:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706471941733/96e72d1a-841b-4ec8-8e00-b4c5e008aa25.png" alt class="image--center mx-auto" /></p><p>The data can be downloaded as CSV, JSON or Arrow file. Here's the CSV example:</p><pre><code class="lang-sql">"country (Group by 1)","count",517"Argentina",3"Australia",10"Austria",3"Bahrain",2"Belgium",1"Brazil",21"Bulgaria",3"Canada",8"Chile",6"China",8"Colombia",3"Croatia",1"Czech Republic",1"Denmark",3"Finland",4"France",17"Germany",37"Greece",1"Hungary",1"India",48"Indonesia",5"Ireland",2"Israel",2"Italy",16"Japan",27"Kenya",1"Korea",8"Malaysia",2"Mexico",4"Netherlands",5"New Zealand",2"Nigeria",1"Norway",2"Oman",1"Peru",2"Philippines",2"Poland",5"Portugal",1"Romania",1"Singapore",7"South Africa",2"Spain",12"Sweden",4"Switzerland",2"Taiwan",3"Thailand",2"UAE",4"UK",30"United States",179"Vietnam",2</code></pre><p>And here's the exported PNG:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706472092565/29c50315-1da7-41b7-ac28-a118967f9022.png" alt class="image--center mx-auto" /></p><p>Exporting the HTML version will give you an interactive graph with hovering etc. Furthermore, you can also change the theme for the different visualizations:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706472743243/ec7b9d1f-3057-4555-b6ce-f020b46543ab.png" alt class="image--center mx-auto" /></p><p>This is also reflected in the exported graphs:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706472777829/0f3113ef-4870-4cd7-a85a-f3e59a02ba8f.png" alt class="image--center mx-auto" /></p><h2 id="heading-using-the-schema-browser">Using the schema browser</h2><p>The schema browser can be found on the left-hand side. It's automatically updated after each executed query, so that all schema operations can be captured. On table-level, the columns, constraints and indexes are shown:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1707832278650/3b881e67-8837-46b7-90f3-f22cb43f8846.png" alt class="image--center mx-auto" /></p><p>If you right-click on a table name, a context menu is shown that has different options:</p><ul><li><p>Generating scripts based on the table definition</p></li><li><p>Truncating, deleting or summarizing the table</p></li><li><p>Viewing table data (all records, first 10 and first 100)</p></li></ul><p>Once clicked, those menu items will create a new tab (see below), and generate and execute the appropriate SQL statements:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1707834354631/1f1f95ec-9451-4bd5-a517-da2813ccfb13.png" alt class="image--center mx-auto" /></p><h2 id="heading-using-query-tabs">Using query tabs</h2><p>Another new feature is the possibility to have multiple query tabs. Those are either automatically created by context menu actions, or the user that clicks on the plus icon:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1707834585353/bfe88242-7ca4-4823-9283-1e12fcd8e280.png" alt class="image--center mx-auto" /></p><p>Each tab can be closed by clicking on the "x" icon next to the tab name.</p><h2 id="heading-generating-data-models">Generating data models</h2><p>If users have created some tables, it's then possible to create a data model from the schema metadata. If the tables also have foreign key relationships, those are also shown in the diagram. Just click on the "Data Model" menu entry on the lower left corner.</p><p>Under the hood, this feature generates <a target="_blank" href="https://mermaid.js.org/syntax/entityRelationshipDiagram.html">Mermaid Entity Relationship Diagram</a> code, that is dynamically rendered as a graph.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1707834843187/2365a08a-f76c-4aeb-bc75-ae1aa5584a2f.png" alt class="image--center mx-auto" /></p><h2 id="heading-using-the-query-history">Using the query history</h2><p>Each query that is issued in the current version of the SQL Workbench is recorded for the so-called query history. It can be accessed by clicking on the "Query History" menu entry in the lower left corner. Once clicked, there's an overlay on the right-hand side with the list of the issued queries.</p><p>The newest queries can be found on top of the list, and with each query listed, there's also an indication when the query was run, and how long it took to execute.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1707835247505/42f710ad-3eee-41b6-91ca-8eafa8166bc3.png" alt class="image--center mx-auto" /></p><p>With the trash icon in the top-right corner, the complete query history can be truncated. Also, single query history entries can be deleted, as well as specific queries can be re-run in a new tab by clicking "Replay query" in the menu that's present for each query history entry.</p><h1 id="heading-example-data-engineering-pipeline">Example Data Engineering pipeline</h1><h2 id="heading-dataset-amp-goals">Dataset &amp; Goals</h2><p>A well-known dataset is the NYC TLC Trip Record dataset. It can be found freely available on the website of <a target="_blank" href="https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page">NYC Taxi and Limousine Commission website</a>. It also comes with some explanations and additional lookup data. In this example, we focus on the yellow taxi data.</p><p>The goal of this example pipeline is to create a clean Data Mart from the given trip records and location data, being able to support some basic analysis of the data via OLAP patterns.</p><h2 id="heading-source-data-analysis">Source Data analysis</h2><p>On the NYC TLC website, there's a <a target="_blank" href="https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf">PDF file</a> explaining the structure and contents of the data. The table structure can be found below, the highlighted columns indicate dimensional values, for which we'll build dimension tables for in the later steps.</p><div class="hn-table"><table><thead><tr><td>Column name</td><td>Description</td></tr></thead><tbody><tr><td><mark>VendorID</mark></td><td>A code indicating the TPEP provider that provided the record.</td></tr><tr><td>tpep_pickup_datetime</td><td>The date and time when the meter was engaged.</td></tr><tr><td>tpep_dropoff_datetime</td><td>The date and time when the meter was disengaged.</td></tr><tr><td>Passenger_count</td><td>The number of passengers in the vehicle. This is a driver-entered value.</td></tr><tr><td>Trip_distance</td><td>The elapsed trip distance in miles reported by the taximeter.</td></tr><tr><td><mark>PULocationID</mark></td><td>TLC Taxi Zone in which the taximeter was engaged</td></tr><tr><td><mark>DOLocationID</mark></td><td>TLC Taxi Zone in which the taximeter was disengaged</td></tr><tr><td><mark>RateCodeID</mark></td><td>The final rate code in effect at the end of the trip.</td></tr><tr><td><mark>Store_and_fwd_flag</mark></td><td>This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka store and forward, because the vehicle did not have a connection to the server.</td></tr><tr><td><mark>Payment_type</mark></td><td>A numeric code signifying how the passenger paid for the trip.</td></tr><tr><td>Fare_amount</td><td>The time-and-distance fare calculated by the meter.</td></tr><tr><td>Extra</td><td>Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.</td></tr><tr><td>MTA_tax</td><td>$0.50 MTA tax that is automatically triggered based on the metered rate in use.</td></tr><tr><td>Improvement_surcharge</td><td>$0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.</td></tr><tr><td>Tip_amount</td><td>Tip amount  This field is automatically populated for credit card tips. Cash tips are not included.</td></tr><tr><td>Tolls_amount</td><td>Total amount of all tolls paid in trip.</td></tr><tr><td>Total_amount</td><td>The total amount charged to passengers. Does not include cash tips.</td></tr><tr><td>Congestion_Surcharge</td><td>Total amount collected in trip for NYS congestion surcharge.</td></tr><tr><td>Airport_fee</td><td>$1.25 for pick up only at LaGuardia and John F. Kennedy Airports</td></tr></tbody></table></div><p>There's an additional <a target="_blank" href="https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv">CSV file</a> for the so-called Taxi Zones, as well as a <a target="_blank" href="https://d37ci6vzurychx.cloudfront.net/misc/taxi_zones.zip">SHX shapefile</a> containing the same info, but with an additional geo information. The structure is the following:</p><div class="hn-table"><table><thead><tr><td>Column name</td><td>Description</td></tr></thead><tbody><tr><td>LocationID</td><td>TLC Taxi Zone, corresponding to the PULocationID and DOLocationID columns in the trip dataset</td></tr><tr><td>Borough</td><td>The name of the NYC borough this Taxi Zone is in</td></tr><tr><td>Zone</td><td>The name of the Taxi Zone</td></tr><tr><td>service_zone</td><td>Can either be "Yellow Zone" or "Boro Zone"</td></tr></tbody></table></div><h2 id="heading-target-data-model">Target Data Model</h2><p>The target data model is derived from the original trip record data, with extracted dimension tables plus a new date hierarchy dimension. Also, the naming schema gets unified and cleaned up.</p><p>It is modeled as a so-called <a target="_blank" href="https://en.wikipedia.org/wiki/Snowflake_schema">Snowflake Schema</a> (check the Mermaid source):</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706571275733/932ced37-0a11-4eb1-a08c-cf123529ec8c.png" alt class="image--center mx-auto" /></p><h2 id="heading-loading-amp-transforming-the-data">Loading &amp; Transforming the data</h2><p>The loading &amp; transforming of the data is divided in multiple steps:</p><ul><li><p>Generating the dimensional tables and values from the given dataset information</p></li><li><p>Generating a date hierarchy dimension</p></li><li><p>Loading and transforming the trip data</p><ul><li><p>Replacing values with dimension references</p></li><li><p>Clean the column naming</p></li><li><p>Unify values</p></li></ul></li></ul><h3 id="heading-generating-the-dimensional-tables"><strong>Generating the dimensional tables</strong></h3><p>We use the given dataset information from the PDF file to manually create dimension tables and their values:</p><pre><code class="lang-sql"><span class="hljs-comment">-- Install and load the spatial extension</span><span class="hljs-keyword">INSTALL</span> spatial;<span class="hljs-keyword">LOAD</span> spatial;<span class="hljs-comment">-- Create temporary table</span><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> tmp_service_zones <span class="hljs-keyword">AS</span> (    <span class="hljs-keyword">SELECT</span>         <span class="hljs-keyword">DISTINCT</span> service_zone     <span class="hljs-keyword">FROM</span>         <span class="hljs-string">'https://data.quacking.cloud/nyc-taxi-metadata/taxi_zones.csv'</span>     <span class="hljs-keyword">WHERE</span>         service_zone != <span class="hljs-string">'N/A'</span>     <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span>         service_zone); <span class="hljs-comment">-- Create dim_zone_type table</span><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> dim_zone_type (    zone_type_id <span class="hljs-built_in">INTEGER</span> PRIMARY <span class="hljs-keyword">KEY</span>,    <span class="hljs-keyword">name</span> <span class="hljs-built_in">VARCHAR</span>);<span class="hljs-comment">-- Insert dim_zone_type table</span><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_zone_type<span class="hljs-keyword">SELECT</span>     <span class="hljs-number">-1</span> <span class="hljs-keyword">as</span> zone_type_id,     <span class="hljs-string">'N/A'</span> <span class="hljs-keyword">as</span> <span class="hljs-keyword">name</span> <span class="hljs-keyword">UNION</span> <span class="hljs-keyword">ALL</span><span class="hljs-keyword">SELECT</span>     (<span class="hljs-keyword">rowid</span> + <span class="hljs-number">1</span>)::<span class="hljs-built_in">INTEGER</span> <span class="hljs-keyword">as</span> zone_type_id,     service_zone <span class="hljs-keyword">as</span> <span class="hljs-keyword">name</span> <span class="hljs-keyword">FROM</span>     tmp_service_zones ; <span class="hljs-comment">-- Drop table tmp_service_zones</span><span class="hljs-keyword">DROP</span> <span class="hljs-keyword">TABLE</span> tmp_service_zones; <span class="hljs-comment">-- Create temporary table</span><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> tmp_borough <span class="hljs-keyword">AS</span> (    <span class="hljs-keyword">SELECT</span>         <span class="hljs-keyword">DISTINCT</span> borough     <span class="hljs-keyword">FROM</span>         <span class="hljs-string">'https://data.quacking.cloud/nyc-taxi-metadata/taxi_zones.csv'</span>     <span class="hljs-keyword">WHERE</span>        borough != <span class="hljs-string">'Unknown'</span>    <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span>         borough); <span class="hljs-comment">-- Create dim_borough table</span><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> dim_borough (    borough_id <span class="hljs-built_in">INTEGER</span> PRIMARY <span class="hljs-keyword">KEY</span>,    <span class="hljs-keyword">name</span> <span class="hljs-built_in">VARCHAR</span>);<span class="hljs-comment">-- Insert dim_borough table</span><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_borough<span class="hljs-keyword">SELECT</span>     <span class="hljs-number">-1</span> <span class="hljs-keyword">as</span> borough_id,     <span class="hljs-string">'N/A'</span> <span class="hljs-keyword">as</span> <span class="hljs-keyword">name</span> <span class="hljs-keyword">UNION</span> <span class="hljs-keyword">ALL</span><span class="hljs-keyword">SELECT</span>     (<span class="hljs-keyword">rowid</span> + <span class="hljs-number">1</span>)::<span class="hljs-built_in">INTEGER</span> <span class="hljs-keyword">as</span> borough_id,     borough <span class="hljs-keyword">as</span> <span class="hljs-keyword">name</span> <span class="hljs-keyword">FROM</span>     tmp_borough ; <span class="hljs-comment">-- Drop temporary table</span><span class="hljs-keyword">DROP</span> <span class="hljs-keyword">TABLE</span> tmp_borough;<span class="hljs-comment">-- Create dim_zone table</span><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> dim_zone (    zone_id <span class="hljs-built_in">INTEGER</span> PRIMARY <span class="hljs-keyword">KEY</span>,    zone_type_id <span class="hljs-built_in">INTEGER</span>,    borough_id <span class="hljs-built_in">INTEGER</span>,    <span class="hljs-keyword">name</span> <span class="hljs-built_in">VARCHAR</span>,    geojson <span class="hljs-built_in">VARCHAR</span>,    <span class="hljs-keyword">FOREIGN</span> <span class="hljs-keyword">KEY</span> (zone_type_id) <span class="hljs-keyword">REFERENCES</span> dim_zone_type (zone_type_id),    <span class="hljs-keyword">FOREIGN</span> <span class="hljs-keyword">KEY</span> (borough_id) <span class="hljs-keyword">REFERENCES</span> dim_borough (borough_id));<span class="hljs-comment">-- Insert dim_zone table</span><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_zone <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">DISTINCT</span>    <span class="hljs-keyword">CASE</span>        <span class="hljs-keyword">WHEN</span> csv.LocationID <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span> <span class="hljs-keyword">THEN</span> csv.LocationID::<span class="hljs-built_in">INT</span>        <span class="hljs-keyword">ELSE</span> raw.LocationID    <span class="hljs-keyword">END</span> <span class="hljs-keyword">AS</span> zone_id,     zt.zone_type_id,     <span class="hljs-keyword">CASE</span>        <span class="hljs-keyword">WHEN</span> b.borough_id <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span> <span class="hljs-keyword">THEN</span> b.borough_id        <span class="hljs-keyword">ELSE</span> <span class="hljs-number">-1</span>    <span class="hljs-keyword">END</span> <span class="hljs-keyword">AS</span> borough_id,    <span class="hljs-keyword">CASE</span>        <span class="hljs-keyword">WHEN</span> csv.Zone <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span> <span class="hljs-keyword">THEN</span> csv.Zone        <span class="hljs-keyword">ELSE</span> raw.zone    <span class="hljs-keyword">END</span> <span class="hljs-keyword">AS</span> <span class="hljs-keyword">name</span>,     raw.geojson <span class="hljs-keyword">FROM</span>     (        <span class="hljs-keyword">SELECT</span>             LocationID,             borough,            zone,              geojson         <span class="hljs-keyword">FROM</span>             (                <span class="hljs-keyword">SELECT</span>                     LocationID,                     borough,                     zone,                     <span class="hljs-keyword">rank</span>() <span class="hljs-keyword">OVER</span> (<span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> LocationID <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> Shape_Leng) <span class="hljs-keyword">AS</span> ranked,                     ST_AsGeoJSON(ST_Transform(geom, <span class="hljs-string">'ESRI:102718'</span>, <span class="hljs-string">'EPSG:4326'</span>)) <span class="hljs-keyword">AS</span> geojson                 <span class="hljs-keyword">FROM</span> ST_Read(<span class="hljs-string">'https://data.quacking.cloud/nyc-taxi-metadata/taxi_zones.shx'</span>)             ) sub        <span class="hljs-keyword">WHERE</span>             sub.ranked = <span class="hljs-number">1</span>     ) <span class="hljs-keyword">raw</span> <span class="hljs-keyword">FULL</span> <span class="hljs-keyword">OUTER</span> <span class="hljs-keyword">JOIN</span>     (        <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">DISTINCT</span>             LocationID,            Zone,             service_zone         <span class="hljs-keyword">FROM</span>             <span class="hljs-string">'https://data.quacking.cloud/nyc-taxi-metadata/taxi_zones.csv'</span>     ) csv <span class="hljs-keyword">ON</span>     csv.LocationId = raw.LocationId <span class="hljs-keyword">FULL</span> <span class="hljs-keyword">OUTER</span> <span class="hljs-keyword">JOIN</span>     dim_zone_type zt <span class="hljs-keyword">ON</span>     csv.service_zone = zt.name <span class="hljs-keyword">FULL</span> <span class="hljs-keyword">OUTER</span> <span class="hljs-keyword">JOIN</span>     dim_borough b <span class="hljs-keyword">ON</span>     b.name = raw.borough<span class="hljs-keyword">WHERE</span>    zone_id <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span><span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span>    zone_id;<span class="hljs-comment">-- Create dim_rate_code table</span><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> dim_rate_code (    rate_code_id <span class="hljs-built_in">INTEGER</span> PRIMARY <span class="hljs-keyword">KEY</span>,    <span class="hljs-keyword">name</span> <span class="hljs-built_in">VARCHAR</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_rate_code (rate_code_id, <span class="hljs-keyword">name</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">1</span>, <span class="hljs-string">'Standard rate'</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_rate_code (rate_code_id, <span class="hljs-keyword">name</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">2</span>, <span class="hljs-string">'JFK'</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_rate_code (rate_code_id, <span class="hljs-keyword">name</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">3</span>, <span class="hljs-string">'Newark'</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_rate_code (rate_code_id, <span class="hljs-keyword">name</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">4</span>, <span class="hljs-string">'Nassau or Westchester'</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_rate_code (rate_code_id, <span class="hljs-keyword">name</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">5</span>, <span class="hljs-string">'Negotiated fare'</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_rate_code (rate_code_id, <span class="hljs-keyword">name</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">6</span>, <span class="hljs-string">'Group ride'</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_rate_code (rate_code_id, <span class="hljs-keyword">name</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">99</span>, <span class="hljs-string">'N/A'</span>);<span class="hljs-comment">-- Create dim_payment_type table</span><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> dim_payment_type (    payment_type_id <span class="hljs-built_in">INTEGER</span> PRIMARY <span class="hljs-keyword">KEY</span>,    <span class="hljs-keyword">name</span> <span class="hljs-built_in">VARCHAR</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_payment_type (payment_type_id, <span class="hljs-keyword">name</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">1</span>, <span class="hljs-string">'Credit card'</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_payment_type (payment_type_id, <span class="hljs-keyword">name</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">2</span>, <span class="hljs-string">'Cash'</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_payment_type (payment_type_id, <span class="hljs-keyword">name</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">3</span>, <span class="hljs-string">'No charge'</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_payment_type (payment_type_id, <span class="hljs-keyword">name</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">4</span>, <span class="hljs-string">'Dispute'</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_payment_type (payment_type_id, <span class="hljs-keyword">name</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">5</span>, <span class="hljs-string">'Unknown'</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_payment_type (payment_type_id, <span class="hljs-keyword">name</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">6</span>, <span class="hljs-string">'Voided trip'</span>);<span class="hljs-comment">-- Create dim_vendor table</span><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> dim_vendor (    vendor_id <span class="hljs-built_in">INTEGER</span> PRIMARY <span class="hljs-keyword">KEY</span>,    <span class="hljs-keyword">name</span> <span class="hljs-built_in">VARCHAR</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_vendor (vendor_id, <span class="hljs-keyword">name</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">1</span>, <span class="hljs-string">'Creative Mobile Technologies'</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_vendor (vendor_id, <span class="hljs-keyword">name</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">2</span>, <span class="hljs-string">'VeriFone Inc.'</span>);<span class="hljs-comment">-- Create dim_stored_type table</span><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> dim_stored_type (    stored_type_id <span class="hljs-built_in">INTEGER</span> PRIMARY <span class="hljs-keyword">KEY</span>,    <span class="hljs-keyword">name</span> <span class="hljs-built_in">VARCHAR</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_stored_type (stored_type_id, <span class="hljs-keyword">name</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">1</span>, <span class="hljs-string">'Store and forward trip'</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_stored_type (stored_type_id, <span class="hljs-keyword">name</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">2</span>, <span class="hljs-string">'Not a store and forward trip'</span>);</code></pre><h3 id="heading-generating-a-date-hierarchy-dimension"><strong>Generating a date hierarchy dimension</strong></h3><pre><code class="lang-sql"><span class="hljs-comment">-- Create dim_date table</span><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> dim_date (    day_dt <span class="hljs-built_in">DATE</span> PRIMARY <span class="hljs-keyword">KEY</span>,    day_name <span class="hljs-built_in">VARCHAR</span>,    day_of_week <span class="hljs-built_in">INT</span>,    day_of_month <span class="hljs-built_in">INT</span>,    day_of_year <span class="hljs-built_in">INT</span>,    week_of_year <span class="hljs-built_in">INT</span>,    month_of_year <span class="hljs-built_in">INT</span>,    month_name <span class="hljs-built_in">VARCHAR</span>,    <span class="hljs-keyword">year</span> <span class="hljs-built_in">INT</span>);<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> dim_date<span class="hljs-keyword">SELECT</span>     date_key <span class="hljs-keyword">AS</span> day_dt,    <span class="hljs-keyword">DAYNAME</span>(date_key)::<span class="hljs-built_in">VARCHAR</span> <span class="hljs-keyword">AS</span> day_name,    ISODOW(date_key)::<span class="hljs-built_in">INT</span> <span class="hljs-keyword">AS</span> day_of_week,    <span class="hljs-keyword">DAYOFMONTH</span>(date_key)::<span class="hljs-built_in">INT</span> <span class="hljs-keyword">AS</span> day_of_month,    <span class="hljs-keyword">DAYOFYEAR</span>(date_key)::<span class="hljs-built_in">INT</span> <span class="hljs-keyword">AS</span> day_of_year,     <span class="hljs-keyword">WEEKOFYEAR</span>(date_key)::<span class="hljs-built_in">INT</span> <span class="hljs-keyword">AS</span> week_of_year,    <span class="hljs-keyword">MONTH</span>(date_key)::<span class="hljs-built_in">INT</span> <span class="hljs-keyword">AS</span> month_of_year,    MONTHNAME(date_key)::<span class="hljs-built_in">VARCHAR</span> <span class="hljs-keyword">AS</span> month_name,    <span class="hljs-keyword">YEAR</span>(date_key)::<span class="hljs-built_in">INT</span> <span class="hljs-keyword">AS</span> <span class="hljs-keyword">year</span><span class="hljs-keyword">FROM</span>     (        <span class="hljs-keyword">SELECT</span>             <span class="hljs-keyword">CAST</span>(<span class="hljs-keyword">RANGE</span> <span class="hljs-keyword">AS</span> <span class="hljs-built_in">DATE</span>) <span class="hljs-keyword">AS</span> date_key         <span class="hljs-keyword">FROM</span>             <span class="hljs-keyword">RANGE</span>(<span class="hljs-built_in">DATE</span> <span class="hljs-string">'2005-01-01'</span>, <span class="hljs-built_in">DATE</span> <span class="hljs-string">'2030-12-31'</span>, <span class="hljs-built_in">INTERVAL</span> <span class="hljs-number">1</span> <span class="hljs-keyword">DAY</span>)    ) generate_date;</code></pre><h3 id="heading-loading-and-transforming-the-trip-data"><strong>Loading and transforming the trip data</strong></h3><pre><code class="lang-sql"><span class="hljs-comment">-- Create sequence for generating trip_ids</span><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">SEQUENCE</span> trip_id_sequence <span class="hljs-keyword">START</span> <span class="hljs-number">1</span>;<span class="hljs-comment">-- Create fact_trip table</span><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> fact_trip (    trip_id <span class="hljs-built_in">INTEGER</span>  <span class="hljs-keyword">DEFAULT</span> <span class="hljs-keyword">nextval</span>(<span class="hljs-string">'trip_id_sequence'</span>) PRIMARY <span class="hljs-keyword">KEY</span>,    pickup_zone_id <span class="hljs-built_in">INTEGER</span>,    pickup_dt <span class="hljs-built_in">DATE</span>,    pickup_ts <span class="hljs-built_in">TIMESTAMP</span>,    dropoff_zone_id <span class="hljs-built_in">INTEGER</span>,    dropoff_dt <span class="hljs-built_in">DATE</span>,    dropoff_ts <span class="hljs-built_in">TIMESTAMP</span>,    rate_code_id <span class="hljs-built_in">INTEGER</span>,     stored_type_id <span class="hljs-built_in">INTEGER</span>,     payment_type_id <span class="hljs-built_in">INTEGER</span>,     vendor_id <span class="hljs-built_in">INTEGER</span>,     passenger_count <span class="hljs-keyword">DOUBLE</span>,    trip_distance_miles <span class="hljs-keyword">DOUBLE</span>,    fare_amount <span class="hljs-keyword">DOUBLE</span>,    extra_amount <span class="hljs-keyword">DOUBLE</span>,    mta_tax_amount <span class="hljs-keyword">DOUBLE</span>,    improvement_surcharge_amount <span class="hljs-keyword">DOUBLE</span>,    tip_amount <span class="hljs-keyword">DOUBLE</span>,    tolls_amount <span class="hljs-keyword">DOUBLE</span>,    congestion_surcharge_amount <span class="hljs-keyword">DOUBLE</span>,    airport_fee_amount <span class="hljs-keyword">DOUBLE</span>,    total_amount <span class="hljs-keyword">DOUBLE</span>);<span class="hljs-comment">-- Deactivating FK relationships for now, due to performance issues when inserting 3 million records</span><span class="hljs-comment">-- FOREIGN KEY (pickup_zone_id) REFERENCES dim_zone (zone_id),</span><span class="hljs-comment">-- FOREIGN KEY (dropoff_zone_id) REFERENCES dim_zone (zone_id),</span><span class="hljs-comment">-- FOREIGN KEY (pickup_dt) REFERENCES dim_date (day_dt),</span><span class="hljs-comment">-- FOREIGN KEY (dropoff_dt) REFERENCES dim_date (day_dt),</span><span class="hljs-comment">-- FOREIGN KEY (rate_code_id) REFERENCES dim_rate_code (rate_code_id),</span><span class="hljs-comment">-- FOREIGN KEY (stored_type_id) REFERENCES dim_stored_type (stored_type_id),</span><span class="hljs-comment">-- FOREIGN KEY (payment_type_id) REFERENCES dim_payment_type (payment_type_id),</span><span class="hljs-comment">-- FOREIGN KEY (vendor_id) REFERENCES dim_vendor (vendor_id)</span><span class="hljs-comment">-- Insert transformed fact data</span><span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> fact_trip<span class="hljs-keyword">SELECT</span>     <span class="hljs-keyword">nextval</span>(<span class="hljs-string">'trip_id_sequence'</span>) <span class="hljs-keyword">AS</span> trip_id,    PULocationID::<span class="hljs-built_in">INT</span> <span class="hljs-keyword">as</span> pickup_zone_id,    tpep_pickup_datetime::<span class="hljs-built_in">DATE</span> <span class="hljs-keyword">as</span> pickup_dt,    tpep_pickup_datetime <span class="hljs-keyword">AS</span> pickup_ts,    DOLocationID::<span class="hljs-built_in">INT</span> <span class="hljs-keyword">as</span> dropoff_zone_id,    tpep_dropoff_datetime::<span class="hljs-built_in">DATE</span> <span class="hljs-keyword">as</span> dropoff_dt,    tpep_dropoff_datetime <span class="hljs-keyword">AS</span> dropoff_ts,    RatecodeID::<span class="hljs-built_in">INT</span> <span class="hljs-keyword">AS</span> rate_code_id,    <span class="hljs-keyword">CASE</span>        <span class="hljs-keyword">WHEN</span> store_and_fwd_flag = <span class="hljs-string">'Y'</span> <span class="hljs-keyword">THEN</span> <span class="hljs-number">1</span>        <span class="hljs-keyword">WHEN</span> store_and_fwd_flag = <span class="hljs-string">'N'</span> <span class="hljs-keyword">THEN</span> <span class="hljs-number">2</span>    <span class="hljs-keyword">END</span> <span class="hljs-keyword">AS</span> stored_type_id,    payment_type::<span class="hljs-built_in">INT</span> <span class="hljs-keyword">AS</span> payment_type_id,    VendorID::<span class="hljs-built_in">INT</span> <span class="hljs-keyword">AS</span> vendor_id,    passenger_count,    trip_distance <span class="hljs-keyword">AS</span> trip_distance_miles,    fare_amount,    extra <span class="hljs-keyword">AS</span> extra_amount,    mta_tax <span class="hljs-keyword">AS</span> mta_tax_amount,    improvement_surcharge <span class="hljs-keyword">AS</span> improvement_surcharge_amount,    tip_amount,    tolls_amount,    congestion_surcharge <span class="hljs-keyword">AS</span> congestion_surcharge_amount,    airport_fee <span class="hljs-keyword">AS</span> airport_fee_amount,    total_amount<span class="hljs-keyword">FROM</span>     <span class="hljs-string">'https://data.quacking.cloud/nyc-taxi-data/yellow_tripdata_2023-01.parquet'</span>;</code></pre><h2 id="heading-data-analysis">Data Analysis</h2><p>The following analyses are just examples on how you could analyze the data set. Feel free to think about your own questions for the dataset, and try to build queries yourselves!</p><h3 id="heading-preparation">Preparation</h3><p>DuckDB supports the <a target="_blank" href="https://duckdb.org/docs/guides/meta/summarize">SUMMARIZE</a> command can help you understand the final data in the fact table before querying it. It launches a query that computes a number of aggregates over all columns, including <code>min</code>, <code>max</code>, <code>avg</code>, <code>std</code> and <code>approx_unique</code>:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706635916309/a55a933c-cf39-49e2-adcd-f159a619ec65.png" alt class="image--center mx-auto" /></p><p>The output already shows some "interesting" things, such as <code>pickup_dt</code> being far in the past, e.g. <code>2008-12-31</code>, and the <code>dropoff_dt</code> with similar values (<code>2009-01-01</code>).</p><h3 id="heading-most-utilized-trip-locations">Most utilized trip locations</h3><p>With this analysis, we want to have a look at the 20 most frequented trips from pickup zone to dropoff zone:</p><pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>    pz.name || <span class="hljs-string">' -&gt; '</span> || dz.name <span class="hljs-keyword">AS</span> trip_description,    <span class="hljs-keyword">count</span>(<span class="hljs-keyword">DISTINCT</span> ft.trip_id)::<span class="hljs-built_in">INT</span> <span class="hljs-keyword">AS</span> trip_count,    <span class="hljs-keyword">sum</span>(ft.passenger_count)::<span class="hljs-built_in">INT</span> <span class="hljs-keyword">AS</span> passenger_count,    <span class="hljs-keyword">sum</span>(ft.total_amount)::<span class="hljs-built_in">INT</span> <span class="hljs-keyword">AS</span> total_amount,    <span class="hljs-keyword">sum</span>(ft.trip_distance_miles)::<span class="hljs-built_in">INT</span> <span class="hljs-keyword">AS</span> trip_distance_miles,    (<span class="hljs-keyword">sum</span>(trip_distance_miles)/<span class="hljs-keyword">count</span>(<span class="hljs-keyword">DISTINCT</span> ft.trip_id))::<span class="hljs-keyword">DOUBLE</span> <span class="hljs-keyword">AS</span> trip_distance_miles_avg,    (<span class="hljs-keyword">sum</span>(ft.total_amount)/<span class="hljs-keyword">count</span>(<span class="hljs-keyword">DISTINCT</span> ft.trip_id))::<span class="hljs-keyword">DOUBLE</span> <span class="hljs-keyword">AS</span> total_amount_avg,    (<span class="hljs-keyword">sum</span>(ft.passenger_count)/<span class="hljs-keyword">count</span>(<span class="hljs-keyword">DISTINCT</span> ft.trip_id))::<span class="hljs-keyword">DOUBLE</span> <span class="hljs-keyword">AS</span> passenger_count_avg<span class="hljs-keyword">FROM</span>    fact_trip ft<span class="hljs-keyword">INNER</span> <span class="hljs-keyword">JOIN</span>    dim_zone pz<span class="hljs-keyword">ON</span>    pz.zone_id = ft.pickup_zone_id<span class="hljs-keyword">INNER</span> <span class="hljs-keyword">JOIN</span>    dim_zone dz<span class="hljs-keyword">ON</span>    dz.zone_id = ft.dropoff_zone_id<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span>    trip_description<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span>    trip_count <span class="hljs-keyword">DESC</span><span class="hljs-keyword">LIMIT</span> <span class="hljs-number">20</span>;</code></pre><p>On a M2 Mac Mini with 16GB RAM, aggregating the 3 million trips takes around 900ms. The result looks like the following:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706652583933/219edfc9-9f2e-4e57-976f-99c1f01cceb8.png" alt class="image--center mx-auto" /></p><p>We can now also create a Y Bar chart showing the (total) trip count, passenger count and trip distance by top 20 most frequented trips:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706652794872/03f96773-09a8-4dd4-8318-7a272f6b0ab1.png" alt class="image--center mx-auto" /></p><p>End result:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706652997260/f9a98328-d4b1-4d3a-8ffa-bf16eb11a998.png" alt class="image--center mx-auto" /></p><h3 id="heading-trip-frequency-by-weekday-and-time-of-day">Trip frequency by weekday and time of day</h3><p>To inspect the traffic patterns, we want to analyze the trip frequency by weekday and time of day (aggregated on hourly level). Therefore, we make use of DuckDB's advanced timestamp/time handling functions:</p><pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>    dd.day_name,    dd.day_of_week,    <span class="hljs-keyword">datepart</span>(<span class="hljs-string">'hour'</span>, time_bucket(<span class="hljs-built_in">INTERVAL</span> <span class="hljs-string">'1 HOUR'</span>, ft.pickup_ts)) day_hour,    <span class="hljs-keyword">count</span>(<span class="hljs-keyword">DISTINCT</span> ft.trip_id)::<span class="hljs-built_in">INT</span> <span class="hljs-keyword">AS</span> trip_count,<span class="hljs-keyword">FROM</span>    fact_trip ft<span class="hljs-keyword">INNER</span> <span class="hljs-keyword">JOIN</span>    dim_date dd<span class="hljs-keyword">ON</span>    dd.day_dt = ft.pickup_dt<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span>    dd.day_name,    dd.day_of_week,    day_hour<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span>    dd.day_of_week,    day_hour;</code></pre><p>Then, we can configure a Y Bar chart that can show us the number of trips by weekday and hour:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706654378057/28b1a67c-6a18-4544-af3a-942840b4a18a.png" alt class="image--center mx-auto" /></p><p>End result:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706654434855/47d1005d-6466-47c2-8ce3-80c8c626c83e.png" alt class="image--center mx-auto" /></p><h1 id="heading-sharing-of-data-pipelines-amp-visualizations">Sharing of Data Pipelines &amp; Visualizations</h1><p>With the latest version of <a target="_blank" href="https://sql-workbench.com">sql-workbench.com</a> it's possible to share both queries and the (customized) visualization of the last executed query. Therefore, you write your queries, run them to check whether they work, and then update the visualization configuration.</p><p>When you did that, you can then click on "Share queries" in the lower left corner of the SQL Workbench. The toggle will let you choose whether you want to copy the visualization configuration as well, or not.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706620864709/aa314815-a44d-428d-99c6-449b4aede72e.png" alt class="image--center mx-auto" /></p><p>If you want to run the complete pipeline to build our dimensional data model, you can click on the link below (this can take from 10 to 120 seconds depending on your machine and internet connection speed, as approximately 50MB of data will be downloaded):</p><p><a target="_blank" href="https://dub.sh/run-data-pipeline"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706466737102/9a07966d-0e5e-4c80-b319-8a104091948d.png" alt class="image--center mx-auto" /></a></p><div data-node-type="callout"><div data-node-type="callout-emoji">💡</div><div data-node-type="callout-text"><strong>The above link unfortunately has to be routed through https://dub.co as URL shortener, because HashNode doesn't support very long URLs as links!</strong></div></div><h1 id="heading-conclusion">Conclusion</h1><p>With DuckDB-WASM and some common web frameworks, it's pretty easy and fast to create custom data applications that can handle datasets in the size of millions of records.</p><p>Those applications are able to provide a very lightweight approach to working with different types of data (such as <a target="_blank" href="https://duckdb.org/docs/guides/import/parquet_import">Parquet</a>, <a target="_blank" href="https://duckdb.org/docs/guides/import/s3_iceberg_import">Iceberg</a>, Arrow, <a target="_blank" href="https://duckdb.org/docs/guides/import/csv_import">CSV</a>, <a target="_blank" href="https://duckdb.org/docs/extensions/json">JSON</a> or <a target="_blank" href="https://duckdb.org/docs/extensions/spatial.html#st_read---read-spatial-data-from-files">spatial data formats</a>), whether locally or remote (via HTTP(S) of S3-compatible storage services), thanks to the versatile DuckDB engine.</p><p>Users can interactively work with the data, create data pipelines by using raw SQL, and iterate until the final desired state has been achieved. The generated data pipeline queries can easily be shared with simple links to <a target="_blank" href="https://sql-workbench.com">sql-workbench.com</a>, so that other collaborators can continue to iterate on the existing work, or even create new solutions with it.</p><p>Once a data pipeline has been finalized, it could for example be deployed to DuckDB instances running in own cloud accounts of the users. A great example would be <a target="_blank" href="https://github.com/tobilg/duckdb-nodejs-layer">running DuckDB in AWS Lambda</a>, e.g. for <a target="_blank" href="https://github.com/tobilg/serverless-parquet-repartitioner">repartitioning Parquet data in S3 nightly</a>, or automatically running reports based on aggregation pipelines etc.</p><p>The possibilities are nearly endless, so I'm very curious what you all build with this great technology! Thanks for reading this length article, I'm happy to answer any questions in the comments.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1706547031708/c6cefe3e-c142-4cf4-acd2-da65d2a3140b.jpeg" alt class="image--center mx-auto" /></p>]]></description><link>https://tobilg.com/using-duckdb-wasm-for-in-browser-data-engineering</link><guid isPermaLink="true">https://tobilg.com/using-duckdb-wasm-for-in-browser-data-engineering</guid><dc:creator><![CDATA[Tobias Müller]]></dc:creator><pubDate>Sat, 27 Jan 2024 23:00:00 GMT</pubDate><cover_image>https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/kCa6cRVtgWI/upload/ccc915852cab85dbec5c58fdb857d6d5.jpeg</cover_image></item><item><title><![CDATA[Retrieving Lambda@Edge CloudWatch Logs]]></title><description><![CDATA[<h3 id="heading-what-is-lambdaedge">What is Lambda@Edge</h3><p>AWS Lambda@Edge is an extension of the traditional AWS Lambda service, but with a crucial twist  it brings serverless computing capabilities closer to the end-users.</p><p>In essence, Lambda@Edge empowers developers to run custom code in response to <a target="_blank" href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/lambda-cloudfront-trigger-events.html">specific CloudFront events</a>. Whether it's tailoring content based on user location, device type, or handling real-time data processing at the edge, Lambda@Edge functions open up a world of possibilities for optimizing content delivery.</p><p>Lambda@Edge functions have the following features:</p><ul><li><p>They can access public internet</p></li><li><p>They can be run before or after your cache</p></li><li><p>They allow developers to view and modify not only the client request/response but also the origin request/response</p></li><li><p>They can read the body of the request</p></li><li><p>They can use a much bigger package size for code compared to CloudFront Functions (1MB for a client request/response trigger and 50MB for an origin request/response trigger)</p></li><li><p>They allow up to 5 seconds for a client request/response trigger and up to 30 seconds for an origin request/response trigger.</p></li></ul><h3 id="heading-how-can-you-retrieve-the-lambdaedge-logs">How can you retrieve the Lambda@Edge logs?</h3><p>Due to the fact that Lambda@Edge functions run in the CloudFront Regional Edge Caches, the CloudWatch logs cannot only be found in <code>us-east-1</code>, but potentially in all other AWS regions that support <a target="_blank" href="https://aws.amazon.com/cloudfront/features/">Regional Edge Caches</a>.</p><p>The <a target="_blank" href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/lambda-edge-testing-debugging.html">Lambda@Edge testing and debugging guide</a> states</p><blockquote><p>When you review CloudWatch log files or metrics when you're troubleshooting errors, be aware that they are displayed or stored in the Region closest to the location where the function executed. So, if you have a website or web application with users in the United Kingdom, and you have a Lambda function associated with your distribution, for example, you must change the Region to view the CloudWatch metrics or log files for the London AWS Region.</p></blockquote><p>So, one simple solution to automatically get the logs from all regions would be to write a script that searches for relevant CloudWatch Logs groups in each region, and displays their respective LogStream entries.</p><p>This script could look like this</p><pre><code class="lang-bash"><span class="hljs-meta">#!/bin/bash</span>FUNCTION_NAME=<span class="hljs-variable">$1</span><span class="hljs-keyword">for</span> region <span class="hljs-keyword">in</span> $(aws --output text ec2 describe-regions | cut -f 4) <span class="hljs-keyword">do</span>  <span class="hljs-built_in">echo</span> <span class="hljs-string">"Checking <span class="hljs-variable">$region</span>"</span>  <span class="hljs-keyword">for</span> loggroup <span class="hljs-keyword">in</span> $(aws --output text logs describe-log-groups --log-group-prefix <span class="hljs-string">"/aws/lambda/us-east-1.<span class="hljs-variable">$FUNCTION_NAME</span>"</span> --region <span class="hljs-variable">$region</span> --query <span class="hljs-string">'logGroups[].logGroupName'</span>)  <span class="hljs-keyword">do</span>    <span class="hljs-built_in">echo</span> <span class="hljs-string">"Found '<span class="hljs-variable">$loggroup</span>' in region <span class="hljs-variable">$region</span>"</span>    <span class="hljs-keyword">for</span> logstream <span class="hljs-keyword">in</span> $(aws --output text logs describe-log-streams --log-group-name <span class="hljs-variable">$loggroup</span> --region <span class="hljs-variable">$region</span> --query <span class="hljs-string">'logStreams[].logStreamName'</span>)    <span class="hljs-keyword">do</span>      aws --output text logs get-log-events --log-group-name <span class="hljs-variable">$loggroup</span> --region <span class="hljs-variable">$region</span> --log-stream-name <span class="hljs-variable">$logstream</span> | cat    <span class="hljs-keyword">done</span>  <span class="hljs-keyword">done</span><span class="hljs-keyword">done</span></code></pre><p>If you save this script as <code>cf-logs.sh</code>, after giving it execution rights with <code>chmod +x cf-logs.sh</code>, can then be started with <code>./cf-logs.sh YOUR_FUNCTION_NAME</code>, where <code>YOUR_FUNCTION_NAME</code> is the real Lambda@Edge function name.</p><p>This solution is great for quick log viewing while you're still developing your edge-enabled application, but surely is not a sustainable one when running in production.</p><h3 id="heading-aggregating-lambdaedge-logs-with-kinesis">Aggregating Lambda@Edge logs with Kinesis</h3><p>For a production setup, you'd probably want to be able to aggregate and store the individual logs coming from the different regions in one place. A possible solution is to stream them to a Kinesis Firehose Delivery Stream, which then stores the logs in S3.</p><p>An example implementation can be found at <a target="_blank" href="https://gist.github.com/heitorlessa/5d2295655f9d76483969d215986e53b0">https://gist.github.com/heitorlessa/5d2295655f9d76483969d215986e53b0</a></p><p>Please be aware that this will incur additional fixed and variable costs, so please review AWS' pricing of the used services.</p>]]></description><link>https://tobilg.com/retrieving-lambda-at-edge-cloudwatch-logs</link><guid isPermaLink="true">https://tobilg.com/retrieving-lambda-at-edge-cloudwatch-logs</guid><dc:creator><![CDATA[Tobias Müller]]></dc:creator><pubDate>Fri, 26 Jan 2024 20:11:22 GMT</pubDate><cover_image>https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/Q1p7bh3SHj8/upload/70961ce3e5e4d59bce33151e44053b96.jpeg</cover_image></item><item><title><![CDATA[List of free AWS Knowledge Badges]]></title><description><![CDATA[<p>As the Skillbuilder website is sometimes a bit hard to navigate, here's the full list of free badges you can do on <a target="_blank" href="https://explore.skillbuilder.aws/">AWS Skillbuilder</a>:</p><ul><li><p><a target="_blank" href="https://explore.skillbuilder.aws/learn/lp/82/Cloud%2520Essentials%2520-%2520Knowledge%2520Badge%2520Readiness%2520Path">AWS Knowledge: Cloud Essentials</a></p></li><li><p><a target="_blank" href="https://explore.skillbuilder.aws/learn/lp/1044/Solutions%2520Architect%2520-%2520Knowledge%2520Badge%2520Readiness%2520Path">AWS Knowledge: Architecting</a></p></li><li><p><a target="_blank" href="https://explore.skillbuilder.aws/learn/lp/92/Serverless%2520-%2520Knowledge%2520Badge%2520Readiness%2520Path">AWS Knowledge: Serverless</a></p></li><li><p><a target="_blank" href="https://explore.skillbuilder.aws/learn/lp/51/object-storage-knowledge-badge-readiness-path">AWS Knowledge: Object Storage</a></p></li><li><p><a target="_blank" href="https://explore.skillbuilder.aws/learn/learning_plan/view/93/block-storage-knowledge-badge-readiness-path">AWS Knowledge: Block Storage</a></p></li><li><p><a target="_blank" href="https://explore.skillbuilder.aws/learn/learning_plan/view/95/file-storage-knowledge-badge-readiness-path">AWS Knowledge: File Storage</a></p></li><li><p><a target="_blank" href="https://explore.skillbuilder.aws/learn/learning_plan/view/94/storage-data-migration-knowledge-badge-readiness-path">AWS Knowledge: Data Migration</a></p></li><li><p><a target="_blank" href="https://explore.skillbuilder.aws/learn/learning_plan/view/54/storage-data-protection-and-disaster-recovery-knowledge-badge-readiness-path">AWS Knowledge: Data Protection &amp; Disaster Recovery</a></p></li><li><p><a target="_blank" href="https://explore.skillbuilder.aws/learn/learning_plan/view/1944/networking-core-knowledge-badge-readiness-path">AWS Knowledge: Networking Core</a></p></li><li><p><a target="_blank" href="https://explore.skillbuilder.aws/learn/public/learning_plan/view/1985/compute-knowledge-badge-readiness-path">AWS Knowledge: Compute</a></p></li><li><p><a target="_blank" href="https://explore.skillbuilder.aws/learn/public/learning_plan/view/1931/amazon-eks-knowledge-badge-readiness-path">AWS Knowledge: EKS</a></p></li><li><p><a target="_blank" href="https://explore.skillbuilder.aws/learn/public/learning_plan/view/1927/events-and-workflows-knowledge-badge-readiness-path">AWS Knowledge: Events and Workflows</a></p></li><li><p><a target="_blank" href="https://explore.skillbuilder.aws/learn/public/learning_plan/view/1986/amazon-braket-badge-knowledge-badge-readiness-path">AWS Knowledge: Braket</a></p></li><li><p><a target="_blank" href="https://explore.skillbuilder.aws/learn/learning_plan/view/1570/aws-for-games-cloud-game-development-knowledge-badge-readiness-path">AWS Knowledge: AWS for Games: Cloud Game Development</a></p></li><li><p><a target="_blank" href="https://explore.skillbuilder.aws/learn/learning_plan/view/1722/media-entertainment-direct-to-consumer-and-broadcast-foundations-knowledge-badge-readiness-path">AWS Knowledge: Media &amp; Entertainment: Direct-to-Consumer and Broadcast Foundations</a></p></li></ul>]]></description><link>https://tobilg.com/list-of-free-aws-knowledge-badges</link><guid isPermaLink="true">https://tobilg.com/list-of-free-aws-knowledge-badges</guid><dc:creator><![CDATA[Tobias Müller]]></dc:creator><pubDate>Fri, 01 Sep 2023 10:32:10 GMT</pubDate><cover_image>https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/VMKsKFSuEg8/upload/f842a4e8f33ff2165cfa32901df073aa.jpeg</cover_image></item><item><title><![CDATA[Serverless Maps for fun and profit]]></title><description><![CDATA[<h1 id="heading-introduction">Introduction</h1><p>In today's data-driven world, interactive and visually appealing web-based maps have become an integral part of countless applications and services. Whether it's for navigation, location-based services, or data visualization, delivering seamless and lightning-fast user experiences is paramount. In this article, we explore how to host web-based maps on Amazon Web Services (AWS) in a serverless manner, exploring how this approach not only boosts speed but also brings significant cost advantages.</p><p><strong>Web-Based Maps: A Key to Engaging User Experiences</strong></p><p>The ubiquity of smartphones and the proliferation of internet-connected devices have transformed how we interact with the digital world. Whether we're exploring a new city, tracking our fitness activities, or monitoring real-time data, web-based maps provide a dynamic and immersive way to engage users. However, delivering smooth and responsive map experiences can be challenging, especially as data sizes and user demands continue to grow.</p><p>Traditionally, hosting web-based maps required dedicated servers or complex infrastructure setups, which often resulted in high operational costs and scalability limitations. Fortunately, the advent of serverless computing and the capabilities offered by AWS have strongly improved how we can deploy and manage web applications, including mapping services.</p><p><strong>Serverless map hosting: A Game-Changer</strong></p><p>AWS' serverless services allow developers to build and run applications without worrying about provisioning and managing servers. This approach not only simplifies development but also offers numerous advantages, particularly when it comes to hosting web-based maps.</p><ol><li><p><strong>Speed</strong> and <strong>Responsiveness</strong>: With serverless hosting via CloudFront and S3, the infrastructure caches data on the edge. This means that users will experience near-instantaneous map loading and seamless interactions, translating to heightened satisfaction and engagement.</p></li><li><p><strong>Cost Efficiency</strong>: One of the most compelling aspects of the serverless approach is its cost-effectiveness. Traditional hosting methods often require paying for idle server resources, which can quickly add up and strain budgets. In contrast, serverless computing on AWS charges you only for the actual compute resources consumed during map requests. As a result, you can significantly reduce operational costs and allocate resources more efficiently.</p></li></ol><p>In this comprehensive guide, we'll walk you through the steps of setting up a serverless web-based map hosting solution on AWS. From leveraging AWS Lambda for dynamic map tile querying to utilizing S3 for scalable and cost-effective data storage and using CloudFront for edge caching, you'll learn how to harness the full potential of AWS services for a top-notch mapping experience.</p><h1 id="heading-choosing-a-map-tile-format">Choosing a map tile format</h1><p>In the realm of digital maps, efficiency and performance are paramount. Whether you're navigating through city streets or exploring remote terrains, quick loading times and seamless zooming can make all the difference in delivering a superior user experience. Enter <a target="_blank" href="https://protomaps.com/docs/pmtiles">PMTiles</a>, a cutting-edge map tile format that is redefining how we interact with digital maps.</p><h2 id="heading-what-are-pmtiles">What are PMTiles?</h2><p>Traditional map tile formats, such as the popular PNG or JPEG images, can be bulky and slow to load, especially when dealing with intricate cartography or high-resolution imagery. PMTiles was specifically designed to address these limitations, offering a lightweight and performant solution for delivering map tiles.</p><p>According to the <a target="_blank" href="https://protomaps.com/docs/pmtiles">protomaps.com</a> website</p><blockquote><p>PMTiles is a single-file archive format for pyramids of tiled data. A PMTiles archive can be hosted on a storage platform like S3, and enables low-cost, zero-maintenance map applications.</p></blockquote><p><strong>Concepts include</strong></p><ul><li><p>A general format for tiled data addressable by Z/X/Y coordinates, which can be cartographic basemap vector tiles, remote sensing observations, JPEG images, or more.</p></li><li><p>Readers use <a target="_blank" href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests">HTTP Range Requests</a> to fetch only the relevant tile or metadata inside a PMTiles archive on-demand.</p></li><li><p>The arrangement of tiles and directories is designed to minimize the amount of overhead requests when panning and zooming.</p></li></ul><p>The current specification of PMTiles v3 can be found on <a target="_blank" href="https://github.com/protomaps/PMTiles/blob/main/spec/v3/spec.md">GitHub</a>.</p><h2 id="heading-why-pmtiles-matter">Why PMTiles matter?</h2><ol><li><p><strong>Faster Loading Times:</strong> PMTiles leverages the power of vector data to store map tiles, which means that the tiles are smaller in size compared to raster image formats like PNG or JPEG. This smaller size translates to quicker loading times, enabling a seamless map browsing experience even in areas with limited network connectivity.</p></li><li><p><strong>Efficient Storage:</strong> Due to their compact size, PMTiles require less storage space than traditional map tile formats. This advantage is particularly significant for applications with large map datasets, as it reduces the infrastructure costs associated with storing and serving the maps.</p></li><li><p><strong>Dynamic Styling:</strong> PMTiles offer the flexibility of dynamic styling, allowing developers to modify the appearance of map tiles on the fly. By adjusting colors, labels, and map features in real-time, developers can create customized maps that cater to the specific needs of their users.</p></li><li><p><strong>Offline Accessibility:</strong> One of the most impressive features of PMTiles is its ability to support offline access. By preloading PMTiles on a user's device, applications can ensure uninterrupted map access, even when an internet connection is not available. This functionality is invaluable for users in remote locations or areas with unstable network coverage. Also, usage of cache headers like <code>Cache-Control</code>, when used in a browser, can save bandwidth and rendering time.</p></li></ol><h2 id="heading-how-can-pmtiles-be-used">How can PMTiles be used?</h2><p>PMTiles can be used with common maps rendering engines like Leaflet, MapLibre GL or OpenLayers by using the <a target="_blank" href="https://www.npmjs.com/package/protomaps">protomaps.js</a> library. Examples of how to integrate it can be found on the website, or the <a target="_blank" href="https://github.com/serverlessmaps/serverlessmaps/blob/main/website/index.html">index.html</a> and <a target="_blank" href="https://github.com/serverlessmaps/serverlessmaps/blob/main/website/basemap.html">basemap.html</a> examples in our <a target="_blank" href="https://github.com/serverlessmaps/serverlessmaps">repo</a>.</p><h2 id="heading-generating-pmtiles-from-openstreetmap-data">Generating PMTiles from OpenStreetMap data</h2><p>The first step for hosting your self-hosted maps can be downloading publically available map data from OpenStreetMap. This data is then to be transformed into a compatible basemap layer for use with PMTiles.</p><p>See the documentation about <a target="_blank" href="https://protomaps.com/docs/frontends/basemap-layers">vector basemap layers</a> to get an idea of how this works in general. The following is necessary to mention though (from <a target="_blank" href="https://protomaps.com/docs/frontends/basemap-layers">protomaps.com</a>):</p><blockquote><p>The organization of features with layers and tags is specific to Protomaps services; this means that map styles are not directly portable with other systems such as <a target="_blank" href="https://openmaptiles.org/schema/">OpenMapTiles</a> or <a target="_blank" href="https://tilezen.readthedocs.io/en/latest/">Mapzen tiles</a></p></blockquote><p>This step can be automated to a high extent.</p><h1 id="heading-solution">Solution</h1><p>Enter <a target="_blank" href="https://github.com/serverlessmaps/serverlessmaps">ServerlessMaps</a>! The project is outlined in the following paragraphs.</p><h1 id="heading-architecture">Architecture</h1><p>In our example implementation, we will only rely on a very few services:</p><ul><li><p>AWS Lambda (proxying the map tile requests to the S3 origin)</p></li><li><p>Amazon CloudFront (globally distributed CDN)</p></li><li><p>Amazon S3 (storing the PMTiles files, and the map example website)</p></li><li><p>Amazon CloudWatch (keeping the logs of the Lambda function)</p></li></ul><p>The overall architecture looks like this:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1689947287518/920eba80-8993-47b8-a513-69437fef3cc9.png" alt class="image--center mx-auto" /></p><h1 id="heading-deployment">Deployment</h1><p>The deployment of <a target="_blank" href="https://github.com/serverlessmaps/serverlessmaps">ServerlessMaps</a> must be done in multiple steps. From a high-level perspective, they are</p><ul><li><p>Setting up the project from GitHub locally</p></li><li><p>Setting up the local environment to prepare the basemaps</p></li><li><p>Build the desired basemaps as PMTiles file</p></li><li><p>Deploy the serverless infrastructure on AWS</p></li><li><p>Upload the basemaps and example websites to S3 (done automatically)</p></li></ul><h3 id="heading-setting-up-the-project-locally">Setting up the project locally</h3><p>To clone the project to your local machine, please use</p><pre><code class="lang-bash">git <span class="hljs-built_in">clone</span> https://github.com/serverlessmaps/serverlessmaps.git</code></pre><p>in your desired directory. Then, do a <code>cd serverlessmaps</code> in the directory you ran the above command.</p><h3 id="heading-setting-up-the-local-environment">Setting up the local environment</h3><p>This step assumes that you're using MacOS as the operating system. If so, you can run</p><pre><code class="lang-bash">scripts/install_macos.sh</code></pre><p>to install the dependencies needed to build the basemaps as PMTiles file. This will install the latest OpenJDK and Maven, which are both necessary for the basemaps build.</p><p>If your system is not using MacOS, you can also install those two dependencies manually.</p><p>The next step is to compile the <a target="_blank" href="https://github.com/onthegomap/planetiler">Planetiler</a> build profile, that later can generate the PMTiles file:</p><pre><code class="lang-bash">scripts/compile_basemaps_builder.sh</code></pre><p>This will create a new directory <code>builder</code> that will contain the JAR with the runnable builder.</p><h3 id="heading-build-the-desired-basemaps">Build the desired basemaps</h3><p>To build a basemap with the before compiled builder, run</p><pre><code class="lang-bash">scripts/build_pmtiles.sh</code></pre><p>This will build a map with a default area of Hamburg / Germany. If you want to generate maps for other areas, have a look at the OSM (sub-)region names, e.g. on the <a target="_blank" href="https://download.geofabrik.de/">https://download.geofabrik.de/</a> server.</p><p>For example, if you'd like to generate a map for the whole of Europe, you could run</p><pre><code class="lang-bash">scripts/build_pmtiles.sh europe</code></pre><p>Please be aware that this will run for several hours depending on your machine, and will generate a PMTiles file of around 45GB. This file will take some time to upload to S3 in the next step as well. The recommendation is if you just want to try out this project, use the default <a target="_blank" href="https://download.geofabrik.de/europe/germany/hamburg.html">hamburg</a> sub-region, which is around 35MB.</p><h3 id="heading-deploy-the-serverless-infrastructure">Deploy the serverless infrastructure</h3><p>This project assumes that you already have set up your AWS credentials locally so that the <a target="_blank" href="https://www.serverless.com">Serverless framework</a> can use it accordingly.</p><p>To deploy the serverless AWS infrastructure, you can do a <code>cd iac</code> from the project's root directory, and then use</p><pre><code class="lang-bash">sls deploy</code></pre><p>to deploy the necessary stack.</p><p>You can customize some parameters for the deployment:</p><ul><li><p><code>region</code>: The AWS region you want to deploy your stack to (default: <code>us-east-1</code>)</p></li><li><p><code>stage</code>: The stage name (default: <code>prd</code>)</p></li><li><p><code>cors</code>: The allowed hostname for the CORS header (default: <code>*</code>)</p></li></ul><p>The following will deploy the stack to the <code>eu-central-1</code> region with the stage name <code>dev</code> and the allowed CORS hostname <a target="_blank" href="http://mymapservice.xyz"><code>mymapservice.xyz</code></a>:</p><pre><code class="lang-bash">sls deploy --region eu-central-1 --stage dev --cors mymapservice.xyz</code></pre><h4 id="heading-stack-output">Stack output</h4><p>The deployment of the stack will generate an output like this on the console:</p><pre><code class="lang-text">----------------------------------------------------------------------------------&gt; The map can be viewed at https://d1b056iuztreqte.cloudfront.net/#13.54/53.54958/9.99286-&gt; The basemap themes can be viewed at https://d1b056iuztreqte.cloudfront.net/basemap.html#13.54/53.54958/9.99286-&gt; Please manually do a 'npm run website:sync' on your console to sync the static website assets if you changed them after the last deployment-&gt; Afterwards, run 'npm run website:invalidate' to invalidate the website's CloudFront distribution cache---------------------------------------------------------------------------------</code></pre><h4 id="heading-automatically-created-files">Automatically created files</h4><p>There will be two automatically created files based on the setting you chose before:</p><ul><li><p><code>website/urlConfig.js</code>: This contains the CloudFront Distribution hostname (variable <code>tilesDistributionHostname</code>) for the caching of the PMTiles. This is assigned by CloudFront during deployment.</p></li><li><p><code>website/tilePath.js</code>: This contains the needed <code>tilePath</code> variable, which depends on the area you chose for the basemap. This is generated by the <code>scripts/build_pmtiles.sh</code> script automatically.</p></li></ul><h3 id="heading-upload-the-basemaps-and-example-websites">Upload the basemaps and example websites</h3><p>The basemap that was generated in the step before the deployment, and the two example websites are synched automatically to the website S3 bucket.</p><p>If you want to deploy your web application/website, you need to run the sync manually via <code>npm run website:sync</code>. After that, the CloudFront cache needs to be invalidated as well to show the new content. This can be done via <code>npm run website:invalidate</code>, both from the <code>iac</code> directory.</p><h2 id="heading-result">Result</h2><p>If everything went well, you can access the URL (<a target="_blank" href="https://d1b056iuztreqte.cloudfront.net/#13.54/53.54958/9.99286"><code>https://d1b056iuztreqte.cloudfront.net/#13.54/53.54958/9.99286</code></a> in the above example output) to view your basic map:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1691413129462/db36e396-d763-48b7-aa30-70f74d45d1b7.png" alt class="image--center mx-auto" /></p><h1 id="heading-conclusion">Conclusion</h1><p>Hosting web-based maps on AWS in a serverless manner opens up a world of opportunities for delivering blazing-fast, interactive, and cost-efficient map services. By taking advantage of AWS's scalable infrastructure and pay-as-you-go pricing model, you can create an outstanding user experience without breaking the bank.</p>]]></description><link>https://tobilg.com/serverless-maps-for-fun-and-profit</link><guid isPermaLink="true">https://tobilg.com/serverless-maps-for-fun-and-profit</guid><dc:creator><![CDATA[Tobias Müller]]></dc:creator><pubDate>Mon, 07 Aug 2023 13:10:07 GMT</pubDate><cover_image>https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/AFB6S2kibuk/upload/42964de7c80e3b113951bed72c3255ca.jpeg</cover_image></item><item><title><![CDATA[Gathering and analyzing public cloud provider IP address data with DuckDB & Observerable]]></title><description><![CDATA[<p>As organizations increasingly adopt the public cloud, managing the networking and security aspects of cloud computing becomes more complex. One of the challenges that cloud administrators face is, especially in a hybrid cloud environment, keeping track of the IP address ranges of the public cloud providers, which all use different file formats to publish their IP address range data. The formats include deeply nested JSONs, CSVs as well as plain text.</p><p>The goal of this article is to outline how this data can be unified, cleaned and made available on a platform that makes it easy for users to consume. Furthermore, some interesting statistics can be derived from those public datasets.</p><p>The data and the source code can be found at:</p><ul><li><a target="_blank" href="https://github.com/tobilg/public-cloud-provider-ip-ranges">https://github.com/tobilg/public-cloud-provider-ip-ranges</a></li></ul><h1 id="heading-data-sources">Data sources</h1><p>The (incomplete) list of public cloud providers that are publishing their <a target="_blank" href="https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing#IPv4_CIDR_blocks">IPv4 CIDR blocks</a> are:</p><ul><li><p><a target="_blank" href="https://ip-ranges.amazonaws.com/ip-ranges.json">AWS</a></p></li><li><p><a target="_blank" href="https://download.microsoft.com/download/7/1/D/71D86715-5596-4529-9B13-DA13A5DE5B63/ServiceTags_Public_20230417.json">Azure</a></p></li><li><p><a target="_blank" href="https://www.cloudflare.com/ips-v4">CloudFlare</a></p></li><li><p><a target="_blank" href="https://digitalocean.com/geo/google.csv">DigitalOcean</a></p></li><li><p><a target="_blank" href="https://api.fastly.com/public-ip-list">Fastly</a></p></li><li><p><a target="_blank" href="https://www.gstatic.com/ipranges/cloud.json">Google Cloud</a></p></li><li><p><a target="_blank" href="https://docs.oracle.com/en-us/iaas/tools/public_ip_ranges.json">Oracle Cloud</a></p></li></ul><p>You can click on the individual links to view or download the data manually.</p><p><strong>HINT: The published lists of IP address ranges don't represent the overall IP address space that each of these providers are possessing.</strong></p><p>To get the complete IP range, data from organizations like ARIN would need to be added as well. For simplicity and brevity, only the publically downloadable info was used.</p><h1 id="heading-retrieving-cleaning-storing-and-exporting-data">Retrieving, cleaning, storing and exporting data</h1><p>The overall data engineering process is divided into multiple steps:</p><ul><li><p>Identify data sources (see above) and define the common schema</p></li><li><p>Retrieving the raw data from public data sources (via HTTP)</p></li><li><p>Cleaning the retrieved data (e.g. remove duplicates)</p></li><li><p>Storing the data in a common schema, so that it can be aggregated and analyzed for different public cloud providers at once</p></li><li><p>Exporting the stored data into different file formats, so that as many different types of clients can make use of it</p></li></ul><p>Additionally, we'd like to keep the costs low, and the infrastructure as simple as possible. That's why DuckDB is chosen as the database layer, which offers a rich and advanced set of features to handle (read and write) different file formats, as well as it can read directly from remote data sources via HTTP, only by using SQL. That saves additional effort for out-of-band ETL.</p><p>Furthermore, to share the data, we chose GitHub, which is free to use for the scope of our use case. Most importantly, it allows us to store the exported data files in our <a target="_blank" href="https://github.com/tobilg/public-cloud-provider-ip-ranges/tree/main/data/providers">repository</a>. To run the overall process, <a target="_blank" href="https://docs.github.com/en/actions">GitHub Actions</a> are used as they also offer a free usage tier, and have everything we need to <a target="_blank" href="https://github.com/tobilg/public-cloud-provider-ip-ranges/blob/main/.github/workflows/main.yml">create the described data pipeline</a>.</p><h2 id="heading-common-data-schema">Common data schema</h2><p>After inspecting the data source files, the derived unified schema for all loaded data sources will look like this:</p><div class="hn-table"><table><thead><tr><td>Column name</td><td>Data type</td><td>Description</td></tr></thead><tbody><tr><td>cloud_provider</td><td>VARCHAR</td><td>The public cloud provider's name</td></tr><tr><td>cidr_block</td><td>VARCHAR</td><td>The CIDR block, e.g. <code>10.0.0.0/32</code></td></tr><tr><td>ip_address</td><td>VARCHAR</td><td>The IP address, e.g. <code>10.0.0.0</code></td></tr><tr><td>ip_address_mask</td><td>INTEGER</td><td>The IP address mask, e.g. <code>32</code></td></tr><tr><td>ip_address_cnt</td><td>INTEGER</td><td>The number of IP addresses in this CIDR block</td></tr><tr><td>region</td><td>VARCHAR</td><td>The public cloud provider region information (if given)</td></tr></tbody></table></div><h2 id="heading-creating-the-cloud-provider-tables">Creating the cloud provider tables</h2><p>At first, we'll create a table in DuckDB for each of the public cloud providers. If DuckDB is installed (see <a target="_blank" href="https://duckdb.org/docs/installation/">docs</a>) and in the PATH, we can execute SQL scripts like this:</p><pre><code class="lang-bash"><span class="hljs-comment"># $DATA_PATH is the location of the DuckDB database file (this is important, because otherwise an in-memory table will be automatically created that will not be able to persist the data</span><span class="hljs-comment"># $SCRIPT is the path to the SQL script that shall be executed</span>duckdb <span class="hljs-variable">$DATA_PATH</span> &lt; <span class="hljs-variable">$SCRIPT</span>.sql</code></pre><p>Each table has different SQLs, as the data sources (contents and formats) of each provider are different.</p><p>Before starting, we need to make sure that the <a target="_blank" href="https://duckdb.org/docs/extensions/httpfs">httpfs extension</a> is <a target="_blank" href="https://github.com/tobilg/public-cloud-provider-ip-ranges/blob/main/queries/install_extensions.sql">installed</a> and loaded (as we use remote datasets):</p><pre><code class="lang-sql"><span class="hljs-keyword">INSTALL</span> httpfs;<span class="hljs-keyword">LOAD</span> httpfs;</code></pre><h3 id="heading-aws-table">AWS table</h3><pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> aws_ip_data <span class="hljs-keyword">AS</span> (  <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">DISTINCT</span>    prefixes.cidr_block,    prefixes.ip_address,    prefixes.ip_address_mask,    <span class="hljs-keyword">CAST</span>(<span class="hljs-keyword">POW</span>(<span class="hljs-number">2</span>, <span class="hljs-number">32</span>-prefixes.ip_address_mask) <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>) <span class="hljs-keyword">AS</span> ip_address_cnt,    prefixes.region  <span class="hljs-keyword">FROM</span> (    <span class="hljs-keyword">SELECT</span>      prefix_object.ip_prefix <span class="hljs-keyword">AS</span> cidr_block,      STR_SPLIT(prefix_object.ip_prefix, <span class="hljs-string">'/'</span>)[<span class="hljs-number">1</span>] <span class="hljs-keyword">AS</span> ip_address,      <span class="hljs-keyword">CAST</span>(STR_SPLIT(prefix_object.ip_prefix, <span class="hljs-string">'/'</span>)[<span class="hljs-number">2</span>] <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>) <span class="hljs-keyword">AS</span> ip_address_mask,      prefix_object.region    <span class="hljs-keyword">FROM</span> (      <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">UNNEST</span>(prefixes) <span class="hljs-keyword">AS</span> prefix_object <span class="hljs-keyword">FROM</span> <span class="hljs-string">'https://ip-ranges.amazonaws.com/ip-ranges.json'</span>    )  ) prefixes);</code></pre><h3 id="heading-azure-table">Azure table</h3><pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> azure_ip_data <span class="hljs-keyword">AS</span> (  <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">DISTINCT</span>    prefixes <span class="hljs-keyword">AS</span> cidr_block,    STR_SPLIT(prefixes, <span class="hljs-string">'/'</span>)[<span class="hljs-number">1</span>] <span class="hljs-keyword">AS</span> ip_address,    <span class="hljs-keyword">CAST</span>(STR_SPLIT(prefixes, <span class="hljs-string">'/'</span>)[<span class="hljs-number">2</span>] <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>) <span class="hljs-keyword">AS</span> ip_address_mask,    <span class="hljs-keyword">CAST</span>(<span class="hljs-keyword">POW</span>(<span class="hljs-number">2</span>, <span class="hljs-number">32</span>-<span class="hljs-keyword">CAST</span>(STR_SPLIT(prefixes, <span class="hljs-string">'/'</span>)[<span class="hljs-number">2</span>] <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>)) <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>) <span class="hljs-keyword">AS</span> ip_address_cnt,    <span class="hljs-keyword">CASE</span>      <span class="hljs-keyword">WHEN</span> region = <span class="hljs-string">''</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">'No region'</span>      <span class="hljs-keyword">ELSE</span> region    <span class="hljs-keyword">END</span> <span class="hljs-keyword">AS</span> region  <span class="hljs-keyword">FROM</span> (    <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">DISTINCT</span>      prop.region <span class="hljs-keyword">AS</span> region,      <span class="hljs-keyword">UNNEST</span>(prop.addressPrefixes) <span class="hljs-keyword">AS</span> prefixes    <span class="hljs-keyword">FROM</span> (      <span class="hljs-keyword">SELECT</span>         values.properties <span class="hljs-keyword">AS</span> prop      <span class="hljs-keyword">FROM</span> (        <span class="hljs-keyword">SELECT</span>           <span class="hljs-keyword">UNNEST</span>(<span class="hljs-keyword">values</span>) <span class="hljs-keyword">AS</span> <span class="hljs-keyword">values</span>        <span class="hljs-keyword">FROM</span>          read_json_auto(<span class="hljs-string">'https://download.microsoft.com/download/7/1/D/71D86715-5596-4529-9B13-DA13A5DE5B63/ServiceTags_Public_20230417.json'</span>, maximum_object_size=<span class="hljs-number">10000000</span>)      )    )  )  <span class="hljs-keyword">WHERE</span>    prefixes <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">LIKE</span> <span class="hljs-string">'%::%'</span>);</code></pre><h3 id="heading-cloudflare-table">CloudFlare table</h3><pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> cloudflare_ip_data <span class="hljs-keyword">AS</span> (  <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">DISTINCT</span>    prefixes <span class="hljs-keyword">AS</span> cidr_block,    STR_SPLIT(prefixes, <span class="hljs-string">'/'</span>)[<span class="hljs-number">1</span>] <span class="hljs-keyword">AS</span> ip_address,    <span class="hljs-keyword">CAST</span>(STR_SPLIT(prefixes, <span class="hljs-string">'/'</span>)[<span class="hljs-number">2</span>] <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>) <span class="hljs-keyword">AS</span> ip_address_mask,    <span class="hljs-keyword">CAST</span>(<span class="hljs-keyword">POW</span>(<span class="hljs-number">2</span>, <span class="hljs-number">32</span>-<span class="hljs-keyword">CAST</span>(STR_SPLIT(prefixes, <span class="hljs-string">'/'</span>)[<span class="hljs-number">2</span>] <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>)) <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>) <span class="hljs-keyword">AS</span> ip_address_cnt,    <span class="hljs-string">'No region'</span> <span class="hljs-keyword">AS</span> region  <span class="hljs-keyword">FROM</span> (    <span class="hljs-keyword">SELECT</span>      column0 <span class="hljs-keyword">AS</span> prefixes    <span class="hljs-keyword">FROM</span>      read_csv_auto(<span class="hljs-string">'https://www.cloudflare.com/ips-v4'</span>)  ));</code></pre><h3 id="heading-digitalocean-table">DigitalOcean table</h3><pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> digitalocean_ip_data <span class="hljs-keyword">AS</span> (  <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">DISTINCT</span>    prefixes <span class="hljs-keyword">AS</span> cidr_block,    STR_SPLIT(prefixes, <span class="hljs-string">'/'</span>)[<span class="hljs-number">1</span>] <span class="hljs-keyword">AS</span> ip_address,    <span class="hljs-keyword">CAST</span>(STR_SPLIT(prefixes, <span class="hljs-string">'/'</span>)[<span class="hljs-number">2</span>] <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>) <span class="hljs-keyword">AS</span> ip_address_mask,    <span class="hljs-keyword">CAST</span>(<span class="hljs-keyword">POW</span>(<span class="hljs-number">2</span>, <span class="hljs-number">32</span>-<span class="hljs-keyword">CAST</span>(STR_SPLIT(prefixes, <span class="hljs-string">'/'</span>)[<span class="hljs-number">2</span>] <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>)) <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>) <span class="hljs-keyword">AS</span> ip_address_cnt,    <span class="hljs-string">'No region'</span> <span class="hljs-keyword">AS</span> region  <span class="hljs-keyword">FROM</span> (    <span class="hljs-keyword">SELECT</span>      column0 <span class="hljs-keyword">AS</span> prefixes    <span class="hljs-keyword">FROM</span>      read_csv_auto(<span class="hljs-string">'https://digitalocean.com/geo/google.csv'</span>)    <span class="hljs-keyword">WHERE</span>      column0 <span class="hljs-keyword">NOT</span> <span class="hljs-keyword">LIKE</span> <span class="hljs-string">'%::%'</span>  ));</code></pre><h3 id="heading-fastly-table">Fastly table</h3><pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> fastly_ip_data <span class="hljs-keyword">AS</span> (  <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">DISTINCT</span>    prefixes <span class="hljs-keyword">AS</span> cidr_block,    STR_SPLIT(prefixes, <span class="hljs-string">'/'</span>)[<span class="hljs-number">1</span>] <span class="hljs-keyword">AS</span> ip_address,    <span class="hljs-keyword">CAST</span>(STR_SPLIT(prefixes, <span class="hljs-string">'/'</span>)[<span class="hljs-number">2</span>] <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>) <span class="hljs-keyword">AS</span> ip_address_mask,    <span class="hljs-keyword">CAST</span>(<span class="hljs-keyword">POW</span>(<span class="hljs-number">2</span>, <span class="hljs-number">32</span>-<span class="hljs-keyword">CAST</span>(STR_SPLIT(prefixes, <span class="hljs-string">'/'</span>)[<span class="hljs-number">2</span>] <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>)) <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>) <span class="hljs-keyword">AS</span> ip_address_cnt,    <span class="hljs-string">'No region'</span> <span class="hljs-keyword">AS</span> region  <span class="hljs-keyword">FROM</span> (    <span class="hljs-keyword">SELECT</span>      <span class="hljs-keyword">UNNEST</span>(addresses) <span class="hljs-keyword">AS</span> prefixes    <span class="hljs-keyword">FROM</span>      read_json_auto(<span class="hljs-string">'https://api.fastly.com/public-ip-list'</span>)  ));</code></pre><h3 id="heading-google-cloud-table">Google Cloud table</h3><pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> google_ip_data <span class="hljs-keyword">AS</span> (  <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">DISTINCT</span>    prefixes.cidr_block,    prefixes.ip_address,    prefixes.ip_address_mask,    <span class="hljs-keyword">CAST</span>(<span class="hljs-keyword">pow</span>(<span class="hljs-number">2</span>, <span class="hljs-number">32</span>-prefixes.ip_address_mask) <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>) <span class="hljs-keyword">AS</span> ip_address_cnt,    prefixes.region  <span class="hljs-keyword">FROM</span> (    <span class="hljs-keyword">SELECT</span>      prefix_object.ipv4Prefix <span class="hljs-keyword">AS</span> cidr_block,      str_split(prefix_object.ipv4Prefix, <span class="hljs-string">'/'</span>)[<span class="hljs-number">1</span>] <span class="hljs-keyword">AS</span> ip_address,      <span class="hljs-keyword">CAST</span>(str_split(prefix_object.ipv4Prefix, <span class="hljs-string">'/'</span>)[<span class="hljs-number">2</span>] <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>) <span class="hljs-keyword">AS</span> ip_address_mask,      prefix_object.scope <span class="hljs-keyword">as</span> region    <span class="hljs-keyword">FROM</span> (      <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">unnest</span>(prefixes) <span class="hljs-keyword">AS</span> prefix_object <span class="hljs-keyword">FROM</span> <span class="hljs-string">'https://www.gstatic.com/ipranges/cloud.json'</span>    )    <span class="hljs-keyword">WHERE</span>      prefix_object.ipv4Prefix <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>  ) prefixes);</code></pre><h3 id="heading-oracle-cloud-table">Oracle Cloud table</h3><pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> oracle_ip_data <span class="hljs-keyword">AS</span> (  <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">DISTINCT</span>    prefixes.cidr <span class="hljs-keyword">AS</span> cidr_block,    STR_SPLIT(prefixes.cidr, <span class="hljs-string">'/'</span>)[<span class="hljs-number">1</span>] <span class="hljs-keyword">AS</span> ip_address,    <span class="hljs-keyword">CAST</span>(STR_SPLIT(prefixes.cidr, <span class="hljs-string">'/'</span>)[<span class="hljs-number">2</span>] <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>) <span class="hljs-keyword">AS</span> ip_address_mask,    <span class="hljs-keyword">CAST</span>(<span class="hljs-keyword">POW</span>(<span class="hljs-number">2</span>, <span class="hljs-number">32</span>-<span class="hljs-keyword">CAST</span>(STR_SPLIT(prefixes.cidr, <span class="hljs-string">'/'</span>)[<span class="hljs-number">2</span>] <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>)) <span class="hljs-keyword">AS</span> <span class="hljs-built_in">INTEGER</span>) <span class="hljs-keyword">AS</span> ip_address_cnt,    <span class="hljs-keyword">CASE</span>      <span class="hljs-keyword">WHEN</span> region = <span class="hljs-string">''</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">'No region'</span>      <span class="hljs-keyword">ELSE</span> region    <span class="hljs-keyword">END</span> <span class="hljs-keyword">AS</span> region  <span class="hljs-keyword">FROM</span> (    <span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">DISTINCT</span>      region,      <span class="hljs-keyword">UNNEST</span>(cidrs) <span class="hljs-keyword">AS</span> prefixes    <span class="hljs-keyword">FROM</span> (      <span class="hljs-keyword">SELECT</span>         regions.region <span class="hljs-keyword">AS</span> region,        regions.cidrs <span class="hljs-keyword">AS</span> cidrs      <span class="hljs-keyword">FROM</span> (        <span class="hljs-keyword">SELECT</span>           <span class="hljs-keyword">UNNEST</span>(regions) <span class="hljs-keyword">AS</span> regions        <span class="hljs-keyword">FROM</span>          read_json_auto(<span class="hljs-string">'https://docs.oracle.com/en-us/iaas/tools/public_ip_ranges.json'</span>, maximum_object_size=<span class="hljs-number">10000000</span>)      )    )  ));</code></pre><h2 id="heading-create-a-combined-view">Create a combined view</h2><p>The next step is to create a new view (<code>ip_data</code>) that combines our tables for the individual cloud providers. We can then use this view later to compare the different cloud providers.</p><p>The view definition looks like this:</p><pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">VIEW</span> ip_data <span class="hljs-keyword">AS</span> (  <span class="hljs-keyword">SELECT</span> <span class="hljs-string">'AWS'</span> <span class="hljs-keyword">as</span> cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region <span class="hljs-keyword">FROM</span> aws_ip_data  <span class="hljs-keyword">UNION</span> <span class="hljs-keyword">ALL</span>  <span class="hljs-keyword">SELECT</span> <span class="hljs-string">'Azure'</span> <span class="hljs-keyword">as</span> cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region <span class="hljs-keyword">FROM</span> azure_ip_data  <span class="hljs-keyword">UNION</span> <span class="hljs-keyword">ALL</span>  <span class="hljs-keyword">SELECT</span> <span class="hljs-string">'CloudFlare'</span> <span class="hljs-keyword">as</span> cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region <span class="hljs-keyword">FROM</span> cloudflare_ip_data  <span class="hljs-keyword">UNION</span> <span class="hljs-keyword">ALL</span>  <span class="hljs-keyword">SELECT</span> <span class="hljs-string">'DigitalOcean'</span> <span class="hljs-keyword">as</span> cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region <span class="hljs-keyword">FROM</span> digitalocean_ip_data  <span class="hljs-keyword">UNION</span> <span class="hljs-keyword">ALL</span>  <span class="hljs-keyword">SELECT</span> <span class="hljs-string">'Fastly'</span> <span class="hljs-keyword">as</span> cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region <span class="hljs-keyword">FROM</span> fastly_ip_data  <span class="hljs-keyword">UNION</span> <span class="hljs-keyword">ALL</span>  <span class="hljs-keyword">SELECT</span> <span class="hljs-string">'Google Cloud'</span> <span class="hljs-keyword">as</span> cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region <span class="hljs-keyword">FROM</span> google_ip_data  <span class="hljs-keyword">UNION</span> <span class="hljs-keyword">ALL</span>  <span class="hljs-keyword">SELECT</span> <span class="hljs-string">'Oracle'</span> <span class="hljs-keyword">as</span> cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region <span class="hljs-keyword">FROM</span> oracle_ip_data);</code></pre><h2 id="heading-export-the-data">Export the data</h2><p>To be able to use the data with other tools, we need to export the data to different formats, in our case <a target="_blank" href="https://duckdb.org/docs/guides/import/csv_export">CSV</a> and <a target="_blank" href="https://duckdb.org/docs/guides/import/parquet_export">Parquet</a>. You can review the executed queries in the <a target="_blank" href="https://github.com/tobilg/public-cloud-provider-ip-ranges/blob/main/queries/export_provider_data.sql">repository</a>.</p><pre><code class="lang-sql"><span class="hljs-comment">-- Only an example, this needs to be done for all providers as well!</span><span class="hljs-comment">-- Export complete data as CSV</span>COPY (<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> ip_data <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> cloud_provider, cidr_block) <span class="hljs-keyword">TO</span> <span class="hljs-string">'data/providers/all.csv'</span> <span class="hljs-keyword">WITH</span> (HEADER <span class="hljs-number">1</span>, DELIMITER <span class="hljs-string">','</span>);<span class="hljs-comment">-- Export complete data as Parquet</span>COPY (<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> ip_data <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> cloud_provider, cidr_block) <span class="hljs-keyword">TO</span> <span class="hljs-string">'data/providers/all.parquet'</span> (<span class="hljs-keyword">FORMAT</span> <span class="hljs-string">'parquet'</span>, COMPRESSION <span class="hljs-string">'SNAPPY'</span>);</code></pre><h2 id="heading-analyze-the-data">Analyze the data</h2><p>As we now prepared our data, we can start analyzing it. Therefore, we'll use an <a target="_blank" href="https://observablehq.com/@duckdb-projects/public-cloud-provider-ip-ranges">ObservableHQ notebook</a>, where we'll upload the <a target="_blank" href="https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/main/data/providers/all.csv">all.csv</a> file to.</p><h3 id="heading-overall-ip-address-counts">Overall IP address counts</h3><iframe width="100%" height="514" src="https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=Overall"></iframe><h3 id="heading-total-number-and-value-of-ip-addresses">Total number and value of IP addresses</h3><p>An astonishing insight is that both <strong>AWS</strong> and <strong>Azure</strong> have more than <strong>six times as many</strong> IP addresses available as their next competitor.</p><p>Also, the <strong>market values</strong> of their IP addresses are <strong>nearly four billion Dollars</strong> according to a <a target="_blank" href="https://circleid.com/posts/20220610-recent-ipv4-pricing-trends-may-2022">market analysis</a>.</p><iframe width="100%" height="348.09375" src="https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=NumberAndValueOfIPAddresses"></iframe><h3 id="heading-cidr-masks-distribution-by-public-cloud-provider">CIDR masks distribution by public cloud provider</h3><p>It's remarkable that, although AWS and Azure have similar absolute numbers of IP addresses, the type of CIDR blocks / IP ranges strongly differ: AWS owns very few very large IP ranges, whereas Azure owns very many rather small IP ranges, and just a few very large ones:</p><iframe width="100%" height="514" src="https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=IPRangeSizeFrequencyGraph"></iframe><p>Another view is the filterable table for this data:</p><iframe width="100%" height="479" src="https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=IPRangeSizeFrequency"></iframe><h3 id="heading-aws-cidr-masks">AWS CIDR masks</h3><iframe width="100%" height="479" src="https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=AWSbyCIDRMask"></iframe><h3 id="heading-azure-cidr-masks">Azure CIDR masks</h3><iframe width="100%" height="479" src="https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=AzurebyCIDRMask"></iframe><h3 id="heading-cloudflare-cidr-masks">CloudFlare CIDR masks</h3><iframe width="100%" height="393.6875" src="https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=CloudFlarebyCIDRMask"></iframe><h3 id="heading-digitalocean-cidr-masks">DigitalOcean CIDR masks</h3><iframe width="100%" height="445" src="https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=DigitalOceanbyCIDRMask"></iframe><h3 id="heading-fastly-cidr-masks">Fastly CIDR masks</h3><iframe width="100%" height="420" src="https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=FastlybyCIDRMask"></iframe><h3 id="heading-google-cloud-cidr-masks">Google Cloud CIDR masks</h3><iframe width="100%" height="479" src="https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=GoogleCloudbyCIDRMask"></iframe><h3 id="heading-oracle-cloud-cidr-masks">Oracle Cloud CIDR masks</h3><iframe width="100%" height="479" src="https://observablehq.com/embed/@duckdb-projects/public-cloud-provider-ip-ranges@latest?cells=OracleCloudbyCIDRMask"></iframe><h1 id="heading-conclusion">Conclusion</h1><p>In this article, we described a simple and straightforward way to gather and transform data in different formats with DuckDB, as well as export it to a common schema as CSV and Parquet files.</p><p>Furthermore, we leveraged DuckDB on <a target="_blank" href="https://observablehq.com/">Observable</a> to analyze and display the data in beautiful and interactive graphs.</p><p>By using GitHub Actions free tier as our "runtime" for the data processing via Bash and SQL scripts, and hosting our data in a GitHub repo (also covered by the free tier), we were able to show that data pipelines like covered use case can be built without accruing infrastructure costs. Also, the <a target="_blank" href="https://observablehq.com/@observablehq/team-and-individual-workspaces?collection=@observablehq/accounts-and-workspaces#freeTeam">Observable pricing model</a> supports our analyses for free.</p>]]></description><link>https://tobilg.com/gathering-and-analyzing-public-cloud-provider-ip-address-data-with-duckdb-observerable</link><guid isPermaLink="true">https://tobilg.com/gathering-and-analyzing-public-cloud-provider-ip-address-data-with-duckdb-observerable</guid><dc:creator><![CDATA[Tobias Müller]]></dc:creator><pubDate>Wed, 26 Apr 2023 16:06:39 GMT</pubDate><cover_image>https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/M5tzZtFCOfs/upload/e969453744ce4250b0e5de76b917c34a.jpeg</cover_image></item><item><title><![CDATA[Casual data engineering, or: A poor man's Data Lake in the cloud - Part I]]></title><description><![CDATA[<p>In the age of big data, organizations of all sizes are collecting vast amounts of information about their operations, customers, and markets. To make sense of this data, many are turning to data lakes - centralized repositories that store and manage data of all types and sizes, from structured to unstructured. However, building a data lake can be a daunting task, requiring significant resources and expertise.</p><p>For enterprises, this often means using SaaS solutions like <a target="_blank" href="https://www.snowflake.com">Snowflake</a>, <a target="_blank" href="https://dremio.com">Dremio</a>, <a target="_blank" href="https://www.databricks.com">DataBricks</a> or the like. Or, go all-in on the public cloud provider offerings from AWS, Azure and Google Cloud. But what if, as recent studies show, the data sizes aren't as big as commonly thought? Is it really necessary to spend so much money on usage and infrastructure?</p><p>In this blog post, we'll walk you through the steps to create a <strong>scalable</strong>, <strong>cost-effective</strong> data lake on AWS. Whether you're a startup, a small business, or a large enterprise, this guide will help you unlock the power of big data without breaking the bank (also see the excellent <a target="_blank" href="https://motherduck.com/blog/big-data-is-dead/">"Big data is dead"</a> blog post by Jordan Tigani).</p><h1 id="heading-modern-data-lake-basics">Modern Data Lake basics</h1><p>The definition of what a Data Lake is, is probably slightly different depending on whom you're asking (see <a target="_blank" href="https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/">AWS</a>, <a target="_blank" href="https://cloud.google.com/learn/what-is-a-data-lake">Google Cloud</a>, <a target="_blank" href="https://azure.microsoft.com/en-us/solutions/data-lake/">Azure</a>, <a target="_blank" href="https://www.databricks.com/discover/data-lakes/introduction">DataBricks</a>, <a target="_blank" href="https://www.ibm.com/topics/data-lake">IBM</a> or <a target="_blank" href="https://en.wikipedia.org/wiki/Data_lake">Wikipedia</a>). What is common to all these definitions and explanations is that it consists of different layers, such as ingestion, storage, processing and consumption. There can be several other layers as well, like cataloging and search, as well as a security and governance layer.</p><p>This is outlined in the excellent AWS article <a target="_blank" href="https://aws.amazon.com/blogs/big-data/aws-serverless-data-analytics-pipeline-reference-architecture/">"AWS serverless data analytics pipeline reference architecture"</a>, which shall be the basis for this blog post:</p><p><img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2021/02/19/data-analytics-update-1-final.jpg" alt class="image--center mx-auto" /></p><h2 id="heading-separation-of-storage-andamp-compute">Separation of storage &amp; compute</h2><p>Modern data lakes have revolutionized the way organizations handle big data. A data lake is a central repository that allows organizations to store all types of data, both structured and unstructured, at any scale. The flexibility and scalability of data lakes enable organizations to perform advanced analytics and gain insights that can drive business decisions. One of the key architectural patterns that modern data lakes follow is the separation of storage and compute.</p><p>Traditionally, data storage and processing were tightly coupled in data warehouses. However, in modern data lakes, data is stored in a separate layer from the computational layer that processes it. Data storage is handled by a data storage layer, while data processing is done by a compute layer. This approach allows organizations to scale storage and compute independently, enabling them to process vast amounts of data without incurring significant costs.</p><p>This has several advantages, which include:</p><ol><li><p>Scalability: It allows organizations to scale each layer independently. The storage layer can be scaled up or down depending on the amount of data being stored, while the compute layer can be scaled up or down depending on the processing requirements.</p></li><li><p>Cost Savings: Decoupling storage and compute can significantly reduce costs. In traditional data warehouses, organizations must provision sufficient storage and processing power to handle peak loads. This results in underutilized resources during periods of low demand, leading to the wastage of resources and increased costs. In modern data lakes, organizations can store data cheaply and only provision the necessary compute resources when required, leading to significant cost savings.</p></li><li><p>Flexibility: Organizations can use a range of storage options, including object storage, file storage, and block storage, to store their data. This flexibility allows organizations to choose the most appropriate storage option for their data, depending on factors such as cost, performance, and durability.</p></li><li><p>Performance: In traditional data warehouses, data is moved from storage to processing, which can be slow and time-consuming, leading to performance issues. In modern data lakes, data is stored in a central repository, and processing is done where the data resides. This approach eliminates the need for data movement, leading to faster processing and improved performance.</p></li></ol><h2 id="heading-optimized-file-formats">Optimized file formats</h2><p>As an example, Parquet is an open-source columnar storage format for data lakes that is widely used in modern data lakes. Parquet stores data in columns rather than rows, which enables it to perform selective queries faster and more efficiently than traditional row-based storage formats.</p><p>Additionally, Parquet supports compression, which reduces storage requirements and improves data processing performance. It's supported by many big data processing engines, including Apache Hadoop, Apache Spark, Apache Drill and many services of public cloud providers, such as Amazon Athena and AWS Glue.</p><h2 id="heading-hive-partitioning-andamp-query-filter-pushdown">Hive partitioning &amp; query filter pushdown</h2><p>The so-called "Hive partitioning" is a technique used in data lakes that involves dividing data into smaller, more manageable parts, called partitions, based on specific criteria such as date, time, or location.</p><p>Partitioning can help improve query performance and reduce data processing time by allowing users to select only the relevant partitions, rather than scanning the entire dataset.</p><p>Query filter pushdown is another optimization technique used in Apache Hive and other services that involves pushing down query filters into the storage layer, allowing it to eliminate irrelevant data before processing the query.</p><p>Combining Hive partitioning and query filter pushdown can result in significant performance gains in data processing, as the query filters can eliminate large amounts of irrelevant data at the partition level, reducing the amount of data that needs to be processed. Therefore, Hive partitioning and query filter pushdown are essential techniques for optimizing data processing performance in data lakes.</p><h2 id="heading-repartitioning-of-data">Repartitioning of data</h2><p>Repartitioning Parquet data in data lakes is a useful technique that involves redistributing data across partitions based on specific criteria. This technique can help optimize query performance and reduce data shuffling during big data processing.</p><p>For instance, if a large amount of data is stored in a single partition, querying that data may take longer than if the data were spread across several partitions. Or, you could write aggregation queries whose output contains much less data, which could improve query speeds significantly.</p><h1 id="heading-the-use-case">The use case</h1><p>Data privacy and GDPR are pretty talked-about topics in recent years. A lot of existing web tracking solutions were deemed as non-compliant, especially in the EU. Thus, individuals and companies had to eventually change their Web Analytics providers, which lead to a rise of new, data privacy-focussing companies in this space (e.g. <a target="_blank" href="https://usefathom.com/">Fathom Analytics</a>, <a target="_blank" href="https://www.simpleanalytics.com">SimpleAnalytics</a>, and <a target="_blank" href="https://www.plausible.io">Plausible</a> just to name a few).</p><p>The pricing of those providers can get relatively steep quite fast if you have a higher amount of pageviews ($74/mo for 2m at Fathom, 99/mo for 1m at SimpleAnalytics, 89/mo for 2m at Plausible). Also, if you're using a provider, you're normally not owning your data.</p><p>So, <strong>let's try to build a web tracking and analytics service on AWS for the cheap, while owning our data, adhering to data privacy laws and using scalable serverless cloud services to avoid having to manage infrastructure by ourselves.</strong> And have some fun and learn a bit while doing it :-)</p><h1 id="heading-high-level-architecture">High-level architecture</h1><p>The overall architecture for the outlined use case looks like this:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1681729029658/08a0a6e0-b1d1-4001-896c-1ffd88b98205.png" alt class="image--center mx-auto" /></p><p>The details will be described further for each layer in the coming paragraphs. For brevity, the focus lies on the main data processing layers. Other layers, such as cataloging, consumption and security &amp; governance, are eventually handled in other upcoming blog posts.</p><h1 id="heading-serving-layer">Serving layer</h1><p>The serving layer is not a part of the data lake. Its main goal is to serve static assets, such as the tracking JavaScript libraries (those will be covered in more detail in another part of this series), and the 1x1 pixel GIF files that are used as endpoints that the tracking library can push its gathered data to. This is done by sending the JSON payload as URL-encoded query strings.</p><p>In our use case, we want to leverage existing AWS services and optimize our costs, while providing great response times. From an architectural perspective, there are many ways we could set up this data-gathering endpoint. <a target="_blank" href="https://aws.amazon.com/cloudfront/">Amazon CloudFront</a> is a CDN that has currently <a target="_blank" href="https://aws.amazon.com/cloudfront/features/?whats-new-cloudfront.sort-by=item.additionalFields.postDateTime&amp;whats-new-cloudfront.sort-order=desc">over 90 edge locations worldwide</a>, thus providing great latencies compared to classical webservers or APIs that are deployed in one or more regions.</p><p>It also has a <a target="_blank" href="https://aws.amazon.com/cloudfront/pricing/">very generous free tier</a> (1TB outgoing traffic, and 10M requests), and with its <a target="_blank" href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/real-time-logs.html">real-time logs feature</a> a great and very cost-effective way ($0.01 for every 1M log lines) to set up such an endpoint by just storing a 1x1px GIF with appropriate caching headers, to which the <strong>JavaScript tracking library will send its payload to as an encoded query string</strong>.</p><p>CloudFront can use S3 as a so-called origin (where the assets will be loaded from if they aren't yet in the edge caches), and that's where the static asset data will be located. Between the CloudFront distribution and the S3 bucket, an <a target="_blank" href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/private-content-restricting-access-to-s3.html">Origin Access Identity</a> will be created, which enables secure communication between both services and avoids that the S3 bucket needs to be publicly accessible.</p><p>To configure CloudFront real-time logs that contain the <a target="_blank" href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/real-time-logs.html#understand-real-time-log-config-fields">necessary information</a>, a <a target="_blank" href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-cloudfront-realtimelogconfig.html">RealtimeLogConfig</a> needs to be <a target="_blank" href="https://github.com/ownstats/ownstats/blob/main/analytics-backend/resources/cf-distribution.yml#L8-L33">created</a>. This acts as "glue" between the CloudFront distribution and the Kinesis Data Stream that consumes the logs:</p><pre><code class="lang-yaml"><span class="hljs-attr">CFRealtimeLogsConfig:</span>  <span class="hljs-attr">Type:</span> <span class="hljs-string">AWS::CloudFront::RealtimeLogConfig</span>  <span class="hljs-attr">Properties:</span>     <span class="hljs-attr">EndPoints:</span>       <span class="hljs-bullet">-</span> <span class="hljs-attr">StreamType:</span> <span class="hljs-string">Kinesis</span>        <span class="hljs-attr">KinesisStreamConfig:</span>          <span class="hljs-attr">RoleArn:</span> <span class="hljs-type">!GetAtt</span> <span class="hljs-string">'AnalyticsKinesisDataRole.Arn'</span>          <span class="hljs-attr">StreamArn:</span> <span class="hljs-type">!GetAtt</span> <span class="hljs-string">'AnalyticsKinesisStream.Arn'</span>    <span class="hljs-attr">Fields:</span>       <span class="hljs-bullet">-</span> <span class="hljs-string">timestamp</span>      <span class="hljs-bullet">-</span> <span class="hljs-string">c-ip</span>      <span class="hljs-bullet">-</span> <span class="hljs-string">sc-status</span>      <span class="hljs-bullet">-</span> <span class="hljs-string">cs-uri-stem</span>      <span class="hljs-bullet">-</span> <span class="hljs-string">cs-bytes</span>      <span class="hljs-bullet">-</span> <span class="hljs-string">x-edge-location</span>      <span class="hljs-bullet">-</span> <span class="hljs-string">time-taken</span>      <span class="hljs-bullet">-</span> <span class="hljs-string">cs-user-agent</span>      <span class="hljs-bullet">-</span> <span class="hljs-string">cs-referer</span>      <span class="hljs-bullet">-</span> <span class="hljs-string">cs-uri-query</span>      <span class="hljs-bullet">-</span> <span class="hljs-string">x-edge-result-type</span>      <span class="hljs-bullet">-</span> <span class="hljs-string">asn</span>    <span class="hljs-attr">Name:</span> <span class="hljs-string">'${self:service}-cdn-realtime-log-config'</span>    <span class="hljs-comment"># IMPORTANT: This setting make sure we receive all the log lines, otherwise it's just sampled!</span>    <span class="hljs-attr">SamplingRate:</span> <span class="hljs-number">100</span></code></pre><h1 id="heading-ingestion-layer">Ingestion layer</h1><p>The ingestion layer mainly consists of two services: A <a target="_blank" href="https://aws.amazon.com/kinesis/data-streams/">Kinesis Data Stream</a>, which is the <a target="_blank" href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/real-time-logs.html#real-time-log-consumer-guidance">consumer</a> of the real-time logs feature of CloudFront, and a <a target="_blank" href="https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html">Kinesis Data Firehose Delivery Stream</a>, which will back up the raw data in S3, and also store the data as partitioned parquet files in another S3 bucket. Both S3 buckets are part of the storage layer.</p><p>The <a target="_blank" href="https://aws.amazon.com/kinesis/data-streams/pricing/?nc=sn&amp;loc=3">Kinesis Data Stream</a> (one shard in provisioned mode) provides an ingest capacity of 1 MB/second or 1,000 records/second, for a price of $0.015/hour in us-east-1, and $0.014 per 1M PUT payload units. It forwards the incoming data to the Kinesis Data Firehose Delivery Stream, whose <a target="_blank" href="https://aws.amazon.com/kinesis/data-firehose/pricing/?nc=sn&amp;loc=3">pricing</a> is more complex. The ingestion costs $0.029/GB, the format conversion $0.018/GB, and the dynamic partitioning $0.02/GB. That sums up to $0.067/GB ingested and written to S3, plus the S3 costs of $0.005/1k PUT object calls.</p><p>The Kinesis Data Firehose Delivery Stream uses <a target="_blank" href="https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html">data transformation</a> and <a target="_blank" href="https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html">dynamic partitioning</a> with a Lambda function, which cleans, transforms and enriches the data so that it can be stored in S3 as parquet files with appropriate Hive partitions.</p><p>The Delivery Stream has so-called <a target="_blank" href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-kinesisfirehose-deliverystream-bufferinghints.html">BufferingHints</a>, which either define from which size (from 1 to 128MB) or in which interval (between 60 to 900 seconds) the data is flushed to S3. The interval defines the minimum latency at which the data gets persisted in the data lake. The Lambda function is part of the processing layer and is discussed below.</p><p>The CloudFormation <a target="_blank" href="https://github.com/ownstats/ownstats/blob/main/analytics-backend/resources/kinesis.yml#L13-L100">resource definition</a> for the Kinesis Data Firehose Delivery Stream can be found below. It sources its variables from the <a target="_blank" href="https://github.com/ownstats/ownstats/blob/main/analytics-backend/serverless.yml#L34-L42">serverless.yml</a>:</p><pre><code class="lang-yaml"><span class="hljs-attr">AnalyticsKinesisFirehose:</span>  <span class="hljs-attr">Type:</span> <span class="hljs-string">'AWS::KinesisFirehose::DeliveryStream'</span>  <span class="hljs-attr">Properties:</span>    <span class="hljs-attr">DeliveryStreamName:</span> <span class="hljs-string">${self:custom.kinesis.delivery.name}</span>    <span class="hljs-attr">DeliveryStreamType:</span> <span class="hljs-string">KinesisStreamAsSource</span>    <span class="hljs-comment"># Source configuration</span>    <span class="hljs-attr">KinesisStreamSourceConfiguration:</span>      <span class="hljs-attr">KinesisStreamARN:</span> <span class="hljs-type">!GetAtt</span> <span class="hljs-string">'AnalyticsKinesisStream.Arn'</span>      <span class="hljs-attr">RoleARN:</span> <span class="hljs-type">!GetAtt</span> <span class="hljs-string">'AnalyticsKinesisFirehoseRole.Arn'</span>    <span class="hljs-comment"># Necessary configuration to transfrom and write data to S3 as parquet files</span>    <span class="hljs-attr">ExtendedS3DestinationConfiguration:</span>      <span class="hljs-attr">BucketARN:</span> <span class="hljs-type">!GetAtt</span> <span class="hljs-string">'CleanedBucket.Arn'</span>      <span class="hljs-attr">BufferingHints:</span>        <span class="hljs-attr">IntervalInSeconds:</span> <span class="hljs-string">${self:custom.kinesis.delivery.limits.intervalInSeconds}</span>        <span class="hljs-attr">SizeInMBs:</span> <span class="hljs-string">${self:custom.kinesis.delivery.limits.sizeInMB}</span>      <span class="hljs-comment"># This enables logging to CloudWatch for better debugging possibilities</span>      <span class="hljs-attr">CloudWatchLoggingOptions:</span>        <span class="hljs-attr">Enabled:</span> <span class="hljs-literal">True</span>        <span class="hljs-attr">LogGroupName:</span> <span class="hljs-string">${self:custom.logs.groupName}</span>        <span class="hljs-attr">LogStreamName:</span> <span class="hljs-string">${self:custom.logs.streamName}</span>      <span class="hljs-attr">DataFormatConversionConfiguration:</span>        <span class="hljs-attr">Enabled:</span> <span class="hljs-literal">True</span>        <span class="hljs-comment"># Define the input format</span>        <span class="hljs-attr">InputFormatConfiguration:</span>           <span class="hljs-attr">Deserializer:</span>             <span class="hljs-attr">OpenXJsonSerDe:</span>               <span class="hljs-attr">CaseInsensitive:</span> <span class="hljs-literal">True</span>        <span class="hljs-comment"># Define the output format</span>        <span class="hljs-attr">OutputFormatConfiguration:</span>           <span class="hljs-attr">Serializer:</span>             <span class="hljs-attr">ParquetSerDe:</span>               <span class="hljs-attr">Compression:</span> <span class="hljs-string">SNAPPY</span>              <span class="hljs-attr">WriterVersion:</span> <span class="hljs-string">V1</span>        <span class="hljs-comment"># The schema configuration based on Glue tables</span>        <span class="hljs-attr">SchemaConfiguration:</span>           <span class="hljs-attr">RoleArn:</span> <span class="hljs-type">!GetAtt</span> <span class="hljs-string">'AnalyticsKinesisFirehoseRole.Arn'</span>          <span class="hljs-attr">DatabaseName:</span> <span class="hljs-string">'${self:custom.glue.database}'</span>          <span class="hljs-attr">TableName:</span> <span class="hljs-string">'incoming_events'</span>      <span class="hljs-comment"># Enable dynamic partitioning</span>      <span class="hljs-attr">DynamicPartitioningConfiguration:</span>          <span class="hljs-attr">Enabled:</span> <span class="hljs-literal">True</span>      <span class="hljs-comment"># Enable Lambda function for pre-processing the Kinesis records</span>      <span class="hljs-attr">ProcessingConfiguration:</span>        <span class="hljs-attr">Enabled:</span> <span class="hljs-literal">True</span>        <span class="hljs-attr">Processors:</span>           <span class="hljs-bullet">-</span> <span class="hljs-attr">Type:</span> <span class="hljs-string">Lambda</span>            <span class="hljs-attr">Parameters:</span>               <span class="hljs-bullet">-</span> <span class="hljs-attr">ParameterName:</span> <span class="hljs-string">NumberOfRetries</span>                <span class="hljs-attr">ParameterValue:</span> <span class="hljs-number">3</span>              <span class="hljs-bullet">-</span> <span class="hljs-attr">ParameterName:</span> <span class="hljs-string">BufferIntervalInSeconds</span>                <span class="hljs-attr">ParameterValue:</span> <span class="hljs-number">60</span>              <span class="hljs-bullet">-</span> <span class="hljs-attr">ParameterName:</span> <span class="hljs-string">BufferSizeInMBs</span>                <span class="hljs-attr">ParameterValue:</span> <span class="hljs-number">3</span>              <span class="hljs-bullet">-</span> <span class="hljs-attr">ParameterName:</span> <span class="hljs-string">LambdaArn</span>                <span class="hljs-attr">ParameterValue:</span> <span class="hljs-type">!GetAtt</span> <span class="hljs-string">'ProcessKinesisRecordsLambdaFunction.Arn'</span>      <span class="hljs-comment"># Enable backups for the raw incoming data</span>      <span class="hljs-attr">S3BackupMode:</span> <span class="hljs-string">Enabled</span>      <span class="hljs-attr">S3BackupConfiguration:</span>        <span class="hljs-attr">BucketARN:</span> <span class="hljs-type">!GetAtt</span> <span class="hljs-string">'RawBucket.Arn'</span>        <span class="hljs-attr">BufferingHints:</span>          <span class="hljs-attr">IntervalInSeconds:</span> <span class="hljs-string">${self:custom.kinesis.delivery.limits.intervalInSeconds}</span>          <span class="hljs-attr">SizeInMBs:</span> <span class="hljs-string">${self:custom.kinesis.delivery.limits.sizeInMB}</span>        <span class="hljs-comment"># Disable logging to CloudWatch for raw data</span>        <span class="hljs-attr">CloudWatchLoggingOptions:</span>          <span class="hljs-attr">Enabled:</span> <span class="hljs-literal">false</span>        <span class="hljs-attr">CompressionFormat:</span> <span class="hljs-string">GZIP</span>        <span class="hljs-attr">Prefix:</span> <span class="hljs-string">'${self:custom.prefixes.raw}'</span>        <span class="hljs-attr">ErrorOutputPrefix:</span> <span class="hljs-string">'${self:custom.prefixes.error}'</span>        <span class="hljs-attr">RoleARN:</span> <span class="hljs-type">!GetAtt</span> <span class="hljs-string">'AnalyticsKinesisFirehoseRole.Arn'</span>      <span class="hljs-attr">RoleARN:</span> <span class="hljs-type">!GetAtt</span> <span class="hljs-string">'AnalyticsKinesisFirehoseRole.Arn'</span>      <span class="hljs-comment"># Define output S3 prefixes</span>      <span class="hljs-attr">Prefix:</span> <span class="hljs-string">'${self:custom.prefixes.incoming}/domain_name=!{partitionKeyFromLambda:domain_name}/event_type=!{partitionKeyFromLambda:event_type}/event_date=!{partitionKeyFromLambda:event_date}/'</span>      <span class="hljs-attr">ErrorOutputPrefix:</span> <span class="hljs-string">'${self:custom.prefixes.error}'</span></code></pre><h1 id="heading-processing-layer">Processing layer</h1><p>The processing layer consists of two parts, the Lambda function that is used for the dynamic partitioning of the incoming data, and a Lambda function that uses the <a target="_blank" href="https://duckdb.org/docs/data/partitioning/partitioned_writes.html">COPY TO PARTITION BY</a> feature of DuckDB to aggregate and repartition the ingested, enriched and stored page views data.</p><h2 id="heading-data-transformation-andamp-dynamic-partitioning-lambda">Data transformation &amp; Dynamic partitioning Lambda</h2><p><a target="_blank" href="https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html">Data transformation</a> is a Kinesis Data Firehose Delivery Stream feature that enables the cleaning, transformation and enrichment of incoming records in a batched manner. In combination with the <a target="_blank" href="https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html#dynamic-partitioning-partitioning-keys">dynamic partitioning feature</a>, this provides powerful data handling capabilities with the data still being "on stream". When writing data to S3 as parquet files, a <a target="_blank" href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-kinesisfirehose-deliverystream-schemaconfiguration.html">schema configuration in the form of a Glue Table</a> needs to be defined as well to make it work (see "Cataloging &amp; search layer" below).</p><p>It's necessary to define some <a target="_blank" href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-kinesisfirehose-deliverystream-processorparameter.html">buffer configuration</a> for the Lambda function, meaning that you need to specify the time interval of 60 seconds (this will add a max delay of one minute to the stream data), the size in MB (between 0.2 and 3), and the number of retries (3 is the only usable default).</p><p>The input coming from the Kinesis Data Firehose Delivery Stream are a base64 encoded strings that contain the loglines coming from the CloudFront distribution:</p><pre><code class="lang-bash">MTY4MjA4NDI0MS40NjlcdDIwMDM6ZTE6YmYxZjo3YzAwOjhlYjoxOGY4OmExZmI6OWRhZFx0MzA0XHQvaGVsbG8uZ2lmP3Q9cHYmdHM9MTY4MjA4MzgwNDc2OCZ1PWh0dHBzJTI1M0ElMjUyRiUyNTJGbXlkb21haW4udGxkJTI1MkYmaG49bXlkb21haW4udGxkJnBhPSUyNTJGJnVhPU1vemlsbGElMjUyRjUuMCUyNTIwKE1hY2ludG9zaCUyNTNCJTI1MjBJbnRlbCUyNTIwTWFjJTI1MjBPUyUyNTIwWCUyNTIwMTBfMTVfNyklMjUyMEFwcGxlV2ViS2l0JTI1MkY1MzcuMzYlMjUyMChLSFRNTCUyNTJDJTI1MjBsaWtlJTI1MjBHZWNrbyklMjUyMENocm9tZSUyNTJGMTEyLjAuMC4wJTI1MjBTYWZhcmklMjUyRjUzNy4zNiZpdz0xMjkyJmloPTkyNiZ0aT1NeSUyNTIwRG9tYWluJnc9MzQ0MCZoPTE0NDAmZD0yNCZsPWRlLURFJnA9TWFjSW50ZWwmbT04JmM9OCZ0ej1FdXJvcGUlMjUyRkJlcmxpblx0Nzg5XHRIQU01MC1QMlx0MC4wMDFcdE1vemlsbGEvNS4wJTIwKE1hY2ludG9zaDslMjBJbnRlbCUyME1hYyUyME9TJTIwWCUyMDEwXzE1XzcpJTIwQXBwbGVXZWJLaXQvNTM3LjM2JTIwKEtIVE1MLCUyMGxpa2UlMjBHZWNrbyklMjBDaHJvbWUvMTEyLjAuMC4wJTIwU2FmYXJpLzUzNy4zNlx0LVx0dD1wdiZ0cz0xNjgyMDgzODA0NzY4JnU9aHR0cHMlMjUzQSUyNTJGJTI1MkZteWRvbWFpbi50bGQlMjUyRiZobj1teWRvbWFpbi50bGQmcGE9JTI1MkYmdWE9TW96aWxsYSUyNTJGNS4wJTI1MjAoTWFjaW50b3NoJTI1M0IlMjUyMEludGVsJTI1MjBNYWMlMjUyME9TJTI1MjBYJTI1MjAxMF8xNV83KSUyNTIwQXBwbGVXZWJLaXQlMjUyRjUzNy4zNiUyNTIwKEtIVE1MJTI1MkMlMjUyMGxpa2UlMjUyMEdlY2tvKSUyNTIwQ2hyb21lJTI1MkYxMTIuMC4wLjAlMjUyMFNhZmFyaSUyNTJGNTM3LjM2Jml3PTEyOTImaWg9OTI2JnRpPU15JTI1MjBEb21haW4mdz0zNDQwJmg9MTQ0MCZkPTI0Jmw9ZGUtREUmcD1NYWNJbnRlbCZtPTgmYz04JnR6PUV1cm9wZSUyNTJGQmVybGluXHRIaXRcdDMzMjBcbg==</code></pre><p>After decoding, the logline is visible and contains the info from the real-time log fields, which are tab-separated and contain newlines:</p><pre><code class="lang-bash">1682084241.469\t2003:e1:bf1f:7c00:8eb:18f8:a1fb:9dad\t304\t/hello.gif?t=pv&amp;ts=1682083804768&amp;u=https%253A%252F%252Fmydomain.tld%252F&amp;hn=mydomain.tld&amp;pa=%252F&amp;ua=Mozilla%252F5.0%2520(Macintosh%253B%2520Intel%2520Mac%2520OS%2520X%252010_15_7)%2520AppleWebKit%252F537.36%2520(KHTML%252C%2520like%2520Gecko)%2520Chrome%252F112.0.0.0%2520Safari%252F537.36&amp;iw=1292&amp;ih=926&amp;ti=My%2520Domain&amp;w=3440&amp;h=1440&amp;d=24&amp;l=de-DE&amp;p=MacIntel&amp;m=8&amp;c=8&amp;tz=Europe%252FBerlin\t789\tHAM50-P2\t0.001\tMozilla/5.0%20(Macintosh;%20Intel%20Mac%20OS%20X%2010_15_7)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/112.0.0.0%20Safari/537.36\t-\tt=pv&amp;ts=1682083804768&amp;u=https%253A%252F%252Fmydomain.tld%252F&amp;hn=mydomain.tld&amp;pa=%252F&amp;ua=Mozilla%252F5.0%2520(Macintosh%253B%2520Intel%2520Mac%2520OS%2520X%252010_15_7)%2520AppleWebKit%252F537.36%2520(KHTML%252C%2520like%2520Gecko)%2520Chrome%252F112.0.0.0%2520Safari%252F537.36&amp;iw=1292&amp;ih=926&amp;ti=My%2520Domain&amp;w=3440&amp;h=1440&amp;d=24&amp;l=de-DE&amp;p=MacIntel&amp;m=8&amp;c=8&amp;tz=Europe%252FBerlin\tHit\t3320\n</code></pre><p>During transformation and enrichment, the following steps are followed:</p><ul><li><p>Validating the source record</p></li><li><p>Enriching the <a target="_blank" href="https://www.npmjs.com/package/ua-parser-js">browser and device data</a> from the user agent string</p></li><li><p>Determine whether the record was generated by a <a target="_blank" href="https://www.npmjs.com/package/isbot">bot</a> (by user agent string)</p></li><li><p>Add nearest geographical information based on <a target="_blank" href="https://www.npmjs.com/package/aws-edge-locations">edge locations</a></p></li><li><p>Compute referer</p></li><li><p>Derive requested URI</p></li><li><p>Compute UTM data</p></li><li><p>Get the event type (either a page view or a tracking event)</p></li><li><p>Build the time hierarchy (year, month, day, event timestamp)</p></li><li><p>Compute data arrival delays (data/process metrics)</p></li><li><p>Generate hashes for page view, daily page view and daily visitor ids (later used to calculate page views and visits)</p></li><li><p>Add metadata with the partition key values (in our case, the partition keys are <strong>domain_name</strong>, <strong>event_date</strong>, and <strong>event_type</strong>), to be able to use the dynamic partitioning feature</p></li></ul><p>The generated JSON looks like this:</p><pre><code class="lang-json">{  <span class="hljs-attr">"result"</span>: <span class="hljs-string">"Ok"</span>,  <span class="hljs-attr">"error"</span>: <span class="hljs-literal">null</span>,  <span class="hljs-attr">"data"</span>: {    <span class="hljs-attr">"event_year"</span>: <span class="hljs-number">2023</span>,    <span class="hljs-attr">"event_month"</span>: <span class="hljs-number">4</span>,    <span class="hljs-attr">"event_day"</span>: <span class="hljs-number">21</span>,    <span class="hljs-attr">"event_timestamp"</span>: <span class="hljs-string">"2023-04-21T13:30:04.768Z"</span>,    <span class="hljs-attr">"arrival_timestamp"</span>: <span class="hljs-string">"2023-04-21T13:37:21.000Z"</span>,    <span class="hljs-attr">"arrival_delay_ms"</span>: <span class="hljs-number">-436232</span>,    <span class="hljs-attr">"edge_city"</span>: <span class="hljs-string">"Hamburg"</span>,    <span class="hljs-attr">"edge_state"</span>: <span class="hljs-literal">null</span>,    <span class="hljs-attr">"edge_country"</span>: <span class="hljs-string">"Germany"</span>,    <span class="hljs-attr">"edge_country_code"</span>: <span class="hljs-string">"DE"</span>,    <span class="hljs-attr">"edge_latitude"</span>: <span class="hljs-number">53.630401611328</span>,    <span class="hljs-attr">"edge_longitude"</span>: <span class="hljs-number">9.9882297515869</span>,    <span class="hljs-attr">"edge_id"</span>: <span class="hljs-string">"HAM"</span>,    <span class="hljs-attr">"referer"</span>: <span class="hljs-literal">null</span>,    <span class="hljs-attr">"referer_domain_name"</span>: <span class="hljs-string">"Direct / None"</span>,    <span class="hljs-attr">"browser_name"</span>: <span class="hljs-string">"Chrome"</span>,    <span class="hljs-attr">"browser_version"</span>: <span class="hljs-string">"112.0.0.0"</span>,    <span class="hljs-attr">"browser_os_name"</span>: <span class="hljs-string">"Mac OS"</span>,    <span class="hljs-attr">"browser_os_version"</span>: <span class="hljs-string">"10.15.7"</span>,    <span class="hljs-attr">"browser_timezone"</span>: <span class="hljs-string">"Europe/Berlin"</span>,    <span class="hljs-attr">"browser_language"</span>: <span class="hljs-string">"de-DE"</span>,    <span class="hljs-attr">"device_type"</span>: <span class="hljs-string">"Desktop"</span>,    <span class="hljs-attr">"device_vendor"</span>: <span class="hljs-string">"Apple"</span>,    <span class="hljs-attr">"device_outer_resolution"</span>: <span class="hljs-string">"3440x1440"</span>,    <span class="hljs-attr">"device_inner_resolution"</span>: <span class="hljs-string">"1292x926"</span>,    <span class="hljs-attr">"device_color_depth"</span>: <span class="hljs-number">24</span>,    <span class="hljs-attr">"device_platform"</span>: <span class="hljs-string">"MacIntel"</span>,    <span class="hljs-attr">"device_memory"</span>: <span class="hljs-number">8</span>,    <span class="hljs-attr">"device_cores"</span>: <span class="hljs-number">8</span>,    <span class="hljs-attr">"utm_source"</span>: <span class="hljs-literal">null</span>,    <span class="hljs-attr">"utm_campaign"</span>: <span class="hljs-literal">null</span>,    <span class="hljs-attr">"utm_medium"</span>: <span class="hljs-literal">null</span>,    <span class="hljs-attr">"utm_content"</span>: <span class="hljs-literal">null</span>,    <span class="hljs-attr">"utm_term"</span>: <span class="hljs-literal">null</span>,    <span class="hljs-attr">"request_url"</span>: <span class="hljs-string">"https://mydomain.tld/"</span>,    <span class="hljs-attr">"request_path"</span>: <span class="hljs-string">"/"</span>,    <span class="hljs-attr">"request_query_string"</span>: <span class="hljs-string">"t=pv&amp;ts=1682083804768&amp;u=https%253A%252F%252Fmydomain.tld%252F&amp;hn=mydomain.tld&amp;pa=%252F&amp;ua=Mozilla%252F5.0%2520(Macintosh%253B%2520Intel%2520Mac%2520OS%2520X%252010_15_7)%2520AppleWebKit%252F537.36%2520(KHTML%252C%2520like%2520Gecko)%2520Chrome%252F112.0.0.0%2520Safari%252F537.36&amp;iw=1292&amp;ih=926&amp;ti=My%2520Domain&amp;w=3440&amp;h=1440&amp;d=24&amp;l=de-DE&amp;p=MacIntel&amp;m=8&amp;c=8&amp;tz=Europe%252FBerlin\t789\tHAM50-P2\t0.001\tMozilla/5.0%20(Macintosh;%20Intel%20Mac%20OS%20X%2010_15_7)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/112.0.0.0%20Safari/537.36\t-\tt=pv&amp;ts=1682083804768&amp;u=https%253A%252F%252Fmydomain.tld%252F&amp;hn=mydomain.tld&amp;pa=%252F&amp;ua=Mozilla%252F5.0%2520(Macintosh%253B%2520Intel%2520Mac%2520OS%2520X%252010_15_7)%2520AppleWebKit%252F537.36%2520(KHTML%252C%2520like%2520Gecko)%2520Chrome%252F112.0.0.0%2520Safari%252F537.36&amp;iw=1292&amp;ih=926&amp;ti=My%2520Domain&amp;w=3440&amp;h=1440&amp;d=24&amp;l=de-DE&amp;p=MacIntel&amp;m=8&amp;c=8&amp;tz=Europe%252FBerlin"</span>,    <span class="hljs-attr">"request_bytes"</span>: <span class="hljs-number">789</span>,    <span class="hljs-attr">"request_status_code"</span>: <span class="hljs-number">304</span>,    <span class="hljs-attr">"request_cache_status"</span>: <span class="hljs-string">"Hit"</span>,    <span class="hljs-attr">"request_delivery_time_ms"</span>: <span class="hljs-number">1</span>,    <span class="hljs-attr">"request_asn"</span>: <span class="hljs-number">3320</span>,    <span class="hljs-attr">"request_is_bot"</span>: <span class="hljs-number">0</span>,    <span class="hljs-attr">"event_name"</span>: <span class="hljs-literal">null</span>,    <span class="hljs-attr">"event_data"</span>: <span class="hljs-literal">null</span>,    <span class="hljs-attr">"page_view_id"</span>: <span class="hljs-string">"f4e1939bc259131659b00cd5f73e55a5bed04fbfa63f095b561fd87009d0a228"</span>,    <span class="hljs-attr">"daily_page_view_id"</span>: <span class="hljs-string">"7c82d13036aa2cfe04720e0388bb8645eb90de084bd50cf69356fa8ec9d8b407"</span>,    <span class="hljs-attr">"daily_visitor_id"</span>: <span class="hljs-string">"9f0ac3a2560cfa6d5c3494e1891d284225e15f088414390a40fece320021a658"</span>,    <span class="hljs-attr">"domain_name"</span>: <span class="hljs-string">"mydomain.tld"</span>,    <span class="hljs-attr">"event_date"</span>: <span class="hljs-string">"2023-04-21"</span>,    <span class="hljs-attr">"event_type"</span>: <span class="hljs-string">"pageview"</span>  },  <span class="hljs-attr">"metadata"</span>: {    <span class="hljs-attr">"partitionKeys"</span>: {      <span class="hljs-attr">"domain_name"</span>: <span class="hljs-string">"mydomain.tld"</span>,      <span class="hljs-attr">"event_date"</span>: <span class="hljs-string">"2023-04-21"</span>,      <span class="hljs-attr">"event_type"</span>: <span class="hljs-string">"pageview"</span>    }  }}</code></pre><p>Then, the following steps are done by the Lambda function:</p><ul><li><p>Encode the JSON stringified records in base64 again</p></li><li><p>Return them to the Kinesis Data Firehose Delivery Stream, which will then persist the data based on the defined <a target="_blank" href="https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html#dynamic-partitioning-namespaces">prefix</a> in the S3 bucket for incoming data.</p></li></ul><h2 id="heading-aggregation-lambda">Aggregation Lambda</h2><p>As the ingested data contains information on a single request level, it makes sense to aggregate the data so that queries can be run optimally, and query response times are reduced.</p><p>The aggregation Lambda function is based on <a target="_blank" href="https://github.com/tobilg/serverless-parquet-repartitioner">tobilg/serverless-parquet-repartitioner</a>, which also has an <a target="_blank" href="https://tobilg.com/using-duckdb-to-repartition-parquet-data-in-s3">accompanying blog post</a> that explains in more detail how the <a target="_blank" href="https://github.com/tobilg/duckdb-nodejs-layer">DuckDB Lambda Layer</a> can be used to repartition or aggregate existing data in S3.</p><p>The Lambda function is scheduled to run each night at 00:30AM, which makes sure that all the Kinesis Firehose Delivery Stream output files of the last day have been written to S3 (this is because the maximum buffer time is 15 minutes).</p><p>When it runs, it does three things:</p><ul><li><p>Create a session aggregation, that derives the session information and whether single requests were bounces or not</p></li><li><p>Calculate the pageviews and visitor numbers, broken down by several dimensions which are later needed for querying (see <code>stats</code> table below)</p></li><li><p>Store the extraction of the event data separately, newly partitioned by <code>event_name</code> to speed up queries</p></li></ul><p>The <a target="_blank" href="https://github.com/ownstats/ownstats/blob/main/analytics-backend/functions/utils/queryRenderer.js#L5-L184">queries</a> can be inspected in the accompanying repository to get an idea about the sophisticated query patterns DuckDB supports.</p><h1 id="heading-storage-layer">Storage layer</h1><p>The storage layer consists of three S3 buckets, where each conforms to a zone outlined in the reference architecture diagram (see above):</p><ul><li><p>A <strong>raw</strong> bucket, where the raw incoming data to the Kinesis Firehose Delivery Stream is backed up to (partitioned by <code>event_date</code>)</p></li><li><p>A <strong>cleaned</strong> bucket, where the data is stored by the Kinesis Firehose Delivery Stream (partitioned by <code>domain_name</code>, <code>event_date</code> and <code>event_type</code>)</p></li><li><p>A <strong>curated</strong> bucket, where the aggregated pageviews and visitors data are stored (partitioned by <code>domain_name</code> and <code>event_date</code>), as well as the aggregated and filtered events (partitioned by <code>domain_name</code>, <code>event_date</code> and <code>event_name</code>)</p></li></ul><h1 id="heading-cataloging-andamp-search-layer">Cataloging &amp; search layer</h1><p>The Kinesis Data Firehose Delivery Stream needs a <a target="_blank" href="https://docs.aws.amazon.com/glue/latest/dg/tables-described.html">Glue table</a> that holds the <a target="_blank" href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-kinesisfirehose-deliverystream-schemaconfiguration.html">schema</a> of the parquet files to be able to produce them (<code>incoming_events</code> table). The <code>stats</code> and the <code>events</code> tables are aggregated daily from the base <code>incoming_events</code> table via cron jobs scheduled by Amazon <a target="_blank" href="https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html">EventBridge Rules</a> at 00:30 AM.</p><h2 id="heading-incomingevents-table">incoming_events table</h2><p>This table stores the events that are the result of the data transformation and dynamic partitioning Lambda function. The schema for the table <code>incoming_events</code> looks like this:</p><div class="hn-table"><table><thead><tr><td>Column name</td><td>Data type</td><td>Is partition key?</td><td>Description</td></tr></thead><tbody><tr><td>domain_name</td><td>string</td><td>yes</td><td>The domain name</td></tr><tr><td>event_date</td><td>string</td><td>yes</td><td>The date of the event (YYYY-MM-DD), as string</td></tr><tr><td>event_type</td><td>string</td><td>yes</td><td>The type of the event (<code>pageview</code> or <code>track</code>)</td></tr><tr><td>event_year</td><td>int</td><td>no</td><td>The year of the event_date (YYYY)</td></tr><tr><td>event_month</td><td>int</td><td>no</td><td>The month of the event (MM)</td></tr><tr><td>event_day</td><td>int</td><td>no</td><td>The day of the event (DD)</td></tr><tr><td>event_timestamp</td><td>timestamp</td><td>no</td><td>The exact event timestamp</td></tr><tr><td>arrival_timestamp</td><td>timestamp</td><td>no</td><td>The exact timestamp when the event arrived in the Kinesis Data Stream</td></tr><tr><td>arrival_delay_ms</td><td>int</td><td>no</td><td>The difference between event_timestamp and arrival_timestamp in milliseconds</td></tr><tr><td>edge_city</td><td>string</td><td>no</td><td>The name of the edge city (all edge location info is derived from the <code>x-edge-location</code> field in the <a target="_blank" href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#LogFileFormat">logs</a>)</td></tr><tr><td>edge_state</td><td>string</td><td>no</td><td>The state of the edge location</td></tr><tr><td>edge_country</td><td>string</td><td>no</td><td>The country of the edge location</td></tr><tr><td>edge_country_code</td><td>string</td><td>no</td><td>The country code of the edge location</td></tr><tr><td>edge_latitude</td><td>float</td><td>no</td><td>The latitude of the edge location</td></tr><tr><td>edge_longitude</td><td>float</td><td>no</td><td>The longitude of the edge location</td></tr><tr><td>edge_id</td><td>string</td><td>no</td><td>The original id of the edge location</td></tr><tr><td>referrer</td><td>string</td><td>no</td><td>The referrer</td></tr><tr><td>referrer_domain_name</td><td>string</td><td>no</td><td>The domain name of the referrer</td></tr><tr><td>browser_name</td><td>string</td><td>no</td><td>The name of the browser</td></tr><tr><td>browser_version</td><td>string</td><td>no</td><td>The version of the browser</td></tr><tr><td>browser_os_name</td><td>string</td><td>no</td><td>The OS name of the browser</td></tr><tr><td>browser_os_version</td><td>string</td><td>no</td><td>The OS version of the browser</td></tr><tr><td>browser_timezone</td><td>string</td><td>no</td><td>The timezone of the browser</td></tr><tr><td>browser_language</td><td>string</td><td>no</td><td>The language of the browser</td></tr><tr><td>device_type</td><td>string</td><td>no</td><td>The device type</td></tr><tr><td>device_vendor</td><td>string</td><td>no</td><td>The device vendor</td></tr><tr><td>device_outer_resolution</td><td>string</td><td>no</td><td>The outer resolution of the device</td></tr><tr><td>device_inner_resolution</td><td>string</td><td>no</td><td>The inner resolution of the device</td></tr><tr><td>device_color_depth</td><td>int</td><td>no</td><td>The color depth of the device</td></tr><tr><td>device_platform</td><td>string</td><td>no</td><td>The platform of the device</td></tr><tr><td>device_memory</td><td>float</td><td>no</td><td>The memory of the device (in MB)</td></tr><tr><td>device_cores</td><td>int</td><td>no</td><td>The number of cores of the device</td></tr><tr><td>utm_source</td><td>string</td><td>no</td><td>Identifies which site sent the traffic</td></tr><tr><td>utm_campaign</td><td>string</td><td>no</td><td>Identifies a specific product promotion or strategic campaign</td></tr><tr><td>utm_medium</td><td>string</td><td>no</td><td>Identifies what type of link was used, such as cost per click or email</td></tr><tr><td>utm_content</td><td>string</td><td>no</td><td>Identifies what specifically was clicked to bring the user to the site</td></tr><tr><td>utm_term</td><td>string</td><td>no</td><td>Identifies search terms</td></tr><tr><td>request_url</td><td>string</td><td>no</td><td>The full requested URL</td></tr><tr><td>request_path</td><td>string</td><td>no</td><td>The path of the requested URL</td></tr><tr><td>request_query_string</td><td>string</td><td>no</td><td>The query string of the requested URL</td></tr><tr><td>request_bytes</td><td>int</td><td>no</td><td>The size of the request in bytes</td></tr><tr><td>request_status_code</td><td>int</td><td>no</td><td>The HTTP status code of the request</td></tr><tr><td>request_cache_status</td><td>string</td><td>no</td><td>The CloudFront cache status</td></tr><tr><td>request_delivery_time_ms</td><td>int</td><td>no</td><td>The time in ms it took for CloudFront to complete the request</td></tr><tr><td>request_asn</td><td>int</td><td>no</td><td>The <a target="_blank" href="https://www.arin.net/resources/guide/asn/">ASN</a> of the requestor</td></tr><tr><td>request_is_bot</td><td>int</td><td>no</td><td>If the request is <a target="_blank" href="https://www.npmjs.com/package/isbot">categorized as a bot</a>, the value will be <code>1</code>, if not <code>0</code></td></tr><tr><td>event_name</td><td>string</td><td>no</td><td>The name of the event for tracking events</td></tr><tr><td>event_data</td><td>string</td><td>no</td><td>The stringified event payload for tracking events</td></tr><tr><td>page_view_id</td><td>string</td><td>no</td><td>The unique pageview id</td></tr><tr><td>daily_page_view_id</td><td>string</td><td>no</td><td>The unique daily pageview id</td></tr><tr><td>daily_visitor_id</td><td>string</td><td>no</td><td>The unique daily visitor id</td></tr></tbody></table></div><h2 id="heading-stats-table">stats table</h2><p>The pageviews and visitor aggregation table. Its schema looks like this:</p><div class="hn-table"><table><thead><tr><td>Column name</td><td>Data type</td><td>Is partition key?</td><td>Description</td></tr></thead><tbody><tr><td>domain_name</td><td>string</td><td>yes</td><td>The domain name</td></tr><tr><td>event_date</td><td>string</td><td>yes</td><td>The date of the event (YYYY-MM-DD), as string</td></tr><tr><td>event_hour</td><td>int</td><td>no</td><td>The hour part of the event timestamp</td></tr><tr><td>edge_city</td><td>string</td><td>no</td><td>The name of the edge city (all edge location info is derived from the <code>x-edge-location</code> field in the <a target="_blank" href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#LogFileFormat">logs</a>)</td></tr><tr><td>edge_country</td><td>string</td><td>no</td><td>The country of the edge location</td></tr><tr><td>edge_latitude</td><td>float</td><td>no</td><td>The latitude of the edge location</td></tr><tr><td>edge_longitude</td><td>float</td><td>no</td><td>The longitude of the edge location</td></tr><tr><td>referrer_domain_name</td><td>string</td><td>no</td><td>The domain name of the referrer</td></tr><tr><td>browser_name</td><td>string</td><td>no</td><td>The name of the browser</td></tr><tr><td>browser_os_name</td><td>string</td><td>no</td><td>The OS name of the browser</td></tr><tr><td>device_type</td><td>string</td><td>no</td><td>The device type</td></tr><tr><td>device_vendor</td><td>string</td><td>no</td><td>The device vendor</td></tr><tr><td>utm_source</td><td>string</td><td>no</td><td>Identifies which site sent the traffic</td></tr><tr><td>utm_campaign</td><td>string</td><td>no</td><td>Identifies a specific product promotion or strategic campaign</td></tr><tr><td>utm_medium</td><td>string</td><td>no</td><td>Identifies what type of link was used, such as cost per click or email</td></tr><tr><td>utm_content</td><td>string</td><td>no</td><td>Identifies what type of link was used, such as cost per click or email</td></tr><tr><td>utm_term</td><td>string</td><td>no</td><td>Identifies search terms</td></tr><tr><td>request_path</td><td>string</td><td>no</td><td>The path of the requested URL</td></tr><tr><td>page_view_cnt</td><td>int</td><td>no</td><td>The number of page views</td></tr><tr><td>visitor_cnt</td><td>int</td><td>no</td><td>The number of daily visitors</td></tr><tr><td>bounces_cnt</td><td>int</td><td>no</td><td>The number of bounces (visited only one page)</td></tr><tr><td>visit_duration_sec_avg</td><td>int</td><td>no</td><td>The average duration of a visit (in seconds)</td></tr></tbody></table></div><h2 id="heading-events-table">events table</h2><p>The schema for the table <code>events</code> looks like this:</p><div class="hn-table"><table><thead><tr><td>Column name</td><td>Data type</td><td>Is partition key</td><td>Description</td></tr></thead><tbody><tr><td>domain_name</td><td>string</td><td>yes</td><td>The domain name</td></tr><tr><td>event_date</td><td>string</td><td>yes</td><td>The date of the event (YYYY-MM-DD), as string</td></tr><tr><td>event_name</td><td>string</td><td>yes</td><td>The name of the event for tracking events</td></tr><tr><td>event_timestamp</td><td>timestamp</td><td>no</td><td>The exact event timestamp</td></tr><tr><td>edge_city</td><td>string</td><td>no</td><td>The name of the edge city (all edge location info is derived from the <code>x-edge-location</code> field in the <a target="_blank" href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#LogFileFormat">logs</a>)</td></tr><tr><td>edge_country</td><td>string</td><td>no</td><td>The country of the edge location</td></tr><tr><td>edge_latitude</td><td>float</td><td>no</td><td>The latitude of the edge location</td></tr><tr><td>edge_longitude</td><td>float</td><td>no</td><td>The longitude of the edge location</td></tr><tr><td>request_path</td><td>string</td><td>no</td><td>The path of the requested URL</td></tr><tr><td>page_view_id</td><td>string</td><td>no</td><td>The unique pageview id</td></tr><tr><td>daily_visitor_id</td><td>string</td><td>no</td><td>The unique daily visitor id</td></tr><tr><td>event_data</td><td>string</td><td>no</td><td>The stringified event payload for tracking events</td></tr></tbody></table></div><h1 id="heading-consumption-layer">Consumption layer</h1><p>The consumption layer will be part of another blog post in this series. Stay tuned! Until it's released, you can have a look at <a target="_blank" href="https://github.com/tobilg/serverless-duckdb">tobilg/serverless-duckdb</a> to get an idea of how the data could potentially be queried in a serverless manner.</p><h1 id="heading-wrapping-up">Wrapping up</h1><p>In this article, you learned some basic principles of modern data lakes in the introduction. After that, it described how to build a serverless, near-realtime data pipeline leveraging AWS services and <a target="_blank" href="https://www.duckdb.org">DuckDB</a> on these principles, by the example of a web analytics application.</p><p>The example implementation of this article can be found on GitHub at <a target="_blank" href="https://github.com/ownstats/ownstats/tree/main/analytics-backend">ownstats/ownstats</a>. Feel free to open an issue in case something doesn't work as expected, or if you'd like to add a feature request.</p><p>The next posts in this series will be</p><ul><li><p><strong>Part II</strong>: Building a lightweight JavaScript library for the gathering of web analytics data</p></li><li><p><strong>Part III</strong>: Consuming the gathered web analytics data by building a serverless query layer</p></li><li><p><strong>Part IV</strong>: Building a frontend for web analytics data</p></li></ul>]]></description><link>https://tobilg.com/casual-data-engineering-or-a-poor-mans-data-lake-in-the-cloud-part-i</link><guid isPermaLink="true">https://tobilg.com/casual-data-engineering-or-a-poor-mans-data-lake-in-the-cloud-part-i</guid><dc:creator><![CDATA[Tobias Müller]]></dc:creator><pubDate>Mon, 24 Apr 2023 06:00:40 GMT</pubDate><cover_image>https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/_WjhfEzRDak/upload/9ed4008b65cd31e083a643acd7016585.jpeg</cover_image></item><item><title><![CDATA[Using DuckDB to repartition parquet data in S3]]></title><description><![CDATA[<p>Since release v0.7.1, DuckDB has the ability to repartition data stored in S3 as parquet files by a simple SQL query, which enables some interesting use cases.</p><h1 id="heading-why-not-use-existing-aws-services">Why not use existing AWS services?</h1><p>If your data lake lives in AWS, a natural choice for ETL pipelines would be existing AWS services such as Amazon Athena. Unfortunately, Athena has <a target="_blank" href="https://docs.aws.amazon.com/athena/latest/ug/ctas-insert-into.html">pretty tight limits</a> on the number of partitions that can be written by a single query (only up to 100). You, therefore, would need to create your workaround logic to be able to adhere to the limits, while still being able to do the operations you want. Additionally, Athena in some cases takes several 100ms to even start a query.</p><p>This is where DuckDB comes into play because it (theoretically) supports an unlimited amount of Hive partitions, and offers very fast queries on partitioned parquet files.</p><h1 id="heading-use-case">Use case</h1><p>A common pattern to ingest streaming data and store it in S3 is to use <a target="_blank" href="https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html">Kinesis Data Firehose Delivery Streams</a>, which can write the incoming stream data as batched parquet files to S3. You can use <a target="_blank" href="https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html">custom S3 prefixes</a> with it when using Lambda processing functions, but by default, you can only partition the data by the timestamp (the timestamp the event reached the Kinesis Data Stream, not the event timestamp!).</p><p>So, a few common use cases for data repartitioning could include:</p><ul><li><p>Repartitioning the written data for the real event timestamp if it's included in the incoming data</p></li><li><p>Repartitioning the data for other query patterns, e.g. to support query filter pushdown and optimize query speeds and costs</p></li><li><p>Aggregation of raw or preprocessed data, and storing them in an optimized manner to support analytical queries</p></li></ul><p>We want to be able to achieve this without having to manage our own infrastructure and built-upon services, which is the reason we want to be as serverless as possible.</p><h1 id="heading-solution">Solution</h1><p>As described before, Amazon Athena only has a partition limit of 100 partitions when writing data. DuckDB doesn't have this limitation, that's why we want to be able to use it for repartitioning.</p><p>This requires that we can deploy DuckDB in a serverless manner. The choice is to run it in Lambda functions, which can be provisioned with up to 10GB of memory (meaning 6 vCPUs), and a maximum runtime of 900 seconds / 15 minutes. This should be enough for most repartitioning needs, because the throughput from/to S3 is pretty fast. Also, we want to be able to run our repartition queries on flexible schedules, that's why we'll use E<a target="_blank" href="https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html">ventBridge Rules with a schedule</a>.</p><p>The project can be found at <a target="_blank" href="https://github.com/tobilg/serverless-parquet-repartitioner">https://github.com/tobilg/serverless-parquet-repartitioner</a>, and just needs to be configured and deployed.</p><h2 id="heading-architecture-overview">Architecture overview</h2><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1677692338278/13d3f454-6d20-4a3b-bc79-8344f54b1af7.png" alt class="image--center mx-auto" /></p><h2 id="heading-configuration">Configuration</h2><h3 id="heading-mandatory-configuration-settings">Mandatory configuration settings</h3><ul><li><p><a target="_blank" href="https://github.com/tobilg/serverless-parquet-repartitioner/blob/main/serverless.yml#L18">S3 bucket name</a>: You need to specify the S3 bucket where the data that you want to repartition resides (e.g. <code>my-source-bucket</code>)</p></li><li><p><a target="_blank" href="https://github.com/tobilg/serverless-parquet-repartitioner/blob/main/serverless.yml#L77">Custom repartitioning query</a>: You can write flexible repartitioning queries in the DuckDB syntax. Have a look at the examples in the <a target="_blank" href="https://duckdb.org/docs/extensions/httpfs">httpfs extension docs</a>. You <strong>need</strong> to update this, as the template uses only example values!</p></li></ul><h3 id="heading-optional-configuration-settings">Optional configuration settings</h3><ul><li><p><a target="_blank" href="https://github.com/tobilg/serverless-parquet-repartitioner/blob/main/serverless.yml#L79">S3 region</a>: The AWS region your S3 bucket is deployed to (if different from the region the Lambda function is deployed to)</p></li><li><p><a target="_blank" href="https://github.com/tobilg/serverless-parquet-repartitioner/blob/main/serverless.yml#L84">The schedule</a>: The actual schedule on why the Lambda function is run. Have a look at the <a target="_blank" href="https://www.serverless.com/framework/docs/providers/aws/events/schedule">Serverless Framework docs</a> to find out what the potential settings are.</p></li><li><p><a target="_blank" href="https://github.com/tobilg/serverless-parquet-repartitioner/blob/main/serverless.yml#L48">DuckDB memory limit</a>: The memory limit is influenced by the function memory setting (automatically)</p></li><li><p><a target="_blank" href="https://github.com/tobilg/serverless-parquet-repartitioner/blob/main/serverless.yml#L75">DuckDB threads count</a>: Optionally set the max thread limit (on Lambda, this is set automatically by the amount of memory the functions has assigned), but with this setting, you can influence how many files are written per partition. If you set a lower thread count than available, this means that the computation will not use all available resources for the sake of being able to set the number of generated files! Ideally, rather align the amount of memory you assign to the Lambda function.</p></li><li><p><a target="_blank" href="https://github.com/tobilg/serverless-parquet-repartitioner/blob/main/serverless.yml#L50">Lambda timeout</a>: The maximum time a Lambda function can run is currently 15min / 900sec. This means that if your query takes longer than that, it will be terminated by the underlying Firecracker engine.</p></li></ul><h3 id="heading-using-different-sourcetarget-s3-buckets">Using different source/target S3 buckets</h3><p>If you're planning to use different S3 buckets as sources and targets for the data repartitioning, you need to adapt the <code>iamRoleStatements</code> settings of the function.</p><p>Here's an example with minimal privileges:</p><pre><code class="lang-bash">iamRoleStatements:  <span class="hljs-comment"># Source S3 bucket permissions</span>  - Effect: Allow    Action:      - s3:ListBucket    Resource: <span class="hljs-string">'arn:aws:s3:::my-source-bucket'</span>  - Effect: Allow    Action:      - s3:GetObject    Resource: <span class="hljs-string">'arn:aws:s3:::my-source-bucket/*'</span>  <span class="hljs-comment"># Target S3 bucket permissions</span>  - Effect: Allow    Action:      - s3:ListBucket      - s3:AbortMultipartUpload      - s3:ListMultipartUploadParts      - s3:ListBucketMultipartUploads    Resource: <span class="hljs-string">'arn:aws:s3:::my-target-bucket'</span>  - Effect: Allow    Action:      - s3:GetObject      - s3:PutObject    Resource: <span class="hljs-string">'arn:aws:s3:::my-target-bucket/*'</span></code></pre><p>A query for this use case would look like this:</p><pre><code class="lang-sql">COPY (<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> parquet_scan(<span class="hljs-string">'s3://my-source-bucket/input/*.parquet'</span>, HIVE_PARTITIONING = <span class="hljs-number">1</span>)) <span class="hljs-keyword">TO</span> <span class="hljs-string">'s3://my-starget-bucket/output'</span> (<span class="hljs-keyword">FORMAT</span> PARQUET, PARTITION_BY (column1, column2, column3), ALLOW_OVERWRITE <span class="hljs-literal">TRUE</span>);</code></pre><h2 id="heading-deployment">Deployment</h2><p>After you cloned this repository to your local machine and cd'ed in its directory, the application can be deployed like this (don't forget a <code>npm i</code> to install the dependencies!):</p><pre><code class="lang-bash">$ sls deploy</code></pre><p>This will deploy the stack to the default AWS region <code>us-east-1</code>. In case you want to deploy the stack to a different region, you can specify a <code>--region</code> argument:</p><pre><code class="lang-bash">$ sls deploy --region eu-central-1</code></pre><p>The deployment should take 2-3 minutes.</p><h2 id="heading-checks-and-manual-triggering">Checks and manual triggering</h2><p>You can <a target="_blank" href="https://www.serverless.com/framework/docs/providers/aws/cli-reference/invoke">manually invoke</a> the deployed Lambda function by running</p><pre><code class="lang-bash">$ sls invoke -f repartitionData</code></pre><p>After that, you can <a target="_blank" href="https://www.serverless.com/framework/docs/providers/aws/cli-reference/logs">check the generated CloudWatch logs</a> by issuing</p><pre><code class="lang-bash">$ sls logs -f repartitionData</code></pre><p>If you don't see any <code>DUCKDB_NODEJS_ERROR</code> in the logs, everything ran successfully, and you can have a look at your S3 bucket for the newly generated parquet files.</p><h2 id="heading-costs">Costs</h2><p>Using this repository will generate costs in your AWS account. Please refer to the AWS pricing docs for the respective services before deploying and running it.</p><h1 id="heading-summary">Summary</h1><p>We were able to show a possible serverless solution to repartition data that is stored in S3 as parquet files, without limitations imposed by certain AWS services. With the solution shown, we can use plain and simple SQL queries, instead of having to eventually use external libraries etc.</p><h1 id="heading-references">References</h1><ul><li>Serverless parquet repartitioner repo: <a target="_blank" href="https://github.com/tobilg/serverless-parquet-repartitioner">https://github.com/tobilg/serverless-parquet-repartitioner</a></li></ul>]]></description><link>https://tobilg.com/using-duckdb-to-repartition-parquet-data-in-s3</link><guid isPermaLink="true">https://tobilg.com/using-duckdb-to-repartition-parquet-data-in-s3</guid><dc:creator><![CDATA[Tobias Müller]]></dc:creator><pubDate>Sun, 26 Feb 2023 17:45:04 GMT</pubDate><cover_image>https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/59yg_LpcvzQ/upload/eff4313bb1f600e6d9da6afcd8a6860f.jpeg</cover_image></item><item><title><![CDATA[Using DuckDB in AWS Lambda]]></title><description><![CDATA[<h2 id="heading-prelude">Prelude</h2><p>DuckDB is an open-source in-process SQL OLAP database management system that has recently gained significant public interest due to its unique architecture and impressive performance benchmarks.</p><p>Unlike traditional databases that are designed to handle a wide variety of use cases, DuckDB is built specifically for analytical queries and is optimized to perform extremely well in these scenarios. This focus on analytics has allowed DuckDB to outperform traditional databases by several orders of magnitude, making it a popular choice for data scientists and analysts who need to process large datasets quickly and efficiently.</p><p>As DuckDB is designed to be a highly efficient and scalable database system, which makes it a perfect fit for serverless architectures that allow developers to build and run applications and services without having to manage infrastructure.</p><p>DuckDB's ability to handle large datasets in a memory-efficient manner makes it an ideal choice for serverless environments. Being able to read columnar storage formats like Parquet or Apache Arrow tables from local, S3 or HTTP sources, DuckDB can quickly scan and aggregate large amounts of data without having to load it all into memory, reducing the amount of memory required to perform complex analytical queries. This allows for cost savings, as serverless environments typically charge for both compute and memory resources.</p><p>Existing AWS services, such as Athena or RDS, don't provide the same functionalities, and also have different scaling and pricing models. That's why it makes sense to explore ways to run DuckDB as an analytical service on AWS.</p><h2 id="heading-how-to-run-duckdb-in-aws-lambda">How to run DuckDB in AWS Lambda?</h2><p>The goal of this article is to use DuckDB on Node.js runtimes (12, 14, 16 and 18), thus it is necessary to find a way to make DuckDB usable with Lambda. The first idea was to simply use the existing <a target="_blank" href="https://www.npmjs.com/package/duckdb">DuckDB npm package</a> and use the default packaging mechanisms when deploying Lambda functions. Unfortunately, this idea proved impossible due to <a target="_blank" href="https://github.com/mapbox/node-pre-gyp/issues/661#issuecomment-1347186316">downstream package problems</a> and different build Operating Systems (Lambda needs statically linked binaries built on Amazon Linux).</p><p>Generally, there are several ways to package dependencies in AWS Lambda functions with Node.js runtimes. Using bundlers like WebPack or ESbuild is probably the most used option at the moment.</p><p>Another one is using AWS Lambda layers for distributing dependencies to Lambda functions, allowing developers to manage common components and libraries across multiple functions in a centralized manner. By creating a layer for the dependencies, developers can avoid having to include them in each functions deployment package. This helps reduce the size of the deployment package and makes it easier to manage updates to the dependencies. Moreover, using a Lambda layer can also improve the performance of Lambda functions.</p><h2 id="heading-building-duckdb-for-aws-lambda">Building DuckDB for AWS Lambda</h2><p>So, how can we achieve to build a DuckDB version that can be used with Node.js runtimes on AWS Lambda?</p><ul><li><p>We have to use a compatible environment when compiling the DuckDB binary, to avoid GLIBC incompatibilities etc. This means that we have to use an Amazon Linux distribution to build DuckDB, and enable static linking.</p></li><li><p>We have to solve the downstream package problems stated above, which make it impossible to use the default package.</p></li><li><p>On a side note, we want to enable the current features like COPY TO PARTITIONED BY and improved Parquet file reading, thus requiring a build from the master branch of DuckDB.</p></li></ul><p>I created <a target="_blank" href="https://github.com/tobilg/duckdb-nodejs-layer">https://github.com/tobilg/duckdb-nodejs-layer</a> to achieve this. It uses GitHub Actions to automatically trigger a build of DuckDB from the current master, package it as AWS Lambda layer and automatically upload it to all AWS regions. Feel free to have a look at the source code, and open a GitHub issue in case you find some errors or ideas for improvement.</p><h2 id="heading-using-the-duckdb-lambda-layer">Using the DuckDB Lambda layer</h2><p>Depending on your preferred framework, the methods to use a Lambda layer are different. You can find the respective docs of the most common frameworks below:</p><ul><li><p><a target="_blank" href="https://www.serverless.com/framework/docs/providers/aws/guide/serverless.yml/#functions">Serverless Framework</a></p></li><li><p><a target="_blank" href="https://aws.amazon.com/blogs/compute/working-with-aws-lambda-and-lambda-layers-in-aws-sam/">SAM</a></p></li><li><p><a target="_blank" href="https://docs.aws.amazon.com/cdk/api/v1/docs/aws-lambda-readme.html#layers">CDK</a></p></li><li><p><a target="_blank" href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-function.html#cfn-lambda-function-layers">CloudFormation</a></p></li></ul><p>The ARNs follow the following logic:</p><pre><code class="lang-bash">arn:aws:lambda:<span class="hljs-variable">$REGION</span>:041475135427:layer:duckdb-nodejs-layer:<span class="hljs-variable">$VERSION</span></code></pre><p>You can find the list of ARNs for all regions at <a target="_blank" href="https://github.com/tobilg/duckdb-nodejs-layer#usage">https://github.com/tobilg/duckdb-nodejs-layer#usage</a>.</p><p>I created an example repository with the Serverless Framework at <a target="_blank" href="https://github.com/tobilg/serverless-duckdb">https://github.com/tobilg/serverless-duckdb</a> that uses API Gateway and Lambda to provide an endpoint to which SQL queries can be issued, which the built-in DuckDB then executes. Let's walk through it!</p><h3 id="heading-requirements">Requirements</h3><p>You'll need a current v3 version installation of the <a target="_blank" href="https://serverless.com/">Serverless Framework</a> on the machine you're planning to deploy the application from.</p><p>Also, you'll have to set up your AWS credentials according to the <a target="_blank" href="https://www.serverless.com/framework/docs/providers/aws/guide/credentials/">Serverless docs</a>.</p><h3 id="heading-configuration">Configuration</h3><p>DuckDB is automatically configured to use the <a target="_blank" href="https://duckdb.org/docs/extensions/httpfs">HTTPFS extension</a> and uses the AWS credentials that are given to your Lambda function by its execution role. This means you can potentially query data that is available via HTTP(S) or in AWS S3 buckets.</p><p>If you want to also query data (e.g. Parquet files) that resides in one or more S3 buckets, you'll have to adjust the <code>iamRoleStatements</code> part of the function configuration in the <a target="_blank" href="https://github.com/tobilg/serverless-duckdb/blob/main/serverless.yml#L45">serverless.yml</a> file. Just replace the <code>YOUR-S3-BUCKET-NAME</code> with your actual S3 bucket name.</p><pre><code class="lang-yaml">  <span class="hljs-attr">query:</span>    <span class="hljs-attr">handler:</span> <span class="hljs-string">src/functions/query.handler</span>    <span class="hljs-attr">memorySize:</span> <span class="hljs-number">10240</span>    <span class="hljs-attr">timeout:</span> <span class="hljs-number">30</span>    <span class="hljs-attr">iamRoleStatements:</span>      <span class="hljs-bullet">-</span> <span class="hljs-attr">Effect:</span> <span class="hljs-string">Allow</span>        <span class="hljs-attr">Action:</span>          <span class="hljs-bullet">-</span> <span class="hljs-string">s3:GetObject</span>        <span class="hljs-attr">Resource:</span> <span class="hljs-string">'arn:aws:s3:::YOUR-S3-BUCKET-NAME/*'</span>      <span class="hljs-bullet">-</span> <span class="hljs-attr">Effect:</span> <span class="hljs-string">Allow</span>        <span class="hljs-attr">Action:</span>          <span class="hljs-bullet">-</span> <span class="hljs-string">s3:ListBucket</span>        <span class="hljs-attr">Resource:</span>          <span class="hljs-bullet">-</span> <span class="hljs-string">'arn:aws:s3:::YOUR-S3-BUCKET-NAME'</span>    <span class="hljs-attr">layers:</span>      <span class="hljs-bullet">-</span> <span class="hljs-string">'arn:aws:lambda:${self:provider.region}:041475135427:layer:duckdb-nodejs-layer:3'</span>    <span class="hljs-attr">events:</span>      <span class="hljs-bullet">-</span> <span class="hljs-attr">http:</span>          <span class="hljs-attr">path:</span> <span class="hljs-string">${self:custom.api.version}/query</span>          <span class="hljs-attr">method:</span> <span class="hljs-string">post</span>          <span class="hljs-attr">cors:</span> <span class="hljs-literal">true</span>          <span class="hljs-attr">private:</span> <span class="hljs-literal">true</span></code></pre><h3 id="heading-deployment">Deployment</h3><p>After you cloned this repository to your local machine and cd'ed in its directory, the application can be deployed like this (don't forget a <code>npm i</code> to install the dependencies):</p><pre><code class="lang-bash">$ sls deploy</code></pre><p>This will deploy the stack to the default AWS region <code>us-east-1</code>. In case you want to deploy the stack to a different region, you can specify a <code>--region</code> argument:</p><pre><code class="lang-bash">$ sls deploy --region eu-central-1</code></pre><p>The deployment should take 2-3 minutes. Once the deployment is finished, you should find some output in your console that indicates the API Gateway endpoint URL and the API Key:</p><pre><code class="lang-yaml"><span class="hljs-attr">api keys:</span>  <span class="hljs-attr">DuckDBKey:</span> <span class="hljs-string">REDACTED</span><span class="hljs-attr">endpoint:</span> <span class="hljs-string">POST</span> <span class="hljs-bullet">-</span> <span class="hljs-string">https://REDACTED.execute-api.us-east-1.amazonaws.com/prd/v1/query</span></code></pre><h3 id="heading-usage">Usage</h3><p>You can now query your DuckDB endpoint via HTTP requests (don't forget to exchange <code>REDACTED</code> with your real URL and API Key), e.g.</p><pre><code class="lang-bash">curl --location --request POST <span class="hljs-string">'https://REDACTED.execute-api.us-east-1.amazonaws.com/prd/v1/query'</span> \--header <span class="hljs-string">'x-api-key: REDACTED'</span> \--header <span class="hljs-string">'Content-Type: application/json'</span> \--data-raw <span class="hljs-string">'{    "query": "SELECT avg(c_acctbal) FROM '</span>\<span class="hljs-string">''</span>https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet<span class="hljs-string">'\'</span><span class="hljs-string">';"}'</span></code></pre><p>The query results will look to this:</p><pre><code class="lang-json">[    {        <span class="hljs-attr">"avg(c_acctbal)"</span>: <span class="hljs-number">4454.577060000001</span>    }]</code></pre><h3 id="heading-example-queries">Example queries</h3><pre><code class="lang-bash">Remote Parquet Scans:  SELECT count(*) FROM <span class="hljs-string">'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet'</span>;  SELECT count(*) FROM <span class="hljs-string">'https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet'</span>;  SELECT avg(c_acctbal) FROM <span class="hljs-string">'https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet'</span>;  SELECT * FROM <span class="hljs-string">'https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet'</span> LIMIT 10;Remote Parquet/Parquet Join:  SELECT n_name, count(*)  FROM <span class="hljs-string">'https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet'</span>,       <span class="hljs-string">'https://shell.duckdb.org/data/tpch/0_01/parquet/nation.parquet'</span>  WHERE c_nationkey = n_nationkey GROUP BY n_name;</code></pre><h2 id="heading-conclusion">Conclusion</h2><p>We were able to show that it's possible to package and use DuckDB in a Lambda function, as well as to run performant queries on remote data with this setup.</p><p>But we need to keep in mind that this is just a showcase. The example application doesn't solve a lot of issues we'd have to solve if we'd want to run this in a distributed manner:</p><ul><li><p>A query planner and router, that scales DuckDB instances and redistributes the queries to the "query backend" functions, as well as unites the query results before passing them back to the query issuer</p></li><li><p>"Query stickiness": The example is stateless, meaning that even if you'd load data into a memory table, you'd not be able to be sure that you'd reach the same function instance with a subsequent query due to Lambda's scaling/execution model</p></li><li><p>Running DuckDB "only" in Lambda functions may not be the most performant way when AWS Fargate and very large EC2 instances exist</p></li><li><p>The example application uses API Gateway as the event source for the Lambda function, which means the maximum runtime of the queries can be 30 seconds, which is unrealistic for large datasets or complicated queries. In real-world scenarios, the Lambda function would need to be triggered asynchronously, e.g. via SNS or SQS. This also means that the queries probably can't follow a strict request/response model.</p></li></ul><h2 id="heading-references">References</h2><ul><li><p><a target="_blank" href="https://www.boilingdata.com/">BoilingData</a></p></li><li><p><a target="_blank" href="https://stoic.com/">STOIC</a></p></li><li><p><a target="_blank" href="https://twitter.com/ghalimi">Ismael Ghalimi's Twitter feed</a></p></li></ul>]]></description><link>https://tobilg.com/using-duckdb-in-aws-lambda</link><guid isPermaLink="true">https://tobilg.com/using-duckdb-in-aws-lambda</guid><dc:creator><![CDATA[Tobias Müller]]></dc:creator><pubDate>Sun, 12 Feb 2023 16:41:41 GMT</pubDate><cover_image>https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/VTvnoNBowZs/upload/e930e7fd571fb4bae4510f6986b305f6.jpeg</cover_image></item><item><title><![CDATA[Building a global reverse proxy with on-demand SSL support]]></title><description><![CDATA[<h1 id="heading-motivation">Motivation</h1><p>Who needs a reverse proxy with on-demand SSL support? Well, think about services as <a target="_blank" href="http://hashnode.com">Hashnode</a>, which also runs this blog, or <a target="_blank" href="http://usefathom.com">Fathom</a> and <a target="_blank" href="http://simpleanalytics.com">SimpleAnalytics</a>. A feature that all those services have in common? They all are enabling their customers to bring their own, custom domain names. The latter two services use them to <a target="_blank" href="https://usefathom.com/blog/bypass-adblockers">bypass</a> <a target="_blank" href="https://docs.simpleanalytics.com/bypass-ad-blockers">adblockers</a>, with the intention that customers can track all their pageviews and events, which potentially wouldn't be possible otherwise because the service's own domain are prone to DNS block lists. Hashnode is using them to enable their customers to host their blogs under their own domain names.</p><h1 id="heading-requirements">Requirements</h1><p>What are the functional &amp; non-functional requirements to build such a system? Let's try to recap:</p><ul><li><p>We want to be able to use a custom ("external") domain, e.g. <code>subdomain.customdomain.tld</code> to redirect to another domain, such as <code>targetdomain.tld</code> via a <code>CNAME</code> DNS record</p></li><li><p>The custom domains need to have SSL/TLS support</p></li><li><p>The custom domains need to be configurable, without changing the underlying infrastructure</p></li><li><p>We want to make sure that the service will only create and provide the certificates for whitelisted custom domains</p></li><li><p>We want to optimize the latency of individual requests, thus need to support a scalable and distributed infrastructure</p></li><li><p>We want to be as Serverless as possible</p></li><li><p>We want to optimize for infrastructure costs (variable and fixed)</p></li><li><p>We want to build this on AWS</p></li><li><p>The service needs to be deployable/updateable/removable via Infrastructure as Code (IaC) in a repeatable manner</p></li></ul><h1 id="heading-architectural-considerations">Architectural considerations</h1><p>Looking at the above requirements, what are the implications from an architectural point of view? Which tools and services are already on the market? What does AWS as Public Cloud Provider offer for our use case?</p><p>For the main requirements of a <a target="_blank" href="https://caddyserver.com/docs/quick-starts/reverse-proxy">reverse proxy</a> server with <a target="_blank" href="https://caddyserver.com/docs/automatic-https">automatic SSL/TLS</a> support, <a target="_blank" href="https://caddyserver.com/">Caddy</a> seems to be an optimal candidate. As it is written in Go, it can run in Docker containers very well and can be used on numerous operating systems. This means we have the options to either run it on EC2 instances or ECS/Fargate if we decide to run it in containers. The latter would cater to the requirement to run as Serverless as possible. It has modules to store the generated SSL/TLS on-demand certificates in <a target="_blank" href="https://github.com/silinternational/certmagic-storage-dynamodb">DynamoDB</a> or <a target="_blank" href="https://caddyserver.com/docs/modules/caddy.storage.s3#github.com/ss098/certmagic-s3">S3</a>.</p><p>Also, the whitelisting of custom domains is possible for those certificates, by providing an additional backend service, which Caddy can ask whether a requested custom domain is <a target="_blank" href="https://caddyserver.com/docs/caddyfile/options#on-demand-tls">allowed to use or not</a>.</p><p>A challenge is that none of those modules are contained in the official Caddy builds, meaning that we'd have to build a custom version of Caddy to be able to use those storage backends.</p><p>Regarding the requirement of global availability and short response times, <a target="_blank" href="https://aws.amazon.com/global-accelerator/">AWS Global Accelerator</a> is a viable option, as it can provide a global single static IP address endpoint for multiple, regionally distributed services. In our use case, those services would be our Caddy installations.</p><p>Running Caddy itself, as said before, is possible via Containers or EC2 instances / VMs. As the services will run the whole time and assumingly don't need a lot of resources if not under heavy load, we assume that 1 vCPU and 1 GB of memory should be enough.</p><p>When projecting this on the necessary infrastructure, the cost comparison between Containers and VMs looks like the following (for simplification, we just compare the fixed costs, ignore variable costs such as <a target="_blank" href="https://aws.amazon.com/global-accelerator/pricing/">egress traffic</a>, and assume the <code>us-east-1</code> region is used):</p><p><strong>Containers</strong></p><ul><li><p>Fargate task with 1 vCPU and 1 GB of memory for each Caddy regional instance</p><ul><li>Price: ($0.04048/vCPU hour + $0.004445/GB hour) * 720 hours (30 days) = $32.35 / 30 days</li></ul></li><li><p>ALB to make the Fargate tasks available to the outside world</p><ul><li>Price: ($0.0225 per ALB-hour + $0.008 per LCU-hour) * 720 hours (30 days) = $21.96 / 30 days</li></ul></li></ul><p>In combination, it would cost <strong>$54.31</strong> to run this setup for 30 days.</p><p><strong>EC2 instances / VMs</strong></p><ul><li><p>A t2.micro instance with 1 vCPU and 1 GB f memory for each Caddy regional instance</p><ul><li>Price: $0.0116/hour on-demand * 720 hours (30 days) = $8.35</li></ul></li><li><p>There's no need for a Load Balancer in front of the EC2 instances, as Global Accelerator can directly use them</p><ul><li>Price: <em>$0</em></li></ul></li></ul><p>In combination, it would cost <strong><em>$8.35</em></strong> to run this setup for 30 days.</p><p><strong>Additional costs are:</strong></p><ul><li><p>AWS Global Accelerator</p><ul><li>Price: $0.025 / hour * 720 hours (30 days) = <strong>$18</strong></li></ul></li><li><p>DynamoDB table</p><ul><li>Price: (5000 reads/day * $0.25/million + 50 writes/day * $1.25/million) * 30 days = $0.0375 reads + 0.001875 writes = <strong>$0.04</strong></li></ul></li><li><p>Lambda function (128 MB / 0.25 sec avg. duration / 5000 req./day)</p><ul><li>Price: ($0.0000166667 / GB-second + $0.20 per 1M req.) = basically <strong>$0</strong></li></ul></li></ul><h1 id="heading-resulting-architecture"><strong>Resulting architecture</strong></h1><p>Based on the calculated fixed costs, we decided to use EC2 instances instead of Fargate tasks, which will save us a decent amount of money even for one Caddy instance. <strong>The estimated costs for running this architecture for 30 days are $26.39</strong>.</p><p>As one of our requirements was that we can roll out this infrastructure potentially on a global scale, we need to be able to deploy the EC2 instances with the Caddy servers in different AWS regions, as well as having multiple instances in the same region.</p><p>Furthermore, we could use <a target="_blank" href="https://aws.amazon.com/dynamodb/global-tables/">DynamoDB Global Tables</a> to achieve a global distribution of the certificates to get faster response times, but deem it as out of scope for this article.</p><p><strong>The final architecture:</strong></p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1673307403175/75a9ddcc-76f5-4f5e-81cb-e5604a09953c.png" alt class="image--center mx-auto" /></p><h1 id="heading-implementation">Implementation</h1><p>To implement the described architecture, we must take several steps. First of all, we must build a custom version of Caddy that includes the DynamoDB module, which then enables us to use DynamoDB as certificate store.</p><h2 id="heading-custom-caddy-build">Custom Caddy build</h2><p>This can be achieved via a custom build process leveraging Docker images of AmazonLinux 2, as found at <a target="_blank" href="https://github.com/tobilg/aws-caddy-build">tobilg/aws-caddy-build</a>.</p><p><a target="_blank" href="https://github.com/tobilg/aws-caddy-build/blob/main/build.sh">build.sh</a> (parametrized custom build of Caddy via Docker)</p><pre><code class="lang-bash"><span class="hljs-meta">#!/usr/bin/env bash</span><span class="hljs-built_in">set</span> -e<span class="hljs-comment"># Set OS (first script argument)</span>OS=<span class="hljs-variable">${1:-linux}</span><span class="hljs-comment"># Set Caddy version (second script argument)</span>CADDY_VERSION=<span class="hljs-variable">${2:-v2.6.2}</span><span class="hljs-comment"># Create release folders</span>mkdir -p <span class="hljs-variable">$PWD</span>/releases <span class="hljs-variable">$PWD</span>/temp_release<span class="hljs-comment"># Run build</span>docker build --build-arg OS=<span class="hljs-variable">$OS</span> --build-arg CADDY_VERSION=<span class="hljs-variable">$CADDY_VERSION</span> -t custom-caddy-build .<span class="hljs-comment"># Copy release from image to temporary folder</span>docker run -v <span class="hljs-variable">$PWD</span>/temp_release:/opt/mount --rm -ti custom-caddy-build bash -c <span class="hljs-string">"cp /tmp/caddy-build/* /opt/mount/"</span><span class="hljs-comment"># Copy release to releases</span>cp <span class="hljs-variable">$PWD</span>/temp_release/* <span class="hljs-variable">$PWD</span>/releases/<span class="hljs-comment"># Cleanup</span>rm -rf <span class="hljs-variable">$PWD</span>/temp_release</code></pre><p><a target="_blank" href="https://github.com/tobilg/aws-caddy-build/blob/main/Dockerfile">Dockerfile</a> (will build Caddy with the DynamoDb and S3 modules)</p><pre><code class="lang-bash">FROM amazonlinux:2ARG CADDY_VERSION=v2.6.2ARG OS=linux<span class="hljs-comment"># Install dependencies</span>RUN yum update -y &amp;&amp; \  yum install golang -yRUN GOBIN=/usr/<span class="hljs-built_in">local</span>/bin/ go install github.com/caddyserver/xcaddy/cmd/xcaddy@latestRUN mkdir -p /tmp/caddy-build &amp;&amp; \  GOOS=<span class="hljs-variable">${OS}</span> xcaddy build <span class="hljs-variable">${CADDY_VERSION}</span> --with github.com/ss098/certmagic-s3 --with github.com/silinternational/certmagic-storage-dynamodb/v3 --output /tmp/caddy-build/aws_caddy_<span class="hljs-variable">${CADDY_VERSION}</span>_<span class="hljs-variable">${OS}</span></code></pre><p>That's it for the custom Caddy build. <mark>You don't need to build this yourself, as the further steps use the </mark> <a target="_blank" href="https://github.com/tobilg/aws-caddy-build/tree/main/releases"><mark>release</mark></a> <mark>I built and uploaded to GitHub.</mark></p><h2 id="heading-reverse-proxy-service">Reverse Proxy Service</h2><p>The implementation of the reverse proxy service can be found at <a target="_blank" href="https://github.com/tobilg/global-reverse-proxy">tobilg/global-reverse-proxy</a>.</p><p>Just clone it via <code>git clone</code> <a target="_blank" href="https://github.com/tobilg/global-reverse-proxy.git"><code>https://github.com/tobilg/global-reverse-proxy.git</code></a> to your local machine, and configure it as described below.</p><h3 id="heading-prerequisites">Prerequisites</h3><p><strong>Serverless Framework</strong></p><p>You need to have a recent (&gt;=3.1.2) version of the <a target="_blank" href="https://goserverless.com/">Serverless Framework</a> installed globally on your machine. If you haven't, you can run <code>npm i -g serverless</code> to install it.</p><p><strong>Valid AWS credentials</strong></p><p>The Serverless Framework relies on already configured AWS credentials. Please refer to the <a target="_blank" href="https://www.serverless.com/framework/docs/providers/aws/guide/credentials/">docs</a> to learn how to set them up on your local machine.</p><p><strong>EC2 key already configured</strong></p><p>If you want to interact with the deployed EC2 instance(s), you need to add your existing public SSH key or create a new one. Please have a look at the <a target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/create-key-pairs.html#how-to-generate-your-own-key-and-import-it-to-aws">AWS docs</a> to learn how you can do that.</p><p>Please also note the name you have given to the newly created key, as you will have to update the <a target="_blank" href="https://github.com/tobilg/global-reverse-proxy/blob/main/proxy-server-stack/serverless.yml#L15">configuration of the proxy server(s) stack</a>.</p><h3 id="heading-infrastructure-as-code-overview">Infrastructure as Code overview</h3><p>The infrastructure consists of three different stacks:</p><ul><li><p>A stack for the domain whitelisting service, and the certificate table in DynamoDB</p></li><li><p>A stack for the proxy server(s) itself, which can be deployed multiple times if you want high (global) availability and fast latencies</p></li><li><p>A stack for the Global Accelerator, and the according DNS records</p></li></ul><h3 id="heading-most-important-parts">Most important parts</h3><p>The main functionality, the reverse proxy based on Caddy, is deployed via an EC2 instance. Its configuration, the so-called Caddyfile, is, together with the CloudFormation resource for the EC2 instance, the most important part.</p><p><a target="_blank" href="https://github.com/tobilg/global-reverse-proxy/blob/main/proxy-server-stack/caddy-config/Caddyfile">Caddyfile</a></p><p>This configuration enables the reverse proxy, the on-demand TLS feature and DynamoDB storage for certificates. It's automatically parametrized via the generated <code>/etc/caddy/environment</code> file (see ec2.yml below). There's a <code>systemctl</code> service for Caddy generated, based on our configuration derived from the serverless.yml, as well.</p><pre><code class="lang-bash">{        admin off        on_demand_tls {                ask {<span class="hljs-variable">$DOMAIN_SERVICE_ENDPOINT</span>}        }        storage dynamodb {<span class="hljs-variable">$TABLE_NAME</span>} {                aws_region {<span class="hljs-variable">$TABLE_REGION</span>}        }}:80 {       respond /health <span class="hljs-string">"Im healthy"</span> 200}:443 {        tls {<span class="hljs-variable">$LETSENCRYPT_EMAIL_ADDRESS</span>} {                on_demand        }        reverse_proxy https://{<span class="hljs-variable">$TARGET_DOMAIN</span>} {                header_up Host {<span class="hljs-variable">$TARGET_DOMAIN</span>}                header_up User-Custom-Domain {host}                header_up X-Forwarded-Port {server_port}                health_timeout 5s        }}</code></pre><p><a target="_blank" href="https://github.com/tobilg/global-reverse-proxy/blob/main/proxy-server-stack/resources/ec2.yml">ec2.yml</a> (extract)</p><p>The interesting part is the <a target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html">UserData</a> script, which is run automatically when the EC2 instance starts. It does the following:</p><ul><li><p>Download the custom Caddy build with DynamoDB support</p></li><li><p>Prepare a group and a user for Caddy</p></li><li><p>Create the <code>caddy.service</code> file for <code>systemctl</code></p></li><li><p>Create the <code>Caddyfile</code> (as outlined above)</p></li><li><p>Create the environment file (<code>/etc/caddy/environment</code>)</p></li><li><p>Enable &amp; reload the <code>systemctl</code> service</p></li></ul><pre><code class="lang-yaml"><span class="hljs-attr">Resources:</span>  <span class="hljs-attr">EC2Instance:</span>    <span class="hljs-attr">Type:</span> <span class="hljs-string">AWS::EC2::Instance</span>    <span class="hljs-attr">Properties:</span>      <span class="hljs-attr">InstanceType:</span> <span class="hljs-string">'${self:custom.ec2.instanceType}'</span>      <span class="hljs-attr">KeyName:</span> <span class="hljs-string">'${self:custom.ec2.keyName}'</span>      <span class="hljs-attr">SecurityGroups:</span>         <span class="hljs-bullet">-</span> <span class="hljs-type">!Ref</span> <span class="hljs-string">'InstanceSecurityGroup'</span>      <span class="hljs-attr">ImageId:</span> <span class="hljs-string">'ami-0b5eea76982371e91'</span> <span class="hljs-comment"># Amazon Linux 2 AMI</span>      <span class="hljs-attr">IamInstanceProfile:</span> <span class="hljs-type">!Ref</span> <span class="hljs-string">'InstanceProfile'</span>      <span class="hljs-attr">UserData:</span> <span class="hljs-type">!Base64</span>         <span class="hljs-attr">'Fn::Join':</span>          <span class="hljs-bullet">-</span> <span class="hljs-string">''</span>          <span class="hljs-bullet">-</span> <span class="hljs-bullet">-</span> <span class="hljs-string">|              #!/bin/bash -xe</span>            <span class="hljs-bullet">-</span> <span class="hljs-string">|              sudo wget -O /usr/bin/caddy "https://github.com/tobilg/aws-caddy-build/raw/main/releases/aws_caddy_v2.6.2_linux"</span>            <span class="hljs-bullet">-</span> <span class="hljs-string">|              sudo chmod +x /usr/bin/caddy</span>            <span class="hljs-bullet">-</span> <span class="hljs-string">|              sudo groupadd --system caddy</span>            <span class="hljs-bullet">-</span> <span class="hljs-string">|              sudo useradd --system --gid caddy --create-home --home-dir /var/lib/caddy --shell /usr/sbin/nologin --comment "Caddy web server" caddy</span>            <span class="hljs-bullet">-</span> <span class="hljs-string">|              sudo mkdir -p /etc/caddy</span>            <span class="hljs-bullet">-</span> <span class="hljs-string">|              sudo echo -e '${file(./configs.js):caddyService}' | sudo tee /etc/systemd/system/caddy.service</span>            <span class="hljs-bullet">-</span> <span class="hljs-string">|              sudo printf '${file(./configs.js):caddyFile}' | sudo tee /etc/caddy/Caddyfile</span>            <span class="hljs-bullet">-</span> <span class="hljs-string">|              sudo echo -e "TABLE_REGION=${self:custom.caddy.dynamoDBTableRegion}\nTABLE_NAME=${self:custom.caddy.dynamoDBTableName}\nDOMAIN_SERVICE_ENDPOINT=${self:custom.caddy.domainServiceEndpoint}\nLETSENCRYPT_EMAIL_ADDRESS=${self:custom.caddy.letsEncryptEmailAddress}\nTARGET_DOMAIN=${self:custom.caddy.targetDomainName}" | sudo tee /etc/caddy/environment</span>            <span class="hljs-bullet">-</span> <span class="hljs-string">|              sudo systemctl daemon-reload</span>            <span class="hljs-bullet">-</span> <span class="hljs-string">|              sudo systemctl enable caddy</span>            <span class="hljs-bullet">-</span> <span class="hljs-string">|</span>              <span class="hljs-string">sudo</span> <span class="hljs-string">systemctl</span> <span class="hljs-string">start</span> <span class="hljs-string">--now</span> <span class="hljs-string">caddy</span></code></pre><p><a target="_blank" href="https://github.com/tobilg/global-reverse-proxy/blob/main/accelerator-stack/resources/global-accelerator.yml">global-accelerator.yml</a></p><p>The Global Accelerator CloudFormation resource wires the EC2 instance(s) to its kind-of global load balancer. This is then referenced by the dns-record.yml, which assigns the configured domain name to the Global Accelerator.</p><pre><code class="lang-yaml"><span class="hljs-attr">Resources:</span>  <span class="hljs-attr">Accelerator:</span>    <span class="hljs-attr">Type:</span> <span class="hljs-string">AWS::GlobalAccelerator::Accelerator</span>    <span class="hljs-attr">Properties:</span>      <span class="hljs-attr">Name:</span> <span class="hljs-string">'External-Accelerator'</span>      <span class="hljs-attr">Enabled:</span> <span class="hljs-literal">true</span>  <span class="hljs-attr">Listener:</span>    <span class="hljs-attr">Type:</span> <span class="hljs-string">AWS::GlobalAccelerator::Listener</span>    <span class="hljs-attr">Properties:</span>      <span class="hljs-attr">AcceleratorArn:</span>        <span class="hljs-attr">Ref:</span> <span class="hljs-string">Accelerator</span>      <span class="hljs-attr">Protocol:</span> <span class="hljs-string">TCP</span>      <span class="hljs-attr">ClientAffinity:</span> <span class="hljs-string">NONE</span>      <span class="hljs-attr">PortRanges:</span>        <span class="hljs-bullet">-</span> <span class="hljs-attr">FromPort:</span> <span class="hljs-number">443</span>          <span class="hljs-attr">ToPort:</span> <span class="hljs-number">443</span>  <span class="hljs-attr">EndpointGroup1:</span>    <span class="hljs-attr">Type:</span> <span class="hljs-string">AWS::GlobalAccelerator::EndpointGroup</span>    <span class="hljs-attr">Properties:</span>       <span class="hljs-attr">EndpointConfigurations:</span>         <span class="hljs-bullet">-</span> <span class="hljs-attr">EndpointId:</span> <span class="hljs-string">'${self:custom.ec2.instance1.id}'</span>          <span class="hljs-attr">Weight:</span> <span class="hljs-number">1</span>      <span class="hljs-attr">EndpointGroupRegion:</span> <span class="hljs-string">'${self:custom.ec2.instance1.region}'</span>      <span class="hljs-attr">HealthCheckIntervalSeconds:</span> <span class="hljs-number">30</span>      <span class="hljs-attr">HealthCheckPath:</span> <span class="hljs-string">'/health'</span>      <span class="hljs-attr">HealthCheckPort:</span> <span class="hljs-number">80</span>      <span class="hljs-attr">HealthCheckProtocol:</span> <span class="hljs-string">'HTTP'</span>      <span class="hljs-attr">ListenerArn:</span> <span class="hljs-type">!Ref</span> <span class="hljs-string">'Listener'</span>      <span class="hljs-attr">ThresholdCount:</span> <span class="hljs-number">3</span></code></pre><h3 id="heading-detailed-configuration">Detailed configuration</h3><p><strong>Stack configurations</strong></p><p>Please configure the following values for the different stacks:</p><ul><li><p>The target domain name where you want your reverse proxy to send the requests to (<a target="_blank" href="https://github.com/tobilg/global-reverse-proxy/blob/main/proxy-server-stack/serverless.yml#L7">targetDomainName</a>)</p></li><li><p>The email address to use for automatic certificate generation via LetsEncrypt (<a target="_blank" href="https://github.com/tobilg/global-reverse-proxy/blob/main/proxy-server-stack/serverless.yml#L8">letsEncryptEmailAddress</a>)</p></li><li><p>The domain name of the proxy service itself, which is then used by GlobalAccelerator (<a target="_blank" href="https://github.com/tobilg/global-reverse-proxy/blob/main/accelerator-stack/serverless.yml#L6">domain</a>)</p></li><li><p>Optionally: The current IP address from which you want to use the EC2 instance(s) via SSH from (<a target="_blank" href="https://github.com/tobilg/global-reverse-proxy/blob/main/proxy-server-stack/serverless.yml#L18">sshClientIPAddress</a>). If you want to use SSH, you'll need to uncomment the respective <a target="_blank" href="https://github.com/tobilg/global-reverse-proxy/blob/main/proxy-server-stack/resources/ec2.yml#L56-L59">SecurityGroup settings</a></p></li></ul><p><strong>Whitelisted domain configuration</strong></p><p>You need to make sure that not everyone can use your reverse proxy with every domain. Therefore, you need to configure the whitelist of domains that you be used by Caddy's <a target="_blank" href="https://caddyserver.com/docs/automatic-https#on-demand-tls">on-demand TLS feature</a>.</p><p>This is done with the Domain Verifier Lambda function, which is deployed at a Function URL endpoint.</p><p>The configuration can be changed <a target="_blank" href="https://github.com/tobilg/global-reverse-proxy/blob/main/domain-service-stack/src/domainVerifier.js#L3-L6">here</a> before deploying the service.</p><p><strong><mark>HINT</mark></strong><mark>: To use this dynamically, as you'd probably wish in a production setting, you could rewrite the Lambda function to read the custom domains from a DynamoDB table, and have another Lambda function run recurrently to issue DNS checks for the CNAME entries the customers would need to make (see below).</mark></p><p><strong>DNS / Nameserver configurations</strong></p><p>If you use an external domain provider, such as <a target="_blank" href="https://www.namecheap.com/support/knowledgebase/article.aspx/767/10/how-to-change-dns-for-a-domain/">Namecheap</a> or GoDaddy, make such that you point the DNS settings at your domain's configuration to those which are assigned to your HostedZone by Amazon. You can look these up in the AWS Console or via the AWS CLI.</p><p><strong>CNAME configuration for proxying</strong></p><p>You also need to <a target="_blank" href="https://aws.amazon.com/premiumsupport/knowledge-center/route-53-create-alias-records/">add CNAME records</a> to the domains you want to proxy for, e.g. if your proxy service domain is <a target="_blank" href="http://external.mygreatproxyservice.com"><code>external.mygreatproxyservice.com</code></a>, you need to add a CNAME record to your existing domain (e.g. <a target="_blank" href="http://test.myexistingdomain.com"><code>test.myexistingdomain.com</code></a>) to redirect to the proxy service domain:</p><pre><code class="lang-bash">CNAME test.myexistingdomain.com external.mygreatproxyservice.com</code></pre><p><strong>Passing options during deployment</strong></p><p>When running <code>sls deploy</code> for each stack, you can specify the following options to customize the deployments:</p><ul><li><p><code>--stage</code>: This will configure the so-called stage, which is part of the stack name (default: <code>prd</code>)</p></li><li><p><code>--region</code>: This will configure the AWS region where the stack is deployed (default: <code>us-east-1</code>)</p></li></ul><h3 id="heading-deployment">Deployment</h3><p>You need to follow a specific deployment order to be able to run the overall service:</p><ol><li><p>Domain whitelisting service: <code>cd domain-service-stack &amp;&amp; sls deploy &amp;&amp; cd ..</code></p></li><li><p>Proxy server(s): <code>cd proxy-server-stack &amp;&amp; sls deploy &amp;&amp; cd ..</code></p></li><li><p>Global Accelerator &amp; HostedZone / DNS : <code>cd accelerator-stack &amp;&amp; sls deploy &amp;&amp; cd ..</code></p></li></ol><h3 id="heading-removal">Removal</h3><p>To remove the individual stacks, you can run <code>sls remove</code> in the individual subfolders.</p><h1 id="heading-wrapping-up">Wrapping up</h1><p>We were able to build a POC for a (potentially) globally distributed reverse proxy service, with on-demand TLS support. We decided against using Fargate, and for using EC2 due to cost reasons. This prioritized costs higher, than running as Serverless as possible. It's possible that, in another setting / environment / experience, you might come to another conclusion, which is completely fine.</p><p>For a more production-like setup, you'd probably need to amend the <a target="_blank" href="https://github.com/tobilg/global-reverse-proxy/blob/main/domain-service-stack/src/domainVerifier.js">Domain Verifier Lambda function</a>, so that it dynamically looks up the custom domains that are configured e.g. by your customers via a UI, and stored in another DynamoDB table via another Lambda function. Deleting or updating those custom domains should probably be possible, too.</p><p>Furthermore, you should then write an additional Lambda function that recurrently checks each stored custom domain if:</p><ul><li><p>The CNAME records point to your <code>external.$YOUR_DOMAIN_NAME.tld</code>, and updates the status accordingly</p></li><li><p>Performs a check via HTTPS whether an actual redirect from the custom domain to your domain is possible</p></li></ul>]]></description><link>https://tobilg.com/building-a-global-reverse-proxy-with-on-demand-ssl-support</link><guid isPermaLink="true">https://tobilg.com/building-a-global-reverse-proxy-with-on-demand-ssl-support</guid><dc:creator><![CDATA[Tobias Müller]]></dc:creator><pubDate>Tue, 10 Jan 2023 23:25:26 GMT</pubDate><cover_image>https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/40XgDxBfYXM/upload/c2bf9426ba8c99f30c9e990e150c08b9.jpeg</cover_image></item></channel></rss>