Gathering and analyzing public cloud provider IP address data with DuckDB & Observerable
As organizations increasingly adopt the public cloud, managing the networking and security aspects of cloud computing becomes more complex. One of the challenges that cloud administrators face is, especially in a hybrid cloud environment, keeping track of the IP address ranges of the public cloud providers, which all use different file formats to publish their IP address range data. The formats include deeply nested JSONs, CSVs as well as plain text.
The goal of this article is to outline how this data can be unified, cleaned and made available on a platform that makes it easy for users to consume. Furthermore, some interesting statistics can be derived from those public datasets.
The data and the source code can be found at:
Data sources
The (incomplete) list of public cloud providers that are publishing their IPv4 CIDR blocks are:
You can click on the individual links to view or download the data manually.
HINT: The published lists of IP address ranges don't represent the overall IP address space that each of these providers are possessing.
To get the complete IP range, data from organizations like ARIN would need to be added as well. For simplicity and brevity, only the publically downloadable info was used.
Retrieving, cleaning, storing and exporting data
The overall data engineering process is divided into multiple steps:
Identify data sources (see above) and define the common schema
Retrieving the raw data from public data sources (via HTTP)
Cleaning the retrieved data (e.g. remove duplicates)
Storing the data in a common schema, so that it can be aggregated and analyzed for different public cloud providers at once
Exporting the stored data into different file formats, so that as many different types of clients can make use of it
Additionally, we'd like to keep the costs low, and the infrastructure as simple as possible. That's why DuckDB is chosen as the database layer, which offers a rich and advanced set of features to handle (read and write) different file formats, as well as it can read directly from remote data sources via HTTP, only by using SQL. That saves additional effort for out-of-band ETL.
Furthermore, to share the data, we chose GitHub, which is free to use for the scope of our use case. Most importantly, it allows us to store the exported data files in our repository. To run the overall process, GitHub Actions are used as they also offer a free usage tier, and have everything we need to create the described data pipeline.
Common data schema
After inspecting the data source files, the derived unified schema for all loaded data sources will look like this:
Column name | Data type | Description |
cloud_provider | VARCHAR | The public cloud provider's name |
cidr_block | VARCHAR | The CIDR block, e.g. 10.0.0.0/32 |
ip_address | VARCHAR | The IP address, e.g. 10.0.0.0 |
ip_address_mask | INTEGER | The IP address mask, e.g. 32 |
ip_address_cnt | INTEGER | The number of IP addresses in this CIDR block |
region | VARCHAR | The public cloud provider region information (if given) |
Creating the cloud provider tables
At first, we'll create a table in DuckDB for each of the public cloud providers. If DuckDB is installed (see docs) and in the PATH, we can execute SQL scripts like this:
# $DATA_PATH is the location of the DuckDB database file (this is important, because otherwise an in-memory table will be automatically created that will not be able to persist the data
# $SCRIPT is the path to the SQL script that shall be executed
duckdb $DATA_PATH < $SCRIPT.sql
Each table has different SQLs, as the data sources (contents and formats) of each provider are different.
Before starting, we need to make sure that the httpfs extension is installed and loaded (as we use remote datasets):
INSTALL httpfs;
LOAD httpfs;
AWS table
CREATE TABLE aws_ip_data AS (
SELECT DISTINCT
prefixes.cidr_block,
prefixes.ip_address,
prefixes.ip_address_mask,
CAST(POW(2, 32-prefixes.ip_address_mask) AS INTEGER) AS ip_address_cnt,
prefixes.region
FROM (
SELECT
prefix_object.ip_prefix AS cidr_block,
STR_SPLIT(prefix_object.ip_prefix, '/')[1] AS ip_address,
CAST(STR_SPLIT(prefix_object.ip_prefix, '/')[2] AS INTEGER) AS ip_address_mask,
prefix_object.region
FROM (
SELECT UNNEST(prefixes) AS prefix_object FROM 'https://ip-ranges.amazonaws.com/ip-ranges.json'
)
) prefixes
);
Azure table
CREATE TABLE azure_ip_data AS (
SELECT DISTINCT
prefixes AS cidr_block,
STR_SPLIT(prefixes, '/')[1] AS ip_address,
CAST(STR_SPLIT(prefixes, '/')[2] AS INTEGER) AS ip_address_mask,
CAST(POW(2, 32-CAST(STR_SPLIT(prefixes, '/')[2] AS INTEGER)) AS INTEGER) AS ip_address_cnt,
CASE
WHEN region = '' THEN 'No region'
ELSE region
END AS region
FROM (
SELECT DISTINCT
prop.region AS region,
UNNEST(prop.addressPrefixes) AS prefixes
FROM (
SELECT
values.properties AS prop
FROM (
SELECT
UNNEST(values) AS values
FROM
read_json_auto('https://download.microsoft.com/download/7/1/D/71D86715-5596-4529-9B13-DA13A5DE5B63/ServiceTags_Public_20230417.json', maximum_object_size=10000000)
)
)
)
WHERE
prefixes NOT LIKE '%::%'
);
CloudFlare table
CREATE TABLE cloudflare_ip_data AS (
SELECT DISTINCT
prefixes AS cidr_block,
STR_SPLIT(prefixes, '/')[1] AS ip_address,
CAST(STR_SPLIT(prefixes, '/')[2] AS INTEGER) AS ip_address_mask,
CAST(POW(2, 32-CAST(STR_SPLIT(prefixes, '/')[2] AS INTEGER)) AS INTEGER) AS ip_address_cnt,
'No region' AS region
FROM (
SELECT
column0 AS prefixes
FROM
read_csv_auto('https://www.cloudflare.com/ips-v4')
)
);
DigitalOcean table
CREATE TABLE digitalocean_ip_data AS (
SELECT DISTINCT
prefixes AS cidr_block,
STR_SPLIT(prefixes, '/')[1] AS ip_address,
CAST(STR_SPLIT(prefixes, '/')[2] AS INTEGER) AS ip_address_mask,
CAST(POW(2, 32-CAST(STR_SPLIT(prefixes, '/')[2] AS INTEGER)) AS INTEGER) AS ip_address_cnt,
'No region' AS region
FROM (
SELECT
column0 AS prefixes
FROM
read_csv_auto('https://digitalocean.com/geo/google.csv')
WHERE
column0 NOT LIKE '%::%'
)
);
Fastly table
CREATE TABLE fastly_ip_data AS (
SELECT DISTINCT
prefixes AS cidr_block,
STR_SPLIT(prefixes, '/')[1] AS ip_address,
CAST(STR_SPLIT(prefixes, '/')[2] AS INTEGER) AS ip_address_mask,
CAST(POW(2, 32-CAST(STR_SPLIT(prefixes, '/')[2] AS INTEGER)) AS INTEGER) AS ip_address_cnt,
'No region' AS region
FROM (
SELECT
UNNEST(addresses) AS prefixes
FROM
read_json_auto('https://api.fastly.com/public-ip-list')
)
);
Google Cloud table
CREATE TABLE google_ip_data AS (
SELECT DISTINCT
prefixes.cidr_block,
prefixes.ip_address,
prefixes.ip_address_mask,
CAST(pow(2, 32-prefixes.ip_address_mask) AS INTEGER) AS ip_address_cnt,
prefixes.region
FROM (
SELECT
prefix_object.ipv4Prefix AS cidr_block,
str_split(prefix_object.ipv4Prefix, '/')[1] AS ip_address,
CAST(str_split(prefix_object.ipv4Prefix, '/')[2] AS INTEGER) AS ip_address_mask,
prefix_object.scope as region
FROM (
SELECT unnest(prefixes) AS prefix_object FROM 'https://www.gstatic.com/ipranges/cloud.json'
)
WHERE
prefix_object.ipv4Prefix IS NOT NULL
) prefixes
);
Oracle Cloud table
CREATE TABLE oracle_ip_data AS (
SELECT DISTINCT
prefixes.cidr AS cidr_block,
STR_SPLIT(prefixes.cidr, '/')[1] AS ip_address,
CAST(STR_SPLIT(prefixes.cidr, '/')[2] AS INTEGER) AS ip_address_mask,
CAST(POW(2, 32-CAST(STR_SPLIT(prefixes.cidr, '/')[2] AS INTEGER)) AS INTEGER) AS ip_address_cnt,
CASE
WHEN region = '' THEN 'No region'
ELSE region
END AS region
FROM (
SELECT DISTINCT
region,
UNNEST(cidrs) AS prefixes
FROM (
SELECT
regions.region AS region,
regions.cidrs AS cidrs
FROM (
SELECT
UNNEST(regions) AS regions
FROM
read_json_auto('https://docs.oracle.com/en-us/iaas/tools/public_ip_ranges.json', maximum_object_size=10000000)
)
)
)
);
Create a combined view
The next step is to create a new view (ip_data
) that combines our tables for the individual cloud providers. We can then use this view later to compare the different cloud providers.
The view definition looks like this:
CREATE VIEW ip_data AS (
SELECT 'AWS' as cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region FROM aws_ip_data
UNION ALL
SELECT 'Azure' as cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region FROM azure_ip_data
UNION ALL
SELECT 'CloudFlare' as cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region FROM cloudflare_ip_data
UNION ALL
SELECT 'DigitalOcean' as cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region FROM digitalocean_ip_data
UNION ALL
SELECT 'Fastly' as cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region FROM fastly_ip_data
UNION ALL
SELECT 'Google Cloud' as cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region FROM google_ip_data
UNION ALL
SELECT 'Oracle' as cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region FROM oracle_ip_data
);
Export the data
To be able to use the data with other tools, we need to export the data to different formats, in our case CSV and Parquet. You can review the executed queries in the repository.
-- Only an example, this needs to be done for all providers as well!
-- Export complete data as CSV
COPY (SELECT * FROM ip_data ORDER BY cloud_provider, cidr_block) TO 'data/providers/all.csv' WITH (HEADER 1, DELIMITER ',');
-- Export complete data as Parquet
COPY (SELECT * FROM ip_data ORDER BY cloud_provider, cidr_block) TO 'data/providers/all.parquet' (FORMAT 'parquet', COMPRESSION 'SNAPPY');
Analyze the data
As we now prepared our data, we can start analyzing it. Therefore, we'll use an ObservableHQ notebook, where we'll upload the all.csv file to.
Overall IP address counts
Total number and value of IP addresses
An astonishing insight is that both AWS and Azure have more than six times as many IP addresses available as their next competitor.
Also, the market values of their IP addresses are nearly four billion Dollars according to a market analysis.
CIDR masks distribution by public cloud provider
It's remarkable that, although AWS and Azure have similar absolute numbers of IP addresses, the type of CIDR blocks / IP ranges strongly differ: AWS owns very few very large IP ranges, whereas Azure owns very many rather small IP ranges, and just a few very large ones:
Another view is the filterable table for this data:
AWS CIDR masks
Azure CIDR masks
CloudFlare CIDR masks
DigitalOcean CIDR masks
Fastly CIDR masks
Google Cloud CIDR masks
Oracle Cloud CIDR masks
Conclusion
In this article, we described a simple and straightforward way to gather and transform data in different formats with DuckDB, as well as export it to a common schema as CSV and Parquet files.
Furthermore, we leveraged DuckDB on Observable to analyze and display the data in beautiful and interactive graphs.
By using GitHub Actions free tier as our "runtime" for the data processing via Bash and SQL scripts, and hosting our data in a GitHub repo (also covered by the free tier), we were able to show that data pipelines like covered use case can be built without accruing infrastructure costs. Also, the Observable pricing model supports our analyses for free.