Gathering and analyzing public cloud provider IP address data with DuckDB & Observerable

Gathering and analyzing public cloud provider IP address data with DuckDB & Observerable

As organizations increasingly adopt the public cloud, managing the networking and security aspects of cloud computing becomes more complex. One of the challenges that cloud administrators face is, especially in a hybrid cloud environment, keeping track of the IP address ranges of the public cloud providers, which all use different file formats to publish their IP address range data. The formats include deeply nested JSONs, CSVs as well as plain text.

The goal of this article is to outline how this data can be unified, cleaned and made available on a platform that makes it easy for users to consume. Furthermore, some interesting statistics can be derived from those public datasets.

The data and the source code can be found at:

Data sources

The (incomplete) list of public cloud providers that are publishing their IPv4 CIDR blocks are:

You can click on the individual links to view or download the data manually.

HINT: The published lists of IP address ranges don't represent the overall IP address space that each of these providers are possessing.

To get the complete IP range, data from organizations like ARIN would need to be added as well. For simplicity and brevity, only the publically downloadable info was used.

Retrieving, cleaning, storing and exporting data

The overall data engineering process is divided into multiple steps:

  • Identify data sources (see above) and define the common schema

  • Retrieving the raw data from public data sources (via HTTP)

  • Cleaning the retrieved data (e.g. remove duplicates)

  • Storing the data in a common schema, so that it can be aggregated and analyzed for different public cloud providers at once

  • Exporting the stored data into different file formats, so that as many different types of clients can make use of it

Additionally, we'd like to keep the costs low, and the infrastructure as simple as possible. That's why DuckDB is chosen as the database layer, which offers a rich and advanced set of features to handle (read and write) different file formats, as well as it can read directly from remote data sources via HTTP, only by using SQL. That saves additional effort for out-of-band ETL.

Furthermore, to share the data, we chose GitHub, which is free to use for the scope of our use case. Most importantly, it allows us to store the exported data files in our repository. To run the overall process, GitHub Actions are used as they also offer a free usage tier, and have everything we need to create the described data pipeline.

Common data schema

After inspecting the data source files, the derived unified schema for all loaded data sources will look like this:

Column nameData typeDescription
cloud_providerVARCHARThe public cloud provider's name
cidr_blockVARCHARThe CIDR block, e.g. 10.0.0.0/32
ip_addressVARCHARThe IP address, e.g. 10.0.0.0
ip_address_maskINTEGERThe IP address mask, e.g. 32
ip_address_cntINTEGERThe number of IP addresses in this CIDR block
regionVARCHARThe public cloud provider region information (if given)

Creating the cloud provider tables

At first, we'll create a table in DuckDB for each of the public cloud providers. If DuckDB is installed (see docs) and in the PATH, we can execute SQL scripts like this:

# $DATA_PATH is the location of the DuckDB database file (this is important, because otherwise an in-memory table will be automatically created that will not be able to persist the data
# $SCRIPT is the path to the SQL script that shall be executed
duckdb $DATA_PATH < $SCRIPT.sql

Each table has different SQLs, as the data sources (contents and formats) of each provider are different.

Before starting, we need to make sure that the httpfs extension is installed and loaded (as we use remote datasets):

INSTALL httpfs;
LOAD httpfs;

AWS table

CREATE TABLE aws_ip_data AS (
  SELECT DISTINCT
    prefixes.cidr_block,
    prefixes.ip_address,
    prefixes.ip_address_mask,
    CAST(POW(2, 32-prefixes.ip_address_mask) AS INTEGER) AS ip_address_cnt,
    prefixes.region
  FROM (
    SELECT
      prefix_object.ip_prefix AS cidr_block,
      STR_SPLIT(prefix_object.ip_prefix, '/')[1] AS ip_address,
      CAST(STR_SPLIT(prefix_object.ip_prefix, '/')[2] AS INTEGER) AS ip_address_mask,
      prefix_object.region
    FROM (
      SELECT UNNEST(prefixes) AS prefix_object FROM 'https://ip-ranges.amazonaws.com/ip-ranges.json'
    )
  ) prefixes
);

Azure table

CREATE TABLE azure_ip_data AS (
  SELECT DISTINCT
    prefixes AS cidr_block,
    STR_SPLIT(prefixes, '/')[1] AS ip_address,
    CAST(STR_SPLIT(prefixes, '/')[2] AS INTEGER) AS ip_address_mask,
    CAST(POW(2, 32-CAST(STR_SPLIT(prefixes, '/')[2] AS INTEGER)) AS INTEGER) AS ip_address_cnt,
    CASE
      WHEN region = '' THEN 'No region'
      ELSE region
    END AS region
  FROM (
    SELECT DISTINCT
      prop.region AS region,
      UNNEST(prop.addressPrefixes) AS prefixes
    FROM (
      SELECT 
        values.properties AS prop
      FROM (
        SELECT 
          UNNEST(values) AS values
        FROM
          read_json_auto('https://download.microsoft.com/download/7/1/D/71D86715-5596-4529-9B13-DA13A5DE5B63/ServiceTags_Public_20230417.json', maximum_object_size=10000000)
      )
    )
  )
  WHERE
    prefixes NOT LIKE '%::%'
);

CloudFlare table

CREATE TABLE cloudflare_ip_data AS (
  SELECT DISTINCT
    prefixes AS cidr_block,
    STR_SPLIT(prefixes, '/')[1] AS ip_address,
    CAST(STR_SPLIT(prefixes, '/')[2] AS INTEGER) AS ip_address_mask,
    CAST(POW(2, 32-CAST(STR_SPLIT(prefixes, '/')[2] AS INTEGER)) AS INTEGER) AS ip_address_cnt,
    'No region' AS region
  FROM (
    SELECT
      column0 AS prefixes
    FROM
      read_csv_auto('https://www.cloudflare.com/ips-v4')
  )
);

DigitalOcean table

CREATE TABLE digitalocean_ip_data AS (
  SELECT DISTINCT
    prefixes AS cidr_block,
    STR_SPLIT(prefixes, '/')[1] AS ip_address,
    CAST(STR_SPLIT(prefixes, '/')[2] AS INTEGER) AS ip_address_mask,
    CAST(POW(2, 32-CAST(STR_SPLIT(prefixes, '/')[2] AS INTEGER)) AS INTEGER) AS ip_address_cnt,
    'No region' AS region
  FROM (
    SELECT
      column0 AS prefixes
    FROM
      read_csv_auto('https://digitalocean.com/geo/google.csv')
    WHERE
      column0 NOT LIKE '%::%'
  )
);

Fastly table

CREATE TABLE fastly_ip_data AS (
  SELECT DISTINCT
    prefixes AS cidr_block,
    STR_SPLIT(prefixes, '/')[1] AS ip_address,
    CAST(STR_SPLIT(prefixes, '/')[2] AS INTEGER) AS ip_address_mask,
    CAST(POW(2, 32-CAST(STR_SPLIT(prefixes, '/')[2] AS INTEGER)) AS INTEGER) AS ip_address_cnt,
    'No region' AS region
  FROM (
    SELECT
      UNNEST(addresses) AS prefixes
    FROM
      read_json_auto('https://api.fastly.com/public-ip-list')
  )
);

Google Cloud table

CREATE TABLE google_ip_data AS (
  SELECT DISTINCT
    prefixes.cidr_block,
    prefixes.ip_address,
    prefixes.ip_address_mask,
    CAST(pow(2, 32-prefixes.ip_address_mask) AS INTEGER) AS ip_address_cnt,
    prefixes.region
  FROM (
    SELECT
      prefix_object.ipv4Prefix AS cidr_block,
      str_split(prefix_object.ipv4Prefix, '/')[1] AS ip_address,
      CAST(str_split(prefix_object.ipv4Prefix, '/')[2] AS INTEGER) AS ip_address_mask,
      prefix_object.scope as region
    FROM (
      SELECT unnest(prefixes) AS prefix_object FROM 'https://www.gstatic.com/ipranges/cloud.json'
    )
    WHERE
      prefix_object.ipv4Prefix IS NOT NULL
  ) prefixes
);

Oracle Cloud table

CREATE TABLE oracle_ip_data AS (
  SELECT DISTINCT
    prefixes.cidr AS cidr_block,
    STR_SPLIT(prefixes.cidr, '/')[1] AS ip_address,
    CAST(STR_SPLIT(prefixes.cidr, '/')[2] AS INTEGER) AS ip_address_mask,
    CAST(POW(2, 32-CAST(STR_SPLIT(prefixes.cidr, '/')[2] AS INTEGER)) AS INTEGER) AS ip_address_cnt,
    CASE
      WHEN region = '' THEN 'No region'
      ELSE region
    END AS region
  FROM (
    SELECT DISTINCT
      region,
      UNNEST(cidrs) AS prefixes
    FROM (
      SELECT 
        regions.region AS region,
        regions.cidrs AS cidrs
      FROM (
        SELECT 
          UNNEST(regions) AS regions
        FROM
          read_json_auto('https://docs.oracle.com/en-us/iaas/tools/public_ip_ranges.json', maximum_object_size=10000000)
      )
    )
  )
);

Create a combined view

The next step is to create a new view (ip_data) that combines our tables for the individual cloud providers. We can then use this view later to compare the different cloud providers.

The view definition looks like this:

CREATE VIEW ip_data AS (
  SELECT 'AWS' as cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region FROM aws_ip_data
  UNION ALL
  SELECT 'Azure' as cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region FROM azure_ip_data
  UNION ALL
  SELECT 'CloudFlare' as cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region FROM cloudflare_ip_data
  UNION ALL
  SELECT 'DigitalOcean' as cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region FROM digitalocean_ip_data
  UNION ALL
  SELECT 'Fastly' as cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region FROM fastly_ip_data
  UNION ALL
  SELECT 'Google Cloud' as cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region FROM google_ip_data
  UNION ALL
  SELECT 'Oracle' as cloud_provider, cidr_block, ip_address, ip_address_mask, ip_address_cnt, region FROM oracle_ip_data
);

Export the data

To be able to use the data with other tools, we need to export the data to different formats, in our case CSV and Parquet. You can review the executed queries in the repository.

-- Only an example, this needs to be done for all providers as well!

-- Export complete data as CSV
COPY (SELECT * FROM ip_data ORDER BY cloud_provider, cidr_block) TO 'data/providers/all.csv' WITH (HEADER 1, DELIMITER ',');

-- Export complete data as Parquet
COPY (SELECT * FROM ip_data ORDER BY cloud_provider, cidr_block) TO 'data/providers/all.parquet' (FORMAT 'parquet', COMPRESSION 'SNAPPY');

Analyze the data

As we now prepared our data, we can start analyzing it. Therefore, we'll use an ObservableHQ notebook, where we'll upload the all.csv file to.

Overall IP address counts

Total number and value of IP addresses

An astonishing insight is that both AWS and Azure have more than six times as many IP addresses available as their next competitor.

Also, the market values of their IP addresses are nearly four billion Dollars according to a market analysis.

CIDR masks distribution by public cloud provider

It's remarkable that, although AWS and Azure have similar absolute numbers of IP addresses, the type of CIDR blocks / IP ranges strongly differ: AWS owns very few very large IP ranges, whereas Azure owns very many rather small IP ranges, and just a few very large ones:

Another view is the filterable table for this data:

AWS CIDR masks

Azure CIDR masks

CloudFlare CIDR masks

DigitalOcean CIDR masks

Fastly CIDR masks

Google Cloud CIDR masks

Oracle Cloud CIDR masks

Conclusion

In this article, we described a simple and straightforward way to gather and transform data in different formats with DuckDB, as well as export it to a common schema as CSV and Parquet files.

Furthermore, we leveraged DuckDB on Observable to analyze and display the data in beautiful and interactive graphs.

By using GitHub Actions free tier as our "runtime" for the data processing via Bash and SQL scripts, and hosting our data in a GitHub repo (also covered by the free tier), we were able to show that data pipelines like covered use case can be built without accruing infrastructure costs. Also, the Observable pricing model supports our analyses for free.