New parquet conversion in IP2Location Python CSV Converter

Parquet is an open source, column-oriented data file format, designed for streamlined data handling and analysis. It is enhanced with high efficiency of processing complex data structures. It also supports performance-driven data compression and encoding techniques. Parquet is mainly used in big data processing framework, for example, Apache Spark and Presto. Parquet is also supported by mainstream cloud data storage and data lakes, such as Google Big Query, Amazon Athena and Snowflake.

There are various methods and tools to create a parquet file from scratch, or convert an existing CSV file into a parquet file. In this tutorial, we are going to show how to convert the IP2Location and IP2Proxy CSV data files into parquet files. Before we start, please ensure that you have installed or updated the IP2Location Python CSV Converter tool, as we are going to use this tool to make the conversion. This tutorial will also use IP2Location and IP2Proxy CSV data files as examples. You can purchase one at here, or register a free account to get a LITE version of the CSV file.

How to convert the IP2Location CSV database to parquet

  1. In your command prompt, go to the folder of the CSV file, and execute the following command:
ip2location-csv-converter -parquet <database_type> <input_csv_filename> <output_parquet_filename>
  1. The database_type will determine the columns in the parquet file. You can get the database_type from here. The valid database_type value for IP2Location CSV file shall be from DB1 to DB26.
  2. You shall see a parquet file converted once the process is done. You can then use the parquet file to perform queries.

How to convert the IP2Proxy CSV database to parquet

  1. In your command prompt, go to the folder of the CSV file and execute the following command:
  2. The database_type will determine the columns in the parquet file. You can find the database_type of the CSV file from here. The valid database_type value for the IP2Proxy CSV file shall be between PX1 and PX12.
  3. Once the command is executed successfully, you shall see the converted parquet file in the folder. You can then use the parquet file and query from the file as you like.

Note for IPv6 parquet

For Parquet, due to the current limitation of the Decimal data type, IP2Location Python CSV converter is unable to preset the ip_from and ip_to as the Decimal. Both are encoded and stored as varchar instead. In order to query using these columns after the conversion, you will need to first convert the IPv6 address into the corresponding IPv6 number, and then convert to zero-padded 32-character lowercase hex string. Below are an example Python code to demonstrate the process:

import ipaddress

# Example IPv6 address
ipv6_addr = "2001:db8::1"

# Convert to integer
ipv6_int = int(ipaddress.IPv6Address(ipv6_addr))

# Convert to zero-padded 32-character lowercase hex string
ipv6_hex = format(ipv6_int, "032x")

After that, you can query from the converted parquet file using the converted hex string, for example,

import duckdb

result = duckdb.query(f"""
    SELECT * FROM '<ipv6_parquet_filename>'
    WHERE ipv6_hex = '{ipv6_hex}'
""").to_df()

Was this article helpful?

Related Articles