3 min read

Exporting BigQuery Tables

Exporting a table from BigQuery to, say, a single local file can be a complicated process depending on how large the table it is.

Among the various limitations documented on Exporting Table Data, I often find the following limitation quite problematic:

You can export up to 1 GB of table data to a single file. If you are exporting more than 1 GB of data, use a wildcard to export the data into multiple files. When you export data to multiple files, the size of the files will vary.

This means that if one wants to export a large BigQuery table into a single file then one has to either take a sample of size up to 1GB (which may not be desirable), or export it into multiple files then merge them afterwards. After some experimentation, I have established a procedure for exporting large BigQuery tables. Take the publicly available US Natality sample dataset hosted on BigQuery as an example. The table amounts to 21.94 GB in size:

To export the table (bigquery-public-data:samples.natality) to a single parquet file, I did the following in a Google Cloud Shell:

  1. Create a Google Cloud Storage bucket to be our intermediate storage for the exported files (I randomly made up the name good-tale-child-natality for the target bucket; replace with your own bucket name):

    gsutil mb gs://good-tale-child-natality
    
  2. Export the table into the bucket as multiple parquet files:

    bq extract \
    --destination_format PARQUET \
    --compression SNAPPY \
    'bigquery-public-data:samples.natality' \
    gs://good-tale-child-natality/natality_*.parquet

    We can also check the size of the exported files :

    zhangxiubo@cloudshell:~ gsutil ls -l gs://good-tale-child-natality
    	19036336  ...	gs://good-tale-child-natality/natality_000000000000.parquet
    	19006398  ...	gs://good-tale-child-natality/natality_000000000001.parquet
    	18973139  ...	gs://good-tale-child-natality/natality_000000000002.parquet
    	19090872  ...	gs://good-tale-child-natality/natality_000000000003.parquet
    	18961658  ...	gs://good-tale-child-natality/natality_000000000004.parquet
    	...

    Note how the size of each parquet file is only around 19 MB after compression, which is much smaller than the 1 GB limit.

  3. Retrieve the individual parquet files and pack them into a zip file (using the atool package):

    gsutil -m cp -r gs://good-tale-child-natality/* .
    apack natality.zip *.parquet

    Then download the zipped file archive to our local machine: cloudshell download natality.zip

  4. Once downloaded, unpack the archive:

    aunpack natality.zip
  5. Use duckdb to merge the parquet files into a single parquet file called natality.parquet:

    import duckdb
    con = duckdb.connect(':memory:')
    con.execute(f"COPY (SELECT * FROM parquet_scan('*.parquet')) TO 'natality.parquet' (FORMAT 'parquet');")
  6. One can then query natality.parquet using duckdb, retrieving the results as pandas.DataFrame objects:

    samples = con.execute(f"SELECT * FROM 'natality.parquet' USING SAMPLE 5%;").fetch_df()

We may still want to work with a smaller sample of the dataset as a pandas.DataFrame object in order to explore it efficiently in-memory in the end; this can also be achieved using BigQuery’s official client libraries, but I find the above approach more reliable in general and sometimes even faster.