Exporting BigQuery Tables - Xiubo's Learning Notes

Exporting a table from BigQuery to, say, a single local file can be a complicated process depending on how large the table it is.

Among the various limitations documented on Exporting Table Data, I often find the following limitation quite problematic:

You can export up to 1 GB of table data to a single file. If you are exporting more than 1 GB of data, use a wildcard to export the data into multiple files. When you export data to multiple files, the size of the files will vary.

This means that if one wants to export a large BigQuery table into a single file then one has to either take a sample of size up to 1GB (which may not be desirable), or export it into multiple files then merge them afterwards. After some experimentation, I have established a procedure for exporting large BigQuery tables. Take the publicly available US Natality sample dataset hosted on BigQuery as an example. The table amounts to 21.94 GB in size:

To export the table (bigquery-public-data:samples.natality) to a single parquet file, I did the following in a Google Cloud Shell:

Create a Google Cloud Storage bucket to be our intermediate storage for the exported files (I randomly made up the name good-tale-child-natality for the target bucket; replace with your own bucket name):
```
gsutil mb gs://good-tale-child-natality
```

Export the table into the bucket as multiple parquet files:

bq extract \
--destination_format PARQUET \
--compression SNAPPY \
'bigquery-public-data:samples.natality' \
gs://good-tale-child-natality/natality_*.parquet

We can also check the size of the exported files :

zhangxiubo@cloudshell:~ gsutil ls -l gs://good-tale-child-natality
	19036336  ...	gs://good-tale-child-natality/natality_000000000000.parquet
	19006398  ...	gs://good-tale-child-natality/natality_000000000001.parquet
	18973139  ...	gs://good-tale-child-natality/natality_000000000002.parquet
	19090872  ...	gs://good-tale-child-natality/natality_000000000003.parquet
	18961658  ...	gs://good-tale-child-natality/natality_000000000004.parquet
	...

Note how the size of each parquet file is only around 19 MB after compression, which is much smaller than the 1 GB limit.

Retrieve the individual parquet files and pack them into a zip file (using the atool package):
```
gsutil -m cp -r gs://good-tale-child-natality/* .
apack natality.zip *.parquet
```
Then download the zipped file archive to our local machine: cloudshell download natality.zip
Once downloaded, unpack the archive:
```
aunpack natality.zip
```

Use duckdb to merge the parquet files into a single parquet file called natality.parquet:

import duckdb
con = duckdb.connect(':memory:')
con.execute(f"COPY (SELECT * FROM parquet_scan('*.parquet')) TO 'natality.parquet' (FORMAT 'parquet');")

One can then query natality.parquet using duckdb, retrieving the results as pandas.DataFrame objects:
```
samples = con.execute(f"SELECT * FROM 'natality.parquet' USING SAMPLE 5%;").fetch_df()
```

We may still want to work with a smaller sample of the dataset as a pandas.DataFrame object in order to explore it efficiently in-memory in the end; this can also be achieved using BigQuery’s official client libraries, but I find the above approach more reliable in general and sometimes even faster.