Exporting a table from BigQuery to, say, a single local file can be a complicated process depending on how large the table it is.
Among the various limitations documented on Exporting Table Data, I often find the following limitation quite problematic:
You can export up to 1 GB of table data to a single file. If you are exporting more than 1 GB of data, use a wildcard to export the data into multiple files. When you export data to multiple files, the size of the files will vary.
This means that if one wants to export a large BigQuery table into a single file then one has to either take a sample of size up to 1GB (which may not be desirable), or export it into multiple files then merge them afterwards. After some experimentation, I have established a procedure for exporting large BigQuery tables. Take the publicly available US Natality sample dataset hosted on BigQuery as an example. The table amounts to 21.94 GB in size:
To export the table (bigquery-public-data:samples.natality
) to a single parquet
file, I did the following in a Google Cloud Shell:
Create a Google Cloud Storage bucket to be our intermediate storage for the exported files (I randomly made up the name
good-tale-child-natality
for the target bucket; replace with your own bucket name):gsutil mb gs://good-tale-child-natality
Export the table into the bucket as multiple
parquet
files:bq extract \ --destination_format PARQUET \ --compression SNAPPY \ 'bigquery-public-data:samples.natality' \ gs://good-tale-child-natality/natality_*.parquet
We can also check the size of the exported files :
zhangxiubo@cloudshell:~ gsutil ls -l gs://good-tale-child-natality 19036336 ... gs://good-tale-child-natality/natality_000000000000.parquet 19006398 ... gs://good-tale-child-natality/natality_000000000001.parquet 18973139 ... gs://good-tale-child-natality/natality_000000000002.parquet 19090872 ... gs://good-tale-child-natality/natality_000000000003.parquet 18961658 ... gs://good-tale-child-natality/natality_000000000004.parquet ...
Note how the size of each
parquet
file is only around 19 MB after compression, which is much smaller than the 1 GB limit.Retrieve the individual parquet files and pack them into a zip file (using the
atool
package):gsutil -m cp -r gs://good-tale-child-natality/* . apack natality.zip *.parquet
Then download the zipped file archive to our local machine:
cloudshell download natality.zip
Once downloaded, unpack the archive:
aunpack natality.zip
Use
duckdb
to merge the parquet files into a single parquet file callednatality.parquet
:import duckdb con = duckdb.connect(':memory:') con.execute(f"COPY (SELECT * FROM parquet_scan('*.parquet')) TO 'natality.parquet' (FORMAT 'parquet');")
One can then query
natality.parquet
usingduckdb
, retrieving the results aspandas.DataFrame
objects:samples = con.execute(f"SELECT * FROM 'natality.parquet' USING SAMPLE 5%;").fetch_df()
We may still want to work with a smaller sample of the dataset as a pandas.DataFrame
object in order to explore it efficiently in-memory in the end; this can also be achieved using BigQuery’s official client libraries, but I find the above approach more reliable in general and sometimes even faster.