33
Yugabyte COPY batch size
COPY is the fastest way to insert data in PostgreSQL and also in YugabyteDB. The usage is mostly the the same, because the query layer reuses PostgreSQL. In PostgreSQL, the COPY can run as one transaction even for very large table. However, YugabyteDB writes are going though a distributed consensus, storing intents that are then applied. Long transactions may be problematic here and this is why we introduced intermediate commits with ROWS_PER_TRANSACTION
. To avoid surprises when loading a pg_dump, we also set a default value for it with yb_default_copy_from_rows_per_transaction=1000
. You can set it to 0 do revert to the PostgreSQL behaviour, or set another value depending on your system size and settings.
Is 1000 the right size? In order to give an idea, I've run a small data ingest, 10 millions rows, with different values of ROWS_PER_TRANSACTION. Here is the code I used to generate a 10 million rows csv:
for i in {1..10000000} ; do
echo -e "$i\t$RANDOM$SECONDS\t$SECONDS"
done > /var/tmp/10millions.tsv
du -h /var/tmp/10millions.tsv
Here is the COPY with ROWS_PER_TRANSACTION in powers of two:
for p in {22..0} ; do
psql -c "truncate table demo"
echo PGOPTIONS="-c yb_default_copy_from_rows_per_transaction=$((2** $p))" \
timeout 36000 psql \
-c "show yb_default_copy_from_rows_per_transaction" \
-c "\\timing on" \
-c "\\copy demo from '/var/tmp/10millions.tsv'" \
2>&1
sleep 60
done | ts | tee 10millions.log
I've gathered the elapsed time from the logs:
awk '/ [0-9]+$/{n=$NF}/Time:/{printf "%10d rows per tx, %5.1f min, %s\n",n,$5/1000/60,$NF}' 10millions.log
Here is the result:
4194304 rows per tx, 20.3 min, (20:16.073)
2097152 rows per tx, 15.8 min, (15:49.707)
1048576 rows per tx, 14.3 min, (14:20.843)
524288 rows per tx, 12.0 min, (11:57.360)
262144 rows per tx, 12.3 min, (12:16.458)
131072 rows per tx, 11.4 min, (11:22.775)
65536 rows per tx, 11.4 min, (11:26.464)
32768 rows per tx, 10.7 min, (10:40.638)
16384 rows per tx, 11.4 min, (11:22.026)
8192 rows per tx, 12.0 min, (11:57.187)
4096 rows per tx, 12.2 min, (12:09.036)
2048 rows per tx, 12.8 min, (12:49.271)
1024 rows per tx, 12.7 min, (12:43.371)
512 rows per tx, 12.7 min, (12:39.496)
256 rows per tx, 15.6 min, (15:33.997)
128 rows per tx, 21.9 min, (21:52.231)
64 rows per tx, 29.0 min, (29:00.760)
32 rows per tx, 44.3 min, (44:16.720)
16 rows per tx, 74.8 min, (01:14:46.988)
8 rows per tx, 125.2 min, (02:05:12.894)
4 rows per tx, 205.1 min, (03:25:06.307)
2 rows per tx, 452.1 min, (07:32:04.131)
Transactions that are too large have to spill to disk (RocksDB first level is MemTable, then, when full, is flushed to SST files). Transactions that are too short suffer from many short calls to DocDB. The optimal batch size here is between 500 and 500000 rows with the best thoughput at 30000. Our default of 1000 is conservative to be acceptable when memory is short, or when rows are very large. 1000 is enough to reduce the roundtrip overhead.
There are other optimizations possible, like ysql_non_txn_copy=true
that I mentioned in a previous post, and some other to come in future versions (I'm writing this in YugabyteDB 2.11). But those are reserved for special cases (initial load, migrations, IoT, upserts) where we can bypass some operations without compromising the database. The YSQL INSERT or COPY, with the PostgreSQL syntax and semantic, has to read before writing, in order to detect duplicate key violation. And it has to obey to ACID properties.
If fast ingest is your goal, the YCQL INSERT or COPY bypasses most of these, and an insert is just one write operations, with no reads. This is really fast on YugabyteDB which stores data in LSM Tree: the new values are appended into the MemTable (and WAL). And the MemTable is flushed to disk asynchronously by sequential writes. Duplicates are resolved later by asynchronous compaction. This is why a bulk load is faster in YCQL than YSQL.
If you compare YSQL COPY with other SQL databases, the comparison must be with the same level of High Availabilty. YugabteDB is build for High Availability: synchronous replication in the cluster for a RTO=0 / RPO=0 resilience. A monolithic PostgreSQL without any synchronous replica can ingest data faster, but is subject to data loss and service outage. The performance difference will decrease in the future versions, because there are many optimizations to come. And because all nodes are active, you can scale out to increase the thoughput.
33