To put it in context, I have a bucket where I storage CSV files and a function that works to put that Data into a Database when you load new CSV into the bucket.
I try to upload 100 CSV at the same time, in all, 581.100 records (70 MB)
All of those files appears in my bucket and a new table is created.
But when I do a “select count” I only found 267306 records (46 % of the total)
I try to do it again, different bucket, function, and table, I try to upload another 100 files, 4.779.100 records this time (312 MB)
When I check the table in big query I realize that only 2.293.920 records exist (47,9%) of the one that supposedly exist.
So my question is, is there a way in which I can upload all the CSV that I want without losing data? Or does GCP have some restriction for that task?
Thank you.
As pointed out in your last comment:
google.api_core.exceptions.Forbidden: 403 Exceeded rate limits: too many table update operations for this table
This error shows that you have reached the limit for maximum rate of table metadata update operations per table for Standard tables, according to the documentation. You can review the limits that may apply here. Note that this quota cannot be increased.
In the diagnosis section, it says:
Metadata table updates can originate from API calls that modify a table's metadata or from jobs that modify a table's content.
As a resolution, you can do the following:
Reduce the update rate for the table metadata.
Add a delay between jobs or table operations to make sure that the update rate is within the limit.
For data inserts or modification, consider using DML operations. DML operations are not affected by the Maximum rate of table metadata update operations per table rate limit.
DML operations have other limits and quotas. For more information, see Using data manipulation language (DML).
If you frequently load data from multiple small files stored in Cloud Storage that uses a job per file, then combine multiple load jobs into a single job. You can load from multiple Cloud Storage URIs with a comma-separated list (for example, gs://my_path/file_1,gs://my_path/file_2), or by using wildcards (for example, gs://my_path/*).
For more information, see Batch loading data.
If you use single-row queries (that is, INSERT statements) to write data to a table, consider batching multiple queries into one to reduce the number of jobs. BigQuery doesn't perform well when used as a relational database, so single-row INSERT statements executed at a high speed is not a recommended best practice.
If you intend to insert data at a high rate, consider using BigQuery Storage Write API. It is a recommended solution for high-performance data ingestion. The BigQuery Storage Write API has robust features, including exactly-once delivery semantics. To learn about limits and quotas, see Storage Write API and to see costs of using this API, see BigQuery data ingestion pricing.
Related
I have large table in CloudSQL that needs to be updated every hour, and I'm considering Airflow as a potential solution. What is the best way to update a large amount of data in a CloudSQL database from Airflow?
The constrain are:
The table need still be readable while the job is running
The table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time
Some of the ideas I have:
Load data needs to update into a pandas framework and run pd.to_sql
Load data into a csv in Cloud Storage and execute LOAD DATA LOCAL INFILE
Load data in memory, break it into chunks, and run a multi-thread process that each update the table chunk by chunk using a shared connection pool to prevent exhausting connection limits
My recent airflow related ETL project could be a reference for you.
Input DB: LargeDB (billion row level Oracle)
Interim DB: Mediam DB( tens of million level HD5 file)
Output
DB: Mediam DB (tens of millsion level mysql )
As far as I encountered, write to db is main block for such ETL process. so as you can see,
For interim stage, I use HD5 as interim DB or file for data transforming. the pandas to_hdf function provide a seconds level performance to large data. in my case, 20 millison rows write to hdf5 using less than 3 minutes.
Below is the performance benchmarking for pandas IO. HDF5 format is top3 fastest and most popular format. https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-perf
For the output stage, I use to_sql with chunk_size parameter. in order to speed up to_sql , you has to manually mapping the column type to database colume type and length,especialy the string or varchar format. With manualy mapping it, to_sql will mapp to blob format or varchar(1000). the default mode is 10 times slow than manually mapping mode.
total 20millions rows write to db via to_sql(chunksize mode) spend about 20 minutes.
if you like the answer, pls vote it up
One clue for your reference based on postgresql partition table but need some DML operation define the partitioned table.
Currently, you main constrains are:
the table need still be readable while the job is running
It means no lock allowed.
the table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time
it should capable with multiple writing in sample time.
I add one things for you may considered as well:
reasonable read performance while writing.
** performance and user experience is key
Partition table could reach all requirements. It is transparence to client applicationi.
At present, you are doing ETL, soon will facing performance issue as the table size gain quickly. The partitioned table is only solution.
The main steps are:
Create partition table with partition list.
normal reading and writing to the table running as usual.
ETL process(could be in parallel):
-. ETL data and uploaded to new table. (very slow, minutes to hours. but no impact to main table)
-. Add the new table to the main table partition list. (super fast, micro seconds level to enable main table)
normal main table reading and write as usual with new data.
If you like the answer, pls vote it up.
Best Regards,
WY
A crucial step to consider while setting up your workflow is to always use good connection management practices to minimize your application's footprint and reduce the likelihood of exceeding Cloud SQL connection limits. Databases connections consume resources on the server and the connection application.
Cloud Composer has no limitations when it comes to your ability to interface with CloudSQL. Therefore, either of the first 2 options is good.
A Python dependency is installable if it has no external dependencies and does not conflict with Composer’s dependencies. In addition, 14262433 explicitly explains the process of setting up a "Large data" workflow using Pandas.
LOAD DATA LOCAL INFILE requires you to use --local-infile for the mysql client. To import data into Cloud SQL, make sure to follow the best practices.
I am working with Google BigQuery for the first time on a client project and have created packages in SSIS to insert data into tables (an odd combination but one required by my client), using an SSIS plugin (CData).
I am looking to insert around 100k rows into a BigQuery table, however, when I look to do further update queries on this table, these cannot be performed because the data is still in the buffer. How does one know how long this will take in BigQuery and are there ways to speed up the process?
It doesn't matter if the data is still in the buffer. If you query the table, the data in the buffer will be included too. Just one of the many awesome things about BigQuery.
https://cloud.google.com/blog/big-data/2017/06/life-of-a-bigquery-streaming-insert
A record that arrives in the streaming buffer will remain there for
some minimum amount of time (minutes). During this period while the
record is buffered, it's possible that you may issue a query that will
reference the table. The Instant Availability Reader allows workers
from the query engine to read the buffered records prior to being
committed to managed storage.
data is still in the buffer. How does one know how long this will take in BigQuery?
Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table.
Data can take up to 90 minutes to become available for copy and export operations. See more in documentation
Meantime, have in mind - Tables that have been written to recently via BigQuery Streaming (tabledata.insertall) cannot be modified using UPDATE or DELETE statements. So, as stated above - up to 90 minutes
are there ways to speed up the process?
The only way in your case is to use loading data instead of streaming data. As per how I understand your case - data is in MS SQL, so you can potentially make your SSIS package batch aware and load it batch by batch through Cloud Storage
I am now trying to use Google Cloud SQL as a solution for my project. However, I found no way to scale writing there. It uses innoDB, so I can't use features of NDB cluster. So I wanted to make some sharding, however I am unable to find any information about this. Are there any ways to scale writes to Google Cloud SQL?
A single INSERT statement with 100 row will run 10 times as fast as 100 single-row INSERTs.
LOAD DATA is probably the fastest way to do inserts.
In some situations it can be faster to use a staging table - put the rows to insert in it, then copy to the 'real' table(s).
Please give more details about the data flow, the table schema, and scale (eg, rows/second).
I am trying to create a web application, primary objective is to insert request data into database.
Here is my problem, One request itself contains 10,000 to 1,00,000 data sets of information
(Each data set needs to be inserted separately as a row in the database)
I may get multiple request on this application concurrently, so its necessary for me to make the inserts fast.
I am using MySQL database, Which approach is better for me, LOAD DATA or BATCH INSERT or is there a better way than these two?
How will your application retrieve this information?
- There will be another background thread based java application that will select records from this table process them one by one and delete them.
Can you queue your requests (batches) so your system will handle them one batch at a time?
- For now we are thinking of inserting it to database straightaway, but yes if this approach is not feasible enough we may think of queuing the data.
Do retrievals of information need to be concurrent with insertion of new data?
- Yes, we are keeping it concurrent.
Here are certain answers to your questions, Ollie Jones
Thankyou!
Ken White's comment mentioned a couple of useful SO questions and answers for handling bulk insertion. For the record volume you are handling, you will enjoy the best success by using MyISAM tables and LOAD DATA INFILE data loading, from source files in the same file system that's used by your MySQL server.
What you're doing here is a kind of queuing operation. You receive these batches (you call them "requests") of records (you call them "data sets.) You put them into a big bucket (your MySQL table). Then you take them out of the bucket one at a time.
You haven't described your problem completely, so it's possible my advice is wrong.
Is each record ("data set") independent of all the others?
Does the order in which the records are processed matter? Or would you obtain the same results if you processed them in a random order? In other words, do you have to maintain an order on the individual records?
What happens if you receive two million-row batches ("requests") at approximately the same time? Assuming you can load ten thousand records a second (that's fast!) into your MySQL table, this means it will take 200 seconds to load both batches completely. Will you try to load one batch completely before beginning to load the second?
Is it OK to start processing and deleting the rows in these batches before the batches are completely loaded?
Is it OK for a record to sit in your system for 200 or more seconds before it is processed? How long can a record sit? (this is called "latency").
Given the volume of data you're mentioning here, if you're going into production with living data you may want to consider using a queuing system like ActiveMQ rather than a DBMS.
It may also make sense simply to build a multi-threaded Java app to load your batches of records, deposit them into a Queue object in RAM (a ConcurrentLinkedQueue instance may be suitable) and process them one by one. This approach will give you much more control over the performance of your system than you will have by using a MySQL table as a queue.
I have to upload about 16 million records to a MySQL 5.1 server on a shared webspace which does not permit LOAD DATA functionality. The table is an Innodb table. I have not assigned any keys yet.
Therefore, I use a Python script to convert my CSV file (of 2.5 GB of size) to an SQL file with individual INSERT statements. I've launched the SQL file, and the process is incredibly slow, it feels like 1000-1500 lines are processed every minute!
In the meantime, I read about bulk inserts, but did not find any reliable source telling how many records one insert statement can have. Do you know?
Is it an advantage to have no keys and add them later?
Would a transaction around all the insert help speed up the process? In fact, there's just a single connection (mine) working with the database at this time.
If you use insert ... values ... syntax to insert multiple rows running a single request your query size is limited by max_allowed_packet value rather than by number of rows.
Concerning keys: it's a good practice to define keys before any data manipulations. Actually, when you build a model you must think of keys, relations, indexes etc.
It's better do define indexes before you insert data as well. CREATE INDEX works quite slowly on huge datasets. But postponing indexes creation is not a huge disadvantage.
To make your inserts faster try to turn autocommit mode on and do not run concurrent requests on your tables.