I am working with Google BigQuery for the first time on a client project and have created packages in SSIS to insert data into tables (an odd combination but one required by my client), using an SSIS plugin (CData).
I am looking to insert around 100k rows into a BigQuery table, however, when I look to do further update queries on this table, these cannot be performed because the data is still in the buffer. How does one know how long this will take in BigQuery and are there ways to speed up the process?
It doesn't matter if the data is still in the buffer. If you query the table, the data in the buffer will be included too. Just one of the many awesome things about BigQuery.
https://cloud.google.com/blog/big-data/2017/06/life-of-a-bigquery-streaming-insert
A record that arrives in the streaming buffer will remain there for
some minimum amount of time (minutes). During this period while the
record is buffered, it's possible that you may issue a query that will
reference the table. The Instant Availability Reader allows workers
from the query engine to read the buffered records prior to being
committed to managed storage.
data is still in the buffer. How does one know how long this will take in BigQuery?
Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table.
Data can take up to 90 minutes to become available for copy and export operations. See more in documentation
Meantime, have in mind - Tables that have been written to recently via BigQuery Streaming (tabledata.insertall) cannot be modified using UPDATE or DELETE statements. So, as stated above - up to 90 minutes
are there ways to speed up the process?
The only way in your case is to use loading data instead of streaming data. As per how I understand your case - data is in MS SQL, so you can potentially make your SSIS package batch aware and load it batch by batch through Cloud Storage
Related
To put it in context, I have a bucket where I storage CSV files and a function that works to put that Data into a Database when you load new CSV into the bucket.
I try to upload 100 CSV at the same time, in all, 581.100 records (70 MB)
All of those files appears in my bucket and a new table is created.
But when I do a “select count” I only found 267306 records (46 % of the total)
I try to do it again, different bucket, function, and table, I try to upload another 100 files, 4.779.100 records this time (312 MB)
When I check the table in big query I realize that only 2.293.920 records exist (47,9%) of the one that supposedly exist.
So my question is, is there a way in which I can upload all the CSV that I want without losing data? Or does GCP have some restriction for that task?
Thank you.
As pointed out in your last comment:
google.api_core.exceptions.Forbidden: 403 Exceeded rate limits: too many table update operations for this table
This error shows that you have reached the limit for maximum rate of table metadata update operations per table for Standard tables, according to the documentation. You can review the limits that may apply here. Note that this quota cannot be increased.
In the diagnosis section, it says:
Metadata table updates can originate from API calls that modify a table's metadata or from jobs that modify a table's content.
As a resolution, you can do the following:
Reduce the update rate for the table metadata.
Add a delay between jobs or table operations to make sure that the update rate is within the limit.
For data inserts or modification, consider using DML operations. DML operations are not affected by the Maximum rate of table metadata update operations per table rate limit.
DML operations have other limits and quotas. For more information, see Using data manipulation language (DML).
If you frequently load data from multiple small files stored in Cloud Storage that uses a job per file, then combine multiple load jobs into a single job. You can load from multiple Cloud Storage URIs with a comma-separated list (for example, gs://my_path/file_1,gs://my_path/file_2), or by using wildcards (for example, gs://my_path/*).
For more information, see Batch loading data.
If you use single-row queries (that is, INSERT statements) to write data to a table, consider batching multiple queries into one to reduce the number of jobs. BigQuery doesn't perform well when used as a relational database, so single-row INSERT statements executed at a high speed is not a recommended best practice.
If you intend to insert data at a high rate, consider using BigQuery Storage Write API. It is a recommended solution for high-performance data ingestion. The BigQuery Storage Write API has robust features, including exactly-once delivery semantics. To learn about limits and quotas, see Storage Write API and to see costs of using this API, see BigQuery data ingestion pricing.
I have large table in CloudSQL that needs to be updated every hour, and I'm considering Airflow as a potential solution. What is the best way to update a large amount of data in a CloudSQL database from Airflow?
The constrain are:
The table need still be readable while the job is running
The table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time
Some of the ideas I have:
Load data needs to update into a pandas framework and run pd.to_sql
Load data into a csv in Cloud Storage and execute LOAD DATA LOCAL INFILE
Load data in memory, break it into chunks, and run a multi-thread process that each update the table chunk by chunk using a shared connection pool to prevent exhausting connection limits
My recent airflow related ETL project could be a reference for you.
Input DB: LargeDB (billion row level Oracle)
Interim DB: Mediam DB( tens of million level HD5 file)
Output
DB: Mediam DB (tens of millsion level mysql )
As far as I encountered, write to db is main block for such ETL process. so as you can see,
For interim stage, I use HD5 as interim DB or file for data transforming. the pandas to_hdf function provide a seconds level performance to large data. in my case, 20 millison rows write to hdf5 using less than 3 minutes.
Below is the performance benchmarking for pandas IO. HDF5 format is top3 fastest and most popular format. https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-perf
For the output stage, I use to_sql with chunk_size parameter. in order to speed up to_sql , you has to manually mapping the column type to database colume type and length,especialy the string or varchar format. With manualy mapping it, to_sql will mapp to blob format or varchar(1000). the default mode is 10 times slow than manually mapping mode.
total 20millions rows write to db via to_sql(chunksize mode) spend about 20 minutes.
if you like the answer, pls vote it up
One clue for your reference based on postgresql partition table but need some DML operation define the partitioned table.
Currently, you main constrains are:
the table need still be readable while the job is running
It means no lock allowed.
the table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time
it should capable with multiple writing in sample time.
I add one things for you may considered as well:
reasonable read performance while writing.
** performance and user experience is key
Partition table could reach all requirements. It is transparence to client applicationi.
At present, you are doing ETL, soon will facing performance issue as the table size gain quickly. The partitioned table is only solution.
The main steps are:
Create partition table with partition list.
normal reading and writing to the table running as usual.
ETL process(could be in parallel):
-. ETL data and uploaded to new table. (very slow, minutes to hours. but no impact to main table)
-. Add the new table to the main table partition list. (super fast, micro seconds level to enable main table)
normal main table reading and write as usual with new data.
If you like the answer, pls vote it up.
Best Regards,
WY
A crucial step to consider while setting up your workflow is to always use good connection management practices to minimize your application's footprint and reduce the likelihood of exceeding Cloud SQL connection limits. Databases connections consume resources on the server and the connection application.
Cloud Composer has no limitations when it comes to your ability to interface with CloudSQL. Therefore, either of the first 2 options is good.
A Python dependency is installable if it has no external dependencies and does not conflict with Composer’s dependencies. In addition, 14262433 explicitly explains the process of setting up a "Large data" workflow using Pandas.
LOAD DATA LOCAL INFILE requires you to use --local-infile for the mysql client. To import data into Cloud SQL, make sure to follow the best practices.
I have a problem with high frequency insert in MySQL. I've searched a lot on Internet but haven't found a good answer to my problem.
I need to log a lot of event at a very high frequency (~3000 inserts / s => 260 millions row per day), these event are stored in a InnoDB table like that :
log_events :
- id_user : BIGINT
- id_event : SMALLINT
- date : INT
- data : BIGINT (data associated to this event)
My problems are :
- How to speed inserts ? Event are send by thousands of visitors and we are not able to bulk insert
- How to limit IO write ? We are on a 6*600 GB SSD drives and have write IO problems
Do you have any ideas to these kind of problem ?
Thanks
François
Do you have any foreign keys on that table? If so, I would consider to remove them and add indexes only on cols which are used for reads. This should improve writes.
The second idea is use some in-memory db (eg. redis, memcache) as a queue and some worker could get data from it and inserts in a bulk (for example for every 2 seconds) to mysql storage.
The another option if you don't need frequent reads is use archive storage instead of innodb: http://dev.mysql.com/doc/refman/5.5/en/archive-storage-engine.html. But it looks like it's not an option for you as long as it hasn't indexes at all (which means full scan table reads).
Another option is reorganize your db structure, eg. use partitioning (http://dev.mysql.com/doc/refman/5.5/en/partitioning.html). But it depends on how SELECTS looks like.
My additional questions are:
could you show whole table definition?
which fields are used for reads? could you show them?
do you need all data for your reads or maybe only recently ones? If so, how recently data must be? (eg. only from last day/week/month/year)
id_event is an event type, right? Number of possible events is static or it could change in the future?
Event are send by thousands of visitors and we are not able to bulk insert
You need to either bulk insert or shard the data. I would be tempted to try the bulk insert route first.
That you think you can't suggests these events are being created by autonomous processes - you just need to funnel them through an intermediary rather than direct to the database. And it would be easiest to implement that funnel as an event based server (rather than a threaded or forking server).
You don't say what the events are nor where they originate - which has some impact on the details of implementing a solution.
Both rsyslog and syslogng will talk to a MySQL backend - hence you can eliminate the overhead of establishing a new connection per message - but I don't know if either implements buffering / bulk inserts. It would certainly be possible to tail the files they produce with a single process and create bulk inserts from there.
It would relatively simple to write a funnel using this event based server, this buffer tool along with a bit of code to implement asynch mysqli calls and a watchdog. Or you could use node.js with an async mysql lib. There's also tools like statsd (again using node.js) which can also perform some aggregation on the data on the data.
Or you could just write something from scratch.
A write-only database is a useless piece of hardware though. You've not provided any details of how this data will be used - which has some relevance to designing a solution. Also since ideally the data feed would be a single process / DB session, it might be a beter idea to use MyISAM rather than InnoDB (I see in your later comment you said you had problems with MyISAM - presumably this was with multiple clients).
I am trying to create a web application, primary objective is to insert request data into database.
Here is my problem, One request itself contains 10,000 to 1,00,000 data sets of information
(Each data set needs to be inserted separately as a row in the database)
I may get multiple request on this application concurrently, so its necessary for me to make the inserts fast.
I am using MySQL database, Which approach is better for me, LOAD DATA or BATCH INSERT or is there a better way than these two?
How will your application retrieve this information?
- There will be another background thread based java application that will select records from this table process them one by one and delete them.
Can you queue your requests (batches) so your system will handle them one batch at a time?
- For now we are thinking of inserting it to database straightaway, but yes if this approach is not feasible enough we may think of queuing the data.
Do retrievals of information need to be concurrent with insertion of new data?
- Yes, we are keeping it concurrent.
Here are certain answers to your questions, Ollie Jones
Thankyou!
Ken White's comment mentioned a couple of useful SO questions and answers for handling bulk insertion. For the record volume you are handling, you will enjoy the best success by using MyISAM tables and LOAD DATA INFILE data loading, from source files in the same file system that's used by your MySQL server.
What you're doing here is a kind of queuing operation. You receive these batches (you call them "requests") of records (you call them "data sets.) You put them into a big bucket (your MySQL table). Then you take them out of the bucket one at a time.
You haven't described your problem completely, so it's possible my advice is wrong.
Is each record ("data set") independent of all the others?
Does the order in which the records are processed matter? Or would you obtain the same results if you processed them in a random order? In other words, do you have to maintain an order on the individual records?
What happens if you receive two million-row batches ("requests") at approximately the same time? Assuming you can load ten thousand records a second (that's fast!) into your MySQL table, this means it will take 200 seconds to load both batches completely. Will you try to load one batch completely before beginning to load the second?
Is it OK to start processing and deleting the rows in these batches before the batches are completely loaded?
Is it OK for a record to sit in your system for 200 or more seconds before it is processed? How long can a record sit? (this is called "latency").
Given the volume of data you're mentioning here, if you're going into production with living data you may want to consider using a queuing system like ActiveMQ rather than a DBMS.
It may also make sense simply to build a multi-threaded Java app to load your batches of records, deposit them into a Queue object in RAM (a ConcurrentLinkedQueue instance may be suitable) and process them one by one. This approach will give you much more control over the performance of your system than you will have by using a MySQL table as a queue.
I am in the process of setting up a mysql server to store some data but realized(after reading a bit this weekend) I might have a problem uploading the data in time.
I basically have multiple servers generating daily data and then sending it to a shared queue to process/analyze. The data is about 5 billion rows(although its very small data, an ID number in a column and a dictionary of ints in another). Most of the performance reports I have seen have shown insert speeds of 60 to 100k/second which would take over 10 hours. We need the data in very quickly so we can work on it that day and then we may discard it(or achieve the table to S3 or something).
What can I do? I have 8 servers at my disposal(in addition to the database server), can I somehow use them to make the uploads faster? At first I was thinking of using them to push data to the server at the same time but I'm also thinking maybe I can load the data onto each of them and then somehow try to merge all the separated data into one server?
I was going to use mysql with innodb(I can use any other settings it helps) but its not finalized so if mysql doesn't work is there something else that will(I have used hbase before but was looking for a mysql solution first in case I have problems seems more widely used and easier to get help)?
Wow. That is a lot of data you're loading. It's probably worth quite a bit of design thought to get this right.
Multiple mySQL server instances won't help with loading speed. What will make a difference is fast processor chips and very fast disk IO subsystems on your mySQL server. If you can use a 64-bit processor and provision it with a LOT of RAM, you may be able to use a MEMORY access method for your big table, which will be very fast indeed. (But if that will work for you, a gigantic Java HashMap may work even better.)
Ask yourself: Why do you need to stash this info in a SQL-queryable table? How will you use your data once you've loaded it? Will you run lots of queries that retrieve single rows or just a few rows of your billions? Or will you run aggregate queries (e.g. SUM(something) ... GROUP BY something_else) that grind through large fractions of the table?
Will you have to access the data while it is incompletely loaded? Or can you load up a whole batch of data before the first access?
If all your queries need to grind the whole table, then don't use any indexes. Otherwise do. But don't throw in any indexes you don't need. They are going to cost you load performance, big time.
Consider using myISAM rather than InnoDB for this table; myISAM's lack of transaction semantics makes it faster to load. myISAM will do fine at handling either aggregate queries or few-row queries.
You probably want to have a separate table for each day's data, so you can "get rid" of yesterday's data by either renaming the table or simply accessing a new table.
You should consider using the LOAD DATA INFILE command.
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
This command causes the mySQL server to read a file from the mySQL server's file system and bulk-load it directly into a table. It's way faster than doing INSERT commands from a client program on another machine. But it's also tricker to set up in production: your shared queue needs access to the mySQL server's file system to write the data files for loading.
You should consider disabling indexing, then loading the whole table, then re-enabling indexing, but only if you don't need to query partially loaded tables.