Update large amount of data in SQL database via Airflow

Update large amount of data in SQL database via Airflow - mysql

I have large table in CloudSQL that needs to be updated every hour, and I'm considering Airflow as a potential solution. What is the best way to update a large amount of data in a CloudSQL database from Airflow?
The constrain are:
The table need still be readable while the job is running
The table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time
Some of the ideas I have:
Load data needs to update into a pandas framework and run pd.to_sql
Load data into a csv in Cloud Storage and execute LOAD DATA LOCAL INFILE
Load data in memory, break it into chunks, and run a multi-thread process that each update the table chunk by chunk using a shared connection pool to prevent exhausting connection limits

My recent airflow related ETL project could be a reference for you.
Input DB: LargeDB (billion row level Oracle)
Interim DB: Mediam DB( tens of million level HD5 file)
Output
DB: Mediam DB (tens of millsion level mysql )
As far as I encountered, write to db is main block for such ETL process. so as you can see,
For interim stage, I use HD5 as interim DB or file for data transforming. the pandas to_hdf function provide a seconds level performance to large data. in my case, 20 millison rows write to hdf5 using less than 3 minutes.
Below is the performance benchmarking for pandas IO. HDF5 format is top3 fastest and most popular format. https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-perf
For the output stage, I use to_sql with chunk_size parameter. in order to speed up to_sql , you has to manually mapping the column type to database colume type and length,especialy the string or varchar format. With manualy mapping it, to_sql will mapp to blob format or varchar(1000). the default mode is 10 times slow than manually mapping mode.
total 20millions rows write to db via to_sql(chunksize mode) spend about 20 minutes.
if you like the answer, pls vote it up

One clue for your reference based on postgresql partition table but need some DML operation define the partitioned table.
Currently, you main constrains are:
the table need still be readable while the job is running
It means no lock allowed.
the table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time
it should capable with multiple writing in sample time.
I add one things for you may considered as well:
reasonable read performance while writing.
** performance and user experience is key
Partition table could reach all requirements. It is transparence to client applicationi.
At present, you are doing ETL, soon will facing performance issue as the table size gain quickly. The partitioned table is only solution.
The main steps are:
Create partition table with partition list.
normal reading and writing to the table running as usual.
ETL process(could be in parallel):
-. ETL data and uploaded to new table. (very slow, minutes to hours. but no impact to main table)
-. Add the new table to the main table partition list. (super fast, micro seconds level to enable main table)
normal main table reading and write as usual with new data.
If you like the answer, pls vote it up.
Best Regards,
WY

A crucial step to consider while setting up your workflow is to always use good connection management practices to minimize your application's footprint and reduce the likelihood of exceeding Cloud SQL connection limits. Databases connections consume resources on the server and the connection application.
Cloud Composer has no limitations when it comes to your ability to interface with CloudSQL. Therefore, either of the first 2 options is good.
A Python dependency is installable if it has no external dependencies and does not conflict with Composer’s dependencies. In addition, 14262433 explicitly explains the process of setting up a "Large data" workflow using Pandas.
LOAD DATA LOCAL INFILE requires you to use --local-infile for the mysql client. To import data into Cloud SQL, make sure to follow the best practices.

Related

How to do a one-time load for 4 billion records from MySQL to SQL Server

We have a need to do the initial data copy on a table that has 4+ billion records to target SQL Server (2014) from source MySQL (5.5). The table in question is pretty wide with 55 columns, however none of them are LOB. I'm looking for options for copying this data in the most efficient way possible.
We've tried loading via Attunity Replicate (which has worked wonderfully for tables not this large) but if the initial data copy with Attunity Replicate fails then it starts over from scratch ... losing whatever time was spent copying the data. With patching and the possibility of this table taking 3+ months to load Attunity wasn't the solution.
We've also tried smaller batch loads with a linked server. This is working but doesn't seem efficient at all.
Once the data is copied we will be using Attunity Replicate to handle CDC.

For something like this I think SSIS would be the most simple. It's designed for large inserts as big as 1TB. In fact, I'd recommend this MSDN article We loaded 1TB in 30 Minutes and so can you.
Doing simple things like dropping indexes and performing other optimizations like partitioning would make your load faster. While 30 minutes isn't a feasible time to shoot for, it would be a very straightforward task to have an SSIS package run outside of business hours.
My business doesn't have a load on the scale you do, but we do refresh our databases of more than 100M nightly which doesn't take more than 45 minutes, even with it being poorly optimized.

One of the most efficient way to load huge data is to read them by chunks.
I have answered many similar question for SQLite, Oracle, Db2 and MySQL. You can refer to one of them for to get more information on how to do that using SSIS:
Reading Huge volume of data from Sqlite to SQL Server fails at pre-execute (SQLite)
SSIS failing to save packages and reboots Visual Studio (Oracle)
Optimizing SSIS package for millions of rows with Order by / sort in SQL command and Merge Join (MySQL)
Getting top n to n rows from db2 (DB2)
On the other hand there are many other suggestions such as drop indexes in destination table and recreate them after insert, Create needed indexes on source table, use fast-load option to insert data ...

SSIS to insert non-matching data on non-linked server

This is regarding SQL Server 2008 R2 and SSIS.
I need to update dozens of history tables on one server with new data from production tables on another server.
The two servers are not, and will not be, linked.
Some of the history tables have 100's of millions of rows and some of the production tables have dozens of millions of rows.
I currently have a process in place for each table that uses the following data flow components:
OLEDB Source task to pull the appropriate production data.
Lookup task to check if the production data's key already exists in the history table and using the "Redirect to error output" -
Transfer the missing data to the OLEDB Destination history table.
The process is too slow for the large tables. There has to be a better way. Can someone help?
I know if the servers were linked a single set based query could accomplish the task easily and efficiently, but the servers are not linked.

Segment your problem into smaller problems. That's the only way you're going to solve this.
Let's examine the problems.
You're inserting and/or updating existing data. At a database level, rows are packed into pages. Rarely is it an exact fit and there's usually some amount of free space left in a page. When you update a row, pretend the Name field went from "bob" to "Robert Michael Stuckenschneider III". That row needs more room to live and while there's some room left on the page, there's not enough. Other rows might get shuffled down to the next page just to give this one some elbow room. That's going to cause lots of disk activity. Yes, it's inevitable given that you are adding more data but it's important to understand how your data is going to grow and ensure your database itself is ready for that growth. Maybe, you have some non-clustered indexes on a target table. Disabling/dropping them should improve insert/update performance. If you still have your database and log set to grow at 10% or 1MB or whatever the default values are, the storage engine is going to spend all of its time trying to grow files and won't have time to actually write data. Take away: ensure your system is poised to receive lots of data. Work with your DBA, LAN and SAN team(s)
You have tens of millions of rows in your OLTP system and hundreds of millions in your archive system. Starting with the OLTP data, you need to identify what does not exist in your historical system. Given your data volumes, I would plan for this package to have a hiccup in processing and needs to be "restartable." I would have a package that has a data flow with only the business keys selected from the OLTP that are used to make a match against the target table. Write those keys into a table that lives on the OLTP server (ToBeTransfered). Have a second package that uses a subset of those keys (N rows) joined back to the original table as the Source. It's wired directly to the Destination so no lookup required. That fat data row flows on over the network only one time. Then have an Execute SQL Task go in and delete the batch you just sent to the Archive server. This batching method can allow you to run the second package on multiple servers. The SSIS team describes it better in their paper: We loaded 1TB in 30 minutes
Ensure the Lookup is a Query of the form SELECT key1, key2 FROM MyTable Better yet, can you provide a filter to the lookup? WHERE ProcessingYear = 2013 as there's no need to waste cache on 2012 if the OLTP only contains 2013 data.
You might need to modify your PacketSize on your Connection Manager and have a network person set up Jumbo frames.
Look at your queries. Are you getting good plans? Are your tables over-indexed? Remember, each index is going to result in an increase in the number of writes performed. If you can dump them and recreate after the processing is completed, you'll think your SAN admins bought you some FusionIO drives. I know I did when I dropped 14 NC indexes from a billion row table that only had 10 total columns.
If you're still having performance issues, establish a theoretical baseline (under ideal conditions that will never occur in the real world, I can push 1GB from A to B in N units of time) and work your way from there to what your actual is. You must have a limiting factor (IO, CPU, Memory or Network). Find the culprit and throw more money at it or restructure the solution until it's no longer the lagging metric.

Step 1. Incremental bulk import of appropriate proudction data to new server.
Ref: Importing Data from a Single Client (or Stream) into a Non-Empty Table
http://msdn.microsoft.com/en-us/library/ms177445(v=sql.105).aspx
Step 2. Use Merge Statement to identify new/existing records and operate on them.
I realize that it will take a significant amount of disk space on the new server, but the process would run faster.

High frequency insert in MySQL

I have a problem with high frequency insert in MySQL. I've searched a lot on Internet but haven't found a good answer to my problem.
I need to log a lot of event at a very high frequency (~3000 inserts / s => 260 millions row per day), these event are stored in a InnoDB table like that :
log_events :
- id_user : BIGINT
- id_event : SMALLINT
- date : INT
- data : BIGINT (data associated to this event)
My problems are :
- How to speed inserts ? Event are send by thousands of visitors and we are not able to bulk insert
- How to limit IO write ? We are on a 6*600 GB SSD drives and have write IO problems
Do you have any ideas to these kind of problem ?
Thanks
François

Do you have any foreign keys on that table? If so, I would consider to remove them and add indexes only on cols which are used for reads. This should improve writes.
The second idea is use some in-memory db (eg. redis, memcache) as a queue and some worker could get data from it and inserts in a bulk (for example for every 2 seconds) to mysql storage.
The another option if you don't need frequent reads is use archive storage instead of innodb: http://dev.mysql.com/doc/refman/5.5/en/archive-storage-engine.html. But it looks like it's not an option for you as long as it hasn't indexes at all (which means full scan table reads).
Another option is reorganize your db structure, eg. use partitioning (http://dev.mysql.com/doc/refman/5.5/en/partitioning.html). But it depends on how SELECTS looks like.
My additional questions are:
could you show whole table definition?
which fields are used for reads? could you show them?
do you need all data for your reads or maybe only recently ones? If so, how recently data must be? (eg. only from last day/week/month/year)
id_event is an event type, right? Number of possible events is static or it could change in the future?

Event are send by thousands of visitors and we are not able to bulk insert
You need to either bulk insert or shard the data. I would be tempted to try the bulk insert route first.
That you think you can't suggests these events are being created by autonomous processes - you just need to funnel them through an intermediary rather than direct to the database. And it would be easiest to implement that funnel as an event based server (rather than a threaded or forking server).
You don't say what the events are nor where they originate - which has some impact on the details of implementing a solution.
Both rsyslog and syslogng will talk to a MySQL backend - hence you can eliminate the overhead of establishing a new connection per message - but I don't know if either implements buffering / bulk inserts. It would certainly be possible to tail the files they produce with a single process and create bulk inserts from there.
It would relatively simple to write a funnel using this event based server, this buffer tool along with a bit of code to implement asynch mysqli calls and a watchdog. Or you could use node.js with an async mysql lib. There's also tools like statsd (again using node.js) which can also perform some aggregation on the data on the data.
Or you could just write something from scratch.
A write-only database is a useless piece of hardware though. You've not provided any details of how this data will be used - which has some relevance to designing a solution. Also since ideally the data feed would be a single process / DB session, it might be a beter idea to use MyISAM rather than InnoDB (I see in your later comment you said you had problems with MyISAM - presumably this was with multiple clients).

Executing SSIS package creates huge no. of temp files which makes me run out of disk space

I have a ssis package which I run using an sql job for bulk copy of data from one database to other. the destination is our integration server where we have enough space for database. But when i run this job (i.e package). it creates huge number of temp files in localsettings/temp folder in orders of for a 1GB mdf file it creates some 20gb of temp files. I have manually created this package and didnot use import export wizard. Can any one help me how to avoid this huge tempfiles while executing?.If any further details needed plese mention.
Note: many said if we create a package using import export wizard and set optimize for many tables true this will happen. But in this package i query only one table and have created manually without using import export wizard.

Why is the package creating temp files?
SSIS is an in-memory ETL solution, except when it can't keep everything in memory and begins swapping to disk.
Why would restructuring the package as #jeff hornby suggested help?
Fully and partially blocking transformations force memory copies in your data flow. Assume you have 10 buckets carrying 1MB of data each. When you use a blocking transformation, as those buckets arrive at a transformation the data has to be copied from one memory location to another one. You've now doubled your packages total memory consumption as you have 10MB of data used before the union all transformation and then another 10MB after it.
Only use columns that you need. If a column is not in your destination, don't add it to the data flow. Use the database to perform sorts and merges. Cast your data to the appropriate types before it ever hits the data flow.
What else could be causing temp file usage
Lookup transformations. I've seen people crush their ETL server when they use SELECT * FROM dbo.BillionRowTable when all they needed was one or two columns for the current time period. The default behaviour of a lookup operation is to execute that source query and cache the results in memory. For large tables, wide and/or deep, this can make it look like your data flow isn't even running as SSIS is busy streaming and caching all of that data as part of the pre-execute phase.
Binary/LOB data. Have an (n)varchar(max)/varbinary(max) or classic BLOB data type in your source table? Sorry, that's not going to be in memory. Instead, the data flow is going to carry a pointer along and write a file out for each one of those objects.
Too much parallel processing. SSIS is awesome in that you get free paralleization of your proessing. Except you can have too much of a good thing. If you have 20 data flows all floating in space with no precedence between them, the Integration Services engine may try to run all of them at once. Add a precedence constraint between them, even if it's just on completion (on success/on fail) to force some serialization of operations. Inside a data flow, you can introduce the same challenge by having unrelated operations going on. My rule of thumb is that starting at any source or destination, I should be able to reach all the other source/destinations.
What else can I do?
Examine what else is using memory on the box. Have you set a sane (non-default) maximum memory value for SQL Server? SSIS like RAM like a fat kid loves cake so you need to balance the memory needs of SSIS against the database itself-they have completely separate memory spaces.
Each data flow has the ability to set the [BufferTempStoragePath and BlobTempStoragePath2. Take advantage of this and put that on a drive with sufficient storage
Finally, add more RAM. If you can't make the package better by doing the above, throw more hardware at it and be done.

If you're getting that many temp files, then you probably have a lot of blocking transforms in your data flow. Try to eliminate the following types of transformations: Aggregate, Fuzzy Grouping, Fuzzy Lookup, Row Sampling, Sort, Term Extraction. Also, partially blocking transactions can create the same problems but not in the same scale: Data Mining Query, Merge, Merge Join, Pivot, Term Lookup, Union All, Unpivot. You might want to try to minimize these transformations.
Probably the problem is a sort transformation somewhere in your data flow (this is the most common). You might be able to eliminate this by using an ORDER BY clause in your SQL statement. Just remember to set the sorted property in the data source.

Can I use multiple servers to increase mysql's data upload performance?

I am in the process of setting up a mysql server to store some data but realized(after reading a bit this weekend) I might have a problem uploading the data in time.
I basically have multiple servers generating daily data and then sending it to a shared queue to process/analyze. The data is about 5 billion rows(although its very small data, an ID number in a column and a dictionary of ints in another). Most of the performance reports I have seen have shown insert speeds of 60 to 100k/second which would take over 10 hours. We need the data in very quickly so we can work on it that day and then we may discard it(or achieve the table to S3 or something).
What can I do? I have 8 servers at my disposal(in addition to the database server), can I somehow use them to make the uploads faster? At first I was thinking of using them to push data to the server at the same time but I'm also thinking maybe I can load the data onto each of them and then somehow try to merge all the separated data into one server?
I was going to use mysql with innodb(I can use any other settings it helps) but its not finalized so if mysql doesn't work is there something else that will(I have used hbase before but was looking for a mysql solution first in case I have problems seems more widely used and easier to get help)?

Wow. That is a lot of data you're loading. It's probably worth quite a bit of design thought to get this right.
Multiple mySQL server instances won't help with loading speed. What will make a difference is fast processor chips and very fast disk IO subsystems on your mySQL server. If you can use a 64-bit processor and provision it with a LOT of RAM, you may be able to use a MEMORY access method for your big table, which will be very fast indeed. (But if that will work for you, a gigantic Java HashMap may work even better.)
Ask yourself: Why do you need to stash this info in a SQL-queryable table? How will you use your data once you've loaded it? Will you run lots of queries that retrieve single rows or just a few rows of your billions? Or will you run aggregate queries (e.g. SUM(something) ... GROUP BY something_else) that grind through large fractions of the table?
Will you have to access the data while it is incompletely loaded? Or can you load up a whole batch of data before the first access?
If all your queries need to grind the whole table, then don't use any indexes. Otherwise do. But don't throw in any indexes you don't need. They are going to cost you load performance, big time.
Consider using myISAM rather than InnoDB for this table; myISAM's lack of transaction semantics makes it faster to load. myISAM will do fine at handling either aggregate queries or few-row queries.
You probably want to have a separate table for each day's data, so you can "get rid" of yesterday's data by either renaming the table or simply accessing a new table.
You should consider using the LOAD DATA INFILE command.
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
This command causes the mySQL server to read a file from the mySQL server's file system and bulk-load it directly into a table. It's way faster than doing INSERT commands from a client program on another machine. But it's also tricker to set up in production: your shared queue needs access to the mySQL server's file system to write the data files for loading.
You should consider disabling indexing, then loading the whole table, then re-enabling indexing, but only if you don't need to query partially loaded tables.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008