How to Update the mysql database when large csv file gets modified - mysql

Initially, I created a database called "sample" and updated the data from massive size CSV file.
Whenever I have small changes in .csv file (some data are added/deleted/modified), I have to update this in database too. Always updating the entire .csv file (large) is not efficient.
Is there any efficient way to update the modified data from .csv file to database?

There's no simple way of doing this.
One plausible way would be to store the old version of the CSV somewhere, run a diff-program between the old and new version, and then use the resulting output to determine what has been updated, changed, or removed, and update the database accordingly.
This is however a bit unreliable, slow, and would take some effort to implement. If you can it would probably be better to adapt the source of the CSV file to update the database directly.

since you also want to delete entries that are not existing in the csv file anymore, you will have to load the complete csv file every time (and truncate the table first) in order to get a 1:1 copy.
For a more convenient synchonzation you probably will have to utilize some scripting language (php, python etc).
Sorry thats all I know...

In my experience it's almost impossible to consider a datafile that changes regularly as your "master data set": unless you can somehow generate a diff file that shows where the masterdata changed you will always be forced to run through the entire csv file, query the database to return the corresponding record and then either do nothing (if identical), insert (if new) or update (if modified). In many cases it will even be faster to just drop the table and reload the entire thing, but that can lead to serious operational problems.
Therefor, if it's at all possible for you I'd consider the database as the masterdata, and generate the csv file from there.

Related

SSIS package design, where 3rd party data is replacing existing data

I have created many SSIS packages in the past, though the need for this one is a bit different than the others which I have written.
Here's the quick description of the business need:
We have a small database on our end sourced from a 3rd party vendor, and this needs to be overwritten nightly.
The source of this data is a bunch of flat files (CSV) from the 3rd party vendor.
Current setup: we truncate the tables of this database, and we then insert the new data from the files, all via SSIS.
Problem: There are times when the files fail to come, and what happens is that we truncate the old data, though we don't have the fresh data set. This leaves us without a database where we would prefer to have yesterday's data over no data at all.
Desired Solution: I would like some sort of mechanism to see if the new data truly exists (these files) prior to truncating our current data.
What I have tried: I tried to capture the data from the files and add them to an ADO recordset and only proceeding if this part was successful. This doesn't seem to work for me, as I have all the data capture activities in one data flow and I don't see a way for me to reuse that data. It would seem wasteful of resources for me to do that and let the in-memory tables just sit there.
What have you done in a similar situation?
If files are not present update some flags like IsFile1Found to false and pass these flags to stored procedure which truncates on conditional basis.
If file is empty then Using powershell through Execute Process Task you can extract first two rows if there are two rows (header + data row) then it means data file is not empty. Then you can truncate the table and import the data.
other approach could be
you can load data into some staging table and from these staging table insert data to the destination table using SQL stored procedure and truncate these staging tables after data is moved to all the destination table. In this way before truncating destination table you can check if staging tables are empty or not.
I looked around and found that some others were struggling with the same issue, though none of them had a very elegant solution, nor do I.
What I ended up doing was to create a flat file connection to each file of interest and have a task count records and save to a variable. If a file isn't there, the package fails and you can stop execution at that point. There are some of these files whose actual count is interesting to me, though for the most part, I don't care. If you don't care what the counts are, you can keep recycling the same variable; this will reduce the creation of variables on your end (I needed 31). In order to preserve resources (read: reduce package execution time), I excluded all but one of the columns in each data source; it made a tremendous difference.

Erasing records from text file after importing it to MySQL database

I know how to import a text file into MySQL database by using the command
LOAD DATA LOCAL INFILE '/home/admin/Desktop/data.txt' INTO TABLE data
The above command will write the records of the file "data.txt" into the MySQL database table. My question is that I want to erase the records form the .txt file once it is stored in the database.
For Example: If there are 10 records and at current point of time 4 of them have been written into the database table, I require that in the data.txt file these 4 records get erased simultaneously. (In a way the text file acts as a "Queue".) How can I accomplish this? Can a java code be written? Or a scripting language is to be used?
Automating this is not too difficult, but it is also not trivial. You'll need something (a program, a script, ...) that can
Read the records from the original file,
Check if they were inserted, and, if they were not, copy them in another file
Rename or delete the original file, and rename the new file to replace the original one.
There might be better ways of achieving what you want to do, but, that's not something I can comment on without knowing your goal.

SSIS read text file to temp table before reaching ADO NET Destination?

I have a text file that's uploaded to a tsql table. It's straightforward, but due to some inconsistencies, I need to upload this file to a #tempTable or #tableVariable, clean it there, and then uploading it to the physical table.
So essentially, I have a Flat File Source that reads the txt file with the inconsistencies, and uploads it to the table in ADO NET Destination. If all the columns in destination table were varchar, then I would be able to save it in the physical table, and run some tsql script to clean. But I don't want to do that.
I can also run a tsql script before and after to create/drop the temp table, but if there's a way to do it with #temp or #table, then great.
If this is something you do more than once, a standard ETL practice is to have a persistent all-VARCHAR staging table that the file is loaded as the first step.
Then the data is inspected for suitability to add to the production table.
This does the following:
Ensures bad data stays out of the production tables where it can break things.
Allows the ETL process to be standardized and repeatable. You no longer have to remember how they broke the data last time, you just add it to the logic that promotes from the staging table to the production table and forget about it.
Allows the easy transition of the process to someone else, as well as the opportunity for proper source control.
I can't think of a single benefit to using a temp table/table variable, unless you are somehow unable to create a physical table.

Copying rows from one database file to another

I have 2 separate sql databases, they both have the same field but are not attached and are completely separate files. One of the database files has a few hundred rows of data and I want to copy a few of those rows into the other database file. Some people have said to use sql statements to copy the data but the databases are not linked in any way so I am not sure as to how these statements would work. Is there no software where I can just select the correct rows and copy them over, or create a new database with the ones selected?
I hope this makes sense, thanks.
Regardless of the database platform you are using, there should be commands/tools that will allow you to perform bulk data imports and exports from/to a file (e.g. a CSV file). Try exporting the rows you wish to copy from the database on the first server into an intermediate file, copying that file to the second server, and then importing it into that database.

cut off data in a dump file

I am using MySQL.
I got a mysql dump file (large_data.sql), I can create a database and load data from this dump file to the created database. No problem on this.
Now, I feel the data in the dump file is too large (for example, it contains 300000 rows/objects in one table, other tables are also contain a large amount of data).
So, I decided to make another dump (based on the large size dump) which can contains a small size of data (for example, 30 rows/objects in a table).
With only that big size dump file, what is the correct and efficient way to cut off the data in that dump and create a new dump file which contains small amount of data?
------------------------- More -----------------------------------
(Use textual tool to open the large size dump is not good, since the dump is very large, it takes long time to open the dump from textual tool)
If you want to work only on the textual dump files, you could use some textual tools (like awk or sed, or perhaps a perl or python or ocaml script) to handle them.
But maybe your big database was already loaded from the big dump file, and you want to work with MySQL incremental backups?
I recommend free file splitter : http://www.filesplitter.org/ .
Only problem : it cut a query in two parts. You need to edit manualy the file after but, it work like a charm.
Example :
My file is :
BlaBloBluBlw
BlaBloBluBlw
BlaBloBluBlw
Result will be :
File 1:
BlaBloBluBlw
BlaBloBl
File 2:
uBlw
BlaBloBluBlw
So you need to edit everything but it work like a charm and very quick. Used today on a 9,5 millions rows table.
BUT !! Best argument : the time you will take to do this is small compared to the time you try to import something big or waiting for it... this is quick and efficent even if you need to edit the file manualy since you need to rebuild the last and first query.