Performing a Bulk insert into two tables - mysql

So I am have some statistical studies that I would like to import into a MySQL databases. The studies have numerous variables, each is used to create a column in my database. I have a CSV file of all the data in my studies that I would like to import into my database as well.
Some of the studies have greater than 1000 variables in them. This means there will be more than 1000 columns in my table, which I know is the limit in MySql. Because of this I have to create multiple tables for my study and combine them using a view to see all the variables at once.
Does this mean that I will have to have multiple CSV files as well (one for each 1000 column table) or is there some way to perform a bulk insert from a CSV file into two tables?

You will certainly need multiple tables. I would write a script to read the csv files.and them write the data to the database.
However, before just dumping the data, I would look for opportunities to normalize/rationalize the dataset. You may discover that normalisation may actually help in your analysis.

Related

MySQL: Automate Data Ingestion from regular txt/csv files to a Database

Intro
I've searched all around about this problem, but I didn't really found a source of knowledge about this, so I'm sorry if this problem seems basic to you, but for me is rather quite intriguing due the fact that I'm having hard time to guess what keywords to use on google in order to retrieve proper info.
Problem Description :
As a matter of fact, i have to issues that i don't know how to deal in a MySQL instance installed in a laptop in a windows environment:
I have a DB in MySQL with 50 tables, of with 15 or 20 tables are tables with original data. The other tables were tables that i generated from the original data tables, in order to properly create tables that would allow me to analyze data in PowerBI. The original data tables are fed by dumps from a ERP Database.
My issue is the following:
How would one automate the process of receiving cumulative txt/csv files (via pen-drive or any other transfer mechanism), store those files into a folder and then update the existing tables with the new information? Is there any reference of best practices to deal with such a scenario?
How can i maintain the good shape of my database with the successive data integration, I mean, how can I make my database scalable and responsive?
Can you point me some sources that would help me with this?
At the moment I imported data into tables, in 2 steps:
1st - I created the table structure with the Workbench import wizard help ( I had to do it this way because the tables have a lot of fields - dozens of them, literally, and those fields need to be in the database). I also inserted primary keys and indexes in those tables;
2nd - I Managed to load the data from the files into those tables, using LOAD DATA IN FILE command.
Some of the fields of the tables created with the import wizard, were created as data type text, with is not necessary in this scenario. I would like to revert those fields to data type NVARCHAR(255) or something, However there are a lot of field to alter the data type and in multiple tables at this point, and i was wondering if i can write a query to do the job of creating all the ALTER TABLES statements i need.
So my issue here is: is it safe to alter the data type in multiple fields in multiple columns (in this case i would like to change fields with datatype text to NAVARCHAR(255))? What is the best way to do this? Can you point me to some sources or best practices for this, please?
Thank you, in advance, for your help.
Cheers
You need a scripting language, not a UI. See mysql commandline tool, the shell of your OS, etc, etc.
DROP DATABASE and reCREATE it
LOAD DATA
Massage the data to get the columns cleaner than what the load data provided
Sic the BI tool on the data.
If you want to discuss Step 3, we need details about what transformations are needed between step 2 and step 4. That includes providing the format or schema for steps 2 and 4.

SSIS Script component - Reference data validation

I am in the process of extending an SSIS package, which takes in data from a text file, 600,000 lines of data or so, modifies some of the values in each line based on a set of business rules and persists the data to a database, database B. I am adding in some reference data validation, which needs to be performed on each row before writing the data to database B. The reference data is stored in another database, database A.
The reference data in database A is stored in seven different tables; each tables only has 4 or 5 columns of type varchar. Six of the tables contain < 1 million records and the seventh has 10+ million rows. I don't want to keep hammering the database for each line in the file and I just want to get some feedback on my proposed approach and ideas on how best to manage the largest table.
The reference data checks will need to be performed in the script component, which acts as a source in the data flow. It has an ado.net connection. On pre-execute, I am going to retrieve the reference data from database 'A', the tables which have < 1 million rows, using the ado.net connection, loop through them all using a sqldatareader, convert them to .Net objects; one for each table and add them to a dictionary.
As I process each line in the file, I can use the dictionaries to perform the reference data validation. Is this a good approach? Anybody got any ideas on how best to manage the largest table?

MySQL: mass delete of rows with their relations with individual rollback

I am looking for the best way to perform a mass delete of MySQL rows while being able to rollback them individually later if needed.
Use Case 1:
Input: a CSV file with a list of ids of a table
Goal: delete the rows with the given ids from the table and all their relations in the other tables of the database. Generate a script (per id?) so the data can be rolled-back individually on demand later. There is no foreign constraints so the order doesn't matter.
Use Case 2:
Input: a CSV file with a list of ids of a table (same table as use case 1)
Goal: use the scripts generated in use case 1 to rollback all the data (relation tables as well) given by the ids in the CSV file.
Volumetry: about 10k identifiers per CSV, 1 table with 5 direct relations and 5 indirect ones (1 table in between)
My questions are:
How and what would be the easiest/better way to perform these actions? SQL dumps/queries? Shell script to generate them? PHP script? Other?
Today I already have PHP scripts that run on a server with a copy of the database to perform other actions (exports) so it doesn't interfer with the production database. Would it be a good practice to run the script that generates the SQL delete and rollback scripts on this server as well. They would then be launched on the production at night? Or is 10k okay so it doesn't interfer with production too much?
How to generate individual rollback scripts so they can be run in mass later if needed (use case 2)? use of mysqldump? merging the scripts into one depending on what's in the CSV?
Thank you for your help and advice on this.
Regards,

Import Only Matched Data From CSV to MySQL

i need to import some data from very huge csv file which is about 1GB.
instead of importing all, i want to just import matched data, i think it will be more easy and faster than importing all data.
i need to search "Post Code District" column of CSV file, if it contains LS1 or LS2 or LS10, import matched data into tabel in SQL?
Misconception. You think that filtering a text file against a database table is going to be faster than just loading the entire file into the database.
I support there are extreme cases where this might be true. But, in general, the safest way to handle these types of situation is:
Import the file into a staging table.
Add indexes, as necessary to the staging table for performance.
Run a query to copy the data you want from the staging table.
I could phrase this a different way. In the time it would take you to figure out how to efficient combine information from the file and a database table, you could probably go through the above process 10-50 times.

Importing MYSQL database to NeO4j

I have a mysql database on a remote server which I am trying to migrate into Neo4j database. For this I dumped the individual tables into csv files and am now planning to use the LOAD CSV functionality to create graphs from the tables.
How does loading each table preserve the relationship between tables?
In other words, how can I generate a graph for the entire database and not just a single table?
Load each table as a CSV
Create indexes on your relationship field (Neo4j only does single property indexes)
Use MATCH() to locate related records between the tables
Use MERGE(a)-[:RELATIONSHIP]->(b) to create the relationship between the tables.
Run "all at once", this'll create a large transaction, won't go to completion, and most likely will crash with a heap error. Getting around that issue will require loading the CSV first, then creating the relationships in batches of 10K-100K transaction blocks.
One way to accomplish that goal is:
MATCH (a:LabelA)
MATCH (b:LabelB {id: a.id}) WHERE NOT (a)-[:RELATIONSHIP]->(b)
WITH a, b LIMIT 50000
MERGE (a)-[:RELATIONSHIP]->(b)
What this does is find :LabelB records that don't have a relationship with the :LabelA records and then creates that relationship for the first 50,000 records it finds. Running this repeatedly will eventually create all the relationships you want.