I am going to be creating some CSV files via MySQL with code like the below for example..
SELECT id, name, email INTO OUTFILE '/tmp/result.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
ESCAPED BY ‘\\’
LINES TERMINATED BY '\n'
FROM users WHERE 1
But I was just wondering if the resulting file is going to be rather large (possibly several GB's) is there things I should be concerned about or precautions I should take? Like memory problems etc?
Ram shouldn't be an issue, but you'll want to make sure the volume you are writing it to can handle that large of a file. I've seen a lot of guys get stuck at 2GB or 4 GB because their file system couldn't handle files larger than that.
Also, I recommend writing it to a local drive on the MySQL machine and then copying it over the network or other means. Writing that large of a file could take quite a while if your network isn't at least Gigabit.
One more suggestion... try it on about 1000 rows or so first, and then test your CSV against your target environment. Sometimes it takes a few tries to get the formatting where you want it.
Related
So i have the following issue, i have a medium sized databased (1.6 gigabytes ). And i am running the following, statement in RDS MySql.
LOAD DATA LOCAL INFILE "E:/csms/Query/Tratado_interno/Dataset_Raw.csv"
INTO TABLE Eventtracks_Take.Prod
FIELDS TERMINATED BY '|'
ENCLOSED BY '"'
IGNORE 1 ROWS;
It has come to my attention, that such load into were supposed to occur on very high speeds at least that is what people in tutorials show, around 1 second no longer no less. But the ones i am handling take around 3.6 minutes. Any ideas on what could be the probable cause? My EC2 instance is t3.large so i would expect that that is not the cause of the issue. Sometimes i also run into a few specific warnings telling me a specific column of data was truncated even tho i can guarantee i gave it enough character on the schema. Nonetheless, would appreciate any given theory.
I have a csv file around 20gig in size with about 60m rows, that I would like to load into a table within mysql.
I have defined my table in mysql with a composite primary key of (col_a, col_b) prior to starting any load.
I have initiated my load as below:
LOAD DATA LOCAL INFILE '/mypath/mycsv.csv'
INTO TABLE mytable
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 0 LINES
(#v1v, #v2v, #v3v, #v4v, #v5v, etc...)
SET
col_a = nullif(#v1v,''),
col_b = nullif(#v2v,''),
col_c = nullif(#v3v,''),
col_d = nullif(#v4v,''),
col_e = nullif(#v5v,''),
etc...,
load_dttm = NOW();
This seemed to work fine, until the the dataset got to around 10g in size, at which point the loading significantly slowed, and what looked like it might take an hour has been running all night and not got much larger.
Are there more efficient ways of loading (depending on your definition of this word) "large" csv's into mysql.
My immediate thoughts are:
1) Should I remove my composite primary key, and only apply it after the load
2) Should I break down the csv into smaller chunks
As I understand it mysql is mainly limited by system constraints, which should not be an issue in my case - I am using an Linux Red-hat server with "MemTotal: 396779348 kB"! and terabytes of space.
This is my first time of using mysql, so please bear this in mind in any answers.
My issue it turns out was due to the /var/lib/mysql directory not being allocated enough space. It seems that mysql will slow down rather than throw an error when space become low when processing a load data command. To resolve this I have moved the datadir using How to change MySQL data directory?
I need to import 400MB of data into a mysql table from 10 different .txt files loosely formatted in a csv manner. I have 4 hours tops to do it.
Is this AT ALL possible?
Is it better to chop the files into smaller files?
Is it better to upload them all simultaneously or to upload them sequentialy?
Thanks
EDIT:
I currently uploaded some of the files to the server, and am trying to use the "load infile" from phpmyadmin to achieve the import, using the following syntax & parameters:
load data infile 'http://example.com/datafile.csv' replace into table database.table fields terminated by ',' lines terminated by ';' (`id`,`status`);
It throws an Access denied error, was wondering if I could achieve this through a php file instead, as mentioned here.
Thanks again
FINAL EDIT
(or My Rookie Mistakes)
In the end, it could be done. Easily.
Paranoia was winning the war in my mind as I wrote this, due to a time limit I thought impossible. I wasn't reading the right things, paying attention to nothing else but the pressure.
The first mistake, an easy one to make and solve, was keeping the indexes while trying to import that first batch of data. It was pointed out in the comments, and I read it in many other places, but I dismissed it as unimportant; it wouldn't change a thing. Well, it definitely does.
Still, the BIG mistake was using LOAD FILE on a table running on the InnoDB engine. Suddenly I run into a post somewhere which's link I've lost that proclaimed that the command wouldn't perform well in InnoDB tables, giving errors such as Lock wait and the likes.
I'm guessing this should probably be declared as a duplicate for some InnoDB vs MyISAM question (if somewhat related), or someone could perhaps provide a more elaborate answer explaining what I only mention and know superficially, and I'd gladly select it as the correct answer (B-B-BONUS points for adding relative size of index compared to table data or something of the like).
Thanks to all those who where involved.
PS: I've replaced HeidiSQL with SQLyog since I read the recommendation on the comments and later answer, it's pretty decent and a bit faster/lighter. If I can overcome the SQLServer-y interface I might keep it as my default db manager.
I want to load a set of large rdf triple files into Neo4j. I have already written a map-reduce code to read all input n-triples and output two CSV files: nodes.csv (7GB - 90 million rows) and relationships.csv (15GB - 120 million rows).
I tried batch-import command from Neo4j v2.2.0-M01, but it crashes after loading around 30M rows of nodes. I have 16GB of RAM in my machine so I set wrapper.java.initmemory=4096 and wrapper.java.maxmemory=13000. So, I decided to split nodes.csv and relationships.csv into smaller parts and run batch-import for each part. However, I don't know how to merge the databases created from multiple imports.
I appreciate any suggestion on how to load large CSV files into Neo4j.
I could finally load the data using batch-import command in Neo4j 2.2.0-M02. It took totally 56 minutes. The issue preventing Neo4j from loading the CSV files was having \" in some values, which was interpreted as a quotation character to be included in the field value and this was messing up everything from this point forward.
Why don't you try this approach (using groovy): http://jexp.de/blog/2014/10/flexible-neo4j-batch-import-with-groovy/
you will create uniqueness constraint on nodes, so duplicates won't be created.
I have a table in MySql that I manage using PhpMyAdmin. Currently it's sitting at around 960,000 rows.
I have a boss who likes to look at the data in Excel, which means weekly, I have to export the data into Excel.
I am looking for a more efficient way to do this. Since I can't actually do the entire table at once, because it times out. So I have been stuck with 'chunking' the table into smaller queries and exporting it like that.
I have tried connecting Excel (and Access) directly to my database, but same problem; it times out. Is there any way to extend the connection limit?
Is your boss Rain Man? Does he just spot 'information' in raw 'data'?
Or does he build functions in excel rather than ask for what he really needs?
Sit with him for an hour and see what he's actually doing with the data. What questions are being asked? What patterns are being detected (manually)? Write a real tool to find that information and then plug it into your monitoring/alerting system.
Or, get a new boss. Seriously. You can tell him I said so.
Honestly for this size of data, I would suggest doing a mysqldump then importing the table back into another copy of MySQL installed somewhere else, maybe on a virtual machine dedicated to this task. From there, you can set timeouts and such as high as you want and not worry about resource limitations blowing up your production database. Using nice on Unix-based OSes or process priorities on Windows-based system, you should be able to do this without too much impact to the production system.
Alternatively, you can set up a replica of your production database and pull the data from there. Having a so-called "reporting database" that replicates various tables or even entire databases from your production system is actually a fairly common practice in large environments to ensure that you don't accidentally kill your production database pulling numbers for someone. As an added advantage, you don't have to wait for a mysqldump backup to complete before you start pulling data for your boss; you can do it right away.
The rows are not too many. Run a query to export your data into a csv file:
SELECT column,column2 INTO OUTFILE '/path/to/file/result.txt'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM yourtable WHERE columnx = 'çondition';
One option would be to use SELECT ... INTO OUTFILE or mysqldump to export your table to CSV, which Excel can then directly open.