Talend - Process large delimited file - csv

i've got a questions regarding on how to process a delimited file with a large number of columns (>3000).
I tried to extract the fields with the standard delimited file input component, but creating the schema takes hours and when i run the job i get an error, because the toString() method exceeds the 65535 bytes limit. At that point i can run the job but all the columns are messed up and i cant realy work with them anymore.
Is it possible to split that .csv-file with talend? Is there any other handling possible, maybe with some sort of java code? If you have any further questions dont hesitate to comment.
Cheers!

You can create the schema of the delimited file in Metadata right? I tested 3k columns with some millions of records and it did not even take 5 minutes to load all the column names with data types. Obviously you can't split that file by taking each row as one cell, it could exceed the limit of strings in talend. But you can do it in Java using BufferedReader.

To deal with Big delimited file, we need something designed for big data, I think it will be a good choice to load your file to a MongoDB collection using this command with no need to create a 3k columns collection before importing file:
mongoimport --db users --collection contacts --type csv --headerline --file /opt/backups/contacts.csv
After that, you can process your data easily using an ETL tool.
See MongoImprort Ref.

Maybe you could have a go with uniVocity. It is built to handle all sorts of extreme situations processing data.
Check out the tutorial and see if suits your needs.
Here's a simple project which works with CSV inputs: https://github.com/uniVocity/worldcities-import/
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

Related

How to convert EXCEL to SQL (I have 143864 row and 100 column in excel) total 48,316 KB

I convert excel to csv first, then import to phpmyadmin only import 100 rows, I changed config.inc buffer size but still did not changed the result. Could you please help me ???
My main idea to do this, compare two tables on mysql workbench, I have one table already sql, i need excel to convert sql then i can use "compare schemas" creating EER Model of existing database.
Good you described the purpose of this approach. This way I can tell you in advance that it will not help to convert that Excel data to a MySQL table.
The model features (sync, compare etc.) all work on meta data only. They do not consider any table content. So instead you should do a textual comparison, by converting the table you have in the server to CSV.
Comparing such large documents is however a challenge. If you only have a few changes then using a diff tool (visual like Araxis Merge or diff on the command line) may help. For larger changesets a small utility app (may self written) might be necessary.

Big data migration from Oracle to MySQL

I received over 100GB of data with 67million records from one of the retailers. My objective is to do some market-basket analysis and CLV. This data is a direct sql dump from one of the tables with 70 columns. I'm trying to find a way to extract information from this data as managing itself in a small laptop/desktop setup is becoming time consuming. I considered the following options
Parse the data and convert the same to CSV format. File size might come down to around 35-40GB as more than half of the information in each records is column names. However, I may still have to use a db as I cant use R or Excel with 66 million records.
Migrate the data to mysql db. Unfortunately I don't have the schema for the table and I'm trying to recreate the schema looking at the data. I may have to replace to_date() in the data dump to str_to_date() to match with MySQL format.
Are there any better way to handle this? All that I need to do is extract the data from the sql dump by running some queries. Hadoop etc. are options, but I dont have the infrastructure to setup a cluster. I'm considering mysql as I have storage space and some memory to spare.
Suppose I go in the MySQL path, how would I import the data? I'm considering one of the following
Use sed and replace to_date() with appropriate str_to_date() inline. Note that, I need to do this for a 100GB file. Then import the data using mysql CLI.
Write python/perl script that will read the file, convert the data and write to mysql directly.
What would be faster? Thank you for your help.
In my opinion writing a script will be faster, because you are going to skip the SED part.
I think that you need to setup a server on a separate PC, and run the script from your laptop.
Also use tail to faster get a part from the bottom of this large file, in order to test your script on that part before you run it on this 100GB file.
I decided to go with the MySQL path. I created the schema looking at the data (had to increase a few of the column size as there were unexpected variations in the data) and wrote a python script using MySQLdb module. Import completed in 4hr 40mins on my 2011 MacBook Pro with 8154 failures out of 67 million records. Those failures were mostly data issues. Both client and server are running on my MBP.
#kpopovbg, yes, writing script was faster. Thank you.

cut off data in a dump file

I am using MySQL.
I got a mysql dump file (large_data.sql), I can create a database and load data from this dump file to the created database. No problem on this.
Now, I feel the data in the dump file is too large (for example, it contains 300000 rows/objects in one table, other tables are also contain a large amount of data).
So, I decided to make another dump (based on the large size dump) which can contains a small size of data (for example, 30 rows/objects in a table).
With only that big size dump file, what is the correct and efficient way to cut off the data in that dump and create a new dump file which contains small amount of data?
------------------------- More -----------------------------------
(Use textual tool to open the large size dump is not good, since the dump is very large, it takes long time to open the dump from textual tool)
If you want to work only on the textual dump files, you could use some textual tools (like awk or sed, or perhaps a perl or python or ocaml script) to handle them.
But maybe your big database was already loaded from the big dump file, and you want to work with MySQL incremental backups?
I recommend free file splitter : http://www.filesplitter.org/ .
Only problem : it cut a query in two parts. You need to edit manualy the file after but, it work like a charm.
Example :
My file is :
BlaBloBluBlw
BlaBloBluBlw
BlaBloBluBlw
Result will be :
File 1:
BlaBloBluBlw
BlaBloBl
File 2:
uBlw
BlaBloBluBlw
So you need to edit everything but it work like a charm and very quick. Used today on a 9,5 millions rows table.
BUT !! Best argument : the time you will take to do this is small compared to the time you try to import something big or waiting for it... this is quick and efficent even if you need to edit the file manualy since you need to rebuild the last and first query.

Importing 10 billion rows into mysql

I have a .csv file with 10 billion rows. I want to check that each row is unique. Is there an easy way to do this? I was thinking perhaps importing to mysql would allow me to find out uniqueness quickly. How can I upload this huge file to mysql? I have already tried row-by-row insert statements and also the 'LOAD DATA INFILE' command but both failed.
Thanks
I wouldn't use a database for this purpose, unless it needed to end up in the database eventually. Assuming you have the same formatting for each row (so that you don't have "8.230" and "8.23", or extra spaces on start/end of lines of equal values), use a few textutils included with most POSIX environments (Linux, Mac OS X), or available for Windows via GnuWIn32 coreutils.
Here is the sequence of steps to do from your system shell. First, sort the file (this step is required):
sort ten.csv > ten_sorted.csv
Then find unique rows from sorted data:
uniq ten_sorted.csv > ten_uniq.csv
Now you can check to see how many rows there are in the final file:
wc ten_uniq.csv
Or you can just use pipes for combining the three steps with one command line:
sort ten.csv | uniq | wc
Does the data have a unique identifier? Have this column as primary key in your mysql table and when you go to import the data, mysql should throw an error if you have duplicates.
As for how to go about doing it..just read in the file row by row and do an insert on each row.
If you are importing from Excel or such other programs. See here for how to cleanse the csv file before importing it into MySQL. Regarding the unique row, as long as your table schema is right, MySQL should be able to take care of it.
EDIT:
Whether the source is Excel or not, LOAD DATA LOCAL INFILE appears to be the way to go.
10bn rows, and LOAD DATA LOCAL gives you error? Are you sure there is no problem with the csv file?
You have to truncate your database into separate small bite size chunks. Use Big Dump.
http://www.ozerov.de/bigdump.php
If you do have 10 billion rows then you will struggle working with this data.
You would need to look at partitioning your database (ref here: about mysql partitioning)
However, even with that large number you would be requiring some serious hardware to cut through the work involved there.
Also, what would you do if a row was found to be nonunique? Would you want to continue importing the data? If you import the data would you import the identical row or flag it as a duplicate? Would you stop processing.
This is the kind of job linux is "made for".
First you have to split the file in to many smaller files:
split -l 100 filename
After this you have few options with the two commands sort / uniq, and after having timed 8 different options with a file of 1 million IP address from an ad exchange log-file, and found a almost 20x difference between using LC_ALL=C or not. For example:
LC_ALL=C sort IP_1m_rows.txt > temp_file
LC_ALL=C uniq temp_file > IP_unique_rows.txt
real 0m1.283s
user 0m1.121s
sys 0m0.088s
Where as the same without LC=ALL_C:
sort IP_1m_rows.txt > temp_file
uniq temp_file > IP_unique_rows.txt
real 0m24.596s
user 0m24.065s
sys 0m0.201s
Piping the command and using LC_ALL=C was 2x slower than the fastest:
LC_ALL=C sort IP_1m_rows.txt | uniq > IP_unique_rows.txt
real 0m3.532s
user 0m3.677s
sys 0m0.106s
Databases are not useful for one-off jobs like this, and flatfiles will get you surprisingly far even with more challenging / long-term objectives.

MySQL export to MongoDB

I am looking to export an existing MySQL database table to seed a MongoDB database.
I would have thought this was a well trodden path, but it appears not to be, as I am coming up blank with a simple MySQLDUMP -> MongoDB JSON converter.
It won't take much effort to code up such a conversion utility.
There are a method that doesn't require you to use any other software than mysql and mongodb utilities. The disadvantage is that you have to go table by table, but in your case you only need to migrate one table, so it won't be painful.
I followed this tutorial. Relevant parts are:
Get a CSV with your data. You can generate one with the following query in mysql.
SELECT [fields] INTO outfile 'user.csv' FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' FROM [table]
Finally, import the file using mongoimport.
That's all
If you're using Ruby, you can also try: Mongify
It will read your mysql database, build a translation file and allow you to map the information.
It supports:
Updating internal IDs (to BSON ObjectID)
Updating referencing IDs
Type Casting values
Embedding Tables into other documents
Before filters (to change data manually)
and much much more...
Read more about it at: http://mongify.com/getting_started.html
MongoVue is a new project that contains a MySQL import:
MongoVue. I have not used that feature.
If you are Mac user you can use MongoHub which has a built in feature to import (& export) data from MySql databases.
If you are using java you can try this
http://code.google.com/p/sql-to-nosql-importer/
For a powerful conversion utility, check out Tungsten Replicator
I'm still looking int this one called SQLToNoSQLImporter, which is written in Java.
I've ut a little something up on GitHub - it's not even 80% there but it's growing for work and it might be something other of you could help me out on!
https://github.com/jaredwa/mysqltomongo