Merging big csv files - csv

I have 4 really big csv files.
is 22GB with more than 65000 rows with these columns
'fid','file_since_dt','rpted_member_kob','rpted_member','rpted_rfc','rpted_rfc_last3','paternal','maternal','additionl_surname','first','middle','prefix','suffix','marital_status','resident_status','country_code','natlity','sex','other_tax_num','other_tax_natlity','num_dependents','birth_dt','deceased_dt','drivers_license','profes_license','voter_registr','watch_flag','dont_display','no_promote','merge_freeze','officer_flag',
is 57GB with more than 65000 rows with these columns
'fid','line1','line2','colonia','municipality','city','state','postal_section','postal_last2','postal_plus5','phone_area_code','phone_number','phone_num','phone_last5','phone_ext','fax_area_code','fax_phone_number','fax_phone_num','fax_phone_last5','special_indic','use_cnt','last_used_dt','residence_dt','rept_member_kob','rept_member','rpted_dt','type','soundex_paternal','soundex_maternal','soundex_addt_surnm','first_initial','patnl_patnl_cnt','patnl_matnl_cnt','matnl_patnl_cnt','matnl_matnl_cnt','country_code',
is trade which is the biggest with 112GB
'fid','serial_num','file_since_dt','bureau_id','member_kob','member_code','member_short_name','member_area_code','member_phone_num','acct_num','account_status','owner_indic','posted_dt','pref_cust_code','acct_type','contract_type','terms_num_paymts','terms_frequency','terms_amt','opened_dt','last_paymt_dt','last_purchased_dt','closed_dt','reporting_dt','reporting_mode','paid_off_dt','collateral','currency_code','high_credit_amt','cur_balance_amt','credit_limit','amt_past_due','paymt_pat_hst','paymt_pat_str_dt','paymt_pat_end_dt','cur_mop_status','remarks_code','restruct_dt','suppress_set_dt','suppress_expir_dt','max_delinqncy_amt','max_delinqncy_dt','max_delinqncy_mop','num_paymts_late','num_months_review','num_paymts_30_day','num_paymts_60_day','num_paymts_90_day','num_paymts_120_day','appraise_value','first_no_payment_dt','saldo_insoluto','last_paymt_amt','crc_indic','plazo_meses','monto_credito_original','last_past_due_dt','interest_amt','cur_interest_mop','days_past_due','email',
4.- is 22gb and has the same content as file 3 as its more like the 2nd partition of file 3
All of them have the constraint fid. I have never came to this part where I need to merge all of them in order to create a single 200 GB file. I don't have any clue on how to handle this. Has anybody has experimented with this in the past? If yes would you mind to share any solution on this?

Dump everything into a real database (after making sure it's data partition has enough room). Then, if you actually need CSV, you can easily export what you need.

Related

Manipulating 100gb Table

I have a particular dataset in tsv format seperated by tabs that is one big txt file of around 100gb (Somewhere around 255 million rows). I have to filter and extract relevant rows so I can easily work on them. So far, I know that Excel can't handle that many rows, and familliar text editors can't open or very painful to work with tables. I've tried LogParser, a 36 mins query gave me a csv output but unfortunately exported number of rows are way below what I guess is present in the data. Also I get some parsing errors and some columns in the exported sets are shifted. Do you have any other alternatives? Maybe can I somehow turn the data into an SQL database? Is it possible?

Performing joins on very large data sets

I have received several CSV files that I need to merge into a single file, all with a common key I can use to join them. Unfortunately, each of these files are about 5 GB in size (several million rows, about 20-100+ columns), so it's not feasible to just load them up in memory and execute a join against each one, but I do know that I don't have to worry about existing column conflicts between them.
I tried making an index of the row for each file that corresponded to each ID so I could just compute the result without using much memory, but of course that's slow as time itself when actually trying to look up each row, pull the rest of the CSV data from the row, concatenate it to the in-progress data and then write out to a file. This simply isn't feasible, even on an SSD, to process against the millions of rows in each file.
I also tried simply loading up some of the smaller sets in memory and running a parallel.foreach against them to match up the necessary data to dump back out to a temporary merged file. While this was faster than the last method, I simply don't have the memory to do this with the larger files.
I'd ideally like to just do a full left join of the largest of the files, then full left join to each subsequently smaller file so it all merges.
How might I otherwise go about approaching this problem? I've got 24 gb of memory on this system to work with and six cores to work with.
While this might just be a problem to load up in a relational database and do the join there from, I thought I'd reach out before going that route to see if there are any ideas out there on solving this from my local system.
Thanks!
A relational database is the first thing that comes to mind and probably the easiest, but barring that...
Build a hash table mapping key to file offset. Parse the rows on-demand as you're joining. If your keyspace is still too large to fit in available address space, you can put that in a file too. This is exactly what a database index would do (though maybe with a b-tree).
You could also pre-sort the files based on their keys and do a merge join.
The good news is that "several" 5GB files is not a tremendous amount of data. I know it's relative, but the way you describe your system...I still think it's not a big deal. If you weren't needing to join, you could use Perl or a bunch of other command-liney tools.
Are the column names known in each file? Do you care about the column names?
My first thoughts:
Spin up Amazon Web Services (AWS) Elastic MapReduce (EMR) instance (even a pretty small one will work)
Upload these files
Import the files into Hive (as managed or not).
Perform your joins in Hive.
You can spinup an instance in a matter of minutes and be done with the work within an hour or so, depending on your comfort level with the material.
I don't work for Amazon, and can't even use their stuff during my day job, but I use it quite a bit for grad school. It works like a champ when you need your own big data cluster. Again, this isn't "Big Data (R)", but Hive will kill this for you in no time.
This article doesn't do exactly what you need (it copies data from S3); however, it will help you understand table creation, etc.
http://aws.amazon.com/articles/5249664154115844
Edit:
Here's a link to the EMR overview:
https://aws.amazon.com/elasticmapreduce/
I'm not sure if you are manipulating the data. But if just combining the csv's you could try this...
http://www.solveyourtech.com/merge-csv-files/

One table access much bigger than original files

I have 12 excel files which are around 400 MB in total. But when importing them into access have already reached the 2 GB limit in the file 9.... is it normal?
Have tried to clean and compact but reduced only to 1.8 GB...
I do not have any query or report, just a big plain table (2.2 MM records x 30 columns so far).
If it is like that, as a solution do you think the options below would work?
1) Link to excel instead of importing it would reduce the file size considerably?
2) I can normalize the table a bit reducing some records by creating tables with the fields that are very repetitive
Thanks for any ideas...
(I could try the above by try and error but it would take me more than a coupe hours)
Linking - if possible - is much to prefer.
xlsx files are zipped. Rename it to zip, open it and study the size of the main file in folder xl\worksheets.

Random records extracted from a large CSV file

I have 50 CSV files, up to 2 millions records in each.
I daily need to get 10000 random records from each of the 50 files and make a new CSV files with all the info (10000*50)
I can not do it manually, because will take me a lot of time, also I've tried to use Access, but, because database is larger then 2G, I cannot use it.
Also, I've tried to use CSVed - a good soft, but still did not help me.
Could someone please give an idea/soft in order to get random records from files and make a new CSV file?
There are many languages you could use, I would use C# and do this.
1) Get the number of lines in a file.
Lines in text file
2) Generate the 10,000 random numbers (unique if you need that) based on the maximum being the count from step 1.
Random without duplicates
3) Pull the records from step 2 from the file and write to new file.
4) Repeat for each file.
Other options if you want to consider a database other than Access are MySQL or SQL Server Express to name a couple.

MySQL Whats better for speed one table with millions of rows or managing multiple tables?

Im re working an existing PHP/MySql/JS/Ajax web app that processes a LARGE number of table rows for users. Here's how the page works currently.
A user uploads a LARGE csv file. The test one I'm working with has 400,000 rows, (each row has 5 columns).
Php creates a brand new table for this data and inserts the hundreds of thousands of rows.
The page then sorts / processes / displays this data back to the user in a useful way. Processing includes searching, sorting by date and other rows and re displaying them without a huge load time (thats where the JS/Ajax comes in).
My question is should this app be placing the data into a new table for each upload or into one large table with an id for each file? I think the origional developer was adding seperate tables for speed purposes. Speed is very important for this.
Is there a faster way? Is there a better mouse trap? Has anyone ever delt with this?
Remember every .csv can contain hundreds of thousands of rows and hundreds of .csv files can be uploaded daily. Though they can be deleted about 24 hrs after they were last used (Im thinking cron job any opinions?)
Thank you all!
A few notes based on comments:
All data is unique to each user and changes so the user wont be Re accessing this data after a couple of hours. Only if they accidentally close the window and then come right back would they really re visit for the same .csv.
No Foreign keys required all csv's are private to each user and dont need to be cross referenced.
I would shy away from putting all the data into a single table for the simple reason that you cannot change the data structure.
Since the data is being deleted anyway and you don't have a requirement to combine data from different loads, there isn't an obvious reason for putting the data into a single table. The other argument is that the application now works. Do you really want to discover some requirement down the road that implies separate tables after you've done the work?
If you do decide on a single table, then use table partitioning. Since each user is using their own data, you can use partitions to separate each user load into a separate partition. Although there are limits on partitions (such as no foreign keys), this will make access the data in a single table as fast as accessing the original data.
Given 105 rows and 102 CSVs per day, you're looking at 10 million rows per day (and you say you'll clear that data down regularly). That doesn't look like a scary figure for a decent db (especially given that you can index within tables, and not across multiple tables).
Obviously the most regularly used CSVs could be very easily held in memory for speed of access - perhaps even all of them (a very simple calculation based on next to no data gives me a figure of 1Gb if you flush every over 24 hours. 1Gb is not an unreasonable amount of memory these days)