I have 4 really big csv files.
is 22GB with more than 65000 rows with these columns
'fid','file_since_dt','rpted_member_kob','rpted_member','rpted_rfc','rpted_rfc_last3','paternal','maternal','additionl_surname','first','middle','prefix','suffix','marital_status','resident_status','country_code','natlity','sex','other_tax_num','other_tax_natlity','num_dependents','birth_dt','deceased_dt','drivers_license','profes_license','voter_registr','watch_flag','dont_display','no_promote','merge_freeze','officer_flag',
is 57GB with more than 65000 rows with these columns
'fid','line1','line2','colonia','municipality','city','state','postal_section','postal_last2','postal_plus5','phone_area_code','phone_number','phone_num','phone_last5','phone_ext','fax_area_code','fax_phone_number','fax_phone_num','fax_phone_last5','special_indic','use_cnt','last_used_dt','residence_dt','rept_member_kob','rept_member','rpted_dt','type','soundex_paternal','soundex_maternal','soundex_addt_surnm','first_initial','patnl_patnl_cnt','patnl_matnl_cnt','matnl_patnl_cnt','matnl_matnl_cnt','country_code',
is trade which is the biggest with 112GB
'fid','serial_num','file_since_dt','bureau_id','member_kob','member_code','member_short_name','member_area_code','member_phone_num','acct_num','account_status','owner_indic','posted_dt','pref_cust_code','acct_type','contract_type','terms_num_paymts','terms_frequency','terms_amt','opened_dt','last_paymt_dt','last_purchased_dt','closed_dt','reporting_dt','reporting_mode','paid_off_dt','collateral','currency_code','high_credit_amt','cur_balance_amt','credit_limit','amt_past_due','paymt_pat_hst','paymt_pat_str_dt','paymt_pat_end_dt','cur_mop_status','remarks_code','restruct_dt','suppress_set_dt','suppress_expir_dt','max_delinqncy_amt','max_delinqncy_dt','max_delinqncy_mop','num_paymts_late','num_months_review','num_paymts_30_day','num_paymts_60_day','num_paymts_90_day','num_paymts_120_day','appraise_value','first_no_payment_dt','saldo_insoluto','last_paymt_amt','crc_indic','plazo_meses','monto_credito_original','last_past_due_dt','interest_amt','cur_interest_mop','days_past_due','email',
4.- is 22gb and has the same content as file 3 as its more like the 2nd partition of file 3
All of them have the constraint fid. I have never came to this part where I need to merge all of them in order to create a single 200 GB file. I don't have any clue on how to handle this. Has anybody has experimented with this in the past? If yes would you mind to share any solution on this?
Dump everything into a real database (after making sure it's data partition has enough room). Then, if you actually need CSV, you can easily export what you need.
I have a particular dataset in tsv format seperated by tabs that is one big txt file of around 100gb (Somewhere around 255 million rows). I have to filter and extract relevant rows so I can easily work on them. So far, I know that Excel can't handle that many rows, and familliar text editors can't open or very painful to work with tables. I've tried LogParser, a 36 mins query gave me a csv output but unfortunately exported number of rows are way below what I guess is present in the data. Also I get some parsing errors and some columns in the exported sets are shifted. Do you have any other alternatives? Maybe can I somehow turn the data into an SQL database? Is it possible?
I have 50 CSV files, up to 2 millions records in each.
I daily need to get 10000 random records from each of the 50 files and make a new CSV files with all the info (10000*50)
I can not do it manually, because will take me a lot of time, also I've tried to use Access, but, because database is larger then 2G, I cannot use it.
Also, I've tried to use CSVed - a good soft, but still did not help me.
Could someone please give an idea/soft in order to get random records from files and make a new CSV file?
There are many languages you could use, I would use C# and do this.
1) Get the number of lines in a file.
Lines in text file
2) Generate the 10,000 random numbers (unique if you need that) based on the maximum being the count from step 1.
Random without duplicates
3) Pull the records from step 2 from the file and write to new file.
4) Repeat for each file.
Other options if you want to consider a database other than Access are MySQL or SQL Server Express to name a couple.
I have been running a mammoth SQL query that is as so:
select SessionID, PID, RespondentID
from BIG_Sessions (nolock)
where RespondentID in (
'1407718',
'1498288',
/* ETC ETC */
)
I heard that Excel has a maximum of 1 million rows. Not sure how to approach this
Table BIG_Sessions is huge. It is pulling multiple SessionID's for a given RespondentID - but I only want one each.
I don't know how to winnow this. Any tips appreciated.
It depends on what version of Excel you are using. 2010 apparently supports "over one million". While 2003 only supports a little over 65,000.
Personally I would export it to a CSV file. Just right click on your result set and select "Save Results As...". No limit there.
An answer purely related to Excel:
I have tried to put data in excess of 50,000 rows into Excel before. When I try, one of two things happen.
1) It actually works, but Excel is extremely slow, unresponsive, and often crashes. The data is basically unusable.
2) I fill up my RAM and Excel crashes, sometimes taking other programs with it.
If you are trying to copy 1,000,000 rows.... I would seriously doubt Excel could handle it!
Databases were created for handling exactly this situation: organizing large amounts of data. See if you can't do what you are trying to accomplish with Excel from within your database.
I want to build an LP whose parameters are given by, between 5 and 10, 25,000,000 to 50,000,000 row .csv files (approx 500mb to 1Gb each).
My model is currently coded in AMPL, and reads in the parameter values directly from the .csv files. The Windows XP with 1 Gb RAM I am using runs out of memory trying to build a model based on data from only one 500mb .csv
My question:
Is there a way to manage my data so that I can build the LP using less memory?
I appreciate all feedback from anyone with experience building massive LP.
It is hard to see that you will ever be able to load and solve such large problems where the .csv file alone is 500 MB or more, if you only have 1 GB RAM on your computer.
If it is not an option to add considerably more RAM memory, you will need to analyze your LP problem to see if it can be separated into smaller independent parts. For example, if you have a problem with 10,000 variables and 10,000,000 rows, perhaps it is possible to break the main problem up into say 100 independent sub-problems with 100 variables and 100,000 rows each?
Here is a link to an albeit dated book chapter that discusses separation of a large LP problem into manageable sub-problems.