Manipulating 100gb Table - csv

I have a particular dataset in tsv format seperated by tabs that is one big txt file of around 100gb (Somewhere around 255 million rows). I have to filter and extract relevant rows so I can easily work on them. So far, I know that Excel can't handle that many rows, and familliar text editors can't open or very painful to work with tables. I've tried LogParser, a 36 mins query gave me a csv output but unfortunately exported number of rows are way below what I guess is present in the data. Also I get some parsing errors and some columns in the exported sets are shifted. Do you have any other alternatives? Maybe can I somehow turn the data into an SQL database? Is it possible?

Related

Merging big csv files

I have 4 really big csv files.
is 22GB with more than 65000 rows with these columns
'fid','file_since_dt','rpted_member_kob','rpted_member','rpted_rfc','rpted_rfc_last3','paternal','maternal','additionl_surname','first','middle','prefix','suffix','marital_status','resident_status','country_code','natlity','sex','other_tax_num','other_tax_natlity','num_dependents','birth_dt','deceased_dt','drivers_license','profes_license','voter_registr','watch_flag','dont_display','no_promote','merge_freeze','officer_flag',
is 57GB with more than 65000 rows with these columns
'fid','line1','line2','colonia','municipality','city','state','postal_section','postal_last2','postal_plus5','phone_area_code','phone_number','phone_num','phone_last5','phone_ext','fax_area_code','fax_phone_number','fax_phone_num','fax_phone_last5','special_indic','use_cnt','last_used_dt','residence_dt','rept_member_kob','rept_member','rpted_dt','type','soundex_paternal','soundex_maternal','soundex_addt_surnm','first_initial','patnl_patnl_cnt','patnl_matnl_cnt','matnl_patnl_cnt','matnl_matnl_cnt','country_code',
is trade which is the biggest with 112GB
'fid','serial_num','file_since_dt','bureau_id','member_kob','member_code','member_short_name','member_area_code','member_phone_num','acct_num','account_status','owner_indic','posted_dt','pref_cust_code','acct_type','contract_type','terms_num_paymts','terms_frequency','terms_amt','opened_dt','last_paymt_dt','last_purchased_dt','closed_dt','reporting_dt','reporting_mode','paid_off_dt','collateral','currency_code','high_credit_amt','cur_balance_amt','credit_limit','amt_past_due','paymt_pat_hst','paymt_pat_str_dt','paymt_pat_end_dt','cur_mop_status','remarks_code','restruct_dt','suppress_set_dt','suppress_expir_dt','max_delinqncy_amt','max_delinqncy_dt','max_delinqncy_mop','num_paymts_late','num_months_review','num_paymts_30_day','num_paymts_60_day','num_paymts_90_day','num_paymts_120_day','appraise_value','first_no_payment_dt','saldo_insoluto','last_paymt_amt','crc_indic','plazo_meses','monto_credito_original','last_past_due_dt','interest_amt','cur_interest_mop','days_past_due','email',
4.- is 22gb and has the same content as file 3 as its more like the 2nd partition of file 3
All of them have the constraint fid. I have never came to this part where I need to merge all of them in order to create a single 200 GB file. I don't have any clue on how to handle this. Has anybody has experimented with this in the past? If yes would you mind to share any solution on this?
Dump everything into a real database (after making sure it's data partition has enough room). Then, if you actually need CSV, you can easily export what you need.

Sample-Gene-Expression data storage problem in mysql

I have 50 gb sample-gene expression data .I want to store this data in mysql files.Data is divided into three txt files .one is sample ,2nd is gene ,and third is a sample-gene matrix which store their expression values.
I tried with three tables,one is for sample,2nd for gene and third with two foreign keys sample id,geneid and a field exp_value .But problem is how i can store that matrix in this table.
Please read
https://dev.mysql.com/doc/refman/8.0/en/load-data.html
You have data in text files, hoping that it's already formatted with separators. If it's formatted then it's easy to import.
If you are using linux use Konsole. If you are using Windows use CMD. It will take while to import it for that large file size. You just have to wait. It's gonna be a lot of trials and errors at first.

Splitting Large CSV files by Column

I have a vey large (4gb) csv file. Cannot open in excel or in other editors. The number of lines (rows) is nearly 3,000 and the number of columns is nearly 320,000.
One solution is to split the original file into smaller ones and be able to open these small ones in Excel or other editors.
The second solution is to take the transpose of the original data then open it in the Excel.
I could not find a tool or script for transpose. I've found some scripts and free software for splitting but each of them splits the csv by row size.
Is there a way to split the original file into smaller ones that consist of max 15000 rows.
I tried to use:
import panda as pd
pd.read_csv(%file Path%).T.to_csv('%new File Path%,headre=false)
But it take ages to complete
In meanwhile I tired to use some python coding, but all of them failed because of the memory issues.
Trial version of the Delimit (http://www.delimitware.com/) handled the data perfectly.

ETL: how to guess data types for messy CSVs with lots of nulls

I often have to cleanse and import messy CSV and Excel files into my MS SQL Server 2014 (but the question would be the same if I were using Oracle or another database).
I have found a way to do this with Alteryx. Can you help me understand if I can do the same with Pentaho Kettle or SSIS? Alternatively, can you recommend another ETL software which addresses my points below?
I often have tables of, say, 100,000 records where the first 90,000 records may be null. Most ETL tools scan only the first few hundred records to guess data types and therefore fail to guess the types of these fields. Can I force Pentaho or SSIS to scan the WHOLE file before guessing types? I understand this may not be efficient for huge files of many GBs, but for the files I handle scanning the entire file is much better than wasting a lot of time trying to guess each field manually
As above, but with the length of a string. If the first 10,000 records are, say, a 3-character string but the subsequent ones are longer, SSIS and Pentaho tend to guess nvarchar(3) and the import will fail. Can I force them to scan all rows before guessing the length of the strings? Or, alternatively, can I easily force all strings to be nvarchar(x) , where I set x myself?
Alteryx has a multi-field tool, which is particularly convenient when cleansing or converting multiple fields. E.g. I have 10 date columns whose datatype was not guessed automatically. I can use the multi-field formula to get Alteryx to convert all 10 fields to date and create new fields called $oldfield_reformatted. Do Pentajho and SSIS have anything similar?
Thank you!
A silly suggestion. In Excel add a row at the top of the list that has a formula that creates a text string with the same length of the longest value in the column.
This formula entered as an array formula would do it..
=REPT("X",MAX(LEN(A:A)))
You could also use a more advanced VBA function to create other dummy values to force datatypes in SSIS.
I've not used SSIS or anything like it, but in the past I would have loaded a file into a table with columns ALL of varchar 1000 say so that all the data loaded, then processed it across into the main table using SQL that casts or removes the data values as I required.
This gives YOU Ultimate control not a package or driver. I was very surprised to hear how this works!

Random records extracted from a large CSV file

I have 50 CSV files, up to 2 millions records in each.
I daily need to get 10000 random records from each of the 50 files and make a new CSV files with all the info (10000*50)
I can not do it manually, because will take me a lot of time, also I've tried to use Access, but, because database is larger then 2G, I cannot use it.
Also, I've tried to use CSVed - a good soft, but still did not help me.
Could someone please give an idea/soft in order to get random records from files and make a new CSV file?
There are many languages you could use, I would use C# and do this.
1) Get the number of lines in a file.
Lines in text file
2) Generate the 10,000 random numbers (unique if you need that) based on the maximum being the count from step 1.
Random without duplicates
3) Pull the records from step 2 from the file and write to new file.
4) Repeat for each file.
Other options if you want to consider a database other than Access are MySQL or SQL Server Express to name a couple.