Csv parser - Evaluate header for each file - csv

I have multiple CSV files in a directory. They may have different column combinations, but I would like to COPY them all with a single command, as there is a lot of them and they all go into same table. But the FDelimitedParser only evaluates the header row for the first file, then rejects all rows that do not fit - ie. all rows from most of the other files. I've been using FDelimitedParser but anything else is fine.
1 - Is this expected behavior, and if so, why ?
2 - I want it to evaluate the headers for each file, is there a way ?
Thanks
(Vertica 7.2)

Looks like you need flexTable for that , see http://vertica-howto.info/2014/07/how-to-load-csv-files-into-flex-tables/

Here's a small workaround that I use when I need to load a bunch of files in at once. This assumes all your files have the same column order.
Download and run Cygwin
Navigate to folder with csv files
cd your_folder_name_with_csv_files
Combine all csv files into a new file
cat *.csv >> new_file_name.csv
Run a copy statement in Vertica from new file. If file headers are an issue, you can follow instructions on this link and run through Cygwin to remove the first line from every file.

Related

Delete rows of a CSV file based off a column value on command line

I have a large file that I cannot open on my computer. I am trying to delete rows of information that are unneeded.
My file looks like this:
NODE,107983_gene,382,666,-,cd10161,8,49,9.0E-100,49.4,0.52,domain
NODE,107985_gene,24,659,-,PF09699.9,108,148,6.3E-500,22.5,0.8571428571428571,domain
NODE,33693_gene,213,1433,-,PF01966.21,92,230,9.0E-10,38.7,0.9344262295081968,domain
NODE,33693_gene,213,1433,-,PRK04926,39,133,1.0E-8,54.5,0.19,domain
NODE,33693_gene,213,1433,-,cd00077,88,238,4.0E-6,44.3,0.86,domain
NODE,33693_gene,213,1433,-,smart00471,88,139,9.0E-7,41.9,0.42,domain
NODE,33694_gene,1430,1912,-,cd16326,67,135,4.0E-50,39.5,0.38,domain
I am trying to remove all lines that have an evalue more than 1.0E-10. This information in located in column 9. I have tried on command line:
awk '$9 >=1E-10' file name > outputfile
This has given me a smaller file but the evalues are all over the place and are not actually removing anything above 1E-10. I want small E-values only.
Does anyone have any suggestions?
almost there, you need to specify the field delimiter
$ awk -F, '$9<1E-10' file > small.values

Running specific table only in large sql files

I have a very large sql file (14GB). Currently, i am not able to open this file on my browser or VS code because it is too huge, keeps crashing and would take so long. However, there is a single table that i want in this huge sql file.
Is there a way of splitting the sql file to get the specific table that i am searching for ? Any helpful answer please ?
You can do:
Step 1: grep ${YourTableName} -rni path/to/you/file
In output you'll see string which is matching ${YourTableName} and line number.
Step 2: tail -25 path/to/you/file > path/to/your/fileChunk (where -25 must be replaced with number from grep command), now in file path/to/your/fileChunk you will have at top stuff related to your table.
Step 3 (optional): Your file path/to/your/fileChunk at top contains stuff related to your table but in the middle and in the bottom of file you may have stuff related to other tables so please repeat step 1 & 2 on file path/to/your/fileChunk and delete needless info.
PS: It's only idea how to split your huge file into chunks, but you have to adapt this commands to your values.

Windows batch command to remove last line from a csv file created from sqlplus spooling

I am creating a csv file from Oracle db using sqlplus spooling. The last line of the csv file contains spool summary of how many rows are selected, for e.g for a csv with 1641 rows in it (including header) the last line says
1641 rows selected.
I want to remove this line from the csv. Not sure if this can be achieved as a sqlplus parameter or by a windows batch script.
Appreciate any inputs to help me remove this last line (or not to create it at all) from the csv file.
I believe that in sql plus you would need to set feedback off
SET FEEDBACK OFF

10+GB file conversion from txt to csv

I have txt file which has 1400 columns and 3.1M rows.
I want to convert this file into csv.
I tried doing it from excel - Data - from text option.
The file was made but it had only 120k rows and all 1400 columns.
I am not sure how i should convert this whole file into csv?
It would be great to have help on this.
Thanks
I see you selected "notepad" tag. You should try: gVim ( https://gvim.en.softonic.com/ ). I used it to open 2gb files and it worked like a charm.
You can find more programs that allow you to open big files here: https://stackoverflow.com/a/159537/1564840
On the other hand, I suggest you to split that big txt file in multiple txt files. Then you can convert the smaller txt files one by one.

Importing .csv files and saving as .dta

I have a folder containing a number of csv files, e.g. "leeds dz.csv", "leeds gh.csv", "leeds fr.csv". The first part of the file names is constant (i.e. always "leeds").
I want to import each to Stata individually, convert to .dta file and save it. Currently I have this code:
cd "etcetc"
clear
local myfilelist : dir . files"*.csv"
foreach file of local myfilelist {
drop _all
insheet using `file', comma
local outfile = subinstr("`file'",".csv","",.)
save "`outfile'", replace
}
The code works fine if I rename all the .csv files manually to delete the "leeds" part, ie if each .csv is named "dz.csv" instead of "leeds dz.csv" etc.
However, if I do not do this deletion I receive the error "invalid 'dz.csv' "
I'm guessing this has something to do with my 3rd line of code, in particular the "*.csv". But I'm unsure how to adapt the code/ why it won't allow me to import files with a space in the name?
The line
insheet using `file', comma
will be problematic with any filename containing spaces.
Try
insheet using "`file'", comma
The help for insheet is quite explicit on this:
If filename is specified without an extension, .raw is assumed. If your
filename contains embedded spaces, remember to enclose it in double
quotes.