I have a very large sql file (14GB). Currently, i am not able to open this file on my browser or VS code because it is too huge, keeps crashing and would take so long. However, there is a single table that i want in this huge sql file.
Is there a way of splitting the sql file to get the specific table that i am searching for ? Any helpful answer please ?
You can do:
Step 1: grep ${YourTableName} -rni path/to/you/file
In output you'll see string which is matching ${YourTableName} and line number.
Step 2: tail -25 path/to/you/file > path/to/your/fileChunk (where -25 must be replaced with number from grep command), now in file path/to/your/fileChunk you will have at top stuff related to your table.
Step 3 (optional): Your file path/to/your/fileChunk at top contains stuff related to your table but in the middle and in the bottom of file you may have stuff related to other tables so please repeat step 1 & 2 on file path/to/your/fileChunk and delete needless info.
PS: It's only idea how to split your huge file into chunks, but you have to adapt this commands to your values.
Related
I'm having some difficulties creating a table in Google BigQuery using CSV data that we download from another system.
The goal is to have a bucket in the Google Cloud Platform that we will upload a 1 CSV file per month. This CSV files have around 3,000 - 10,000 rows of data, depending on the month.
The error I am getting from the job history in the Big Query API is:
Error while reading data, error message: CSV table encountered too
many errors, giving up. Rows: 2949; errors: 1. Please look into the
errors[] collection for more details.
When I am uploading the CSV files, I am selecting the following:
file format: csv
table type: native table
auto detect: tried automatic and manual
partitioning: no partitioning
write preference: WRITE_EMPTY (cannot change this)
number of errors allowed: 0
ignore unknown values: unchecked
field delimiter: comma
header rows to skip: 1 (also tried 0 and manually deleting the header rows from the csv files).
Any help would be greatly appreciated.
This usually points to the error in the structure of data source (in this case your CSV file). Since your CSV file is small, you can run a little validation script to see that the number of columns is exactly the same across all your rows in the CSV, before running the export.
Maybe something like:
cat myfile.csv | awk -F, '{ a[NF]++ } END { for (n in a) print n, "rows have",a[n],"columns" }'
Or, you can bind it to the condition (lets say if your number of columns should be 5):
ncols=$(cat myfile.csv | awk -F, 'x=0;{ a[NF]++ } END { for (n in a){print a[n]; x++; if (x==1){break}}}'); if [ $ncols==5 ]; then python myexportscript.py; else echo "number of columns invalid: ", $ncols; fi;
It's impossible to point out the error without seeing an example CSV file, but it's very likely that your file is incorrectly formatted. As a result, one typo confuses BQ into thinking there are thousands. Let's say you have the following csv file:
Sally Whittaker,2018,McCarren House,312,3.75
Belinda Jameson 2017,Cushing House,148,3.52 //Missing a comma after the name
Jeff Smith,2018,Prescott House,17-D,3.20
Sandy Allen,2019,Oliver House,108,3.48
With the following schema:
Name(String) Class(Int64) Dorm(String) Room(String) GPA(Float64)
Since the schema is missing a comma, everything is shifted one column over. If you have a large file, it results in thousands of errors as it attempts to inserts Strings into Ints/Floats.
I suggest you run your csv file through a csv validator before uploading it to BQ. It might find something that breaks it. It's even possible that one of your fields has a comma inside the value which breaks everything.
Another theory to investigate is to make sure that all required columns receive an appropriate (non-null) value. A common cause of this error is if you cast data incorrectly which returns a null value for a specific field in every row.
As mentioned by Scicrazed, this issue seems to be generated as some file rows has an incorrect format, in which case it is required to validate the content data in order to figure out the specific error that is leading this issue.
I recommend you to check the errors[] collection that might contains additional information about the aspects that can be making to fail the process. You can do this by using the Jobs: get method that returns detailed information about your BigQuery Job or refer to the additionalErrors field of the JobStatus Stackdriver logs that contains the same complete error data that is reported by the service.
I'm probably too late for this, but it seems the file has some errors (it can be a character that cannot be parsed or just a string in an int column) and BigQuery cannot upload it automatically.
You need to understand what the error is and fix it somehow. An easy way to do it is by running this command on the terminal:
bq --format=prettyjson show -j <JobID>
and you will be able to see additional logs for the error to help you understand the problem.
If the error happens only a few times you just can increase the number of errors allowed.
If it happens many times you will need to manipulate your CSV file before you upload it.
Hope it helps
Hello and thank you for take your time reading my issue.
I have the next issue with Cassandra cqlsh:
When I use the COPY command to load a .csv into my table, the command prompt never finishes the executing and loads nothing into the table if I stop it with ctrl+c.
Im using .csv's files from: https://www.kaggle.com/daveianhickey/2000-16-traffic-flow-england-scotland-wales
specifically from ukTrafficAADF.csv.
I put the code below:
CREATE TABLE first_query ( AADFYear int, RoadCategory text,
LightGoodsVehicles text, PRIMARY KEY(AADFYear, RoadCategory);
Im trying it:
COPY first_query (AADFYear, RoadCategory, LightGoodsVehicles) FROM '..\ukTrafficAADF.csv' WITH DELIMITER=',' AND HEADER=TRUE;
This give me the error below repeatedly:
Failed to import 5000 rows: ParseError - Invalid row length 29 should be 3, given up without retries
And never finishes.
Add that the .csv file have more columns that I need, and trying the previous COPY command with the SKIPCOLS reserved word including the unused columns does the same.
Thanks in advance.
In cqlsh COPY command, All column in the csv must be present in the table schema.
In your case your csv ukTrafficAADF has 29 column but in the table first_query has only 3 column that's why it's throwing parse error.
So in some way you have to remove all the unused column from the csv then you can load it into cassandra table with cqlsh copy command
I have multiple CSV files in a directory. They may have different column combinations, but I would like to COPY them all with a single command, as there is a lot of them and they all go into same table. But the FDelimitedParser only evaluates the header row for the first file, then rejects all rows that do not fit - ie. all rows from most of the other files. I've been using FDelimitedParser but anything else is fine.
1 - Is this expected behavior, and if so, why ?
2 - I want it to evaluate the headers for each file, is there a way ?
Thanks
(Vertica 7.2)
Looks like you need flexTable for that , see http://vertica-howto.info/2014/07/how-to-load-csv-files-into-flex-tables/
Here's a small workaround that I use when I need to load a bunch of files in at once. This assumes all your files have the same column order.
Download and run Cygwin
Navigate to folder with csv files
cd your_folder_name_with_csv_files
Combine all csv files into a new file
cat *.csv >> new_file_name.csv
Run a copy statement in Vertica from new file. If file headers are an issue, you can follow instructions on this link and run through Cygwin to remove the first line from every file.
I need to process multivariate time series given as multiline, multirow *.csv files with Apache Pig. I am trying to use a custom UDF (EvalFunc) to solve my problem. However, all Loaders I tried (except org.apache.pig.impl.io.ReadToEndLoader which I do not get to work) to load data in my csv-files and pass it to the UDF return one line of the file as one record. What I need is, however one column (or the content of the complete file) to be able to process a complete time series. Processing one value is obviously useless because I need longer sequences of values...
The data in the csv-files looks like this (30 columns, 1st is a datetime, all others are double values, here 3 sample lines):
17.06.2013 00:00:00;427;-13.793273;2.885583;-0.074701;209.790688;233.118828;1.411723;329.099170;331.554919;0.077026;0.485670;0.691253;2.847106;297.912382;50.000000;0.000000;0.012599;1.161726;0.023110;0.952259;0.024673;2.304819;0.027350;0.671688;0.025068;0.091313;0.026113;0.271128;0.032320;0
17.06.2013 00:00:01;430;-13.879651;3.137179;-0.067678;209.796500;233.141233;1.411920;329.176863;330.910693;0.071084;0.365037;0.564816;2.837506;293.418550;50.000000;0.000000;0.014108;1.159334;0.020250;0.954318;0.022934;2.294808;0.028274;0.668540;0.020850;0.093157;0.027120;0.265855;0.033370;0
17.06.2013 00:00:02;451;-15.080651;3.397742;-0.078467;209.781511;233.117081;1.410744;328.868437;330.494671;0.076037;0.358719;0.544694;2.841955;288.345883;50.000000;0.000000;0.017203;1.158976;0.022345;0.959076;0.018688;2.298611;0.027253;0.665095;0.025332;0.099996;0.023892;0.271983;0.024882;0
Has anyone an idea how I could process this as 29 time series?
Thanks in advance!
What do you want to achieve?
If you want to read all rows in all files as a single record, this can work:
a = LOAD '...' USING PigStorage(';') as <schema> ;
b = GROUP a ALL;
b will contain all the rows in a bag.
If you want to read each CSV file as a single record, this can work:
a = LOAD '...' USING PigStorage(';','tagsource') as <schema> ;
b = GROUP a BY $0; --$0 is the filename
b will contain all the rows per file in a bag.
I have a list of words in a text file. Each word separated by a new line. I want to read all of the words, and then, for each word I have to look up the DB and remove rows that contain the words that were read from the text file. How do i do that? I am a newbie to DB programming and I guess we dont have loops in SQL, right?
1 - Read all the words from the text file
2 - For each word from the text file
3 - Remove entry from db e.d. delete from TABLE where ITEMNAME is like ' WORDFROMFILE'
Thanks
Here's the general idea:
Step 1: Import the text file into a table.
Step 2: Write a query that DELETEs from the target table WHERE the keyword = the keyword in the target table, using an INNER JOIN.
You could use this technique to read text from file. If you want to do more complicated stuff, I'd suggest doing it from the front end (eg c#/vb etc.) rather than the db