Is there a work-around that allows missing data to equal NULL for LOAD DATA INFILE in MySQL? - mysql

I have a lot of large csv files with NULL values stored as ,, (i.e., no entry). Using LOAD DATA INFILE makes these NULL values into zeros, even if I create the table with a string like var DOUBLE DEFAULT NULL. After a lot of searching I found that this is a known "bug", although it may be a feature for some users. Is there a way that I can fix this on the fly without pre-processing? These data are all numeric, so a zero value is very different from NULL.
Or if I have to do pre-processing, is there one that is most promising for dealing with tens of csv files of 100mb to 1gb? Thanks!

With minimal preprocessing with sed, you can have your data ready for import.
for csvfile in *.csv
do
sed -i -e 's/^,/\\N,/' -e 's/,$/,\\N/' -e 's/,,/,\\N,/g' -e 's/,,/,\\N,/g' $csvfile
done
That should do an in-place edit of your CSV files and replace the blank values with \N. Update the glob, *.csv, to match your needs.
The reason there are two identical regular expressions matching ,, is because I couldn't figure out another way to make it replace two consecutive blank values. E.g. ,,,.

"\N" (without quotes) in a data file signifies that the value should be null when the file is imported into MySQL. Can you edit the files to replace ",," with ",\N,"?

Related

Replace a value of attribute in json in SQL dump

I have table in mariadb with five columns and in that one column is of type longtext (Compressed String) in json format.
Also I have a dump of that table and in that I need to change the attribute salary value in nested json string in below based on the value of empId attribute's value.
INSERT INTO Employee VALUES (1,"ram","1243","19-03-14",{"name":"ram",age:"23","empId":"1234","address":{"city":{"name":"",.*},"country":{"name":"",.*}},"gender":"male","hobbies":"travel","salary":"40000","qualification":"BE","marrital-status":"married",.*),
(2,"komal","1243","19-03-14",{"name":"komal",age:"21","empId":"1534","address":{"city":{"name":"",.*},"country":{"name":"",.*}},"gender":"male","hobbies":"music","salary":"30000","qualification":"BE","marrital-status":"married",.*),
(3,"ramya","1243","19-03-14",{"name":"ramya",age:"22","empId":"1754","address":{"city":{"name":"",.*},"country":{"name":"",.*}},"gender":"male","hobbies":"travel","salary":"40000","qualification":"BE","marrital-status":"married",.*),
(4,"raj","1243","19-03-14",{"name":"raj",age:"23","empId":"1364","address":{"city":{"name":"",.*},"country":{"name":"",.*}},"gender":"male","hobbies":"playing","salary":"40000","qualification":"BE","marrital-status":"married",.*);
I have a csv file with empId and mapped to revised salary.
I will loop the csv file and in the dump based on empId replace the salary from the csv.
I tried with sed command to replace with below pattern.
:%s/\("empId":"1243"\),\(.*\),\("address":{"city":{\),\(.*\),\(}\),\("country":{\),\(.*\),\(}}\),\(.*\),"salary":[0-9\"]*/\1,\2,\3,\4,5,\6,\7,\8,\9,"salaray":"50000"/
but I am getting below error message.
E872: (NFA regexp) Too many '('
E51: Too many \(
E476: Invalid command
How can I parse the json in the dump and change the value using sed or any other way in shell script?
I'm not 100% sure about your use case but you can try this for each empId in the csv.
sed -E 's/"empId":"1234+"/"salary":1234/g' input.txt
Add a -i to the command to replace the text in the file otherwise it will just be the standard output that changes after you test that it's working.
Using node or python to do this might be easier. You could make a small script that reads the json from a file then outputs the new json to a new file. Or if you're restricted in your tech stack, maybe loading the csv into the sql db itself and creating a query from it, with your input data, to get the right form to then insert could be another option.
Some resources I used which might explain better than this short answer.
https://linuxize.com/post/how-to-use-sed-to-find-and-replace-string-in-files/
https://phoenixnap.com/kb/grep-regex
using extended regular expressions in sed

Let Google BigQuery infer schema from csv string file

I want to upload csv data into BigQuery. When the data has different types (like string and int), it is capable of inferring the column names with the headers, because the headers are all strings, whereas the other lines contains integers.
BigQuery infers headers by comparing the first row of the file with
other rows in the data set. If the first line contains only strings,
and the other lines do not, BigQuery assumes that the first row is a
header row.
https://cloud.google.com/bigquery/docs/schema-detect
The problem is when your data is all strings ...
You can specify --skip_leading_rows, but BigQuery still does not use the first row as the name of your variables.
I know I can specify the column names manually, but I would prefer not doing that, as I have a lot of tables. Is there another solution ?
If your data is all in "string" type and if you have the first row of your CSV file containing the metadata, then I guess it is easy to do a quick script that would parse the first line of your CSV and generates a similar "create table" command:
bq mk --schema name:STRING,street:STRING,city:STRING... -t mydataset.myNewTable
Use that command to create a new (void) table, and then load your CSV file into that new table (using --skip_leading_rows as you mentioned)
14/02/2018: Update thanks to Felipe's comment:
Above comment can be simplified this way:
bq mk --schema `head -1 myData.csv` -t mydataset.myNewTable
It's not possible with current API. You can file a feature request in the public BigQuery tracker https://issuetracker.google.com/issues/new?component=187149&template=0.
As a workaround, you can add a single non-string value at the end of the second line in your file, and then set the allowJaggedRows option in the Load configuration. Downside is you'll get an extra column in your table. If having an extra column is not acceptable, you can use query instead of load, and select * EXCEPT the added extra column, but query is not free.

How to bulk load into cassandra other than copy method.?

AM using the copy method for cpying the .csv file into the cassandra tables..
But am getting records mismatch error..
Record 41(Line 41) has mismatched number of records (85 instead of 82)
This is happening for all the .csv files & all the .csv files are system generated..
Any work around for this error..?
Based on your error message, it sounds like the copy command is working for you, until record 41. What are you using as a delimiter? The default delimiter for the COPY command is a comma, and I'll bet that your data has some additional commas in it on line 41.
A few options:
Edit your data and remove the extra commas.
Alter your .csv file to encapsulate the values of all of your fields in double-quotes, as COPY's default QUOTE value is ". This will allow you to leave the in-text commas.
Alter your .csv file to delimit with pipes | instead of a comma, and set the COPY command's DELIMITER option to |.
Try using either the Cassandra bulk loader or json2sstable utility to import your data. I've never used them, but I would bet you'll have similar problems if you have commas in your data set.

Sybase ASE 12.0 CSV Table Export

What I'm trying to do is export a view/table from Sybase ASE 12.0 into a CSV file, but I'm having a lot of difficulty in it.
We want to import it into IDEA or MS-Access. The way that these programs operate is with the text-field encapsulation character and a field separator character, along with new lines being the record separator (without being able to modify this).
Well, using bcp to export it is ultimately fruitless with its built in options. It doesn't allow you to define a text field encapsulation character (as far as I can tell). So we tried to create another view that reads from the other view/table that concatenates the fields that have new lines in them (text fields), however, you may not do that without losing some of the precision because it forces the field into a varchar of 8000 characters/bytes, of which our max field used is 16000 (so there's definitely some truncation).
So, we decided to create columns in the new view that had the text field delimiters. However, that put our column count for the view at 320 -- 70 more than the 250 column limit in ASE 12.0.
bcp can only work on existing tables and views, so what can we do to export this data? We're pretty much open to anything.
If its only the new line char that is causing problems can you not just do a replace
create new view as
select field1, field2, replace(text_field_with_char, 'new line char,' ' ')
from old_view
You may have to consider exporting as 2 files, importing into your target as 2 tables and then combining them again in the target. If both files have a primary key this is simple.
That sounds like bcp's right, but process the output via awk or perl.
But are those things you have and know? That might be a little overhead for you.
If you're on Windows you can get Active Perl free and it could be quick.
something like:
perl -F, -lane 'print "\"$F[0]\",$F[1],\"$F[2]\",$F[3]\n" ;' bcp-output-file
how's that? $F is an array of fields. The text ones you encircle with \"
You can use BCP format files for this.
bcp .... -f XXXX.fmt
BCP can also produce this format files interactively if you don't state
any of -c -n -f flags. Then you can save the format file and experiment with it, editing it and runnign BCP.
To safe time while exporting and debugging, use -F -L flags like "-F 1 -L 10" -- this gets only first 10 lines.

DB load CSV into multiple tables

UPDATE: added an example to clarify the format of the data.
Considering a CSV with each line formatted like this:
tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5,[tbl2.col1:tbl2.col2]+
where [tbl2.col1:tbl2.col2]+ means that there could be any number of these pairs repeated
ex:
tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2
The tables would relate to eachother using the line number as a key which would have to be created in addition to any columns mentioned above.
Is there a way to use mysql load
data infile to load the data into
two separate tables?
If not, what Unix command line tools
would be best suited for this?
no, not directly. load data can only insert into one table or partitioned table.
what you can do is load the data into a staging table, then use insert into to select the individual columns into the 2 final tables. you may also need substring_index if you're using different delimiters for tbl2's values. the line number is handled by an auto incrementing column in the staging table (the easiest way is to make the auto column last in the staging table definition).
the format is not exactly clear, and is best done w/perl/php/python, but if you really want to use shell tools:
cut -d , -f 1-5 file | awk -F, '{print NR "," $0}' > table1
cut -d , -f 6- file | sed 's,\:,\,,g' | \
awk -F, '{i=1; while (i<=NF) {print NR "," $(i) "," $(i+1); i+=2;}}' > table2
this creates table1 and table 2 files with these contents:
1,tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5
2,tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5
3,tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5
and
1,tbl2.col1,tbl2.col2
1,tbl2.col1,tbl2.col2
2,tbl2.col1,tbl2.col2
2,tbl2.col1,tbl2.col2
3,tbl2.col1,tbl2.col2
3,tbl2.col1,tbl2.col2
As you say, the problematic part is the unknown number of [tbl2.col1:tbl2.col2] pairs declared in each line. I would tempted to solve this through sed: split the one file into two files, one for each table. Then you can use load data infile to load each file into its corresponding table.