How to deal with no escape commas in string fields when using `store.format`='csv' in drill table - apache-drill

When using drill script to convert a set of parquet files to csv, running into problem where some of the fields have commas in them. This causes a problem because drill does not appear to automatically add any escaping "<field>" or '<field>' characters around the fields in the converted files (eg. for string values that look like "Soft-drink, Large").
The script looks like
/opt/mapr/drill/drill-1.11.0/bin/sqlline \
-u jdbc:drill:zk=node001:5181,node002:5181,node003:5181 \
-n $(tail -n+1 $basedir/src/drill-creds.txt | head -1) \
-p $(tail -n+2 $basedir/src/drill-creds.txt | head -1) \
--run=$sqldir
where the sql being run by drill looks like
alter session set `store.format`='csv';
create table dfs.myworkspace.`/path/to/csv/destination` as
select ....
from dfs.myworkspace.`/path/to/origin/files`
Does anyone have any common ways they work around this? Is there a way to add escaping characters to the converted csv files (tried checking the docs (https://drill.apache.org/docs/create-table-as-ctas/), but could find nothing related)?

You can use TSV or PSV formats or to update the delimiter symbol for CSV cells by configuring Format Plugin of your File Storage Plugin:
In Drill UI choose Storage, then Update dfs storage plugin, find CSV in formats and add delimiter property with a desired value.
"csv": {
"type": "text",
"extensions": [
"csv2"
],
"skipFirstLine": false,
"extractHeader": true,
"delimiter": "^"
},
Also please check escape property
See more:
https://drill.apache.org/docs/text-files-csv-tsv-psv/#use-a-distributed-file-system

Related

NBSP creeping inside mySQL data [duplicate]

I have a spreadsheet which really has only one complicated table. I basically convert the spreadsheet to a cvs and use a groovy script to generate the INSERT scripts.
However, I cannot do this with a table that has 28 fields with data within some of the fields on the spreadsheet that make importing into the CVS even more complicated. So the fields in the new CVS are not differentiated properly or my script has not accounted for it.
Does anyone have any suggestions on a better approach to do this? Thanks.
Have a look at LOAD DATA INFILE statement. It will help you to import data from the CSV file into table.
This is a recurrent question on stackoverflow. Here is an updated answer.
There are actually several ways to import an excel file in to a MySQL database with varying degrees of complexity and success.
Excel2MySQL or Navicat utilities. Full disclosure, I am the author of Excel2MySQL. These 2 utilities aren't free, but they are the easiest option and have the fewest limitations. They also include additional features to help with importing Excel data into MySQL. For example, Excel2MySQL automatically creates your table and automatically optimizes field data types like dates, times, floats, etc. If your in a hurry or can't get the other options to work with your data then these utilities may suit your needs.
LOAD DATA INFILE: This popular option is perhaps the most technical and requires some understanding of MySQL command execution. You must manually create your table before loading and use appropriately sized VARCHAR field types. Therefore, your field data types are not optimized. LOAD DATA INFILE has trouble importing large files that exceed 'max_allowed_packet' size. Special attention is required to avoid problems importing special characters and foreign unicode characters. Here is a recent example I used to import a csv file named test.csv.
phpMyAdmin: Select your database first, then select the Import tab. phpMyAdmin will automatically create your table and size your VARCHAR fields, but it won't optimize the field types. phpMyAdmin has trouble importing large files that exceed 'max_allowed_packet' size.
MySQL for Excel: This is a free Excel Add-in from Oracle. This option is a bit tedious because it uses a wizard and the import is slow and buggy with large files, but this may be a good option for small files with VARCHAR data. Fields are not optimized.
For comma-separated values (CSV) files, the results view panel in Workbench has an "Import records from external file" option that imports CSV data directly into the result set. Execute that and click "Apply" to commit the changes.
For Excel files, consider using the official MySQL for Excel plugin.
A while back I answered a very similar question on the EE site, and offered the following block of Perl, as a quick and dirty example of how you could directly load an Excel sheet into MySQL. Bypassing the need to export / import via CSV and so hopefully preserving more of those special characters, and eliminating the need to worry about escaping the content.
#!/usr/bin/perl -w
# Purpose: Insert each Worksheet, in an Excel Workbook, into an existing MySQL DB, of the same name as the Excel(.xls).
# The worksheet names are mapped to the table names, and the column names to column names.
# Assumes each sheet is named and that the first ROW on each sheet contains the column(field) names.
#
use strict;
use Spreadsheet::ParseExcel;
use DBI;
use Tie::IxHash;
die "You must provide a filename to $0 to be parsed as an Excel file" unless #ARGV;
my $sDbName = $ARGV[0];
$sDbName =~ s/\.xls//i;
my $oExcel = new Spreadsheet::ParseExcel;
my $oBook = $oExcel->Parse($ARGV[0]);
my $dbh = DBI->connect("DBI:mysql:database=$sDbName;host=192.168.123.123","root", "xxxxxx", {'RaiseError' => 1,AutoCommit => 1});
my ($sTableName, %hNewDoc, $sFieldName, $iR, $iC, $oWkS, $oWkC, $sSql);
print "FILE: ", $oBook->{File} , "\n";
print "DB: $sDbName\n";
print "Collection Count: ", $oBook->{SheetCount} , "\n";
for(my $iSheet=0; $iSheet < $oBook->{SheetCount} ; $iSheet++)
{
$oWkS = $oBook->{Worksheet}[$iSheet];
$sTableName = $oWkS->{Name};
print "Table(WorkSheet name):", $sTableName, "\n";
for(my $iR = $oWkS->{MinRow} ; defined $oWkS->{MaxRow} && $iR <= $oWkS->{MaxRow} ; $iR++)
{
tie ( %hNewDoc, "Tie::IxHash");
for(my $iC = $oWkS->{MinCol} ; defined $oWkS->{MaxCol} && $iC <= $oWkS->{MaxCol} ; $iC++)
{
$sFieldName = $oWkS->{Cells}[$oWkS->{MinRow}][$iC]->Value;
$sFieldName =~ s/[^A-Z0-9]//gi; #Strip non alpha-numerics from the Column name
$oWkC = $oWkS->{Cells}[$iR][$iC];
$hNewDoc{$sFieldName} = $dbh->quote($oWkC->Value) if($oWkC && $sFieldName);
}
if ($iR == $oWkS->{MinRow}){
#eval { $dbh->do("DROP TABLE $sTableName") };
$sSql = "CREATE TABLE IF NOT EXISTS $sTableName (".(join " VARCHAR(512), ", keys (%hNewDoc))." VARCHAR(255))";
#print "$sSql \n\n";
$dbh->do("$sSql");
} else {
$sSql = "INSERT INTO $sTableName (".(join ", ",keys (%hNewDoc)).") VALUES (".(join ", ",values (%hNewDoc)).")\n";
#print "$sSql \n\n";
eval { $dbh->do("$sSql") };
}
}
print "Rows inserted(Rows):", ($oWkS->{MaxRow} - $oWkS->{MinRow}), "\n";
}
# Disconnect from the database.
$dbh->disconnect();
Note:
Change the connection ($oConn) string to suit, and if needed add a
user-id and password to the arguments.
If you need XLSX support a quick switch to Spreadsheet::XLSX is all
that's needed. Alternatively it only takes a few lines of code, to
detect the filetype and call the appropriate library.
The above is a simple hack, assumes everything in a cell is a string
/ scalar, if preserving type is important, a little function with a
few regexp can be used in conjunction with a few if statements to
ensure numbers / dates remain in the applicable format when written
to the DB
The above code is dependent on a number of CPAN modules, that you can install, assuming outbound ftp access is permitted, via a:
cpan YAML Data::Dumper Spreadsheet::ParseExcel Tie::IxHash Encode Scalar::Util File::Basename DBD::mysql
Should return something along the following lines (tis rather slow, due to the auto commit):
# ./Excel2mysql.pl test.xls
FILE: test.xls
DB: test
Collection Count: 1
Table(WorkSheet name):Sheet1
Rows inserted(Rows):9892

Postgres export via copy script doesn't like json with quotes on import

I am trying to migrate some json blocks stored inside a Postgres database using the scripts below. They work fine in most cases but fails on a few data sets(I think ones where I double quotes in the json). Basically I am trying to promote some individual approved blocks of json to the prod database using an export script that copies out the data to file and an import script that locates the destination node of json in the destination database and rewrites that node with my new data appended.
I think those data sets that fail are the ones where the json data contains double quotes in the data. An example of how the json with quotes is exports is shown here. I think the issue is the way the copy function is writing out(escaping) this data. My reason for this guess is 1)Only records that fail seem to be ones where quotes in the data 2)A few json tools failed when parsing the json in the area where the quotes were escaped.
"description": "See the chapter \\"Key fields\\" for a general description of keys.
How can I adjust my scripts to be more friendly to what type of data is stored in the json?
Export
psql -h demo_ip -c "\COPY (select f_values from view_export_f where id = '$id' and f_kind='$f_kind') TO '/tmp/demo.json'"
Import
psql -h prod_ip -ef "import.psql" -v ex="$ex"
import.psql
\set json_block `cat /tmp/demo.json`
UPDATE data
SET value= jsonb_set(value, '{json_location}', value->'json_location' || (:'json_block'), true) where id=(:'ex')

Loading json data from a file into Postgres

I need to load data from multiple JSON files each having multiple records within them to a Postgres table. I am using the following code but it does not work (am using pgAdmin III on windows)
COPY tbl_staging_eventlog1 ("EId", "Category", "Mac", "Path", "ID")
from 'C:\\SAMPLE.JSON'
delimiter ','
;
Content of SAMPLE.JSON file is like this (giving two records out of many such):
[{"EId":"104111","Category":"(0)","Mac":"ABV","Path":"C:\\Program Files (x86)\\Google","ID":"System.Byte[]"},{"EId":"104110","Category":"(0)","Mac":"BVC","Path":"C:\\Program Files (x86)\\Google","ID":"System.Byte[]"}]
Try this:
BEGIN;
-- let's create a temp table to bulk data into
create temporary table temp_json (values text) on commit drop;
copy temp_json from 'C:\SAMPLE.JSON';
-- uncomment the line above to insert records into your table
-- insert into tbl_staging_eventlog1 ("EId", "Category", "Mac", "Path", "ID")
select values->>'EId' as EId,
values->>'Category' as Category,
values->>'Mac' as Mac,
values->>'Path' as Path,
values->>'ID' as ID
from (
select json_array_elements(replace(values,'\','\\')::json) as values
from temp_json
) a;
COMMIT;
As mentioned in Andrew Dunstan's PostgreSQL and Technical blog
In text mode, COPY will be simply defeated by the presence of a backslash in the JSON. So, for example, any field that contains an embedded double quote mark, or an embedded newline, or anything else that needs escaping according to the JSON spec, will cause failure. And in text mode you have very little control over how it works - you can't, for example, specify a different ESCAPE character. So text mode simply won't work.
so we have to turn around to the CSV format mode.
copy the_table(jsonfield)
from '/path/to/jsondata'
csv quote e'\x01' delimiter e'\x02';
In the official document sql-copy, some Parameters list here:
COPY table_name [ ( column_name [, ...] ) ]
FROM { 'filename' | PROGRAM 'command' | STDIN }
[ [ WITH ] ( option [, ...] ) ]
[ WHERE condition ]
where option can be one of:
FORMAT format_name
FREEZE [ boolean ]
DELIMITER 'delimiter_character'
NULL 'null_string'
HEADER [ boolean ]
QUOTE 'quote_character'
ESCAPE 'escape_character'
FORCE_QUOTE { ( column_name [, ...] ) | * }
FORCE_NOT_NULL ( column_name [, ...] )
FORCE_NULL ( column_name [, ...] )
ENCODING 'encoding_name'
FORMAT
Selects the data format to be read or written: text, csv (Comma Separated Values), or binary. The default is text.
QUOTE
Specifies the quoting character to be used when a data value is quoted. The default is double-quote. This must be a single one-byte character. This option is allowed only when using CSV format.
DELIMITER
Specifies the character that separates columns within each row (line) of the file. The default is a tab character in text format, a comma in CSV format. This must be a single one-byte character. This option is not allowed when using binary format.
NULL
Specifies the string that represents a null value. The default is \N (backslash-N) in text format, and an unquoted empty string in CSV format. You might prefer an empty string even in text format for cases where you don't want to distinguish nulls from empty strings. This option is not allowed when using binary format.
HEADER
Specifies that the file contains a header line with the names of each column in the file. On output, the first line contains the column names from the table, and on input, the first line is ignored. This option is allowed only when using CSV format.
You can use spyql.
Running the following command would generate INSERT statements that you can pipe into psql:
$ jq -c .[] *.json | spyql -Otable=tbl_staging_eventlog1 "SELECT json->EId, json->Category, json->Mac, json->Path, json->ID FROM json TO sql"
INSERT INTO "tbl_staging_eventlog1"("EId","Category","Mac","Path","ID") VALUES ('104111','(0)','ABV','C:\Program Files (x86)\Google','System.Byte[]'),('104110','(0)','BVC','C:\Program Files (x86)\Google','System.Byte[]');
jq is used to transform the json arrays from all json files in the current directory into json lines (1 json object per line) and then spyql takes care of converting json lines into INSERT statements.
To import the data into PostgreSQL:
$ jq -c .[] *.json | spyql -Otable=tbl_staging_eventlog1 "SELECT json->EId, json->Category, json->Mac, json->Path, json->ID FROM json TO sql" | psql -U your_user_name -h your_host your_database
Disclaimer: I am the author of spyql.

Change the column data delimiter on mysqldump output

I'm looking to change to formatting of the output produced by the mysqldump command in the following way:
(data_val1,data_val2,data_val3,...)
to
(data_val1|data_val2|data_val3|...)
The change here being a different delimiter. This would then allow me to (in python) parse the data lines using a line.split("|") command and end up with the values correctly split (as opposed to doing line.split(",") and have values that contain commas be split into multiple values).
I've tried using the --fields-terminated-by flag, but this requires the --tab flag to be used as well. I don't want use the --tab flag as it splits the dump into several files. Does anyone know how to alter the delimiter that mysqldump uses?
This is not a good idea. Instead of using string.split() in Python, use the csv module to properly parse CSV data, which may be enclosed in quotes and may have internal , which aren't delimiters.
import csv
MySQL dump files are intended to be used as input back into MySQL. If you really want pipe-delimited output, use the SELECT INTO OUTFILE syntax instead with the FIELDS TERMINATED BY '|' option.

Is there a work-around that allows missing data to equal NULL for LOAD DATA INFILE in MySQL?

I have a lot of large csv files with NULL values stored as ,, (i.e., no entry). Using LOAD DATA INFILE makes these NULL values into zeros, even if I create the table with a string like var DOUBLE DEFAULT NULL. After a lot of searching I found that this is a known "bug", although it may be a feature for some users. Is there a way that I can fix this on the fly without pre-processing? These data are all numeric, so a zero value is very different from NULL.
Or if I have to do pre-processing, is there one that is most promising for dealing with tens of csv files of 100mb to 1gb? Thanks!
With minimal preprocessing with sed, you can have your data ready for import.
for csvfile in *.csv
do
sed -i -e 's/^,/\\N,/' -e 's/,$/,\\N/' -e 's/,,/,\\N,/g' -e 's/,,/,\\N,/g' $csvfile
done
That should do an in-place edit of your CSV files and replace the blank values with \N. Update the glob, *.csv, to match your needs.
The reason there are two identical regular expressions matching ,, is because I couldn't figure out another way to make it replace two consecutive blank values. E.g. ,,,.
"\N" (without quotes) in a data file signifies that the value should be null when the file is imported into MySQL. Can you edit the files to replace ",," with ",\N,"?