Importing data from csv files into Cratedb - csv

I have created a table in Crate 0.38.x with columns having integer, string and timestamp data types. I want to load data into this table from delimited text files. Is there a utility to do a bulk import? Sorry, but I could not find one in the documentation or on Github

In order to do bulk imports from file the COPY FROM statement can be used (see https://crate.io/docs/stable/sql/reference/copy_from.html). But there is only support for JSON formatted files so you'll probably need to convert the text files first.
Not sure if there are any plans to add support for other formats, but if you create a github issue requesting the feature you'll get feedback once it has been implemented.
There are also docs available on how to migrate from mysql and mongodb

I have quickly imported data from MySQL to Crate 0.40 installing Ruby on Rails in the same server of the MySQL DB, and then using the Mysql2JSON gem (See the Mysql2xxx part).
Crate requires a one line per register JSON file. So, you have to edit the output replacing the [", ",", "] with ", "/n", " in the mysql2xxXX gem source, in order to have a format like this in the output:
{"id": 1, "quote": "Don't panic"}
{"id": 2, "quote": "Would it save you a lot of time if I just gave up and went mad now?"}
After exporting the MySQL JSON info with the Mysql2Json gem you have to upload the file to the Create server and put in the Crate console:
COPY table_name FROM 'file:///tmp/import_data/quotes.json'

Read this:
https://crate.io/docs/crate/reference/en/latest/general/dml.html#import-and-export
just make sure that you have created the table with schema beforehand using the copy function to import dataset from json or csv.

Related

How to Export GA360 table from Big query to snowflake through GCS as json file without data loss?

I am exporting GA360 table from Big query to snowflake as json format using bq cli command. I am losing some fields when I load it as table in snowflake. I use the copy command to load my json data from GCS external stage in snowflake to snowflake tables. But, I am missing some fields that are part of nested array. I even tried compressing the file when I export to gcs but I still loose data. Can someone suggest me how I can do this. I don't want to flatten the table in bigquery and transfer that. My daily table size is minimum of 1.5GB to maximum of 4GB.
bq extract \
--project_id=myproject \
--destination_format=NEWLINE_DELIMITED_JSON \
--compression GZIP \
datasetid.ga_sessions_20191001 \
gs://test_bucket/ga_sessions_20191001-*.json
I have set up my integration, file format, and stage in snowflake. I copying data from this bucket to a table that has one variant field. The row count matches with Big query but the fields are missing.
I am guessing this is due to the limit snowflake has where each variant column should be of 16MB. Is there some way I can compress each variant field to be under 16MB?
I had no problem exporting GA360, and getting the full objects into Snowflake.
First I exported the demo table bigquery-public-data.google_analytics_sample.ga_sessions_20170801 into GCS, JSON formatted.
Then I loaded it into Snowflake:
create or replace table ga_demo2(src variant);
COPY INTO ga_demo2
FROM 'gcs://[...]/ga_sessions000000000000'
FILE_FORMAT=(TYPE='JSON');
And then to find the transactionIds:
SELECT src:visitId, hit.value:transaction.transactionId
FROM ga_demo1, lateral flatten(input => src:hits) hit
WHERE src:visitId='1501621191'
LIMIT 10
Cool things to notice:
I read the GCS files easily from Snowflake deployed in AWS.
JSON manipulation in Snowflake is really cool.
See https://hoffa.medium.com/funnel-analytics-with-sql-match-recognize-on-snowflake-8bd576d9b7b1 for more.

How to convert dbase III files to mysql?

is ist possible to convert .DBF files to any other format?
Does anybody knows a script, that can be used to convert .DBF files to an mysql query.
It would be also fine, to convert the DBF files to CSV files.
I always got problems with the codec of the DBF files.
Konstantin
https://www.dbase.com/Knowledgebase/faq/import_export_data.asp
Q: How do I export data from a dBASE table to a text file?
A: Exporting data from dBASE to a text file is handled through the COPY TO command.
Like the APPEND FROM command, there are a number of ways to use this command. Here we are only interested in it's most basic use. Once you understand how to use this command, you can go to your on-line help for further details on what can be accomplished with the COPY TO command.
In order to export data you must first be using the table from which the data will be exported. As before, you will be employing the USE command in the command window.
USE <tablename>
For example:
USE Mytest.dbf
Once the table is in use, all you need to do is type the following command in the command window:
COPY TO <filename> TYPE DELIMITED
For example:
COPY TO Myexport.txt TYPE DELIMITED
This would result in a file being created in the current directory called Myexport.txt which would be in the DELIMITED or *.CSV format.
If we had wanted to export the data in the *.SDF format, we would have typed:
COPY TO Myexport.txt TYPE SDF
This would result in a file being created in the current directory called Myexport.txt which would be in the System Delimted or *.SDF format.
Those are the basics on how to import and export text data into a dBASE table. For further information consult the on-line help for the APPEND FROM and COPY TO commands.
I converted old (circa 1997) DBF files to CSV using Python and the dbfread module.
After installation of Python, from the Python interpreter (<WIN> + 'Python') install the dbfread module:
>>> pip install dbfread
The module has many method to read DBF files and excellent documentation.
Then a Python script does the job, or typing directly into the interpreter:
# Read the DBF file
table = DBF('C:/my_dbf_file.dbf', encoding='1252')
outFileName = 'C:/my_export.csv'
with open(outFileName, 'w', newline='', encoding='1252' ) as file:
writer = csv.writer(file)
writer.writerow(table.field_names)
for record in table:
writer.writerow(list(record.values()))
Note that each record in the database is read and save one at a time and that the first line of the CSV file are the column's names.
Encoding could be problematic, a list of encoding to try is here: The dbread.DBF() method tries to guess the encoding but is not perfect. This is why in the code I specify the parameters encoding in both DBF() and csv.open().

Why my PyAthena generate a csv and a csv meta data file in s3 location while reading a GLUE table?

I started to pull GLUE table via using pyathena since last week. However, one annoying thing I noticed that is if I wrote my code as shown below, sometimes it works and returns a pandas dataframe but other times, this piece of codes will create a csv and a csv metadata in the folder where physical data (parquet) are stored in S3 and registered in GLUE.
I know that if you use pandas cursor, it may end up with these two files but I just wonder if I can access data without these two files since every time these two files generated in S3, my read in process failed.
Thank you!
import os
access_key_id = os.getenv('AWS_ACCESS_KEY_ID')
secret_access_key = os.getenv('AWS_SECRET_ACCESS_KEY')
connect1 = connect(s3_staging_dir='s3://xxxxxxxxxxxxx')
df = pd.read_sql("select * from abc.table_name", connect1)
df.head()
go to Athena
click settings -> workgroup name -> edit workgroup
Update "Query result location"
click "Override client-side settings"
Note: If you have not setup any other workgroups for your Athena environment, you should only find one workgroup named "Primary".
This should resolve your problem. For more information you can read:
https://docs.aws.amazon.com/athena/latest/ug/querying.html

Convert file of JSON objects to Parquet file

Motivation: I want to load the data into Apache Drill. I understand that Drill can handle JSON input, but I want to see how it performs on Parquet data.
Is there any way to do this without first loading the data into Hive, etc and then using one of the Parquet connectors to generate an output file?
Kite has support for importing JSON to both Avro and Parquet formats via its command-line utility, kite-dataset.
First, you would infer the schema of your JSON:
kite-dataset json-schema sample-file.json -o schema.avsc
Then you can use that file to create a Parquet Hive table:
kite-dataset create mytable --schema schema.avsc --format parquet
And finally, you can load your JSON into the dataset.
kite-dataset json-import sample-file.json mytable
You can also import an entire directly stored in HDFS. In that case, Kite will use a MR job to do the import.
You can actually use Drill itself to create a parquet file from the output of any query.
create table student_parquet as select * from `student.json`;
The above line should be good enough. Drill interprets the types based on the data in the fields. You can substitute your own query and create a parquet file.
To complete the answer of #rahul, you can use drill to do this - but I needed to add more to the query to get it working out of the box with drill.
create table dfs.tmp.`filename.parquet` as select * from dfs.`/tmp/filename.json` t
I needed to give it the storage plugin (dfs) and the "root" config can read from the whole disk and is not writable. But the tmp config (dfs.tmp) is writable and writes to /tmp. So I wrote to there.
But the problem is that if the json is nested or perhaps contains unusual characters, I would get a cryptic
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: java.lang.IndexOutOfBoundsException:
If I have a structure that looks like members: {id:123, name:"joe"} I would have to change the select to
select members.id as members_id, members.name as members_name
or
select members.id as `members.id`, members.name as `members.name`
to get it to work.
I assume the reason is that parquet is a "column" store so you need columns. JSON isn't by default so you need to convert it.
The problem is I have to know my json schema and I have to build the select to include all the possibilities. I'd be happy if some knows a better way to do this.

How to copy csv data file to Amazon RedShift?

I'm trying to migrating some MySQL tables to Amazon Redshift, but met some problems.
The steps are simple:
1. Dump the MySQL table to a csv file
2. Upload the csv file to S3
3. Copy the data file to RedShift
Error occurs in step 3:
The SQL command is:
copy TABLE_A from 's3://ciphor/TABLE_A.csv' CREDENTIALS
'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' delimiter ',' csv;
The error info:
An error occurred when executing the SQL command: copy TABLE_A from
's3://ciphor/TABLE_A.csv' CREDENTIALS
'aws_access_key_id=xxxx;aws_secret_access_key=xxxx ERROR: COPY CSV is
not supported [SQL State=0A000] Execution time: 0.53s 1 statement(s)
failed.
I don't know if there's any limitations on the format of the csv file, say the delimiters and quotes, I cannot find it in documents.
Any one can help?
The problem is finally resolved by using:
copy TABLE_A from 's3://ciphor/TABLE_A.csv' CREDENTIALS
'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' delimiter ','
removequotes;
More information can be found here http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
Now Amazon Redshift supports CSV option for COPY command. It's better to use this option to import CSV formatted data correctly. The format is shown bellow.
COPY [table-name] FROM 's3://[bucket-name]/[file-path or prefix]'
CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' CSV;
The default delimiter is ( , ) and the default quotes is ( " ). Also you can import TSV formatted data with CSV and DELIMITER option like this.
COPY [table-name] FROM 's3://[bucket-name]/[file-path or prefix]'
CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' CSV DELIMITER '\t';
There are some disadvantages to use the old way(DELIMITER and REMOVEQUOTES) that REMOVEQUOTES does not support to have a new line or a delimiter character within an enclosed filed. If the data can include this kind of characters, you should use CSV option.
See the following link for the details.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
If you want to save your self some code/ you have a very basic use case you can use Amazon Data Pipeline.
it stats a spot instance and perform the transformation within amazon network and it's really intuitive tool (but very simple so you can't do complex things with it)
You can try with this
copy TABLE_A from 's3://ciphor/TABLE_A.csv' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' csv;
CSV itself means comma separated values, no need to provide delimiter with this. Please refer link.
[http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-format]
I always this code:
COPY clinical_survey
FROM 's3://milad-test/clinical_survey.csv'
iam_role 'arn:aws:iam::123456789123:role/miladS3xxx'
CSV
IGNOREHEADER 1
;
Description:
1- COPY the name of your file store in S3
2- FROM address of file
3- iam_role is a substitution for CREDENTIAL. Note that, iam_role should be defined in iam management menu at your console, and then in trust menu should be assigned to the user as well (That is the hardest part!)
4- CSV uses comma delimiter
5- IGNORHEADER 1 is a must! Otherwise it will throw an error. (skip one row of my CSV and consider it as a header)
Since the resolution has already been provided, I'll not repeat the obvious.
However, in case you receive some more error which you're not able to figure out, simply execute on your workbench while you're connected to any of the Redshift accounts:
select * from stl_load_errors [where ...];
stl_load_errors contains all the Amazon RS load errors in historical fashion where a normal user can view details corresponding to his / her own account but a superuser can have all the access.
The details are captured elaborately at :
Amazon STL Load Errors Documentation
Little late to comment but it can be useful:-
You can use an open source project to copy tables directly from mysql to redshift - sqlshift.
It only requires spark and if you have yarn then it can also be used.
Benefits:- It will automatically decides distkey and interleaved sortkey using primary key.
It looks like you are trying to load local file into REDSHIFT table.
CSV file has to be on S3 for COPY command to work.
If you can extract data from table to CSV file you have one more scripting option. You can use Python/boto/psycopg2 combo to script your CSV load to Amazon Redshift.
In my MySQL_To_Redshift_Loader I do the following:
Extract data from MySQL into temp file.
loadConf=[ db_client_dbshell ,'-u', opt.mysql_user,'-p%s' % opt.mysql_pwd,'-D',opt.mysql_db_name, '-h', opt.mysql_db_server]
...
q="""
%s %s
INTO OUTFILE '%s'
FIELDS TERMINATED BY '%s'
ENCLOSED BY '%s'
LINES TERMINATED BY '\r\n';
""" % (in_qry, limit, out_file, opt.mysql_col_delim,opt.mysql_quote)
p1 = Popen(['echo', q], stdout=PIPE,stderr=PIPE,env=env)
p2 = Popen(loadConf, stdin=p1.stdout, stdout=PIPE,stderr=PIPE)
...
Compress and load data to S3 using boto Python module and multipart upload.
conn = boto.connect_s3(AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(bucket_name)
k = Key(bucket)
k.key = s3_key_name
k.set_contents_from_file(file_handle, cb=progress, num_cb=20,
reduced_redundancy=use_rr )
Use psycopg2 COPY command to append data to Redshift table.
sql="""
copy %s from '%s'
CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s'
DELIMITER '%s'
FORMAT CSV %s
%s
%s
%s;""" % (opt.to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,opt.delim,quote,gzip, timeformat, ignoreheader)