I need to load data from multiple JSON files each having multiple records within them to a Postgres table. I am using the following code but it does not work (am using pgAdmin III on windows)
COPY tbl_staging_eventlog1 ("EId", "Category", "Mac", "Path", "ID")
from 'C:\\SAMPLE.JSON'
delimiter ','
;
Content of SAMPLE.JSON file is like this (giving two records out of many such):
[{"EId":"104111","Category":"(0)","Mac":"ABV","Path":"C:\\Program Files (x86)\\Google","ID":"System.Byte[]"},{"EId":"104110","Category":"(0)","Mac":"BVC","Path":"C:\\Program Files (x86)\\Google","ID":"System.Byte[]"}]
Try this:
BEGIN;
-- let's create a temp table to bulk data into
create temporary table temp_json (values text) on commit drop;
copy temp_json from 'C:\SAMPLE.JSON';
-- uncomment the line above to insert records into your table
-- insert into tbl_staging_eventlog1 ("EId", "Category", "Mac", "Path", "ID")
select values->>'EId' as EId,
values->>'Category' as Category,
values->>'Mac' as Mac,
values->>'Path' as Path,
values->>'ID' as ID
from (
select json_array_elements(replace(values,'\','\\')::json) as values
from temp_json
) a;
COMMIT;
As mentioned in Andrew Dunstan's PostgreSQL and Technical blog
In text mode, COPY will be simply defeated by the presence of a backslash in the JSON. So, for example, any field that contains an embedded double quote mark, or an embedded newline, or anything else that needs escaping according to the JSON spec, will cause failure. And in text mode you have very little control over how it works - you can't, for example, specify a different ESCAPE character. So text mode simply won't work.
so we have to turn around to the CSV format mode.
copy the_table(jsonfield)
from '/path/to/jsondata'
csv quote e'\x01' delimiter e'\x02';
In the official document sql-copy, some Parameters list here:
COPY table_name [ ( column_name [, ...] ) ]
FROM { 'filename' | PROGRAM 'command' | STDIN }
[ [ WITH ] ( option [, ...] ) ]
[ WHERE condition ]
where option can be one of:
FORMAT format_name
FREEZE [ boolean ]
DELIMITER 'delimiter_character'
NULL 'null_string'
HEADER [ boolean ]
QUOTE 'quote_character'
ESCAPE 'escape_character'
FORCE_QUOTE { ( column_name [, ...] ) | * }
FORCE_NOT_NULL ( column_name [, ...] )
FORCE_NULL ( column_name [, ...] )
ENCODING 'encoding_name'
FORMAT
Selects the data format to be read or written: text, csv (Comma Separated Values), or binary. The default is text.
QUOTE
Specifies the quoting character to be used when a data value is quoted. The default is double-quote. This must be a single one-byte character. This option is allowed only when using CSV format.
DELIMITER
Specifies the character that separates columns within each row (line) of the file. The default is a tab character in text format, a comma in CSV format. This must be a single one-byte character. This option is not allowed when using binary format.
NULL
Specifies the string that represents a null value. The default is \N (backslash-N) in text format, and an unquoted empty string in CSV format. You might prefer an empty string even in text format for cases where you don't want to distinguish nulls from empty strings. This option is not allowed when using binary format.
HEADER
Specifies that the file contains a header line with the names of each column in the file. On output, the first line contains the column names from the table, and on input, the first line is ignored. This option is allowed only when using CSV format.
You can use spyql.
Running the following command would generate INSERT statements that you can pipe into psql:
$ jq -c .[] *.json | spyql -Otable=tbl_staging_eventlog1 "SELECT json->EId, json->Category, json->Mac, json->Path, json->ID FROM json TO sql"
INSERT INTO "tbl_staging_eventlog1"("EId","Category","Mac","Path","ID") VALUES ('104111','(0)','ABV','C:\Program Files (x86)\Google','System.Byte[]'),('104110','(0)','BVC','C:\Program Files (x86)\Google','System.Byte[]');
jq is used to transform the json arrays from all json files in the current directory into json lines (1 json object per line) and then spyql takes care of converting json lines into INSERT statements.
To import the data into PostgreSQL:
$ jq -c .[] *.json | spyql -Otable=tbl_staging_eventlog1 "SELECT json->EId, json->Category, json->Mac, json->Path, json->ID FROM json TO sql" | psql -U your_user_name -h your_host your_database
Disclaimer: I am the author of spyql.
Related
I am working on some benchmarks and need to compare ORC, Parquet and CSV formats. I have exported TPC/H (SF1000) to ORC based tables. When I want to export it to Parquet I can run:
CREATE TABLE hive.tpch_sf1_parquet.region
WITH (format = 'parquet')
AS SELECT * FROM hive.tpch_sf1_orc.region
When I try the similar approach with CSV, then I get the error Hive CSV storage format only supports VARCHAR (unbounded). I would assumed that it would convert the other datatypes (i.e. bigint) to text and store the column format in the Hive metadata.
I can export the data to CSV using trino --server trino:8080 --catalog hive --schema tpch_sf1_orc --output-format=CSV --execute 'SELECT * FROM nation, but then it gets emitted to a file. Although this works for SF1 it quickly becomes unusable for SF1000 scale-factor. Another disadvantage is that my Hive metastores wouldn't have the appropriate meta-data (although I could patch it manually if nothing else works).
Anyone an idea how to convert my ORC/Parquet data to CSV using Hive?
In Trino Hive connector, the CSV table can contain varchar columns only.
You need to cast the exported columns to varchar when creating the table
CREATE TABLE region_csv
WITH (format='CSV')
AS SELECT CAST(regionkey AS varchar), CAST(name AS varchar), CAST(comment AS varchar)
FROM region_orc
Note that you will need to update your benchmark queries accordingly, e.g. by applying reverse casts.
DISCLAIMER: Read the full post, before using anything discussed here. It's not real CSV and you migth screw up!
It is possible to create typed CSV-ish tables when using the TEXTFILE format and use ',' as the field separator:
CREATE TABLE hive.test.region (
regionkey bigint,
name varchar(25),
comment varchar(152)
)
WITH (
format = 'TEXTFILE',
textfile_field_separator = ','
);
This will create a typed version of the table in the Hive catalog using the TEXTFILE format. It normally uses the ^A character (ASCII 10), but when set to ',' it resembles the same structure as CSV formats.
IMPORTANT: Although it looks like CSV, it is not real CSV. It doesn't follow RFC 4180, because it doesn't properly quote and escape. The following INSERT will not be inserted co:
INSERT INTO hive.test.region VALUES (
1,
'A "quote", with comma',
'The comment contains a newline
in it');
The text will be copied unmodified to the file without escaping quotes or commas. This should have been written like this to be proper CSV:
1,"A ""quote"", with comma","The comment contains a newline
in it"
Unfortunately, it is written as:
1,A "quote", with comma,The comment contains a newline
in it
This results in invalid data that will be represented by NULL columns. For this reason, this method can only be used when you have full control over the text-based data and are sure that it doesn't contain newlines, quotes, commas, ...
I am trying to import csv file to table in postgres using COPY command. I have problem that one column is of json data type. I tried to escape json data in csv using dollars ($$...$$) docu_4.1.2.2.
This is first line of csv:
3f382d8c-bd27-4092-bd9c-8b50e24df7ec;370038757|PRIMARY_RESIDENTIAL;$${"CustomerData": "{}", "PersonModule": "{}"}$$
This is command used for import:
psql -c "COPY table(id, name, details) FROM '/path/table.csv' DELIMITER ';' ENCODING 'UTF-8' CSV;"
This is error I get:
ERROR: invalid input syntax for type json
DETAIL: Token "$" is invalid.
CONTEXT: JSON data, line 1: $...
COPY table, line 1, column details: "$${CustomerData: {}, PersonModule: {}}$$"
How should I escape/import json value using COPY? Should I give up and use something like pg_loader instead? Thank you
In case of failing with importing the JSON data please give a try to the following setup - this worked for me even for quite complicated data:
COPY "your_schema_name.yor_table_name" (your, column_names, here)
FROM STDIN
WITH CSV DELIMITER E'\t' QUOTE '\b' ESCAPE '\';
--here rows data
\.
I'm trying to import a JSON file into a table. I'm using the solution mentioned here: https://stackoverflow.com/a/33130304/1663462:
create temporary table temp_json (values text) on commit drop;
copy temp_json from 'data.json';
select
values->>'annotations' as annotationstext
from (
select json_array_elements(replace(values,'\','\\')::json) as values
from temp_json
) a;
Json file content is:
{"annotations": "<?xml version=\"1.0\"?>"}
I have verified that this is a valid JSON file.
The json file contains a \" which I presume is responsible for the following error:
CREATE TABLE
COPY 1
psql:insertJson2.sql:13: ERROR: invalid input syntax for type json
DETAIL: Expected "," or "}", but found "1.0".
CONTEXT: JSON data, line 1: {"annotations": "<?xml version="1.0...
Are there any additional characters that need to be escaped?
Because copy command processes escape ('\') characters for text format without any options there are two ways to import such data.
1) Process file using external utility via copy ... from program, for example using sed:
copy temp_json from program 'sed -e ''s/\\/\\\\/g'' data.json';
It will replace all backslashes to doubled backslashes, which will be converted back to single ones by copy.
2) Use csv import:
copy temp_json from 'data.json' with (format csv, quote '|', delimiter E'\t');
Here you should to set quote and delimiter characters such that it does not occur anywhere in your file.
And after that just use direct conversion:
select values::json->>'annotations' as annotationstext from temp_json;
Valid JSON can naturally have the backslash character: \. When you insert data in a SQL statement like so:
sidharth=# create temp table foo(data json);
CREATE TABLE
sidharth=# insert into foo values( '{"foo":"bar", "bam": "{\"mary\": \"had a lamb\"}" }');
INSERT 0 1
sidharth=# select * from foo;
data
\-----------------------------------------------------
{"foo":"bar", "bam": "{\"mary\": \"had a lamb\"}" }
(1 row)
Things work fine.
But if I copy the JSON to a file and run the copy command I get:
sidharth=# \copy foo from './tests/foo' (format text);
ERROR: invalid input syntax for type json
DETAIL: Token "mary" is invalid.
CONTEXT: JSON data, line 1: {"foo":"bar", "bam": "{"mary...
COPY foo, line 1, column data: "{"foo":"bar", "bam": "{"mary": "had a lamb"}" }"
Seems like postgres is not processing the backslashes. I think because of http://www.postgresql.org/docs/8.3/interactive/sql-syntax-lexical.html and
it I am forced to use double backslash. And that works, i.e. when the file contents are:
{"foo":"bar", "bam": "{\\"mary\\": \\"had a lamb\\"}" }
The copy command works. But is it correct to expect special treatment for json data types
because afterall above is not a valid json.
http://adpgtech.blogspot.ru/2014/09/importing-json-data.html
copy the_table(jsonfield)
from '/path/to/jsondata'
csv quote e'\x01' delimiter e'\x02';
PostgreSQL's default bulk load format, text, is a tab separated markup. It requires backslashes to be escaped because they have special meaning for (e.g.) the \N null placeholder.
Observe what PostgreSQL generates:
regress=> COPY foo TO stdout;
{"foo":"bar", "bam": "{\\"mary\\": \\"had a lamb\\"}" }
This isn't a special case for json at all, it's true of any string. Consider, for example, that a string - including json - might contain embedded tabs. Those must be escaped to prevent them from being seen as another field.
You'll need to generate your input data properly escaped. Rather than trying to use the PostgreSQL specific text format, it'll generally be easier to use format csv and use a tool that writes correct CSV, with the escaping done for you on writing.
I want to import a CSV file into version 9.2 but the CSV file has double-quote double-quote in the final column position to represent a NULL value:
"2","1001","9","2","0","0","130","","2012-10-22 09:33:07.073000000",""
which is mapped to a column of type Timestamp. postgreSQL doesn't like the "". I've tried to set the NULL option but maybe I'm not doing it correctly? I've tried NULL as '"" and NULL '' and NULL as '' and NULL "" but without success; here's my command:
COPY SCH.DEPTS
FROM 'H:/backups/DEPTS.csv'
WITH (
FORMAT CSV,
DELIMITER ',' ,
NULL '',
HEADER TRUE,
QUOTE '"'
)
but it fails with an error:
ERROR: invalid input syntax for type timestamp: ""
CONTEXT: COPY depts, line 2, column expirydate: ""
P.S. Is there a way to specify the string representation of Booleans to the COPY command? The utility that produced the CSVs (of which there are many) used "false" and "true".
The empty string ("") isn't a valid timestamp, and COPY doesn't appear to offer a FORCE NULL or FORCE EMPTY TO NULL mode; it has the reverse, FORCE NOT NULL, but that won't do what you want.
You probably need to COPY the data into a table with a text field for the timestamp, probably an UNLOGGED or TEMPORARY table, then use an INSERT INTO real_table SELECT col1, col, col3, NULLIF(tscol,'') FROM temp_table;.
COPY should accept true and false as booleans, so you shouldn't have any issues there.
Alternately, read the CSV with a simple Python script and the csv module, and then use psycopg2 to COPY rows into Pg. Or just write new cleaned up CSV out and feed that into COPY. Or use an ETL tool that does data transforms like Pentaho Kettle or Talend.
This still seems to be an issue 5 years later. I ran into this issue today running PostgreSQL 9.6.8. As a workaround before running the COPY command, I use sed to replace all occurrences of "" with null and then add NULL as 'null' to my COPY command i.e.
sed -i 's/""/null/g' myfile.csv
PGPASSWORD=<pwd> psql -h <host> -p <port> -d <db> -U <user>
-c "\copy mytable from myfile.csv WITH CSV DELIMITER ',' QUOTE '\"' ESCAPE '\\' NULL as 'null';"