postgresql json to columns error Character with value must be escaped - json

I try to load some data from a table containing json rows.
There is one field that can contain special chars as \t and \r, and I want to keep them as is in the new table.
Here is my file:
{"text_sample": "this is a\tsimple test", "number_sample": 4}
Here is what I do:
Drop table if exists temp_json;
Drop table if exists test;
create temporary table temp_json (values text);
copy temp_json from '/path/to/file';
create table test as (select
(values->>'text_sample') as text_sample,
(values->>'number_sample') as number_sample
from (
select replace(values,'\','\\')::json as values
from temp_json
) a);
I keep getting this error:
ERROR: invalid input syntax for type json
DETAIL: Character with value 0x09 must be escaped.
CONTEXT: JSON data, line 1: ...g] Objection to PDDRP Mediation (was Re: Call for...
How do I need to escape those characters?
Thanks a lot

Copy the file as csv with a different quoting character and delimiter:
drop table if exists test;
create table test (values jsonb);
\copy test from '/path/to/file.csv' with (format csv, quote '|', delimiter ';');
select values ->> 'text_sample', values ->> 'number_sample'
from test;
?column? | ?column?
-----------------------------+----------
this is a simple test | 4

As mentioned in Andrew Dunstan's PostgreSQL and Technical blog
In text mode, COPY will be simply defeated by the presence of a backslash in the JSON. So, for example, any field that contains an embedded double quote mark, or an embedded newline, or anything else that needs escaping according to the JSON spec, will cause failure. And in text mode you have very little control over how it works - you can't, for example, specify a different ESCAPE character. So text mode simply won't work.
so we have to turn around to the CSV format mode.
copy the_table(jsonfield)
from '/path/to/jsondata'
csv quote e'\x01' delimiter e'\x02';
In the official document sql-copy, some Parameters list here:
COPY table_name [ ( column_name [, ...] ) ]
FROM { 'filename' | PROGRAM 'command' | STDIN }
[ [ WITH ] ( option [, ...] ) ]
[ WHERE condition ]
where option can be one of:
FORMAT format_name
FREEZE [ boolean ]
DELIMITER 'delimiter_character'
NULL 'null_string'
HEADER [ boolean ]
QUOTE 'quote_character'
ESCAPE 'escape_character'
FORCE_QUOTE { ( column_name [, ...] ) | * }
FORCE_NOT_NULL ( column_name [, ...] )
FORCE_NULL ( column_name [, ...] )
ENCODING 'encoding_name'
FORMAT
Selects the data format to be read or written: text, csv (Comma Separated Values), or binary. The default is text.
QUOTE
Specifies the quoting character to be used when a data value is quoted. The default is double-quote. This must be a single one-byte character. This option is allowed only when using CSV format.
DELIMITER
Specifies the character that separates columns within each row (line) of the file. The default is a tab character in text format, a comma in CSV format. This must be a single one-byte character. This option is not allowed when using binary format.
NULL
Specifies the string that represents a null value. The default is \N (backslash-N) in text format, and an unquoted empty string in CSV format. You might prefer an empty string even in text format for cases where you don't want to distinguish nulls from empty strings. This option is not allowed when using binary format.
HEADER
Specifies that the file contains a header line with the names of each column in the file. On output, the first line contains the column names from the table, and on input, the first line is ignored. This option is allowed only when using CSV format.

cast json as text, instead of getting text value from json. Eg:
t=# with j as (
select '{"text_sample": "this is a\tsimple test", "number_sample": 4}'::json v
)
select v->>'text_sample' your, (v->'text_sample')::text better
from j;
your | better
-----------------------------+--------------------------
this is a simple test | "this is a\tsimple test"
(1 row)
and to avoid 0x09 error, try using
replace(values,chr(9),'\t')
as in your example you replace backslash+t, not the actual chr(9)...

Related

com.univocity.parsers.common.TextParsingException

Trying to read the below data from a CSV results in a com.univocity.parsers.common.TextParsingException exception:
B1456741975-266,"","{""m"": {""difference"": 60}}","","","",2022-02-04T17:03:59.566Z
B1789753479-460,"","",",","","",2022-02-18T14:46:57.332Z
B1456741977-123,"","{""m"": {""difference"": 60}}","","","",2022-02-04T17:03:59.566Z
Here's the Pyspark (3.1.2) code used to read the data:
from pyspark.sql.dataframe import DataFrame
df = (spark.read.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header","false")
.option("multiline","true")
.option("quote",'"')
.option("escape",'\"')
.option("delimiter",",")
.option("unescapedQuoteHandling", "RAISE_ERROR")
.load('/mnt/source/analysis/error_in_csv.csv'))
This is the exception that I'm getting.
Caused by: com.univocity.parsers.common.TextParsingException: com.univocity.parsers.common.TextParsingException - Unescaped quote character '"' inside quoted value of CSV field. To allow unescaped quotes, set 'parseUnescapedQuotes' to 'true' in the CSV parser settings. Cannot parse CSV input.
Internal state when error was thrown: line=2, column=3, record=1, charIndex=165, headers=[B1456741975-266, , {"m": {"difference": 60}}, , , , 2022-02-04T17:03:59.566Z]
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Auto-closing enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Delimiters for detection=null
Empty value=
Escape unquoted values=false
Header extraction enabled=null
Headers=null
Ignore leading whitespaces=false
Ignore leading whitespaces in quotes=false
Ignore trailing whitespaces=false
Ignore trailing whitespaces in quotes=false
Input buffer size=1048576
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=1000
Line separator detection enabled=true
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Processor=none
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=field selection: []
Skip bits as whitespace=true
Skip empty lines=true
Unescaped quote handling=RAISE_ERRORFormat configuration:
CsvFormat:
Comment character=#
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\n
Quote character="
Quote escape character="
Quote escape escape character=null
Internal state when error was thrown: line=2, column=3, record=1, charIndex=165, headers=[B1456741975-266, , {"m": {"difference": 60}}, , , , 2022-02-04T17:03:59.566Z]
at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402)
at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:623)
at org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:389)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:335)
... 33 more
Caused by: com.univocity.parsers.common.TextParsingException: Unescaped quote character '"' inside quoted value of CSV field. To allow unescaped quotes, set 'parseUnescapedQuotes' to 'true' in the CSV parser settings. Cannot parse CSV input.
Internal state when error was thrown: line=2, column=3, record=1, charIndex=165, headers=[B1456741975-266, , {"m": {"difference": 60}}, , , , 2022-02-04T17:03:59.566Z]
at com.univocity.parsers.csv.CsvParser.handleValueSkipping(CsvParser.java:241)
at com.univocity.parsers.csv.CsvParser.handleUnescapedQuote(CsvParser.java:319)
at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:393)
at com.univocity.parsers.csv.CsvParser.parseSingleDelimiterRecord(CsvParser.java:177)
at com.univocity.parsers.csv.CsvParser.parseRecord(CsvParser.java:109)
at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:581)
... 39 more
Can someone please advise? It looks like the quoted delimiter in the second line is causing this. Is there a way to avoid it without changing the source data itself?

Cannot read geojson string with single quotes into Postgres table

I'm trying to import a large geojson file into a Postgres table. In order to do so, I first convert the json into csv with python:
import pandas as pd
df = pd.read_json('myjson.txt')
df.to_csv('myjson.csv',sep='\t')
The resulting csv looks like:
name type features
0 geobase FeatureCollection {'type': 'Feature', 'geometry': {'type': 'LineString', 'coordinates': [[-73.7408048408216, 45.5189595588608], [-73.7408749973688, 45.5189893490944], [-73.7409267622838, 45.5190212771795], [-73.7418867072278, 45.519640108602204], [-73.7419636417947, 45.5196917400376]]}, 'properties': {'ID_TRC': 1010001, 'DEB_GCH': 12320, 'FIN_GCH': 12340}}
The first three lines in json file were:
{"name":"geobase","type":"FeatureCollection"
,"features":[
{"type":"Feature","geometry":{"type":"LineString","coordinates":[[-73.7408048408216,45.5189595588608],[-73.7408749973688,45.5189893490944],[-73.7409267622838,45.5190212771795],[-73.7418867072278,45.5196401086022],[-73.7419636417947,45.5196917400376]]},"properties":{"ID_TRC":1010001,"DEB_GCH":12320,"FIN_GCH":12340}}
Following that, the copy command into my postgres table is:
psql -h (host) -U (user) -d (database) -c "\COPY geometries.geobase_tmp(id,name,type,features) FROM '.../myjson.csv' with (format csv,header true, delimiter E'\t');"
results in my table filled with name,type and features. First feature (a text field) is for example the following string:
{'type': 'Feature', 'geometry': {'type': 'LineString', 'coordinates': [[-73.7408048408216, 45.5189595588608], [-73.7408749973688, 45.5189893490944], [-73.7409267622838, 45.5190212771795], [-73.7418867072278, 45.519640108602204], [-73.7419636417947, 45.5196917400376]]}, 'properties': {'ID_TRC': 1010001, 'DEB_GCH': 12320, 'FIN_GCH': 12340}}
In Postgres, when I try to read from this tmp table into another one:
SELECT features::json AS fc FROM geometries.geobase_tmp
I get the error :
SQL Error [22P02]: ERROR: invalid input syntax for type json
Detail : Token "'" is invalid.
Where : JSON data, line 1: {'...
It's like if Postgres expects double quotes and not single quotes to parse the json text. What can I do to avoid this problem?
EDIT: I followed the procedure described here (datatofish.com/json-string-to-csv-python) to convert json to csv. The source (the json txt file) is a valid json and contains only double quotes. After conversion, it's not a valid json anymore (it contains single quotes instead of double quotes). Is there a way to output a csv while keeping the double quotes?
I figured it out:
Json to csv:
import pandas as pd
import json
import csv
df = pd.read_json('myjson.txt')
df['geom'] = df['features'].apply(lambda x:json.dumps(x['geometry']))
df['properties'] = df['features'].apply(lambda x:json.dumps(x['properties']))
df[['geom','properties']].to_csv('myjson.csv',sep='\t',quoting=csv.QUOTE_ALL)
Now CSV file looks like:
"" "geom" "properties"
"0" "{""type"": ""LineString"", ""coordinates"": [[-73.7408048408216, 45.5189595588608], [-73.7408749973688, 45.5189893490944], [-73.7409267622838, 45.5190212771795], [-73.7418867072278, 45.519640108602204], [-73.7419636417947, 45.5196917400376]]}" "{""ID_TRC"": 1010001, ""DEB_GCH"": 12320, ""FIN_GCH"": 12340}"
...
Postgres tmp table created with:
CREATE TABLE geometries.geobase_tmp (
id int,
geom TEXT,
properties TEXT
)
Copy CSV content into tmp table:
psql -h (host) -U (user) -d (database) -c "\COPY geometries.geobase_tmp(id,geom,properties) FROM 'myjson.csv' with (format csv,header true, delimiter E'\t');"
Creation of final postgres table which contains geometry and properties (each property in its own field):
drop table if exists geometries.troncons;
SELECT
row_number() OVER () AS gid,
ST_GeomFromGeoJSON(geom) as geom,
properties::json->'ID_TRC' AS ID_TRC,
properties::json->'DEB_GCH' AS DEB_GCH,
properties::json->'FIN_GCH' AS FIN_GCH
INTO TABLE geometries.troncons
FROM geometries.geobase_tmp

CSV Parsing Issue with Attoparsec

Here is my code that does CSV parsing, using the text and attoparsec
libraries:
import qualified Data.Attoparsec.Text as A
import qualified Data.Text as T
-- | Parse a field of a record.
field :: A.Parser T.Text -- ^ parser
field = fmap T.concat quoted <|> normal A.<?> "field"
where
normal = A.takeWhile (A.notInClass "\n\r,\"") A.<?> "normal field"
quoted = A.char '"' *> many between <* A.char '"' A.<?> "quoted field"
between = A.takeWhile1 (/= '"') <|> (A.string "\"\"" *> pure "\"")
-- | Parse a block of text into a CSV table.
comma :: T.Text -- ^ CSV text
-> Either String [[T.Text]] -- ^ error | table
comma text
| T.null text = Right []
| otherwise = A.parseOnly table text
where
table = A.sepBy1 record A.endOfLine A.<?> "table"
record = A.sepBy1 field (A.char ',') A.<?> "record"
This works well for a variety of inputs but is not working in case that there
is a trailing \n at the end of the input.
Current behaviour:
> comma "hello\nworld"
Right [["hello"],["world"]]
> comma "hello\nworld\n"
Right [["hello"],["world"],[""]]
Wanted behaviour:
> comma "hello\nworld"
Right [["hello"],["world"]]
> comma "hello\nworld\n"
Right [["hello"],["world"]]
I have been trying to fix this issue but I ran out of idaes. I am almost
certain that it will have to be something with A.endOfInput as that is the
significant anchor and the only "bonus" information we have. Any ideas on how
to work that into the code?
One possible idea is to look at the end of the string before running the
Attoparsec parser and removing the last character (or two in case of \r\n)
but that seems to be a hacky solution that I would like avoid in my code.
Full code of the library can be found here: https://github.com/lovasko/comma

hive sql, serde how to not quote my fields?

Since by default serde quotes fields by ", How can I not quote my fields using serde?
I tried:
row format serde "org.apache.hadoop.hive.serde2.OpenCSVSerde"
with serdeproperties(
"separatorChar" = ",",
"quoteChar" = "")
But i'm getting
FAILED: SemanticException java.lang.StringIndexOutOfBoundsException: String index out of range: 0
You could achieve this by specifying \u0000 as the quote character. Since quoteChar expects a string, you should use this unicode version of NULL.
ROW FORMAT SERDE
"org.apache.hadoop.hive.serde2.OpenCSVSerde"
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\u0000")
This unicode NULL \u0000 is what used by the CSV writer class as value for NO_QUOTE_CHARACTER: http://www.java2s.com/Code/Java/Development-Class/AverysimpleCSVwriterreleasedunderacommercialfriendlylicense.htm
For some reason "quoteChar" = "\u0000" didn't work for me as suggested in Nirmal's answer above.
When saving to file without quotes around the fields, I use:
-- saving to file
INSERT OVERWRITE LOCAL DIRECTORY 'file:/home/sidazhou/temp'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT *
FROM temp_table
;
PS. I know this isn't what's being asked, which concerns ROW FORMAT SERDE instead of ROW FORMAT DELIMITED FIELDS.

HIVE escaped by not working '\\'

I have a data-set in S3
123, "some random, text", "", "", 236
I build a external table on this dataset :
CREATE EXTERNAL TABLE db1.myData(
field1 bigint,
field2 string,
field3 string,
field4 string,
field5 bigint,
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LOCATION 's3n://thisMyData/';
Problem/ Issue :
when I do
select * from db1.myData
field2 is shown as
some random
I need the field to be
some random, text
Gotcha's :
1. I cannot change the delimiter as there are over ~300 .csv files at this location
2. ESCAPED BY is not escaping the '\\'
3. I'm using HIVE 0.13 so there I cannot use CSV SerDe and neither i'm allowed to import new jars to cluster (its a complicated process to add a new jar as I have to go through Director level approvals)
Question:
Is there a workaround for making 'ESCAPED BY' come alive ?!
Any other workarounds for this ??
All suggestions are welcome !!
N.B : THis is not a repeat question. If you think its a repeat, please guide me to right page and I will take this off of this portal :)
I had to use: ESCAPED BY '\134' which translates to: ESCAPED BY '\'.
Additionally, because I was calling the Athena create table statement by passing in the statement from a JSON file I had to add an extra \ to mask the original \ in JSON. So my final statement within the JSON file looked like this: ESCAPED BY '\\134'.
If you are using Hive 0.14, you can use CSV Serde like this:
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
Refer below link for details:
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde