Cannot read geojson string with single quotes into Postgres table - json

I'm trying to import a large geojson file into a Postgres table. In order to do so, I first convert the json into csv with python:
import pandas as pd
df = pd.read_json('myjson.txt')
df.to_csv('myjson.csv',sep='\t')
The resulting csv looks like:
name type features
0 geobase FeatureCollection {'type': 'Feature', 'geometry': {'type': 'LineString', 'coordinates': [[-73.7408048408216, 45.5189595588608], [-73.7408749973688, 45.5189893490944], [-73.7409267622838, 45.5190212771795], [-73.7418867072278, 45.519640108602204], [-73.7419636417947, 45.5196917400376]]}, 'properties': {'ID_TRC': 1010001, 'DEB_GCH': 12320, 'FIN_GCH': 12340}}
The first three lines in json file were:
{"name":"geobase","type":"FeatureCollection"
,"features":[
{"type":"Feature","geometry":{"type":"LineString","coordinates":[[-73.7408048408216,45.5189595588608],[-73.7408749973688,45.5189893490944],[-73.7409267622838,45.5190212771795],[-73.7418867072278,45.5196401086022],[-73.7419636417947,45.5196917400376]]},"properties":{"ID_TRC":1010001,"DEB_GCH":12320,"FIN_GCH":12340}}
Following that, the copy command into my postgres table is:
psql -h (host) -U (user) -d (database) -c "\COPY geometries.geobase_tmp(id,name,type,features) FROM '.../myjson.csv' with (format csv,header true, delimiter E'\t');"
results in my table filled with name,type and features. First feature (a text field) is for example the following string:
{'type': 'Feature', 'geometry': {'type': 'LineString', 'coordinates': [[-73.7408048408216, 45.5189595588608], [-73.7408749973688, 45.5189893490944], [-73.7409267622838, 45.5190212771795], [-73.7418867072278, 45.519640108602204], [-73.7419636417947, 45.5196917400376]]}, 'properties': {'ID_TRC': 1010001, 'DEB_GCH': 12320, 'FIN_GCH': 12340}}
In Postgres, when I try to read from this tmp table into another one:
SELECT features::json AS fc FROM geometries.geobase_tmp
I get the error :
SQL Error [22P02]: ERROR: invalid input syntax for type json
Detail : Token "'" is invalid.
Where : JSON data, line 1: {'...
It's like if Postgres expects double quotes and not single quotes to parse the json text. What can I do to avoid this problem?
EDIT: I followed the procedure described here (datatofish.com/json-string-to-csv-python) to convert json to csv. The source (the json txt file) is a valid json and contains only double quotes. After conversion, it's not a valid json anymore (it contains single quotes instead of double quotes). Is there a way to output a csv while keeping the double quotes?

I figured it out:
Json to csv:
import pandas as pd
import json
import csv
df = pd.read_json('myjson.txt')
df['geom'] = df['features'].apply(lambda x:json.dumps(x['geometry']))
df['properties'] = df['features'].apply(lambda x:json.dumps(x['properties']))
df[['geom','properties']].to_csv('myjson.csv',sep='\t',quoting=csv.QUOTE_ALL)
Now CSV file looks like:
"" "geom" "properties"
"0" "{""type"": ""LineString"", ""coordinates"": [[-73.7408048408216, 45.5189595588608], [-73.7408749973688, 45.5189893490944], [-73.7409267622838, 45.5190212771795], [-73.7418867072278, 45.519640108602204], [-73.7419636417947, 45.5196917400376]]}" "{""ID_TRC"": 1010001, ""DEB_GCH"": 12320, ""FIN_GCH"": 12340}"
...
Postgres tmp table created with:
CREATE TABLE geometries.geobase_tmp (
id int,
geom TEXT,
properties TEXT
)
Copy CSV content into tmp table:
psql -h (host) -U (user) -d (database) -c "\COPY geometries.geobase_tmp(id,geom,properties) FROM 'myjson.csv' with (format csv,header true, delimiter E'\t');"
Creation of final postgres table which contains geometry and properties (each property in its own field):
drop table if exists geometries.troncons;
SELECT
row_number() OVER () AS gid,
ST_GeomFromGeoJSON(geom) as geom,
properties::json->'ID_TRC' AS ID_TRC,
properties::json->'DEB_GCH' AS DEB_GCH,
properties::json->'FIN_GCH' AS FIN_GCH
INTO TABLE geometries.troncons
FROM geometries.geobase_tmp

Related

Parsing a Pandas column in JSON format

I am parsing a Pandas column of type string that is in JSON format such as the following
kafka_data["MESSAGE_DATA__C"].iloc[0]
Out[20]: '{"userId":"af33f42e","trackingCategory":"ACTION","trackedItem":{"id":"PERSONAL_IDENTIFICATION_STARTED","category":"PERSONAL_IDENTIFICATION","title":"Personal Identification Started"}}'
When I parse a single row everything works
json.loads(kafka_data["MESSAGE_DATA__C"].iloc[0])
Out[25]:
{'userId': 'af33f42e',
'trackingCategory': 'ACTION',
'trackedItem': {'id': 'PERSONAL_IDENTIFICATION_STARTED',
'category': 'PERSONAL_IDENTIFICATION',
'title': 'Personal Identification Started'}}
But when I try to parse altogether the column, an error prompts.
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 212 (char 211)
Am I missing anything?
I need to read this column into a new dataframe.
When applying a function to entire column, use axis=1 parameter.
Please try:
kafka_data[["MESSAGE_DATA__C"]].apply(lambda row: json.loads(str(row["MESSAGE_DATA__C"])), axis=1)

ERROR: invalid input syntax for type json DETAIL: Token "'" is invalid. while importing csv in pgadmin

I have made a new table with three columns
customer_id,media_urls,survey_taste
in a db in pgadmin with attributes as
int,text[],jsonb
respectively.
I have a csv that I was trying to import into this table using pgadmin and
the contents of that file are like this
1,"{'http://example.com','http://example.com'}","{'taste':[1,2,3,4]}"
but I am getting this error
ERROR: invalid input syntax for type json
DETAIL: Token "'" is invalid.
CONTEXT: JSON data, line 1: '...
COPY survey_taste, line 2, column survey_taste: "{'taste': [-0.19101654669350904, 0.08575981750112513, 0.07133783942655376, -0.10579014363010293, 0.0..." ```
To address your comments in reverse order. To have this entered in one field you would need to have it as:
'[{"http":"abc","http":"abc"},{"taste":[1,2,3,4]}]'
Per:
select '[{"http":"abc","http":"abc"},{"taste":[1,2,3,4]}]'::json;
json
---------------------------------------------------
[{"http":"abc","http":"abc"},{"taste":[1,2,3,4]}]
As to the quoting issue:
When you pass a dict to csv you will get:
d = {"taste":[1,2,3,4]}
print(d)
{'taste': [1, 2, 3, 4]
What you need is:
import json
json.dumps(d)
'{"test": [1, 2, 3, 4]}'
Using json.dumps will turn the dict into a proper JSON string representation.
Putting it all together:
# Create list of dicts
l = [{'http': 'abc', 'http': 'abc'}, {'taste': [1,2,3,4]}]
# Create JSON string representattion
json.dumps(l)
'[{"http": "abc"}, {"taste": [1, 2, 3, 4]}]'

How to import specific fields from objects from a JSON file to MYSQL 8.0

With MySql 8 you can import json data directly with jsondata flag with the --import command. This is the official link: https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-shell-utilities-json-import-command.html
But how do you import specific fields in the json data to specific column of an already defined table.
For eg: If you have a countries table with (id, code, name) and a json file with structure like
[
{
id: 1,
code: "foo",
name: "bar",
otherField1: "baz",
otherField2: "baz2"
},
{
...
}
]
Do it in 2 steps. First you import the JSON file into a staging table. Then you insert into your final table from the staging table
Something like
create table country_staging json_text VARCHAR(1000);
mysqlsh user#localhost:33062 --import /jsonpath/countries.json countries_staging jsondata --schema=mydb
INSERT INTO countries (id, code, name, otherfield1, otherfield2)
SELECT JSON_EXTRACT(json_text, "$.id"),
JSON_EXTRACT(json_text, "$.code"),
JSON_EXTRACT(json_text, "$.name"),
JSON_EXTRACT(json_text, "$.otherfield1"),
JSON_EXTRACT(json_text, "$.otherfield2")
FROM countries_staging

Select (ignore if does not exists) for JSON logs Spark SQL

I am new to Apache spark and trying out a few POCs around this. I am trying to read json logs which are structured but a few fields are not always guaranteed, for example :
{
"item": "A",
"customerId": 123,
"hasCustomerId": true,
.
.
.
},
{
"item": "B",
"hasCustomerId": false,
.
.
.
}
}
Assume I want to transform these JSON logs into CSV, I was trying out Spark SQL to get hold of all the fields by simple Select statements but as the second JSON is missing a field(although it does has an identifier) I am not sure how can I handle this.
I want to transform the above json logs to
item, customerId, ....
A , 123 , ....
B , null/0 , ....
You should use SqlContext to read the JOSN file, sqlContext.read.json("file/path") But if you want to convert it into CSV and then you want to read it with missing values. Your CSV file should be look like
item,customerId,hasCustomerId, ....
A,123,, .... // hasCustomerId is null
B,,888, .... // customerId is null
i.e. empty record. Then you have to read this like
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("file/path")

HIVE escaped by not working '\\'

I have a data-set in S3
123, "some random, text", "", "", 236
I build a external table on this dataset :
CREATE EXTERNAL TABLE db1.myData(
field1 bigint,
field2 string,
field3 string,
field4 string,
field5 bigint,
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LOCATION 's3n://thisMyData/';
Problem/ Issue :
when I do
select * from db1.myData
field2 is shown as
some random
I need the field to be
some random, text
Gotcha's :
1. I cannot change the delimiter as there are over ~300 .csv files at this location
2. ESCAPED BY is not escaping the '\\'
3. I'm using HIVE 0.13 so there I cannot use CSV SerDe and neither i'm allowed to import new jars to cluster (its a complicated process to add a new jar as I have to go through Director level approvals)
Question:
Is there a workaround for making 'ESCAPED BY' come alive ?!
Any other workarounds for this ??
All suggestions are welcome !!
N.B : THis is not a repeat question. If you think its a repeat, please guide me to right page and I will take this off of this portal :)
I had to use: ESCAPED BY '\134' which translates to: ESCAPED BY '\'.
Additionally, because I was calling the Athena create table statement by passing in the statement from a JSON file I had to add an extra \ to mask the original \ in JSON. So my final statement within the JSON file looked like this: ESCAPED BY '\\134'.
If you are using Hive 0.14, you can use CSV Serde like this:
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
Refer below link for details:
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde