I am parsing a Pandas column of type string that is in JSON format such as the following
kafka_data["MESSAGE_DATA__C"].iloc[0]
Out[20]: '{"userId":"af33f42e","trackingCategory":"ACTION","trackedItem":{"id":"PERSONAL_IDENTIFICATION_STARTED","category":"PERSONAL_IDENTIFICATION","title":"Personal Identification Started"}}'
When I parse a single row everything works
json.loads(kafka_data["MESSAGE_DATA__C"].iloc[0])
Out[25]:
{'userId': 'af33f42e',
'trackingCategory': 'ACTION',
'trackedItem': {'id': 'PERSONAL_IDENTIFICATION_STARTED',
'category': 'PERSONAL_IDENTIFICATION',
'title': 'Personal Identification Started'}}
But when I try to parse altogether the column, an error prompts.
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 212 (char 211)
Am I missing anything?
I need to read this column into a new dataframe.
When applying a function to entire column, use axis=1 parameter.
Please try:
kafka_data[["MESSAGE_DATA__C"]].apply(lambda row: json.loads(str(row["MESSAGE_DATA__C"])), axis=1)
Related
I have made a new table with three columns
customer_id,media_urls,survey_taste
in a db in pgadmin with attributes as
int,text[],jsonb
respectively.
I have a csv that I was trying to import into this table using pgadmin and
the contents of that file are like this
1,"{'http://example.com','http://example.com'}","{'taste':[1,2,3,4]}"
but I am getting this error
ERROR: invalid input syntax for type json
DETAIL: Token "'" is invalid.
CONTEXT: JSON data, line 1: '...
COPY survey_taste, line 2, column survey_taste: "{'taste': [-0.19101654669350904, 0.08575981750112513, 0.07133783942655376, -0.10579014363010293, 0.0..." ```
To address your comments in reverse order. To have this entered in one field you would need to have it as:
'[{"http":"abc","http":"abc"},{"taste":[1,2,3,4]}]'
Per:
select '[{"http":"abc","http":"abc"},{"taste":[1,2,3,4]}]'::json;
json
---------------------------------------------------
[{"http":"abc","http":"abc"},{"taste":[1,2,3,4]}]
As to the quoting issue:
When you pass a dict to csv you will get:
d = {"taste":[1,2,3,4]}
print(d)
{'taste': [1, 2, 3, 4]
What you need is:
import json
json.dumps(d)
'{"test": [1, 2, 3, 4]}'
Using json.dumps will turn the dict into a proper JSON string representation.
Putting it all together:
# Create list of dicts
l = [{'http': 'abc', 'http': 'abc'}, {'taste': [1,2,3,4]}]
# Create JSON string representattion
json.dumps(l)
'[{"http": "abc"}, {"taste": [1, 2, 3, 4]}]'
I'm trying to import a large geojson file into a Postgres table. In order to do so, I first convert the json into csv with python:
import pandas as pd
df = pd.read_json('myjson.txt')
df.to_csv('myjson.csv',sep='\t')
The resulting csv looks like:
name type features
0 geobase FeatureCollection {'type': 'Feature', 'geometry': {'type': 'LineString', 'coordinates': [[-73.7408048408216, 45.5189595588608], [-73.7408749973688, 45.5189893490944], [-73.7409267622838, 45.5190212771795], [-73.7418867072278, 45.519640108602204], [-73.7419636417947, 45.5196917400376]]}, 'properties': {'ID_TRC': 1010001, 'DEB_GCH': 12320, 'FIN_GCH': 12340}}
The first three lines in json file were:
{"name":"geobase","type":"FeatureCollection"
,"features":[
{"type":"Feature","geometry":{"type":"LineString","coordinates":[[-73.7408048408216,45.5189595588608],[-73.7408749973688,45.5189893490944],[-73.7409267622838,45.5190212771795],[-73.7418867072278,45.5196401086022],[-73.7419636417947,45.5196917400376]]},"properties":{"ID_TRC":1010001,"DEB_GCH":12320,"FIN_GCH":12340}}
Following that, the copy command into my postgres table is:
psql -h (host) -U (user) -d (database) -c "\COPY geometries.geobase_tmp(id,name,type,features) FROM '.../myjson.csv' with (format csv,header true, delimiter E'\t');"
results in my table filled with name,type and features. First feature (a text field) is for example the following string:
{'type': 'Feature', 'geometry': {'type': 'LineString', 'coordinates': [[-73.7408048408216, 45.5189595588608], [-73.7408749973688, 45.5189893490944], [-73.7409267622838, 45.5190212771795], [-73.7418867072278, 45.519640108602204], [-73.7419636417947, 45.5196917400376]]}, 'properties': {'ID_TRC': 1010001, 'DEB_GCH': 12320, 'FIN_GCH': 12340}}
In Postgres, when I try to read from this tmp table into another one:
SELECT features::json AS fc FROM geometries.geobase_tmp
I get the error :
SQL Error [22P02]: ERROR: invalid input syntax for type json
Detail : Token "'" is invalid.
Where : JSON data, line 1: {'...
It's like if Postgres expects double quotes and not single quotes to parse the json text. What can I do to avoid this problem?
EDIT: I followed the procedure described here (datatofish.com/json-string-to-csv-python) to convert json to csv. The source (the json txt file) is a valid json and contains only double quotes. After conversion, it's not a valid json anymore (it contains single quotes instead of double quotes). Is there a way to output a csv while keeping the double quotes?
I figured it out:
Json to csv:
import pandas as pd
import json
import csv
df = pd.read_json('myjson.txt')
df['geom'] = df['features'].apply(lambda x:json.dumps(x['geometry']))
df['properties'] = df['features'].apply(lambda x:json.dumps(x['properties']))
df[['geom','properties']].to_csv('myjson.csv',sep='\t',quoting=csv.QUOTE_ALL)
Now CSV file looks like:
"" "geom" "properties"
"0" "{""type"": ""LineString"", ""coordinates"": [[-73.7408048408216, 45.5189595588608], [-73.7408749973688, 45.5189893490944], [-73.7409267622838, 45.5190212771795], [-73.7418867072278, 45.519640108602204], [-73.7419636417947, 45.5196917400376]]}" "{""ID_TRC"": 1010001, ""DEB_GCH"": 12320, ""FIN_GCH"": 12340}"
...
Postgres tmp table created with:
CREATE TABLE geometries.geobase_tmp (
id int,
geom TEXT,
properties TEXT
)
Copy CSV content into tmp table:
psql -h (host) -U (user) -d (database) -c "\COPY geometries.geobase_tmp(id,geom,properties) FROM 'myjson.csv' with (format csv,header true, delimiter E'\t');"
Creation of final postgres table which contains geometry and properties (each property in its own field):
drop table if exists geometries.troncons;
SELECT
row_number() OVER () AS gid,
ST_GeomFromGeoJSON(geom) as geom,
properties::json->'ID_TRC' AS ID_TRC,
properties::json->'DEB_GCH' AS DEB_GCH,
properties::json->'FIN_GCH' AS FIN_GCH
INTO TABLE geometries.troncons
FROM geometries.geobase_tmp
i try to construct JSON with string that contains "\n" in it like this :
ver_str= 'Package ID: version_1234\nBuild\nnumber: 154\nBuilt\n'
proj_ver_str = 'Version_123'
comb = '{"r_content": {0}, "s_version": {1}}'.format(ver_str,proj_ver_str)
json_content = json.loads()
d =json.dumps(json_content )
getting this error:
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Dev/python/new_tester/simple_main.py", line 18, in <module>
comb = '{"r_content": {0}, "s_version": {1}}'.format(ver_str,proj_ver_str)
KeyError: '"r_content"'
The error arises not because of newlines in your values, but because of { and } characters in your format string other than the placeholders {0} and {1}. If you want to have an actual { or a } character in your string, double them.
Try replacing the line
comb = '{"r_content": {0}, "s_version": {1}}'.format(ver_str,proj_ver_str)
with
comb = '{{"r_content": {0}, "s_version": {1}}}'.format(ver_str,proj_ver_str)
However, this will give you a different error on the next line, loads() missing 1 required positional argument: 's'. This is because you presumably forgot to pass comb to json.loads().
Replacing json.loads() with json.loads(comb) gives you another error: json.decoder.JSONDecodeError: Expecting value: line 1 column 15 (char 14). This tells you that you've given json.loads malformed JSON to parse. If you print out the value of comb, you see the following:
{"r_content": Package ID: version_1234
Build
number: 154
Built
, "s_version": Version_123}
This isn't valid JSON, because the string values aren't surrounded by quotes. So a JSON parsing error is to be expected.
At this point, let's take a look at what your code is doing and what you seem to want it to do. It seems you want to construct a JSON string from your data, but your code puts together a JSON string from your data, parses it to a dict and then formats it back as a JSON string.
If you want to create a JSON string from your data, it's far simpler to create a dict with your values and use json.dumps on that:
d = json.dumps({"r_content": ver_str, "s_version": proj_ver_str})
I have a pandas dataframe "df". the dataframe has a field in it "json_field" that contains json data. I think it's actually yaml format. I'm trying to parse the json values in to their own columns. I have example data below. when I run the code below I'm getting an error when it hits the 'name' field. the code and the error are below. it seems like maybe it's having trouble because the value associated with the name field isn't quoted, or has spaces. does anyone see what the issue might be or suggest how to fix?
example:
print(df['json_field'][0])
output:
- start:"test"
name: Price Unit
field: priceunit
type: number
operator: between
value:
- '40'
- '60'
code:
import yaml
pd.io.json.json_normalize(yaml.load(df['json_field'][0]), 'name','field','type','operator','value').head()
error:
An error was encountered:
mapping values are not allowed here
in "<unicode string>", line 3, column 7:
name: Price Unit
^
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/yaml/__init__.py", line 94, in safe_load
return load(stream, SafeLoader)
File "/usr/local/lib64/python3.6/site-packages/yaml/__init__.py", line 72, in load
return loader.get_single_data()
File "/usr/local/lib64/python3.6/site-packages/yaml/constructor.py", line 35, in get_single_data
node = self.get_single_node()
File "/usr/local/lib64/python3.6/site-packages/yaml/composer.py", line 36, in get_single_node
document = self.compose_document()
File "/usr/local/lib64/python3.6/site-packages/yaml/composer.py", line 55, in compose_document
node = self.compose_node(None, None)
File "/usr/local/lib64/python3.6/site-packages/yaml/composer.py", line 82, in compose_node
node = self.compose_sequence_node(anchor)
File "/usr/local/lib64/python3.6/site-packages/yaml/composer.py", line 110, in compose_sequence_node
while not self.check_event(SequenceEndEvent):
File "/usr/local/lib64/python3.6/site-packages/yaml/parser.py", line 98, in check_event
self.current_event = self.state()
File "/usr/local/lib64/python3.6/site-packages/yaml/parser.py", line 382, in parse_block_sequence_entry
if self.check_token(BlockEntryToken):
File "/usr/local/lib64/python3.6/site-packages/yaml/scanner.py", line 116, in check_token
self.fetch_more_tokens()
File "/usr/local/lib64/python3.6/site-packages/yaml/scanner.py", line 220, in fetch_more_tokens
return self.fetch_value()
File "/usr/local/lib64/python3.6/site-packages/yaml/scanner.py", line 580, in fetch_value
self.get_mark())
yaml.scanner.ScannerError: mapping values are not allowed here
in "<unicode string>", line 3, column 7:
name: Price Unit
desired output:
name field type operator value
Price Unit priceunit number between - '40' - '60'
Update:
I tried the suggestion below
import yaml
df['json_field']=df['json_field'].str.replace('"test"', "")
pd.io.json.json_normalize(yaml.safe_load(df['json_field'][0].lstrip()), 'name','field','type','operator','value').head()
and got the output below:
operator0 typefield
0 P priceunit
1 r priceunit
2 i priceunit
3 c priceunit
4 e priceunit
Seems to me that it's having an issue because your yaml is improperly formatted. That "test" should not be there at all if you want a dictionary named "start" with 5 mappings inside of it.
import yaml
a = """
- start:
name: Price Unit
field: priceunit
type: number
operator: between
value:
- '40'
- '60'
""".lstrip()
# Single dictionary named start, with 5 entries
yaml.safe_load(a)[0]
{'start':
{'name': 'Price Unit',
'field': 'priceunit',
'type': 'number',
'operator': 'between',
'value': ['40', '60']}
}
To do this with your data, try:
data = yaml.load(df['json_field'][0].replace('"test"', ""))
df = pd.json_normalize(yaml.safe_load(a)[0]["start"])
print(df)
name field type operator value
0 Price Unit priceunit number between [40, 60]
It seems that your real problem in one line earlier.
Note that your YAML fragment starts with start:"test".
One of key principles of YAML is that a key is separated
from its value by a colon and a space, whereas your
sample does not contain this (mandatory) space.
Maybe you should manually edit your input, i.e. add this
missing space.
Other solution is to write a "specialized pre-parser",
which adds such missing spaces and its result is fed into
the YAML parser. This can be an option if the number of such
cases is big.
I'm trying to design a load test for some APIs, I need to read the data from a CSV file, Here is my CSV file :
amount,description
100,"100 Added"
-150,"-150 removed"
20, "20 added"
the amount is a number.
My CSV data config looks like this:
Image
When I put the body data like this :
Image
I have this error :
{"timestamp":1529427563867,"status":400,"error":"Bad Request","message":"JSON parse error: Unrecognized token 'amount': was expecting ('true', 'false' or 'null'); nested exception is com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'amount': was expecting ('true', 'false' or 'null')\n
and when I change it to this:
Image
I have this error:
"timestamp":1529427739395,"status":400,"error":"Bad Request","message":"JSON parse error: Cannot deserialize value of type `java.math.BigDecimal` from String \"amount\": not a valid representation; nested exception is com.fasterxml.jackson.databind.exc.InvalidFormatException: Cannot deserialize value of type `java.math.BigDecimal` from String \"amount\": not a valid representation\n at [Source: (PushbackInputStream); line: 2, column: 13]
How should I pass the parameters to be able to read from the CSV file?
P.S it works fine when I do all the things without CSV.
In your config, you have "Ignore first line" set to False.
Options:
1: Set "Ignore First Line" to True, for source files with header data included.
2: Leave the config setting as is, but remove the headers from the CSV source file.