Writing JSON column to Postgres using Pandas .to_sql - json

During an ETL process I needed to extract and load a JSON column from one Postgres database to another. We use Pandas for this since it has so many ways to read and write data from different sources/destinations and all the transformations can be written using Python and Pandas. We're quite happy with the approach to be honest.. but we hit a problem.
Usually it's quite easy to read and write the data. You just use pandas.read_sql_table to read the data from the source and pandas.to_sql to write it to the destination. But, since one of the source tables had a column of type JSON (from Postgres) the to_sql function crashed with the following error message.
df.to_sql(table_name, analytics_db)
File "/home/ec2-user/python-virtual-environments/etl/local/lib64/python2.7/site-packages/pandas/core/generic.py", line 1201, in to_sql
chunksize=chunksize, dtype=dtype)
File "/home/ec2-user/python-virtual-environments/etl/local/lib64/python2.7/site-packages/pandas/io/sql.py", line 470, in to_sql
chunksize=chunksize, dtype=dtype)
File "/home/ec2-user/python-virtual-environments/etl/local/lib64/python2.7/site-packages/pandas/io/sql.py", line 1147, in to_sql
table.insert(chunksize)
File "/home/ec2-user/python-virtual-environments/etl/local/lib64/python2.7/site-packages/pandas/io/sql.py", line 663, in insert
self._execute_insert(conn, keys, chunk_iter)
File "/home/ec2-user/python-virtual-environments/etl/local/lib64/python2.7/site-packages/pandas/io/sql.py", line 638, in _execute_insert
conn.execute(self.insert_statement(), data)
File "/home/ec2-user/python-virtual-environments/etl/local/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 945, in execute
return meth(self, multiparams, params)
File "/home/ec2-user/python-virtual-environments/etl/local/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/home/ec2-user/python-virtual-environments/etl/local/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement
compiled_sql, distilled_params
File "/home/ec2-user/python-virtual-environments/etl/local/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context
context)
File "/home/ec2-user/python-virtual-environments/etl/local/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1393, in _handle_dbapi_exception
exc_info
File "/home/ec2-user/python-virtual-environments/etl/local/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 202, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/home/ec2-user/python-virtual-environments/etl/local/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1159, in _execute_context
context)
File "/home/ec2-user/python-virtual-environments/etl/local/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 459, in do_executemany
cursor.executemany(statement, parameters)
sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) can't adapt type 'dict'

I've been searching the web for a solution but couldn't find any so here is what we came up with (there might be better ways but at least this is a start if someone else runs into this).
Specify the dtype parameter in to_sql.
We went from:df.to_sql(table_name, analytics_db) to df.to_sql(table_name, analytics_db, dtype={'name_of_json_column_in_source_table': sqlalchemy.types.JSON}) and it just works.

If you (re-)create the JSON column using json.dumps(), you're all set.
This way the data can be written using pandas' .to_sql() method, but also the much faster COPY method of PostgreSQL (via copy_expert() of psycopg2 or sqlalchemy's raw_connection()) can be employed.
For the sake of simplicity, let's assume that we have a column of dictionaries that should be written into a JSON(B) column:
import json
import pandas as pd
df = pd.DataFrame([['row1',{'a':1, 'b':2}],
['row2',{'a':3,'b':4,'c':'some text'}]],
columns=['r','kv'])
# conversion function:
def dict2json(dictionary):
return json.dumps(dictionary, ensure_ascii=False)
# overwrite the dict column with json-strings
df['kv'] = df.kv.map(dict2json)

I am unable to comment peralmq's answer, but in case of postgresql JSONB you can use
from sqlalchemy import dialects
dataframe.to_sql(..., dtype={"json_column":dialects.postgresql.JSONB})

Related

How to resolve coreferences without Internet using AllenNLP and coref-spanbert-large?

A want to resolve coreferences without Internet using AllenNLP and coref-spanbert-large model.
I try to do it in the way that is describing here https://demo.allennlp.org/coreference-resolution
My code:
from allennlp.predictors.predictor import Predictor
import allennlp_models.tagging
predictor = Predictor.from_path(r"C:\Users\aap\Desktop\coref-spanbert-large-2021.03.10.tar.gz")
example = 'Paul Allen was born on January 21, 1953, in Seattle, Washington, to Kenneth Sam Allen and Edna Faye Allen.Allen attended Lakeside School, a private school in Seattle, where he befriended Bill Gates, two years younger, with whom he shared an enthusiasm for computers.'
pred = predictor.predict(document=example)
coref_res = predictor.coref_resolved(example)
print(pred)
print(coref_res)
When I have an access to internet the code works correctly.
But when I don't have an access to internet I get the following errors:
Traceback (most recent call last):
File "C:/Users/aap/Desktop/CoreNLP/Coref_AllenNLP.py", line 14, in <module>
predictor = Predictor.from_path(r"C:\Users\aap\Desktop\coref-spanbert-large-2021.03.10.tar.gz")
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\predictors\predictor.py", line 361, in from_path
load_archive(archive_path, cuda_device=cuda_device, overrides=overrides),
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\models\archival.py", line 206, in load_archive
config.duplicate(), serialization_dir
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\models\archival.py", line 232, in _load_dataset_readers
dataset_reader_params, serialization_dir=serialization_dir
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 604, in from_params
**extras,
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 632, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 200, in create_kwargs
cls.__name__, param_name, annotation, param.default, params, **extras
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 307, in pop_and_construct_arg
return construct_arg(class_name, name, popped_params, annotation, default, **extras)
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 391, in construct_arg
**extras,
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 341, in construct_arg
return annotation.from_params(params=popped_params, **subextras)
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 604, in from_params
**extras,
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 634, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\data\token_indexers\pretrained_transformer_mismatched_indexer.py", line 63, in __init__
**kwargs,
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\data\token_indexers\pretrained_transformer_indexer.py", line 58, in __init__
model_name, tokenizer_kwargs=tokenizer_kwargs
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\data\tokenizers\pretrained_transformer_tokenizer.py", line 71, in __init__
model_name, add_special_tokens=False, **tokenizer_kwargs
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\cached_transformers.py", line 110, in get_tokenizer
**kwargs,
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 362, in from_pretrained
config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\transformers\models\auto\configuration_auto.py", line 368, in from_pretrained
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\transformers\configuration_utils.py", line 424, in get_config_dict
use_auth_token=use_auth_token,
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\transformers\file_utils.py", line 1087, in cached_path
local_files_only=local_files_only,
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\transformers\file_utils.py", line 1268, in get_from_cache
"Connection error, and we cannot find the requested files in the cached path."
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
Process finished with exit code 1
Please, say me, what do I need to do my code works without Internet?
You will need a local copy of transformer model's configuration file and vocabulary so that the tokenizer and token indexer don't need to download those:
from transformers import AutoConfig, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(transformer_model_name)
config = AutoConfig.from_pretrained(transformer_model_name)
tokenizer.save_pretrained(local_config_path)
config.to_json_file(local_config_path + "/config.json")
You will then need to override the transformer model name in the configuration file to the local directory (local_config_path) where you saved these things:
predictor = Predictor.from_path(
r"C:\Users\aap\Desktop\coref-spanbert-large-2021.03.10.tar.gz",
overrides={
"dataset_reader.token_indexers.tokens.model_name": local_config_path,
"validation_dataset_reader.token_indexers.tokens.model_name": local_config_path,
"model.text_field_embedder.tokens.model_name": local_config_path,
},
)
I have run into similar problem when using structured-prediction-srl-bert without internet, and I saw in the logs 4 item for downloads:
dataset_reader.bert_model_name = bert-base-uncased, Downloading 4 files
model INFO vocabulary.py - Loading token dictionary from data/structured-prediction-srl-bert.2020.12.15/vocabulary. Downloading... 4x smaller files
Spacy models 'en_core_web_sm' not found
later on, [nltk_data] Error loading punkt: <urlopen error [Errno -3] Temporary failure in name resolution> [nltk_data] Error loading wordnet: <urlopen error [Errno -3] Temporary failure in name resolution>
I have solved it with these steps:
structured-prediction-srl-bert:
I have downloaded the structured-prediction-srl-bert.2020.12.15.tar.gz from the https://demo.allennlp.org/semantic-role-labeling (Model Card tab) -
https://storage.googleapis.com/allennlp-public-models/structured-prediction-srl-bert.2020.12.15.tar.gz
I have unzipped it into ./data/structured-prediction-srl-bert.2020.12.15
The code:
pip install allennlp==2.10.0 allennlp-models==2.10.0
from allennlp.predictors.predictor import Predictor
predictor = Predictor.from_path("./data/structured-prediction-srl-bert.2020.12.15/")
bert-base-uncased
I have created a folder ./data/bert-base-uncased and there I have downloaded these files from https://huggingface.co/bert-base-uncased/tree/main
config.json
tokenizer.json
tokenizer_config.json
vocab.txt
pytorch_model.bin
Aditionally, I had to change the "bert_model_name" from "bert-base-uncased" into a path "./data/bert-base-uncased", the earlier causes the download. This has to be done in the ./data/structured-prediction-srl-bert.2020.12.15/config.json , and there are two occurences.
python -m spacy download en_core_web_sm
python -c 'import nltk; nltk.download("punkt"); nltk.download("wordnet")'
After these steps the allennlp did not need internet anymore.

mySQL unicode decode error for emojis when the Django queryset is converted to a list

I am on Django 1.11, python 3.6 and mysqlclient 1.4.6. I need to convert a Django queryset into a list. The objects in the queryset have a field with a unicode emoji value. (see: field=u'✅ This is the field value') That particular table and field in mySQL are using utf8mb4_unicode_ci encoding.
When I run qs_list = list(qs), MySQL throws a Unicode error which I think is caused by the emoji in the field when the queryset is evaluated.
File encoding:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
Queryset:
qs = Fake.objects.filter(..)
qs_list = list(qs)
Error:
<class 'UnicodeDecodeError'>
Exception: 'utf-8' codec can't decode byte 0xed in position 11: invalid continuation byte
File "/home/mysite/lib/python3.6/site-packages/django/db/models/query.py", line 250, in __iter__
self._fetch_all()
File "/home/mysite/lib/python3.6/site-packages/django/db/models/query.py", line 1121, in _fetch_all
self._result_cache = list(self._iterable_class(self))
File "/home/mysite/lib/python3.6/site-packages/django/db/models/query.py", line 53, in __iter__
results = compiler.execute_sql(chunked_fetch=self.chunked_fetch)
File "/home/mysite/lib/python3.6/site-packages/django/db/models/sql/compiler.py", line 899, in execute_sql
raise original_exception
File "/home/mysite/lib/python3.6/site-packages/django/db/models/sql/compiler.py", line 889, in execute_sql
cursor.execute(sql, params)
File "/home/mysite/lib/python3.6/site-packages/django/db/backends/utils.py", line 64, in execute
return self.cursor.execute(sql, params)
File "/home/mysite/lib/python3.6/site-packages/django/db/backends/mysql/base.py", line 101, in execute
return self.cursor.execute(query, args)
File "/home/mysite/lib/python3.6/site-packages/MySQLdb/cursors.py", line 209, in execute
res = self._query(query)
File "/home/mysite/lib/python3.6/site-packages/MySQLdb/cursors.py", line 317, in _query
self._post_get_result()
File "/home/mysite/lib/python3.6/site-packages/MySQLdb/cursors.py", line 352, in _post_get_result
self._rows = self._fetch_row(0)
File "/home/mysite/lib/python3.6/site-packages/MySQLdb/cursors.py", line 325, in _fetch_row
return self._result.fetch_row(size, self._fetch_type)
Is there a simple solution or a workaround for this? How can I get the list of objects in the queryset without queryset being evaluated?
Most probably this is related to the encoding and you need to check other types of encoding as suggested in this article
Regarding the list of objects - try iterating on the result to get(for example) each name and store the values in a list:
qs = Fake.objects.filter(..)
qs_list = [x.name for x in qs]
where 'name' is the model field you need to iterate.

JSON Parsing with Nao robot - AttributeError

I'm using a NAO robot with naoqi version 2.1 and Choregraphe on Windows. I want to parse json from an attached file to the behavior. I attached the file like in that link.
Code:
def onLoad(self):
self.filepath = os.path.join(os.path.dirname(ALFrameManager.getBehaviorPath(self.behaviorId)), "fileName.json")
def onInput_onStart(self):
with open(self.filepath, "r") as f:
self.data = self.json.load(f.get_Response())
self.dataFromFile = self.data['value']
self.log("Data from file: " + str(self.dataFromFile))
But when I run this code on the robot (connected with a router) I'll get an error:
[ERROR] behavior.box :_safeCallOfUserMethod:281 _Behavior__lastUploadedChoregrapheBehaviorbehavior_1136151280__root__AbfrageKontostand_3__AuslesenJSONDatei_1: Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/naoqi.py", line 271, in _safeCallOfUserMethod
func()
File "<string>", line 20, in onInput_onStart
File "/usr/lib/python2.7/site-packages/inaoqi.py", line 265, in <lambda>
__getattr__ = lambda self, name: _swig_getattr(self, behavior, name)
File "/usr/lib/python2.7/site-packages/inaoqi.py", line 55, in _swig_getattr
raise AttributeError(name)
AttributeError: json
I already tried to understand the code from the correspondending lines but I couldn't fixed the error. But I know that the type of my object f is 'file'. How can I open the json file as a json file?
Your problem comes from this:
self.json.load(f.get_Response())
... there is no such thing as "self.json" on a Choregraphe box, import json and then do json.load. And what is get_Response ? That method doesn't exist on anything in Python that I know of.
You might want to first try making a standalone python script (that doesn't use the robot) that can read your json file before you try it with choregraphe. It will be easier.

Flask TypeError 'is not JSON serializable' - nested dictionary

i am using Flask as framework for my server, and while returning a response i get the following error:
> Traceback (most recent call last):
File "C:\Python27\lib\site-packages\flask\app.py", line 1612, in full_dispatch_request
rv = self.dispatch_request()
File "C:\Python27\lib\site-packages\flask\app.py", line 1598, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "C:\Python27\lib\site-packages\flask_restful\__init__.py", line 480, in wrapper
resp = resource(*args, **kwargs)
File "C:\Python27\lib\site-packages\flask\views.py", line 84, in view
return self.dispatch_request(*args, **kwargs)
File "C:\Python27\lib\site-packages\flask_restful\__init__.py", line 595, in dispatch_request
resp = meth(*args, **kwargs)
File "rest.py", line 27, in get
return jsonify(**solution)
File "C:\Python27\lib\site-packages\flask\json.py", line 263, in jsonify
(dumps(data, indent=indent, separators=separators), '\n'),
File "C:\Python27\lib\site-packages\flask\json.py", line 123, in dumps
rv = _json.dumps(obj, **kwargs)
File "C:\Python27\lib\json\__init__.py", line 251, in dumps
sort_keys=sort_keys, **kw).encode(obj)
File "C:\Python27\lib\json\encoder.py", line 209, in encode
chunks = list(chunks)
File "C:\Python27\lib\json\encoder.py", line 434, in _iterencode
for chunk in _iterencode_dict(o, _current_indent_level):
File "C:\Python27\lib\json\encoder.py", line 408, in _iterencode_dict
for chunk in chunks:
File "C:\Python27\lib\json\encoder.py", line 332, in _iterencode_list
for chunk in chunks:
File "C:\Python27\lib\json\encoder.py", line 332, in _iterencode_list
for chunk in chunks:
File "C:\Python27\lib\json\encoder.py", line 442, in _iterencode
o = _default(o)
File "C:\Python27\lib\site-packages\flask\json.py", line 80, in default
return _json.JSONEncoder.default(self, o)
File "C:\Python27\lib\json\encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: {'origin': u'porto', 'dest': u'lisboa', 'price': '31', 'date': '2017-12-23', 'url': u'https://www.google.pt/flights/#search;f=opo;t=lis;d=2017-12-23;r=2017-12-24'} is not JSON serializable
i have the following function:
from flask import Flask, request, jsonify
from flask_restful import Resource, Api
from flask_cors import CORS, cross_origin
from json import dumps
import flights
import solveProblem
app = Flask(__name__)
api = Api(app)
CORS(app)
class Flights(Resource):
def get(self, data):
print 'received data from client: ' + data
solution = solveProblem.solve(data)
print 'got the solution from the script! \nSOLUTION: \n'
print solution
return jsonify(solution)
api.add_resource(Flights, '/flights/<string:data>')
if __name__ == '__main__':
app.run()
while debugging the problem, i found the following solutions which did not work:
1) return solution instead of {'solution': solution}
2) do jsonify(solution)
3) do jsonify(**solution)
none of the above worked for me;
i wonder why this happens, when i am trying to return a valid dictionary:
{'flights': [[{'origin': u'porto', 'dest': u'lisboa', 'price': '31', 'date': '2017-12-23', 'url': u'https://www.google.pt/flights/#search;f=opo;t=lis;d=2017-12-23;r=2017-12-24'}]], 'cost': '31'}
any help is appreciated.
Thanks
My guess is when you were creating 'solution', the data that got assigned to it was an incorrectly formatted dictionary
{'item', 'value'}
Instead of:
{'item': 'value'}
Thus creating a set instead of a dict
we cannot directly use the jsonify when your trying to converting list of data into json.
there is two approaches are there you can convert list into dictionary for that we need to write function that convert your list data into dictionary which is complicated task .
there is one smart work you can use Marshmallow library . it serialized you list data after that you can use jsonify.
In flask-restful, Resource class get method will just need to return python data structure. So just remove jsonify. For User Defined Object, you can use marshal_with() decorator.
See more: https://flask-restful.readthedocs.io/en/latest/quickstart.html#a-minimal-api
Since most of your functions are declared elsewhere, I worked a toy Flask program just to pass the dictionary you got stuck with.
[Edit] Before I was using the standard python json module. I edited it to use flask's own jsonify, and it works with the direct dictionary still. So the error is not where the OP is looking for.
{'flights': [[{'origin': u'porto', 'dest': u'lisboa', 'price': '31', 'date': '2017-12-23', 'url': u'https://www.google.pt/flights/#search;f=opo;t=lis;d=2017-12-23;r=2017-12-24'}]], 'cost': '31'}
The following program runs and returns the dictionary as a JSON object:
import flask
app = flask.Flask(__name__)
#app.route('/')
def hello():
jdic = flask.jsonify( {'origin': u'porto', 'dest': u'lisboa', 'price': '31', 'date': '2017-12-23', 'url': u'https://www.google.pt/flights/#search;f=opo;t=lis;d=2017-12-23;r=2017-12-24'} )
return jdic
if __name__ == '__main__':
app.run()
As I found out, this error generally occurs when the response is not a pure python dictionary. This happened to me because I was trying to pass a class object. So, to solve the problem, i created a class method which returns a dictionary describing the object, and use this to create the json response.
Conclusion: Use Pure python objects, which are easily translated to JSON.
I had the same problem with a 3 level Nested Dictionary; it was valid, json serializable and via command line json.dumps had no issue. However, Flask did not want to output it: "TypeError", not json serializable. The only difference is that I am using Python 3.5.
So I made a copy of it as a string (that on command line was json serializable!) and passed to Flask output, it worked.
Try to pass the nested json as
eval(str(solution))
and see the error. It's not a definitive solution but more a workaround.
Hope it helps.

Python CSV Has No Attribute 'Writer'

There's a bit of code giving me trouble. It was working great in another script I had but I must have messed it up somehow.
The if csv: is primarily because I was relying on a -csv option in an argparser. But even if I were to run this with proper indents outside the if statement, it still returns the same error.
import csv
if csv:
with open('output.csv', 'wb') as csvfile:
csvout = csv.writer(csvfile, delimiter=',',
quotechar=',', quoting=csv.QUOTE_MINIMAL)
csvout.writerow(['A', 'B', 'C'])
csvfile.close()
Gives me:
Traceback (most recent call last):
File "import csv.py", line 34, in <module>
csvout = csv.writer(csvfile, delimiter=',',
AttributeError: 'str' object has no attribute 'writer'
If I remove the if statement, I get:
Traceback (most recent call last):
File "C:\import csv.py", line 34, in <module>
csvout = csv.writer(csvfile, delimiter=',',
AttributeError: 'NoneType' object has no attribute 'writer'
What silly thing am I doing wrong? I did try changing the file name to things like test.py as I saw that in another SO post, didn't work.
For me I had named my file csv.py. So when I import csv from that file I was essentially trying to import the same file itself.
If you've set something that assigns to csv (looks like a string) then you're shadowing the module import. So, the simplest thing is to just change whatever's assigning to csv that isn't the module and call it something else...
In effect what's happening is:
import csv
csv = 'bob'
csvout = csv.writer(somefile)
Remove the further assignment to csv and go from there...
For my case, my function name happened to be csv(). Once I renamed my function, the error disappeared.