Null Values by migration from MySQL to mongo - mysql

I need to migrate some tables from MySQL to mongoDB. After searching the web, for me it looks like an MySQL export to CSV and an import from that CSV to mongoDB should be the fastest and easiest way.
I'm export MySQL using that query:
select * into outfile '/tmp/feed.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY ''
from feeds;
But there is one problem.
If an MySQL field is NULL, so the MySQL export writes an \N (or \\N) into the CSV file.
By importing that file, mongoDB imports the \\N as string instead of an NULL value.
The mongoDB import option --ignoreBlanks will not work, becaouse \\N is not "blank" in mongoDB's point of view.
So my question:
1.) how could I avoid exporting NULLas \\N?
or
2.) how could mongodbimport read/interpret \\N as NULL or empty value?
By the way: it's not an option to postprocess the CSV to search and replace the \\N
On possible answer for 1.) could be the modification of the select statement: SELECT IFNULL( field1, "" ) But in this case I have to define and check every column. An export script would not so flexible, if all columns are defined in the select statement.
//Edit: while playing around with that import<->export I found an other problem: date fields, which also interpreted as strings from mongoimport

I would comment rather than adding an answer, but my reputation is still quite low...
What I've done in a project I'm working on is do the migration using a Python script. I have the exported table in a CSV. The code I use looks like this:
import csv
import zip
import pymongo
f = open( filename )
reader = csv.reader( f )
destinationItems = []
The following reads the column names (first row in CSV)
columns = next( reader )
The columns can be put in a tuple that here I call 'keys'. The code is here oblivious of the column names. Each row is then converted to a dictionary ready to be amended to remove (or do something else with -) NULLs.
keys = tuple( columns )
for property in reader:
entry = dict( zip( keys, property ) )
and the following deals with NULL; in this case I remove the entry altogether if found to be 'NULL' in the exported CSV.
entry = { k:v for k,v in entry.iteritems() if ( k in keys and ( v != 'NULL' ) or k not in keys ) }
destinationItems.append( entry )
Update the mongodb instance
mongoClient = pymongo.MongoClient()
mongoClient['mydb'].mycollection.insert( destinationItems )

Related

Error parsing JSON: more than one document in the input (Redshift to Snowflake SQL)

I'm trying to convert a query from Redshift to Snowflake SQL.
The Redshift query looks like this:
SELECT
cr.creatives as creatives
, JSON_ARRAY_LENGTH(cr.creatives) as creatives_length
, JSON_EXTRACT_PATH_TEXT(JSON_EXTRACT_ARRAY_ELEMENT_TEXT (cr.creatives,0),'previewUrl') as preview_url
FROM campaign_revisions cr
The Snowflake query looks like this:
SELECT
cr.creatives as creatives
, ARRAY_SIZE(TO_ARRAY(ARRAY_CONSTRUCT(cr.creatives))) as creatives_length
, PARSE_JSON(PARSE_JSON(cr.creatives)[0]):previewUrl as preview_url
FROM campaign_revisions cr
It seems like JSON_EXTRACT_PATH_TEXT isn't converted correctly, as the Snowflake query results in error:
Error parsing JSON: more than one document in the input
cr.creatives is formatted like this:
"[{""previewUrl"":""https://someurl.com/preview1.png"",""device"":""desktop"",""splitId"":null,""splitType"":null},{""previewUrl"":""https://someurl.com/preview2.png"",""device"":""mobile"",""splitId"":null,""splitType"":null}]"
It seems to me that you are not working with valid JSON data inside Snowflake.
Please review your file format used for the copy into command.
If you open the "JSON" text provided in a text editor , note that the information is not parsed or formatted as JSON because of the quoting you have. Once your issue with double quotes / escaped quotes is handled, you should be able to make good progress
Proper JSON on Left || Original Data on Right
If you are not inclined to reload your data, see if you can create a Javascript User Defined Function to remove the quotes from your string, then you can use Snowflake to process the variant column.
The following code is working POJO that can be used to remove the doublequotes for you.
var textOriginal = '[{""previewUrl"":""https://someurl.com/preview1.png"",""device"":""desktop"",""splitId"":null,""splitType"":null},{""previewUrl"":""https://someurl.com/preview2.png"",""device"":""mobile"",""splitId"":null,""splitType"":null}]';
function parseText(input){
var a = input.replaceAll('""','\"');
a = JSON.parse(a);
return a;
}
x = parseText(textOriginal);
console.log(x);
For anyone else seeing this double double quote issue in JSON fields coming from CSV files in a Snowflake external stage (slightly different issue than the original question posted):
The issue is likely that you need to use the FIELD_OPTIONALLY_ENCLOSED_BY setting. Specifically, FIELD_OPTIONALLY_ENCLOSED_BY = '"' when setting up your fileformat.
(docs)
Example of creating such a file format:
create or replace file format mydb.myschema.my_tsv_file_format
type = CSV
field_delimiter = '\t'
FIELD_OPTIONALLY_ENCLOSED_BY = '"';
And example of querying from a stage using this file format:
select
$1 field_one
$2 field_two
-- ...and so on
from '#my_s3_stage/path/to/file/my_tab_separated_file.csv' (file_format => 'my_tsv_file_format')

Django / PostgresQL jsonb (JSONField) - convert select and update into one query

Versions: Django 1.10 and Postgres 9.6
I'm trying to modify a nested JSONField's key in place without a roundtrip to Python. Reason is to avoid race conditions and multiple queries overwriting the same field with different update.
I tried to chain the methods in the hope that Django would make a single query but it's being logged as two:
Original field value (demo only, real data is more complex):
from exampleapp.models import AdhocTask
record = AdhocTask.objects.get(id=1)
print(record.log)
> {'demo_key': 'original'}
Query:
from django.db.models import F
from django.db.models.expressions import RawSQL
(AdhocTask.objects.filter(id=25)
.annotate(temp=RawSQL(
# `jsonb_set` gets current json value of `log` field,
# take a the nominated key ("demo key" in this example)
# and replaces the value with the json provided ("new value")
# Raw sql is wrapped in triple quotes to avoid escaping each quote
"""jsonb_set(log, '{"demo_key"}','"new value"', false)""",[]))
# Finally, get the temp field and overwrite the original JSONField
.update(log=F('temp’))
)
Query history (shows this as two separate queries):
from django.db import connection
print(connection.queries)
> {'sql': 'SELECT "exampleapp_adhoctask"."id", "exampleapp_adhoctask"."description", "exampleapp_adhoctask"."log" FROM "exampleapp_adhoctask" WHERE "exampleapp_adhoctask"."id" = 1', 'time': '0.001'},
> {'sql': 'UPDATE "exampleapp_adhoctask" SET "log" = (jsonb_set(log, \'{"demo_key"}\',\'"new value"\', false)) WHERE "exampleapp_adhoctask"."id" = 1', 'time': '0.001'}]
It would be much nicer without RawSQL.
Here's how to do it:
from django.db.models.expressions import Func
class ReplaceValue(Func):
function = 'jsonb_set'
template = "%(function)s(%(expressions)s, '{\"%(keyname)s\"}','\"%(new_value)s\"', %(create_missing)s)"
arity = 1
def __init__(
self, expression: str, keyname: str, new_value: str,
create_missing: bool=False, **extra,
):
super().__init__(
expression,
keyname=keyname,
new_value=new_value,
create_missing='true' if create_missing else 'false',
**extra,
)
AdhocTask.objects.filter(id=25) \
.update(log=ReplaceValue(
'log',
keyname='demo_key',
new_value='another value',
create_missing=False,
)
ReplaceValue.template is the same as your raw SQL statement, just parametrized.
(jsonb_set(log, \'{"demo_key"}\',\'"another value"\', false)) from your query is now jsonb_set("exampleapp.adhoctask"."log", \'{"demo_key"}\',\'"another value"\', false). The parentheses are gone (you can get them back by adding it to the template) and log is referenced in a different way.
Anyone interested in more details regarding jsonb_set should have a look at table 9-45 in postgres' documentation: https://www.postgresql.org/docs/9.6/static/functions-json.html#FUNCTIONS-JSON-PROCESSING-TABLE
Rubber duck debugging at its best - in writing the question, I've realised the solution. Leaving the answer here in hope of helping someone in future:
Looking at the queries, I realised that the RawSQL was actually being deferred until query two, so all I was doing was storing the RawSQL as a subquery for later execution.
Solution:
Skip the annotate step altogether and use the RawSQL expression straight into the .update() call. Allows you to dynamically update PostgresQL jsonb sub-keys on the database server without overwriting the whole field:
(AdhocTask.objects.filter(id=25)
.update(log=RawSQL(
"""jsonb_set(log, '{"demo_key"}','"another value"', false)""",[])
)
)
> 1 # Success
print(connection.queries)
> {'sql': 'UPDATE "exampleapp_adhoctask" SET "log" = (jsonb_set(log, \'{"demo_key"}\',\'"another value"\', false)) WHERE "exampleapp_adhoctask"."id" = 1', 'time': '0.001'}]
print(AdhocTask.objects.get(id=1).log)
> {'demo_key': 'another value'}

Insert or update if exists in mysql using pandas

I am trying to insert data from xlsx file into mysqdl table. I want to insert data in table and if there is a duplicate on primary keys, I want to update the existing data otherwise insert. I have written the script already but I realized it is too much work and using pandas it is quick. How can I achieve it in pandas?
#!/usr/bin/env python3
import pandas as pd
import sqlalchemy
engine_str = 'mysql+pymysql://admin:mypass#localhost/mydb'
engine = sqlalchemy.create_engine(engine_str, echo=False, encoding='utf-8')\
file_name = "tmp/results.xlsx"
df = pd.read_excel(file_name)
I can think of two options, but number 1 might be cleaner/faster:
1) Make SQL decide on the update/insert. Check this other question. You can iterate by rows of your 'df', from i=1 to n. Inside the loop for the insertion you can write something like:
query = """INSERT INTO table (id, name, age) VALUES(%s, %s, %s)
ON DUPLICATE KEY UPDATE name=%s, age=%s"""
engine.execute(query, (df.id[i], df.name[i], df.age[i], df.name[i], df.age[i]))
2) Define a python function that returns True or False when the record exists and then use it in your loop:
def check_existence(user_id):
query = "SELECT EXISTS (SELECT 1 FROM your_table where user_id_str = %s);"
return list(engine.execute(query, (user_id, ) ) )[0][0] == 1
You could iterate over rows and do this check before inserting
Please also check the solution in this question and this one too which might work in your case.
Pangres is the tool for this job.
Overview here:
https://pypi.org/project/pangres/
Use the function pangres.fix_psycopg2_bad_cols to "clean" the columns in the DataFrame.
Code/usage here:
https://github.com/ThibTrip/pangres/wiki
https://github.com/ThibTrip/pangres/wiki/Fix-bad-column-names-postgres
Example code:
# From: <https://github.com/ThibTrip/pangres/wiki/Fix-bad-column-names-postgres>
import pandas as pd
# fix bad col/index names with default replacements (empty string for '(', ')' and '%'):
df = pd.DataFrame({'test()':[0],
'foo()%':[0]}).set_index('test()')
print(df)
test() foo()%
0 0
# clean cols, index w/ no replacements
df_fixed = fix_psycopg2_bad_cols(df)
print(df_fixed)
test foo
0 0
# fix bad col/index names with custom replacements - you MUST provide replacements for '(', ')' and '%':
# reset df
df = pd.DataFrame({'test()':[0],
'foo()%':[0]}).set_index('test()')
# clean cols, index w/ user-specified replacements
df_fixed = fix_psycopg2_bad_cols(df, replacements={'%':'percent', '(':'', ')':''})
print(df_fixed)
test foopercent
0 0
Will only fix/correct some of the bad characters:
Replaces '%', '(' and ')' (characters that won't play nicely or even at all)
But, useful in that it handles cleanup and upsert.
(p.s., I know this post is over 4 years old, but still shows up in Google results when searching for "pangres upsert determine number inserts and updates" as the top SO result, dated May 13, 2020.)
When using Pandas no iteration is needed. Isn't that faster?
df = pd.read_csv(csv_file,sep=';',names=['column'])
df.to_sql('table', con=con, if_exists='append', index=False, chunksize=20000)

Could anybody check my SQL code using SET, SELECT and CAST all at the same time

I'm trying to import some csv files into a table (on phpmyadmin). The problem is that the import wasn't really correct (the date wasn't showing, my values either (they showed correctly when the type was a VARCHAR) etc) so I thought about converting my values to the right type during my import. When I tried it for my date it worked using only the code below at the end :
SET Date = STR_TO_DATE (#dte, '%m.%d.%Y')
SQL;
But when I wanted to convert my values into a float or a decimal or whatever, it didn't work. It always says 'Syntax error or access violation' ( and I guess it starts from = ( Select CAST(#tot AS decimal(10,2)) etc ).
Do you have any suggestions please?
The request
Select CAST(#tot AS decimal(10,2))
from `bss_154`)
normally works on mySQL.
Thanks in advance.
PS: Here is my code :
$sql=<<<SQL
LOAD DATA INFILE '{$filename}'
IGNORE INTO TABLE nsui.bss_154
FIELDS TERMINATED BY ';'
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n' IGNORE 1 LINES
(#dte, `BSC_Name`, `Segment_Name`, `SEGMENT`, `UP_QUAL`, `UP_LEVEL`,
`DOWN_QUAL`, `DOWN_LEV`, `DISTANCE`, `MSC_INVOC`, `INTFER_UP`,
`INTFER_DWN`, `UMBR`, `PBDGT`, `OMC`, `DIR_RETRY`, `PRE_EMPTION`,
`FIELD_DROP`, `LOW_DISTANCE`, `BAD_CI`, `GOOD_CI`, `HO_DUE_SLOW_MOV_MS`,
`HO_DUE_MS_SLOW_SPEED`, `HO_DUE_MS_HIGH_SPEED`,
`HO_ATT_DUE_SWITCH_CIRC_POOL`, `HO_ATT_DUE_ERFD`,
`HO_ATT_DUE_TO_BSC_CONTR_TRHO`, `HO_ATT_DUE_TO_DADLB`,
`HO_ATT_DUE_TO_GPRS`, `HO_ATT_DUE_TO_HSCSD`, `HO_ATT_DUE_BAD_SUPER_RXLEV`,
`HO_ATT_DUE_GOOD_REGULAR_RXLEV`, `HO_ATT_DUE_DIRECT_ACCESS`,
`HO_ATTEMPT_INTERBAND_DUE_LEVEL`, `HO_ATTEMPT_DUE_TO_ISHO`,
`HO_ATT_DUE_INTERSYS_DIRECT_ACC`, `HO_ATT_FOR_AMR_TO_HR`,
`HO_ATT_FOR_AMR_TO_FR`, `HO_ATT_INTER_BAND_SDCCH`,
`HO_ATT_INTER_BAND_TCH`, `HO_ATT_INTER_BTS_TYPE_SDCCH`,
`HO_ATT_INTER_BTS_TYPE_TCH`, #tot)
SET Date = STR_TO_DATE (#dte, '%m.%d.%Y'),
Hoattempts_outgoing_and_intra-cell = (
Select CAST(#tot AS decimal(10,2))
from `bss_154`)
SQL;
Also, perhaps you want to create the proper data types (VARCHAR) in the table definition first, then import the data that you can, and then create another script that imports and fixes the field data that's left over using the proper FORMAT and CAST statement syntaxes:
FORMAT ( value, format [, culture ] )
See: https://msdn.microsoft.com/en-us/library/hh213505.aspx
CAST ( expression AS data_type [ ( length ) ] )
See: https://msdn.microsoft.com/en-us/library/ms187928.aspx
With some code like:
CAST(FORMAT(#tot, 'N', 'en-us') as varchar) As SOME_TOTAL

MySQL Dynamic Query Statement in Python with Dictionary

Very similar to this question MySQL Dynamic Query Statement in Python
However what I am looking to do instead of two lists is to use a dictionary
Let's say i have this dictionary
instance_insert = {
# sql column variable value
'instance_id' : 'instnace.id',
'customer_id' : 'customer.id',
'os' : 'instance.platform',
}
And I want to populate a mysql database with an insert statement using sql column as the sql column name and the variable name as the variable that will hold the value that is to be inserted into the mysql table.
Kind of lost because I don't understand exactly what this statement does, but was pulled from the question that I posted where he was using two lists to do what he wanted.
sql = "INSERT INTO instance_info_test VALUES (%s);" % ', '.join('?' for _ in instance_insert)
cur.execute (sql, instance_insert)
Also I would like it to be dynamic in the sense that I can add/remove columns to the dictionary
Before you post, you might want to try searching for something more specific to your question. For instance, when I Googled "python mysqldb insert dictionary", I found a good answer on the first page, at http://mail.python.org/pipermail/tutor/2010-December/080701.html. Relevant part:
Here's what I came up with when I tried to make a generalized version
of the above:
def add_row(cursor, tablename, rowdict):
# XXX tablename not sanitized
# XXX test for allowed keys is case-sensitive
# filter out keys that are not column names
cursor.execute("describe %s" % tablename)
allowed_keys = set(row[0] for row in cursor.fetchall())
keys = allowed_keys.intersection(rowdict)
if len(rowdict) > len(keys):
unknown_keys = set(rowdict) - allowed_keys
print >> sys.stderr, "skipping keys:", ", ".join(unknown_keys)
columns = ", ".join(keys)
values_template = ", ".join(["%s"] * len(keys))
sql = "insert into %s (%s) values (%s)" % (
tablename, columns, values_template)
values = tuple(rowdict[key] for key in keys)
cursor.execute(sql, values)
filename = ...
tablename = ...
db = MySQLdb.connect(...)
cursor = db.cursor()
with open(filename) as instream:
row = json.load(instream)
add_row(cursor, tablename, row)
Peter
If you know your inputs will always be valid (table name is valid, columns are present in the table), and you're not importing from a JSON file as the example is, you can simplify this function. But it'll accomplish what you want to accomplish. While it may initially seem like DictCursor would be helpful, it looks like DictCursor is useful for returning a dictionary of values, but it can't execute from a dict.