Python3, Pandas and MySQL index issue - mysql

I'm new on Python and trying to learn some basic Data Manipulation (the main focus is Data Science). So I'm still grasping Pandas and everything else.
What I'm trying to achieve, is to create a DataFrame and store it on a MySQL database. This is my script (that don't work):
from sqlalchemy.types import VARCHAR
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.random((4,4)),
index=['val1','val2','val3','val4'],
columns=['col1','col2','col3','col4'])
engine = create_engine('mysql+pymysql://user:password#localhost/python_samples')
frame.to_sql('rnd_vals', engine, dtype={'index':VARCHAR(5)})
When I try to execute this, I get the error saying that MySQL won't allow to create a TEXT/BLOB index withouth the length:
InternalError: (pymysql.err.InternalError) (1170, "BLOB/TEXT column 'index' used in key specification without a key length") [SQL: 'CREATE INDEX ix_rnd_vals_index ON rnd_vals (`index`)']
I believed that I could fix this, by specifying the dtype option on the to_sql() function, but it didn't help.
I found a way of making this, by joining two DataFrames, one with the values, and the other one with the index:
from sqlalchemy.types import VARCHAR
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
frame = pd.DataFrame(np.random.random(25).reshape(5,5),
columns=['Jan','Feb','Mar','Apr','May'])
idxFrame = pd.DataFrame({'index':['exp1','exp2','exp3','exp4','exp5']})
frame = frame.join(idxFrame)
frame=frame.set_index('index')
engine = create_engine('mysql+pymysql://user:password#localhost/python_samples')
frame.to_sql('indexes',engine,if_exists='replace', index_label='index',
dtype={'index':VARCHAR(5)})
This works as expected, but I really doubt that this is the correct way of making this, can someone help me? What did I did wrong?
Thank you

For whoever is having this issue, Ilja Everilä on the comments solved the issue. The index name was actually 'None', instead of 'index', so when I changed the dtype from
dtype={'index':VARCHAR(5)}
to
dtype={'None':VARCHAR(5)}
It solved the issue, and the table was created on MySQL as:
CREATE TABLE `rnd_vals` (
`index` text,
`col1` double DEFAULT NULL,
`col2` double DEFAULT NULL,
`col3` double DEFAULT NULL,
`col4` double DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
as expected.
Thank you all!

I tried to find a direct way to enable pandas to directly import the index. In the end reset_index() seems the simplest method:
my_df.reset_index()
my_df.to_sql(name='my_table', con=engine, index=False, if_exists='replace')

With :
frame.to_sql('rnd_vals', engine, dtype={'None':VARCHAR(5)})
It was giving :
1170, "BLOB/TEXT column 'index' used in key specification without a key length") [SQL: 'CREATE INDEX ix_indexes_index ON indexes (index)'] (Background on this error at: http://sqlalche.me/e/e3q8)
This resolved the issue :
frame.to_sql('indexes',engine,if_exists='replace', index_label='index',dtype={frame.index.name:VARCHAR(5)})

You are trying to make an index from a column of type text/blob. In this case, MySQL unable to put uniqueness to the columns because of dynamic nature. There is no length associated with this. You can provide the type of column while saving dataframe to MySQL or (if you don't need an index) just make index=False.

Use VARCHAR(...) instead of TEXT whenever practical.
In general, it is not useful to index TEXT columns.
I can't provide the sqlalchemy; I am not familiar with how it obfuscates the SQL code.

Related

to_mysql inserts more rows in SQL table than there are in pandas dataframe

So I have a MySQL database, let's call it "MySQLDB". When trying to create a new table (let's call it datatable) and insert data from a pandas dataframe, my code keeps adding rows to the SQL table, and I'm not sure if they are duplicates or not. For reference, there are around 50,000 rows in my pandas dataframe, but after running my code, the SQL table contains over 1 million rows. Note that I am using XAMPP to run a local MySQL server on which the database "MYSQLDB" is stored. Below is a simplified/generic version of what I am running. Note I have removed the port number and replaced it with generic [port] in this post.
import pandas as pd
from sqlalchemy import create_engine
import mysql.connector
pandas_db = pd.read_csv('filename.csv', index_col = [0])
engine = create_engine('mysql+mysqlconnector://root:#localhost:[port]/MySQLDB', echo=False)
pandas_db.to_sql(name='datatable', con=engine, if_exists = 'replace', chunksize = 100, index=False)
Is something wrong with the code? Or could it be something to do with XAMPP or the way I set up my database? If there is anything I could improve, please let me know.
I haven't found any other good posts that describe having the same issue.

How to avoid encoding warning when inserting binary data into a blob column in MySQL using Python 2.7 and MySQLdb

I'm having trouble inserting binary data into a longblob column in MySQL using MySQLdb from Python 2.7, but I'm getting an encoding warning that I don't know how to get around:
./test.py:11: Warning: Invalid utf8 character string: '8B0800'
curs.execute(sql, (blob,))
Here is the table definition:
CREATE TABLE test_table (
id int(11) NOT NULL AUTO_INCREMENT,
gzipped longblob,
PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
And the test code:
#!/usr/bin/env python
import sys
import MySQLdb
blob = open("/tmp/some-file.gz", "rb").read()
sql = "INSERT INTO test_table (gzipped) VALUES (%s)"
conn = MySQLdb.connect(db="unprocessed", user="some_user", passwd="some_pass", charset="utf8", use_unicode=True)
curs = conn.cursor()
curs.execute(sql, (blob,))
I've searched here and elsewhere for the answer, but unfortunately although many questions seem like they are what I'm looking for, the posters don't appear to be having encoding issues.
Questions:
What is causing this warning?
How do I get rid of it?
After some more searching I've found the answers.
It is actually MySQL generating this warning.
It can be avoided by using _binary before the binary parameter.
https://bugs.mysql.com/bug.php?id=79317
So the Python code needs to be updated as follows:
sql = "INSERT INTO test_table (gzipped) VALUES (_binary %s)"

pandas dataframe index datetime.date converts to object KeyError

I retrieve some data from my MySQL database. This data has the date (not datetime) in one column and some other random data in the other columns. Let's say dtf is my dataframe. There is no index yet so I set one
dtf.set_index('date', inplace=True)
Now I would like to get data from a specific date so I write for example
dtf.loc['2000-01-03']
or just
dtf['2000-01-03']
This gives me a KeyError:
KeyError: '2000-01-03'
But I know its in there. dtf.head() shows me that.
So I did take a look at the type of the index of the first row:
type(dtf.index[0])
and it tells me: datetime.date. All good, now what happens if I just type
dtf.index
Index([2000-01-03, 2000-01-04, 2000-01-05, 2000-01-06, 2000-01-07, 2000-01-10,
2000-01-11, 2000-01-12, 2000-01-13, 2000-01-14,
...
2015-09-09, 2015-09-10, 2015-09-11, 2015-09-14, 2015-09-15, 2015-09-16,
2015-09-17, 2015-09-18, 2015-09-21, 2015-09-22],
dtype='object', name='date', length=2763)
I am a bit confused about the dtype='object'. Shouldn't this read datetime.date?
If I use datetime in my mysql table instead of date everything works like a charm. Is this a bug or a feature? I really would like to use datetime.date because it describes my data best.
My pandas version is 0.17.0
I am using python 3.5.0
My os is arch linux
You should use datetime64/Timestamp rather than datetime.datetime:
dtf.index = pd.to_datetime(dtf.index)
will mean you have a DatetimeIndex and can do nifty things like loc by strings.
dtf.loc['2000-01-03']
You won't be able to do that with datetime.datetime.

Circular Dependency Error with SQLAlchemy using autoload for table creation

I am attempting to use the script found here.
I am connecting to an MS SQL database and attempting to copy it into a MySQL database. When the script gets to this line:
table.metadata.create_all(dengine)
I get the error of:
sqlalchemy.exc.CircularDependencyError
I reasearched this error and found that it occurs when using the autoload=True when creating a table. The solution though doesn't help me. The solution for this is to not use autoload=True and to make use of the use_alter=True flag when defining the foreign key, but I'm not defining the tables manually, so I can't set that flag.
Any help on how to correct this issue, or on a better way to accomplish what I am trying to do would be greatly appreciated. Thank you.
you can iterate through all constraints and set use_alter on them:
from sqlalchemy.schema import ForeignKeyConstraint
for table in metadata.tables.values():
for constraint in table.constraints:
if isinstance(constraint, ForeignKeyConstraint):
constraint.use_alter = True
Or similarly, iterate through them and specify them as AddConstraint operations, bound to after the whole metadata creates:
from sqlalchemy import event
from sqlalchemy.schema import AddConstraint
for table in metadata.tables.values():
for constraint in table.constraints:
event.listen(
metadata,
"after_create",
AddConstraint(constraint)
)
see Controlling DDL Sequences

Unpickle fields using SQLAlchemy

I need SQLAlchemy to check a database table column for occurrences of python-pickled strings (such as S'foo'\np0\n.), unpickle them (which in this example case would yield foo) , and write them back. How do I do that (efficiently)? (Can I somehow abuse SQLAlchemy's PickleType?)
Okay, found a way using sqlalchemy.sql.expression.func.substr:
from sqlalchemy.sql.expression import func
table.update().where(
and_(table.c.column.startswith("S'"),
table.c.column.endswith("'\np0\n."))
).values({table.c.column:
func.substr(table.c.column,
3,
func.char_length(table.c.column)-8)
}).execute()