I'm currently trying to build a database interface with python to store stock data. This data is in the form of a tuple list with each element consisting of "date, open, high, low, close, volume. date represents a UNIX timestamp and has to be unique in combination with the ticker symbol in the database. Below is an example of a typically processed output (company_stock):
[(1489780560, 'NYSE:F', 12.5, 12.505, 12.49, 12.495, 567726),
(1489780620, 'NYSE:F', 12.495, 12.5, 12.48, 12.48, 832487),
(1489780680, 'NYSE:F', 12.485, 12.49, 12.47, 12.475, 649818),
(1489780740, 'NYSE:F', 12.475, 12.48, 12.47, 12.47, 700579),
(1489780800, 'NYSE:F', 12.47, 12.48, 12.47, 12.48, 567798)]
I'm using the pymysql package to insert this list into a local MySQL database (Version 5.5). While the code runs through and the values get inserted, the database will crash - or rather stop - after reaching about ~250k rows. Since the relevant This is the export part of the stock data processing function which gets called about once every 20 seconds and inserts about 400 values.
# SQL Export
def tosql(company_stock, ticker, interval, amount_period, period):
try:
conn = pymysql.connect(host = "localhost", user = "root",
passwd = "pw", db = "db", charset = "utf8",
autocommit = True,
cursorclass = pymysql.cursors.DictCursor)
cur = conn.cursor()
# To temp table
query = "INSERT INTO stockdata_import "
query += "(date, tickersymbol, open, high, low, close, volume)"
query += "VALUES (%s, %s, %s, %s, %s, %s, %s)"
cur.executemany(query, company_stock)
# Duplicate Check with temp table and existing database storage
query = "INSERT INTO stockdata (date, tickersymbol, open, high, low, close, volume) "
query += "SELECT i.date, i.tickersymbol, i.open, i.high, i.low, "
query += "i.close, i.volume FROM stockdata_import i "
query += "WHERE NOT EXISTS(SELECT dv.date, dv.tickersymbol FROM "
query += "stockdata dv WHERE dv.date = i.date "
query += "AND dv.tickersymbol = i.tickersymbol)"
cur.execute(query)
print(": ".join([datetime.now().strftime("%d.%m.%Y %H:%M:%S"),
"Data stored in Vault. Ticker", str(ticker),
"Interval", str(interval),
"Last", str(amount_period), str(period)]))
finally:
# Clear temp import table and close connection
query = "DELETE from stockdata_import"
cur.execute(query)
cur.close()
conn.close()
I suspect that the check for already existent values takes too long as the database grows and eventually breaks down due to the lock of the tables (?) while checking for uniqueness of the date/ticker combination. Since I expect this database to grow rather fast (about 1 million rows per week) it seems that a different solution is required to ensure that there is only one date/ticker pair. This is the SQL CREATE statement for the import table (the real table with which it gets compared looks the same):
CREATE TABLE stockdata_import (id_stock_imp BIGINT(12) NOT NULL AUTO_INCREMENT,
date INT(10),
tickersymbol VARCHAR(16),
open FLOAT(12,4),
high FLOAT(12,4),
low FLOAT(12,4),
close FLOAT(12,4),
volume INT(12),
crawled_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY(id_stock_imp));
I have already looked into setting a constraint for the date/tickersymbol pair and to handle upcoming exceptions in python, but my research so far suggested that this would be even slower plus I am not even sure if this will work with the bulk insert of the pymysql cursor function executemany(query, data).
Context information:
The SQL export shown above is the final part of a python script handling the stock data response. This script, in turn, gets called by another script which is timed by a crontab to run at a specific time each day.
Once the crontab starts the control script, this will call the subscript about 500 times with a sleep time of about 20-25 seconds between each run.
The error which I see in the logs is: ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
Questions:
How can I optimize the query or alter the storage table to ensure uniqueness for a given date/ticker combination?
Is this even the problem or do I fail to see some other problem here?
Any further advice is also welcome.
If you would like to ensure uniqueness of your data, then just add a unique index on the relevant date and ticker fields. Unique index prevents duplicate values from being inserted, therefore there is no need to check for the existence of data before the insertion.
Since you do not want to insert duplicate data, just use insert ignore instead of plain insert to supress duplicate insert errors. Based on the mumber of affected rows, you can still detect and log duplicate insertions.
Related
I have connected to Spotify's API in Python to extract the top twenty tracks of a searched artist. I am trying to store the data in MySQL Workbench in a database named 'spotify_api', I created called 'spotify'. Before I added my code to connect to MySQL Workbench, my code worked correctly and was able to extract the list of tracks, but I have run into issues in getting my code to connect to my database. Below is the code I have written to both extract the data and store it into my database:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import mysql.connector
mydb = mysql.connector.connect(
host = "localhost",
user = "root",
password = "(removed for question)",
database = "spotify_api"
)
mycursor = mydb.cursor()
sql = 'DROP TABLE IF EXISTS spotify_api.spotify;'
mycursor.execute(sql)
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id="(removed for question)",
client_secret="(removed for question)"))
results = sp.search(q='sza', limit=20)
for idx, track in enumerate(results['tracks']['items']):
print(idx, track['name'])
sql = "INSERT INTO spotify_api.spotify (tracks, items) VALUES (" + \
str(idx) + ", '" + track['name'] + "');"
mycursor.execute(sql)
mydb.commit()
print(mycursor.rowcount, "record inserted.")
mycursor.execute("SELECT * FROM spotify_api.spotify;")
myresult = mycursor.fetchall()
for x in myresult:
print(x)
mycursor.close()
Every time I run my code in the VS Code terminal, I receive an error stating that my table doesn't exist. This is what it states:
"mysql.connector.errors.ProgrammingError: 1146 (42S02): Table 'spotify_api.spotify' doesn't exist"
I'm not sure what I need to fix in my code or in my database in order to eliminate this error and get my data stored into my table. In my table I have created two columns 'tracks' and 'items', but I'm not sure if my issues lie in my database or in my code.
Well, it seems pretty clear. You ran
DROP TABLE IF EXISTS spotify_api.spotify;
...
INSERT INTO spotify_api.spotify (tracks, items) VALUES ...
We won't even raise the spectre of the Chuck Berry
track titled little ol' Bobby Tables here.
You DROP'd it, then tried to INSERT into it.
That won't work.
You'll need to CREATE TABLE prior to the INSERT.
#Test
public void transaction() throws Exception {
Connection conn = null;
PreparedStatement ps = null;
try {
String sql = "insert into `1` values(?, ?, ?, ?)";
conn = JDBCUtils.getConnection();
ps = conn.prepareStatement(sql);
conn.setAutoCommit(false);
for(int i = 1; i <= 10000; i++){
ps.setObject(1, i);
ps.setObject(2, 10.12345678);
ps.setObject(3, "num_" + i);
ps.setObject(4, "2021-12-24 19:00:00");
ps.addBatch();
}
ps.executeBatch();
ps.clearBatch();
conn.commit();
} catch (Exception e) {
conn.rollback();
e.printStackTrace();
}finally {
JDBCUtils.closeResources(conn, ps);
}
}
When setAutoCommit = true, local MySQL and distributed MySQL insert speeds are very slow.
When I set the transaction to commit manually, just like the code above, the local MySQL speed has increased a lot, but the insertion speed of distributed MySQL is still very slow.
Is there any additional parameters I need to set?
Setting parameters probably won't help (much).
There are a couple of reasons for the slowness:
With autocommit=true you are committing on every insert statement. That means the each new row must be written to disk before the database server returns the response to the client.
With autocommit=false there is still a client -> server -> client round trip for each insert statement. Those round trips add up to a significant amount of time.
One way to make this faster is to insert multiple rows with each insert statement, but that is messy because you would need to generate complex (multi-row) insert statements.
A better way is to use JDBC's batch feature to reduce the number of round-trips. For example:
PreparedStatement ps = c.prepareStatement("INSERT INTO employees VALUES (?, ?)");
ps.setString(1, "John");
ps.setString(2,"Doe");
ps.addBatch();
ps.clearParameters();
ps.setString(1, "Dave");
ps.setString(2,"Smith");
ps.addBatch();
ps.clearParameters();
int[] results = ps.executeBatch();
(Attribution: above code copied from this answer by #Tusc)
If that still isn't fast enough, you should get even better performance using MySQL's native bulk insert mechanism; e.g. load data infile; see High-speed inserts with MySQL
For completeness, I am adding this suggestion from #Wilson Hauck
"In your configuration [mysqld] section, innodb_change_buffer_max_size=50 # from 25 (percent) for improved INSERT rate per second. SHOW FULL PROCESSLIST; to monitor when the instance has completed adjustment, then do your inserts and put it back to 25 percent for typical processing speed."
This may increase the insert rate depending on your table and its indexes, and on the order in which you are inserting the rows.
But the flip-side is that you may be able to achieve the same speedup (or more!) by other means; e.g.
by sorting your input so that rows are inserted in index order, or
by dropping the indexes, inserting the records and then recreating the indexes.
You can read about the change buffer here and make your own judgements.
I am using celery to archive the async job in python, my code flow is as following:
celery task get some data from remote api
celery beat get the celery task result from celery backend which is redis and then insert the result into redis
but in step 2, before I insert result data into mysql, I check if the data is existed.although I do the check, the duplicate data still be inserted.
my code is as following:
def get_task_result(logger=None):
db = MySQLdb.connect(host=MYSQL_HOST, port=MYSQL_PORT, user=MYSQL_USER, passwd=MYSQL_PASSWD, db=MYSQL_DB, cursorclass=MySQLdb.cursors.DictCursor, use_unicode=True, charset='utf8')
cursor = db.cursor()
....
....
store_subdomain_result(db, cursor, asset_id, celery_task_result)
....
....
cursor.close()
db.close()
def store_subdomain_result(db, cursor, top_domain_id, celery_task_result, logger=None):
subdomain_list = celery_task_result.get('result').get('subdomain_list')
source = celery_task_result.get('result').get('source')
for domain in subdomain_list:
query_subdomain_sql = f'SELECT * FROM nw_asset WHERE domain="{domain}"'
cursor.execute(query_subdomain_sql)
sub_domain_result = cursor.fetchone()
if sub_domain_result:
asset_id = sub_domain_result.get('id')
existed_source = sub_domain_result.get('source')
if source not in existed_source:
new_source = f'{existed_source},{source}'
update_domain_sql = f'UPDATE nw_asset SET source="{new_source}" WHERE id={asset_id}'
cursor.execute(update_domain_sql)
db.commit()
else:
insert_subdomain_sql = f'INSERT INTO nw_asset(domain) values("{domain}")'
cursor.execute(insert_subdomain_sql)
db.commit()
I first select if the data is existed, if the data not existed, I will do the insert, the code is as following:
query_subdomain_sql = f'SELECT * FROM nw_asset WHERE domain="{domain}"'
cursor.execute(query_subdomain_sql)
sub_domain_result = cursor.fetchone()
I do this, but it still insert duplicate data, I can't understand this.
I google this question and some one says use insert ignore or relace into or unique index, but I want to know why the code not work as expectedly?
also, In my opinion, I think if there is some cache in mysql, when I do the select, the data not really into mysql it just in the flush, so the select will return none?
I have made an application in Nodejs that every minute calls an endpoint and gets a json array that has about 100000 elements. I need to upsert this elements into my database such that if the element doesn't exist I insert it with column "Point" value set to 0.
So far I'm having a cron job and simple upsert query. But it's so slow:
var q = async.queue(function (data, done) {
db.query('INSERT INTO stat(`user`, `user2`, `point`) '+data.values+' ON DUPLICATE KEY UPDATE point=point+ 10',function (err, result) {
if (err) throw err;
});
},100000);
//Cron job here Every 1 minute execute the lines below:
var values='' ;
for (var v=0;v<stats.length;v++) {
values = '("JACK","' + stats[v] + '", 0)';
q.push({values: values});
}
How can I do such a task in a very short amount of time. Is using mysql a wrong decision? I'm open to any other architecture or solution. Note that I have to do this every minute.
I fixed this problem by using Bulk Upsert (from documentation)! I managed to Upsert over 24k rows in less than 3 seconds. Basically created the query first then ran it:
INSERT INTO table (a,b,c) VALUES (1,2,3),(4,5,6)
ON DUPLICATE KEY UPDATE c=VALUES(a)+VALUES(b);
In FoxPro using native table, I usually do this when inserting new Data.
Sele Table
If Seek(lcIndex)
Update Record
Else
Insert New Record
EndIf
If I will use MYSQL as my DataBase, what is the best and fastest way to
do this in FoxPro code using SPT? I will be updating a large number of records.
Up to 80,000 transactions.
Thanks,
Herbert
I would only take what Jerry supplied one step further. When trying to deal with any insert, update, delete with SQL pass through, it can run into terrible debugging problems based on similar principles of SQL-injection.
What if your "myValue" field had a single quote, double quote, double hyphen (indicating comment)? You would be hosed.
Parameterize your statement such as using VFP variable references, then use "?" in your sql statement to qualify which "value" should be used. VFP properly passes. This also helps on data types, such as converting numbers into string when building the "myStatement".
Also, in VFP, you can use TEXT/ENDTEXT to simplify the readability of the commands
lcSomeStringVariable = "My Test Value"
lnANumericValue = 12.34
lnMyIDKey = 389
TEXT to lcSQLCmd NOSHOW PRETEXT 1+2+8
update [YourSchems].[YourTable]
set SomeTextField = ?lcSomeStringVariable,
SomeNumberField = ?lnANumericValue
where
YourPKColumn = ?lnMyIDKey
ENDTEXT
=sqlexec( yourHandle, lcSQLCmd, "localCursor" )
You can use SQL Pass through in your Visual Foxpro application. Take a look at the SQLCONNECT() or SQLSTRINGCONNECT() for connecting to your Database. Also look at SQLEXEC() for executing your SQL statement.
For Example:
myValue = 'Test'
myHandle = SQLCONNECT('sqlDBAddress','MyUserId','MyPassword')
myStatement = "UPDATE [MySchema].[Mytable] SET myField = '" + myValue + "' WHERE myPk = 1"
=SQLEXEC(myHandle, myStatement,"myCursor")
=SQLEXEC(myHandle, "SELECT * FROM [MySchema].[Mytable] WHERE myPk = 1","myCursor")
SELECT myCursor
BROWSE LAST NORMAL
This would be your statement string for SQLEXEC:
INSERT INTO SOMETABLE
SET KEYFIELD = ?M.KEYFIELD,
FIELD1 = ?M.FIELD1
...
FIELDN = ?M.FIELDN
ON DUPLICATE KEY UPDATE
FIELD1 = ?M.FIELD1
...
FIELDN = ?M.FIELDN
Notice that the ON DUPLICATE KEY UPDATE part does not contain the key field, otherwise it would normally be identical to the insert (or not, if you want to do something else when the record already exists)