I have 14 million rows and 20 columns in a table named weather and 1900 rows and 15 columns in a table named incident on MySQL server. I am trying to set the active column in weather to 1 where the weather date column is between the start and end date column of the incident table and where the weather location column is equal to the incident location column. I have the following query and I am not sure if it is the most efficient way to do it. It currently has been running for an hour on AWS RDS db.m5.4xlarge (16 vCPU and 64 GB RAM). It is only using 8% CPU according to the AWS Console.
UPDATE dev.weather, dev.incident
SET weather.active = 1
WHERE weather.location = incident.location AND weather.DATE BETWEEN dev.incident.start_date AND dev.incident.end_date
Is there a better way to accomplish this?
By the time we come up with a satisfactory solution, your query will be finished. But here are some thoughts on it.
UPDATE, especially if lots of rows are modified, is very time-consuming. (This is because of the need to save old rows in case of rollback.)
Without the indexes, I cannot advise on things completely.
This is a one-time query, correct? Future "incidents" will do the update as the incident is stored, correct? That will probably run reasonably fast.
Given that you have a way to update for a single incident, use that as the basis for doing the initial UPDATE (the one you are asking about now). That is, write a special, one-time, program to run through the 1900 incidents, performing the necessary Update. (Advantage: only one Update need ever be written.)
Be sure to COMMIT after each Update. (Or run with autocommit=ON.) Else the 1900 updates will be a big burden on the system, perhaps worse than the single-Update that started this discussion.
Related
I have a table with 27 columns and 300,000 rows of data, out of which 8 columns are filled with data 0 or 1 or null. Using LabVIEW I get the total count of each of these columns using the following query:
select
d_1_result,
d_2_value_1_result,
de_2_value_2_result,
d_3_result,
d_4_value_1_result,
d_4_value_2_result,
d_5_result
from Table_name_vp
where ( insp_time between
"15-02-02 06:00:00" and "15-02-02 23:59:59" or
inspection_time between "15-02-03 00:00:00" and "15-02-03 06:00:00")
and partname = "AbvQuene";
This query runs for the number of days the user input, for example 120 days.
I found that total time taken by the query is 8 secs which not good.
I want to reduce the time to 8 millisecs.
I have also changed the engine to Myisam.
Any suggestions to reduce the time consumed by the query. (LabVIEW Processing is not taking time)
It depends on the data, and how many rows out of the 300,000 are actually selected by your WHERE clause. Obviously if all 300,000 are included, the whole table will need to be read. If it's a smaller number of rows, an index on insp_time or inspection_time (is this just a typo, are these actually the same field?) and/or partname might help. The exact index will depend on your data.
Update 2:
I can't see any reason for you wouldn't be able to load your whole DB into memory because it should be less than 60MB. Do you agree with this?
Please post your answer and the answer the following questions (you can edit a question after you have asked it - that's easier than commenting).
Next steps:
I should have mentioned this before, that before you run a query in LabVIEW I would always test it first using your DB admin tool (e.g. MySql Workbench). Please post whether that worked or not.
Post your LabVIEW code.
You can try running your query with less than 300K rows - say 50K and see how much your memory increases. If there's some limitation on how many rows you can query at one time than you can break your giant query into smaller ones pretty easily and then just add up the results sets. I can post an example if needed.
Update:
It sounds like there's something wrong with your schema.
For example, if you had 27 columns of double's and datetimes ( both are 8 bytes each) your total DB size would only be about 60MB (300K * 27 * 8 / 1048576).
Please post your schema for further help (you can use SHOW CREATE TABLE tablename).
8 millisecs is an extremely low time - I assume that's being driven by some type of hardware timing issue? If not please explain that requirement as a typical user requirement is around 1 second.
To get the response time that low you will need to do the following:
Query the DB at the start of your app and load all 300,000 rows into memory (e.g. a LabVIEW array)
Update the array with new values (e.g. array append)
Run the "query" against he array (e.g. using a for loop with a case select)
On a separate thread (i.e. LabVIEW "loop") insert the new records into to the database or do it write before the app closes
This approach assumes that only one instance of the app is running at a time because synchronizing database changes across multiple instances will be very hard with that timing requirement.
I will probably cross-post this but I assume it's a common problem - didn't find anything quite right in my search.
My task is this --- I have a data source streaming a source of "updates" and "new tickets" to a MySQL collection of tickets. Each ticket has a unique Ticket_ID.
Essentially, I need to take these updates and do an insert/ update against a table of tickets. Simple enough.
The problem is that every day, there are 3,000 rows to insert/update against this ever-increasing-list that is currently at 300,000 and can probably expand to 1 million.
Luckily, after 30 days, I know a ticket cannot be updated again, so perhaps can move that to an archive. Still, that's 3000 rows a day against a month of data, which is usually 90,000 rows. That simply takes a lot of time. I haven't checked the exact figure but maybe 30 minutes to an hour, maybe longer.
How do I optimize this process of insert/ update, which by definition, has to lookup each incoming Ticket_ID against the database of existing Ticket_IDs?
Is the insert/ update process simply the best that there is?
The main problem is that insert/ update process can only have one running at a time --- as oppose to having 10+ (just insert) processes running simultaneously and then 1 delete duplicates process following.
Or is there a creative way like ... find ticket_ids where count(*) > 1 -- delete the one with the older timestamp -- that could save me some time?
I know this is a complicated question but it seems like a common problem. Thanks.
I have a table which has tens of thousands of new rows added a hour.
Based on certain events I set a given row a state of complete by setting it's status field to 1, update it's status_timesamp and then when querying the table with selects, I ignore all rows with field of 1.
But this leads to a huge amount of rows that I no longer need all with a field of 1. I also may need the fields at a later point in the future for logging purposes, but for the everyday purpose of my application such rows aren't needed.
I could delete the row instead of updating the field to 1 but I figure a delete is more costly than an update and many inserts are happening per second.
Ultimately I would like a way to move all the rows with status 1 into some kind of log table, without impacting on the current table which has many inserts and updates happening per second.
This is a difficult question to answer. Indeed, updates (on a non-indexed field) should be faster than deletes. In a simple environment, you would do the delete, along with a trigger that logged the information that you wanted.
I find it hard to believe that there is no downtime for the database. Can't you do the deletes at 2:00 a.m. once per week on Sunday, in some time zone?
Normally if this isn't the case, then you have high-availability requirements. And, in such a circumstance, you would have a replicated database. Most times, inserts, updates, and queries would go to both databases. During database maintenance periods, only one might be up while the other "does maintenance". Then that database catches up with the transactions, takes over the user load, and the other database "does maintenance". In your case "does maintenance" means doing the delete and logging.
If you have a high availability requirement and you are not using replication of some sort, your system has bigger vulnerabilities than simply accumulating to-be-deleted data.
I have a web application that has a MySql database with a device_status table that looks something like this...
deviceid | ... various status cols ... | created
This table gets inserted into many times a day (2000+ per device per day (estimated to have 100+ devices by the end of the year))
Basically this table gets a record when just about anything happens on the device.
My question is how should I deal with a table that is going to grow very large very quickly?
Should I just relax and hope the database will be fine in a few months when this table has over 10 million rows? and then in a year when it has 100 million rows? This is the simplest, but seems like a table that large would have terrible performance.
Should I just archive older data after some time period (a month, a week) and then make the web app query the live table for recent reports and query both the live and archive table for reports covering a larger time span.
Should I have an hourly and/or daily aggregate table that sums up the various statuses for a device? If I do this, what's the best way to trigger the aggregation? Cron? DB Trigger? Also I would probably still need to archive.
There must be a more elegant solution to handling this type of data.
I had a similar issue in tracking the number of views seen for advertisers on my site. Initially I was inserting a new row for each view, and as you predict here, that quickly led to the table growing unreasonably large (to the point that it was indeed causing performance issues which ultimately led to my hosting company shutting down the site for a few hours until I had addressed the issue).
The solution I went with is similar to your #3 solution. Instead of inserting a new record when a new view occurs, I update the existing record for the timeframe in question. In my case, I went with daily records for each ad. what timeframe to use for your app would depend entirely on the specifics of your data and your needs.
Unless you need to specifically track each occurrence over the last hour, you might be over-doing it to even store them and aggregate later. Instead of bothering with the cron job to perform regular aggregation, you could simply check for an entry with matching specs. If you find one, then you update a count field of the matching row instead of inserting a new row.
I'm designing a system, and by going deep into numbers, I realize that it could reach a point where there could be a table with 54,240,211,584 records/year (approximately). WOW!!!!
So, I brook it down & down to 73,271,952 records/year (approximately).
I got the numbers by making some excel running on what would happen if:
a) no success = 87 users,
b) low moderated success = 4300 users,
c) high moderated success = 13199 users,
d) success = 55100 users
e) incredible success = nah
Taking into account that the table is used for SELECT, INSERT, UPDATE & JOIN statements and that these statements would be executed by any user logged into the system hourly/daily/weekly (historical data is not an option):
Question 1: is 2nd quantity suitable/handy for the MySQL engine, such that performance would suffer little impact???
Question 2: I set the table as InnoDB but, given the fact that I handle all of the statements with JOINS & that I'm willing to run into the 4GB limit problem, is InnoDB useful???
Quick overview of the tables:
table #1: user/event purchase. Up to 15 columns, some of them VARCHAR.
table #2: tickets by purchase. Up to 8 columns, only TINYINT. Primary Key INT. From 4 to 15 rows inserted by each table #1 insertion.
table #3: items by ticket. 4 columns, only TINYINT. Primary Key INT. 3 rows inserted by each table #2 insertion. I want to keep it as a separated table, but if someone has to die...
table #3 is the target of the question. The way I reduced to 2nd quantity was by making each table #3's row be a table #2's column.
Something that I dont want to do, but I would if necessary, is to partition the tables by week and add more logic to application.
Every answer helps, but it would be more helpful something like:
i) 33,754,240,211,584: No, so lets drop the last number.
ii) 3,375,424,021,158: No, so lets drop the last number.
iii) 337,542,402,115: No, so lets drop the last number. And so on until we get something like "well, it depends on many factors..."
What would I consider "little performance impact"??? Up to 1,000,000 records, it takes no more than 3 seconds to exec the queries. If 33,754,240,211,584 records will take around 10 seconds, that's excellent to me.
Why don't I just test it by myself??? I think I'm not capable of doing such a test. The stuff I would do is just to insert that quantity of rows and see what happens. I prefer FIRST the point of view of someone who has already known of something similar. Remember, I'm still in design stage
Thanks in advance.
54,240,211,584 is a lot. I only have experience with mysql tables up to 300 million rows, and it handles that with little problem. I'm not sure what you're actually asking, but here's some notes:
Use InnoDB if you need transaction support, or are doing a lot of inserts/updates.
MyISAM tables are bad for transactional data, but ok if you're very read heavy and only do bulk inserts/updates every now and then.
There's no 4Gb limit with mysql if you're using recent releases/recen't operating systems. My biggest table is 211Gb now.
Purging data in large tables is very slow. e.g. deleting all records for a month takes me a few hours. (Deleting single records is fast though).
Don't use int/tinyint if you're expecting many billions of records, they'll wrap around.
Get something working, fix the scaling after the first release. An unrealized idea is pretty much useless, something that works(for now) might be very useful.
Test. There's no real substitute - your app and db usage might be wildely different from someone elses huge database.
Look into partitioned tables, a recent feature in MySQL that can help you scale in may ways.
Start at the level you're at. Build from there.
There are plenty of people out there who will sell you services you don't need right now.
If $10/month shared hosting isn't working anymore, then upgrade, and eventually hire someone to help you get around the record limitations of your DB.
There is no 4Gb limit, but of course there are limits. Don't plan too far ahead. If you're just starting up and you plan to be the next Facebook, that's great but you have no resources.
Get something working so you can show your investors :)