MYSQL: mass update data into existing table - mysql

I got a huge performance issue for my uploading data into mysql db. Using an example, I have special tools to mine say personal information of thousands of people.
I have one tool that mines the phone numbers of the people. Another that mines say the home address of the people. Another mines the photos of the person. So for this example, say there are 100000 people of Country A. I will have to mine data from different countries later on. These mining tools will finish at different times. The mining of phone numbers takes 20 mins. Mining of photos takes 1 week. Mining of the addresses takes 3 days.
The customer wants to see the data as soon as possible in an existing table/db. I wrote some scripts to detect when one tool finishes to start uploading row by row data. However, this seems to take a REALLY long time (using UPDATE ...).
Is there a faster way to do this?
The table that exists in the db is structure like this:
Columns: ID_COUNTRY,ID_PERSON,FULL NAME,PHONE,BLOB_PHOTO,ADDRESS

Yes, there is a faster way. Put the data from each of the processes into a separate table, by inserting into the table.
You will then have to create a query to gather the data:
select *
from people p left outer join
phones ph
on p.personid = ph.perhsonid left outer join
addresses a
on p.personid = a.personid left outer join
photos pho
on p.personid = pho.personid;
Each individual table should start off empty. When the results are available, the table can be loaded using insert. This has at least two advantages. (1) inserts are faster than updates, and bulk inserts may be faster still. (2) The data is available in some tables without blocking inserts into the rest of the tables.

Related

MySQL 1:N Data Mapping

Something really bugs me and im not sure what is the "correct" approach.
If i make a select to get contacts from my database there are a decent amount of joins involved.
It will look something like this (around 60-70 columns):
SELECT *
FROM contacts
LEFT JOIN company
LEFT JOIN person
LEFT JOIN address
LEFT JOIN person_communication
LEFT JOIN company_communication
LEFT JOIN categories
LEFT JOIN notes
company and person are 1:1 cardinality so its straight forward.
But "address", "communication" and "categories" are 1:n cardinality.
So depending on the amount of rows in the 1:n tables i will get a lot of "double" rows (I don't know whats the real term for that, the rows are not double i know that the address or phone number etc is different). For myself as a contact, a fairly filled contact, i get 85 rows back.
How do you guys work with that?
In my PHP application i always wrote some "Data-Mapper" where the array key was the "contact.ID aka primary" and then checked if it exists and then pushed the additional data into it. Also PHP is not really type strict what makes it easy.
Now I'm learning GO(golang) and i thought screw that LOOOONG select and data mapping just write selects for all the 1:n.... yeah no, not enough connections to load a table full of contacts. I know that i can increase the connections but the error seems to imply that this would be the wrong way.
I use the following driver: https://github.com/go-sql-driver/mysql
I also tried GROUP_CONCAT but then i running in trouble parsing it back.
Do i have to do my mapping approach again or is there some nice solution out there? I found it quite dirty at points tho?
The solution is simple: you need to execute more than one query!
The cause of all the "duplicate" rows is that you're generating a result called a Cartesian product. You are trying to join to several tables with 1:n relationships, but each of these has no relationship to the other, so there's no join condition restricting them with respect to each other.
Therefore you get a result with every combination of all the 1:n relationships. If you have 3 matches in address, 5 matches in communication, and 5 matches in categories, you'd get 3x5x5 = 75 rows.
So you need to run a separate SQL query for each of your 1:n relationships. Don't be afraid—MySQL can handle a few queries. You need them.

Combine data across dozens of DB's in a non-expensive query?

I run a site where companies create accounts and have their users sign up for their accounts. A couple of the companies I work with have sensitive data and for that reason the decision was made a while back that we would maintain separate databases for each company that registers with our site.
To be clear, the DB's look similar to the below for companies A, B and C.
db_main /* Stores site info for each company */
db_a
db_b
db_c
Now, I'm finding that sometimes a user creates an account with both company A and company B, so it would be nice if I could combine their progress from the two sites (A and B). For example, if the user earns 5 points on site A, and 5 points on site B, I would like for their total site points to read "10" (their combined total from 5 + 5).
There are hundreds of these databases, though, and I'm worried that it will be rough on the server to be constantly running queries across all databases. The user's score, for instance, is calculated on each page load.
Is there an efficient way to run queries in this manner?
Joining to 100 DB's should never be an option, and to your question, it won't be efficient.
What I would suggest instead is to create a global table that stores a cache of the points you are after globally. Points should not be 'sensitive' in any way from the sounds of it. I assume a userID is not either. Given that a customer should never have direct query access to this table, it should be a non-issue.
Scenario:
User joins siteA
earns 5 points
dbA gets updated
dbGlobalPoints gets upsert'ed (if exists (it won't), update points+5, else insert userID, 5)
User then joins siteB with same username (this may be your biggest issue if you don't have unique id's across systems)
profile query pulls/joins dbGlobalPoints for display
earns 10 points.
dbB gets updated
dbGlobalPoints gets upsert'ed (if exists (it will), update points+10, else insert userID, 10)
On initial run, a 'rebuild' process of sorts will need to be run which steps through each company table and populates the global table. This will also be useful later for a 'recount' process (say, you drop dbA and don't want those points to count anymore)
you could also make this a subroutine that fires per user just once (in the background) if they don't have a record in the global points database.

member action table data model suggestion

I'm trying to add an action table, but i'm currently at odds as to how to approach the problem.
Before i go into more detail.
We have members who can do different actions on our website
add an image
update an image
rate an image
post a comment on image
add a blog post
update a blog post
comment on a blog post
etc, etc
the action table allows our users to "Watch" other member's activities if they want to add them to their watch list.
I currently created a table called member_actions with the following columns
[UserID] [actionDate] [actionType] [refID]
[refID] can be a reference either to the image ID in the DB or blogpost ID, or an id column of another actionable table (eg. event)
[actionType] is an Enum column with action names such as (imgAdd,imgUpdate,blogAdd,blogUpdate, etc...)
[actionDate] will decide which records get deleted every 90 days... so we won't be keeping the actions forever
the current mysql query i cam up with is
SELECT act.*,
img.Title, img.FileName, img.Rating, img.isSafe, img.allowComment AS allowimgComment,
blog.postTitle, blog.firstImageSRC AS blogImg, blog.allowComments AS allowBlogComment,
event.Subject, event.image AS eventImg, event.stimgs, event.ends,
imgrate.Rating
FROM member_action act
LEFT JOIN member_img img ON (act.actionType="imgAdd" OR act.actionType="imgUpdate")
AND img.imgID=act.refID AND img.isActive AND img.isReady
LEFT JOIN member_blogpost blog ON (act.actionType="blogAdd" OR act.actionType="blogUpdate")
AND blog.id=act.refID AND blog.isPublished AND blog.isPublic
LEFT JOIN member_event event ON (act.actionType="eventAdd" OR act.actionType="eventUpdate")
AND event.id=act.refID AND event.isPublished
LEFT JOIN img_rating imgrate ON act.actionType="imgRate" AND imgrate.UserID=act.UserID AND imgrate.imgID=act.refID
LEFT JOIN member_favorite imgfav ON act.actionType="imgFavorite" AND imgfav.UserID=act.UserID AND imgfav.imgID=act.refID
LEFT JOIN img_comment imgcomm ON (act.actionType="imgComment" OR act.actionType="imgCommentReply") AND imgcomm.imgID=act.refID
LEFT JOIN blogpost_comment blogcomm ON (act.actionType="blogComment" OR act.actionType="blogCommentReply") AND blogcomm.blogPostID=act.refID
ORDER BY act.actionDate DESC
LIMIT XXXXX,20
Ok so basically, given that i'll be deleting actions older than 90 days every week or so... would it make sense to go with this query for displaying the member action history?
OR should i add a new text column in member_actions table called [actionData] where i can store a few details in json or xml format for fast querying of the member_action table.
It adds to the table size and reduces query complexity, but the table will be purged from periodically from old entries.
the assumption is that eventually we'll have no more than a few 100k members so would i'm concerned about the table size of the member_action table with it's text [actionData] column that will contain some specific details.
I'm leaning towards the [actionData] model but any recommendations or considerations will be appreciated.
another consideration is that it's possible that the table entries for img or blog could get deleted... so i could have action but no reference record...this sure does add to the problem.
thanks in advance
Because you are dealing with user interface issues, performance is key. All the joins will do take time, even with indexes. And, querying the database is likely to lock records in all the tables (or indexes), which can slow down inserts.
So, I lean towards denormalizing the data, by maintaining the text in the record.
However, a key consideration is whether the text can be updated after the fact. That is, you will load the data when it is created. Can it then change? The problem of maintaining the data in light of changes (which could involve triggers and stored procedures) could introduce a lot of additional complexity.
If the data is static, this is not an issue. As for table size, I don't think you should worry about that too much. Databases are designed to manage memory. It is maintaining the table in a page cache, which should contain pages for currently active members. You can always increase memory size, especially for 100,000 users which is well within the realm of today's servers.
I'd be wary of this approach - as you add kinds of actions that you want to monitor the join is going to keep growing (and the sparse extra columns in the select statement as well).
I don't think it would be that scary to have a couple of extra columns in this table - and this query sounds like it would be running fairly frequently, so making it efficient seems like it would be a good idea.

best practice database design for tracking progress through related tables (multiple left joins)

I have a django database application, which is constantly evolving.
We want to track the progress of samples as they progress from
sample -> library -> machine -> statistics, etc.
Generally it is a one to many relationship from each stage left to right.
Here is a simplified version of my database schema
table sample
id
name
table library
id
name
sample_id (foreign key to sample table)
table machine
id
name
status
library_id (foreign key to library table)
table sample_to_projects
sample_id
project_id
table library_to_subprojects
library_id
subproject_id
So far it has been going ok, except now, everything needs to be viewed by projects. Each of the stages can belong to one or more projects. I have added a many_to_many relation between project and the existing tables.
I am trying to create some views that do the multiple left joins and show the progress of samples for a project.
sample A
sample B library_1 machine_1
sample B library_2 machine_2
sample C library_3
first try at the query was like this:
SELECT fields FROM
sample_to_projects ,
sample
LEFT JOIN library ON sample.id = library.sample_id ,
library_to_project
LEFT JOIN machine ON machine.library_id = library.id
WHERE
sample_to_project.project_id = 30
AND sample_to_project.sample_id = sample.id
AND library_to_project.project_id = 30
AND library_to_project.library_id = library_id
The problem here is that the LEFT JOIN is done before the WHERE clause.
So if we have a sample that belongs to project_A and project_B.
If the sample has a library for project_B, but we want to filter on project_A, the LEFT JOIN does not add a row with NULLs for library columns (as there are libraries). However these rows get filtered back out by the WHERE clause, and the sample does not show up.
reults filtering on project_A
sample_1(project_A, project_B) library_A (project_A)
sample_1(project_A, project_B) library_B (project_A, project_B)
sample_2(project_A, project_B) library_C (project_B) *this row gets filtered out, it should show only the sample details*
So my solution is to create a subquery to join the other (right hand side) tables before the LEFT JOIN is done.
SELECT fields FROM
sample_to_projects ,
sample
LEFT JOIN (
SELECT library.id as lib_id , library.sample_id as smaple_id , library.name as lib_name , machine_name
FROM library ,
lib_to_projects ,
machine
)
AS join_table ON sample.id = join_table.sample_id
WHERE
sample_to_project.project_id = 30
AND sample_to_project.sample_id = sample.id
The problem is that there are a few more stages in the real version of my database, so I will need to do a nested subquery for each LEFT JOIN. The SQL will be getting pretty large ad difficult to read, and I wondered if there is a better solution at the design level? Also it won't play nicely with Django models (though if I can get the SQL working I will be happy enough).
Or can anyone suggest some sort of best practices for this type of problem? I am sure it must be relatively common with showing users in groups or something similar. If anyone knows a way that would fit well with django models that would be even better.
What about creating sepatate views for each Project_Id?
If you leave the database structure as is and add to it as the application progresses. You can create a separate view for each stage or Project_Id. If there are 30 stages (Project_Id 1..30) then create 30 separate views.
When you add a new stage... create a new view.
I'm not precisely clear on what you're using this for, but it looks like your use-case could benefit from Pivot Tables. Microsoft Excel and Microsoft Access have these, probably the easiest to set up as well.
Basically, you set up a query that joins all your related data together, possibly with some parameters a user would fill in (would make things faster if you have large amounts of data), then feed the result to the Pivot Table, and then you can group things any way you want. You could, on the fly, see subprojects by library, samples by machine, libraries by samples, and filter on any of those fields as well. So you could quickly make a report of Samples by Machine, and filter it so only samples for machine 1 show up.
The benefit is that you make one query that includes all the data you might want, and then you can focus on just arranging the groups and filtering. There are more heavy-duty systems for this sort of stuff (OLAP servers), but you may not need that if you don't have huge amounts of data.

Empty Rows in MySQL SELECT with LEFT JOIN

I have three tables called: users, facilities, and staff_facilities.
users contains average user data, the most important fields in my case being users.id, users.first, and users.last.
facilities also contains a fair amount of data, but none of it is necessarily pertinent to this example except facilities.id.
staff_facilties consists of staff_facilities.id (int,auto_inc,NOT NULL),staff_facilities.users_id (int,NOT NULL), and staff_faciltities.facilities_id (int,NOT NULL). (That's a mouthful!)
staff_facilities references the ids for the other two tables, and we are calling this table to look up users' facilities and facilities' users.
This is my select query in PHP:
SELECT users.id, users.first, users.last FROM staff_facilities LEFT JOIN users ON staff_facilities.users_id=users.id WHERE staff_facilities.facilties_id=$id ORDER BY users.last
This query works great on our development server, but when I drop it into the client's production environment often times blank rows appear in the results set. Our development server is using the replicated tables and data that already exist on the client's production server, but the hardware and software vary quite a bit.
These rows are devoid of any information, including the three id fields that require NOT NULL values to be entered into the database. Running the query through the MySQL management tools on the backend returns the same results. Searching the table for NULL fields has not turned up anything.
The other strange thing is that the number of empty rows is changing based on the varying results caused by the WHERE clause id check. It's usually around one to three empty rows, but they are consistent when using the same parameter.
I've many times dealt with the returning of nearly duplicate rows due to LEFT JOINS, but I've never had this happen before. As far as displaying the information goes, I can easily hide it from the end user. My concern is primarily that this problem will be compounded as time passes and the number of records grows larger. As it sits, this system has just been installed, and we already have 2000+ records in the staff_facilities table.
Any insight or direction would be appreciated. I can provide further more detailed examples and information as well.
You are only selecting columns from the table on the right side of the join. Of course some of them are completely null, you did a left join. So those records match to an id in the table on the left side of the join but not to any data on the right side of the join. Since you aren't returning any columns from the left table, you see no data.