MySQL GROUP BY slow across three tables with spatial search - mysql

I am adding some further First World War records to my astreetnearyou.org site
I have three tables:
people - contains full details of over 1 million people who died
addresses - contains about 700,000 different addresses for about 600,000 of these people
cemeteries - a new table which has records of about 15,000 cemeteries;
In terms of relationships, every address has the ID of the person it relates to; every person in the people table has the name of the cemetery they are buried in (as an aside, these can be long varchar values, would it be better to give them unique integer IDs for the join? Answer: I tried it and it shaved about 0.5 secs off the query time)
I want to run a query that essentially says "give me a unique list of all the people who lived or are buried in this map area (bounding box)"
An example query is:
SELECT people.id, people.rank, people.forename, `people`.surname, people.regiment, people.date_of_death, people.cemeteryname, cemeteries.country, cemeteries.link
FROM people
JOIN cemeteries ON people.cemeteryId=cemeteries.id
LEFT JOIN addresses ON addresses.personId=people.id
WHERE MBRContains( GeomFromText( 'LINESTRING(-0.35 51.50,-0.32 51.51)' ), cemeteries.point) OR MBRContains( GeomFromText( 'LINESTRING(-0.35 51.50,-0.32 51.51)' ), addresses.point)
GROUP BY people.id
This returns 276 results but takes about 6 seconds. Without the GROUP BY it's 296 results including the duplicate IDs but takes well under a second. If I remove the LEFT JOIN table and associated WHERE clause (so I only get matches by cemetery, not address) it is also very quick.
I have spatial indexes on both point fields and all the fields that are in the JOIN conditions, plus based on another post on here I've added indexes across the id and point fields in the addresses table, and the cemetery and point fields in the cemeteries table.
I'm no sql expert so any advice on making this more efficient and thereby quicker would be much appreciated. Also I guess some more table info would probably be of use, but can you tell me what would be helpful and how to produce it?!

ALTER TABLE people ADD INDEX IdCemIdIdx (id, cemeteryId);
if possible, use:
https://www.percona.com/doc/percona-toolkit/LATEST/pt-online-schema-change.html

Related

Optimal MySQL table schema for given use case

I have two tables - books and images. The books table has many columns - including id (primary key), name (which is not unique), releasedate, etc. The images table have two columns - id (which is not unique, i.e one book id may have multiple images associated with it, and we need all those images. This column has a non-unique index), and poster (which is unique primary key, all images lie in the same bucket, hence cannot have duplicate names). My requirement is given a book name, find all images associated with it (along with the year of release and the bucketname for each image, the bucketname being just a number in this case).
I am running this query:
select books.id,poster,bucketname,year(releasedate) from books
inner join images where images.bookId = books.id and books.name = "<name>";
A sample result set may look like this:
As you can see there are two results matching - one with id 2 and year 1989, having 5 images, other one with id 261009, year 2013 and one image.
The problem is, the query is extremely slow. It takes around .14 seconds from MySQL console itself, under zero load (in production there may be several concurrent requests and they may be queued, leading to further delay), which is unacceptable for autocomplete. Can anyone tell me how to optimize the query by adding correct indices/keys to the tables? If it is not possible from MySQL, suggestions regarding a proper Redis schema would be useful as well.
Edit: Approx no. of rows in images - 480k, in books - 285k. In future, autocomplete will show result for book authors as well as book names, hence the query will need to expand to take into account a separate table authors where each author will have an id and name, just like a book.
For optimal performance, you want suitable covering indexes available. For example:
... on `books` (`name`,`id`,`releasedate`)
... on `images` (`bookid`,`poster`,`bucketname`)
We want name as the leading column in the index, because of the equality predicate in the WHERE clause. We want id and releasedate also included in the index to make it a "covering index", so the query can be satisfied from the index, without a need to visit pages of the underlying table to retrieve values.
We want bookid as the leading column because of the reference in the ON clause. Again, having poster and bucketname available right in the index make it a "covering" index.
Use EXPLAIN to see the query execution plan.
Also, note that the inner join operation won't return a row from books if a matching row in images is not found. If we want to return a row from books even when no image is available, we could use an outer join.
I'd write the query like this:
SELECT b.id
, i.poster
, i.bucketname
, YEAR(b.releasedate)
FROM books b
LEFT
JOIN images i
ON i.bookid = b.id
WHERE b.name = ?

To create MySQL tables for each specific user, or generalize the tables?

I'm running into all kinds of thought problems while planning my database:
Outline:
The database is a patient database with a large number of patients.
Each patient has tons of data, eg: bloodpressure values on different dates.
Questions:
Would it be easier to create tables per patient e.g.
"bob_builder_BPvalues" or to create one table for the BP values eg. "BP_values" and then have all the patients values in there linked via foreign keys?
As I have so much data per patient, it does not seem to make sense to mix blood pressure value of each patient into one single table as this would look very messy to a human. Which approach would be faster in terms of processing and sorting through the data?
Let's say you have 10 patients:
With your first approach, you'd end up with 10 different tables always containing the same type of data.
For each query on a single patient, you would have to build a dynamic query joining to the right table:
SELECT ...
FROM patients
INNER JOIN bobby_measures ON ... -- this has to be crafted dynamically each time
WHERE patients.name = 'bobby'
And what if you want to make some stats on some kind of data for a range of dates for all patients ? Querying this becomes a nightmare, even with 10 patients. So guess what happens when you have 1000...
On the other hand, your second choice makes (arguably) human reading of the database more difficult. But being read by a human is not one of the objectives of databases.
With a single patientData table (or as many tables you want, one per datatype if needed, bloodPressure and stuff), everything becomes simpler. You can query any patient using the same query, changing only the patient id, you can make all the queries you want for a range of dates, filtering on some datatype, or whatever.
SELECT ...
FROM patients
INNER JOIN patientData ON ...
WHERE patients.name in ('bobby', 'joe'...)
AND patientData.type = 'blood pressure'
AND patientData.date BETWEEN ... AND ...
-- and so on
Using the right indices on the patientData table(s) and an appropriate presentation layer, all this data becomes totally readable by an average user.
Have a single table for all patients. This can then link to a BloodPressure table using a foreign key. The relationship between ...
Patient 1----* BloodPressureResults
So a single patient can have many blood pressure results.
You would then be able to view the blood pressure results for a specific patient by using a simple query...
SELECT * FROM BloodPressureResults
WHERE Patient_Id = '1'
This would then return you all of the blood pressure results for the patient with an Id of 1.
You would then also be able to add other tables like WeightResults or BloodTestResults in the same way as the BloodPressureResults table

Best structure for tables with more than 10000 columns

I am applying a group of data mining algorithms to a dataset comprised of a set of customers along with a large number of descriptive attributes that summarize various aspects of their past behavior. There are more than 10,000 attributes, each stored as a column in a table with the customer id as the primary key. For several reasons, it is necessary to pre-compute these attributes rather than calculating them on the fly. I generally try to select customer with a specified attribute set. The algorithms can combine any arbitrary number of these attributes together in a single SELECT statement and join the required tables. All the tables have the same number of rows (one per customer).
I am wondering what's the best way to structure these tables of attributes. Is it better to group the attributes into tables of 20-30 columns, requiring more joins on average but fewer columns per SELECT, or have tables with the maximum number of columns to minimize the number of joins, but having potentially all 10K columns joined at once?
I also thought of using one giant 3-column customerID-attribute-value table and storing all the info there, but it would be harder to structure a "select all customers with these attributes-type query that I need."
I'm using MySQL 5.0+, but I assume this is a general SQL-ish question.
From my expirience using tables with 10,000 columns is very-very-very bad idea. What if in future this number will be increased?
If there are a lot of attributes you shouldn't use a horizontal scaled tables (with large number of columns). You should create a new table attributes and place alltributes values into it. Then connect this table with Many-To-One relationship to main entry table
Maybe the second way is to use no-SQL (like MongoDB) systems
As #odiszapc said, you have to use a meta-model structure, like for instance:
CREATE TABLE customer(ID INT NOT NULL PRIMARY KEY, NAME VARCHAR(64));
CREATE TABLE customer_attribute(ID INT NOT NULL, ID_CUSTOMER INT NOT NULL, NAME VARCHAR(64), VALUE VARCHAR(1024));
Return basic informations of given customer:
SELECT * FROM customers WHERE name='John';
Return customer(s) matching certain attributes:
SELECT c.*
FROM customer c
INNER JOIN attribute a1 ON a1.id_customer = c.id
AND a1.name = 'address'
AND a1.value = '1078, c/ los gatos madrileƱos'
INNER JOIN attribute a2 ON a2.id_customer = c.id
AND a2.name = 'age'
AND a2.value = '27'
Your generator should generate the inner joins on the fly.
Proper indexes on the tables should allow all this engine to go relatively fast (if we assume 10k attributes per customer, and 10k customers, that's actually pretty much a challenge...)
10,000 columns is much. The SELECT statement will be very long and messy if you wouldn't use *. I think you can narrow the attributes down to most useful and meaningful ones, eliminating others

Intersection of two (very) big tables

I have two tables: all_users and vip_users
all_users table has a list of all users (you don't say?) in my system and it currently has around 57k records, while vip_users table has around 37k records.
Primary key in both tables is an autoincrement id field. all_users table is big in terms of attribute count (around 20, one of them is email), while vip_users table has only (along with id) email attribute.
I wanted to query out the "nonVip" users by doing this (with help of this question here on SO):
SELECT all_users.id, all_users.email
FROM all_users
LEFT OUTER JOIN vip_users
ON (all_users.email=vip_users.email)
WHERE vip_users.email IS NULL
And now, finally coming to the problem - I ran this query in phpmyadmin and even after 20 minutes I was forced to close it and restart httpd service as it was taking too long to complete, my server load jumped over 2 and the site (which also queries the database) became useless as it was just loading too slow. So, my question is - how do I make this query? Do I make some script and run it over night - not using phpmyadmin (is this maybe where the problem lies?), or do I need to use different SQL query?
Please help with your thoughts on this.
Try indexing the fields email on both tables, that should speed up the query
CREATE INDEX useremail ON all_users(email)
CREATE INDEX vipemail ON vip_users(email)
As written, you're not getting the results you're looking for. You're looking for vip_users rows where the email matches an email in users, and is also NULL.
Is there a reason you want vip_users to have a separate id from users? If you change the vip_users id field to a fk on the users id field, yo would then change your select to:
SELECT all_users.id, all_users.email
FROM all_users
LEFT OUTER JOIN vip_users
ON (all_users.id=vip_users.id)
WHERE vip_users.email IS NULL;
There's no reason this query should take any discernible about of time. 37k records is not a very big table....
I think NOT IN is faster and used less resource than LEFT OUTER JOIN.
Can you try -
SELECT *
FROM all_users
WHERE id NOT IN (SELECT id
FROM vip_users
WHERE email IS NULL);

Is there something more efficient than joining tables in MySQL?

I have a table with entity-attribute-value structure. As an example, as entities I can have different countries. I can have the following attributes: "located in", "has border with", "capital".
Then I want to find all those countries which are "located in Asia" and "has border with Russia". The straightforward way to do that is to join the table with itself using entities are the column for joining and then to use where.
However, if I have 20 rows where Russia in in the entity-column, than in the joint table I will have 20*20=400 rows with Russia as the entity. And it is so for every country. So, the joint table going to be huge.
Will it be not more efficient to use the original table to extract all countries which are located in Asia, then to extract all countries which have border with Russia and then to use those elements which are in both sets of countries?
You shouldn't end up having a huge number of records so this should work
SELECT a.entity,
a.located_in,
a.border
FROM my_table a
WHERE a.border in (SELECT b.entity FROM my_table b WHERE b.entity = 'RUSSIA' )
AND a.located_in = 'ASIA'
You are confusing join with Cartesian product. There could never be more rows in the join then there are in the actual data, the only thing being altered is which elements/rows are taken.
So if you have 20 Russian rows, the table resulting from the join could never have more than 20 Russian entries.
The operation you suggest using is exactly what a join does. Just make sure you have the appropriate indices and let MySQL do the rest.