SQL tree traversal - mysql

I am not totally sure I am naming this right, but please bear with me.
I am wondering if is possible to do something like this in SQL(MySQL specifically):
Let's say we have tree-like data that is persisted in the database in the following table:
mysql> desc data_table;
+------------------------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------------------+---------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| parent_id | int(10) unsigned | YES | MUL | NULL | |
| value | text | YES | | NULL | |
So each row has a parent, except for the 'root' row and each row has children except for leaf rows.
Is it possible to find all descendants of any given row utilizing solely SQL?

It's possible to fetch all descendants utilizing solely SQL, but not in a single query. But I'm sure you figured that out; I assume you mean you want to do it in a single query.
You might be interested in reading about some alternative designs to store tree structures, that do enable you to fetch all descendants using a single SQL query. See my presentation Models for Hierarchical Data with SQL and PHP.
You can also use recursive SQL queries with other brands of database (e.g. PostgreSQL), but MySQL does not currently support this feature.

You're probably better of with the nested set model instead (see http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/ - further down). Its far more efficient for selects and you can get the complete path to each node with a simple self join.
However, in practice it is a good idea to pre-cache path and depth if you want to do things like " where depth = 3" or want to show the complete path for multiple nodes if you have more than 1000 records in your table.

I was just asking myself the same question.
This is what i googled:
http://moinne.com/blog/ronald/mysql/manage-hierarchical-data-with-mysql-stored-procedures
It works with stored procedures.
But so much logic in the DB is not a good thing in my opinion..

Related

Data model question: how to qualify 3rd objects

I'm dealing with slightly different types hence for clarity of what I'm trying to achieve I have decided to use metaphor.
Let's say you need to create tables that describe projects by two architectural bureaus:
1st only deals with 3D plans
2nd only deals with 2D sketches
I have the following table
mysql> describe sketch;
+------------------+-------------------------------+------+-----+-------------------+
| Field | Type | Null | Key | Default |
+------------------+-------------------------------+------+-----+-------------------+
| project_id | binary(16) | NO | PRI | NULL |
| company_id | binary(16) | NO | PRI | NULL |
| type | enum('2D','3D','N/A') | YES | |'N/A' |
+------------------+-------------------------------+------+-----+-------------------+
As you can see project_id & company_id form the PRIMARY KEY
The issue arises when in some exceptional circumstances the same company takes on 2D and 3D task under the same project ID.
Or the same company starts working on two or more sub-projects of the same type (e.g. both are 2D sketches) but within the realm of let's call it parent project with exactly the same ID.
One quick and dirty fix would be simply to add unique ID to the above table but it wouldn't work for me, because there are various reports and and other functions which basically do this: SELECT blah FROM sketch WHERE project_id=XXX AND company_id
I could add code to filter the results from the above SQL but I can't really change the structure or the table.
Any ideas of what options do I have?
Appreciate any ideas!
And thank you very much beforehand!
As you describe the problem, company/project is not a primary key. You describe circumstances where uniqueness is violated.
Then company/project/type does seem to be a unique key and a candidate primary key. I would say that you should have a numeric primary key and declare the tripartite key as unique.

Removing duplicate TEXTS from large mysql table

I have mysql table, which has structure
+------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| content | longtext | NO | | NULL | |
| valid | tinyint(1) | NO | | NULL | |
| created_at | timestamp | YES | | NULL | |
| updated_at | timestamp | YES | | NULL | |
+------------+------------------+------+-----+---------+----------------+
I need to remove duplicate entries by content column, everything would be easy if it wasn't longtext, the main issue is that entries in that column vary in length from 1 char to over 12,000 chars and more, and I have over 4,000,000 entries, simple query like select id from table where content like "%stackoverflow%"; takes 15s to execute, what would be best approach to remove duplicate entries and not wait 2 days on executing query?
md5 is your friend here. Make a separate hashvalues table (to avoid locking/contention with this table in production) with columns for the id and hash. The primary key for this table should actually be the hash column, rather than id.
Once the new empty table is created, use MySql's md5() function to populate the new table from your original data, with the original id and the md5(content) for the field values. If necessary you can even populate the table in batches, if it would take too long or slow things down too much to do it all at once.
When the new table is fully populated with data, you can JOIN it to itself like this:
SELECT h1.*
FROM hashvalues h1
INNER JOIN hashvalues h2 on h1.hash = h2.hash and h1.id <> h2.id
This should be MUCH faster than comparing the content directly, since the database only has to compare pre-computed hash values. I'd expect to run almost instantly. It will tell you which records are potential duplicates. There is still a potential for hash collisions, so you also need to compare this back to the original data to be sure, or include an originalcontent column in the new table you can use with the query above. That done, you will know which records to remove.
This system can be even better if you can add a column to the original table to keep the md5() hash of your content field up to date every time it changes. A Generated Column will work well for this if you have the right storage engine. Otherwise, you can use a trigger. This column will allow you to re-run your duplicates check as needed, without all the extra work with the separate table.
Finally, there are also Sha(), Sha1(), and Sha2() functions that might be more collision-resistant. However, the md5() will be much faster and the additional collision resistance isn't enough to avoid the need for also comparing the original data. This also isn't a security situation where collision potential will matter, and so md5() is the better choice here. These aren't passwords, after all.

Constrain database to hold one value or the other never both

Is it possible to add a database constraint to limit a row to have a single value in one of two columns, never more and never less? Let me illustrate:
Sales Order Table
---------------------------------
id | person_id | company_id |
Rows for this would look like:
id | person_id | company_id |
---|-----------|------------|
1 | 1 | null |
2 | 2 | null |
3 | null | 1 |
4 | null | 2 |
In this illustration, the source of the sales order is either a person or a company. It is one or the other, no more or less. My question is: is there a way to constrain the database so that 1) both fields can't be null and 2) both fields can't be not-null? i.e., one has to be null and one has to be not-null...
I know the initial reaction from some may be to combine the two tables (person, company) into one customer table. But, the example I'm giving is just a very simple example. In my application the two fields I'm working with cannot be combined into one.
The DBMS I'm working with is MySQL.
I hope the question makes sense. Thank you in advance for your help!
This may come as a shock...
mysql doesn't support CHECKconstraints. It allows you to define them, but it totally ignores them.
They are allowed in the syntax only to provide compatibility with other database's syntax.
You could use a trigger on update/insert, and use SIGNAL to raise an exception.

Django: What's the fastest way to order a QuerySet based on the count of a related field?

I've got an Item model in my Django app with a ManyToMany field that's handled through an intermediate Favorite model. Here are simplified versions of the models in question:
class Item(models.Model):
name = models.CharField(max_length=200)
class Favorite(models.Model):
user = models.ForeignKey(User)
item = models.ForeignKey(Item)
I'm trying to get a list of items ordered by the number of favorites. The query below works, however there are thousands of records in the Item table, and the query is taking up to a couple of minutes to complete.
items = Item.objects.annotate(num_favorites=Count('favorite')).order_by('-num_favorites')
Not sure if this is relevant to any potential answers, but I'm paginating the results using Django's built-in Pagintor:
paginator = Paginator(items, 100)
I know I can add a favorites field to my Item model and increment that every time an item is favorited, but I'm wondering if there's another cleaner, more efficient way to retrieve this data in a reasonable time.
Below is the output of the MySQL EXPLAIN function:
+----+-------------+--------------------+------+-----------------------------+-----------------------------+---------+-------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+------+-----------------------------+-----------------------------+---------+-------------------------------+------+---------------------------------+
| 1 | SIMPLE | appname_item | ALL | NULL | NULL | NULL | NULL | 566 | Using temporary; Using filesort |
| 1 | SIMPLE | appname_favorite | ref | appname_favorite_67b70d25 | appname_favorite_67b70d25 | 4 | appname.appname_item.id | 1 | |
+----+-------------+--------------------+------+-----------------------------+-----------------------------+---------+-------------------------------+------+---------------------------------+
when ever you try to use order_by clause whose index is not defined, try to defined index on it, it makes the ordering process way faster, in you case index on 'num_favorites' is what you need to speed up the query execution time. also the technique which you are using is fine there is nothing bad in it.
please see the related question which was asked by me on same issue, and saverio gave an excellent answer to tackle the problem. Hope this helps.
That is already the best way. If it's slow, the problem is probably with your database - specifically, you don't have the right indexes for that query so the db is having to do too much sorting.
The Django debug toolbar is great for diagnosing this sort of thing - it will show you how long each query is taking, and allow you to run the db's EXPLAIN function on each one. The MySQL docs will tell you what the output from EXPLAIN means. However, after that it's up to you to optimise the db.

MySQL: Appending records: find then append or append only

I'm writing a program, in C++, to access tables in MySQL via the MySql C++ Connector.
I retrieve a record from the User (via GUI or Xml file).
Here are my questions:
Should I search the table first for
the given record, then append if it
doesn't exist,
Or append the record, and let MySQL
append the record if it is unique?
Here is my example table:
mysql> describe ing_titles;
+----------+----------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+----------+------+-----+---------+-------+
| ID_Title | int(11) | NO | PRI | NULL | |
| Title | char(32) | NO | | NULL | |
+----------+----------+------+-----+---------+-------+
In judgment, I am looking for a solution that will enable my program to respond quickly to the User.
During development, I have small tables (less than 5 records), but I am expecting them to grow when I formally release the application.
FYI: I am using Visual Studion 2008, C++, wxWidgets, and MySQL C++ Connector on Windows XP and Vista.
Mark the field in question with a UNIQUE constraint and use INSERT ... ON DUPLICATE KEY UPDATE or INSERT IGNORE.
The former will update the records if they already exists, the latter will just do nothing.
Searching the table first is not efficient, since it requires two roundtrips to the server: the first one to search, the second one to insert (or to update).
The syntaxes above do the same in one sentence.