Suggest Sphinx index scheme - mysql

In a MySQL database I have documents of different type: some have text content, meta keys, descriptions, others have code, SKU number, size and brand name and so on. The problem is, I have to search something in all of these documents and then display a single page, where the results will be grouped by the document type, such as help page, blog post, item... It's not clear for me how to implement the Sphinx index: I want to have a single index to speed up queries, but since different docs have different structure - how can I group them? I was thinking about just concatenating them, but it just doesn't feel right.

Sphinx does not actually return documents, concatenated or not, it returns primary keys of the items or attributes you have indexed. Here, in this snippet from a sphinx.conf, the SQL here is used to build an index. When the index is subsequently searched, product.id will be returned whilst text2search will be searched.
sql_query = SELECT id, CONCAT_WS( ' ', field1, field2 ) as text2search FROM product
If your documents/products reside in the same database table, this is very straightforward. You are able to retrieve and recreate your data structure on the database side when given the primary key(s) to work with.
If you are indexing items of different types in one sphinx index when each type is mapped to a different table, it's a little more involved.

Related

How to optimize MySQL indexes/query for modular search?

I have a large MyISAM table with ~200,000,000 rows of data representing events which are meant to be searchable via a web UI. The UI is a page of the "Advanced Search" variety, so the user can enter search criteria for any combination of searchable fields.
A simplified version of the table structure is something like this:
events
event_id, event_code, place_id, person_id, date, type
people
person_id, person_name
and the user is allowed to search using a required type, an optional date range, and zero or more of event_codes, place_ids, and/or person_names.
In order to optimize the table for this type of search, do I need an index on every permutation of columns, except the ones that don't include type? Or is there a more efficient way to index the table.
Currently, the table has a primary index which covers event_code, place_id, person_id, date, and type, so when you search using all fields the response is acceptable. But if you try to search using e.g. only a date range, the query essentially never returns.
MySQL uses indexes from first field to last, so if you are not using the earlier fields in an index, the index will not be used (as you observed with date).
I wouldn't necessarily say you need an index on every permutation, but at least a few. Since you've stated the type is required, I would start the indexes with that and make a relatively small initial set of two field index with each of the other searchable fields being the second field.
Also of great importance is how you form the actual search conditions. If you are doing operations on indexed fields, like DAYOFWEEK(date), no amount of indexes will help.

What is the fastest way to group my records?

My site shows collections of links on different subjects. These links are divided into two types: web and images. My database will have millions (probably more than ten million) of these records. When the page loads, I need to show the user the web and image links for the particular subject of that page. So the first question is:
Do I create two separate, smaller tables, one each for the web and image links, and then make a query to each, or do I create one huge table (with correct indexes) for both and make one query. Where will I get better performance? Should the one table and one query be more efficient, then my next question is:
What would be the most efficient way to subdivide the two types for presentation? Should I use group by, or should I use php to divide my result array into the two types?
TIA!
You can get similar performances using a table for all objects, or one for links or websites. If you have two separate tables, doing a UNION of the results would return all of the results you needed.
The main reason to divide the results is whether they are really different (from your application point of view). That is, if you are going to end up using a lot of queries like
select * from objects where type='image';
then it might make sense to have two tables.
Then using group by is not a way of grouping the different results, it is a way of aggregating them.
So, for instance, you can use
select type, count(*) from objects group by type
to get
| image | 100000 |
| web | 2000000 |
but it will not return the objects separated. To get them "grouped", you can either use a query for each one, or use an ordering and then have the logic in the application to divide the results.
It's possible you'll get slightly better performance from just one table, but this decision should be primarily guided by whether the nature of data or constraints is different or not.
There is another (more important from the performance perspective) decision you'll have to make: how do you want to cluster the data (all InnoDB tables are clustered)?
If you want to have an excellent performance getting all the links of a given page, use an identifying relationship, producing a natural key in the link table(s):
The LINK table is effectively just a single B-tree, with the page PK1 at its leading edge, which physically groups together the rows that belong to the same page. The following query can be satisfied by a simple index range scan and minimal I/O:
SELECT URL
FROM LINK
WHERE PAGE_ID = <whatever>
If you used separate tables, you can just have two different queries. Many client APIs support executing two queries in a single database round-trip. If PHP doesn't, you can UNION the two queries to save one database round-trip:
SELECT *
FROM (
SELECT 1 LINK_TYPE, URL
FROM IMAGE_LINK
WHERE PAGE_ID = <whatever>
UNION ALL
SELECT 2, URL
FROM WEB_LINK
WHERE PAGE_ID = <whatever>
)
ORDER BY LINK_TYPE
The above query will give you...
LINK_TYPE URL
1 http://somesite.com/foo.jpeg
1 http://somesite.com/bar.jpeg
1 http://somesite.com/baz.jpeg
...
2 http://somesite.com/foo.html
2 http://somesite.com/bar.html
2 http://somesite.com/baz.html
...
...which will be very easy to separate at the client level.
If you didn't use separate tables, you can them separate the URLs by their extension at the client level, or introduce an additional field in the LINK PK: {PAGE_ID, LINK_TYPE, URL}, which should make the following query very efficient:
SELECT LINK_TYPE, URL
FROM LINK
WHERE PAGE_ID = <whatever>
ORDER BY LINK_TYPE
Note that the order of fields in the PK matters, so placing the LINK_TYPE at the end would prevent the DBMS from just doing the index range scan.
1 Whatever it may be; I just used the PAGE_ID as an example.
It depends on how web data is close to img data. If data is basically made of the link, one table fits better, having a column to differentiate between web and data (and possibly others later, like css, js ...)
Links: (id, link, type)
adding an index on type or type link will help the grouping (by type), and the matching search by (type, link).
If however, web and img data are different in such a way that you don't want to mix apples and oranges, like
Web: (wid, wlink, rating, ...)
Img: (iid, ilink, width, height, mbsize, camera, datetaken, hasexif...)
in this case, besides the link both tables don't have much in common. Image links and web links being different, there is not even a "gain" when having a same link for both kinds of data. Another advantage (which is also possible with one table, but makes more sense here) is to link both kinds of data in another table
Relations: (wid,iid)
that allows to maintain the relation between web sites and images, since an image may be used by several web sites, and web sites use several images. Indexing on wid and on iid.
My preference goes to the two tables (with optional Relations link).
Regarding queries from PHP, using UNION you can obtain the data from two tables in one query.
Do I create two separate, smaller tables or one huge table?
Go for one table.
What would be the most efficient way to subdivide the two types for presentation?
Depends on the certain search criteria.

Is it good practice to consolidate small static tables in a database?

I am developing a database to store test data. Each piece of data has 11 tags of metadata. Currently I have a separate table for each of the metadata options. I have seen a few questions on here regarding best practices for numerous small tables, but I thought I'd pose the question for my own project because I didn't get a clear answer from the other questions asked.
Here is my table list, with the fields in each table:
Source Type - id, name, description
For Flight - id, name, description
Site - id, name, abrv, description
Stand - id, site (FK site table), name, abrv, descrition
Sensor Type - id, name, channels, descrition
Vehicle - id, name, abrv, descrition
Zone - id, vehicle (FK vehicle table), name, abrv, description
Event Type - id, name, description
Event - id, event type (FK to event type Table), name, descrition
Analysis - id, name, descrition
Bandwidth - id, name, descrition
You can see the fields are more or less the same in each of these tables. There are three tables that reference another table.
Would it be better to have just one large table called something like Meta with the following fields:
Meta: id, metavalue, name, abrv, FK, value, descrition
where metavalue = one of the above table names
and FK = a reference to another row in the Meta table in place of a foreign key?
I am new to databases and multiple tables seems most intuitive, but one table makes the programming easier.
So questions are:
Is it good practice to reduce the number of tables and put all static values in one table.
Is it bad to have a self referencing table.
FYI I am making this web database using django and mysql on a windows server with NTFS formatting.
Tips and best practices appreciate.
thanks.
"Would it be better to have just one large table" - emphatically and categorically, NO!
This anti-pattern is sometimes referred to as 'The one table to rule them all"!
Ten Common Database Design Mistakes: One table to hold all domain values.
Using the data in a query is much easier
Data can be validated using foreign key constraints very naturally,
something not feasible for the other
solution unless you implement ranges
of keys for every table – a terrible
mess to maintain.
If it turns out that you need to keep more information about a
ShipViaCarrier than just the code,
'UPS', and description, 'United Parcel
Service', then it is as simple as
adding a column or two. You could even
expand the table to be a full blown
representation of the businesses that
are carriers for the item.
All of the smaller domain tables will fit on a single page of disk.
This ensures a single read (and likely
a single page in cache). If the other
case, you might have your domain table
spread across many pages, unless you
cluster on the referring table name,
which then could cause it to be more
costly to use a non-clustered index if
you have many values.
You can still have one editor for all rows, as most domain tables will
likely have the same base
structure/usage. And while you would
lose the ability to query all domain
values in one query easily, why would
you want to? (A union query could
easily be created of the tables easily
if needed, but this would seem an
unlikely need.)
Most of these look like they won't do anything but expand codes into descriptions. Do you even need the tables? Just define a bunch of constants, or codes, and then have a dictionary of long descriptions for the codes.
The field in the referring table just stores the code. eg: "SRC_FOO", "EVT_BANG" etc.
This is also often known as the One True Lookup Table (OTLT) - see my old blog entry OTLT and EAV: the two big design mistakes all beginners make.

How to best get 3 prior image and 3 later image records in MySQL query?

I'll explain briefly what I want to accomplish from a functional perspective. I'm working on an image gallery. On the page that shows a single image, I want the sidebar to show thumbnails of images uploaded by the same user. At a maximum, there should be 6, 3 that were posted before the current main image, and 3 that were posted after the main image. You could see it as a stream of images by the same user through which you can navigate. I believe Flickr has a similar thing.
Technically, my image table has an autoincremented id, a user id and a date_uploaded field, amongst many other columns.
What would your advise be on how to implement such a query? Can I combine this in a single query? Are there any handy MySQL utilities that can deal with offsets and such?
PS: I prefer not to create an extra "rank" column, since that would make managing deletions difficult. Also, using the autoincrement id seems risky, I might change it for a GUID later on. Finally, I'm of course looking for a query that performs and scales.
I know I ask for a lot, but it seems simpler than it is?
The query could look like the following.
With a UserID+image_id index (and possibly additional fields for covering purposes), this should perform relatively well.
SELECT field1, field2, whatever
FROM myTable
WHERE UserID = some_id
-- AND image_id > id_of_the_previously_first_image
ORDER BY image_id
LIMIT 7;
To help with scaling, you should consider using a bigger LIMIT value and cache accordingly.
Edit (answering remarks/questions):
The combined index...
is made of several fields, specifically
CREATE [UNIQUE] INDEX UserId_Image_id_idx
ON myTable (UserId, image_ida [, field1 ...] )
Note that optional elements of this query are in brackets ([]). I would assume the UNIQUE constraint would be a good thing. The additional "covering" fields (field1,...) maybe beneficiary, but would depend on the "width" of such additional fields as well as on the overall setup and usage patterns (since [large] indexes slow down INSERTs/UPDATEs/DELETEs, one may wish to limit the number and size of such indexes etc.)
Such an index data "type" is neither numeric nor string etc. It is simply made of the individual data types. For example if UserId is VARCHAR(10) and Image_id is INT, the resulting index would use these two types for the underlying search criteria, i.e.
... WHERE UserId = 'JohnDoe' AND image_id > 12389
in other words one needn't combine these criteria into a single key.
On image_id
when you say image_id, you mean the combined user/image id, right?
No, I mean only image_id. I'm assuming this field is a separate field in the table. The UserID is taken care of in the other predicate of the WHERE clause.
The original question write up indicates that this field is auto-generated, and I'm assuming we can rely on this field for sorting purposes. Alternatively we could rely on other fields such as the timestamp when the image was uploaded and such.
Also, an afterthought, whether ordered by a [monotonically increasing] Image_id or by the Timestamp_of_upload, we may want to use a DESC order, to show the latest "stuff" first.

Generate number id from text/url for fast "SELECT"

I have the following problem:
I have a feed capturer that captures news from different sources every half an hour.
I only insert entries that don't have their URLs already in the database (URL is used to see if the record is already in database).
Even with that, I get some repeated entries, because some sites report the same news (that usually are from a news source like Reuters). I could look for these repeated entries during insertion, but i think this would slow the insertion time even more.
So, I can later find these repeated entries by the title. But I think this search is slow. Then, my idea is to generate a numeric field from the title and then search by this number for repeated titles.
What kind of encoding could I use (I thought in something reverse to base64) to encode the titles?
I'm suposing that searching for repeated numbers is a lot faster than searching for repeated words. Is that true or not?
Do you suggest a better solution for this problem?
Well, I don't care to have the repeated entries in the database, I just don't want to show then to the user. Like google, that filters the repeated results, but shows then if you want.
I hope I explained It well. Thanks in advance.
Fill the MD5 hash of the URL and title and build a UNIQUE index on it:
CREATE UNIQUE INDEX ux_mytable_title_url ON (title_hash, url_hash)
INSERT
INTO mytable (url, title, url_hash, title_hash)
VALUES ('url', 'title', MD5('url'), MD5('title'))
To select like Google (one result per title), use this query:
SELECT *
FROM (
SELECT DISTINCT title_hash
FROM mytable
) md
JOIN mytable mo
ON mo.url_title = md.title_hash
AND mo.url_hash =
(
SELECT url_hash
FROM mytable mi
WHERE mi.title_hash = md.title_hash
ORDER BY
mi.title_hash, mi.url_hash
LIMIT 1
)
so you can use a new table containing only the encoded keys based on title and url, you have then to add a key on it to accelerate search. But i don't think that you can use an effecient algorytm to transform strings to numbers ..
for the encryption use
SELECT MD5(CONCAT('title', 'url'));
and before every insertion you test if the encoded concatenation of title and url exists on this table.
#Quassnoi can explain better than I, but I think there is no visible difference in performance if you use a VARCHAR/CHAR or INT in a index to use it later for GROUPing or other method to find the duplicates. That way you could use the solution proposed by him but use a normal INDEX instead of a UNIQUE index and keep the duplicates in the database, filtering out only when showing to users.