How to optimize MySQL indexes/query for modular search? - mysql

I have a large MyISAM table with ~200,000,000 rows of data representing events which are meant to be searchable via a web UI. The UI is a page of the "Advanced Search" variety, so the user can enter search criteria for any combination of searchable fields.
A simplified version of the table structure is something like this:
events
event_id, event_code, place_id, person_id, date, type
people
person_id, person_name
and the user is allowed to search using a required type, an optional date range, and zero or more of event_codes, place_ids, and/or person_names.
In order to optimize the table for this type of search, do I need an index on every permutation of columns, except the ones that don't include type? Or is there a more efficient way to index the table.
Currently, the table has a primary index which covers event_code, place_id, person_id, date, and type, so when you search using all fields the response is acceptable. But if you try to search using e.g. only a date range, the query essentially never returns.

MySQL uses indexes from first field to last, so if you are not using the earlier fields in an index, the index will not be used (as you observed with date).
I wouldn't necessarily say you need an index on every permutation, but at least a few. Since you've stated the type is required, I would start the indexes with that and make a relatively small initial set of two field index with each of the other searchable fields being the second field.
Also of great importance is how you form the actual search conditions. If you are doing operations on indexed fields, like DAYOFWEEK(date), no amount of indexes will help.

Related

making unique records in MS access

I have 1.7 million records in an access table sorted A to Z. the records are not unique and there are repeated records. I want to make them unique based on their frequency. if a record has been repeated 4 times I want the first one to get "-1" at the end of the record value, the second record get "-2" and so on. in this way similar records will become unique. all similar record are beside each other because of sorting. in excel I do this task by an If function (if this cell value<>the cell value above then "1" else above repeat number plus 1) but in access I don't know what to do (I'm a beginner).
finally I want to add a column to original table which is (original record value - repeat number).
I appreciate your help
Note about sort order:
Sort order in a relational database is not concrete like in a spreadsheet. There is no concept of rows being "next to each other", unless in context of an index. An index is largely a tool for the database to handle the data more efficiently (and to aid in defining uniqueness). The order itself is still largely dynamic because the order of a particular query can be specified differently from the index (or from storage order) and this does not change how the data is actually stored. Being "next to each other" is essentially a useless concept in SQL queries, unless you mean "next to each other numerically", for instance with an AutoNumber field or with the "repeat numbers" you want to add. Unlike in a spreadsheet, you cannot refer to the row "just above this row" or the "row offset by 2 from the 'current' row".
Solution
Regardless of whether or not you will use the AutoNumber column later, add a Long Integer AutoNumber column anyway. This column is named [ID] in the example code. Why? Because until you add something to allow the database to differentiate between the rows, there is technically no way using standard SQL to reliably reference individual duplicates since there is no way to distinguish individual rows. Even though you say that there are other differentiating columns, your own description rules out using them as a reliable key in referring to specific rows. (Even without such a differentiating column, Access can technically distinguish between rows. Iterating through a DAO.Recordset object in VBA would work, but perhaps not very elegant / efficient.)
Also add a new integer column for counting repeats, which below is named [DupeIndex]. A separate field is preferred (necessary?) because it allows continued reference to the original, unaltered duplicate values. If the reference number were directly updated, it would no longer match other fields and so would not be easily detected as a duplicate anymore. The following solution relies on grouping of ALL duplicate values, even those already "marked" with a [DupeIndex] number.
You should also realize that in comparing different data sets, that having separate fields allows more flexibility in matching the data. Having the values appended to the reference number complicates comparison, since you likely not only want to compare rows with the same duplication index, rather you will want to compare all possible combinations. For example, comparing records 123-1 in one set to 123-4 in another... how do you select such rows in an automated fashion? You don't want to have to manually code all combinations, but that's what you'll end up doing if you don't keep them separate like {123,1} and {123,4}.
Create and save this as a named query [Duplicates]. This query is referenced by later queries. It could instead be embedded as a sub query, but my preferences is to use saved queries for easier visualization and debugging in Access:
SELECT Data.RefNo, Count(Data.ID) AS Dupes, Max(Data.DupeIndex) AS IndexMax
FROM Data
GROUP BY Data.RefNo
HAVING Count(Data.ID) > 1
Execute the following to create a temporary table with new duplicate index values:
SELECT D1.ID, D1.RefNo,
IIf([Duplicates].[IndexMax] Is Null,0,[Duplicates].[IndexMax])
+ 1
+ (SELECT Count(D2.ID) FROM Data As D2
WHERE D2.[RefNo]=[D1].[RefNo]
And [D2].[DupeIndex] Is Null
And [D2].[ID]<[D1].[ID]) AS NewIndex
INTO TempIndices
FROM Data AS D1 INNER JOIN Duplicates ON D1.RefNo = Duplicates.RefNo
WHERE (D1.DupeIndex Is Null);
Execute the update query to set the new duplicate index values:
UPDATE Data
INNER JOIN TempIndices ON Data.ID = TempIndices.ID
SET Data.DupeIndex = [NewIndex]
Optionally remove the AutoNumber field and now assign the combined [RefNo] and new [DupeIndex] as primary key. The temporary table can also be deleted.
Comments about the queries:
Solution assume that [DupeIndex] is Null for unprocessed duplicates.
Solution correctly handles existing duplicate index numbers, only updating duplicate rows without an unique index.
Access has rather strict conditions for UPDATE queries, namely that updates are not based on circular references and/or that that joins will not produce multiple updates for the same row, etc. The temporary table is necessary in this case, since the query determining new index values refers multiple times in sub queries to the very column that is being updated. (If the update is attempted using joins on the subqueries, for example, Access complains that Operation must use an updatable query.)

How to search either on id or name for certain purchase orders

We would like to filter purchase orders either based on purchase order id (primary key) or name of the purchase order using a single search box.
We used the like parameter to search on the name field, but it doesn't seem to work on the primary key. It works only when we use the equal operator for id(s). But it would be preferable if we can filter purchase orders using like for id(s). How to do this?
create table purchase_orders (
id int(11) primary key,
name varchar(255),
...
)
Option 1
SELECT *
FROM purchase_orders
WHERE id LIKE '%123%'; -- tribute to TemporaryNickName
This is horrible, performance-wise :)
Option 2a
Add a text column which receives a string version of id. Maybe add some triggers to populate it automatically.
Option 2b
Change the type of id column to CHAR or VARCHAR (I believe CHAR should be preferred for a primary key).
In both 2a. and 2b. cases, add an index (maybe a FULLTEXT one) to this column.
I think LIKE should work. I assume that your SQL wasn't correctly written.
Let's assume that you have order name "ABCDEF" then you can find this using the following query structure.
SELECT id FROM purchase_orders WHERE name LIKE '%CD%';
To explain it, % sign means it's a wildcard. As a result this query is going to select any String that contains "CD" inside of it.
According to the table structure, varchar can contain 255 characters. I think this is quite a large string and it's probably going to consume a lot of resources and going to take more time to search something using SQL functions like LIKE. You can always search it by id
WHERE id = something. This is much faster way btw
, but I don't think order id is an user friendly data, instead I would let users to use product name. My recommendation is to use apache Lucene or MySQL's full text search feature (which can improve search performance).
Apache lucene
MySQL Full text search function
These are tools built to search certain pattern or word through list of large strings in much faster way. Many websites use this to build their own mini search engines. I found mysql full text search function requires pretty much no learning curve and straight forward to use =D

Suggest Sphinx index scheme

In a MySQL database I have documents of different type: some have text content, meta keys, descriptions, others have code, SKU number, size and brand name and so on. The problem is, I have to search something in all of these documents and then display a single page, where the results will be grouped by the document type, such as help page, blog post, item... It's not clear for me how to implement the Sphinx index: I want to have a single index to speed up queries, but since different docs have different structure - how can I group them? I was thinking about just concatenating them, but it just doesn't feel right.
Sphinx does not actually return documents, concatenated or not, it returns primary keys of the items or attributes you have indexed. Here, in this snippet from a sphinx.conf, the SQL here is used to build an index. When the index is subsequently searched, product.id will be returned whilst text2search will be searched.
sql_query = SELECT id, CONCAT_WS( ' ', field1, field2 ) as text2search FROM product
If your documents/products reside in the same database table, this is very straightforward. You are able to retrieve and recreate your data structure on the database side when given the primary key(s) to work with.
If you are indexing items of different types in one sphinx index when each type is mapped to a different table, it's a little more involved.

How to best get 3 prior image and 3 later image records in MySQL query?

I'll explain briefly what I want to accomplish from a functional perspective. I'm working on an image gallery. On the page that shows a single image, I want the sidebar to show thumbnails of images uploaded by the same user. At a maximum, there should be 6, 3 that were posted before the current main image, and 3 that were posted after the main image. You could see it as a stream of images by the same user through which you can navigate. I believe Flickr has a similar thing.
Technically, my image table has an autoincremented id, a user id and a date_uploaded field, amongst many other columns.
What would your advise be on how to implement such a query? Can I combine this in a single query? Are there any handy MySQL utilities that can deal with offsets and such?
PS: I prefer not to create an extra "rank" column, since that would make managing deletions difficult. Also, using the autoincrement id seems risky, I might change it for a GUID later on. Finally, I'm of course looking for a query that performs and scales.
I know I ask for a lot, but it seems simpler than it is?
The query could look like the following.
With a UserID+image_id index (and possibly additional fields for covering purposes), this should perform relatively well.
SELECT field1, field2, whatever
FROM myTable
WHERE UserID = some_id
-- AND image_id > id_of_the_previously_first_image
ORDER BY image_id
LIMIT 7;
To help with scaling, you should consider using a bigger LIMIT value and cache accordingly.
Edit (answering remarks/questions):
The combined index...
is made of several fields, specifically
CREATE [UNIQUE] INDEX UserId_Image_id_idx
ON myTable (UserId, image_ida [, field1 ...] )
Note that optional elements of this query are in brackets ([]). I would assume the UNIQUE constraint would be a good thing. The additional "covering" fields (field1,...) maybe beneficiary, but would depend on the "width" of such additional fields as well as on the overall setup and usage patterns (since [large] indexes slow down INSERTs/UPDATEs/DELETEs, one may wish to limit the number and size of such indexes etc.)
Such an index data "type" is neither numeric nor string etc. It is simply made of the individual data types. For example if UserId is VARCHAR(10) and Image_id is INT, the resulting index would use these two types for the underlying search criteria, i.e.
... WHERE UserId = 'JohnDoe' AND image_id > 12389
in other words one needn't combine these criteria into a single key.
On image_id
when you say image_id, you mean the combined user/image id, right?
No, I mean only image_id. I'm assuming this field is a separate field in the table. The UserID is taken care of in the other predicate of the WHERE clause.
The original question write up indicates that this field is auto-generated, and I'm assuming we can rely on this field for sorting purposes. Alternatively we could rely on other fields such as the timestamp when the image was uploaded and such.
Also, an afterthought, whether ordered by a [monotonically increasing] Image_id or by the Timestamp_of_upload, we may want to use a DESC order, to show the latest "stuff" first.

Generate number id from text/url for fast "SELECT"

I have the following problem:
I have a feed capturer that captures news from different sources every half an hour.
I only insert entries that don't have their URLs already in the database (URL is used to see if the record is already in database).
Even with that, I get some repeated entries, because some sites report the same news (that usually are from a news source like Reuters). I could look for these repeated entries during insertion, but i think this would slow the insertion time even more.
So, I can later find these repeated entries by the title. But I think this search is slow. Then, my idea is to generate a numeric field from the title and then search by this number for repeated titles.
What kind of encoding could I use (I thought in something reverse to base64) to encode the titles?
I'm suposing that searching for repeated numbers is a lot faster than searching for repeated words. Is that true or not?
Do you suggest a better solution for this problem?
Well, I don't care to have the repeated entries in the database, I just don't want to show then to the user. Like google, that filters the repeated results, but shows then if you want.
I hope I explained It well. Thanks in advance.
Fill the MD5 hash of the URL and title and build a UNIQUE index on it:
CREATE UNIQUE INDEX ux_mytable_title_url ON (title_hash, url_hash)
INSERT
INTO mytable (url, title, url_hash, title_hash)
VALUES ('url', 'title', MD5('url'), MD5('title'))
To select like Google (one result per title), use this query:
SELECT *
FROM (
SELECT DISTINCT title_hash
FROM mytable
) md
JOIN mytable mo
ON mo.url_title = md.title_hash
AND mo.url_hash =
(
SELECT url_hash
FROM mytable mi
WHERE mi.title_hash = md.title_hash
ORDER BY
mi.title_hash, mi.url_hash
LIMIT 1
)
so you can use a new table containing only the encoded keys based on title and url, you have then to add a key on it to accelerate search. But i don't think that you can use an effecient algorytm to transform strings to numbers ..
for the encryption use
SELECT MD5(CONCAT('title', 'url'));
and before every insertion you test if the encoded concatenation of title and url exists on this table.
#Quassnoi can explain better than I, but I think there is no visible difference in performance if you use a VARCHAR/CHAR or INT in a index to use it later for GROUPing or other method to find the duplicates. That way you could use the solution proposed by him but use a normal INDEX instead of a UNIQUE index and keep the duplicates in the database, filtering out only when showing to users.