Generate number id from text/url for fast "SELECT" - mysql

I have the following problem:
I have a feed capturer that captures news from different sources every half an hour.
I only insert entries that don't have their URLs already in the database (URL is used to see if the record is already in database).
Even with that, I get some repeated entries, because some sites report the same news (that usually are from a news source like Reuters). I could look for these repeated entries during insertion, but i think this would slow the insertion time even more.
So, I can later find these repeated entries by the title. But I think this search is slow. Then, my idea is to generate a numeric field from the title and then search by this number for repeated titles.
What kind of encoding could I use (I thought in something reverse to base64) to encode the titles?
I'm suposing that searching for repeated numbers is a lot faster than searching for repeated words. Is that true or not?
Do you suggest a better solution for this problem?
Well, I don't care to have the repeated entries in the database, I just don't want to show then to the user. Like google, that filters the repeated results, but shows then if you want.
I hope I explained It well. Thanks in advance.

Fill the MD5 hash of the URL and title and build a UNIQUE index on it:
CREATE UNIQUE INDEX ux_mytable_title_url ON (title_hash, url_hash)
INSERT
INTO mytable (url, title, url_hash, title_hash)
VALUES ('url', 'title', MD5('url'), MD5('title'))
To select like Google (one result per title), use this query:
SELECT *
FROM (
SELECT DISTINCT title_hash
FROM mytable
) md
JOIN mytable mo
ON mo.url_title = md.title_hash
AND mo.url_hash =
(
SELECT url_hash
FROM mytable mi
WHERE mi.title_hash = md.title_hash
ORDER BY
mi.title_hash, mi.url_hash
LIMIT 1
)

so you can use a new table containing only the encoded keys based on title and url, you have then to add a key on it to accelerate search. But i don't think that you can use an effecient algorytm to transform strings to numbers ..
for the encryption use
SELECT MD5(CONCAT('title', 'url'));
and before every insertion you test if the encoded concatenation of title and url exists on this table.

#Quassnoi can explain better than I, but I think there is no visible difference in performance if you use a VARCHAR/CHAR or INT in a index to use it later for GROUPing or other method to find the duplicates. That way you could use the solution proposed by him but use a normal INDEX instead of a UNIQUE index and keep the duplicates in the database, filtering out only when showing to users.

Related

Count the number of times keywords (in a table) appear in a field in another table

I will simplify my problem in order to explain it.
I have a table which contains text messages posted by users and another table which contains keywords.
I want to display, for each user, the number of times keywords are found in text messages.
I don't want the result to display a keyword if it's not found in text messages.
I wan't it to be case INSENSITIVE. All keywords are lowered but in messages, you can find lower & upper chars.
Because I'm not sure that my explanation is clear enough, here comes the SQLFiddle.
http://sqlfiddle.com/#!2/c402a
Hope anyone can help me.
I found what I was looking for. It wasn't easy for me but here is my query :
SELECT t_msg.msg_usr,
t_list.list_word,
count(t_list.list_word),
t_msg.msg_text
FROM t_msg
INNER JOIN t_list
ON LOWER(t_msg.msg_text) LIKE CONCAT("%", t_list.list_word, "%")
GROUP BY t_msg.msg_usr, t_list.list_word;
The SQLFiddle is there : http://sqlfiddle.com/#!2/ba052/8
The recommendation would be to not try solving this with a query. It's possible to write a query that will do it, such query will scan the messages table for each keyword separately, and produce a count (or a row that you can group by), but this won't scale, or be reliable in sense of language search.
Here is what you might want to do:
Create a table to map (user_id, keyword_id) to a count of this keyword in messages of this user. Let's call it t_keyword_count.
Each time you receive a message, before you save the message into the database, search it for all the keywords you care about (using whatever good text search libraries that account for misspellings, etc.). You should know the (user_id) for this message.
You will, at that point, be ready to add the message to the database, and will have an array of (keyword_id) with keywords that this message will have.
In a transaction, insert the message into the t_msg table, and run update/insert for (user_id,keyword_id) to have value=value+1 (or +n, if you need to count the same keyword more than once in the same message) for the t_keyword_count table.
If you are trying to solve the problem of having to do the above on existing data, you can do this manually, just to build up that t_keyword_count table first (depends on how many keywords you have in total, but even if there are a lot, this can be scripted). But you should change (or mirror) the t_msg.msg_text field to be a field suitable for text search, and use SQL text search functionality to find the keywords.

Optimized SELECT query in MySQL

I have a very large number of rows in my table, table_1. Sometimes I just need to retrieve a particular row.
I assume, when I use SELECT query with WHERE clause, it loops through the very first row until it matches my requirement.
Is there any way to make the query jump to a particular row and then start from that row?
Example:
Suppose there are 50,000,000 rows and the id which I want to search for is 53750. What I need is: the search can start from 50000 so that it can save time for searching 49999 rows.
I don't know the exact term since I am not expert of SQL!
You need to create an index : http://dev.mysql.com/doc/refman/5.1/en/create-index.html
ALTER TABLE_1 ADD UNIQUE INDEX (ID);
The way I understand it, you want to select a row with id 53750. If you have a field named id you could do this:
SELECT * FROM table_1 WHERE id = 53750
Along with indexing the id field. That's the fastest way to do so. As far as I know.
ALTER table_1 ADD UNIQUE INDEX (<collumn>)
Would be a great first step if it has not been generated automatically. You can also use:
EXPLAIN <your query here>
To see which kind of query works best in this case. Note that if you want to change the where statement (anywhere in the future) but see a returning value in there it will be a good idea to put an index on that aswell.
Create an index on the column you want to do the SELECT on:
CREATE INDEX index_1 ON table_1 (id);
Then, select the row just like you would before.
But also, please read up on databases, database design and optimization. Your question is full of false assumptions. Don't just copy and paste our answers verbatim. Get educated!
There are several things to know about optimizing select queries like Range and Where clause Optimization, the documentation is pretty informative baout this issue, read the section: Optimizing SELECT Statements. Creating an index on the column you evaluate is very helpfull regarding performance too.
One possible solution You can create View then query from view. here is details of creating view and obtain data from view
http://www.w3schools.com/sql/sql_view.asp
now you just split that huge number of rows into many view (i. e row 1-10000 in one view then 10001-20000 another view )
then query from view.
I am pretty sure that any SQL database with a little respect for themselves does not start looping from the first row to get the desired row. But I am also not sure how they makes it work, so I can't give an exact answer.
You could check out what's in your WHERE-clause and how the table is indexed. Do you have a proper primary key? Like using a numeric data type for that. Do you have indexes on more columns, that is used in your queries?
There is also alot to concider when installing the database server, like where to put the data and log files, how much memory to give the server and setting the growth. There's a lot you can do to tune your server.
You could try and split your tables in partitions
More about alter tables to add partitions
Selecting from a specific partition
In your case you could create a partition on ID for every 50.000 rows and when you want to skip the first 50.000 you just select from partition 2. How to do this ies explained quite well in the MySQL documentation.
You may try simple as this one.
query = "SELECT * FROM tblname LIMIT 50000,0
i just tried it with phpmyadmin. WHERE the "50,000" is the starting row to look up.
EDIT :
But if i we're you i wouldn't use this one, because it will lapses the 1 - 49999 records to search.

MySQL 5.5 Database design. Problem with friendly URLs approach

I have a maybe stupid question but I need to ask it :-)
My Friendly URL (furl) database design approach is fairly summarized in the following diagram (InnoDB at MySQL 5.5 used)
Each item will generate as many furls as languages available on the website. The furl_redirect table represents the controller path for each item. I show you an example:
item.id = 1000
item.title = 'Example title'
furl_redirect = 'item/1000'
furl.url = 'en/example-title-1000'
furl.url = 'es/example-title-1000'
furl.url = 'it/example-title-1000'
When you insert a new item, its furl_redirect and furls must be also inserted. The problem appears becouse of the (necessary) unique constraint in the furl table. As you see above, in order to get unique urls, I use the title of the item (it is not necessarily unique) + the id to create the unique url. That means the order of inserting rows should be as follow:
1. Insert item -- (and get the id of the new item inserted) ERROR!! furl_redirect_id must not be null!!
2. Insert furl_redirect -- (need the item id to create de path)
3. Insert furl -- (need the item id to create de url)
I would like an elegant solution to this problem, but I can not get it!
Is there a way of getting the next AutoIncrement value on an InnoDB Table?, and is it recommended to use it?
Can you think of another way to ensure the uniqueness of the friendly urls that is independent of the items' id? Am I missing something crucial?
Any solution is welcome!
Thanks!
You can get an auto-increment in InnoDB, see here. Whether you should use it or not depends on what kind of throughput you need and can achieve. Any auto-increment/identity type column, when used as a primary key, can create a "hot spot" which can limit performance.
Another option would be to use an alphanumeric ID, like bit.ly or other URL shorteners. The advantage of these is that you can have short IDs that use base 36 (a-z+0-9) instead of base 10. Why is this important? Because you can use a random number generator to pick a number out of a fairly big domain - 6 characters gets you 2 billion combinations. You convert the number to base 36, and then check to see if you already have this number assigned. If not, you have your new ID and off you go, otherwise generate a new random number. This helps to avoid hotspots if that turns out to be necessary for your system. Auto-increment is easier and I'd try that first to see if it works under the loads that you're anticipating.
You could also use the base 36 ID and the auto-increment together so that your friendly URLs are shorter, which is often the point.
You might consider another ways to deal with your project.
At first, you are using "en/" "de/" etc, for changing language. May I ask how does it work in script? If you have different folders for different languages your script and users must suffer a lot. Try to use gettext or any other localisation method (depends on size of your project).
About the friendly url's. My favorite method is to have only one extra column in item's table. For example:
Table picture
id, path, title, alias, created
Values:
1, uploads/pics/mypicture.jpg, Great holidays, great-holidays, 2011-11-11 11:11:11
2, uploads/pics/anotherpic.jpg, Great holidays, great-holidays-1, 2011-12-12 12:12:12
Now in the script, while inserting the item, create alias from title, check if the alias exists already and if does, you can add id, random number, or count (depending on how many same titles u have already).
After you store the alais like this its very simple. User try to access
http://www.mywebsite.com/picture/great-holidays
So in your script you just see that user want to see picture, and picture with alias great-holidays. Find it in DB and show it.

Consistent random ordering in a MySQL query

I have a database of pictures and I want to let visitors browse the pictures. I have one "next" and one "previous" link.
But what I want is to show every visitor anther order of the pictures. How can I do that? If I will use ORDER BY RANDOM() I will show sometimes duplicate images.
Can someone help me please? Thank you!
You can try to use seed in random function:
SELECT something
FROM somewhere
ORDER BY rand(123)
123 is a seed. Random should return the same values.
The problem arises from the fact that each page will run RAND() again and has no way of knowing if the returned pictures have already been returned before. You would have to compose your query in such a way that you can filter out the pictures already presented on the previous pages, so that RAND() will have fewer options to choose from.
An idea would be to randomize the pictures, select the IDs, store the IDs in the session, then SELECT using those IDs. This way, each user will have the pictures randomized, but they will be able to paginate through them without re-randomizing them on each page.
So, something like:
SELECT id FROM pictures ORDER BY RAND() LIMIT x if you don't have the IDs in the session already
Store the IDs in the session
SELECT ... FROM pictures WHERE id IN (IDs from session) LIMIT x
Another idea is to store in session the IDs that the user already saw and filter them out. For example:
SELECT ... FROM pictures ORDER BY RAND() LIMIT x if the session doesn't contain any ID
Append the IDs from the current query to the session
SELECT ... FROM pictures WHERE id NOT IN (IDs from session) ORDER BY RAND() LIMIT x
Another way seems to be to use a seed, as izi points out. I have to say I didn't know about the seed, but it seems to return the exact same results for the exact same value of the seed. So, run your usual query and use RAND(seed) instead of RAND(), where "seed" is a unique string or number. You can use the session ID as a seed, because it's guaranteed to be unique for each visitor.
You can seed the random function as suggested by izi, or keep track of visited images vs non-visited images as suggested by rdineiu.
I'd like to stress that neither option will perform well, however. Either will lead you to sorting your entire table (or the part of it of interest) using an arbitrary criteria and extracting the top n rows, possibly with an offset. It'll be dreadfully slow.
Thus, consider for a moment how important it is that every visitor should get a different image order. Probably, it'll be not that important, as long as things look random. Assuming this is the case, consider this alternative...
Add an extra float field to your table, call it sort_ord. Add an index on it. On every insert or update, assign it a random value. The point here is to end up with a seemingly random order (from the visitor's standpoint) without compromising performance.
Such a setup will allow you to grab the top n rows and paginate your images using an index, rather than by sorting your entire table.
At your option, have a cron job periodically set a new value:
update yourtable
set sort_ord = rand();
Also at your option, create several such fields and assign one to visitors when they visit your site (cookie or session).
This will solve:
SELECT DISTINCT RAND() as rnd, [rest of your query] ORDER BY rnd;
Use RAND(SEED). From the docs: "If a constant integer argument N is specified, it is used as the seed value." (http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html#function_rand).
In the example above the result order is always the same. You simply change the seed (351) and you get a new random order.
SELECT * FROM your_table ORDER BY RAND(351);
You can to change the seed every time the user hits the first page.
Without seeing the SQL I'd guess you could try SELECT DISTINCT...

How to best get 3 prior image and 3 later image records in MySQL query?

I'll explain briefly what I want to accomplish from a functional perspective. I'm working on an image gallery. On the page that shows a single image, I want the sidebar to show thumbnails of images uploaded by the same user. At a maximum, there should be 6, 3 that were posted before the current main image, and 3 that were posted after the main image. You could see it as a stream of images by the same user through which you can navigate. I believe Flickr has a similar thing.
Technically, my image table has an autoincremented id, a user id and a date_uploaded field, amongst many other columns.
What would your advise be on how to implement such a query? Can I combine this in a single query? Are there any handy MySQL utilities that can deal with offsets and such?
PS: I prefer not to create an extra "rank" column, since that would make managing deletions difficult. Also, using the autoincrement id seems risky, I might change it for a GUID later on. Finally, I'm of course looking for a query that performs and scales.
I know I ask for a lot, but it seems simpler than it is?
The query could look like the following.
With a UserID+image_id index (and possibly additional fields for covering purposes), this should perform relatively well.
SELECT field1, field2, whatever
FROM myTable
WHERE UserID = some_id
-- AND image_id > id_of_the_previously_first_image
ORDER BY image_id
LIMIT 7;
To help with scaling, you should consider using a bigger LIMIT value and cache accordingly.
Edit (answering remarks/questions):
The combined index...
is made of several fields, specifically
CREATE [UNIQUE] INDEX UserId_Image_id_idx
ON myTable (UserId, image_ida [, field1 ...] )
Note that optional elements of this query are in brackets ([]). I would assume the UNIQUE constraint would be a good thing. The additional "covering" fields (field1,...) maybe beneficiary, but would depend on the "width" of such additional fields as well as on the overall setup and usage patterns (since [large] indexes slow down INSERTs/UPDATEs/DELETEs, one may wish to limit the number and size of such indexes etc.)
Such an index data "type" is neither numeric nor string etc. It is simply made of the individual data types. For example if UserId is VARCHAR(10) and Image_id is INT, the resulting index would use these two types for the underlying search criteria, i.e.
... WHERE UserId = 'JohnDoe' AND image_id > 12389
in other words one needn't combine these criteria into a single key.
On image_id
when you say image_id, you mean the combined user/image id, right?
No, I mean only image_id. I'm assuming this field is a separate field in the table. The UserID is taken care of in the other predicate of the WHERE clause.
The original question write up indicates that this field is auto-generated, and I'm assuming we can rely on this field for sorting purposes. Alternatively we could rely on other fields such as the timestamp when the image was uploaded and such.
Also, an afterthought, whether ordered by a [monotonically increasing] Image_id or by the Timestamp_of_upload, we may want to use a DESC order, to show the latest "stuff" first.