How to extract relevant data from MySQL? - mysql

I'm using a table named "url2" with tje MySQL InnoDB Engine. I'm having so many data with full HTML of a Page, URL of the page, and so on.... When I use the following SQL query I am getting lot of results:
SELECT url FROM url2 WHERE html LIKE '%Yuva%' OR url LIKE '%Yuva%'
The search term yuva can be changes as user request
It will select lot of data, mostly which I don't need, how can i avoid that?
The out put of the above query is
www.123musiq.com
www.123musiq.com/home.html
www.123musiq.com/yuva.html
www.sensongs.com/
www.sensongs.com/hindi.html
www.sensongs.com/yuva.html
The Output i need is
According to the relevancy it should be sorted Like
www.123musiq.com/yuva.html
www.sensongs.com/yuva.html
www.sensongs.com/hindi.html
As from the comment of my Friend i change table to MyISAM,but i am geting 123musiq.com files first about 25 after that i am geting sensongs.how can i get 2 from 123musiq.com and 2 from sensongs.com,order by relevance

It seems you're asking for a Full Text Index, which in MySQL are only available on MyISAM tables.
Since you're using InnoDB tables, the easiest solution is to create a new (MyISAM) table with only the text content and an index to join with the original table (this also helps with seek efficiency in some common cases).

Perhaps you want to use LIMIT?
SELECT * FROM url2 WHERE html LIKE '%Yuva%' OR url LIKE '%Yuva%' LIMIT 2

Related

Solr index multiple tables from MySQL

I have following mysql tables
1. user(user_id,email)
2. tweets(tweet_id,user_id,tweet)
3. tags(tag_id,tag)
4. tweets_tags(tweet_id,tag_id)
I want to show current user's tweets under "My Tweets" Tab in application. I want to get following data from Solr
user_id
email
tweet where user_id=x
tags where tweet_id=xx
How to index those mysql table on Solr? I only what to know the code of schema.xml and data-config.xml for Full/Delta import.
Note : I am not asking about MySQL connector etc, I have done already.
The use case you've described doesn't seem to justify using solr. You would just make sure you have proper keys and indexes and do it in mysql directly.
If for some reason you MUST use solr, you could probably prepare all the data and feed it to solr in a tag/tweet/user structure like this
user1 - tweet1 - tag1
user1 - tweet1 - tag2
user1 - tweet2 - tag1
and so on.
Then from solr you query by user, and then sort and group by tweet and then tag.
However I must state again that the solution I just described is implemented much safer with a higher confidence on the result by using plain sql.
Should you provide more details on your desired outcome, I'd be happy to suggest the database structure along with the necessary foreign keys and indexes and the queries you need to get your data out.
If you are using DIH (dataimporterhandler), I guess that link should be the solution for you:
Import with sub entities
If you have problem with writing the exact configurations, please let me know, I can assist you.

Optimized SELECT query in MySQL

I have a very large number of rows in my table, table_1. Sometimes I just need to retrieve a particular row.
I assume, when I use SELECT query with WHERE clause, it loops through the very first row until it matches my requirement.
Is there any way to make the query jump to a particular row and then start from that row?
Example:
Suppose there are 50,000,000 rows and the id which I want to search for is 53750. What I need is: the search can start from 50000 so that it can save time for searching 49999 rows.
I don't know the exact term since I am not expert of SQL!
You need to create an index : http://dev.mysql.com/doc/refman/5.1/en/create-index.html
ALTER TABLE_1 ADD UNIQUE INDEX (ID);
The way I understand it, you want to select a row with id 53750. If you have a field named id you could do this:
SELECT * FROM table_1 WHERE id = 53750
Along with indexing the id field. That's the fastest way to do so. As far as I know.
ALTER table_1 ADD UNIQUE INDEX (<collumn>)
Would be a great first step if it has not been generated automatically. You can also use:
EXPLAIN <your query here>
To see which kind of query works best in this case. Note that if you want to change the where statement (anywhere in the future) but see a returning value in there it will be a good idea to put an index on that aswell.
Create an index on the column you want to do the SELECT on:
CREATE INDEX index_1 ON table_1 (id);
Then, select the row just like you would before.
But also, please read up on databases, database design and optimization. Your question is full of false assumptions. Don't just copy and paste our answers verbatim. Get educated!
There are several things to know about optimizing select queries like Range and Where clause Optimization, the documentation is pretty informative baout this issue, read the section: Optimizing SELECT Statements. Creating an index on the column you evaluate is very helpfull regarding performance too.
One possible solution You can create View then query from view. here is details of creating view and obtain data from view
http://www.w3schools.com/sql/sql_view.asp
now you just split that huge number of rows into many view (i. e row 1-10000 in one view then 10001-20000 another view )
then query from view.
I am pretty sure that any SQL database with a little respect for themselves does not start looping from the first row to get the desired row. But I am also not sure how they makes it work, so I can't give an exact answer.
You could check out what's in your WHERE-clause and how the table is indexed. Do you have a proper primary key? Like using a numeric data type for that. Do you have indexes on more columns, that is used in your queries?
There is also alot to concider when installing the database server, like where to put the data and log files, how much memory to give the server and setting the growth. There's a lot you can do to tune your server.
You could try and split your tables in partitions
More about alter tables to add partitions
Selecting from a specific partition
In your case you could create a partition on ID for every 50.000 rows and when you want to skip the first 50.000 you just select from partition 2. How to do this ies explained quite well in the MySQL documentation.
You may try simple as this one.
query = "SELECT * FROM tblname LIMIT 50000,0
i just tried it with phpmyadmin. WHERE the "50,000" is the starting row to look up.
EDIT :
But if i we're you i wouldn't use this one, because it will lapses the 1 - 49999 records to search.

mysql match urls

I am inserting urls in a mysql table. For example i have inserted 8 entries as below:
url
-----------------------------
http://example.com
http://www.example.com
http://example.com/
http://www.example.com/
http://example.com/sports
http://www.example.com/sports
http://example.com/sports/
http://www.example.com/sports/
. Now how can i write a query to match example.com which should return the first 4 entries since they are the same url? Similarly how do i write a query to get the last 4 entries as they are the same? Even if i have huge number of entries the query should be fast is it possible ??
Well, if you have those links in a single table, you could get them like:
SELECT * FROM table WHERE url LIKE '%example.com%'
Is this fast? NO - it will require full table scan.
If I were you, I would model my DB to hold those URLs in 2 tables:
links
id
*base_url* - holds example.com
related_links
id
*link_id* - FK on links
subdomain - holds www.
*relative_url* - holds /sports/
Edit - to answer comment:
Your DB is not normalized right now. You hold multiple records for "the same thing" - you are not benefitting the advantages of DBs. DBs are useful when working with structured data - your query needs to make string operations - an pretty complex ones. So, while it would probably be possible to return the results you need and want with the current form of the DB, it won't be a trivial task, and definitely performance would suck.
My recommendation - modify the DB - at least add the columns subdomain and relative_path to your table and hold this information as separate as possible - to be able to make aggregated queries on it.

Search page engine PHP with MYSQL database?

I'm going to generalize this question so that other people can use the answers.
Let's say I have a website driven by a MYSQL database.
Database contains 5 tables:events,news,books,articles,tips.
Every table has among others 2 fields Title and Details in which I want to search
On every page of the site I have a search form (text field and button).
After I type a word or phrase I want to be redirected to a page called search where I should see the results as a list with links from the entire database.
e.g.
Book X (link on it to the book found in the database)
Event Y
Article Z
HELP: The tables are INNODB ENGINE so full text search didn't work also I'm having trouble in building a SELECT statement for searching multiple fields from multiple tables with LIKE. I've succeded with one table but with multiple tables and multiple fields I'm getting error or no data or duplicated data in some cases. Some help with this Select statement please.
Question: How do I build a search engine for all the tables in my MYSQL DB? Some SQL injection or other hacking prevention advice would be appreciated also.
My Approach to the situation is create a view based on all the tables with similar columns ( columns which we need to search only) and one more alias column with their table names/ entity name (Books, event etc)
It should look like this
EntityName Title Details
Books xxxx xxxxx
...............................
I am not explaining how to create views with union (dont use Union All if not expecting duplicates).
The next stop would be search using like statements
select * from vwSearchData where Title like '%keyword' or details like '%keyword'
Next step is to display the data along with their entity names.
Ofcourse, you need to get the keyword by filtering with html entities from the search form.
You can use UNION:
(SELECT * FROM events WHERE title LIKE '$key' OR details LIKE '$key')
UNION
(SELECT * FROM news WHERE title LIKE '$key' OR details LIKE '$key')
and so on.

Generate number id from text/url for fast "SELECT"

I have the following problem:
I have a feed capturer that captures news from different sources every half an hour.
I only insert entries that don't have their URLs already in the database (URL is used to see if the record is already in database).
Even with that, I get some repeated entries, because some sites report the same news (that usually are from a news source like Reuters). I could look for these repeated entries during insertion, but i think this would slow the insertion time even more.
So, I can later find these repeated entries by the title. But I think this search is slow. Then, my idea is to generate a numeric field from the title and then search by this number for repeated titles.
What kind of encoding could I use (I thought in something reverse to base64) to encode the titles?
I'm suposing that searching for repeated numbers is a lot faster than searching for repeated words. Is that true or not?
Do you suggest a better solution for this problem?
Well, I don't care to have the repeated entries in the database, I just don't want to show then to the user. Like google, that filters the repeated results, but shows then if you want.
I hope I explained It well. Thanks in advance.
Fill the MD5 hash of the URL and title and build a UNIQUE index on it:
CREATE UNIQUE INDEX ux_mytable_title_url ON (title_hash, url_hash)
INSERT
INTO mytable (url, title, url_hash, title_hash)
VALUES ('url', 'title', MD5('url'), MD5('title'))
To select like Google (one result per title), use this query:
SELECT *
FROM (
SELECT DISTINCT title_hash
FROM mytable
) md
JOIN mytable mo
ON mo.url_title = md.title_hash
AND mo.url_hash =
(
SELECT url_hash
FROM mytable mi
WHERE mi.title_hash = md.title_hash
ORDER BY
mi.title_hash, mi.url_hash
LIMIT 1
)
so you can use a new table containing only the encoded keys based on title and url, you have then to add a key on it to accelerate search. But i don't think that you can use an effecient algorytm to transform strings to numbers ..
for the encryption use
SELECT MD5(CONCAT('title', 'url'));
and before every insertion you test if the encoded concatenation of title and url exists on this table.
#Quassnoi can explain better than I, but I think there is no visible difference in performance if you use a VARCHAR/CHAR or INT in a index to use it later for GROUPing or other method to find the duplicates. That way you could use the solution proposed by him but use a normal INDEX instead of a UNIQUE index and keep the duplicates in the database, filtering out only when showing to users.