how to create a small search engine [closed]

how to create a small search engine [closed] - mysql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 12 months ago.
Improve this question
I'm aim to create a small in-app search engine(something like a google map address search bar).The requirement is quite simple.The item is consisted of many key-words,the user types in a key-word,it gives out corresponding result,the user types in another key-word after that,it continues to filter the result.
The first thing come to my mind is to use mysql,create a key-words table to store every key-wrods and like it to the item table,and when user type in a key-word,it searched through every record in key-words table to give results.Am I thinking in the right way?Could u guys give me some helps?I'm a totally novice in mysql(only learn it in high school lesson).Is there any open-source platform for this?

Note: If your don't need to store keyword frequency, then go with Marmik Bhatt's LIKE suggestion.
If you have large amount of data and you want to do a keyword search only (i.e. you are not going to be searching for phrases or use concepts like "near"), then you can simply create a keyword table:
CREATE TABLE address
(
id INT(10) PRIMARY KEY,
/* ... */
);
CREATE TABLE keyword
(
word VARCHAR(255),
address_id INT(10),
frequency INT(10),
PRIMARY KEY(word, article_id)
);
You then scan through the text that you are "indexing" and count each word that you find there.
If you want to do several keywords:
SELECT address.*, SUM(frequency) frequency_sum
FROM address
INNER JOIN keyword ON keyword.address_id = address.id
WHERE keyword.word IN ('keyword1', 'keyword2', /*...*/)
GROUP BY address.id;
Here i've done a frequency sum, which can be a dirty way to compare the usefulness of the result, when many are given.
Things to think about:
Do you want to insert all keywords into the database, or only those, that have a frequency higher than a specific value? If you insert all your table may become huge, if you insert only higher frequency ones, then you will not find the only article that mentions a specific word, but does so only once.
Do you want to insert all the available keywords for the specific article or only "top ones"? In this case the danger is that frequent words that add nothing to the meaning will begin pushing others out. Consider the word "However", it may be in your article many more times than "mysql", buy it is the latter that defines the article, not the former.
Do you want to exclude words shorter then a specific length of characters?
Do you want to exclude known "meaningless" words?

For search engine, I use 'LIKE' to search parameters...
The query would look like...
SELECT * FROM tbl_keywords
INNER JOIN tbl_addresses ON tbl_addresses.id = tbl_keyword.address_id
WHERE tbl_keywords.keywords LIKE "% $keyword %";
$keyword is a variable retried from GET or POST request from the search bar.
You can also use JSON output of your search result so, using jquery you can provide fast search result output.
Full Text Search
You can also use full text search for searching for places and related keywords
see this link...SQL Full Search Tutorial

One thing you can implement is that you can break down the user keyword based on spaces and it will fetch you out most relevant results.
For example, user types Create search engine
then explode it based on space.
Then query DB for each word.
A REGEXP might be more efficient, but you'd have to benchmark it to be sure, e.g.
SELECT * from fiberbox where field REGEXP 'Create|search|engine';
Use jQuery Autocomplete to make an auto-suggest search like Google does

Related

Should I use another column to show whether LONGTEXT contains data? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I have a doubt (probably very basic) when it comes to know if a field has been stored with information in the database.
I'm working with Laravel and MySQL as DB engine.
The situation is as follows:
I have a field in the database that stores information (in this case it is a LONGTEXT field with a large amount of information in it). This field stores information in an automated way (by means of a CRON).
When listing the information related to the records of that table, I need to know if the field in question contains information or not.
At first I had thought of including another field (column) in the same table that tells me if the field is empty or not. Although I consider that this would be a correct way to do it, on the other hand I think that I could save this column by simply checking if the field in question is empty or not. However, I'm not sure if this would be the right way to do it and if this could affect the performance of the application (I don't know how MySQL does exactly this comparison or if it could be optimised by making use of the new field).
I hope I have explained myself correctly.
Schematically, the options are:
Option 1:
Have a single field (very large amount of information).
When obtaining the list with the records, check in the corresponding search if the field in question contains information.
Option 2:
Have two fields: one of them contains the information and the other is a boolean that indicates if the first one contains information.
When obtaining the list of records, look at the boolean.
The aim of the question is to use good practices as well as to optimise both the search and minimise the impact on the performance of the process.
Thank you very much in advance.

It takes extra work for MySQL to retrieve the contents of a LONGTEXT or any other BLOB / CLOB column. You'll incur that work even if your query says.
SELECT id FROM tbl WHERE longtext IS NOT NULL /* slow! */
or
SELECT id FROM tbl WHERE CHAR_LENGTH(longtext) >= 1 /* slow! */
So, yes, you should also use another column to indicate whether the LONGTEXT column is populated if you need to run a lot of queries like that.
You could consider using a generated -- virtual --- colum like this for the purpose.
textlen BIGINT GENERATED ALWAYS AS (CHAR_LENGTH(longtext)) STORED
The generated column will get its value at the time you INSERT or UPDATE the row. Then WHERE textlen >= 1 will be fast. You can even put an index on it.
Go for the length rather than the Boolean value. It doesn't take significantly more space, it gives you slightly more information, and it gives you a nice way to sanity-check your logic during testing.

How to manage entity duplication in database table [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last month.
Improve this question
I am working on a simple database design of an application.
I have a Book Illustrator and Editor table.
Modelling 1 Relation between
With this model, I think here is the duplication of the column name in each author editor and illustrator table.
What if a book author, illustrator and editor person are same, in this case, data get duplicated across 3 tables.
But in case of searching it will be faster, I guess as it no of items per table will be less.
Modelling 2
With this modeling, all the author, illustrator and editor info get saved in a single table and I am confused what should be the name of this table.
With this approach. The data won't' get duplicated but the searching will be double as compared to model 1.
Can anyone suggest me which model should I choose. I feel modeling 2 is better.

It is purely up to your taste which model you should use. The second one has the advantage that you wont get duplicates. With both models you can get the results with one query
select * from books
left join names auth ON (auth.id = author_id)
left join names ill ON (ill.id = illustrator_id)
left join names ed ON (ed.id = editor_id)
where books.id = 1;
SQLFiddle gives an example of model 2. If you want to obtain the data from model one, just change the 3 joins to the right table.
If you want to display a list of authors, I would not recommend adding it as a new field in the names table, but just use a joint query.
select auth.* from books
left join names auth ON (auth.id = author_id)
As long as you set the indexes on the id, author_id, illustrator_id and editor_id, you are fine.
Edit: my preference would go to model 2. I think it might also a bit faster:
The database only needs to open one file (not 3)
There are less records in the table (compared to the combined of the 3 tables) because you don't have duplicates.
The database only need to search through one index set (not 3) and might do some optimised stuff in the back because it is looking for 3 keys in the same set (instead of 3 key in 3 index sets) - it's my gut feeling, not sure if this is exactly correct...

You can make one amendment in the 2nd design you have proposed by keeping the user type column as well, which describes whether the user is any of author, illustrator and editor. the id will vary from 0 - 7, you can store the decimal value of the bitwise data. as if a person is Editor & Author then,
1(Editor) 0(Illustrator) 1(Author) => 5
So when you will perform any select/search on that table you can add filters where user type in query.

Do you need to validate, for example, that the author is defined as author in "Author" before you link to a book as author?
Do you care to do a query to know who are all authors/editors/illustrators defined in your database?
You have created N-N link between the entities, however, you have the "auhorId", "editorId" and "illustatorId" in the "Book" entity!
The proper way would be to have the resolution of the many-to-many relationship by having another table, and end up with something like this
BOOK, has ID, TITLE, DESC, etc.
PARTICIPANT (suggested name for all people), has ID, NAME, BIO, etc
AUTHOR, has BOOK_ID, PARTICIPANT_ID
EDITOR, has BOOK_ID, PARTICIPANT_ID
ILLUSTRATORS, has BOOK_ID, PARTICIPANT_ID
OR, instead of (3, 4, 5), BOOK_PARTICIPANT, has BOOK_ID, PARTICIPANT_ID, PARTICIPATION_TYPE (code for author, editor, illustrator), or even use flags (IS_AUTHOR, IS_EDITOR, IS_PARTICIPANT, where one is required to be set)
If you need to validate the participant as author, editor, illustrator before being able to link to a book, you need to add three flags here to to PARTICIPANT: IS_AUTHOR, IS_EDITOR, IS_ILLUSTRATOR
Hope this helps

Count the number of times keywords (in a table) appear in a field in another table

I will simplify my problem in order to explain it.
I have a table which contains text messages posted by users and another table which contains keywords.
I want to display, for each user, the number of times keywords are found in text messages.
I don't want the result to display a keyword if it's not found in text messages.
I wan't it to be case INSENSITIVE. All keywords are lowered but in messages, you can find lower & upper chars.
Because I'm not sure that my explanation is clear enough, here comes the SQLFiddle.
http://sqlfiddle.com/#!2/c402a
Hope anyone can help me.

I found what I was looking for. It wasn't easy for me but here is my query :
SELECT t_msg.msg_usr,
t_list.list_word,
count(t_list.list_word),
t_msg.msg_text
FROM t_msg
INNER JOIN t_list
ON LOWER(t_msg.msg_text) LIKE CONCAT("%", t_list.list_word, "%")
GROUP BY t_msg.msg_usr, t_list.list_word;
The SQLFiddle is there : http://sqlfiddle.com/#!2/ba052/8

The recommendation would be to not try solving this with a query. It's possible to write a query that will do it, such query will scan the messages table for each keyword separately, and produce a count (or a row that you can group by), but this won't scale, or be reliable in sense of language search.
Here is what you might want to do:
Create a table to map (user_id, keyword_id) to a count of this keyword in messages of this user. Let's call it t_keyword_count.
Each time you receive a message, before you save the message into the database, search it for all the keywords you care about (using whatever good text search libraries that account for misspellings, etc.). You should know the (user_id) for this message.
You will, at that point, be ready to add the message to the database, and will have an array of (keyword_id) with keywords that this message will have.
In a transaction, insert the message into the t_msg table, and run update/insert for (user_id,keyword_id) to have value=value+1 (or +n, if you need to count the same keyword more than once in the same message) for the t_keyword_count table.
If you are trying to solve the problem of having to do the above on existing data, you can do this manually, just to build up that t_keyword_count table first (depends on how many keywords you have in total, but even if there are a lot, this can be scripted). But you should change (or mirror) the t_msg.msg_text field to be a field suitable for text search, and use SQL text search functionality to find the keywords.

Is it ever "ok" to use multiple queries instead of one? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
In building a web app recently, I started thinking about the information returned from a query I was making:
Find the user information and (for simplicity sake) the associated phone numbers tied to this user. Something as simple as:
SELECT a.fname, a.lname, b.phone
FROM users a
JOIN users_phones b
ON (a.userid = b.userid)
WHERE a.userid = 12345;
No problem here (yes I'm preventing injection, etc, not the point of this question). When I think about the data that is returned though, I am returning (potentially) several rows of information with that users name on each one. Let's say that single user has 1000 phone numbers associated with it. That's a first name and last name being returned each call a lot. Let's also assume I want to return a lot more than just the first name and last name of that user and in fact I'm starting to return quite a bit of extra rows which I really only needed once.
Are there circumstances in which it is "more appropriate" to make multiple calls to a database?
e.g.
SELECT firstname, lastname
FROM users
WHERE userid = 12345;
SELECT phone
FROM users_phones
WHERE userid = 12345;
If the answer is yes, is there a good/proper method of determining when to use multiple queries versus a single one?

I think that really depends on your use case. In the example you gave, it seems to make sense to return it as two queries, especially if you're passing that info back to a mobile device where you want to make sure you send them as little data as possible (not everyone has unlimited data.....)
I'd probably stick a DISTINCT in those queries as well if that's going to make a difference based on your tables.

A query with a JOIN may be slower than two independent queries. It really depends on the type of access you're doing.
For your example, I'd go with the two query approach. These queries could be executed in parallel, they could be cached, and there's no real reason to JOIN other than for arbitrary presentation concerns.
You'll also want to be concerned about returning duplicate data. In your example it looks like fname and lname would be repeated for each and every phone number, resulting in a lot of data being transmitted that's actually not useful. This is because of the one-to-many relationship you've described.
Generally you'll want to JOIN if it means sending less data, or because the two queries are not independent.

This should be driven by the application. Basically, you retrieve in one query all the information needed in one place. If you take this question page as an example, you see your user ID, the reputation counter, and the badge counters. There's no need to retrieve other user profile information when I first display the question page.
Only when one clicks on the user ID the rest of the profile may be queried, and may be not even all of it, as there are several tabs on the profile page.
However, if your application is guaranteed to access all 1000 phone numbers at once, along with the user's name, then you probably should fetch them all together.

What is the most efficient way to store a sort-order on a group of records in a database? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Assume PHP/MYSQL but I don't necessarily need actual code, I'm just interested in the theory behind it.
A good use-case would be Facebook's photo gallery page. You can drag and drop a photo on the page, which fires an Ajax event to save the new sort order. I'm implementing something very similar.
For example, I have a database table "photos" with about a million records:
photos
id : int,
userid : int,
albumid : int,
sortorder : int,
filename : varchar,
title : varchar
Let's say I have an album with 100 photos. I drag/drop a photo into a new location and the Ajax event fires off to save on the server.
Should I be passing the entire array of photo ids back to the server and updating every record? Assume input validation by "WHERE userid=loggedin_id", so malicious users can only mess with the sort order of their own photos
Should I be passing the photo id, its previous sortorder index and its new sortorder index, retrieve all records between these 2 indices, sort them, then update their orders?
What happens if there are thousands of photos in a single gallery and the sort order is changed?

What about just using an integer column which defines the order? By default you assign numbers * 1000, like 1000, 2000, 3000.... and if you move 3000 between 1000 and 2000 you change it to 1500. So in most cases you don't need to update the other numbers at all. I use this approach and it works well. You could also use double but then you don't have control about the precision and rounding errors, so rather don't use it.
So the algorithm would look like: say you move B to position after A. First perform select to see the order of the record next to A. If it is at least +2 higher than the order of A then you just set order of B to fit in between. But if it's just +1 higher (there is no space after A), you select the bordering records of B to see how much space is on this side, divide by 2 and then add this value to the order of all the records between A and B. That's it!
(Note that you should use transaction/locking for any algorithm which contains more than a single query, so this applies to this case too. The easiest way is to use InnoDB transaction.)

Store as a linked list, sortorder is a foreign key reference to the next photo_id in the set.

this would probably be a 'linked list' construct.

To me the second method of updating is the way to go (update only the range that changes). You are mentioning "What happens if there are thousands of photos in a single gallery ...", and to me that is never going to happen. Lets take your facebook example. Facebook doesn't show thousands of photos on one page, they split it up to about 10-20 per page.

The way I'd do this in a nonrelational database is to store a list of photo IDs on the 'album' entity/record, in the order desired. Reordering the photos results in reordering the list, and only a single database write.
Some SQL databases (Eg, PostgreSQL) have native list datatypes, but MySQL doesn't. You could serialize the list as a string or binary on MySQL.
3rd-normal-form trained database gurus will scream at you that this is a terrible approach, but RDBMSes are optimized for OLAP type queries, where query flexibility is more important than read performance. Webapps are best written with a 'write heavy, read light' strategy in mind, and this sort of denormalization is exactly in line with that.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008