implementing a blacklist of usernames - mysql

I have a block list of names/words, has about 500,000+ entries. The use of the data is to prevent people from entering these words as their username or name. The table structure is simple: word_id, word, create_date.
When the user clicks submit, I want the system to lookup whether the entered name is an exact match or a word% match.
Is this the only way to implement a block or is there a better way? I don't like the idea of doing lookups of this many rows on a submit as it slows down the submit process.

Consider a few points:
Keep your blacklist (business logic) checking in your application, and perform the comparison in your application. That's where it most belongs, and you'll likely have richer programming languages to implement that logic.
Load your half million records into your application, and store it in a cache of some kind. On each signup, perform your check against the cache. This will avoid hitting your table on each signup. It'll be all in-memory in your application, and will be much more performant.
Ensure myEnteredUserName doesn't have a blacklisted word at the beginning, end, and anywhere in between. Your question specifically had a begins-with check, but ensure that you don't miss out on 123_BadWord999.
Caching bring its own set of new challenges; consider reloading from the database everyday n minutes, or at a certain time or event. This will allow new blacklisted words to be loaded, and old ones to be thrown out.

You can't do where 'loginName' = word%. % can only be used in the literal string, not as part of the column data.
You would need to say where 'logi' = word or 'login' = word or ... where you compare substrings of the login name with the bad words. You'll need to test each substring whose length is between the shortest and longest bad word, inclusive.
Make sure you have an index on the word column of your table, and see what performance is like.
Other ways to do this would be:
Use Lucene, it's good at quickly searching text, espacially if you just need to know whether or not your substring exists. Of course Lucene might not fit technically in your environment -- it's a Java library.
Take a hash of each bad word, and record them in a bitset in memory -- this will be small and fast to look up, and you'll only need to go to the database to make sure that a positive isn't false.

Related

Matching 2 databases of names, given first, last, gender and DOB?

I collect a list of Facebook friends from my users including First, Last, Gender and DOB. I am then attempting to compare that database of names (stored as a table in MySQL) to another database comprised of similar information.
What would be the best way to conceptually link these results, with the second database being the much larger set of records (>500k rows)?
Here was what I was proposing:
Iterate through Facebook names
Search Last + DOB - if they match, assume a "confident" match
Search Last + First - if they match, assume a "probable" match
Search Last + Lichtenstein(First) above a certain level, assume a "possible" match
Are there distributed computing concepts that I am missing that may make this faster than a sequential mySQL approach? What other pitfalls may spring up, noting that it is much more important to not have a false-positive rather than miss a record?
Yes, your idea seems like a better algorithm.
Assuming performance is your concern, you can use caching to store the values that are just being searched. You can also start indexing the results in a NoSQL database such that the results will be very faster, so that you will have better read performance. If you have to use MySQL, read about polyglot persistence.
Assuming simplicity is your concern, you can still use indexing in a NoSQL database, so over the time you don't have to do myriad of joins will spoil the experience of the user and the developer.
There could be much more concerns, but it all depends on where would you like to use it, to use in a website, or to such data analytic purpose.
If you want to operate on the entire set of data (as opposed to some interactive thing), this data set size might be small enough to simply slurp into memory and go from there. Use a List to hang on to the data then create a Map> that for each unique last name points (via integer index) to all the places in the list where it exists. You'll also set yourself up to be able to perform more complex matching logic without getting caught up trying to coerce SQL into doing it. Especially since you are spanning two different physical databases...

MySQL: Best way to do a backwords full text search?

I am trying to do basically a reverse full test search but have no clue of the best way to go about doing it.
Basically I have a table of key phrases laid out like this:
id - phrase
1 - "hello world"
2 - "goodbye world"
3 - "this is my world"
I then have a set string, such as "Welcome to the hello world group". I want to find the ID of all rows in my table that has an exact match for phrase. Meaning "o the" would not match because the word is "to the". Also "ello" would not match because the world is "hello".
Using Full Text Search, this can easily be achieved by doing a search of:
AGAINST ('"hello world"' IN BOOLEAN MODE);
Problem is, I don't believe I can use a full text search, since a full text search would find all rows that contains a single phrase. I want all phrases (from a known set of phrases) that match a single set.
I know how to do this using RegEx using the following, however this is way to slow. On a table with 400,000 key phrases it took over 40 seconds:
WHERE "the data I know I want to search goes here" REGEXP CONCAT('[[:<:]]', phrases, '[[:>:]]')
What I need is a more optimized way to do this. How would I possibly go about doing this as a full text search, even if i have to temporarily add it to a table without actually doing a LOOP individually checking each keyword.
I really appreciate the feedback as this is really causing my site to lag on adding new data.
If you are willing to consider a solution that reads the phrases out of the database and constructs a separate data structure used for optimized phrase detection, there are two main techniques that solve the problem. Which one is best for you depends on a number of factors, in particular:
How frequently the phrase list is updated
Whether and how you tokenise the text before running the phrase detection
How long the target strings are
Option 1: Hash-table of the phrases This means you simply insert each of the phrases as key into a hash table (aka dictionary or hash map in many programming languages). The phrase id becomes the value. Updates are fast and easy, but detecting the phrases in a given string can be hard: Firstly you need to tokenise the string and be sure that phrases only occur between token boundaries. Secondly, you need to make a lookup in the hash not only for every token, but also for every pair, triple, quadruple etc. of consecutive tokens. This still works well if the target strings are generally short. You can also maintain a copy of the hash table on disk, e.g. using the Berkeley DB. There are ready-to-use modules in the standard library of most programming languages for this.
Option 2: Search trie (or, slightly more advanced, a minimised search trie or a finite state machine). This can be implemented in very space-efficient ways but is generally larger than a hash table (although 400k entries will not be a problem at all). The big advantage during phrase detection is that you need not cut out tokens (or candidate phrases between token boundaries) before making look-ups. Instead you perform a longest-match look-up at each candidate start position in the text. Storing on disk is possible, although in most programming languages there won't be a standard-library module for this. Updates are quite easy in a trie, but can get difficult (and potentially time-consuming) in a minimised trie or FST.
Both options allow the data structure to be maintained on disk (or a copy of it to be stored on disk, while the actual look-ups happen memory). But you won't get transaction safety or fault-tolerance (which I understand you are not looking for).
You can use search engine. For example solr. You can set specific search filters against text. + search for words only. + It will be blindingly fast.
Or, second idea you can create your own table that stores all words and id of phrase. and search that table maching words only. It will be faster because you can add index on words better then phrases altogether.

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.

MySQL is SELECT with LIKE expensive?

The following question is regarding the speed between selecting an exact match (example: INT) vs a "LIKE" match with a varchar.
Is there much difference? The main reason I'm asking this is because I'm trying to decide if it's a good idea to leave IDs out of my current project.
For example Instead of:
http://mysite.com/article/391239/this-is-an-entry
Change to:
http://mysite.com/article/this-is-an-entry
Do you think I'll experience any performance problems on the long run? Should I keep the ID's?
Note:
I would use LIKE to keep it easier for users to remember. For example, if they write "http://mysite.com/article/this-is-an" it would redirect to the correct.
Regarding the number of pages, lets say I'm around 79,230 and the app. is growing fast. Like lets say 1640 entries per day
An INT comparison will be faster than a string (varchar) comparison. A LIKE comparison is even slower as it involves at least one wildcard.
Whether this is significant in your application is hard to tell from what you've told us. Unless it's really intensive, ie. you're doing gazillions of these comparisons, I'd go with clarity for your users.
Another thing to think about: are users always going to type the URL? Or are they simply going to use a search engine? These days I simply search, rather than try and remember a URL. Which would make this a non-issue for me as a user. What are you users like? Can you tell from your application how they access your site?
Firstly I think it doesn't really matter either way, yes it will be slower as a LIKE clause involves more work than a direct comparison, however the speed is negligible on normal sites.
This can be easily tested if you were to measure the time it took to execute your query, there are plenty of examples to help you in this department.
To move away slighty from your question, you have to ask yourself whether you even need to use a LIKE for this query, because 'this-is-an-entry' should be unique, right?
SELECT id, friendly_url, name, content FROM articles WHERE friendly_url = 'this-is-an-article';
A "SELECT * FROM x WHERE = 391239" query is going to be faster than "SELECT * FROM x WHERE = 'some-key'" which in turn is going to be faster than "SELECT * FROM x WHERE LIKE '%some-key%'" (presence of wild-cards isn't going to make a heap of difference.
How much faster? Twice as fast? - quite likely. Ten times as fast? stretching it but possible. The real questions here are 1) does it matter and 2) should you even be using LIKE in the first place.
1) Does it matter
I'd probably say not. If you indeed have 391,239+ unique articles/pages - and assuming you get a comparable level of traffic, then this is probably just one of many scaling problems you are likely to encounter. However, I'd warrant this is not the case, and therefore you shouldn't worry about a million page views until you get to 1 million and one.
2) Should you even be using LIKE
No. If the page/article title/name is part of the URL "slug", it has to be unique. If it's not, then you are shooting yourself in the foot in term of SEO and writing yourself a maintanence nightmare. If the title/name is unique, then you can just use a "WHERE title = 'some-page'", and making sure the title column has a unique index on.
Edit
You plan of using LIKE for the URL's is utterly utterly crazy. What happens if someone visits
yoursite.com/articles/the
Do you return a list of all the pages starting "the" ? What then happens if:
Author A creates
yoursite.com/articles/stackoverflow-is-massive
2 days later Author B creates
yoursite.com/articles/stackoverflow-is-massively-flawed
Not only will A be pretty angry that his article has been hi-jacked, all the perma-links he may have been sent out will be broken, and Google is going never going to give your articles any reasonable page-rank because the content keeps changing and effectively diluting itself.
Sometimes there is a pretty good reason you've never seen your amazing new "idea/feature/invention/time-saver" anywhere else before.
INT is much more faster.
In the string case I think you should not select query with LIKE but just with = because you look for this-is-an-entry, not for this-is-an-entry-and-something.
There are a few things to consider:
The type of search performed on the DataBase will be an "index seek", search for single row using an index, most of the time.
This type of exact match operation on a single row is not significantly faster using ints than strings, they are basically the same cost, for any practical purpose.
What you can do is the following optimization, search the database using a exact match (no wildcards), this is as fast as using an int index. If there is no match do a fuzzy search (search using wildcards) this is more expensive, but on the other hand is more rare and can produce more than one result. A form of ranking results is needed if you want to go for best match.
Pseudocode:
Search for an exact match using a string: Article Like 'entry'
if (match is found) display page
if (match is not found) Search using wildcards
If (one apropriate match is found) display page
If (more relevant matches) display a "Did you tried to find ... page"
If (no matches) display error page
Note: keep in mind that fuzzy URLs are not recommended from a SEO perspective, because people can link your site using multiple URLs which will split your page rank instead of increase it.
If you put an index on the varchar field it should be ok (performance wise), really depends on how many pages you are going to have. Also you have to be more careful and sanitize the string to prevent sql injections, e.g. only allow a-z, 0-9, -, _, etc in your query.
I would still prefer an integer id as it is faster and safer, change the format to something nicer like:
http://mysite.com/article/21-this-is-an-entry.html
As said, comparing INT < VARCHAR, and if the table is indexed on the field you're searching then that'll help too, as the server won't have to create a manual index on the fly.
One thing which will help validate your queries for speed and sense is EXPLAIN. You can use this to show which indexes your query is using, as well as execution times.
To answer your question, if it's possible to build your system using exact matches on the article ID (ie an INT) then it'll be much "lighter" than if you're trying to match the whole url using a LIKE statement. LIKE will obviously work, but I wouldn't want to run a large, high traffic site on it.

high load on mysql DB how to avoid?

I have a table contain the city around the worlds it contain more than 70,000 cities.
and also have auto suggest input in my home page - which used intensively in my home page-, that make a sql query (like search) for each input in the input (after the second letter)..
so i afraid from that heavily load,,...,, so I looking for any solution or technique can help in such situation .
Cache the table, preferably in memory. 70.000 cities is not that much data. If each city takes up 50 bytes, that's only 70000 * 50 / (1024 ^ 2) = 3MByte. And after all, a list of cities doesn't change that fast.
If you are using AJAX calls exclusively, you could cache the data for every combination of the first two letters in JSON. Assuming a Latin-like alphabet, that would be around 680 combinations. Save each of those to a text file in JSON format, and have jQuery access the text files directly.
Create an index on the city 'names' to begin with. This speeds up queries that look like:
SELECT name FROM cities WHERE name LIKE 'ka%'
Also try making your auto complete form a little 'lazy'. The more letters a user enters, lesser the number of records your database has to deal with.
What resources exist for Database performance-tuning?
You should cache as much data as you can on the web server. Data that does not change often like list of Countries, Cities, etc is a good candidate for this. Realistically, how often do you add a country? Even if you change the list, a simple refresh of the cache will handle this.
You should make sure that your queries are tuned properly to make best use of Index and Join techniques.
You may have load on your DB from other queries as well. You may want to look into techniques to improve performance of MySQL databases.
Just get your table to fit in memory, which should be trivial for 70k rows.
Then you can do a scan very easily. Maybe don't even use a sql database for this (as it doesn't change very often), just dump the cities into a text file and scan that. That'd definitely be better if you have many web servers but only one db server as each could keep its own copy of the file.
How many queries per second are you seeing peak? I can't imagine there being that many people typing city names in, even if it is a very busy site.
Also you could cache the individual responses (e.g. in memcached) if you get a good hit rate (e.g. because people tend to type the same things in)
Actually you could also probably precalculate the responses for all one-three letter combinations, that's only 26*26*26 (=17k) entries. As a four or more letter input must logically be a subset of one of those, you could then scan the appropriate one of the 17k entries.
If you have an index on the the city name it should be handled by the database efficiently. This statement is wrong, see comments below
To lower the demands on your server resources you can offer autocompletion only after n more characters. Also allow for some timeout, i.e. don't do a request when a user is still typing.
Once the user stopped typing for a while you can request autocompletion.