Sphinx without using an auto_increment id

Sphinx without using an auto_increment id - mysql

I am current in planning on creating a big database (2+ million rows) with a variety of data from separate sources. I would like to avoid structuring the database around auto_increment ids to help prevent against sync issues with replication, and also because each item inserted will have a alphanumeric product code that is guaranteed to be unique - it seems to me more sense to use that instead.
I am looking at a search engine to index this database with Sphinx looking rather appealing due to its design around indexing relational databases. However, looking at various tutorials and documentation seems to show database designs being dependent on an auto_increment field in one form or another and a rather bold statement in the documentation saying that document ids must be 32/64bit integers only or things break.
Is there a way to have a database indexed by Sphinx without auto_increment fields as the id?

Sure - that's easy to work around. If you need to make up your own IDs just for Sphinx and you don't want them to collide, you can do something like this in your sphinx.conf (example code for MySQL)
source products {
# Use a variable to store a throwaway ID value
sql_query_pre = SELECT #id := 0
# Keep incrementing the throwaway ID.
# "code" is present twice because Sphinx does not full-text index attributes
sql_query = SELECT #id := #id + 1, code AS code_attr, code, description FROM products
# Return the code so that your app will know which records were matched
# this will only work in Sphinx 0.9.10 and higher!
sql_attr_string = code_attr
}
The only problem is that you still need a way to know what records were matched by your search. Sphinx will return the id (which is now meaningless) plus any columns that you mark as "attributes".
Sphinx 0.9.10 and above will be able to return your product code to you as part of the search results because it has string attributes support.
0.9.10 is not an official release yet but it is looking great. It looks like Zawodny is running it over at Craig's List so I wouldn't be too nervous about relying on this feature.

sphinx only requires ids to be integer and unique, it doesn't care if they are auto incremented or not, so you can roll out your own logic. For example, generate integer hashes for your string keys.

Sphinx doesnt depend on auto increment , just needs unique integer document ids. Maybe you can have a surrogate unique integer id in the tables to work with sphinx. As it is known that integer searches are way faster than alphanumeric searches. BTW how long is ur alphanumeric product code? any samples?

I think it's possible to generate a XML Stream from your data.
Then create the ID via Software (Ruby, Java, PHP).
Take a look at
http://github.com/burke/mongosphinx

Related

Indexing in Neo4j, text or integer

I am creating an application which uses both MySql and Neo4j. I think that listing the many nodes properties in a table will be faster at reading all those after querying for a specific set of nodes (or even before), but I am open to be proven wrong. After all finding properties of a row is what relational dbs are for.
To ensure consistency, I have created a property on each node which is the auto_increment ID in my sql table.
I wish neo4j would allow indexing a property regardless of labels but that's not the case and I struggle to understand why this is not possible at all.
Question is: do you think that the performance in neo4j would be much better if the index is on a number versus a string? I am thinking whether to drop the numeric id and just stick with node.name

You can configure indexes on properties without referring to particular labels. You do this by editing node_auto_indexing in conf/neo4j.properties.
If you're looking to compare simple equality, I'd guess that indexing on numbers might be slightly faster, but I doubt the difference is big enough to be very meaningful, unless the string alternatives are very large.

Another option would be to put an AutoInc label and index on that label with your auto_id node property.
Assuming that auto_id is the property you added to all nodes to reference the MySQL auto_increment ID column, then:
CREATE INDEX ON AutoInc:(auto_id)
MATCH(n)
SET n :AutoInc

How to structure mysql database for use with sphinx?

I am trying to make a database of products that can be searched by many facets(like newegg or amazon). At first I was going to try to do the whole thing with mysql but further research has led me to believe that is a bad idea so instead I am thinking about using Sphinx.
My question is how would I set up the mysql tables for this? Would I just have one table for the products and another one with all the facets that would just have a couple large varchar fields and foreign key to the product?

I am not a huge Sphinx expert, but I'd say that you don't have to stick all your data in one table. Sphinx can handle associations just fine. If you are planning to use Rails for your front-end then take a look at thinking_sphinx gem. It definitely allows you to specify attributes based on data spread out into many tables. In my experience I didn't have to change my data structure to accommodate Sphinx.

I'll pipe in.
You don't really need to actually. Facets in Sphinx are just ID's (at least in 0.9.9 the current stable release). I am going to assume that you have a standard product table with your different facets stored as foreign keys to other tables.
So assuming you have this you can just select over the main product table and set up the facets in sphinx as per the documentation.
I would really need to see your table structure to comment further. It sounds like you have your products spread over multiple tables. In this case as you mentioned I would go with a single table which you index on which is populated with the contents of all the others.

The great thing about Sphinx is that you can use a MySQL query to get your data into Sphinx. This allows you to structure your database in a way that's optimized for your business logic, without having to worry about how search will perform. As long as you're creative with the query you write for sql_query, you can normalize your database however you'd like, and still be able to grab all the text to be indexed with a single query. For example, if you need to get strings from a many-to-one relationship into your index, you can do so using a subquery.
sql_query = SELECT *, (SELECT pa.text FROM products_attr pa WHERE pa.product_id=p.id ) \
FROM products p;
Additionally, if you drop downs where you search on attribute IDs, you use Sphinx's multi-value attribute. This way, you can search by attribute ID, as well as the text of the attrbute.
sql_attr_multi = uint attributes from query; \
SELECT product_id AS id, id AS attribute FROM product_attributes ;

Beautifying URLs with FULLTEXT index

I am currently building a website with multiple pages and in order to beautify the site's URLs I am using addresses like http://mydomain.com/category/item-name
I am using MySQL tables so in order to fetch the current item from my MySQL I have two options:
1) Add the item's ID to the title: http://mydomain.com/category/28745/item-name (where 28745 is the ID in the table). That way I can run a query SELECT * FROM products WHERE ID=28745 . Easy approach but the problem is that the URL is a bit uglier.
2) Fetch the item using a text search. In that case I will use the item-name as a FULLTEXT (using MyISAM) so the query will be SELECT * FROM products WHERE item-name=some-text .
I am trying to find out if there are any downsides to the second approach. Does using FULLTEXT instead of an Index on an INT field cost in performance? Does it really matter to search engines if the URL consists of the ID and is a bit uglier?
Thanks,
Meir

You don't need a FULLTEXT index, that's the first thing.
A FULLTEXT index is an index used for searching of the database of text. What you're doing is exact matching, you're not searching for entries.
That said, what's the downside of having an index over textual column over integer one?
First thing is the size. Integers require less storage space. Their indexes require less storage space. In order to store an integer, you need 4 bytes (2^32 is the range). To store a single ASCII char you need 1 byte. So, a word that's containing over 4 letters will take up more space than number 4.5 billion.
Second thing is that you're forced to use MyISAM if you want to have fulltext indexes for some reason.
There are advantages and disadvantages of MyISAM over InnoDB and that's a topic well-covered here at SO.
In short - unless you have 100k+ categories and growing and unless you need advanced searching options for your categories - don't use a fulltext index, use the regular one.
Table engine is up to you to decide.
For small amount of data it will all work without any issue.

string searching does impact the performance, but having friendly names also matters to the search engines and is more descriptive for the user when shared. Use index on your item-name field in the database to speed up the searching a little.

I recommend putting the pagenumber in a separate field.
Forget about using a fulltext index.
Make your table like this:
TableURL
pageid integer autoincrement primary key
url varchar(1000)
pagetext text
now you can just retrieve the url by doing:
$pageid = mysql_real_escape_string(.....);
....
SELECT pagetext from tableurl where pageid = '$pageid'
This will make your searches much faster, speed up your inserts and keep your db-design clean as well as prevent retrieving duplicate results.

Maybe using a date in your addresses instead of an ID is a cleaner approach?
Edit:
If this is just about products, I think displaying them as text like the second approach is better because you probably have unique product names in a category? And if this is not the case you can perhaps add the ID in the address:
http://mydomain.com/category/normal-item
http://mydomain.com/category/item-that-appears-multiple-times/1
http://mydomain.com/category/item-that-appears-multiple-times/2
http://mydomain.com/category/item-that-appears-multiple-times/3

Best primary key for storing URLs

which is the best primary key to store website address and page URLs?
To avoid the use of autoincremental id (which is not really tied to the data), I designed the schema with the use of a SHA1 signature of the URL as primary key.
This approach is useful in many ways: for example I don't need to read the last_id from the database so I can prepare all table updates calculating the key and do the real update in a single transaction. No constraint violation.
Anyway I read two books which tell me I am wrong. In "High performance MySQL" it is said that the random key is not good for the DB optimizer. Moreover, in each Joe Celko's books he says the primary key should be some part of the data.
The question is: the natural keys for URLs are... URLs themselves. The fact is that if for a site it is short (www.something.com), there's not an imposed limit for am URL (see http://www.boutell.com/newfaq/misc/urllength.html).
Consider I have to store (and work with) some millions of them.
Which is the best key, then? Autoincremental ids, URLs, hashes of URLs?

You'll want an autoincrement numeric primary key. For the times when you need to pass ids around or join against other tables (for example, optional attributes for a URL), you'll want something small and numeric.
As for what other columns and indexes you want, it depends, as always, on how you're going to use them.
A column storing a hash of each URL is an excellent idea for almost any application that uses a significant number of URLs. It makes SELECTing a URL by its full text about as fast as it's going to get. A second advantage is that if you make that column UNIQUE, you don't need to worry about making the column storing the actual URL unique, and you can use REPLACE INTO and INSERT IGNORE as simple, fast atomic write operations.
I would add that using MySQL's built-in MD5() function is just fine for this purpose. Its only disadvantage is that a dedicated attacker can force collisions, which I'm quite sure you don't care about. Using the built-in function makes, for example, some types of joins much easier. It can be a tiny bit slower to pass a full URL across the wire ("SELECT url FROM urls WHERE hash=MD5('verylongurl')" instead of "WHERE hash='32charhexstring'"), but you'll have the option to do that if you want. Unless you can come up with a concrete scenario where MD5() will let you down, feel free to use it.
The hard question is whether and how you're going to need to look up URLs in ways other than their full text: for example, will you want to find all URLs starting with "/foo" on any "bar.com" host? While "LIKE '%bar.com%/foo%'" will work in testing, it will fail miserably at scale. If your needs include things like that, you can come up with creative ways to generate non-UNIQUE indexes targeted at the type of data you need... maybe a domain_name column, for starters. You'll have to populate those columns from your application, almost certainly (triggers and stored procedures are a lot more trouble than they're worth here, especially if you're concerned about performance -- don't bother).
The good news is that relational databases are very flexible for that sort of thing. You can always add new columns and populate them later. I would suggest for starters: int unsigned auto_increment primary key, unique hash char(32), and (assuming 64K chars suffices) text url.

Presumably you're talking about an entire URL, not just a hostname, including CGI parameters and other stuff.
SHA-1 hashing the URLs makes all the keys long, and makes sorting out trouble fairly obscure. I had to use indexes on hashes once to obscure some confidential data while maintaining the ability to join two tables, and the performance was poor.
There are two possible approaches. One is the naive and obvious one; it will actually work well in mySQL. It has advantages such as simplicity, and the ability to use URL LIKE 'whatever%' to search efficiently.
But if you have lots of URLs concentrated in a few domains ... for example ....
http://stackoverflow.com/questions/3735390/best-primary-key-for-storing-urls
http://stackoverflow.com/questions/3735391/how-to-add-a-c-compiler-flag-to-extconf-rb
etc, you're looking at indexes which vary only in the last characters. In this case you might consider storing and indexing the URLs with their character order reversed. This may lead to a more efficiently accessed index.
(The Oracle table server product happens has a built in way of doing this with a so-called reversed index.)
If I were you I would avoid an autoincrement key unless you have to join more than two tables ON TABLE_A.URL = TABLE_B.URL or some other join condition with that kind of meaing.

Depends on how you use the table. If you mostly select with WHERE url='<url>', then it's fine to have a one-column table. If you can use an autoincrement id to identify an URL in all places in your app, then use the autoincrement

How does a hash table work? Is it faster than "SELECT * from .."

Let's say, I have :
Key | Indexes | Key-values
----+---------+------------
001 | 100001 | Alex
002 | 100002 | Micheal
003 | 100003 | Daniel
Lets say, we want to search 001, how to do the fast searching process using hash table?
Isn't it the same as we use the "SELECT * from .. " in mysql? I read alot, they say, the "SELECT *" searching from beginning to end, but hash table is not? Why and how?
By using hash table, are we reducing the records we are searching? How?
Can anyone demonstrate how to insert and retrieve hash table process in mysql query code? e.g.,
SELECT * from table1 where hash_value="bla" ...
Another scenario:
If the indexes are like S0001, S0002, T0001, T0002, etc. In mysql i could use:
SELECT * from table WHERE value = S*
isn't it the same and faster?

A simple hash table works by keeping the items on several lists, instead of just one. It uses a very fast and repeatable (i.e. non-random) method to choose which list to keep each item on. So when it is time to find the item again, it repeats that method to discover which list to look in, and then does a normal (slow) linear search in that list.
By dividing the items up into 17 lists, the search becomes 17 times faster, which is a good improvement.
Although of course this is only true if the lists are roughly the same length, so it is important to choose a good method of distributing the items between the lists.
In your example table, the first column is the key, the thing we need to find the item. And lets suppose we will maintain 17 lists. To insert something, we perform an operation on the key called hashing. This just turns the key into a number. It doesn't return a random number, because it must always return the same number for the same key. But at the same time, the numbers must be "spread out" widely.
Then we take the resulting number and use modulus to shrink it down to the size of our list:
Hash(key) % 17
This all happens extremely fast. Our lists are in an array, so:
_lists[Hash(key % 17)].Add(record);
And then later, to find the item using that key:
Record found = _lists[Hash(key % 17)].Find(key);
Note that each list can just be any container type, or a linked list class that you write by hand. When we execute a Find in that list, it works the slow way (examine the key of each record).

Do not worry about what MySQL is doing internally to locate records quickly. The job of a database is to do that sort of thing for you. Just run a SELECT [columns] FROM table WHERE [condition]; query and let the database generate a query plan for you. Note that you don't want to use SELECT *, since if you ever add a column to the table that will break all your old queries that relied on there being a certain number of columns in a certain order.
If you really want to know what's going on under the hood (it's good to know, but do not implement it yourself: that is the purpose of a database!), you need to know what indexes are and how they work. If a table has no index on the columns involved in the WHERE clause, then, as you say, the database will have to search through every row in the table to find the ones matching your condition. But if there is an index, the database will search the index to find the exact location of the rows you want, and jump directly to them. Indexes are usually implemented as B+-trees, a type of search tree that uses very few comparisons to locate a specific element. Searching a B-tree for a specific key is very fast. MySQL is also capable of using hash indexes, but these tend to be slower for database uses. Hash indexes usually only perform well on long keys (character strings especially), since they reduce the size of the key to a fixed hash size. For data types like integers and real numbers, which have a well-defined ordering and fixed length, the easy searchability of a B-tree usually provides better performance.
You might like to look at the chapters in the MySQL manual and PostgreSQL manual on indexing.

http://en.wikipedia.org/wiki/Hash_table
Hash tables may be used as in-memory data structures. Hash tables may also be adopted for use with persistent data structures; database indices sometimes use disk-based data structures based on hash tables, although balanced trees are more popular.

I guess you could use a hash function to get the ID you want to select from. Like
SELECT * FROM table WHERE value = hash_fn(whatever_input_you_build_your_hash_value_from)
Then you don't need to know the id of the row you want to select and can do an exact query. Since you know that the row will always have the same id because of the input you build the hash value form and you can always recreate this id through the hash function.
However this isn't always true depending on the size of the table and the maximum number of hashvalues (you often have "X mod hash-table-size" somewhere in your hash). To take care of this you should have a deterministic strategy you use each time you get two values with the same id. You should check Wikipedia for more info on this strategy, its called collision handling and should be mentioned in the same article as hash-tables.
MySQL probably uses hashtables somewhere because of the O(1) feature norheim.se (up) mentioned.

Hash tables are great for locating entries at O(1) cost where the key (that is used for hashing) is already known. They are in widespread use both in collection libraries and in database engines. You should be able to find plenty of information about them on the internet. Why don't you start with Wikipedia or just do a Google search?
I don't know the details of mysql. If there is a structure in there called "hash table", that would probably be a kind of table that uses hashing for locating the keys. I'm sure someone else will tell you about that. =)
EDIT: (in response to comment)
Ok. I'll try to make a grossly simplified explanation: A hash table is a table where the entries are located based on a function of the key. For instance, say that you want to store info about a set of persons. If you store it in a plain unsorted array, you would need to iterate over the elements in sequence in order to find the entry you are looking for. On average, this will need N/2 comparisons.
If, instead, you put all entries at indexes based on the first character of the persons first name. (A=0, B=1, C=2 etc), you will immediately be able to find the correct entry as long as you know the first name. This is the basic idea. You probably realize that some special handling (rehashing, or allowing lists of entries) is required in order to support multiple entries having the same first letter. If you have a well-dimensioned hash table, you should be able to get straight to the item you are searching for. This means approx one comparison, with the disclaimer of the special handling I just mentioned.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008