Solr indexing structure with MySQL - mysql

I have three to five search fields in my application and planning to integrate this with Apache Solr. I tried to do the sams with a single table and is working fine. Here are my questions.
Can we create index multiple tables in same core ? Or should i create separate core for each indexes (i guess this concept is wrong).
Suppose i have 4 tables users, careers, education and location. I have two search boxes in a php page where one is to search for simple locations (just like an autocomplete box) and another one is to get search for a keyword which should check on tables careers and education. If multiple indexes are possible under single core;
2.1 How do we define the query here ?
2.2 Can we specify index name in query (like table name in mysql) ?
Links which can answer my concerns are enough.

If you're expecting to query the same data as part of the same request, such as auto-completing users, educations and locations at the same time, indexing them to the same core is probably what you want.
The term "core" is probably identical to the term "index" in your usage, and having multiple sets of data in the same index will usually be achieved through having a field that indicates the type of document (and then applying a filter query if you want to get documents of only one type, such as fq=type:location. You can use the grouping feature of Solr to get separate result sets of documents back for each query as well.
If you're only ever going to query the data separately, having them in separate indexes are probably the way to go, as you'll be able to scale and perform analysis and tuning independent from each index in that case (and avoid having to always have a filter query to get the type of content you're looking for).
Specifying the index name is the same as specifying the core, and is part of the URL to Solr: http://localhost:8983/solr/index1/ or http://localhost:8983/solr/index2/.

Related

MariaDB dynamic columns for activity stream?

I have the following problem:
We have a lot of different, yet similar types of data items that we want to record in a (MariaDB) database. All data items have some common parameters such as id, username, status, file glob, type, comments, start & end time stamps. In addition there are many (let's say between 40 and 100) parameters that are specific to each type of data item.
We would prefer to have the different data item types in the same table because they will be displayed along with several other data, as they happen, in one single list in the web application. This will appear like an activity stream or "Facebook wall".
It seems that the normalised approach with a top-level generic table joined with specific tables underneath will lead to bad performance. We will have to do both a lot of joins and unions in order to display the activity stream, and the application will frequently poll with this query, so it's important that the query runs fast.
So, which is the better solution(s) in terms of performance and storage optimization?
to utilize MariaDB's dynamic columns
to just add in all the different kinds of columns we need in one table, and just accept that each data item type will only use a few of the columns, i.e. the rest will be null.
something else?
Does it matter if we use regular columns when a lot of the data in them will be null?
When should we use dynamic columns and when is it better to use regular columns?
I believe you should have separate columns for the values you are filtering by. However, you might have some unfiltered values. For those it might be a good idea to store them in a single column as a json object (simple to encode/decode).
A few columns -- the main ones for using in WHERE and ORDER BY clauses (but not necessarily all the columns you might filter on.
A JSON column or MariaDB Dynamic columns.
See my blog on why not to use EAV schema. I focus on how to do it in JSON, but MariaDB's Dynamic Columns is arguably better.

Bulding search engine for large database

I'm building a fairly large database where I will have a lot of tables with various data.
But each table has similar fields, for example video title or track title.
Now the problem I'm facing is how to build a query which would look for a keyword match across five or more tables, keep in mind that each table can potentially have from 100k to 1million rows or in some cases even couple million rows.
I think using joins or separate queries for each table would be very slow, so what I thought of is to make one separate table where I would store search data.
For example I think it could have fields like these,
id ---- username ---- title ---- body ---- date ---- belongs_to ---- post_id
This way I think it would perform a lot faster searches, or am I totally wrong?
The only problem with this approach that I can think of it is that it would be hard to manage this table because if original record from some of the tables is deleted I would also need to delete record from 'search' table as well.
Don't use MySQL for joining lots of tables, I would suggest you to take a look at Apache Solr, with RDBMS
Take a look at some information retrieval systems. They also require their own indices, so you need to index the data after each update (or in regular intervals) to keep the search index up to date. But they offer the following advantages:
much faster, because they use special algorithms and data structures designed for specifically that purpose
ability to search for documents based on a set of terms (and maybe also a set of negative terms that must not appear in the result)
search for phrases (i.e. terms that appear after each other in a specific order)
automatic stemming (i.e. stripping the endings of words like "s", "ed", "ing" ...)
detection of spelling mistakes (i.e. "Did you mean ...?")
stopwords to avoid indexing really common meaningless words ("a", "the", etc.)
wildcard queries
advanced ranking strategies (i.e. rank by relevance, based on the number and the position of each occurrences of the search terms)
I have used xapian in the past for my projects and I was quite happy with it. Lucene, Solr and elastic search are some other really popular projects that might fit your needs.

Which of these 2 MySQL DB Schema approaches would be most efficient for retrieval and sorting?

I'm confused as to which of the two db schema approaches I should adopt for the following situation.
I need to store multiple attributes for a website, e.g. page size, word count, category, etc. and where the number of attributes may increase in the future. The purpose is to display this table to the user and he should be able to quickly filter/sort amongst the data (so the table strucuture should support fast querying & sorting). I also want to keep a log of previous data to maintain a timeline of changes. So the two table structure options I've thought of are:
Option A
website_attributes
id, website_id, page_size, word_count, category_id, title_id, ...... (going up to 18 columns and have to keep in mind that there might be a few null values and may also need to add more columns in the future)
website_attributes_change_log
same table strucuture as above with an added column for "change_update_time"
I feel the advantage of this schema is the queries will be easy to write even when some attributes are linked to other tables and also sorting will be simple. The disadvantage I guess will be adding columns later can be problematic with ALTER TABLE taking very long to run on large data tables + there could be many rows with many null columns.
Option B
website_attribute_fields
attribute_id, attribute_name (e.g. page_size), attribute_value_type (e.g. int)
website_attributes
id, website_id, attribute_id, attribute_value, last_update_time
The advantage out here seems to be the flexibility of this approach, in that I can add columns whenever and also I save on storage space. However, as much as I'd like to adopt this approach, I feel that writing queries will be especially complex when needing to display the tables [since I will need to display records for multiple sites at a time and there will also be cross referencing of values with other tables for certain attributes] + sorting the data might be difficult [given that this is not a column based approach].
A sample output of what I'd be looking at would be:
Site-A.com, 232032 bytes, 232 words, PR 4, Real Estate [linked to category table], ..
Site-B.com, ..., ..., ... ,...
And the user needs to be able to sort by all the number based columns, in which case approach B might be difficult.
So I want to know if I'd be doing the right thing by going with Option A or whether there are other better options that I might have not even considered in the first place.
I would recommend using Option A.
You can mitigate the pain of long-running ALTER TABLE by using pt-online-schema-change.
The upcoming MySQL 5.6 supports non-blocking ALTER TABLE operations.
Option B is called Entity-Attribute-Value, or EAV. This breaks rules of relational database design, so it's bound to be awkward to write SQL queries against data in this format. You'll probably regret using it.
I have posted several times on Stack Overflow describing pitfalls of EAV.
Also in my blog: EAV FAIL.
Option A is a better way ,though the time may be large when alert table for adding a extra column, querying and sorting options are quicker. I have used the design like Option A before, and it won't take too long when alert table while millions records in the table.
you should go with option 2 because it is more flexible and uses less ram. When you are using option1 then you have to fetch a lot of content into the ram, so will increases the chances of page fault. If you want to increase the querying time of the database then you should defiantly index your database to get fast result
I think Option A is not a good design. When you design a good data model you should not change the tables in a future. If you domain SQL language, using queries in option B will not be difficult. Also it is the solution of your real problem: "you need to store some attributes (open number, not final attributes) of some webpages, therefore, exist an entity for representation of those attributes"
Use Option A as the attributes are fixed. It will be difficult to query and process data from second model as there will be query based on multiple attributes.

mysql table design, large column size vs large number of rows

Please help me understand which of the following is better for scaling and performance.
Table: test
columns: id <int, primary key>, doc <int>, keyword <string>
The data i want to store is a pointer to the documents containing a particular keyword
Design 1:
have unique constraint on the keyword column and store the list of documents as an array
e.g id: 1, doc: [4,5,6], keyword: google
Design 2:
insert a row for each document
1 4 google
2 5 google
3 6 google
Lets the say the average number of documents a particular keyword would be found in is close to 100000. there may not be a max number of documents the keyword appears in.
You can forget about option 1 because there's no array data type in mysql.
To be honest if you want a scallable solution for this type of data I think you should look into a different type of database. Research more on NoSQL and 'key-value pair store database'.
With mysql, the best I can think of is your 2nd option, with the exception that you should create another table with a numeric ID and a list of unique keywords. That way, when you do your search you'll first look up the ID, then filter the big table by the ID instead of string. Numeric comparison is faster than string comparison.
A lot of factors come into scaling and performance so it's not usually a good idea to try to optimise unknowns early in development.
For database design I find it's usually best to go with the more correct normalised approach (your design 2) and then worry about the scaling and performance if it becomes an issue. You can then de-normalise certain areas or take other approaches depending on what issues you face.
Your design option 1 is likely to hit other issues more immediately with the inability to join the doc column with another table, as well as complexities updating and searching it as well.
Design 1 is potentially limited by MySQL's row size limit.
Design 2 makes the most sense to me. What if you need to remove one of those values? You just delete a row rather than having to search through and update an array. It's also nice because it allows you to limit the size of your results if necessary (e.g., for pagination).
You might also consider creating a many-to-many relationship between this table and a keywords table instead of storing keywords as a field here.

How to structure mysql database for use with sphinx?

I am trying to make a database of products that can be searched by many facets(like newegg or amazon). At first I was going to try to do the whole thing with mysql but further research has led me to believe that is a bad idea so instead I am thinking about using Sphinx.
My question is how would I set up the mysql tables for this? Would I just have one table for the products and another one with all the facets that would just have a couple large varchar fields and foreign key to the product?
I am not a huge Sphinx expert, but I'd say that you don't have to stick all your data in one table. Sphinx can handle associations just fine. If you are planning to use Rails for your front-end then take a look at thinking_sphinx gem. It definitely allows you to specify attributes based on data spread out into many tables. In my experience I didn't have to change my data structure to accommodate Sphinx.
I'll pipe in.
You don't really need to actually. Facets in Sphinx are just ID's (at least in 0.9.9 the current stable release). I am going to assume that you have a standard product table with your different facets stored as foreign keys to other tables.
So assuming you have this you can just select over the main product table and set up the facets in sphinx as per the documentation.
I would really need to see your table structure to comment further. It sounds like you have your products spread over multiple tables. In this case as you mentioned I would go with a single table which you index on which is populated with the contents of all the others.
The great thing about Sphinx is that you can use a MySQL query to get your data into Sphinx. This allows you to structure your database in a way that's optimized for your business logic, without having to worry about how search will perform. As long as you're creative with the query you write for sql_query, you can normalize your database however you'd like, and still be able to grab all the text to be indexed with a single query. For example, if you need to get strings from a many-to-one relationship into your index, you can do so using a subquery.
sql_query = SELECT *, (SELECT pa.text FROM products_attr pa WHERE pa.product_id=p.id ) \
FROM products p;
Additionally, if you drop downs where you search on attribute IDs, you use Sphinx's multi-value attribute. This way, you can search by attribute ID, as well as the text of the attrbute.
sql_attr_multi = uint attributes from query; \
SELECT product_id AS id, id AS attribute FROM product_attributes ;