Basic Solr query & data structure - mysql

I need to use Solr for a very quick demo, I have a MySql database that contains 37k records of products online (like gmail, google analytic) where I have information like name, description, and keywords.
I managed to store the data like this structure
{
"keywords":"[\"music-streaming,streaming,internet-radio,audio-scrobbling\"]",
"description":"Last.fm is a music community website that offers personalized internet radio, using a recommendation system called \"Audioscrobbler\" to build a detailed profile of users based on their music tastes and interests. The service...",
"operatingSystem":"[\"Mac,Windows,Linux,Web/Cloud,Android,iPhone,WindowsPhone,KindleFire\"]",
"meta":"[\"Freemium\", \"Mac\", \"Windows\", \"Linux\", \"Web/Cloud\", \"Android\", \"iPhone\", \"...\", \"WindowsPhone\", \"KindleFire\"]",
"name":"Last.fm",
"id":39145,
"category":"audio-and-music"}
Meta & operating system are JSON arrays, while the remaining fields are text fields.
I need help in three things
Is this data structure (schema) is good in terms of structure, searching, and indexing?
I want to build a query where is shows relevant products based on keywords?
How can I turn the fields meta and operating system into filters rather than search keywords?
My final goal is to have a search bar where a user can type in a specific keyword then filter according to operating system and meta

The fields with multiple values should probably be indexed as separate terms in a multiValued field, so that you can query / filter for fields with a specific value. i.e. index the field as 'Mac', 'Windows', 'Linux', 'Web/Cloud', etc., and not as a single value with all the values embedded.
Depending on your exact requirements, similar / relevant documents can be found using the MoreLikeThis component.
When the fields are properly multivalued (as they should be) you can generate a Facet on the field for filtering (and then use fq to filter the result set accordingly).

Related

How to use Key-Value pair in Relational database(MySql)?

I wanted to use a relational database(MySql) to store my data as key-value pair.
I would be getting no. of key-value pairs dynamically.
I can create a simple table to store them in separate columns.
Values can be of type- int, varchar, text or date.
The problem which I am facing is:
When I need to run a query on key whose value should be an integer and I need to use and greater than or less than query with it. Same case when I need to use between query with date fields.
How can I achieve it?
------------------------------------------------Edit---------------------------------------------------
For greater clarity, I am providing the background for this question which I have divided into three parts:
1. Data 2: Use Case 3. Possible Designs
1. Data
Suppose I'm creating data store for census of a country**(Just an example)**. Fields for storing data would be different for male, female, boy or girl and also it will vary according to the person's profession. The number of fields depends on the requirement which can increase up to 500 or more.
2. Use Case
Show a paginated list of persons whose monthly income is between $7000 to $10000. User can click on any page number and the database should directly fetch the data for that page number. For example, if we are showing 10 results in a page and user clicks on the 5th page then we should show him the list of the person's from 40 to 50.
Some of the values belonging to a particular group store description which can have large data. So they should be stored as TEXT.
3. Possible Designs
I can create a separate table for each different type and store their data in respective fields. But the problem I'm thinking about this approach is that MySQL table has a maximum row size limit of 65,535 bytes. Going by this approach and storing all data horizontally might cross the max size limit. As the number of fields are not fixed and can change as per requirement.
Instead of storing data horizontally I can store them vertically using Entity Attribute Value design(key-value pair). For now, the increase in the number of rows due to this design is not a problem. Using this I can store data of all male, female or child in the same table. But the problem with this approach is:
I will lose the Datatype of certain important fields. I can not query and get the list of persons whose income is more than 1000.
For storing data or all fields in single Value type, I need to make it varchar. But some fields store large data which requires TEXT as the type.
Considering the above problem, I thought that instead of creating only one value field, I will create multiple value fields like value_int, value_varchar, value_date or value_text.
DB structure
For this problem, I will be using MySQL and cannot change the DB due to certain restrictions. So I am looking for a design with MySQL only.
Going by key-value approach is a good idea or not? Or any other possible design which can be used?
In very general terms, if you know the entities and attributes of your problem domain, and the data is relational, I'd use a relational schema (your "possible design 1"). If you actually encounter problems with maximum row width, your problem domain might contain logical subgroupings of attributes, so you can split them into separate table.
For instance:
Person (id, name, ...)
Person_demographics (person_id, age, location, ...)
Person_finance (person_id, income, wealth...)
If you don't know the entities and attributes in advance, I recommend using MySQL's JSON support. or XML support. This gives you access to much better query options than EAV.
The problem with EAV-like solutions in your scenario is that any non-trivial queries end up being incredibly complicated - "find all responses where salary is between x and y, and the age is z, in locations (a, b, c)" turns into a horrible mess of SQL, but with XPath this is pretty straightforward.

algorithm for analyzing description before saving into db: mongodb

The idea of application is to show all "E-number" ingredients within a product (E100, E200 etc.)
Imagine we have a list of products coming into our database (JSONs, scrapped or received via APIs). Products contain description - it describes ingredients within a product.
Sometimes those ingredients already come with numbers (like E100), but sometimes there are names of ingredients (Octyl gallate), sometimes both.
We are going to store all these data in mongodb (collection prodcuts).
The question - now application queries given product and it has to show all E-numbers product contains. How would you solve the problem that descriptions has different forms (sometimes with direct E-numbers, sometimes with E-descriptions, sometimes with both etc.). Moreover sometimes in products' descriptions some E-descriptions are written incorreclty (with missing letters).
I do not thing that it would be good to do this on the fly, it would be better if all data would already be stored in DB (but not sure). So myr general solution could be like this:
do preprocessing of description field when receiving products data and before saving product into DB (this could be done in any programming language - node.js for instance)
during preprocessing we need to analyse descriptions field (thus searching within existing e-collection: e-id, e-name, e-category, array of e-different-names; for instance, if description contains E100, greens, Octyl gallate then during preprocessing we would get array "E100, E140, E311".
then we would create "e-list" for products collection in json
save product in db
Does this seems logical? Never worked with mongodb.
Yes, it would make sense to process it while inserting and prepare data for quick queries. These ingredients could be normalized to separate collection and then add ingredient id-s to product.

Advanced search in concatenating string

i am creating a search functionality for a website where i need to take the user's full address from an input e.g "Address 32, City,Region,Country, Postal Code"(no necessary with this order) and return the available restaurant that are around the area.
I have a table "address" where there is a field for each of the above elements.
I was thinking of concatenating the users address from the database and compare it with the user's input by help of SQL REGEXP.
Is there any other approximate SQL search that can give me that or can you suggest me a different approach?
A friend suggested using (http://www.simonemms.com/2011/02/08/codeigniter-solr/) however with a small research on it the problem still remains.
Trouble with concatentating the address together in SQL is you will miss out on it using indexes. Hence it will be slow. Added to which if you do not know the order of the input elements the chances of it matching what is concatenated from the database (is a likely different order) is slim.
I would suggest for much of the address items, split them off into different tables (ie a table of regions, another of countries, etc) and just store the ids in the columns in the users table.
For a search, identify which of the search fields go with which actual field then join on those to find the real address.
Also means you can identify typos more easily.

Tridion 2009 embedded metadata storage format in the broker

I'm fairly new to Tridion and I have to implement functionality that will allow a content editor to create a component and assign multiple date ranges (available dates) to it. These will need to be queried from the broker to provide a search functionality.
Originally, this was only require a single start and end date and so were implemented as individual meta data fields.
I am proposing to use an embedded schema within the schema's 'available dates' metadata field to allow multiple start and end dates to be assigned.
However, as the field is now allowing multiple values, the data is stored in the broker as comma separated values in the 'KEY_STRING_VALUE' column rather than as a date value in the 'KEY_DATE_VALUE' column as it was when it was only allowed a single start and end values.
eg.
KEY_NAME | KEY_STRING_VALUE
end_date | 2012-04-30T13:41:00, 2012-06-30T13:41:00
start_date | 2012-04-21T13:41:00, 2012-06-01T13:41:00
This is now causing issues with my broker querying as I can no longer use simple query logic to retrieve the items I require for the search based on the dates.
Before I start to write C# logic to parse these comma separated dates and search based on those, I was wondering if anyone had had similar requirements/experiences in the past and had implemented this in a different way to reduce the amount of code parsing required and to use the broker querying to complete the search.
I'm developing this on Tridion 2009 but using the 5.3 Broker (for legacy reasons) so the query currently looks like this (for the single start/end dates):
query.SetCustomMetaQuery((KEY_NAME='end_date' AND KEY_DATE_VALUE>'" + startDateStr + "') AND (ITEM_ID IN(SELECT ITEM_ID FROM CUSTOM_META WHERE KEY_NAME='start_date' AND KEY_DATE_VALUE<'" + endDateStr + "')))";
Any help is greatly appreciated.
Just wanted to come back and give some details on how I finally approached this should anyone else face the same scenario.
I proposed the set number of fields to the client (as suggested by Miguel) but the client wasn't happy with that level of restriction.
Therefore, I ended up implementing the embeddable schema containing the start and end dates which gave most flexibility. However, limitations in the Broker API meant that I had to access the Broker DB directly - not ideal, but the client has agreed to the approach to get the functionality required. Obviously this would need to be revisited should any upgrades be made in the future.
All the processing of dates and the available periods were done in C# which means the performance of the solution is actually pretty good.
One thing that I did discover that caused some issues was that if you have multiple values for the field using the embedded schema (ie in this case, multiple start and end dates) then the meta data is stored in the KEY_STRING_VALUE column in the CUSTOM_META table. However, if you only have a single value in the field (i.e. one start and end date) then these are stored as dates in the KEY_DATE_VALUE column in the same way as if you'd just used single fields rather than an embeddable schema. It seems a sensible approach for Tridion to take but it serves to make it slightly more complicated when writing the queries and the parsing code!
This is a complex scenario, as you will have to go throughout all the DCPs and parse those strings to determine if match the search criteria
There is a way you could convert that metadata (comma separated) in single values in the broker, but the name of the fields need to be different Range1, Range2, ...., RangeN
You can do that with a deployer extension where you change the XML Structure of the package and convert each those strings in different values (1,2, .., n).
This extension can take some time if you are not familiar with deployer extensions and doesn't solve 100% your scenario.
The problem of this is that you still have to apply several conditions for retrieve those values and there is always a limit you have to set (Versus the User that can add as may values as wants)
Sample:
query.SetCustomMetaQuery((KEY_NAME='end_date1'
query.SetCustomMetaQuery((KEY_NAME='end_date2'
query.SetCustomMetaQuery((KEY_NAME='end_date3'
query.SetCustomMetaQuery((KEY_NAME='end_date4'
Probably the fastest and easiest way to achieve that is instead to use an multi-value field, use different fields. I understand that is not the most generic scenario and there are Business Requirements implications but can simplify the development.
My previous comments are in the context of use only the Broker API, but you can take advantage of a search engine if is part of your architecture.
You can index the Broker Database and massage the data.
Using the Search Engine API you can extract the ids of the Components/Component Templates and use the Broker API to retrieve the proper information

Array, EAV, Serialized LOB for custom fields?

I've been trying to answer a complex Mysql data structure problem for custom fields for an online app. I'm fairly new to Mysql so any input is appreciated.
The current database is a relational database and each user of the service will share the same database and tables.
Here is an example of what I'm trying to do.
Let's say I'm trying to create a list. This list can contain up to 30 custom fields. The user can choose between 12 unique elements and each element can have up to 15 user defined attributes.
Each list can be unique within an account as well as between accounts. Accounts can have numerous lists and each list could have different quantities of elements as well as different attributes per element.
An element can be many things, for example: multiple choice, radio button, phone field, address, single line text, multi-line text, etc.
An example of attributes for a multiple choice (checkbox) element could be: red, green, blue, orange, white, black
An example of a single line text element could be: First Name input field.
Each element must also have a user defined title field and tag field which can be referenced and used in other features of the app.
Segmentation is very important as well. A user needs to be able to segment a list based on any element. For example, a user may want to segment list "ABC" based on all records where "red" is present in multiple choice element #1 (they may have more than 1 multiple choice element for a list).
In this example I would assume that arrays, EAV, Serialized LOB would work fine. However, I'm not sure what would be the best structure for my needs at my scale.
In reality, there will most likely be up to 50,000 records per list and there is a real possibility of 20,000+ accounts - each with numerous lists. Therefore, I'm looking for the most efficient and flexible structure.
To make matters even more complex I also need to ensure an efficient way to add/ delete elements to any particular list at any given time. For example, if a user creates a list with the maximum allow number of custom fields (30) and then three months later decides they want to delete a field, I need a way to find that list and all associated values for that custom field and then delete all the values, element type and its attributes. The user would then be allowed to add a new element to this list.
I've reviewed many of the EAV posts on this site, as well as this http://www.martinfowler.com/eaaCatalog/serializedLOB.html It doesn't seem that EAV would be very efficient for my needs due to the data retrieval downsides.
I was also wondering how well a multi-dimensional array would work at this scale? I believe wordpress uses this for their custom fields.
Any input would be greatly appreciated as to how best to structure the database for this situation. Thank you!
You can read about how FriendFeed implements custom fields:
http://bret.appspot.com/entry/how-friendfeed-uses-mysql
They use a combination of Serialized LOB, with extra tables containing inverted indexes. You don't need an extra table for every possible attribute in your LOB, only the ones you want to search for with assistance from an index.
You can use json enconding and decoding (i'm assuming you're using PHP) to store the input info in a table with a collumn to store the user and other to store this data as text. The answers have to be stored in another table (with a FK to use CASCADE ON DELETE).
If you can specify the max size of the input specification, use a varchar field.
This can't be the best aprouch (need some profiling tests to make sure it's robust enough) but can sure be used.