MySQL --> MongoDB: Keep IDs or create mapping? - mysql

We are going to migrate our database from MySQL to MongoDB.
Some URLs pointing at our web application use database IDs (e.g. http://example.com/post/5)
At the moment I see two possibilities:
1) Keep existing MySQL IDs and use them as MongoDB IDs. IDs of new documents will get new MongoDB ObjectIDs.
2) Generate new MongoDB ObjectIDs for all documents and create a mapping with MySQLId --> MongoDBId for all external links with old IDs in it.
2 will mess up my PHP app a little, but I could imagine that #1 will cause problems with indexes or sharding?
What is the best practice here to avoid problems?

1) Keep existing MySQL IDs and use them as MongoDB IDs. IDs of new
documents will get new MongoDB ObjectIDs.
ObjectId very useful when you don't want/have a natural primary key for your documents, but mixing ObjecIDs and numerical IDs as primary keys can only cause you problems later on with queries. I would suggest a different route. Keep existing MySQL IDs and use them as MongoDB IDs; create new documents with numerical IDs, as you would do for MySQL. This way you don't have to mix data types in one field.
2) Generate new MongoDB ObjectIDs for all documents and create a
mapping with MySQLId --> MongoDBId for all external links with old IDs
in it.
This can work also, but you need, as you said, map your new and old IDs. This is probably some extra work which you can avoid if you leave your IDs unchanged.
I could imagine that #1 will cause problems with indexes or sharding?
ObjectIDs and MySQL AUTO_INCREMENT IDs are both monotonically increasing so there wouldn't be much difference if they are used as a shard keys (you will probably use hashed shard keys in that case; you can read more details here).
Edit
Which problems could occur when mixing ObjectIDs and numeric IDs?
If you're doing simple equality checks (i.e get doc. with {_id: 5} or {_id: ObjectId("53aeb2dcb9f8955d1a927b97")) you will have no problems. However, range queries will be more complicated:
As an example:
db.coll.find({_id : { $gt : 5}})
This query will return you only documents with num. IDs.
This query:
db.coll.find({_id : { $gt : ObjectId("53aeb2dcb9f8955d1a927b97")}});
will return only documents that have ObjectIds.
Obviously, you can use $or to find either, but my point that your queries won't be as straightforward as with non-mixed Ids.

Related

SQL Query to find duplicates where a String contains a specific id

I have a database table where one field (payload) is a string where a JSON-object is stored. This JSON has multiple attributes. I would like to find a way to query all entries where the payload json-object contains the same value for the attribute id_o to find duplicates.
So for example if there existed multiple entries where id_o of the payload-string is "id_o: 100" I want to get these rows back.
How can I do this?
Thanks in advance!
I have faced similar issue like this before.
I used regexp_substr
SELECT regexp_substr(yourJSONcolumn, '"id_o":"([^,]*)',0,1,'e') end as give_it_a_name
the comma in "([^,])" can be replaced with a "." if after the id_0:100 has a . or something else that you want to remove.
i think storing json in database is not a good experience. Now your db needs a normalization, it will be good, if you create a new row in your db, give it a unique index and store this id_o property there.
UPDATE
here what i find in another question:
If you really want to be able to add as many fields as you want with no limitation (other than an arbitrary document size limit), consider a NoSQL solution such as MongoDB.
For relational databases: use one column per value. Putting a JSON blob in a column makes it virtually impossible to query (and painfully slow when you actually find a query that works).
Relational databases take advantage of data types when indexing, and are intended to be implemented with a normalized structure.
As a side note: this isn't to say you should never store JSON in a relational database. If you're adding true metadata, or if your JSON is describing information that does not need to be queried and is only used for display, it may be overkill to create a separate column for all of the data points.
I guess you JSON looks like this: {..,"id_o":"100",..}
SELECT * FROM your_table WHERE your_column LIKE '%"id_o":"100"%'

Solr indexing structure with MySQL

I have three to five search fields in my application and planning to integrate this with Apache Solr. I tried to do the sams with a single table and is working fine. Here are my questions.
Can we create index multiple tables in same core ? Or should i create separate core for each indexes (i guess this concept is wrong).
Suppose i have 4 tables users, careers, education and location. I have two search boxes in a php page where one is to search for simple locations (just like an autocomplete box) and another one is to get search for a keyword which should check on tables careers and education. If multiple indexes are possible under single core;
2.1 How do we define the query here ?
2.2 Can we specify index name in query (like table name in mysql) ?
Links which can answer my concerns are enough.
If you're expecting to query the same data as part of the same request, such as auto-completing users, educations and locations at the same time, indexing them to the same core is probably what you want.
The term "core" is probably identical to the term "index" in your usage, and having multiple sets of data in the same index will usually be achieved through having a field that indicates the type of document (and then applying a filter query if you want to get documents of only one type, such as fq=type:location. You can use the grouping feature of Solr to get separate result sets of documents back for each query as well.
If you're only ever going to query the data separately, having them in separate indexes are probably the way to go, as you'll be able to scale and perform analysis and tuning independent from each index in that case (and avoid having to always have a filter query to get the type of content you're looking for).
Specifying the index name is the same as specifying the core, and is part of the URL to Solr: http://localhost:8983/solr/index1/ or http://localhost:8983/solr/index2/.

Best approach for having unique row IDs in the whole database rather than just in one table?

I'm designing a database for a project of mine, and in the project I have many different kinds of objects.
Every object might have comments on it - which it pulls from the same comments table.
I noticed I might run into problems when two different kind of objects have the same id, and when pulling from the comments table they will pull each other comments.
I could just solve it by adding an object_type column, but it will be harder to maintain when querying, etc.
What is the best approach to have unique row IDs across my whole database?
I noticed Facebook number their objects with a really, really large number IDs, and probably determine the type of it by id mod trillion or some other really big number.
Though that might work, are there any more options to achieve the same thing, or relying on big enough number ranges should be fine?
Thanks!
You could use something like what Twitter uses for their unique IDs.
http://engineering.twitter.com/2010/06/announcing-snowflake.html
For every object you create, you will have to make some sort of API call to this service, though.
Why not tweaking your concept of object_type by integrating it in the id column? For example, an ID would be a concatenation of the object type, a separator and a unique ID within the column.
This approach might scale better, as a unique ID generator for the whole database might lead to a performance bottleneck.
If you only have one database instance, you can create a new table to allocate IDs:
CREATE TABLE id_gen (
id BIGINT PRIMARY KEY AUTO_INCREMENT NOT NULL
);
Now you can easily generate new unique IDs and use them to store your rows:
INSERT INTO id_gen () VALUES ();
INSERT INTO foo (id, x) VALUES (LAST_INSERT_ID(), 42);
Of course, the moment you have to shard this, you're in a bit of trouble. You could set aside a single database instance that manages this table, but then you have a single point of failure for all writes and a significant I/O bottleneck (that only grows worse if you ever have to deal with geographically disparate datacenters).
Instagram has a wonderful blog post on their ID generation scheme, which leverages PostgreSQL's awesomeness and some knowledge about their particular application to generate unique IDs across shards.
Another approach is to use UUIDs, which are extremely unlikely to exhibit collisions. You get global uniqueness for "free", with some tradeoffs:
slightly larger size: a BIGINT is 8 bytes, while a UUID is 16 bytes;
indexing pains: INSERT is slower for unsorted keys. (UUIDs are actually preferable to hashes, as they contain a timestamp-ordered segment.)
Yet another approach (which was mentioned previously) is to use a scalable ID generation service such as Snowflake. (Of course, this involves installing, integrating, and maintaining said service; the feasibility of doing that is highly project-specific.)
I am using tables as object classes, rows as objects and columns as object parameters. Everything starts with the class techname, in which every object has its unique identifier, which is unique in the database. The object classes are registered as objects in the table object classes, and the parameters for each object class are linked to it.

Do we have to use GUIDs in LinqToSql application to uniquely identify objects

I have inherited a LinqToSql application which is making use of GUID keys for objects.
I'd rather use conventional identity fields - much easier for people to use, understand and communicate. However there is some business logic that requires the application to identify unique objects before they're persisted to the DB which is why GUIDs where used in the first place.
Another issue we're having is with fragmented indexes - AFAIK we can't create sequential GUIDs in .Net code.
As this is my first exercise in LinqToSql I'd like to know how others have addressed this issue.
BTW there is no need for the data between multiple servers to be combined - the main (only) reason that I've used GUID keys in the past.
No, you don't have to use Guids, you can use any key type you'd like.
If you are stuck with Guids consider having the database generate them sequentially for you by making the default binding for the pk field newsequentialid(). This will eliminate fragmentation in your clustered index at least. You need to make a few modifications to the .dbml if you do this. On the key field in the .dbml Auto Generated Value = true and Auto-Sync = OnInsert
As far as generating the value before you insert to the database I don't see how using an identity field helps you. You will still have to insert to the database to reliably get the correct value. (Identity columns will have the same Autogenerated/Autosync settings as above)
Ints or Guids, you should be able to wrap the insert in a transaction, insert the record, grab the new key value, run your business logic and if it fails roll back the newly inserted record.

Sphinx without using an auto_increment id

I am current in planning on creating a big database (2+ million rows) with a variety of data from separate sources. I would like to avoid structuring the database around auto_increment ids to help prevent against sync issues with replication, and also because each item inserted will have a alphanumeric product code that is guaranteed to be unique - it seems to me more sense to use that instead.
I am looking at a search engine to index this database with Sphinx looking rather appealing due to its design around indexing relational databases. However, looking at various tutorials and documentation seems to show database designs being dependent on an auto_increment field in one form or another and a rather bold statement in the documentation saying that document ids must be 32/64bit integers only or things break.
Is there a way to have a database indexed by Sphinx without auto_increment fields as the id?
Sure - that's easy to work around. If you need to make up your own IDs just for Sphinx and you don't want them to collide, you can do something like this in your sphinx.conf (example code for MySQL)
source products {
# Use a variable to store a throwaway ID value
sql_query_pre = SELECT #id := 0
# Keep incrementing the throwaway ID.
# "code" is present twice because Sphinx does not full-text index attributes
sql_query = SELECT #id := #id + 1, code AS code_attr, code, description FROM products
# Return the code so that your app will know which records were matched
# this will only work in Sphinx 0.9.10 and higher!
sql_attr_string = code_attr
}
The only problem is that you still need a way to know what records were matched by your search. Sphinx will return the id (which is now meaningless) plus any columns that you mark as "attributes".
Sphinx 0.9.10 and above will be able to return your product code to you as part of the search results because it has string attributes support.
0.9.10 is not an official release yet but it is looking great. It looks like Zawodny is running it over at Craig's List so I wouldn't be too nervous about relying on this feature.
sphinx only requires ids to be integer and unique, it doesn't care if they are auto incremented or not, so you can roll out your own logic. For example, generate integer hashes for your string keys.
Sphinx doesnt depend on auto increment , just needs unique integer document ids. Maybe you can have a surrogate unique integer id in the tables to work with sphinx. As it is known that integer searches are way faster than alphanumeric searches. BTW how long is ur alphanumeric product code? any samples?
I think it's possible to generate a XML Stream from your data.
Then create the ID via Software (Ruby, Java, PHP).
Take a look at
http://github.com/burke/mongosphinx