Solr and Lucene, including in web application - mysql

I want to add simple search functionality into an existing Java web application.
Search should be done on existing database fields.
It is an web-applicaiton deployed on wildfly, REST-Services and MySql-DB.
After some research, my first impression was, using Solr I will get what I want.
BUT: As I'am not allowed to deploy one more application to customers environments, Solr doesn't fit any more.
As I understood, there are two ways to fix this:
Using EmbeddedSolr
"Self-build solr" (http://javaskeleton.blogspot.de/2011/07/adding-solr-to-existing-web-application.html)
Which way should I go, to implement search to my web-app ?
Or should I switch to Lucene ?

The 2nd way seems old and although the post seems to be removed I think I got what the author meant, from the title.
IMO the 1st way is better because you will be using Solr as it should be used, as a black box, without mixing up things with your webapp.
Having said that, keep in mind that the embedded solr isn't a good choice for a production environment because it is a standalone module, mainly not scalable.
I suggest you to write your Solr client code in a decoupled way: your webapp should deal only with SolrServer abstract class. Behind the scenes you'll instantiate an EmbeddedSolrServer at the moment. Later, if you want to scale your search service, this design will let you to switch to another impl (LBHttpServer, SolrCloud) with a small refactoring effort.

So I will describe my way that I've chosen.
First of all, yes Lucene is my friend.
In my web-app I creaded an #WebListener. This will create one Index, and delete it if already exists, at start of my web-app.
The content of the Index are some database-filed values of three Objects, that have to be searched.
In my SearchService(REST) I build up my QuerySearch, and acces this Index.
Additonally I want extend existing REST-Services (not yet done). So when editing objectTypes (CUD) included in Index, the Index have to be update.
Feel free, to give me some suggestions or best practices.

Related

Downside of having string properties in service contracts that can contain a full json model

We are working with a DDD framework in our company. We are changing a lot of core things in our API because we are still growing and we are still in our enfant phase when designing a good API.
The problem is that there are alot of flows already in the same api. Which are not compatible with eachother.
We have an order service and a product service.
Normally when the product model radically changes, we have a major impact in the order model.
Now im here listing all kind of red flags which should never happen but I simply dont have control over how it needs to be done. That is pretty much management pushing for a fast solution. And leading to bad shortcuts...
The way is has been decided to overcome that Order needs to adapt constantly. They made a property in the orderline called productConfiguration. This is in the contract of the service and is direcrtly translated as is in the DB tables. This contains the product model that can change. In json format.
For me its very clear that this is very dangerous to do this. Because i nthe end you need to change this json into an actual object. So you just move the restrictions from the service contract to code logic. Which makes it worse cause it will only cause an issue at run time...
Are there other major things I just know about, so I can bring it to the table to avoid this way of working...
Using strings that are directly converted into DB tables is not just in your opinion a bad design. It's an opinion shared by a lot of us.
What do you do when an object changes? For example, the new one requires an attribute that the old one didn't had. How do you manage this situation? I suppose that you've to change everything, including the objects stored before. Or build a kind of transformation layer where you translate objects from the old to the new design. A lot of extra work.
Anyway, given that the two domains are separated, what are the information that change so much and require such a design? I mean, for most of the things you could know at the beginning what do you need for your part of the domain. For the rest, I would prefer to have a kind of service that given an Id gives you the information from the other domain. You can change this service (here could be also json obj, if nothing than just showing is required) and adapt to your/their needs. But, it's just a solution that comes from my limited knowledge of your processes.
Other ways are also possible, as long as you can always understand which version of the design are you using.

AS3-Spod Example or tutorial? or any other AS3 ORM

Does anybody have any experience with as3-spod?
I downloaded the source code from github and as3-signals and started to try it out, but I´ll take ages to get to know the framework by trial and error and probably miss a lot of best practices. The framework looks good but lack's on examples. The git page does't have a lot of info on that...
If anybody knows some other ORM for AIR that I can use on pure AS3 projects that have any bit of documentation, I´m more than thankful!
I was hoping to do a question-comment asking for clarification, but I don't have enough reputation yet! So I will answer as best I may.
I am using as3-spod for my application. It's been pretty reliable and mostly given me what I want. It's not really ideal, though. What I'd really like is something more ActiveRecord-like, or something original that lets you generate queries by concatenating conditions in a fluid syntax.
But if you're not using Flex (as I'm not, and you're not) then your options are pretty thin, as most of the other AS3 ORMs out there rely on some part of the Flex framework. Apart from as3-spod, the only possibility I could find was Christophe Coenraets' proof-of-concept but as he points out, it would need a lot of work to develop it into a fully-fledged ORM:
This is still a simplistic proof of concept and is by no means a production ready ORM solution.
And I haven't had time for that.
You are right that as3-spod is quite poorly documented. I guess the main class you want to look at is SpodTable. It's from that one you do inserts, selects, etc. An update on a single object can be done from the object itself. Look out for the various signals on SpodTable (select, selectAll, etc). To get going with it, just mark up a model class with metadata, then from your SpodDatabase instance call createTable(MyModelClass).
My main gripes with as3-spod are these (I'm listing them so you don't look for features that don't exist, which I wasted a fair bit of time doing!):
It works asynchronously. Doesn't matter if your actual SQLConnection has been opened synchronously or asynchronously; you have to listen to signals. That means you can't retrieve records and then use them straight away in the same method, you have to listen to signals. What I tend to do is to do large selects when the app starts, then filter the data in memory rather than doing complex queries. Pretty annoying.
Be careful with null values for numeric columns. I can't see a way of setting NULL or NOT NULL for columns using as3-spod; it always seems to make them NOT NULL, which will cause errors if you try to insert a row from an object with null fields.
There's no migration system (a la Rails). I am working on rolling my own as that's an essential feature for my purposes (it's a mobile app I'm developing).
Good luck! Let me know in comments if there's anything else specific you'd like me to cover and I can expand this answer.
EDIT
I've just noticed the existence of AS3SQLite. Haven't used it yet but, looks like there are other possibilities out there :)

geo spatial application: mySql vs CouchDB vs others

I am developing an application on google map and checking out various options to store and retrieve spatial information within a bounding box.
Initially I thought MySql was not a good option, but after checking http://dev.mysql.com/doc/refman/5.6/en/spatial-analysis-functions.html and http://code.google.com/apis/maps/articles/phpsqlsearch.html, looks like I can use MySql and it does support my use cases.
I was also evaluating node.js and couchdb with geocouch.. With modules like socket.io, geo etc looks like this is also a good choice. check out the book "Getting Started with GEO, CouchDB, and Node.js". My application would be 1 page application and I do not foresee if I would require rdbms anytime in future.
i have also seen this - http://nodeguide.com/convincing_the_boss.html and this makes me little apprehensive about whether to go with node.js-geocouch....
If the architecture for your next apps reads like the cookbook of
NoSQL ingredients, please pause for a second and read this.
Yes, Redis, CouchDB, MongoDB, Riak, Casandra, etc. all look really
tempting, but so did that red apple Eve couldn't resist. If you're
already taking a technological risk with using node.js, you shouldn't
multiply it with more technology you probably don't fully understand
yet.
Sure, there are legitimate use cases for choosing a document oriented
database. But if you are trying to build a business on top of your
software, sticking to conservative database technology (like postgres
or mysql) might just outweigh the benefits of satisfying your inner
nerd and impressing your friends.
What is your opinion ?
GeoCouch sounds like a good solution in your case. If you want to have an easy installation, you can have a look at Couchbase Single Server, which is basically a CouchDB with GeoCouch included (check out the Developer Preview for 2.0.

How do you database access (I/O) to/from Magento Commerce?

So, I want to import, export and modify the database. I have read that I have to do that by XML, but I don't really understand their doc system and I haven't found any good tutorials out there that explain this. I am slowly reading the very expensive and short book which is somewhat answering my questions, but I crave more.
As a second question, I want to have a order system where I can send out information or emails with my own code. I assume this would be some type of plug-in that would override or be called at a certain time. Any info would be helpful.
Some parts of the magento data can be imported/exported via the backend (System->Import/Export), namely products and customers.
If you want to deal with the complete DB - use your DB tool of choice (I prefer mysqldump).
When dealing with exported CSV.. use OpenOffice, from my experience it deals better with the separation characters than Excel.
As for your second question - as far as I understood, you will have to develop a module if you want to do something different than the existing functionality and keep the original mail functions. If you don't want to/have to keep the original functions, you can opt to overwrite the module, which is much easier as far as I can see. Google search for "overriding magento module" should turn up atleast one decent tutorial.
I found what I was looking for here:
(on magento site: Resources -> Magento Core API -> Product API or whichever API you want)
The problem is there is no Order API yet (or none that I've seen)
http://www.magentocommerce.com/wiki/doc/webservices-api/api/catalog_product#examples
This details how you'd write an external php script and obtain,edit or delete products (or anything else with an API).
Modules still look daunting, but I am reading through the (very thin) magento book (the only one available).
I hope this helps someone else.

Best open source, extendable crawler to use for image crawling

We are in the starting phase of a project, and we are currently wondering
whether which crawler is the best choice for us.
Our project:
Basically, we're going to set up Hadoop and crawl the web for images.
We will then run our own indexing software on the images stored in HDFS
based on the Map/Reduce facility in Hadoop. We will not use other indexing
than our own.
Some particular questions:
Which crawler will handle crawling for images best?
Which crawler will best adapt to a distributed crawling system, in which we
use many servers conducting crawling together?
Right now these look like the 3 best options-
Nutch: Known to scale. Doesn't look like the best option because it seems that is it tied closely to their text searching software.
Heritrix: Also scales. This one currently looks like the best option.
Scrapy: Has not been used on a large scale (not sure though). I dont know if it has the basic stuff like URL canonicalization. I would like to use this one because it is a python framework (I like python more than java), but I don't know if they have implemented the advanced features of a web crawler.
Summary:
We need to get as many images as possible from the web. Which existing crawling framework is both scalable and efficient , but also the one which will be the easiest to modify to get only images?
Thanks!
http://lucene.apache.org/nutch/
I would think going with something with the broadest use and support (community support) would be the better approach.
Nutch may be a good option because you want to end up on HDFS. It may be useful to look into the HBase integration that are currently in the works (NUTCH-650).
You may be able to get the data you need by skipping the index step at the end and instead look at the segments themselves.
However for flexibility another option may be Droids: http://incubator.apache.org/droids/. It's still in the incubator phase at apache, but worth looking at.
You may get some ideas by looking at the SimpleRuntime example in the org.apache.droids.examples. Perhaps by replacing the Sysout handler with one that stores the images onto HDFS that may give you what you want.