Creating RapidMiner Subprocess to Reduce Repitition - rapidminer

I'm new to RapidMiner, but am using it to do some text mining and have what is probably a simple question that I can't seem to find an answer to anywhere.
I'm extracting some data from a database into RapidMiner. From there, I'm filtering the data using an attribute_value_filter to get several subsets from the original data.
After having all of my subsets ready to go, I need to do the exact same operations on each subset (extracting one column of data, creating a document, tokenizing, stemming, removing stop words, etc.).
Is there a way to make a "function", so to speak, in RapidMiner so I don't have to duplicate the exact same set of operations for each subset of data? I've tried using the subprocess operator, but I can't seem to get an independent output for each independent input. Perhaps I'm not setting it up right? I don't have enough reputation points to post images of my processes, but if more details are needed, please let me know.
Any feedback would be very much appreciated!

The Execute Process operator is a good one to use. This allows for the creation of a library of processes to be used in a modular way.
There's an example at my blog here.

Related

Best way to store/edit GEOJSON working on Wordpress [duplicate]

I want to create a website that will have an ajax search. It will fetch the data or from a JSON file or from a database.I do not know which technology to use to store the data. JSON file or MySQL. Based on some quick research it is gonna be about 60000 entries. So the file size if i use JSON will be around 30- 50 MB and if use MySQL will have 60000 rows. What are the limitations of each technique and what are the benefits?
Thank you
I can't seem to comment since I need 50 rep. for commenting, so I will give it as an answer:
MySQL will be preferable for many reasons, not the least of which being you do not want your web server process to have write access to the filesystem (except for possibly logging) because that is an easy way to get exploited.
Also, the MySQL team has put a lot of engineering effort into things such as replication, concurrent access to data, ACID compliance, and data integrity.
Imagine if, for instance, you add a new field that is required in whatever data structure you are storing. If you store in JSON files, you will have to have some process that opens each file, adds the field, then saves it. Compare this to the difficulty of using ALTER TABLE with a DEFAULT value for the field. (A bit of a contrived example, but how many hacks do you want to leave in your codebase for dealing with old data?) so to be really blunt about, MySQL is a database while JSON is not, so the correct answer is MySQL, without hesitation. JSON is just a language, and barely even that. JSON was never designed to handle anything like concurrent connections or any sort of data manipulation, since its own function is to represent data, not to manage it.
So go with MySQL for storing the data. Then you should use some programming language to read that database, and send that information as JSON, rather than actually storing anything in JSON.
If you store the data in files, whether in JSON format or anything else, you will have all sorts of problems that people have stopped worrying about since databases started being used for the same thing. Size limitations, locks, name it. It's good enough when you have one user, but the moment you add more of them, you'll start solving so many problems that you would probably end up by writing an entire database engine just to handle the files for you, while all along you could have simply used an actual database. Do note! Don't take my word for granted, I am not an expert on this field, so let others post their answer and then judge by that. I think enough people here on stackoverflow have more experience then I do haha. These are NOT entirely my words, but I have taken out the parts that were true from what I knew and know and added some of my own knowledge :) Have a great time making your website
For MySQl :you can select specific rows,or specific column using queries ,filter data based on a key,order alphabetically
downside:need a REST API to fetch data because it can't be accessed directly,you have to use php or python or whatever programming language for backend code.
for json file :benefits :no backend code directly accessed using GET http request.
downside:no filtering ,ordering or any queries,you have to do it manually.

What to use JSON file or SQL

I want to create a website that will have an ajax search. It will fetch the data or from a JSON file or from a database.I do not know which technology to use to store the data. JSON file or MySQL. Based on some quick research it is gonna be about 60000 entries. So the file size if i use JSON will be around 30- 50 MB and if use MySQL will have 60000 rows. What are the limitations of each technique and what are the benefits?
Thank you
I can't seem to comment since I need 50 rep. for commenting, so I will give it as an answer:
MySQL will be preferable for many reasons, not the least of which being you do not want your web server process to have write access to the filesystem (except for possibly logging) because that is an easy way to get exploited.
Also, the MySQL team has put a lot of engineering effort into things such as replication, concurrent access to data, ACID compliance, and data integrity.
Imagine if, for instance, you add a new field that is required in whatever data structure you are storing. If you store in JSON files, you will have to have some process that opens each file, adds the field, then saves it. Compare this to the difficulty of using ALTER TABLE with a DEFAULT value for the field. (A bit of a contrived example, but how many hacks do you want to leave in your codebase for dealing with old data?) so to be really blunt about, MySQL is a database while JSON is not, so the correct answer is MySQL, without hesitation. JSON is just a language, and barely even that. JSON was never designed to handle anything like concurrent connections or any sort of data manipulation, since its own function is to represent data, not to manage it.
So go with MySQL for storing the data. Then you should use some programming language to read that database, and send that information as JSON, rather than actually storing anything in JSON.
If you store the data in files, whether in JSON format or anything else, you will have all sorts of problems that people have stopped worrying about since databases started being used for the same thing. Size limitations, locks, name it. It's good enough when you have one user, but the moment you add more of them, you'll start solving so many problems that you would probably end up by writing an entire database engine just to handle the files for you, while all along you could have simply used an actual database. Do note! Don't take my word for granted, I am not an expert on this field, so let others post their answer and then judge by that. I think enough people here on stackoverflow have more experience then I do haha. These are NOT entirely my words, but I have taken out the parts that were true from what I knew and know and added some of my own knowledge :) Have a great time making your website
For MySQl :you can select specific rows,or specific column using queries ,filter data based on a key,order alphabetically
downside:need a REST API to fetch data because it can't be accessed directly,you have to use php or python or whatever programming language for backend code.
for json file :benefits :no backend code directly accessed using GET http request.
downside:no filtering ,ordering or any queries,you have to do it manually.

How to migrate data from mongodb to mysql?

I am currently working on an application like to analitics, i has Angularjs app which communicates with Spring REST Client App from which user creates token(trackingID) and use generated script with this id putting on his website to collect information about visitor's actions through another Spring REST tracking App, for tracking app i am using as mongodb to collect visitor actions/visitor info for fast insertion, but for rest client app mysql with user/accounts details.
My question is how to migrate mongo data from tracking app to mysql maybe for getting posibility of join for easily and fastest way of analyze data with any kind of filters from angularjs client app, to create manually any workers that periodically will transfer data from last point to present state from mongo to mysql, or are any existed tools that can be setted for this transfer?
There is no official library to do this.
But you can use mongoexport feature from mongoDB to export it in a CSV format and mysqlimport to import them into MySQL.
Here are links to the documentation MySQL import and MongoDB Export.
One more method you can try to write a program in one of your favorite language and read from MongoDB and write into MySQL
MySQL 5.7 has a new JSON data type, that can be very convenient.
You can create a table at MySQL to receive the JSON messages AS IS, and then use SQL to query it or do a post processing to load the data in a structured set of database tables.
Check this out: https://dev.mysql.com/doc/refman/5.7/en/json.html
I realise this question is a few years old - but recently I've had a number of people enquiring whether a tool I developed (https://virtual.blue/apps/json-converter) can do exactly what the OP is asking (convert MongoDB to SQL) so I am guessing it is still something people want. Keep reading to find out why I am honestly not surprised by this.
The short answer to whether the tool can help you is: perhaps. If your existing data relationships are not too complicated, and your database is not enormous, it may well be worth a try.
However, I thought it might help to try and explain what the issues are with this kind of conversion, since all the answers I have seen so far are along the lines of "try tool X" or "first convert to format Y and then you can slurp it into MySQL using utility Z". ie there is no thought to whether what you get at the end of doing this is going to make sense in terms of data relationships and integrity.
For example, you could just stick your entire database dump in a single field of a single SQL table (ok space limitations might prevent this in reality, but hopefully you get my point). Then your database would be "in MySQL format", but it would be absolutely no use to anyone.
The point is, what you actually want is a fully defined database model, correctly encapsulating all of the intrinsic data relationships. ("Database normalization" as it is known.) If your conversion process gets those relationships wrong, then you have a broken model, and any queries you try to run over it are likely to return nonsense. Unfortunately there is no magic tool that is just going to "know" the best way to represent your data in MySQL, and closing your eyes and shovelling it into a bunch of random tools is unlikely to miraculously get you what you want.
And herein lies the fundamental problem with the "NoSQL" philosophy (fad). They sold people the bogus notion of "non-relational data". My first thought when I heard this was, "How does that work? Surely all data is relational?" By the looks of things we are steadily getting more and more evidence that my instincts were right. ("NoSQL? Why stop there? I go with 'NoDatabase'. It returns no results at all, but it sure is fast!")
The NoSQL madness throws several important fundamental engineering principles to the wind. We shouted "don't hard code!", "DRY!" (Don't Repeat Yourself) because these actions infuse inflexibility into systems. Traditional wisdom makes precisely the same flexibility argument when it advises "create a fully described model with all the data relationships represented". Then you can execute any arbitrary query over it and expect meaningful results. "Yes but there are a whole bunch of queries we are never going to need to run," says the NoSQL proponent. But surely we learnt our lesson on things we are "never going to need to do"? ("I hard code liberally, because I know I am never going to want to change my code." Hmm...)
The arguments about speed are largely moot. Say it turns out you are frequently doing a complex 9 table join, with unsurprisingly sluggish performance. So create an index. Cache it. Swap some disk space for speed. The NoSQL philosophy is to swap data integrity for speed, which makes no sense at all.
When you generate your fast lookup index (cache/table/map/whatever) what you are really doing is creating a view over your model. If your model changes, you can readily update your view. Going from a model to a view is easy - it's a one to many operation and you are on the right side of entropy.
However, when you went with MongoDB you effectively decided to create views without bothering to describe your fundamental model. Now you discover there are queries you want to run, but can't - and so it's no wonder you want to move over to SQL and actually have your data modelled correctly. The problem is you now want to go from a view to a model. Now you're on the wrong side of entropy. Your view is a lossy representation of the model's fundamental relationships. You can't expect a tool to "translate" your database, because you are asking it to insert new relationships which were not originally defined. These are real world relationships that are not machine-guessable. The tool cannot know what relationships were intended.
In short the only way you can do this reliably is to get your hands dirty. An intelligent human, with complete understanding of the system you are modelling needs to sit down and carefully come up with (possibly a substantial amount of) code which effectively picks through the data and resolves all of the insufficiently represented data relationships. If your data is complex then it's going to be a headache and there is no way to cheat.
If your data is still relatively simple then I would suggest making the conversion as soon as possible, before it becomes difficult. In this case my tool (https://virtual.blue/apps/json-converter) may be able to help.
(They really should have asked a Physicist before they came up with all this nonsense...!)
You can download a trial version of Studio 3T for Mongo and export your database to SQL (or JSON) directly

SQL Assemblies vs Application code for complicated queries on large XML columns

I have a table with a few relational columns and one XML column which sometimes holds a fairly large chunk of data. I also have a simple webservice which uses the database. I need to be able to report on things like all the instances of a certain element within the XML column, a list of all the distinct values for a certain element, things like that.
I was able to get a list of all the distinct values for an element, but didn't get much further than that. I ended up writing incredibly complex T-SQL code to do something that seems pretty simple in C#: go through all the rows in this table, and apply this ( XPath | XQuery | XSLT ) to the XML column. I can filter on the relational columns to reduce the amount of data, but this is still a lot of data for some of the queries.
My plan was to embed an assembly in SQL Server (I'm using 2008 SP2) and have it create an indexed view on the fly for a given query (I'd have other logic to clean this view up). This would allow me to keep the network traffic down, and possibly also allow me to use tools like Excel and MSRS reports as a cheap user interface, but I'm seeing a lot of people saying "just use application logic rather than SQL assemblies". (I could be barking entirely up the wrong tree here, I guess).
Grabbing the big chunk of data to the web service and doing the processing there would have benefits as well - I'm less constrained by the SQL Server environment (since I don't live inside it) and my setup process is easier. But it does mean I'm bringing a lot of data over the network, storing it in memory while I process it, then throwing some of it away.
Any advice here would be appreciated.
Thanks
Edit:
Thanks guys, you've all been a big help. The issue was that we were generating a row in the table for a file, and each file could have multiple results, and we would doing this each time we ran a particular build job. I wanted to flatten this out into a table view.
Each execution of this build job checked thousands of files for several attributes, and in some cases each of these tests these were generating thousands of results (MSIVAL tests were the worst culprit).
The answer (duh!) is to flatten it out before it goes into the database! Based on your feedback, I decided to try creating a row for each result for each test on each file, and the XML just had the details of that one result - this made the query much simpler. Of course, we now have hundreds of thousands of rows each time we run this tool but the performance is much better. I now have a view which creates a flattened version of one of the classes of results that are emitted by the build job - this returns >200,000 and takes <5 seconds, compared to around 3 minutes for the equivalent (complicated) query before I went the flatter route, and between 10 and 30 minutes for the XML file processing of the old (non-database) version.
I now have some issues with the number of times I connect, but I have an idea of how to fix that.
Thanks again! +1's all round
I suggest using the standard xml tools in TSQL. (http://msdn.microsoft.com/en-us/library/ms189075.aspx). If you don't wish to use this I would recommend processing the xml on another machine.
SQLCLR is perfect for smaller functions, but with the restrictions on the usable methods it tends to become an exercise in frustration once you are trying to do more advanced things.
What you're asking about is really a huge balancing act and it totally depends on several factors. First, what's the current load on your database? If you're running this on a database that is already under heavy load, you're probably going to want to do this parsing on the web service. XML shredding and querying is an incredibly expensive procedure in SQL Server, especially if you're doing it on un-indexed columns that don't have a schema defined for them. Schemas and indexes help with this processing overhead, but they can't eliminate the fact that XML parsing isn't cheap. Secondly, the amount of data you're working with. It's entirely possible that you just have too much data to push over the network. Depending on the location of your servers and the amount of data, you could face insurmountable problems here.
Finally, what are the relative specs of your machines? If your web service machine has low memory, it's going to be thrashing data in and out of virtual memory trying to parse the XML which will destroy your performance. Maybe you're not running the most powerful database hardware and shredding XML is going to be performance prohibitive for the CPU you've got on your database machine.
At the end of the day, the only way to really know is to try both ways and figure out what makes sense for you. Doing the development on your web services machine will almost undoubtedly be easier as LINQ to XML is a more elegant way of parsing through XML than XQuery shoehorned into T-SQL is. My indication, given the information you provided in your question, is that T-SQL is going to perform better for you in the long run because you're doing XML parsing on every row or at least most rows in the database for reporting purposes. Pushing that kind of information over the network is just ugly. That said, if performance isn't that important, there's something to be said about taking the easier and more maintainable route of doing all the parsing on the application server.

How can I integrate advanced computations into a database field?

My biological research involves the measurement of a cellular structure as it changes length throughout the course of observation (capturing images every minute for several hours). As my data sets have become larger I am trying to store them in an Access database, from which I would like to perform various queries about their changes in size.
I know that the SELECT statement can incorporate some mathematical permutations, but I have been unable to incorporate many of my necessary calculations (probably due to my lack of knowledge). For example, one calculation involves determining the rate of change during specifically defined periods of growth. This calculation is entirely dependent on the raw data saved in the table, therefore I didn't this it would be appropriate to just calculate it in excel prior to entry into the field.
So my question is, what would be the most appropriate method of performing this calculation. Should I attempt to string together a huge SELECT calculation in my QUERY, or is there a way to use another language (I know perl?) which can be called to populate the new query field?
I'm not looking for someone to write the code, just where is it appropriate to incorporate each step. Also, I am currently using Office Access but would be interested in any mySQL answers as I may be moving to this platform at a later date. Thanks all!
You could encapsulate your logic and maths into a custom function in VBA and then call that in your select statement. This methodology would also work with other database engines but the exact wording might be slightly different
Doing it in SQL will be a lot faster, however much harder to debug (I'm guessing that you're looking at things like ANOVA, t-tests, chi^2 etc).
Having said that, you may want to to store and calculate interim values like the delay since the previous measurement, and the change in measurement.
OTOH, the metrics you describe are very simple to do in SQL:
one calculation involves determining the rate of change during specifically defined periods of growth
C.
If you need to populate a field with the result of a complex calculation, it will be easier to do the calculation in a full-fledged programming language, such as Perl, instead of trying to do it in SQL.
Perl has a very good database API, "DBI", with drivers for just about every database engine known to man. Here's a good short article on DBI:
http://www.perl.com/pub/a/1999/10/DBI.html
You would be at the mercy of the SQL implementation for calculation options, precision, etc. Better to use a separate language where you have control, extensibility and flexibility to store the results you want and need.
I'd propose that you use Access as a front-end tool for entering, editing and printing your data. You can store the data in any back-end database engine (MySQL, SQL Server, etc.), though Jet/ACE (the default Access database engine) is likely to be completely adequate unless your data set gets very, very large (it's limited to 2GBs but you don't really want to continue using it if your data grows to much over 1GB during regular usage).
For complex statistical analysis, though, I'd recommend considering exporting the data and using a proper statistics package for doing the analysis. This means your reporting might all be done from there.
In that situation, you could leverage Access's capabilities in allowing you to create an interface for selecting the datasets you wanted to export for analysis. The last time I did this for a client, they were using SPSS for the data analysis and I built them a very flexible export interface (they could choose any variables they liked for analysis).
Whether this is a helpful alternative depends on the extent and type of the analysis you're going to do. If you're using a lot of functions that Access VBA lacks and have to borrow them from Excel or write replacements for them, then you might be better off doing all of that in some other program.
Also, it may be that some or many or all of your calculations belong in the presentation layer and not in SQL. Access reports have a lot of capabilities here, and if you're summarizing data, it may be best done at that level, rather than in the SQL recordsources underlying your reports.