Open Source Data Mining Software [closed] - open-source

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I was wondering; what is the best open source software that I can use for non-binary association rule generations. I need a non-binary implementation because converting my currently non-binary data to binary data would not give the desired results.
Thanks and can't wait to here your comments!

Also take a look at Weka

Check out:
RapidMiner
and
R with Rattle

Try the Orange data mining toolkit.
http://www.ailab.si/orange/

Try Data Mining SDK.

These days I like Knime. See http://knime.org.

you could even try another one called Tanagra http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html
Its mainly for research purpose but works well and has good tutorials here
http://data-mining-tutorials.blogspot.com

I have an open-source software named SPMF with more than 130 algorithms related to association rules mining, frequent itemset mining, sequential rule mining and sequential pattern mining. You can check my webpage for more details and to download it:
It is Java source code. It has a simple graphical user interface. It also has many specialized algorithms that you will not find in other data mining software.

Related

What are some good data cleanup tools? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am parsing large amounts of complex files (mostly CSV files but some are not) and I need to structure/parse them into some standard formats. This not only involves row wise cleanup of data but some simple individual cell-based logic. I want a tool that a non-programmer can use also so a business team member can write simple drag and drop logic and not take up engineering time. So far, I have looked at Google Refine and Data wrangler and the last one looks great. Are there any other such tools out there?
ETL tools are oriented more towards relational databases, but also have support for XML and CSV file input/output. Examples:
http://www.talendforge.org/
http://kettle.pentaho.com/
Could easily be too complicated for your requirements though. Also, see this similar question on SO (with additional links): What software is availible for data quality checking .

ETL interview questions? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
In 5 days I'm going to ETL interview. It's my first interview on this subject. What question would I be asked? Most likely they will be about MS SQL Server Integration Service.
If possible, provide the answers. =)
If possible, provide the answers. =)
Keep it high-level if you have to, but don't ask a question that couldn't answer yourself.
I agree with Brad that syntax is not important, it's the thought process.
Another idea is to ask them about how they would pack up and move an office. It gives you insight into the same kinds of decisions needed in ETL (prep, actual moving of stuff, and validation), and you might be more comfortable talking about that than the details of SSIS
Think practically. Hand them a printout of a sample file that might need to be imported (possibly simplified to save time). Have them talk about database design, considerations, concerns, possible ways to improve the data. Then bring out a second printout of somehow related and see if they can figure how to validate the one from the other.
Make sure you talk about how much time is available to perform the ETL processes based on business rules and environment.
Require as much pseudo-code as you like, but I personally subscribe to the idea that syntax can be taught cheaply, but learning how to think is a very expensive thing to teach someone; and sometimes it's not even successful.
Also, ask them what standards they would implement if they were to design the optimum layout of the source data. Make sure you consider data distribution beyond your company (if applicable).

Best tutorial to learn SSIS [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Which book is best to learn SSIS. Actually in my project we need to take onput from CVS file and after processing the data in SQL server 2008 we have export it back to excel file. ASP.NET is used as UI for this.
Thanks,
Nabin
I completely agree with Cade in terms of simply working with it. I found that trying to follow specific "tutorials" to try and learn the package didn't really help but having a number of useful resources definitely came in handy.
At work, we had this book kicking around but really it just went over the flow objects available without going into any real-world examples. Jamie Thomson's blogs (here and here) are both excellent online resources though and have been really helpful for me personally.
Try this book:
http://www.amazon.com/Professional-Microsoft-Integration-Services-Programmer/dp/0470247959
The best way to learn SSIS is just to do it. Probably best to start and then refer to the book. Because the tool is so GUI intensive, I tended to get more after reading the book later once I was already familiar with the environment somewhat.
Reading the material some times couldn't solve your real time migration by missing some perticular functionality related to your project. I worked on your scenario case of migrating database to SQL using intermediate CSV or text files.
http://msdn.microsoft.com/en-us/library/dd537533(SQL.100).aspx
we migrated nearly 1TB in 30 min using SSIS 2008.
this could help to get the information on specific properties of souce file according to our requirements.
thanks
prav

how do free online OCR programs compare to commercial ones? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
How much better would commercial OCR software be compared to the stuff that's available online for free?
More specifically: Reading text in pictures (things like book covers etc...)
I work with OCR quite a lot and can definitely vouch that the commercial offerings are much better than what you can find out there for free. Yes, you can make a free one 'work', but it will take a lot of effort for sub-optimal results.
I recommend finding a product that uses the ABBYY FineReader : It does a great job with little configuration.
You may want to consider whether you need to use an SDK provided by the OCR supplier or an end-user application. The SDK will provide position details, etc of what it finds and offer a lot more in-depth control, but will be more expensive. The end-user package will basically just read everything it finds, but you may be able to set it to automatic or control it rudimentally and it might be good enough for what you're trying to do, and may be a lot cheaper.
Get a trial version and give it a go!
Google's ocropus is free opensource and one of the best

What data mining tools do you use? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Besides the two well-known Open Source tools RapidMiner and Weka, are there any other good tools (either Open Source or Commercial), which you can recommend for data mining?
Thanks in advance!
My money is on R, see e.g. the Machine Learning task view.
How about the open source Orange data mining toolkit.
http://www.ailab.si/orange/
You can look at my project - Data Mining SDK.
According to the KDnuggets Poll 2011, RapidMiner once more is the most widely used data mining solution world-wide:
http://www.kdnuggets.com/2011/05/tools-used-analytics-data-mining.html
If it is commercial software the following two are awesome
SAS
SPSS
Another very powerful opensource tool is Knime. In some respects it is better than RapidMiner. As for commercial here's what I've tried:
1.Polyanalyst
2.SPSS Clementine
3.Kxen
4.Statistica Data Miner
5.MATLAB
I like Polyanalyst the best. But it's just my opinion.
According to the yearly KDnuggets Polls 2007, 2008, and 2009, RapidMiner is the most widely used Open Source Data Mining Solution among data mining experts world-wide:
KDnuggets Data Mining Tool Poll 2009
RapidMiner is open source and 100% Java, RapidMiner is much more flexible and offers significantly more functionality than Weka and KNIME.
JDMP http://www.jdmp.org/
The data mining tool I used(also machine learing tools):
Weka: classfication, clustering, association rule, decision tree......
Cluto: clustering
libsvm: classification
And from many posts, I find there still other famous tools which I haven't used:
Orange
R
RapidMiner
SAS
SPSS
There must be other useful tools that I'm not aware of.