Weka J48 Gets stuck on Building Model on Training Data - csv

I'm trying to use Weka to look at my data sets. When I load my data set in, and I got to classify and I choose J48 and click start it will being normally, my bird in the bottom right hand corner will go back and forth and there will be a x 1 next to it. The status will update to "Building model on train data" but then after a second or two the bird will stop and sit back down, and it will change to x 0. No further progress is made after that.
The file I am looking at is a csv file with 5 columns. The first row, is a row of labels, and the total amount of rows are 1971 (in each column obviously).
I have done some research on this and found no solutions. Possibly I'm looking in the wrong place? Any guidance or resolutions to this issue would be much appreciated!
Img of Screen when stopped

May have found a solution myself... It could of been due to the size of the data. I reduced the data amount and it started loading what looks like large matrix's. I'll provide two screenshots. Does this seem like a safe assumption to make? It was due to data size? Confirmation would be appreciated.
Screenshot of J48 Results 1
Screenshot of J48 Results 2

Related

Data pipeline proposal

Our product has been growing steadily over the last few years and we are now on a turning point as far as data size for some of our tables is, where we expect that the growth of said tables will probably double or triple in the next few months, and even more so in the next few years. We are talking in the range of 1.4M now, so over 3M by the end of the summer and (since we expect growth to be exponential) we assume around 10M at the end of the year. (M being million, not mega/1000).
The table we are talking about is sort of a logging table. The application receives data files (csv/xls) on a daily basis and the data is transfered into said table. Then it is used in the application for a specific amount of time - a couple of weeks/months - after which it becomes rather redundant. That is: if all goes well. If there is some problem down the road, the data in the rows can be useful to inspect for problem solving.
What we would like to do is periodically clean up the table, removing any number of rows based on certain requirements, but instead of actually deleting the rows move them 'somewhere else'.
We currently use MySQL as a database and the 'somewhere else' could be the same, but can be anything. For other projects we have a Master/Slave setup where the whole database is involved, but that's not what we want or need here. It's just some tables where the Master table would need to become shorter and the Slave only bigger, not a one-on-one sync.
The main requirement for the secondary store would be that the data should be easy to inspect/query when need to, either by SQL or another DSL, or just visual tooling. So we are not interested in backing up the data to one or more CSV files or another plain text format, since that is not as easy to inspect. The logs will then be somewhere on S3 so we would need to download it, grep/sed/awk on it... We'd much rather have something database like that we can consult.
I hope the problem is clear?
For the record: while the solution can be anything we prefer to have the simplest solution possible. It's not that we don't want Apache Kafka (example), but then we'd have to learn it, install it, maintain it. Every new piece of technology adds onto our stack, the lighter it remains the more we like it ;).
Thanks!
PS: we are not just being lazy here, we have done some research but we just thought it'd be a good idea to get some more insight in the problem.

Is dc.js (used with crossfilter and d3.js) still a good option for Big data visualization on a browser page?

I'm try to build an online dashboard to visualize a large csv dataset and i want to be sure i'm following the right path.
Thank you everybody.
Crossfilter will speedily handle up to around 500K rows of data, maybe more depending on column complexity.
Around that size you also have to consider the time that it will take to download the data to the browser, affecting when the charts will appear in the page.
If your data is bigger than that, and you still want to use dc.js, you have two options:
Pre-aggregate your data: instead of counting rows with group.reduceCount(), use group.reduceSum() and have a column with pre-summed integers. Of course, you will not be able to drill into pre-aggregated data, so this is only effective if you can accept some granularity of the dimensions of your charts.
Use a server-side replacement for crossfilter, such as elastic-dc. There are other solutions floating around but I think Deepak has the most complete solution.

mysql query to vb.net line chart

I'm creating a trend (line chart) in vb.net populating with data from a mysql database. Currently, I have x-axis as datetime and y-axis as just a test value (integer). As in most HMI/SCADA trends you can pause, go back, fwd (scroll) in the trend (I just have buttons). I have my y-axis interval changeable. So, what I do is look at the Id from the DB as min/max. Then when I want to scroll left/right I just increment/decrement the min/max and then use these new values for my where clause in the sql query. All of this works very well. It seems to have very little overhead and is very fast/responsive. So far I have tested with over 10 million rows and don't see any noticeable performance difference. However, at some point in time my Id (being auto increment) will run out. I know it's a large number but depending on my poll rate it could fill up in just a few years.
So, does anyone have a better approach to this? Basically I'm trying to eliminate the need to query more points than what I can view in the chart at one time. Also, if I have millions of rows I don't want to load these points if I don't have to. I also need to have this somewhat future proof. Right now I just don't feel comfortable with it.

Waiting in pentaho kettle-spoon

I am new to the Pentaho Kettle and I want to do multiple operations in a Transformation.
Firstly I am
inserting data from a text file to a main table.
Loading some of the columns from the main table to a 2nd table based on some conditions.
But the problem is only after completing the first step I have to do the 2nd step. Because for the 2nd step i need the 1st step to be completed.
I can say that my first step is taking almost 20 mins..
Also in that same transformation i have to do other data loading from different table too..
I don't know kettle is providing a dedicated option to perform that like any switches or something like that.I have searched a lot in web but I didn't got any ...
So can anyone help me in solving the problem.
that's what exactly the "Blocking Step" do, give a try.
http://www.nicholasgoodman.com/bt/blog/2008/06/25/ordered-rows-in-kettle/
Or split your transform into multiple transforms and orchestrate them in a Job. If your transforms are simple, I would tend towards using the blocking steps. But using them too much I find makes the transforms messy and complex. Wrapping Transforms in Jobs usually gives you more control.
Brian

MySQL - Saving items

This is a follow up from my last question: MySQL - Best method to saving and loading items
Anyways, I've looked at some other examples and sources, and most of them have the same method of saving items. Firstly, they delete all the rows that's already inserted into the database containing the character's reference, then they insert the new rows accordingly to the current items that the character has.
I just wanted to ask if this is a good way, and if it would cause a performance hit if i were to save 500 items per each character or so. If you have a better solution, please tell me!
Thanks in advance, AJ Ravindiran.
It would help if you talked about your game so we could get a better idea of your data requirements.
I'd say it depends. :)
Are the slot/bank updates happening constantly as the person plays, or just when the person savles their game and leaves. Also does the order of the slots really matter for the bank slots? Constantly deleting and inserting 500 records certainly can have a performance hit, but there may be a better way to do it, possibly you could just update the 500 records without deleting them. Possibly your first idea of 0=4151:54;1=995:5000;2=521:1;
wasn't SO bad. If the database is only being used for storing that information, and the game itself is managing that information once its loaded. But if you might want to use it for other things like "What players have item X", or "What is the total value of items in Player Ys bank". Then storing it like that won't allow you to ask the database, it would have to be computed by the game.