Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I am new at Knime and I have a doubt about the GroupBy node.
I have a data set representing a Shopping Cart, with the following columns
Session Number (integer)
CustomerID (String)
Start Hour
Duration
ClickedProducts
AgeAddress
LastOrder
Payments
CustomerScore
Order
where Order (Char meaning Y=purchase or N = nonpurchase)
I saw in my data set that Session Number can have more than one row, so I used the GroupBy node and grouped by SessionID, but when I see the resulting table, I only see the column I have chosen.
I would like some advice about if I have to aggregate new columns with another node.
Thank you
What exactly is your question? If there is any KNIME example similar to this problem? I don't know any.
The grouping and the prediction can of course be done in KNIME. Use the GroupBy node to group by CustonerID and Session. Values of other fields can be aggregated in various ways. Then use the Partitioning node to partition your data into training and test set. Then use a learner e.g. the Decision Tree Learner node to train a model on the training data. Use the Decision Tree Predictor to use the trained model to predict the test data. Finally use the Scorer node to calculate accuracy and other quality measures. Of course you can also do cross validation in KNIME to score your models.
Hope this helps.
Related
Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 7 days ago.
Improve this question
I am trying to understand how it is working FFT. So understanding for background phenomenon FFT, I tried to make an example. I saw that, there is one formula about calculation frequency value using the complex numbers. This formula in the below.
Directly we calculate the this complex numbers index, sampling rate and total number of item in the fft list. After this information we can obtain frequency value.
But in this situation we ignore the all complex numbers. So why we do that I don't understand. Is there someone that will give me a clue about that ?
This video has lots of FFT outputs of complex numbers. But this guy directly ignore the complex numbers value and find index (k), as we already know sampling rate value and length of ffth result (N).
And after this calculations, obtain the frequency value. Is it normal ignore all of complex numbers value ? Or Am I missed something about this calculation.
This is my complex numbers value and I want to calculate frequency value using formula by hand. How can I do that ?
Thanks in advance for all comments
I want to tried fft calculations but ignoring the complex numbers give me a stuck
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am working with parking occupancy prediction using machine learning random forest regression. I have 6 features, I have tried to implement the random forest model but the results are not good, As I am very new to this I do not know what kind of model is suitable for this kind of problem. My dataset is huge I have 47 million rows. I have also used Random search cv but I cannot improve the model. Kindly have a look at the code below and help to improve or suggest another model.
Random forest regression
The features used are extracted with the help of the location data of the parking lots with a buffer. Kindly help me to improve.
So, your used variables are :
['restaurants_pts','population','res_percent','com_percent','supermarkt_pts', 'bank_pts']
The thing I see is, for a same Parking, those variables won't change, so the Regression will just predict the "average" occupancy of the parking. One of the key part of your problem seem to be that the occupancy is not the same at 5pm and at 4am...
I'd suggest you work on a time variable (ex : arrival) so it's usable.
Itself, the variable cannot be understood by the model, but you can work on it to create categories with it. For example, you make a preprocess selecting only the HOUR of your variable, and then make categories with it (either each hour being a category, or larger categories like ['noon - 6am', '6am - 10am', '10am - 2pm', '2pm - 6 pm', '6 pm - noon'])
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am wondering if it is possible to get a random array of ID's from a table but include one in particular?
So say I have 200 rows, I might limit my script to output 20 but one of the rows must include id 2 (for example).
Not sure if this is possible, would appreciate any help received.
select id, if(id = 2, -1, rand()) as sort from my_table order by sort limit 20
Not the final solution, but maybe this thread helps you:
MySQL select 10 random rows from 600K rows fast
By the way: I'd handle the randomstuff within the script (e.g. PHP) with cached (e.g. Memcached) datasets. But that depends on your goal.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am working in a project where i need to calculate some avg values based on the users interaction on a site.
Now, the amount of records that needs to have their total avg calculated can range from a few to thousands.
My question is, at which threshold would it be wise to store the aggregated data in a seperate table and through a store procedure update that value everytime a new record is generated instead of just calculate it everytime it is neede?
Thanks in advance.
Dont do it, until you start having performance problems caused by the time it takes to aggregate your data.
Then do it.
If discovering this bottleneck in production is unacceptable, then run the system in a test environment that accurately matches your production environment and load in test data that accurately matches production data. If you hit a performance bottleneck in that environment that is caused by aggregation time, then do it.
You need to weigh the need of current data vs the need of quick data. If you absolutely need current data then you have to live with longer delays in your queries. If you absolutely need your data asap then you will have to deal with older data.
You can time your queries and time the insertion into a separate table and evaluate which seems to best fit your needs.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Is there an algorithm for how to perform reservoir sampling when the points in the data stream have associated weights?
The algorithm by Pavlos Efraimidis and Paul Spirakis solves exactly this problem. The original paper with complete proofs is published with the title "Weighted random sampling with a reservoir" in Information Processing Letters 2006, but you can find a simple summary here.
The algorithm works as follows. First observe that another way to solve the unweighted reservoir sampling is to assign to each element a random id R between 0 and 1 and incrementally (say with a heap) keep track of the top k ids. Now let's look at weighted version, and let's say the i-th element has weight w_i. Then, we modify the algorithm by choosing the id of the i-th element to be R^(1/w_i) where R is again uniformly distributed in (0,1).
Another article talking about this algorithm is this one by the Cloudera folks.
You can try the A-ES algorithm from this paper of S. Efraimidis. It's quite simple to code and very efficient.
Hope this helps,
Benoit