Split Data operator in RapidMiner generates always the same splits - rapidminer

I am using RapidMiner Studio 7.0.001 on Mac OSX platform.
While using Split Data operator, I recognized that it always generates the same splits for my data. I didn't use local random seed and all the sampling types have the same problem.
Any help is appreciated.

In the top level process, there is a parameter called random seed. Set this to -1 to cause a new random seed to be generated from the system time. Make sure the parameter use local random seed in the Split Data operator is unchecked.

Related

SSIS reading csv produced by My Sql, code page conflict

I've got a 3rd party file coming in - utf-8 encoded, 56 columns, csv export from MySql. My intent is to load it into a Sql Server 2019 instance - a table layout I do not have control over.
Sql Server Import Wizard will automatically do the code page conversions to latin 1 (and a couple string-to-int conversions) but it will not handle the MySql "\N" for null conventions, so I thought I'd try my hand at SSIS to see if I could get the data cleaned up on ingestion.
I got a number of components set up to do various filtering and transforming (like the "\N" stuff) and that was all working fine. Then I tried to save the data using an OLE DB destination, and the wheels kinda fall off the cart.
SSIS appears to drop all of the automatic conversions Import Wizard would do and force you to make the conversions explicit.
I add a Data Transformation Component into the flow and edit all 56 columns to be explicit about the various conversions - only it lets me edit the "Copy of" output column code pages it will not save them. Either in the Editor or the Advanced Editor.
I saw another article here saying "Use the Derived Column Transformation" but that seems to be on a column-by-column basis (so I'd have to add 56 of them).
It seems kinda crazy that SSIS is such a major step backwards in this regard from Import Wizard, bcp, or BULK INSERT.
Is there a way to get it to work through the code page switch in SSIS using SSIS components? All the components I've seen recommended don't seem to be working and all of the other articles say "make another table using different code pages or NVARCHAR and then copy one table to the other" which kinda defeats the purpose.
It took synthesizing a number of different posts on tangentially related issues, but I think I've finally gotten SSIS to do a lot of what Import Wizard and BULK INSERT gave for free.
It seems that to read a utf-8 csv file in with SSIS and to process it all the way through to a table that's in 1252 and not using NVARCHAR involves the following:
Create a Flat File Source component and set the incoming encoding to 65001 (utf-8). In the Advanced editor, convert all string columns from DT_STR/65001 to DT_WSTR (essentially NVARCHAR). It's easier to work with those outputs the rest of the way through your workflow, and (most importantly) a Data Conversion transform component won't let you convert from 65001 to any other code page. But it will let you convert from DT_WSTR to DT_STR in a different code page.
1a) SSIS is pretty annoying about putting a default 50 length on everything by default. And not carrying through any lengths as defaults from one component/transform to the next. So you have to go through and set the appropriate lengths on all the "Column 0" input columns from the Flat File Source and all the WSTR transforms you create in that component.
1b) If your input file contains, as mine apparently does, invalid utf-8 encoding now and then, choose "RD_RedirectRow" as the Truncation error handling for every column. Then add a Flat File Destination to your workflow, and attach the red line coming out of your Flat File Source to it. That's if you want to see what row was bad. You can just choose "RD_IgnoreError" if you don't care about bad input. But leaving the default means your whole package will blow up if it hits any bad data
Create a Script transform component, and in that script you can check each column for the MySql "\N" and change it to null.
Create a Data Conversion transformation component and add it to your workflow. Because of the DT_WSTR in step 1, you can now change that output back to a DT_STR in a different code page here. If you don't change to DT_WSTR from the get-go, the Data Conversion component will not work changing the code page at this step. 99% of the data I'm getting in just has latinate characters, utf-8 encoded (the accents). There are a smattering of kanji characters in a small subset of data, so to reproduce what Import Wizard does for you, you must change the Truncation error handling on every column here that might be impacted to RD_IgnoreError. Unlike some documentation I read, RD_IgnoreError does not put null in the column; it puts the text with the non-mapping characters replaced with "?" like we're all used to.
Add your OLE DB destination component and map all of the output columns from step 3 to the columns of your database.
So, a lot of work to get back to Import Wizard started and to start getting the extra things SSIS can do for you. And SSIS can be kind of annoying about snapping column widths back to the default 50 when you change something. If you've got a lot of columns this can get pretty tedious.

Convert comma to dot in Python or MySQL

I have a Python script which collects data and sends it to my MySQL table.
I noticed that the "Cost" sometimes is 0,95 which results in 0 in my table since my table use "0.95" instead of "0,95".
I assume the best solution is to convert the , to . in my Python script by using:
variable.replace(",", ".")
However, couldn't one solution be to change format in my MySQL table? So that I store numbers in this format:
1100
0,95
0,1
150000
My Django Model
cost = models.DecimalField(max_digits=10, decimal_places=4, default=None)
Any feedback on how to best solve this issue?
Thanks
Your first instinct is correct: convert the "unusual" (comma-decimal) input into the standard format that MySQL used by default (dot-decimal) at the first point where you receive it.
there's lots of ways to write numbers
Be careful, though that you don't get stung by people using commas as thousands separators like "3,203,907.23", or the European form "3.203.907,23", the Swiss "3'203'907,23' or even this form, which is widely used in India: "32,03,907.71" (yes, I did mean to type only two digits there!)
To make your life easier, the rule for currencies is relatively simple:
where a dot or comma is followed by only two digits at the end of the string, that character is acting as the decimal separator.
Once you know which is the decimal separator, you can safely remove all other non-digits from the string, change the decimal separator you found to . then use any standard library string-to-number conversion.
Storage format isn't presentation format
Yes, you can tell MySQL to use comma as its decimal separator, but doing that will break so much of your code - including the parts of the framework that read from the database and expect dot-decimal numbers - that you'll regret doing it that way very quickly...
There's a general principle at work here: you should do your data storage and processing using a format that is easy to process, interchangeable with other systems, and understood by other software developers.
Consider what happens if you need to allow a different framework to access your MySQL database to generate reports... whoever develops that software (and it may be you) will be glad that the numbers are all stored the way numbers are "always" stored in databases.
Convert on the way in, re-convert on the way out
Where you need to accept input in a different format, convert that input into your standardised format as early as possible.
When you need to use an output format, do the conversion to that format as late as possible.
The idea is to keep as much of your system "unexceptional" as possible. A programmer who has to remember what numeric format will in force at the time when a given method is called is not a happy programmer.
P.S.
The option you're talking about in MySQL is an example of this pattern: it doesn't change how numeric data is stored. All that changes is how you pass numbers to MySQL and how it presents them back to you.

Passing a path as a parameter in Pentaho

In a Job I am checking if the file that I want to read is available or not. If this csv exists I want to read the data and save them in a database table within a transformation.
This is what I have done so far:
1) I have create the job, 2) I have implemented some parameters, one of them with the path for the file, 3) I have indicated that I am going to pass this value to the transformation.
Now, the thing is, I am sure this is should be something very simple to implement, but even when I have follow some blogs, I have not succeeded with this part of the process. I've tried to follow this example:
http://diethardsteiner.blogspot.com.co/2011/03/pentaho-data-integration-scheduling-and.html
My question remains the same. How can I indicate to the transformation that it has to use the parameter that I am given him from the job?
You just mixed up the columns
Parameter should be the name of the parameter in the transformation you are running.
Value is the value you are passing.
Since you are passing a variable, and not a constant value you use the ${} syntax to indicate this.

Random selection from CSV file in Jmeter

I have a very large CSV file (8000+ items) of URLs that I'm reading with a CSV Data Set Config element. It is populating the path of an HTTP Request sampler and iterating through with a while controller.
This is fine except what I want is have each user (thread) to pick a random URL from the CSV URL list. What I don't want is each thread using CSV items sequentially.
I was able to achieve this with a Random Order Controller with multiple HTTP Request samplers , however 8000+ HTTP Samplers really bogged down jmeter to an unusable state. So this is why I put the HTTP Sampler URLs in the CSV file. It doesn't appear that I can use the Random Order Controller with the CSV file data however. So how can I achieve random CSV data item selection per thread?
There is another way to achieve this:
create a separate thread group
depending on what you want to achieve:
add a (random) loop count -> this will set a start offset for the thread group that does the work
add a loop count or forever and a timer and let it loop while the other thread group is running. This thread group will read a 'pseudo' random line
It's not really random, the file is still read sequentially, but your work thread makes jumps in the file. It worked for me ;-)
There's no random selection function when reading csv data. The reason is you would need to read the whole file into memory first to do this and that's a bad idea with a load test tool (any load test tool).
Other commercial tools solve this problem by automatically re-processing the data. In JMeter you can achieve the same manually by simply sorting the data using an arbitrary field. If you sort by, say Surname, then the result is effectively random distribution.
Note. If you ensure the default All Threads is set for the CSV Data Set Config then the data will be unique in the scope of the JMeter process.
The new Random CSV Data Set Config from BlazeMeter plugin should perfectly fit your needs.
As other answers have stated, the reason you're not able to select a line at random is because you would have to read the whole file into memory which is inefficient.
Rather than trying to get JMeter to handle this on the fly, why not just randomise the file order itself before you start the test?
A scripting language such as perl makes short work of this:
cat unrandom.csv | perl -MList::Util=shuffle -e 'print shuffle<STDIN>' > random.csv
For my case:
single column
small dataset
Non-changing CSV
I just discard using CSV and refer to https://stackoverflow.com/a/22042337/6463291 and use a Bean Preprocessor instead, something like this:
String[] query = new String[]{"csv_element1", "csv_element2", "csv_element3"};
Random random = new Random();
int i = random.nextInt(query.length);
vars.put("randomOption",query[i]);
Performance seems ok, if you got the same issue can try this out.
I am not sure if this will work, but I will anyways suggest it.
Why not divide your URLs in 100 different CSV files. Then in each thread you generate the random number and use that number to identify CSV file to read using __CSVRead function.
CSVRead">http://jmeter.apache.org/usermanual/functions.html#_CSVRead
Now the only part I am not sure if the __CSVRead function reopens the file every time or shares the same file handle across the threads.
You may want to try it. Please share your findings.
A much straight forward solution.
In CSV file, add another column (say B)
apply =RAND() function in the first cell of column B (say B1). This will create random float number.
Drag the cell (say B1) corner to apply for all the corresponding URLs
Sort column B.
your URL will be sorted randomly.
Delete column B.

SQL Server Integration Services - Incremental data load hash comparison

Using SQL Server Integration Services (SSIS) to perform incremental data load, comparing a hash of to-be-imported and existing row data. I am using this:
http://ssismhash.codeplex.com/
to create the SHA512 hash for comparison. When trying to compare data import hash and existing hash from database using a Conditional Split task (expression is NEW_HASH == OLD_HASH) I get the following error upon entering the expression:
The data type "DT_BYTES" cannot be used with binary operator "==". The type of one or both of the operands is not supported for the operation. To perform this operation, one or both operands need to be explicitly cast with a cast operator.
Attempts at casting each column to a string (DT_WSTR, 64) before comparison have resulted in a truncation error.
Is there a better way to do this, or am I missing some small detail?
Thanks
Have you tried expanding the length beyond 64? I believe DT_BYTES is valid up to 8000 characters. I verified the following are legal cast destinations for DT_BYTES based on the books online article:
DT_I4
DT_UI4
DT_I8
DT_UI8
DT_STR
DT_WSTR
DT_GUID
DT_IMAGE
I also ran a test in BIDS and verified it had no problem comparing the values once I cast them to a sufficiently long data type.
SHA512 is a bit much as your chances of actually colliding are 1 in 2^256. SHA512 always outputs 512 bits which is 64 bytes. I have a similar situation where I check the hash of an incoming binary file. I use a Lookup Transformation instead of a Conditional Split.
This post is older but in order to help other users...
The answer is that in SSIS you cannot compare binary data using the == operator.
What I've seen is that people will most often convert (and store) the hashed value as varchar or nvarchar which can be compared in SSIS.
I believe the other users have answered your issue with "truncation" correctly.