Calling other feature and reading data from csv not in examples - csv

I am usually calling other feature and reading data from csv in the examples, like below.
Scenario Outline:
* call read('classpath:controller/Controller.feature')
Examples:
|read('classpath:com/testdata/Test.csv')|
This time I still want to read data from csv, but use Examples for other purpose, like below. Is it possible to read data from csv still? Maybe passing as parameter?
Scenario Outline:
* call read('classpath:controller/Controller.feature'){read('classpath:com/testdata/Test.csv')}
Examples:
|gain |spend |
|12000| 12008 |
|3400 | 4655 |
I know it works this way but I have to pass index [0], and if I have more test data in csv it won't work
Scenario Outline:
* def testData = read('classpath:com/testdata/Test.csv')
* call read('classpath:controller/Controller.feature'){ "name": "#(testData[0].name)", "age": "#(testData[0].age)"}
Examples:
|gain |spend |
|12000| 12008 |
|3400 | 4655 |

I'll just give one tip. When you use Examples the row index is available as a variable called __num: https://github.com/karatelabs/karate#scenario-outline-enhancements
So you can do things like this:
Feature:
Scenario Outline:
* def data = [{ id: 0 }, { id: 1 }]
* match (data[__num].id) == temp
Examples:
| temp! |
| 0 |
| 1 |

Related

Loading quoted numbers into snowflake table from CSV with COPY TO <TABLE>

I have a problem with loading CSV data into snowflake table. Fields are wrapped in double quote marks and hence there is problem with importing them into table.
I know that COPY TO has CSV specific option FIELD_OPTIONALLY_ENCLOSED_BY = '"'but it's not working at all.
Here are some pices of table definition and copy command:
CREATE TABLE ...
(
GamePlayId NUMBER NOT NULL,
etc...
....);
COPY INTO ...
FROM ...csv.gz'
FILE_FORMAT = (TYPE = CSV
STRIP_NULL_VALUES = TRUE
FIELD_DELIMITER = ','
SKIP_HEADER = 1
error_on_column_count_mismatch=false
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
)
ON_ERROR = "ABORT_STATEMENT"
;
Csv file looks like this:
"3922000","14733370","57256","2","3","2","2","2019-05-23 14:14:44",",00000000",",00000000",",00000000",",00000000","1000,00000000","1000,00000000","1317,50400000","1166,50000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000"
I get an error
'''Numeric value '"3922000"' is not recognized '''
I'm pretty sure it's because NUMBER value is interpreted as string when snowflake is reading "" marks, but since I use
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
it shouldn't even be there... Does anyone have some solution to this?
Maybe something is incorrect with your file? I was just able to run the following without issue.
1. create the test table:
CREATE OR REPLACE TABLE
dbNameHere.schemaNameHere.stacko_58322339 (
num1 NUMBER,
num2 NUMBER,
num3 NUMBER);
2. create test file, contents as follows
1,2,3
"3922000","14733370","57256"
3,"2",1
4,5,"6"
3. create stage and put file in stage
4. run the following copy command
COPY INTO dbNameHere.schemaNameHere.STACKO_58322339
FROM #stageNameHere/stacko_58322339.csv.gz
FILE_FORMAT = (TYPE = CSV
STRIP_NULL_VALUES = TRUE
FIELD_DELIMITER = ','
SKIP_HEADER = 0
ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
)
ON_ERROR = "CONTINUE";
4. results
+-----------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|-----------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| stageNameHere/stacko_58322339.csv.gz | LOADED | 4 | 4 | 4 | 0 | NULL | NULL | NULL | NULL |
+-----------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
1 Row(s) produced. Time Elapsed: 2.436s
5. view the records
>SELECT * FROM dbNameHere.schemaNameHere.stacko_58322339;
+---------+----------+-------+
| NUM1 | NUM2 | NUM3 |
|---------+----------+-------|
| 1 | 2 | 3 |
| 3922000 | 14733370 | 57256 |
| 3 | 2 | 1 |
| 4 | 5 | 6 |
+---------+----------+-------+
Can you try with a similar test as this?
EDIT: A quick look at your data shows many of your numeric fields appear to start with commas, so something definitely amiss with the data.
Assuming your numbers are European formatted , decimal place, and . thousands, reading the numeric formating help, it seems Snowflake does not support this as input. I'd open a feature request.
But if you read the column in as text then use REPLACE like
SELECT '100,1234'::text as A
,REPLACE(A,',','.') as B
,TRY_TO_DECIMAL(b, 20,10 ) as C;
gives:
A B C
100,1234 100.1234 100.1234000000
safer would be to strip placeholders first like
SELECT '1.100,1234'::text as A
,REPLACE(A,'.') as B
,REPLACE(B,',','.') as C
,TRY_TO_DECIMAL(C, 20,10 ) as D;

Returning a new Dataframe (by transforming an existing one) using a function - spark/scala

I am new to Spark. I am trying to read a JSONArray into a Dataframe and perform some transformations on it. I am trying to cleanse my data by removing some html tags and some newline characters. for example:
Initial dataframe read from JSON:
+-----+---+-----+-------------------------------+
|index| X|label| date |
+-----+---+-----+-------------------------------+
| 1| 1| A|<div>&quot2017-01-01&quot</div>|
| 2| 3| B|<div>2017-01-02</div> |
| 3| 5| A|<div>2017-01-03</div> |
| 4| 7| B|<div>2017-01-04</div> |
+-----+---+-----+-------------------------------+
Should be transformed to :
+-----+---+-----+------------+
|index| X|label| date |
+-----+---+-----+------------+
| 1| 1| A|'2017-01-01'|
| 2| 3| B|2017-01-02 |
| 3| 5| A|2017-01-03 |
| 4| 7| B|2017-01-04 |
+-----+---+-----+------------+
I know that we can perform these transformations using:
df.withColumn("col_name",regexp_replace("col_name",pattern,replacement))
I am able to cleanse my data using the withColumn as shown above. However, I have a large number of columns and writing a .withColumn method for every column doesn't seem to be elegant, concise or efficient. So I tried doing something like this:
val finalDF = htmlCleanse(intialDF, columnsArray)
def htmlCleanse(df: DataFrame, columns: Array[String]): DataFrame = {
var retDF = hiveContext.emptyDataFrame
for(i <- 0 to columns.size-1){
val name = columns(i)
retDF = df.withColumn(name,regexp_replace(col(name),"<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>",""))
.withColumn(name,regexp_replace(col(name),""","'"))
.withColumn(name,regexp_replace(col(name)," "," "))
.withColumn(name,regexp_replace(col(name),":",":"))
}
retDF
}
I defined a new function htmlCleanse and I am passing the Dataframe to be transformed and the columns array to the function. The function creates a new emptyDataFrame and iterates over the columns list performing the cleansing on a column for a single iteration and assigns the transformed df to the retDF variable.
This gave me no errors, but it doesn't seem to remove the html tags from all the columns while some of the columns appear to be cleansed. Not sure what's the reason for this inconsistent behavior(any ideas on this?).
So, what would be an efficient way to cleanse my data? Any help would be appreciated. Thank you!
The first issue is that initializing for an empty frame does nothing, you just create something new. You can't then "add" things to it from another dataframe without a join (which would be a bad idea performance wise).
The second issue is that retDF is always defined from df. This means that you throw away everything you did except for cleaning the last column.
Instead you should initialize retDF to df and in every iteration fix a column and overwrite retDF as follows:
def htmlCleanse(df: DataFrame, columns: Array[String]): DataFrame = {
var retDF = df
for(i <- 0 to columns.size-1){
val name = columns(i)
retDF = retDF.withColumn(name,regexp_replace(col(name),"<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>",""))
.withColumn(name,regexp_replace(col(name),""","'"))
.withColumn(name,regexp_replace(col(name)," "," "))
.withColumn(name,regexp_replace(col(name),":",":"))
}
retDF
}

spark rdd fliter by query mysql

I use spark streaming to stream data from Kafka and I want to filter data judge by data in MySql.
For example, I get data from kafka just like:
{"id":1, "data":"abcdefg"}
and there are data in MySql like this:
id | state
1 | "success"
I need to query the MySql to get the state of term id.
I can define a connect to MySql in the function of filter, and it works. The code like this:
def isSuccess(x):
id = x["id"]
sql = """
SELECT *
FROM Test
WHERE id = "{0}"
""".format(id)
conn = mysql_connection(......)
result = rdbi.query_one(sql)
if result == None:
return False
else:
return True
successRDD = rdd.filter(isSuccess)
But it will define connection for every row of the RDD, and will waste a lot of computing resource.
How to do in filter?
I suggest you go for using mapPartition available in Apache Spark to prevent initialization of MySQL connection for every RDD.
This is the MySQL table that I created:
create table test2(id varchar(10), state varchar(10));
With the following values:
+------+---------+
| id | state |
+------+---------+
| 1 | success |
| 2 | stopped |
+------+---------+
Use the following PySpark Code as reference:
import MySQLdb
data1=[["1", "afdasds"],["2","dfsdfada"],["3","dsfdsf"]] #sampe data, in your case streaming data
rdd = sc.parallelize(data1)
def func1(data1):
con = MySQLdb.connect(host="127.0.0.1", user="root", passwd="yourpassword", db="yourdb")
c=con.cursor()
c.execute("select * from test2;")
data=c.fetchall()
dict={}
for x in data:
dict[x[0]]=x[1]
list1=[]
for x in data1:
if x[0] in dict:
list1.append([x[0], x[1], dict[x[0]]])
else:
list1.append([x[0], x[1], "none"]) # i assign none if id in table and one received from streaming dont match
return iter(list1)
print rdd.mapPartitions(func1).filter(lambda x: "none" not in x[2]).collect()
The output that i got was:
[['1', 'afdasds', 'success'], ['2', 'dfsdfada', 'stopped']]

Why does reading csv file with empty values lead to IndexOutOfBoundException?

I have a csv file with the foll struct
Name | Val1 | Val2 | Val3 | Val4 | Val5
John 1 2
Joe 1 2
David 1 2 10 11
I am able to load this into an RDD fine. I tried to create a schema and then a Dataframe from it and get an indexOutOfBound error.
Code is something like this ...
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
When I tried to perform an action on rowRDD, gives the error.
Any help is greatly appreciated.
This is not answer to your question. But it may help to solve your problem.
From the question I see that you are trying to create a dataframe from a CSV.
Creating dataframe using CSV can be easily done using spark-csv package
With the spark-csv below scala code can be used to read a CSV
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csvFilePath)
For your sample data I got the following result
+-----+----+----+----+----+----+
| Name|Val1|Val2|Val3|Val4|Val5|
+-----+----+----+----+----+----+
| John| 1| 2| | | |
| Joe| 1| 2| | | |
|David| 1| 2| | 10| 11|
+-----+----+----+----+----+----+
You can also inferSchema with latest version. See this answer
Empty values are not the issue if the CSV file contains fixed number of columns and your CVS looks like this (note the empty field separated with it's own commas):
David,1,2,10,,11
The problem is your CSV file contains 6 columns, yet with:
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
You try to read 7 columns. Just change your mapping to:
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5))
And Spark will take care of the rest.
The possible solution to that problem is replacing missing value with Double.NaN. Suppose I have a file example.csv with columns in it
David,1,2,10,,11
You can read the csv file as text file as follow
fileRDD=sc.textFile(example.csv).map(x=> {val y=x.split(","); val z=y.map(k=> if(k==""){Double.NaN}else{k.toDouble()})})
And then you can use your code to create dataframe from it
You can do it as follows.
val df = sqlContext
.read
.textfile(csvFilePath)
.map(_.split(delimiter_of_file, -1)
.map(
p =>
Row(
p(0),
p(1),
p(2),
p(3),
p(4),
p(5),
p(6))
Split using delimiter of your file. When you set -1 as limit it consider all the empty fields.

Copy previous values kettle pentaho

I have an issue and i'm looping on it! :| I hope someone can help me..
So i have an input file (.xls), that is simple but there are a row (lets say its "ROW1") that is like this:
ROW1 | ROW2 | ROW3 | ROW_N
765 | 1 | AAAA-MM-DD | ...
null | 1 | AAAA-MM-DD | ...
null | 1 | AAAA-MM-DD | ...
944 | 2 | AAAA-MM-DD | ...
null | 2 | AAAA-MM-DD | ...
088 | 7 | AAAA-MM-DD | ...
555 | 2 | AAAA-MM-DD | ...
null | 2 | AAAA-MM-DD | ...
There are no stardard here, like you can see.. There are some lines null (ROW1) and in ROW2, there are equal numbers, with different association to ROW1 (like in line 5 and 6, then in line 8 and 9).
My objective is to copy and paste the values from ROW1, in the ROW1 after when is null, till isn't null. Basically is to copy form previous step, when is null...
I'm trying to use the "Formula" step, by using something like:
=IF(AND(ISBLANK([ROW1]);NOT(ISBLANK([ROW2]));ROW_n=ROW1;IF(AND(NOT(ISBLANK([ROW1]));NOT(ISBLANK([ROW2]));ROW_n=ROW1;ROW_n=""));
But nothing yet..
I've tried "Analytic Query" but nothing too..
I'm using just stream a xls file input..
Tks very much, any help is very much appreciiated!!
Best Regardsd!
Well i discover a solution, adding a "User Defined Java Class" with the code below:
import java.util.HashMap;
private FieldHelper output_field, card_field;
private RowSet out, log;
private String previou_card =null;
public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
if (first)
{
first = false;
out = findTargetRowSet("out");
output_field = get(Fields.Out, "previous_card");
} else {
Object[] r = getRow();
if (r == null) {
setOutputDone();
return false;
}
r = createOutputRow(r, data.outputRowMeta.size());
if (previous_card != null) {
output_field.setValue(r, previous_card);
}
if (card_field == null) {
card_field = get(Fields.In, "Grupo de Cartões");
}
String card = card_field.getString(r);
if (card != null && !card.isEmpty()) {
previous_card = card;
}
// Send the row on to the next step.
putRowTo(data.outputRowMeta, r, out);
}
return true;
After this i have to put a few steps but this help very much.
Thank you mates!!
Finally i got result. Please follow below steps
Below image is full transformation screen.
Data Grid Data will be like these. Sorry for that in my local i don't have Microsoft because of that i took Data Grid. Instead of Data Grid you can drag and drop Microsoft Excel Input step.
Drag and Drop one java script step and write below code.
Last step of transformation, drag and drop Select values step and select the columns.( These step is no necessary)
Final result will be like these.
Hope this helps.