error and output issues for python matrix? - function

I am building a function that takes a list made up of lists (ex: [['a'],['b'],['c']]) and outputs it as a table. I cannot use pretty table because I need a specific output (ex | a | b | ) with the lines and the spaces exactly alike.
Here is my function:
def show_table(table):
if table is None:
table=[]
new_table=""
for row in range(table):
for val in row:
new_table+= ("| "+val+" ")
new_table+= "|\n"
return new_table.strip("\n")
I keep getting the error:
show_table([['a'],['b'],['c']])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in show_table
TypeError: 'list' object cannot be interpreted as an integer
I'm not sure why there is an issue. I've also gotten an output error where it only outputs the first item in the first list and nothing more. Could someone explain how to use the format function to get rid of this error and output what I want correctly?
Fixed error but still failing tests:
FAIL: test_show_table_12 (main.AllTests)
Traceback (most recent call last):
File "testerl7.py", line 116, in test_show_table_12
def test_show_table_12 (self): self.assertEqual (show_table([['10','2','300'],['4000','50','60'],['7','800','90000']]),'| 10 | 2 | 300 |\n| 4000 | 50 | 60 |\n| 7 | 800 | 90000 |\n')
AssertionError: '| 10| 2| 300|\n| 4000| 50| 60|\n| 7| 800| 90000|' != '| 10 | 2 | 300 |\n| 4000 | 50 | 60 |\n| 7 | 800 | 90000 |\n'
- | 10| 2| 300|
+ | 10 | 2 | 300 |
? +++ +++ +++
- | 4000| 50| 60|
+ | 4000 | 50 | 60 |
? + ++ ++++
- | 7| 800| 90000|+ | 7 | 800 | 90000 |
? ++++ + + +

The problem is here:
for row in range(table):
range takes 1, 2, or 3 integers as arguments. It does not take a list.
You want to use:
for row in table:
Also, check your indents; it looks like the newline addition should be indented more.

Your traceback tells you that the problem occurs on line 5:
for row in range(table):
… so something on that line is trying, without success, to interpret something else as an integer. If we take a look at the docs for range(), we see this:
The arguments to the range constructor must be integers (either built-in int or any object that implements the __index__ special method).
… but table is not an integer; it's a list. If you want to iterate over a list (or something similar), you don't need a special function – simply
for row in range:
will work just fine.
There's another problem with your function apart from the misuse of range(), which is that you've indented too much of your code. This:
if table is None:
table=[]
new_table=""
for row in range(table):
for val in row:
new_table+= ("| "+val+" ")
new_table+= "|\n"
… will only execute any of the indented code if table is None, whereas what you really want is just to set table=[] if that is the case. Fixing up both those problems gives you this:
def show_table(table):
if table is None:
table=[]
new_table = ""
for row in table:
for val in row:
new_table += ("| " + val + " ")
new_table += "|\n"
return new_table.strip("\n")
(I've also changed all your indents to four spaces, and added spaces here and there, to improve the style).

Related

AWK: How to merge CSV files and eliminate rows that contain certain values?

I have hundreds of CSV files. Each CSV file is similar to this:
| KEYWORD | NUMBER OF COMPS | AVGE M E (K) | GS/M | EST. A SE/M | C CORE |
|---------|-----------------|--------------|------|-------------|--------|
| Apples | 311 | 12 | N/A | <100 | 10 |
| Bananas | >1,200 | 737 | N/A | 490 | 88 |
| Oranges | 48 | 184 | N/A | N/A | 1 |
| Fruits | 161 | 94 | N/A | - | 6 |
(I have posted this in table format, to make it more readable, but the CSV data is at the bottom of this post).
All the CSV files have the same header row. Only the data is different.
I would like to do the following:
Merge all the CSV files together, but only have 1 header row.
Omit any rows where EST. A SE/M (Column 5) contains any of the following data: <100, N/A or -
Notes about the Data
Sometimes the some or even all cells in the CSV file are wrapped in quotation marks.
Other times they are not.
Sometimes the first column (keyword) may contain multiple words or accented characters.
My code so far
This code merges all the CSV files into 1 without only one heading
awk '(NR == 1) || (FNR > 1)' *.csv > ^0-output.csv
This works perfectly.
However, I am not sure how to delete the unwanted rows after the merge.
So far I have this:
awk '$5 !~ /(<100|N\/A|-)/' ^0-output.csv > ^0-output.csv
But when I use this code, it just produces a blank file.
Plus, I am not sure if there is a way to integrate it in the first line, so it does everything with a single command.
Notes
Here is how the data looks in CSV format
Sample1.csv
KEYWORD,NUMBER OF COMPS,AVGE M E (K),GS/M,EST. A SE/M,C CORE
Apples,311,12,N/A,<100,10
Bananas,">1,200",737,N/A,490,88
Oranges,48,184,N/A,N/A,1
Fruits,161,94,N/A,-,63
Sample2.csv
KEYWORD,NUMBER OF COMPS,AVGE M E (K),GS/M,EST. A SE/M,C CORE
Dino,588,67,N/A,888,234
Thunder,">1,200",211,N/A,<100,77
Ninja,95,37,N/A,-,878
Sample3.csv
KEYWORD,NUMBER OF COMPS,AVGE M E (K),GS/M,EST. A SE/M,C CORE
Blur,84,2454,N/A,-,234
Sample4.csv
"KEYWORD","NUMBER OF COMPS","AVGE M E (K)","GS/M","EST. A SE/M","C CORE"
"hedgehog rolls ròund",32,481,N/A,"878",13
"Clever Fox jumps Hîgh",233,83,N/A,"<100",12
"Bear à lot",122,35,N/A,"-",11
"kitten hîgh life","121","673","32","N/A","15"
Please note: The actual files that the finished script will be used on will have a variety of file names. They will NOT always follow the pattern of sample 1, sample 2 etc.
Expected Output
Expected output: (CSV format)
KEYWORD,NUMBER OF COMPS,AVGE M E (K),GS/M,EST. A SE/M,C CORE
Bananas,">1,200",737,N/A,490,88
Dino,588,67,N/A,888,234
"hedgehog rolls ròund",32,481,N/A,"878",13
(Note: It doesn't matter if the expected output keeps the wrapping quote marks as the final CSV file is opened in Apple Numbers)
Expected output: (Readable format)
| KEYWORD | NUMBER OF COMPS | AVGE M E (K) | GS/M | EST. A SE/M | C CORE |
|---------|-----------------|--------------|------|-------------|--------|
| Bananas | >1,200 | 737 | N/A | 490 | 88 |
| Dino | 588 | 67 | N/A | 888 | 234 |
| hedgehog rolls ròund | 588 | 67 | N/A | 888 | 234 |
Environment:
I am using Mac OS X 10.14.6. I am unable to install other versions of awk.
You may just add merge 2 conditions into one using && :
awk -F, 'NR==1 || (FNR>1 && $5 !~ /^(<100|N\/A|-)$/)' *.csv > output.csv
Here $5 !~ /^(<100|N\/A|-)$/) will skip a row if $5 is <100 or - or N/A. It is important to use regex anchors ^ and $ to avoid matching unwanted string such as 1000 or AB-123.
It seems you have a comma in double quotes also in file1.csv. In that case following gnu-awk command should work from you:
awk -v FPAT='"[^"]*"|[^,]*' '
NR == 1 || (FNR > 1 && $5 !~ /^(<100|N\/A|-)*$/)' *.csv > output.csv
EDIT: As per OP's comments there could be a comma in between " too, so to handle that its better to use FPAT, written and tested with GNU awk.
awk -v FPAT='[^,]*|"[^"]+"' '
{ sub(/\r$/,"") }
FNR==1{
if(NR==1){ print }
next
}
$5=="<100"||$5=="N/A"||$5=="-"{
next
}
1
' *.csv
Could you please try following, written and tested with GNU awk on shown samples only.
awk '
BEGIN{
FS=OFS=","
}
FNR==1{
if(NR==1){ print }
next
}
$5=="<100"||$5=="N/A"||$5=="-"{ next }
1
' *.csv
OR in case your values can contain something else also and you want to use regex to match the values which you want to neglect then try following.
awk '
BEGIN{
FS=OFS=","
}
FNR==1{
if(NR==1){ print }
next
}
$5~/<100/ || $5~/N\/A/ || $5~/-/{ next }
1
' *.csv
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS="," ##Setting field separator as comma here.
}
FNR==1{ ##Checking condition if its firt line of current Input_file then do following.
if(NR==1){ print } ##If its very first line of very first Input_file then print that line.
next ##next will skip all further statements from here.
}
$5=="<100"||$5=="N/A"||$5=="-"{ next } ##Checking condition if 5th field contains either <100 OR N/A OR - then skip all further statements.
1 ##awk'sh way to print the current line.
' *.csv ##Passing all .csv files to awk program from here.
It looks to me like you're only interested in testing the 2nd-last field and neither that nor the last field can contain commas so just count field numbers from the end instead of from the beginning of each line and then you don't care whether earlier fields contain commas or not. Given that, this will work using any awk:
$ awk -F',' '(NR==1) || (FNR>1 && $(NF-1)!~/^"?(<100|N\/A|-)"?$/)' *.csv
KEYWORD,NUMBER OF COMPS,AVGE M E (K),GS/M,EST. A SE/M,C CORE
Bananas,">1,200",737,N/A,490,88
Dino,588,67,N/A,888,234
"hedgehog rolls ròund",32,481,N/A,"878",13

Loading quoted numbers into snowflake table from CSV with COPY TO <TABLE>

I have a problem with loading CSV data into snowflake table. Fields are wrapped in double quote marks and hence there is problem with importing them into table.
I know that COPY TO has CSV specific option FIELD_OPTIONALLY_ENCLOSED_BY = '"'but it's not working at all.
Here are some pices of table definition and copy command:
CREATE TABLE ...
(
GamePlayId NUMBER NOT NULL,
etc...
....);
COPY INTO ...
FROM ...csv.gz'
FILE_FORMAT = (TYPE = CSV
STRIP_NULL_VALUES = TRUE
FIELD_DELIMITER = ','
SKIP_HEADER = 1
error_on_column_count_mismatch=false
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
)
ON_ERROR = "ABORT_STATEMENT"
;
Csv file looks like this:
"3922000","14733370","57256","2","3","2","2","2019-05-23 14:14:44",",00000000",",00000000",",00000000",",00000000","1000,00000000","1000,00000000","1317,50400000","1166,50000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000"
I get an error
'''Numeric value '"3922000"' is not recognized '''
I'm pretty sure it's because NUMBER value is interpreted as string when snowflake is reading "" marks, but since I use
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
it shouldn't even be there... Does anyone have some solution to this?
Maybe something is incorrect with your file? I was just able to run the following without issue.
1. create the test table:
CREATE OR REPLACE TABLE
dbNameHere.schemaNameHere.stacko_58322339 (
num1 NUMBER,
num2 NUMBER,
num3 NUMBER);
2. create test file, contents as follows
1,2,3
"3922000","14733370","57256"
3,"2",1
4,5,"6"
3. create stage and put file in stage
4. run the following copy command
COPY INTO dbNameHere.schemaNameHere.STACKO_58322339
FROM #stageNameHere/stacko_58322339.csv.gz
FILE_FORMAT = (TYPE = CSV
STRIP_NULL_VALUES = TRUE
FIELD_DELIMITER = ','
SKIP_HEADER = 0
ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
)
ON_ERROR = "CONTINUE";
4. results
+-----------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|-----------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| stageNameHere/stacko_58322339.csv.gz | LOADED | 4 | 4 | 4 | 0 | NULL | NULL | NULL | NULL |
+-----------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
1 Row(s) produced. Time Elapsed: 2.436s
5. view the records
>SELECT * FROM dbNameHere.schemaNameHere.stacko_58322339;
+---------+----------+-------+
| NUM1 | NUM2 | NUM3 |
|---------+----------+-------|
| 1 | 2 | 3 |
| 3922000 | 14733370 | 57256 |
| 3 | 2 | 1 |
| 4 | 5 | 6 |
+---------+----------+-------+
Can you try with a similar test as this?
EDIT: A quick look at your data shows many of your numeric fields appear to start with commas, so something definitely amiss with the data.
Assuming your numbers are European formatted , decimal place, and . thousands, reading the numeric formating help, it seems Snowflake does not support this as input. I'd open a feature request.
But if you read the column in as text then use REPLACE like
SELECT '100,1234'::text as A
,REPLACE(A,',','.') as B
,TRY_TO_DECIMAL(b, 20,10 ) as C;
gives:
A B C
100,1234 100.1234 100.1234000000
safer would be to strip placeholders first like
SELECT '1.100,1234'::text as A
,REPLACE(A,'.') as B
,REPLACE(B,',','.') as C
,TRY_TO_DECIMAL(C, 20,10 ) as D;

Output Error for Matrix in Python?

I am trying to create a function that outputs a matrix that contains each item in a list on a separate line with lines in between. The only output I'm getting is quotations (''). I do not understand why. I think I set it all up correctly to output what is needed but there has to be something missing?
I included examples below my code.
def show_table(table):
table=[]
s=[[str(e) for e in row] for row in table]
lens= [max(map(len, col)) for col in zip(*s)]
fmt= '\t'.join('{{:{}}}'.format(x) for x in lens)
table= [fmt.format(*row) for row in s]
return '\n'.join(table)
show_table([['A','BB'],['C','DD']])
output:
'| A | BB |\n| C | DD |\n'
print(show_table([['A','BB'],['C','DD']]))
output:
| A | BB |
| C | DD |
The issue is on the second line where you are initialising your list to an empty list. Instead try:
if table is None:
table = []
Perhaps a better way to accomplish this could be:
def show_table(table):
if table is None:
table = []
data = ""
for row in table:
for val in row:
data += "| " + val + " "
data += "|\n"
return data.strip("\n")
print show_table([['a','bb'],['c','dd']])
Output:
| a | bb |
| c | dd |

Why does reading csv file with empty values lead to IndexOutOfBoundException?

I have a csv file with the foll struct
Name | Val1 | Val2 | Val3 | Val4 | Val5
John 1 2
Joe 1 2
David 1 2 10 11
I am able to load this into an RDD fine. I tried to create a schema and then a Dataframe from it and get an indexOutOfBound error.
Code is something like this ...
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
When I tried to perform an action on rowRDD, gives the error.
Any help is greatly appreciated.
This is not answer to your question. But it may help to solve your problem.
From the question I see that you are trying to create a dataframe from a CSV.
Creating dataframe using CSV can be easily done using spark-csv package
With the spark-csv below scala code can be used to read a CSV
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csvFilePath)
For your sample data I got the following result
+-----+----+----+----+----+----+
| Name|Val1|Val2|Val3|Val4|Val5|
+-----+----+----+----+----+----+
| John| 1| 2| | | |
| Joe| 1| 2| | | |
|David| 1| 2| | 10| 11|
+-----+----+----+----+----+----+
You can also inferSchema with latest version. See this answer
Empty values are not the issue if the CSV file contains fixed number of columns and your CVS looks like this (note the empty field separated with it's own commas):
David,1,2,10,,11
The problem is your CSV file contains 6 columns, yet with:
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
You try to read 7 columns. Just change your mapping to:
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5))
And Spark will take care of the rest.
The possible solution to that problem is replacing missing value with Double.NaN. Suppose I have a file example.csv with columns in it
David,1,2,10,,11
You can read the csv file as text file as follow
fileRDD=sc.textFile(example.csv).map(x=> {val y=x.split(","); val z=y.map(k=> if(k==""){Double.NaN}else{k.toDouble()})})
And then you can use your code to create dataframe from it
You can do it as follows.
val df = sqlContext
.read
.textfile(csvFilePath)
.map(_.split(delimiter_of_file, -1)
.map(
p =>
Row(
p(0),
p(1),
p(2),
p(3),
p(4),
p(5),
p(6))
Split using delimiter of your file. When you set -1 as limit it consider all the empty fields.

Copy previous values kettle pentaho

I have an issue and i'm looping on it! :| I hope someone can help me..
So i have an input file (.xls), that is simple but there are a row (lets say its "ROW1") that is like this:
ROW1 | ROW2 | ROW3 | ROW_N
765 | 1 | AAAA-MM-DD | ...
null | 1 | AAAA-MM-DD | ...
null | 1 | AAAA-MM-DD | ...
944 | 2 | AAAA-MM-DD | ...
null | 2 | AAAA-MM-DD | ...
088 | 7 | AAAA-MM-DD | ...
555 | 2 | AAAA-MM-DD | ...
null | 2 | AAAA-MM-DD | ...
There are no stardard here, like you can see.. There are some lines null (ROW1) and in ROW2, there are equal numbers, with different association to ROW1 (like in line 5 and 6, then in line 8 and 9).
My objective is to copy and paste the values from ROW1, in the ROW1 after when is null, till isn't null. Basically is to copy form previous step, when is null...
I'm trying to use the "Formula" step, by using something like:
=IF(AND(ISBLANK([ROW1]);NOT(ISBLANK([ROW2]));ROW_n=ROW1;IF(AND(NOT(ISBLANK([ROW1]));NOT(ISBLANK([ROW2]));ROW_n=ROW1;ROW_n=""));
But nothing yet..
I've tried "Analytic Query" but nothing too..
I'm using just stream a xls file input..
Tks very much, any help is very much appreciiated!!
Best Regardsd!
Well i discover a solution, adding a "User Defined Java Class" with the code below:
import java.util.HashMap;
private FieldHelper output_field, card_field;
private RowSet out, log;
private String previou_card =null;
public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
if (first)
{
first = false;
out = findTargetRowSet("out");
output_field = get(Fields.Out, "previous_card");
} else {
Object[] r = getRow();
if (r == null) {
setOutputDone();
return false;
}
r = createOutputRow(r, data.outputRowMeta.size());
if (previous_card != null) {
output_field.setValue(r, previous_card);
}
if (card_field == null) {
card_field = get(Fields.In, "Grupo de Cartões");
}
String card = card_field.getString(r);
if (card != null && !card.isEmpty()) {
previous_card = card;
}
// Send the row on to the next step.
putRowTo(data.outputRowMeta, r, out);
}
return true;
After this i have to put a few steps but this help very much.
Thank you mates!!
Finally i got result. Please follow below steps
Below image is full transformation screen.
Data Grid Data will be like these. Sorry for that in my local i don't have Microsoft because of that i took Data Grid. Instead of Data Grid you can drag and drop Microsoft Excel Input step.
Drag and Drop one java script step and write below code.
Last step of transformation, drag and drop Select values step and select the columns.( These step is no necessary)
Final result will be like these.
Hope this helps.