Dealing with currency values in PIG - pigstorage - csv

I have 2 column CSV file loaded in HDFS. Column 1 is a Model name, column 2 is a price in $. Example - Model: IE33, Price: $52678.00
When I run the following script, the price values all return as a two digit result example $52.
ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
Model, Price);
dump ultraPrice;
All my values are between $20000 and $60000. I don't know why it is being cut off.
If I change the CSV file and remove the $ from the price values everything works fine, but I know there has to be a better way.

Note that in your load statement you are not specifying the datatype.By default the model and price will be of type bytearray and hence the discrepancy.
You can either remove the $ from the csv file or load the data as chararray and replace the $ sign and cast it into float.
A = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING TextLoader() as (line:chararray);
A1 = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z0-9.,\\s]+)','');
B = FOREACH A1 GENERATE FLATTEN(STRSPLIT($0,','));
B1 = FOREACH B GENERATE $0 as Model,(float)$1 as Price;
DUMP B1;

Related

Lua - How to analyse a .csv export to show the highest, lowest and average values etc

Using Lua, i’m downloading a .csv file and then taking the first line and last line to help me validate the time period visually by the start and end date/times provided.
I’d also like to scan through the values and create a variety of variables e.g the highest, lowest and average value reported during that period.
The .csv is formatted in the following way..
created_at,entry_id,field1,field2,field3,field4,field5,field6,field7,field8
2021-04-16 20:18:11 UTC,6097,17.5,21.1,20,20,19.5,16.1,6.7,15.10
2021-04-16 20:48:11 UTC,6098,17.5,21.1,20,20,19.5,16.3,6.1,14.30
2021-04-16 21:18:11 UTC,6099,17.5,21.1,20,20,19.6,17.2,5.5,14.30
2021-04-16 21:48:11 UTC,6100,17.5,21,20,20,19.4,17.9,4.9,13.40
2021-04-16 22:18:11 UTC,6101,17.5,20.8,20,20,19.1,18.5,4.4,13.40
2021-04-16 22:48:11 UTC,6102,17.5,20.6,20,20,18.7,18.9,3.9,12.40
2021-04-16 23:18:11 UTC,6103,17.5,20.4,19.5,20,18.4,19.2,3.5,12.40
And my code to get the first and last line is as follows
print("Part 1")
print("Start : check 2nd and last row of csv")
local ctr = 0
local i = 0
local csvfilename = "/home/pi/shared/feed12hr.csv"
local hFile = io.open(csvfilename, "r")
for _ in io.lines(csvfilename) do ctr = ctr + 1 end
print("...... Count : Number of lines downloaded = " ..ctr)
local linenumbera = 2
local linenumberb = ctr
for line in io.lines(csvfilename) do i = i + 1
if i == linenumbera then
secondline = line
print("...... 2nd Line is = " ..secondline) end
if i == linenumberb then
lastline = line
print("...... Last line is = " ..lastline)
-- return line
end
end
print("End : Extracted 2nd and last row of csv")
But I now plan to pick a column, ideally by name (as I’d like to be able to use this against other .csv exports that are of a similar structure.) And get the .csv into a table/array...
I’ve found an option for that here - Csv file to a Lua table and access the lines as new table or function()
See below..
#!/usr/bin/lua
print("Part 2")
print("Start : Convert .csv to table")
local csvfilename = "/home/pi/shared/feed12hr.csv"
local csv = io.open(csvfilename, "r")
local items = {} -- Store our values here
local headers = {} --
local first = true
for line in csv:gmatch("[^\n]+") do
if first then -- this is to handle the first line and capture our headers.
local count = 1
for header in line:gmatch("[^,]+") do
headers[count] = header
count = count + 1
end
first = false -- set first to false to switch off the header block
else
local name
local i = 2 -- We start at 2 because we wont be increment for the header
for field in line:gmatch("[^,]+") do
name = name or field -- check if we know the name of our row
if items[name] then -- if the name is already in the items table then this is a field
items[name][headers[i]] = field -- assign our value at the header in the table with the given name.
i = i + 1
else -- if the name is not in the table we create a new index for it
items[name] = {}
end
end
end
end
print("End : .csv now in table/array structure")
But I’m getting the following error ??
pi#raspberrypi:/ $ lua home/pi/Documents/csv_to_table.lua
Part 2
Start : Convert .csv to table
lua: home/pi/Documents/csv_to_table.lua:12: attempt to call method 'gmatch' (a nil value)
stack traceback:
home/pi/Documents/csv_to_table.lua:12: in main chunk
[C]: ?
pi#raspberrypi:/ $
Any ideas on that ?
I can confirm that the .csv file is there ?
Once everything (hopefully) is in a table - I then want to be able to generate a list of variables based on the information in a chosen column, which I can then use and send within a push notification or email (which I already have the code for).
The following is what I’ve been able to create so far, but I would appreciate any/all help to do more analysis of the values within the chosen column so I can see all things like get highest, lowest, average etc.
print("Part 3")
print("Start : Create .csv analysis values/variables")
local total = 0
local count = 0
for name, item in pairs(items) do
for field, value in pairs(item) do
if field == "cabin" then
print(field .. " = ".. value)
total = total + value
count = count + 1
end
end
end
local average = tonumber(total/count)
local roundupdown = math.floor(average * 100)/100
print(count)
print(total)
print(total/count)
print(rounddown)
print("End : analysis values/variables created")
io.open returns a file handle on success. Not a string.
Hence
local csv = io.open(csvfilename, "r")
--...
for line in csv:gmatch("[^\n]+") do
--...
will raise an error.
You need to read the file into a string first.
Alternatively can iterate over the lines of a file using file:lines(...) or io.lines as you already do in your code.
local csv = io.open(csvfilename, "r")
if csv then
for line in csv:lines() do
-- ...
You're iterating over the file more often than you need to.
Edit:
This is how you could fill a data table while calculating the maxima for each row on the fly. This assumes you always have valid lines! A proper solution should verify the data.
-- prepare a table to store the minima and maxima in
local colExtrema = {min = {}, max = {}}
local rows = {}
-- go over the file linewise
for line in csvFile:lines() do
-- split the line into 3 parts
local timeStamp, id, dataStr = line:match("([^,]+),(%d+),(.*)")
-- create a row container
local row = {timeStamp = timeStamp, id = id, data = {}}
-- fill the row data
for val in dataStr:gmatch("[%d%.]+") do
table.insert(row.data, val)
-- find the biggest value so far
-- our initial value is the smallest number possible
local oldMax = colExtrema[#row.data].max or -math.huge
-- store the bigger value as the new maximum
colExtrema.max[#row.data] = math.max(val, oldMax)
end
-- insert row data
table.insert(rows, row)
end

how to iterate over xlsx data in octave with mixed types

I am trying to read a simple xlsx file with xlsread in octave. Its csv version is shown below:
2,4,abc,6
8,10,pqr,12
14,16,xyz,18
I am trying to read and write the contents with this code:
[~, ~, RAW] = xlsread('file.xlsx');
allData = cell2mat(RAW); # error with cell2mat()
printf('data nrows=%d, ncolms=%d\n', rows(allData), columns(allData));
for i=1:rows(allData)
for j=1:columns(allData)
printf('data(%d,%d) = %d\n', i,j, allData(i,j));
endfor
endfor
and I am getting the following error:
error: cell2mat: wrong type elements or mixed cells, structs, and matrices
I have experimented with several variations of this problem:
(A) If I delete the column with the text data, ie the xlsx file contains only numbers, then this code works fine.
(B) On the other hand, if I delete the cell2mat() call even for the purely number xlsx, I get an error during the cell access:
error: printf: wrong type argument 'cell'
(C) If I use cell2mat() during printf, like this:
printf('data(%d,%d) = %d\n', i,j, cell2mat(allData(i,j)));
I get correct data for the integers, and garbage for the text items.
So, how can I access and print each cell of the xlsx data, when the xlsx contains mixed-type data?
In other words, given a column index, and given that I know what type of data I am expecting there (integer or string), so how can I re-format the cell type before using it?
A numeric array cannot have multi-class data hence cell2mat fails. Cell-arrays are used to hold such type of data and you already have it in a cell array, so there is no need of conversion and so just skip that line (allData = cell2mat(RAW);).
Within the loop, you have this line:
printf('data(%d,%d) = %d\n', i, j, allData(i,j) );
% ↑ ↑ ↑
% 1 2a 2b
The problems are represented by up-arrows.
You've mixed data in your cell array but you're using %d as the data specifier. You can fix this by converting all of your data to string and then use %s as the specifier.
If you use square brackets ( ) for indexing a cell array, you will get a cell. What you need here is the content of that cell and braces { } are used for that.
So it will be:
printf('data(%d,%d) = %s\n', i,j, num2str(RAW{i,j}));
Note that instead of all that, you can simply just enter RAW to get this:
octave:1> RAW
RAW =
{
[1,1] = 2
[2,1] = 8
[3,1] = 14
[1,2] = 4
[2,2] = 10
[3,2] = 16
[1,3] = abc
[2,3] = pqr
[3,3] = xyz
[1,4] = 6
[2,4] = 12
[3,4] = 18
}

Pig decimal value not working

I am studying the PIG language in cloudera, and I have some problem with decimal value.
I have a csv file, where I have a lot of data with different types.
I have a data column named "petrol_average" with value like "5,78524512".
I want to load this data from my CSV file.
My script is :
*> a = LOAD ‘myfile.csv’ USING PigStorage(‘;’) AS (country: chararray,
> petrol_average: double);
>
> b = FOREACH a generate country, petrol_average;
>
> DUMP B;*
The result dumped is like:
*"(Canada, )
(Brazil, 5.0)
(France, )
(United States 8.0)
..."*
In my Csv file i have value for the petrol_average Canada and France.
My pig script is not showing me the value and the value for Brazil is 5,78524512, the value is automatically rounded.
Do you have some answer for my problem ?
Sorry for my English.
sample of myfile.csv
a,578524512
b,8596243
c,15424685
d,14253685
code
A = Load 'data/MyFile.txt' using PigStorage(',') as (country:chararray,petrol_average:long);
NOTE:
you have create schema with double but your data is simple integer so that it remove data after first digit so that i have used it as long
grunt> dump A;
grunt> B = FOREACH A generate country, petrol_average;
grunt> dump B;
result
(a,578524512)
(b,8596243)
(c,15424685)
(d,14253685)
work fine happy hadoop :)
#MaheshGupta
Thank you for your answer, When I am using float or long I have a result like this :
()
(8.0)
()
()
()
()
()
()
()
()
()
When i declare it in my schema as chararray I have this result :
(9,100000381)
(8,199999809)
(8,399999619)
(8,100000381)
(8,399999619)
(8,399999619)
(8,399999619)
(8,100000381)
(8,5)
(8,199999809)
(9)
My script is this one:
a = LOAD 'myfile.csv' USING PigStorage(';') AS
(country: chararray;
petrol_average chararray);
b = FOREACH a generate petrol_average;
DUMP b;
My big problem is for division or addition because I can't do it, the type is a Chararray.

How do I compare two CSV files in PIG, and output only the different columns?

Between an older csv file and a newer one, I want to find what fields have changed on rows with the same key. For example, if the unique key is in field $2, and we have two files:
Old csv file:
FIELD1,FIELD2,ID,FIELD4
a,a,key1,a
b,b,key2,b
New csv file:
FIELD1,FIELD2,ID,FIELD4
a,a2,key1,a2
b,b,key2,b
Desired output something like:
{FIELD2:a2,ID:key1,FIELD4:a2}
or in other words, on the row with ID = key1, the 2nd and 4th fields changed, and these are the changed values.
A pig script that outputs the whole row if any field has changed, is:
old = load '$old' using PigStorage('\n') as (line:chararray);
new = load '$new' using PigStorage('\n') as (line:chararray);
cg = cogroup old by line, new by line;
new_only = foreach (filter cg by IsEmpty(old)) generate flatten(new);
store new_only into '$changes';
My initial idea (I'm not sure how to complete it) is:
old = LOAD $old USING PigStorage('|');
new = LOAD $new USING PigStorage('|');
cogroup_data = COGROUP old by $2, new by $2 -- 3rd column is unique key
diff_data = FOREACH cogroup_data GENERATE DIFF(old,new);
-- ({(a,a,key2,a),(a,a2,key2,a2)})
-- ? what goes here ?

How to import comma delimited text file into datawindow (powerbuilder 11.5)

Hi good day I'm very new to powerbuilder and I'm using PB 11.5
Can someone know how to import comma delimited text file into datawindow.
Example Text file
"1234","20141011","Juan, Delacruz","Usa","001992345456"...
"12345","20141011","Arc, Ino","Newyork","005765753256"...
How can I import the third column which is the full name and the last column which is the account number. I want to transfer the name and account number into my external data window. I've tried to use the ImportString(all the rows are being transferred in one column only). I have three fields in my external data window.the Name and Account number.
Here's the code
ls_File = dw_2.Object.file_name[1]
li_FileHandle = FileOpen(ls_File)
li_FileRead = FileRead(li_FileHandle, ls_Text)
DO WHILE li_FileRead > 0
li_Count ++
li_FileRead = FileRead(li_FileHandle, ls_Text)
ll_row = dw_1.ImportString(ls_Text,1)
Loop.
Please help me with the code! Thank You
It seems that PB expects by default a tab-separated csv file (while the 'c' from 'csv' stands for 'coma'...).
Add the csv! enumerated value in the arguments of ImportString() and it should fix the point (it does in my test box).
Also, the columns defined in your dataobject must match the columns in the csv file (at least for the the first columns your are interested in). If there are mode columns in the csv file, they will be ignored. But if you want to get the 1st (or 2nd) and 3rd columns, you need to define the first 3 columns. You can always hide the #1 or #2 if you do not need it.
BTW, your code has some issues :
you should always test the return values of function calls like FileOpen() for stopping processing in case of non-existent / non-readable file
You are reading the text file twice for the first row: once before the while and another inside of the loop. Or maybe it is intended to ignore a first line with column headers ?
FWIF, here is a working code based on yours:
string ls_file = "c:\dev\powerbuilder\experiment\data.csv"
string ls_text
int li_FileHandle, li_fileread, li_count
long ll_row
li_FileHandle = FileOpen(ls_File)
if li_FileHandle < 1 then
return
end if
li_FileRead = FileRead(li_FileHandle, ls_Text)
DO WHILE li_FileRead > 0
li_Count ++
ll_row = dw_1.ImportString(csv!,ls_Text,1)
li_FileRead = FileRead(li_FileHandle, ls_Text)//read next line
Loop
fileclose(li_fileHandle)
use datawindow_name.importfile(CSV!,file_path) method.