Manipulate a CSV file using Pig - csv

I tried the following command to load a CSV file using Pig with the command:
A = LOAD '/USER/XYZ/PIG/FILENAME.ASC' USING PIGSTORAGE(',');
While it loaded and gave no error, cat a gave me a Directory does not exist error. I'm new to Pig and know I did something very wrong there. How do I check if it is indeed loaded? Or is loaded a misnomer, and the file just exists on the HDFS?
Next, I'd like to cut a few columns of data from the CSV file and store it in another file. How can I go about it?
I don't necessarily need the script/code, but if you could point me to the right functions that will accomplish what I want to do, that would be great. Thanks!

To see the current content of A you can use DUMPA;. To see the schema/relationship you can use DESCRIBEA;.
Once you know the schema of A you can project out the fields you want. E.G. B = FOREACH A GENERATE $0 AS foo, $4 AS bar ; to select only the 1st and 5th columns, naming them foo and bar respectively.
Storing can be done with STOREB INTO 'myoutdir' USING PigStorage('|') ; where the char you choose to be a delimiter can be any single char.
So, for example this is how the script would look while I am testing it:
A = LOAD '/USER/XYZ/PIG/FILENAME.ASC' USING PIGSTORAGE(',') ;
DESCRIBE A ;
DUMP A ;
B = FOREACH A GENERATE $0, $4;
DESCRIBE B ;
DUMP B ;
STORE B INTO 'myoutdir' USING PigStorage('|') ;

Related

SAS - Change the format of a column in a proc imported CSV

I cannot quite figure out how to change the format of a column in my data file. I have the data set proc imported, and it guessed the format of a specific column as numeric, I would like to to be character-based.
This is where I'm currently at, and it does not change the format of my NUMBER column:
proc import
datafile = 'datapath'
out = dataname
dbms = CSV
replace
;
format NUMBER $8.
;
guessingrows = 20000
;
run;
You could import the data and then format after using - I believe the following would work.
proc sql;
create table want as
select *,
put(Number, 4.) as CharacterVersion
from data;
quit;
You cannot change the type/format via PROC IMPORT. However, you can write a data step to read in the file and then customize everything. If you're not sure how to start with that, check the log after you run a PROC IMPORT and it will have the 'skeleton' code. You can copy that code, edit it, and run to get what you need. Writing from scratch also works using an INFILE and INPUT statement.
From the help file (search for "Processing Delimited Files in SAS")
If you need to revise your code after the procedure runs, issue the RECALL command (or press F4) to the generated DATA step. At this point, you can add or remove options from the INFILE statement and customize the INFORMAT, FORMAT, and INPUT statements to your data.
Granted, the grammar in this section is horrific! The idea is that the IMPORT Procedure generates source code that can be recalled and modified for subsequent submission.

CSV File Processing - SAS

I am pretty new to SAS programming and trying to find the most efficient way to my current ongoing initiative. Basically, I need to modify the existing .csv file stored on the SAS server and save it in my folder on the same server.
Modification required:
keep .csv as format
use "|" instead of "," as delimiter
have the following output name: filename_YYYYMMDDhhmmss.csv
keep only 4 variables from the original file
rename some of the variables we keep
Here is the script I am currently using, but there are a few issues with it:
PROC IMPORT OUT = libname.original_file (drop=var0)
FILE = "/.../file_on_server.csv"
DBMS = CSV
REPLACE;
RUN;
%PUT date_human = %SYSFUNC(PUTN(%sysevalf(%SYSFUNC(TODAY())-1), datetime20.));
proc export data = libname.original_file ( rename= ( var1=VAR11 var2=VAR22 Type=VAR33 ))
outfile = '/.../filename_&date_human..csv' label dbms=csv replace;
delimiter='|';
run;
I also have an issue with the variable called "Type" when renaming it as it looks like there is a conflict with some of the system key words. Date format is not good either, and I was not able to find the exact format on the SAS forums, unfortunately.
Any advice on how to make this script more efficient is greatly appreciated.
I wouldn't bother with trying to actually read the data into a SAS dataset. Just process it and write it back out. If the input structure is consistent then it is pretty simple. Just read everything as character strings and output the columns that you want to keep.
Let's assume that the data has 12 columns and the last one of the four that want to keep is the 10th column. So you only need to read in 10 of them.
First setup your input and output filenames in macro variables to make it easier to edit. You can use your logic for generating the filename for the new file.
%let infile=/.../file_on_server.csv;
%let outfile=/.../filename_&date_human..csv;
Then use a simple DATA _NULL_ step to read the data as character strings and write it back out. You can even change the relative order of the four columns if you want. So this program will copy the 2nd, 5th, 4th and 10th columns and change the column headers to NewName1, NewName2, NewName3 and NewName4.
data _null_;
infile "&infile" dsd dlm=',' truncover;
file "&outfile" dsd dlm='|';
length var1-var10 $200 ;
input var1-var10;
if _n_=1 then do;
var2='NewName1';
var5='NewName2';
var4='NewName3';
var10='NewName4';
end;
put var2 var5 var4 var10 ;
run;
If some of the data for the four columns you want to keep are longer than 200 characters then just update the LENGTH statement.
So let's try a little experiment. First let's make a dummy CSV file.
filename example temp;
data _null_;
file example ;
input;
put _infile_;
cards4;
a,b,c,d,e,f,g,h,i,j,k,l,m
1,2,3,4,5,6,7,8,9,10,11,12,13
o,p,q,r,s,t,u,v,w,x,y,z
;;;;
Now let's try running it. I will modify the INFILE and FILE statements to read from my temp file and write the result to the log.
infile example /* "&infile" */ dsd dlm=',' truncover;
file log /* "&outfile" */ dsd dlm='|';
Here are the resulting rows written.
NewName1|NewName2|NewName3|NewName4
2|5|4|10
p|s|r|x

Handing delimiter in Apache Pig

I have a comma separated value file.
Data example:
1001,Laptop,beautify,laptop amazing price,<HTML>XYZ</HTML>,1345
1002,Camera,Best Mega Pixel,<HTML>ABC</HTML>,4567
1003,TV,Best Price,<HTML>DEF</HTML>,8791
We have only 5 columns: id, Device, Description, HTML Code, Identifier.
For a few of the records there is an extra , in the Description column.
For example, First Records in above sample data has the extra , [beautify,laptop amazing price] which I want to eliminate.
While loading data into PIG 5:
INFILE1 = LOAD 'file1.csv' using PigStorage(',') as (id,Device,Description,HTML Code,Identifier)
There is a Data issue getting created.
Could you please suggest how to handle this data issue in Pig Script?
If the file is a correct csv, it should have double quote at the begining and the end of the field that contains the coma. Then, you just have to load your data using CSVLoader : https://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/CSVLoader.html.
register 'piggybank.jar' ;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
INFILE1 = LOAD 'file1.csv' using CSVLoader() as (id,Device,Description,HTML Code,Identifier)
If you don't have any double quote, maybe you could try a ragex, knowing that your third field starts by "<" .. (use Regex function in Pig https://pig.apache.org/docs/r0.11.1/func.html#regex-extract-all). Tell me if you need more info.

Updating a local SAS table from a CSV file

3 and have a table which I need to update. From my understanding, you can do something like the following:
data new_table;
update old_table update_table;
by some_key;
run;
My issue (well I have a few...) is that I'm importing the "update_table" from a CSV file and the formats aren't matching the "old_table", so the update fails.
I've tried creating the "update_table" from the "old_table" using proc sql create table with zero observations, which created the correct types/formats, but then I was unable to insert data into it without replacing it.
The other major issue I have is that there are a large number of columns (480), and custom formats, and I've run up against a 6000 character limit for the script.
I'm very new to SAS and any help would be greatly appreciated :)
It sounds like you need to use a data step to read in your CSV. There are lots of papers out there explaining how to do this, so I won't cover it here. This will allow you to specify the format (numeric/character) for each field. The nice thing here is you already know what formats they need to be in (from your existing dataset), so you can create this read in fairly easily.
Let's say your data is so:
data have;
informat x date9.;
input x y z $;
datalines;
10JAN2010 1 Base
11JAN2010 4 City
12JAN2010 8 State
;;;;
run;
Now, if you have a CSV of the same format, you can read it in by generating the input code from the above dataset. You can use PROC CONTENTS to do this, or you can generate it by using dictionary.tables which has the same information as PROC CONTENTS.
proc sql;
select catx(' ',name,ifc(type='char', '$' ,' '))into :inputlist
separated by ' '
from dictionary.columns
where libname='WORK' and memname='HAVE';
select catx(' ',name,informat) into :informatlist separated by ' '
from dictionary.columns
where libname='WORK' and memname='HAVE'
and not missing(informat);
quit;
The above are two examples; they may or may not be sufficient for your particular needs.
Then you use them like so:
data want;
infile datalines dlm=',';
informat &informatlist.;
input &inputlist.;
datalines;
13JAN2010,9,REGION
;;;;
run;
(obviously you would use your CSV file instead of datalines, just used here as example).
The point is you can write the data step code using the metadata from your original dataset.
I needed this today, so I made a macro out of it: https://core.sasjs.io/mp__csv2ds_8sas.html
It doesn't wrap the input statement so it may break with a large number of columns if you have batch line length limits. If anyone would like me to fix that, just raise an issue: https://github.com/sasjs/core/issues/new

Howto process multivariate time series given as multiline, multirow *.csv files with Apache Pig?

I need to process multivariate time series given as multiline, multirow *.csv files with Apache Pig. I am trying to use a custom UDF (EvalFunc) to solve my problem. However, all Loaders I tried (except org.apache.pig.impl.io.ReadToEndLoader which I do not get to work) to load data in my csv-files and pass it to the UDF return one line of the file as one record. What I need is, however one column (or the content of the complete file) to be able to process a complete time series. Processing one value is obviously useless because I need longer sequences of values...
The data in the csv-files looks like this (30 columns, 1st is a datetime, all others are double values, here 3 sample lines):
17.06.2013 00:00:00;427;-13.793273;2.885583;-0.074701;209.790688;233.118828;1.411723;329.099170;331.554919;0.077026;0.485670;0.691253;2.847106;297.912382;50.000000;0.000000;0.012599;1.161726;0.023110;0.952259;0.024673;2.304819;0.027350;0.671688;0.025068;0.091313;0.026113;0.271128;0.032320;0
17.06.2013 00:00:01;430;-13.879651;3.137179;-0.067678;209.796500;233.141233;1.411920;329.176863;330.910693;0.071084;0.365037;0.564816;2.837506;293.418550;50.000000;0.000000;0.014108;1.159334;0.020250;0.954318;0.022934;2.294808;0.028274;0.668540;0.020850;0.093157;0.027120;0.265855;0.033370;0
17.06.2013 00:00:02;451;-15.080651;3.397742;-0.078467;209.781511;233.117081;1.410744;328.868437;330.494671;0.076037;0.358719;0.544694;2.841955;288.345883;50.000000;0.000000;0.017203;1.158976;0.022345;0.959076;0.018688;2.298611;0.027253;0.665095;0.025332;0.099996;0.023892;0.271983;0.024882;0
Has anyone an idea how I could process this as 29 time series?
Thanks in advance!
What do you want to achieve?
If you want to read all rows in all files as a single record, this can work:
a = LOAD '...' USING PigStorage(';') as <schema> ;
b = GROUP a ALL;
b will contain all the rows in a bag.
If you want to read each CSV file as a single record, this can work:
a = LOAD '...' USING PigStorage(';','tagsource') as <schema> ;
b = GROUP a BY $0; --$0 is the filename
b will contain all the rows per file in a bag.