Pig decimal value not working - csv

I am studying the PIG language in cloudera, and I have some problem with decimal value.
I have a csv file, where I have a lot of data with different types.
I have a data column named "petrol_average" with value like "5,78524512".
I want to load this data from my CSV file.
My script is :
*> a = LOAD ‘myfile.csv’ USING PigStorage(‘;’) AS (country: chararray,
> petrol_average: double);
>
> b = FOREACH a generate country, petrol_average;
>
> DUMP B;*
The result dumped is like:
*"(Canada, )
(Brazil, 5.0)
(France, )
(United States 8.0)
..."*
In my Csv file i have value for the petrol_average Canada and France.
My pig script is not showing me the value and the value for Brazil is 5,78524512, the value is automatically rounded.
Do you have some answer for my problem ?
Sorry for my English.

sample of myfile.csv
a,578524512
b,8596243
c,15424685
d,14253685
code
A = Load 'data/MyFile.txt' using PigStorage(',') as (country:chararray,petrol_average:long);
NOTE:
you have create schema with double but your data is simple integer so that it remove data after first digit so that i have used it as long
grunt> dump A;
grunt> B = FOREACH A generate country, petrol_average;
grunt> dump B;
result
(a,578524512)
(b,8596243)
(c,15424685)
(d,14253685)
work fine happy hadoop :)

#MaheshGupta
Thank you for your answer, When I am using float or long I have a result like this :
()
(8.0)
()
()
()
()
()
()
()
()
()
When i declare it in my schema as chararray I have this result :
(9,100000381)
(8,199999809)
(8,399999619)
(8,100000381)
(8,399999619)
(8,399999619)
(8,399999619)
(8,100000381)
(8,5)
(8,199999809)
(9)
My script is this one:
a = LOAD 'myfile.csv' USING PigStorage(';') AS
(country: chararray;
petrol_average chararray);
b = FOREACH a generate petrol_average;
DUMP b;
My big problem is for division or addition because I can't do it, the type is a Chararray.

Related

Importing data from VERY large text file into Mysql [duplicate]

I have a very large CSV file (150 MB). What is the best way to import it to MySQL?
I have to do some manipulation in PHP before inserting it into the MySQL table.
You could take a look at LOAD DATA INFILE in MySQL.
You might be able to do the manipulations once the data is loaded into MySQL, rather than first reading it into PHP. First store the raw data in a temporary table using LOAD DATA INFILE, then transform the data to the target table using a statement like the following:
INSERT INTO targettable (x, y, z)
SELECT foo(x), bar(y), z
FROM temptable
I would just open it with fopen and use fgetcsv to read each line into an array.
pseudo-php follows:
mysql_connect( //connect to db);
$filehandle = fopen("/path/to/file.csv", "r");
while (($data = fgetcsv($filehandle, 1000, ",")) !== FALSE) {
// $data is an array
// do your parsing here and insert into table
}
fclose($filehandle)

Error parsing JSON: more than one document in the input (Redshift to Snowflake SQL)

I'm trying to convert a query from Redshift to Snowflake SQL.
The Redshift query looks like this:
SELECT
cr.creatives as creatives
, JSON_ARRAY_LENGTH(cr.creatives) as creatives_length
, JSON_EXTRACT_PATH_TEXT(JSON_EXTRACT_ARRAY_ELEMENT_TEXT (cr.creatives,0),'previewUrl') as preview_url
FROM campaign_revisions cr
The Snowflake query looks like this:
SELECT
cr.creatives as creatives
, ARRAY_SIZE(TO_ARRAY(ARRAY_CONSTRUCT(cr.creatives))) as creatives_length
, PARSE_JSON(PARSE_JSON(cr.creatives)[0]):previewUrl as preview_url
FROM campaign_revisions cr
It seems like JSON_EXTRACT_PATH_TEXT isn't converted correctly, as the Snowflake query results in error:
Error parsing JSON: more than one document in the input
cr.creatives is formatted like this:
"[{""previewUrl"":""https://someurl.com/preview1.png"",""device"":""desktop"",""splitId"":null,""splitType"":null},{""previewUrl"":""https://someurl.com/preview2.png"",""device"":""mobile"",""splitId"":null,""splitType"":null}]"
It seems to me that you are not working with valid JSON data inside Snowflake.
Please review your file format used for the copy into command.
If you open the "JSON" text provided in a text editor , note that the information is not parsed or formatted as JSON because of the quoting you have. Once your issue with double quotes / escaped quotes is handled, you should be able to make good progress
Proper JSON on Left || Original Data on Right
If you are not inclined to reload your data, see if you can create a Javascript User Defined Function to remove the quotes from your string, then you can use Snowflake to process the variant column.
The following code is working POJO that can be used to remove the doublequotes for you.
var textOriginal = '[{""previewUrl"":""https://someurl.com/preview1.png"",""device"":""desktop"",""splitId"":null,""splitType"":null},{""previewUrl"":""https://someurl.com/preview2.png"",""device"":""mobile"",""splitId"":null,""splitType"":null}]';
function parseText(input){
var a = input.replaceAll('""','\"');
a = JSON.parse(a);
return a;
}
x = parseText(textOriginal);
console.log(x);
For anyone else seeing this double double quote issue in JSON fields coming from CSV files in a Snowflake external stage (slightly different issue than the original question posted):
The issue is likely that you need to use the FIELD_OPTIONALLY_ENCLOSED_BY setting. Specifically, FIELD_OPTIONALLY_ENCLOSED_BY = '"' when setting up your fileformat.
(docs)
Example of creating such a file format:
create or replace file format mydb.myschema.my_tsv_file_format
type = CSV
field_delimiter = '\t'
FIELD_OPTIONALLY_ENCLOSED_BY = '"';
And example of querying from a stage using this file format:
select
$1 field_one
$2 field_two
-- ...and so on
from '#my_s3_stage/path/to/file/my_tab_separated_file.csv' (file_format => 'my_tsv_file_format')

Dealing with currency values in PIG - pigstorage

I have 2 column CSV file loaded in HDFS. Column 1 is a Model name, column 2 is a price in $. Example - Model: IE33, Price: $52678.00
When I run the following script, the price values all return as a two digit result example $52.
ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
Model, Price);
dump ultraPrice;
All my values are between $20000 and $60000. I don't know why it is being cut off.
If I change the CSV file and remove the $ from the price values everything works fine, but I know there has to be a better way.
Note that in your load statement you are not specifying the datatype.By default the model and price will be of type bytearray and hence the discrepancy.
You can either remove the $ from the csv file or load the data as chararray and replace the $ sign and cast it into float.
A = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING TextLoader() as (line:chararray);
A1 = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z0-9.,\\s]+)','');
B = FOREACH A1 GENERATE FLATTEN(STRSPLIT($0,','));
B1 = FOREACH B GENERATE $0 as Model,(float)$1 as Price;
DUMP B1;

SSIS write DT_NTEXT into an UTF-8 csv file

I need to write the result of an SQL query into a CSV file (UTF-8 (I need this encoding as there are French letters)). One of the columns is too large (more than 20000 char) so I can't use DT_WSTR for it. The type that is inputted is DT_TEXT so I use a Data Conversion to change it to DT_NTEXT. But then when I want to write it to the file I have this error message :
Error 2 Validation error. The data type for "input column" is
DT_NTEXT, which is not supported with ANSI files. Use DT_TEXT instead
and convert the data to DT_NTEXT using the data conversion component
Is there a way I can write the data to my file?
Thank you
I had this kind of issues also sometimes. When working with data larger than 255 characters SSIS sees it as blob data and will always handle this as such.
I then converted this blob stream data to a readable text with a script component. Then other transformation should be possible.
This was the case in ssis that came with sql server 2008 but I believe this isn't changed yet.
I ended up doing just like Samyne says, I used a script.
First I've modified my SQL SP, instead of having several columns I put all the info in one single column like follows :
Select Column1 + '^' + Column2 + '^' + Column3 ...
Then I used this code in a script
string fileName = Dts.Variables["SLTemplateFilePath"].Value.ToString();
using (var stream = new FileStream(fileName, FileMode.Truncate))
{
using (var sw = new StreamWriter(stream, Encoding.UTF8))
{
OleDbDataAdapter oleDA = new OleDbDataAdapter();
DataTable dt = new DataTable();
oleDA.Fill(dt, Dts.Variables["FileData"].Value);
foreach (DataRow row in dt.Rows)
{
foreach (DataColumn column in dt.Columns)
{
sw.WriteLine(row[column]);
}
}
sw.WriteLine();
}
}
Putting all the info in one column is optional, I just wanted to avoid handling it in the script, this way if my SP is changed I don't need to modify the SSIS.

Create a SAS function that takes as input and output a dataset

I am doing the same 10 sub steps transformation to multiple data sets. Let's call this transformation flag_price_change.
This transformation takes as an input a dataset and a threshold (real) and creates 10 subdatasets in order to come up with the final one with some added columns. As I said before, I repeat this transformation to multiple datasets
As I am processing multiple data tables the same way, I would like to know if I could create a function like this in SAS.
flag_price_change(input_table,column_name1,column_name2,threshold,output_table).
Where column_name 1 and 2 are just names of the columns the algorithm just focus on, and output_table should be the created table after the flag_price_change function is executed.
Questions:
What's the procedure to define such a function?
Can I store it in a separate SAS file?
How do I call this function from another SAS program?
SAS functions are for individual observations of data. What you want is a macro (check out a starter guide here), which is defined like this:
%macro flag_price_change(input_table, column_name1, column_name2, threshold, output_table);
/** Inside the macro, you can refer to each parameter/argument
with an ampersand in front of it. So for example, to add
column_name1 to column_name2, you would do the following:
**/
DATA &output_table;
set &input_table;
new_variable = &column_name1 + &column_name2;
RUN;
%mend;
To call the macro, you would do this:
%flag_price_change(
input_table = data1,
column_name1 = var1,
column_name2 = var2,
threshold = 0.5,
output_table = output1);
To call the same code on another data set with different variable names and threshold:
%flag_price_change(
input_table = data2,
column_name1 = var3,
column_name2 = var4,
threshold = 0.25,
output_table = output2);
There are a lot of tricks and catches with macro programming to be aware of, so do check your work at each step.