Remove uneven blank spaces from a txt file in pig - csv

I have a text file with uneven blank spaces and I want to store it as a csv file using pig.My file is of the format
2013 210 0 2878 -7543 4 29 20 116
2013 210 10 2875 -7538 4 32 20 116
2013 210 20 2872 -7533 4 29 20 116
2013 210 30 2870 -7527 4 29 20 115
2013 210 40 2867 -7522 4 30 20 115
2013 210 50 2864 -7516 4 29 20 115
2013 210 60 2861 -7511 4 29 20 115

If you are having uneven spaces, read the values as a single line first then squeeze the data with a regular expression then use STRSPLIT to split the single space separated data.
text_data = load 'file.txt' as line;
squeezed_data = foreach text_data generate REPLACE(line, '\\s+', ' ');

Related

importing a table from a webpage using pd.read_html

I am trying to use pd.read_html to import the table under "Daily Observations" from https://www.wunderground.com/history/monthly/us/mi/ann-arbor/date/2020-1
I tried this but the error of "HTTPError: HTTP Error 403: Forbidden" showed.
Jan = pd.read_html('https://www.wunderground.com/history/monthly/us/mi/ann-arbor/date/2020-1')
Alternatively, I copied the source code of the table and save it as an html file.
The html file looks like this:
https://i.stack.imgur.com/YTkF9.jpg
When I use pd.read_html to import this html file, it seems like the imported dataset is not a dataframe. The rows and columns gone messy like this:
[ Time \
0 Jan 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...
1 Jan
2 1
3 2
4 3
.. ...
220 0.02
221 0.00
222 0.00
223 0.00
224 0.00
Temperature (° F) \
0 Max Avg Min 38 30.8 26 47 41.8 35 45 42.8 39 3...
1 NaN
2 NaN
3 NaN
4 NaN
.. ...
220 NaN
221 NaN
222 NaN
223 NaN
224 NaN
How can I solve it?

Join query returns duplicate rows

purchase_request_master
prm_voucher_no| project_id| status_id| request_date
17 46 3 11-6-2016 0:00
18 46 3 20-6-2016 0:00
19 46 3 216-2016 0:00
purchase_request_details
prm_voucher_no| item_id| request_quantity
17 80 50
17 81 100
18 80 75
19 83 10
19 81 35
19 82 120
purchase_order_master
pom_voucher_no| prm_request_id |supplier_id
16 17 14
17 18 14
18 19 15
purchase_order_details
pom_voucher_no| approved_quantity| rate
16 50 1000
16 100 1500
17 75 150
18 10 2500
18 35 3000
18 120 1700
when I run the below query it gives 14 rows(duplicate row returning).expected out put row is 6.. Please refer below output tables..
select prm.prm_voucher_no,prm.project_id,prm.status_id,prd.requested_quantity,prd.item_id,pom.pom_voucher_no,pom.supplier_id,pod.rate,pod.approved_quantity
from purchase_request_master prm
left join purchase_request_details prd on prd.prm_voucher_no=prm.prm_voucher_no
left join purchase_order_master pom on prm.prm_voucher_no=pom.request_id
left join purchase_order_details pod on pom.pom_voucher_no=pod.pom_voucher_no
where prm.project_id=46 and ( EXTRACT(MONTH FROM prm.request_Date)=6) and (EXTRACT(YEAR FROM prm.request_Date)=2016)
group by prm.voucher_no,prm.project_id,prm.status_id,prd.requested_quantity,prd.item_id,pom.voucher_no,pom.supplier_id,pod.rate,pod.approved_quantity
order by prm.voucher_no
i tried inner join,distinct,distinct least,group by,temporary table,with clause all these method.. but no use every this gives duplicate row
How to solve this problem..
OUTPUT
prm_voucher_no| project_id| status_id|item_id|request_quantity |pom_voucher_no| supplier_id|approved_quantity | rate
17 46 3 80 50 16 14 100 1000
17 46 3 81 100 16 14 75 1500
17 46 3 80 75 16 15 10 150
17 46 3 81 10 16 14 35 10
18 46 3 81 35 17 14 120 35
19 46 3 80 120 18 15 50 120
19 46 3 81 50 18 14 100 1000
19 46 3 82 100 18 14 75 1500
19 46 3 80 75 18 15 10 150
19 46 3 81 10 18 14 35 10
19 46 3 82 35 18 14 120 35
19 46 3 80 120 18 15 35 120
19 46 3 81 35 18 14 50 1500
19 46 3 82 50 18 15 100 1700
EXPECTED OUTPUT
prm_voucher_no| project_id| status_id| item_id| request_quantity| pom_voucher_no| supplier_id|approved_quantity| rate
17 46 3 80 50 16 14 100 1000
17 46 3 81 100 16 14 75 1500
18 46 3 81 35 17 14 120 35
19 46 3 80 120 18 15 50 120
19 46 3 81 50 18 14 100 1000
19 46 3 82 100 18 14 75 1500
I think the problem is in your data model itself. Ideally, you would have a line_number field in both of your "detail" tables, and this would be used in the join:
create table purchase_request_details (
prm_voucher_no integer,
prm_voucher_line integer, // Add this
item_id integer,
request_quantity
)
create table purchase_order_details (
pom_voucher_no integer,
pom_voucher_line integer, // and this
approved_quantity integer,
rate integer
)
And then this query would give you the results you seek:
select
prm.prm_voucher_no,prm.project_id,prm.status_id,prd.request_quantity,
prd.item_id,pom.pom_voucher_no,pom.supplier_id,pod.rate,pod.approved_quantity
from
purchase_request_master prm
left join purchase_request_details prd on
prd.prm_voucher_no=prm.prm_voucher_no
left join purchase_order_master pom on
prm.prm_voucher_no=pom.prm_request_id
left join purchase_order_details pod on
pom.pom_voucher_no=pod.pom_voucher_no and
prd.prm_voucher_line = pod.pom_voucher_line // This is the key
where
prm.project_id=46 and
EXTRACT(MONTH FROM prm.request_Date) = 6 and
EXTRACT(YEAR FROM prm.request_Date) = 2016
order by prm.prm_voucher_no
If you have no ability to control the data model, then I think the best you can do is artificially add a line number. I don't recommend this at all, as you are presupposing a lot of things, most notably that the order of records in the one table automatically correlates to the order of records in the other -- and I'm betting that's far from a guarantee.
Adding a line number would be done using the row_number() analytic, and PostgreSQL has that but MySQL does not... you have both tags in your question. Which DBMS are you using?
If you can't add line numbers, can you add item_id to your purchase_order_details table? This would likely handle your issue, unless you can have the same item on multiple lines within a purchase request/order.
In the data you have above, a join on the requested quantity (prd.request_quantity = pod.approved_quantity) fixes your issue, but I am highly confident that this would burn you when you started running it against real data.

Percentage by Row Group

I have a matrix with rows grouped by Dept (Department). I am trying to get the actual hours / required hours percentage in a column for each row group, but I can only get the total %, not the % by group. Ex:
I should get this:
Total Employee Req Hrs Rep Hrs % Billable hrs % NonBill Hrs % Time Off %
Dept A Total 672 680 101 575 85 140 21 8 1
Emp1 168 170 101 150 89 50 29 0 0
Emp2 168 165 98 120 71 20 12 8 4
Emp3 168 175 104 155 92 20 12 0 0
Emp4 168 170 101 150 89 50 29 0 0
Dept B Total 420 428 102 365 87 80 19 4 .1
Emp5 168 170 101 150 89 50 29 0 0
Emp6 84 84 98 60 71 10 12 4 4
Emp7 168 175 104 155 92 20 12 0 0
G Total 1092 1108 101 940 86 190 17 12 1
But I get this:
Total Employee Req Hrs Rep Hrs % Billable hrs % NonBill Hrs % Time Off %
Dept A Total 1684 1675 101 1250 86 225 17 12 1
Emp1 168 170 101 150 89 50 29 0 0
Emp2 168 165 98 120 71 20 12 8 4
Emp3 168 175 104 155 92 20 12 0 0
Emp4 168 170 101 150 89 50 29 0 0
Dept B Total 1092 1108 101 1250 86 225 17 12 1
Emp5 168 170 101 150 89 50 29 0 0
Emp6 84 84 98 60 71 10 12 4 4
Emp7 168 175 104 155 92 20 12 0 0
G Total 1092 1108 101 940 86 190 17 12 1
The totals are correct but the % is wrong.
I have several Datasets because the report only runs the department you are in, except for the VPs who can see all departments.
I Insert the percentage columns into the matrix and have tried several expressions with no results including:
=Fields!ActHrs.Value/Fields!ReqHrs.Value
=Sum(Fields!ActHrs.Value, "Ut_Query")/Sum(Fields!ReqHrs.Value, "Ut_Query")
=Sum(Fields!ActHrs.Value, "Ut_Query","Dept")/Sum(Fields!ReqHrs.Value,
"Ut_Query","Dept")
=Sum(Fields!ActHrs.Value,"Dept", "Ut_Query")/Sum(Fields!ReqHrs.Value,
"Dept","Ut_Query")
Plus more I can't even remember.
I tried creating new groups, and even a new matrix.
There must be a simple way to get the percentage by group but I have not found an answer on any of the interned boards.
OK, I figured this out, but it doesn't make much sense. If I try:
=Textbox29/TextBox28 I get error messages about undefined variables.
If I go the the textbox properties and rename the textboxes to Act and Req and use:
=Act/Req I get the right answer.

SSRS Moving average

Please i need help to calculate moving average in SSRS. see example from excel bellow. i have search net and looked at other forum without success.
the moving average is something like Avg(A1:A3),Avg(A2:A4),Avg(A3:A5)etc
Count of ID BAND
-20 >20 <12 Grand Total Three Months Avg
Year Month <15 15-20 >20 Total
2012/2013 Jan 35 9 13 57 57
Feb 34 23 20 77 67
Mar 25 33 8 66 =AVERAGE(F5:F7)
Apr 7 31 13 51 65
May 6 10 13 29 49
Jun 19 14 18 51 44
Jul 34 16 6 56 45
Aug 26 21 30 77 =AVERAGE(F10:F12)
Sep 13 53 21 =AVERAGE(F11:F13)
Oct 1 34 33 68 =AVERAGE(F12:F14)
Nov 35 16 19 70 53
Dec 33 23 36 92 77

Code-golf: Output multiplication table to the Console

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I recently pointed a student doing work experience to an article about dumping a multiplication table to the console. It used a nested for loop and multiplied the step value of each.
This looked like a .NET 2.0 approach. I was wondering, with the use of Linq and extension methods,for example, how many lines of code it would take to achieve the same result.
Is the stackoverflow community up to the challenge?
The challenge:
In a console application, write code to generate a table like this example:
01 02 03 04 05 06 07 08 09
02 04 06 08 10 12 14 16 18
03 06 09 12 15 18 21 24 27
04 08 12 16 20 24 28 32 36
05 10 15 20 25 30 35 40 45
06 12 18 24 30 36 42 48 54
07 14 21 28 35 42 49 56 63
08 16 24 32 40 48 56 64 72
09 18 27 36 45 54 63 72 81
As this turned into a language-agnostic code-golf battle, I'll go with the communities decision about which is the best solution for the accepted answer.
There's been alot of talk about the spec and the format that the table should be in, I purposefully added the 00 format but the double new-line was originally only there because I didn't know how to format the text when creating the post!
J - 8 chars - 24 chars for proper format
*/~1+i.9
Gives:
1 2 3 4 5 6 7 8 9
2 4 6 8 10 12 14 16 18
3 6 9 12 15 18 21 24 27
4 8 12 16 20 24 28 32 36
5 10 15 20 25 30 35 40 45
6 12 18 24 30 36 42 48 54
7 14 21 28 35 42 49 56 63
8 16 24 32 40 48 56 64 72
9 18 27 36 45 54 63 72 81
This solution found by #earl:
'r(0)q( )3.'8!:2*/~1+i.9
Gives:
01 02 03 04 05 06 07 08 09
02 04 06 08 10 12 14 16 18
03 06 09 12 15 18 21 24 27
04 08 12 16 20 24 28 32 36
05 10 15 20 25 30 35 40 45
06 12 18 24 30 36 42 48 54
07 14 21 28 35 42 49 56 63
08 16 24 32 40 48 56 64 72
09 18 27 36 45 54 63 72 81
MATLAB - 10 characters
a=1:9;a'*a
... or 33 characters for stricter output format
a=1:9;disp(num2str(a'*a,'%.2d '))
Brainf**k - 185 chars
>---------[++++++++++>---------[+<[-<+>>+++++++++[->+>>---------[>-<++++++++++<]<[>]>>+<<<<]>[-<+>]<---------<]<[->+<]>>>>++++[-<++++>]<[->++>+++>+++<<<]>>>[.[-]<]<]++++++++++.[-<->]<+]
cat - 252 characters
01 02 03 04 05 06 07 08 09
02 04 06 08 10 12 14 16 18
03 06 09 12 15 18 21 24 27
04 08 12 16 20 24 28 32 36
05 10 15 20 25 30 35 40 45
06 12 18 24 30 36 42 48 54
07 14 21 28 35 42 49 56 63
08 16 24 32 40 48 56 64 72
09 18 27 36 45 54 63 72 81
Assuming that a trailing newline is wanted; otherwise, 251 chars.
* runs *
Python - 61 chars
r=range(1,10)
for y in r:print"%02d "*9%tuple(y*x for x in r)
C#
This is only 2 lines. It uses lambdas not extension methods
var nums = new List<int>() { 1, 2, 3, 4, 5, 6, 7, 8, 9 };
nums.ForEach(n => { nums.ForEach(n2 => Console.Write((n * n2).ToString("00 "))); Console.WriteLine(); });
and of course it could be done in one long unreadable line
new List<int>() { 1, 2, 3, 4, 5, 6, 7, 8, 9 }.ForEach(n => { new List<int>() { 1, 2, 3, 4, 5, 6, 7, 8, 9 }.ForEach(n2 => Console.Write((n * n2).ToString("00 "))); Console.WriteLine(); });
all of this is assuming you consider a labmda one line?
K - 12 characters
Let's take the rosetta-stoning seriously, and compare Kdb+'s K4 with the canonical J solution (*/~1+i.9):
a*/:\:a:1+!9
1 2 3 4 5 6 7 8 9
2 4 6 8 10 12 14 16 18
3 6 9 12 15 18 21 24 27
4 8 12 16 20 24 28 32 36
5 10 15 20 25 30 35 40 45
6 12 18 24 30 36 42 48 54
7 14 21 28 35 42 49 56 63
8 16 24 32 40 48 56 64 72
9 18 27 36 45 54 63 72 81
J's "table" operator (/) equals the K "each-left each-right" (/:\:) idiom. We don't have J's extremely handy "reflexive" operator (~) in K, so we have to pass a as both left and right argument.
Fortran95 - 40 chars (beating perl by 4 chars!)
This solution does print the leading zeros as per the spec.
print"(9(i3.2))",((i*j,i=1,9),j=1,9);end
Oracle SQL, 103 characters:
select n, n*2, n*3, n*4, n*5, n*6, n*7, n*8, n*9 from (select rownum n from dual CONNECT BY LEVEL < 10)
C# - 117, 113, 99, 96, 95 89 characters
updated based on NickLarsen's idea
for(int x=0,y;++x<10;)
for(y=x;y<x*10;y+=x)
Console.Write(y.ToString(y<x*9?"00 ":"00 \n"));
99, 85, 82 81 characters
... If you don't care about the leading zeros and would allow tabs for alignment.
for(int x=0,y;++x<10;)
{
var w="";
for(y=1;++y<10;)
w+=x*y+" ";
Console.WriteLine(w);
}
COBOL - 218 chars -> 216 chars
PROGRAM-ID.P.DATA DIVISION.WORKING-STORAGE SECTION.
1 I PIC 9.
1 N PIC 99.
PROCEDURE DIVISION.PERFORM 9 TIMES
ADD 1 TO I
SET N TO I
PERFORM 9 TIMES
DISPLAY N' 'NO ADVANCING
ADD I TO N
END-PERFORM
DISPLAY''
END-PERFORM.
Edit
216 chars (probably a different compiler)
PROGRAM-ID.P.DATA DIVISION.WORKING-STORAGE SECTION.
1 I PIC 9.
1 N PIC 99.
PROCEDURE DIVISION.
PERFORM B 9 TIMES
STOP RUN.
B.
ADD 1 TO I
set N to I
PERFORM C 9 TIMES
DISPLAY''.
C.
DISPLAY N" "NO ADVANCING
Add I TO N.
Not really a one-liner, but the shortest linq i can think of:
var r = Enumerable.Range(1, 9);
foreach (var z in r.Select(n => r.Select(m => n * m)).Select(a => a.Select(b => b.ToString("00 "))))
{
foreach (var q in z)
Console.Write(q);
Console.WriteLine();
}
In response to combining this and SRuly's answer
Enumberable.Range(1,9).ToList.ForEach(n => Enumberable.Range(1,9).ToList.ForEach(n2 => Console.Write((n * n2).ToString("00 "))); Console.WriteLine(); });
Ruby - 42 Chars (including one linebreak, interactive command line only)
This method is two lines of input and only works in irb (because irb gives us _), but shortens the previous method by a scant 2 charcters.
1..9
_.map{|y|puts"%02d "*9%_.map{|x|x*y}}
Ruby - 44 Chars (tied with perl)
(a=1..9).map{|y|puts"%02d "*9%a.map{|x|x*y}}
Ruby - 46 Chars
9.times{|y|puts"%02d "*9%(1..9).map{|x|x*y+x}}
Ruby - 47 Chars
And back to a double loop
(1..9).map{|y|puts"%02d "*9%(1..9).map{|x|x*y}}
Ruby - 54 chars!
Using a single loop saves a couple of chars!
(9..89).map{|n|print"%02d "%(n/9*(x=n%9+1))+"\n"*(x/9)}
Ruby - 56 chars
9.times{|x|puts (1..9).map{|y|"%.2d"%(y+x*y)}.join(" ")}
Haskell — 85 84 79 chars
r=[1..9]
s x=['0'|x<=9]++show x
main=mapM putStrLn[unwords[s$x*y|x<-r]|y<-r]
If double spacing is required (89 81 chars),
r=[1..9]
s x=['0'|x<=9]++show x
main=mapM putStrLn['\n':unwords[s$x*y|x<-r]|y<-r]
F# - 61 chars:
for y=1 to 9 do(for x=1 to 9 do printf"%02d "(x*y));printfn""
If you prefer a more applicative/LINQ-y solution, then in 72 chars:
[1..9]|>Seq.iter(fun y->[1..9]|>Seq.iter((*)y>>printf"%02d ");printfn"")
c# - 125, 123 chars (2 lines):
var r=Enumerable.Range(1,9).ToList();
r.ForEach(n=>{var s="";r.ForEach(m=>s+=(n*m).ToString("00 "));Console.WriteLine(s);});
C - 97 79 characters
#define f(i){int i=0;while(i++<9)
main()f(x)f(y)printf("%.2d ",x*y);puts("");}}
Perl, 44 chars
(No hope of coming anywhere near J, but languages with matrix ops are in a class of their own here...)
for$n(1..9){printf"%3d"x9 .$/,map$n*$_,1..9}
R (very similar to Matlab on this level): 12 characters.
> 1:9%*%t(1:9)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 1 2 3 4 5 6 7 8 9
[2,] 2 4 6 8 10 12 14 16 18
[3,] 3 6 9 12 15 18 21 24 27
[4,] 4 8 12 16 20 24 28 32 36
[5,] 5 10 15 20 25 30 35 40 45
[6,] 6 12 18 24 30 36 42 48 54
[7,] 7 14 21 28 35 42 49 56 63
[8,] 8 16 24 32 40 48 56 64 72
[9,] 9 18 27 36 45 54 63 72 81
PHP, 71 chars
for($x=0;++$x<10;print"\n"){for($y=0;++$y<10;){printf("%02d ",$x*$y);}}
Output:
$ php -r 'for($x=0;++$x<10;print"\n"){for($y=0;++$y<10;){printf("%02d ",$x*$y);}}'
01 02 03 04 05 06 07 08 09
02 04 06 08 10 12 14 16 18
03 06 09 12 15 18 21 24 27
04 08 12 16 20 24 28 32 36
05 10 15 20 25 30 35 40 45
06 12 18 24 30 36 42 48 54
07 14 21 28 35 42 49 56 63
08 16 24 32 40 48 56 64 72
09 18 27 36 45 54 63 72 81
C#, 135 chars, nice and clean:
var rg = Enumerable.Range(1, 9);
foreach (var rc in from r in rg
from c in rg
select (r * c).ToString("D2") + (c == 9 ? "\n\n" : " "))
Console.Write(rc);
PostgreSQL: 81 74 chars
select array(select generate_series(1,9)*x)from generate_series(1,9)as x;
Ruby - 56 chars :D
9.times{|a|9.times{|b|print"%02d "%((a+1)*(b+1))};puts;}
C - 66 Chars
This resolves the complaint about the second parameter of main :)
main(x){for(x=8;x++<89;)printf("%.2d%c",x/9*(x%9+1),x%9<8?32:10);}
C - 77 chars
Based on dreamlax's 97 char answer. His current answer somewhat resembles this one now :)
Compiles ok with gcc, and main(x,y) is fair game for golf i reckon
#define f(i){for(i=0;i++<9;)
main(x,y)f(x)f(y)printf("%.2d ",x*y);puts("");}}
XQuery 1.0 (96 bytes)
string-join(for$x in 1 to 9 return(for$y in 1 to 9 return concat(0[$x*$y<10],$x*$y,' '),'
'),'')
Run (with XQSharp) with:
xquery table.xq !method=text
Scala - 77 59 58 chars
print(1 to 9 map(p=>1 to 9 map(q=>"%02d "format(p*q))mkString)mkString("\n"))
Sorry, I had to do this, the Scala solution by Malax was way too readable...
[Edit] For comprehension seems to be the better choice:
for(p<-1 to 9;q<-{println;1 to 9})print("%02d "format p*q)
[Edit] A much longer solution, but without multiplication, and much more obfuscated:
val s=(1 to 9).toSeq
(s:\s){(p,q)=>println(q.map("%02d "format _)mkString)
q zip(s)map(t=>t._1+t._2)}
PHP, 62 chars
for(;$x++<9;print"\n",$y=0)while($y++<9)printf("%02d ",$x*$y);
Java - 155 137 chars
Update 1: replaced string building by direct printing. Saved 18 chars.
class M{public static void main(String[]a){for(int x,y=0,z=10;++y<z;System.out.println())for(x=0;++x<z;System.out.printf("%02d ",x*y));}}
More readable format:
class M{
public static void main(String[]a){
for(int x,y=0,z=10;++y<z;System.out.println())
for(x=0;++x<z;System.out.printf("%02d ",x*y));
}
}
Another attempt using C#/Linq with GroupJoin:
Console.Write(
String.Join(
Environment.NewLine,
Enumerable.Range(1, 9)
.GroupJoin(Enumerable.Range(1, 9), y => 0, x => 0, (y, xx) => String.Join(" ", xx.Select(x => x * y)))
.ToArray()));
Ruby — 47 chars
puts (a=1..9).map{|i|a.map{|j|"%2d"%(j*i)}*" "}
Output
1 2 3 4 5 6 7 8 9
2 4 6 8 10 12 14 16 18
3 6 9 12 15 18 21 24 27
4 8 12 16 20 24 28 32 36
5 10 15 20 25 30 35 40 45
6 12 18 24 30 36 42 48 54
7 14 21 28 35 42 49 56 63
8 16 24 32 40 48 56 64 72
9 18 27 36 45 54 63 72 81
(If we ignore spacing, it becomes 39: puts (a=1..9).map{|i|a.map{|j|j*i}*" "} And anyway, I feel like there's a bit of room for improvement with the wordy map stuff.)