Apache pig group by function is not giving expected output - csv

I have data in csv format as shown below.
The data has the below format
"first_name","last_name","company_name","address","city","county","postal","phone1","phone2","email","web"
The sample data named under User.csv. The file contains below data.
"Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz#hotmail.com","http://www.alandrosenburgcpapc.co.uk"
"Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas#gmail.com","http://www.capgeminiamerica.co.uk"
"France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade#hotmail.com","http://www.elliottjohnwesq.co.uk"
When I try the same to load using PigStorage
user = LOAD '/home/abhijit/Downloads/User.csv' USING PigStorage(',');
DUMP user;
The output of it is like :
("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz#hotmail.com","http://www.alandrosenburgcpapc.co.uk")
("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas#gmail.com","http://www.capgeminiamerica.co.uk")
("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade#hotmail.com","http://www.elliottjohnwesq.co.uk")
I want to do a group by on city. So I have written
grp = group user by $4;
dump grp;
I get the output as :
( Binney St",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas#gmail.com","http://www.capgeminiamerica.co.uk")})
("8 Moor Place",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade#hotmail.com","http://www.elliottjohnwesq.co.uk")})
("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz#hotmail.com","http://www.alandrosenburgcpapc.co.uk")})
The company_name and address is creating a problem as it contains ',' as part of it. for example "14, Taylor St" in address or "Elliott, John W Esq" in company_name.
so my $4 is treated for "Taylor St" and not the "St. Stephens Ward"
So because of the extra delimiter in the address data or the company_name data is not loaded properly or seperated properly and the group by fuction is not giving correct result.
How can I achieve the group by output as below
("Abbey Ward",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas#gmail.com","http://www.capgeminiamerica.co.uk")})
("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz#hotmail.com","http://www.alandrosenburgcpapc.co.uk")})
("East Southbourne and Tuckton W",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade#hotmail.com","http://www.elliottjohnwesq.co.uk")})
grp = group a by $5 ;
It won't be the solution for me. I already thought of it.

The problem is that PigStorage does not take escaping into account, so creates columns for fields that should not be columns (each time an entry contains a comma).
Using CSVExcelStorage will solve this as this storage can deal with escaping, thus creating the right amount and sequence of columns.

Related

Split New Line - MS Access

Would appreciate any help on this problem
In MS Access
I'd like to split the values of one field (Main Address) to 2 separate fields (Address 1 and Address 2) where in Address 1 gets the first line and then Address 2 gets the second and other line items
ex #1
Main Address | Address 1 | Address 2
----------------------------------------
1 Main Road | 1 Main Road | San Jose CA
San Jose CA
ex #2
Main Address | Address 1 | Address 2
----------------------------------------
1 Main Road | 1 Main Road | San Jose CA Drop at Front
San Jose CA
Drop at Front
Thanks All!
Hope the representation of the samples makes sense, if not let me know if you have questions and I'll clarify! TA
Does the [Main Address] data have Cr and Lf characters to force new lines? If it doesn't, what you want is virtually impossible. If yes, an expression in query or textbox:
Replace(Left([Main Address] & "", Instr([Main Address] & Chr(13), Chr(13))), Chr(13), "")
Trim(Replace(Mid([Main Address] & "", Instr([Main Address] & Chr(13), Chr(13))), Chr(13) & Chr(10), " "))

Replace just one line start of string in sql

I'm new in sql and couldn't found how to change just firts line in a cell.
This is value of cell.
[B]Ynt: Hello I'm Jack[/B]
2 lines
3 lines
4 lines
I want to change it to
2 lines
3 lines
4 lines
Could you please help me for queries? Every first rows begin with [B]Ynt: and ending with [/B] There is one blank line after firts line. Check below pisture.
UPDATE xf_post SET message = REPLACE(message, 'Ynt:%', '');
delete just first lines in a cell who has begin with "Ynt:"
try this but on sample.
you can make +3 as your need.
if you not understand let me know
UPDATE xf_post SET
message = REPLACE(message,SUBSTRING(message,1,POSITION( '[/b]' IN message)+3) , '')
where message like "[b]Ynt:%"
I have just replace the message ="your given text" it is working as your desire result check it
check below query
select REPLACE("[b]Ynt: 80'li yıllarda çocuk olmak..[/b] Yeğenim henüz dört yaşında.. [b][SIZE=16px]1990 lı olmakta böyleydi işte....[/SIZE][/b] 1980li yıllarda hayatının ilk tecrübelerini yaşamış, ilkokula gitmiş, Kenan Evren´i, Erdal İnönü´yü, Özal'ı tanımış olmak, Ajda Pekkan´ın Alo, Michael Jackson´ın Pepsi reklamlarını hatırlayacak kadar şanslı olmak demek.. [b]Türkiye'de yaşamış son mutlu kuşak olduğunu hüzünle hissetmek demek.. [/b] [b]Katılıyorum. 1990 lardada öyle[/b]",SUBSTRING("[b]Ynt: 80'li yıllarda çocuk olmak..[/b] Yeğenim henüz dört yaşında.. [b][SIZE=16px]1990 lı olmakta böyleydi işte....[/SIZE][/b] 1980li yıllarda hayatının ilk tecrübelerini yaşamış, ilkokula gitmiş, Kenan Evren´i, Erdal İnönü´yü, Özal'ı tanımış olmak, Ajda Pekkan´ın Alo, Michael Jackson´ın Pepsi reklamlarını hatırlayacak kadar şanslı olmak demek.. [b]Türkiye'de yaşamış son mutlu kuşak olduğunu hüzünle hissetmek demek.. [/b] [b]Katılıyorum. 1990 lardada öyle[/b]",1,POSITION( '[/b]' IN "[b]Ynt: 80'li yıllarda çocuk olmak..[/b] Yeğenim henüz dört yaşında.. [b][SIZE=16px]1990 lı olmakta böyleydi işte....[/SIZE][/b] 1980li yıllarda hayatının ilk tecrübelerini yaşamış, ilkokula gitmiş, Kenan Evren´i, Erdal İnönü´yü, Özal'ı tanımış olmak, Ajda Pekkan´ın Alo, Michael Jackson´ın Pepsi reklamlarını hatırlayacak kadar şanslı olmak demek.. [b]Türkiye'de yaşamış son mutlu kuşak olduğunu hüzünle hissetmek demek.. [/b] [b]Katılıyorum. 1990 lardada öyle[/b]")+3) , '')

insert xml data in mysql table

I have this type of XML.
<Players>
<TeamA name="Kings XI Punjab" Id="1107">
<Player1 Id="270">Virender Sehwag </Player1>
<Player2 Id="10114">Mandeep Singh </Player2>
<Player3 Id="10085">Glenn Maxwell </Player3>
<Player4 Id="5313">David Miller </Player4>
<Player5 Id="4961">George Bailey (C) </Player5>
<Player6 Id="4508">Wriddhiman Saha (W)</Player6>
<Player7 Id="62576">Akshar Patel </Player7>
<Player8 Id="3736">Mitchell Johnson </Player8>
<Player9 Id="4610">Rishi Dhawan </Player9>
<Player10 Id="4997">Parvinder Awana </Player10>
<Player11 Id="10116">Sandeep Sharma </Player11>
</TeamA>
<TeamB name="Kolkata Knight Riders" Id="1106">
<Player1 Id="3723">Robin Uthappa (W)</Player1>
<Player2 Id="3478">Gautam Gambhir (C) </Player2>
<Player3 Id="4276">Manish Pandey </Player3>
<Player4 Id="141">Jacques Kallis </Player4>
<Player5 Id="11803">Suryakumar Yadav </Player5>
<Player6 Id="3724">Yusuf Pathan </Player6>
<Player7 Id="3766">Ryan ten Doeschate </Player7>
<Player8 Id="3729">Piyush Chawla </Player8>
<Player9 Id="11229">Sunil Narine </Player9>
<Player10 Id="3874">Morne Morkel </Player10>
<Player11 Id="5221">Umesh Yadav </Player11>
</TeamB>
</Players>
i want this type opf insertion into mysql database
Team Player TeamId PlayerId
Kings XI Punjab Virender Sehwag 1107 270
Kolkata Knight Riders Robin Uthappa 1106 3723
an so on..every entry for player and team and both ids should be in this format.
How can i do this.
TeamA Id Player1 Player2 Player3 Kings XI Punjab 5313 Virender Sehwag Mandeep Singh Glenn Maxwell
I am getting this type of entry into my db. How do i get player id and name of all layers in a row instead of column
I am unable to do this by
LOAD XML LOCAL INFILE '/pathtofile/file.xml' INTO TABLE my_tablename ROWS IDENTIFIED BY '<tagname>';
Please give me a proper solution
thanks in advance.

Transpose clumps of cell on openoffice calc

Is there a function that can help trun this :
A B C D E
asa fafa ada sawewf
wefw ff rwf fw
rww er rr 23
into this:
A B
asa fafa
wefw ff
rww er
ada sawewf
rwf fw
rr 23
in another worksheet preferably?
It has been solved. The ods posted here : http://forum.openoffice.org/en/forum/viewtopic.php?f=9&t=62453 works as per my requests

How to match and assign data the pythonic way?

I have a list (mysql table) of People and their titles as shown in the table below. I also have a list of titles and their categories. How do I assign their categories to the person? The problem arises when there are multiple titles for a person. What is the pythonic way of mapping the title to the category and assigning it to the person?
People Table
Name Title
--------------------
John D CEO, COO, CTO
Mary J COO, MD
Tim C Dev Ops, Director
Title Category table
Title Executive IT Other
-----------------------------
CEO 1
COO 1
CTO 1 1
MD 1
Dev Ops 1
Director 1
Desired output :
Name Title Executive IT Other
---------------------------------------------
John D CEO, COO, CTO 1 1
Mary J COO, MD 1
Tim C Dev Ops, Director 1 1
name_title = (("John D",("CEO","COO","CTO")),
("Mary J",("COO","MD")),
("Tim C",("Dev Ops","Director")))
title_cat = {"CEO": set(["Executive"]),
"COO": set(["Executive"]),
"CTO": set(["Executive"]),
"MD": set(["Executive"]),
"Dev Ops": set(["IT"]),
"Director": set(["Other"])}
name_cat = [(name, reduce(lambda x,y:x|y, [title_cat[title]for title in titles])) for name,titles in name_title]
It would be nice if there was a union which behaved like sum on sets.
people=['john','Mary','Tim']
Title=[['CEO','COO','CTO'],['COO','MD'],['DevOps','Director']]
title_des={'CEO':'Executive','COO':'Executive','CTO':'Executive',
'MD':'Executive','DevOps':'IT','Director':'Others'
}
people_des={}
for i,x in enumerate(people):
people_des[x]={}
for y in Title[i]:
if title_des[y] not in people_des[x]:
people_des[x][title_des[y]]=[y]
else:
people_des[x][title_des[y]].append(y)
print(people_des)
output:
{'Tim': {'IT': ['DevOps'], 'Others': ['Director']}, 'john': {'Executive': ['CEO', 'COO', 'CTO']}, 'Mary': {'Executive': ['COO', 'MD']}}
Start by arranging your input data in a dictionary-of-lists form:
>>> name_to_titles = {
'John D': ['CEO', 'COO', 'CTO'],
'Mary J': ['COO', 'MD'],
'Tim C': ['Dev Ops', 'Director']
}
Then loop over the input dictionary to create the reverse mapping:
>>> title_to_names = {}
>>> for name, titles in name_to_titles.items():
for title in titles:
title_to_names.setdefault(title, []).append(name)
>>> import pprint
>>> pprint.pprint(title_to_names)
{'CEO': ['John D'],
'COO': ['John D', 'Mary J'],
'CTO': ['John D'],
'Dev Ops': ['Tim C'],
'Director': ['Tim C'],
'MD': ['Mary J']}
I propose this if you mean you have the string:
s = '''Name Title
--------------------
John D CEO, COO, CTO
Mary J COO, MD
Tim C Dev Ops, Director
Title Executive IT Other
-----------------------------
CEO 1
COO 1
CTO 1
MD 1
Dev Ops 1
Director 1
'''
lines = s.split('\n')
it = iter(lines)
for line in it:
if line.startswith('Name'):
break
next(it) # '--------------------'
for line in it:
if not line:
break
split = line.split()
titles = split[2:]
name = split[:2]
print ' '.join(name), titles
# John D ['CEO,', 'COO,', 'CTO']
# Mary J ['COO,', 'MD']
# Tim C ['Dev', 'Ops,', 'Director']