Apache pig group by function is not giving expected output - csv
I have data in csv format as shown below.
The data has the below format
"first_name","last_name","company_name","address","city","county","postal","phone1","phone2","email","web"
The sample data named under User.csv. The file contains below data.
"Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz#hotmail.com","http://www.alandrosenburgcpapc.co.uk"
"Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas#gmail.com","http://www.capgeminiamerica.co.uk"
"France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade#hotmail.com","http://www.elliottjohnwesq.co.uk"
When I try the same to load using PigStorage
user = LOAD '/home/abhijit/Downloads/User.csv' USING PigStorage(',');
DUMP user;
The output of it is like :
("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz#hotmail.com","http://www.alandrosenburgcpapc.co.uk")
("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas#gmail.com","http://www.capgeminiamerica.co.uk")
("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade#hotmail.com","http://www.elliottjohnwesq.co.uk")
I want to do a group by on city. So I have written
grp = group user by $4;
dump grp;
I get the output as :
( Binney St",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas#gmail.com","http://www.capgeminiamerica.co.uk")})
("8 Moor Place",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade#hotmail.com","http://www.elliottjohnwesq.co.uk")})
("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz#hotmail.com","http://www.alandrosenburgcpapc.co.uk")})
The company_name and address is creating a problem as it contains ',' as part of it. for example "14, Taylor St" in address or "Elliott, John W Esq" in company_name.
so my $4 is treated for "Taylor St" and not the "St. Stephens Ward"
So because of the extra delimiter in the address data or the company_name data is not loaded properly or seperated properly and the group by fuction is not giving correct result.
How can I achieve the group by output as below
("Abbey Ward",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas#gmail.com","http://www.capgeminiamerica.co.uk")})
("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz#hotmail.com","http://www.alandrosenburgcpapc.co.uk")})
("East Southbourne and Tuckton W",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade#hotmail.com","http://www.elliottjohnwesq.co.uk")})
grp = group a by $5 ;
It won't be the solution for me. I already thought of it.
The problem is that PigStorage does not take escaping into account, so creates columns for fields that should not be columns (each time an entry contains a comma).
Using CSVExcelStorage will solve this as this storage can deal with escaping, thus creating the right amount and sequence of columns.
Related
Split New Line - MS Access
Would appreciate any help on this problem In MS Access I'd like to split the values of one field (Main Address) to 2 separate fields (Address 1 and Address 2) where in Address 1 gets the first line and then Address 2 gets the second and other line items ex #1 Main Address | Address 1 | Address 2 ---------------------------------------- 1 Main Road | 1 Main Road | San Jose CA San Jose CA ex #2 Main Address | Address 1 | Address 2 ---------------------------------------- 1 Main Road | 1 Main Road | San Jose CA Drop at Front San Jose CA Drop at Front Thanks All! Hope the representation of the samples makes sense, if not let me know if you have questions and I'll clarify! TA
Does the [Main Address] data have Cr and Lf characters to force new lines? If it doesn't, what you want is virtually impossible. If yes, an expression in query or textbox: Replace(Left([Main Address] & "", Instr([Main Address] & Chr(13), Chr(13))), Chr(13), "") Trim(Replace(Mid([Main Address] & "", Instr([Main Address] & Chr(13), Chr(13))), Chr(13) & Chr(10), " "))
Replace just one line start of string in sql
I'm new in sql and couldn't found how to change just firts line in a cell. This is value of cell. [B]Ynt: Hello I'm Jack[/B] 2 lines 3 lines 4 lines I want to change it to 2 lines 3 lines 4 lines Could you please help me for queries? Every first rows begin with [B]Ynt: and ending with [/B] There is one blank line after firts line. Check below pisture. UPDATE xf_post SET message = REPLACE(message, 'Ynt:%', ''); delete just first lines in a cell who has begin with "Ynt:"
try this but on sample. you can make +3 as your need. if you not understand let me know UPDATE xf_post SET message = REPLACE(message,SUBSTRING(message,1,POSITION( '[/b]' IN message)+3) , '') where message like "[b]Ynt:%" I have just replace the message ="your given text" it is working as your desire result check it check below query select REPLACE("[b]Ynt: 80'li yıllarda çocuk olmak..[/b] Yeğenim henüz dört yaşında.. [b][SIZE=16px]1990 lı olmakta böyleydi işte....[/SIZE][/b] 1980li yıllarda hayatının ilk tecrübelerini yaşamış, ilkokula gitmiş, Kenan Evren´i, Erdal İnönü´yü, Özal'ı tanımış olmak, Ajda Pekkan´ın Alo, Michael Jackson´ın Pepsi reklamlarını hatırlayacak kadar şanslı olmak demek.. [b]Türkiye'de yaşamış son mutlu kuşak olduğunu hüzünle hissetmek demek.. [/b] [b]Katılıyorum. 1990 lardada öyle[/b]",SUBSTRING("[b]Ynt: 80'li yıllarda çocuk olmak..[/b] Yeğenim henüz dört yaşında.. [b][SIZE=16px]1990 lı olmakta böyleydi işte....[/SIZE][/b] 1980li yıllarda hayatının ilk tecrübelerini yaşamış, ilkokula gitmiş, Kenan Evren´i, Erdal İnönü´yü, Özal'ı tanımış olmak, Ajda Pekkan´ın Alo, Michael Jackson´ın Pepsi reklamlarını hatırlayacak kadar şanslı olmak demek.. [b]Türkiye'de yaşamış son mutlu kuşak olduğunu hüzünle hissetmek demek.. [/b] [b]Katılıyorum. 1990 lardada öyle[/b]",1,POSITION( '[/b]' IN "[b]Ynt: 80'li yıllarda çocuk olmak..[/b] Yeğenim henüz dört yaşında.. [b][SIZE=16px]1990 lı olmakta böyleydi işte....[/SIZE][/b] 1980li yıllarda hayatının ilk tecrübelerini yaşamış, ilkokula gitmiş, Kenan Evren´i, Erdal İnönü´yü, Özal'ı tanımış olmak, Ajda Pekkan´ın Alo, Michael Jackson´ın Pepsi reklamlarını hatırlayacak kadar şanslı olmak demek.. [b]Türkiye'de yaşamış son mutlu kuşak olduğunu hüzünle hissetmek demek.. [/b] [b]Katılıyorum. 1990 lardada öyle[/b]")+3) , '')
insert xml data in mysql table
I have this type of XML. <Players> <TeamA name="Kings XI Punjab" Id="1107"> <Player1 Id="270">Virender Sehwag </Player1> <Player2 Id="10114">Mandeep Singh </Player2> <Player3 Id="10085">Glenn Maxwell </Player3> <Player4 Id="5313">David Miller </Player4> <Player5 Id="4961">George Bailey (C) </Player5> <Player6 Id="4508">Wriddhiman Saha (W)</Player6> <Player7 Id="62576">Akshar Patel </Player7> <Player8 Id="3736">Mitchell Johnson </Player8> <Player9 Id="4610">Rishi Dhawan </Player9> <Player10 Id="4997">Parvinder Awana </Player10> <Player11 Id="10116">Sandeep Sharma </Player11> </TeamA> <TeamB name="Kolkata Knight Riders" Id="1106"> <Player1 Id="3723">Robin Uthappa (W)</Player1> <Player2 Id="3478">Gautam Gambhir (C) </Player2> <Player3 Id="4276">Manish Pandey </Player3> <Player4 Id="141">Jacques Kallis </Player4> <Player5 Id="11803">Suryakumar Yadav </Player5> <Player6 Id="3724">Yusuf Pathan </Player6> <Player7 Id="3766">Ryan ten Doeschate </Player7> <Player8 Id="3729">Piyush Chawla </Player8> <Player9 Id="11229">Sunil Narine </Player9> <Player10 Id="3874">Morne Morkel </Player10> <Player11 Id="5221">Umesh Yadav </Player11> </TeamB> </Players> i want this type opf insertion into mysql database Team Player TeamId PlayerId Kings XI Punjab Virender Sehwag 1107 270 Kolkata Knight Riders Robin Uthappa 1106 3723 an so on..every entry for player and team and both ids should be in this format. How can i do this. TeamA Id Player1 Player2 Player3 Kings XI Punjab 5313 Virender Sehwag Mandeep Singh Glenn Maxwell I am getting this type of entry into my db. How do i get player id and name of all layers in a row instead of column I am unable to do this by LOAD XML LOCAL INFILE '/pathtofile/file.xml' INTO TABLE my_tablename ROWS IDENTIFIED BY '<tagname>'; Please give me a proper solution thanks in advance.
Transpose clumps of cell on openoffice calc
Is there a function that can help trun this : A B C D E asa fafa ada sawewf wefw ff rwf fw rww er rr 23 into this: A B asa fafa wefw ff rww er ada sawewf rwf fw rr 23 in another worksheet preferably?
It has been solved. The ods posted here : http://forum.openoffice.org/en/forum/viewtopic.php?f=9&t=62453 works as per my requests
How to match and assign data the pythonic way?
I have a list (mysql table) of People and their titles as shown in the table below. I also have a list of titles and their categories. How do I assign their categories to the person? The problem arises when there are multiple titles for a person. What is the pythonic way of mapping the title to the category and assigning it to the person? People Table Name Title -------------------- John D CEO, COO, CTO Mary J COO, MD Tim C Dev Ops, Director Title Category table Title Executive IT Other ----------------------------- CEO 1 COO 1 CTO 1 1 MD 1 Dev Ops 1 Director 1 Desired output : Name Title Executive IT Other --------------------------------------------- John D CEO, COO, CTO 1 1 Mary J COO, MD 1 Tim C Dev Ops, Director 1 1
name_title = (("John D",("CEO","COO","CTO")), ("Mary J",("COO","MD")), ("Tim C",("Dev Ops","Director"))) title_cat = {"CEO": set(["Executive"]), "COO": set(["Executive"]), "CTO": set(["Executive"]), "MD": set(["Executive"]), "Dev Ops": set(["IT"]), "Director": set(["Other"])} name_cat = [(name, reduce(lambda x,y:x|y, [title_cat[title]for title in titles])) for name,titles in name_title] It would be nice if there was a union which behaved like sum on sets.
people=['john','Mary','Tim'] Title=[['CEO','COO','CTO'],['COO','MD'],['DevOps','Director']] title_des={'CEO':'Executive','COO':'Executive','CTO':'Executive', 'MD':'Executive','DevOps':'IT','Director':'Others' } people_des={} for i,x in enumerate(people): people_des[x]={} for y in Title[i]: if title_des[y] not in people_des[x]: people_des[x][title_des[y]]=[y] else: people_des[x][title_des[y]].append(y) print(people_des) output: {'Tim': {'IT': ['DevOps'], 'Others': ['Director']}, 'john': {'Executive': ['CEO', 'COO', 'CTO']}, 'Mary': {'Executive': ['COO', 'MD']}}
Start by arranging your input data in a dictionary-of-lists form: >>> name_to_titles = { 'John D': ['CEO', 'COO', 'CTO'], 'Mary J': ['COO', 'MD'], 'Tim C': ['Dev Ops', 'Director'] } Then loop over the input dictionary to create the reverse mapping: >>> title_to_names = {} >>> for name, titles in name_to_titles.items(): for title in titles: title_to_names.setdefault(title, []).append(name) >>> import pprint >>> pprint.pprint(title_to_names) {'CEO': ['John D'], 'COO': ['John D', 'Mary J'], 'CTO': ['John D'], 'Dev Ops': ['Tim C'], 'Director': ['Tim C'], 'MD': ['Mary J']}
I propose this if you mean you have the string: s = '''Name Title -------------------- John D CEO, COO, CTO Mary J COO, MD Tim C Dev Ops, Director Title Executive IT Other ----------------------------- CEO 1 COO 1 CTO 1 MD 1 Dev Ops 1 Director 1 ''' lines = s.split('\n') it = iter(lines) for line in it: if line.startswith('Name'): break next(it) # '--------------------' for line in it: if not line: break split = line.split() titles = split[2:] name = split[:2] print ' '.join(name), titles # John D ['CEO,', 'COO,', 'CTO'] # Mary J ['COO,', 'MD'] # Tim C ['Dev', 'Ops,', 'Director']