Pig count occurrence of strings in text messages - csv

I've got two files - venues.csv and tweets.csv. I want to count for each of the venues the number of times occurs in the tweet message from the tweets file.
I've imported the csv files in HCatalog.
What I managed to do so far:
I know how to filter the text fields and to get these tuples that contain 'Shell' their tweet messages. I want to do the same but not with hard-coded Shell, rather for each name from the venuesNames bag. How can I do that? Also then how can I use the generate command properly to generate a new bag that is matching the results from the count with the names of the venues?
a = LOAD 'venues_test_1' USING org.apache.hcatalog.pig.HCatLoader();
b = LOAD 'tweets_test_1' USING org.apache.hcatalog.pig.HCatLoader();
venuesNames = foreach a generate name;
countX = FILTER b BY (text matches '.*Shell.*');
venueToCount = generate ('Shell' as venue, COUNT(countX) as countVenues);
DUMP venueToCount;
The files that I'm using are:
tweets.csv
created_at,text,location
Sat Nov 03 13:31:07 +0000 2012, Sugar rush dfsudfhsu, Glasgow
Sat Nov 03 13:31:07 +0000 2012, Sugar rush ;dfsosjfd HAHAHHAHA, London
Sat Apr 25 04:08:47 +0000 2009, at Sugar rush dfjiushfudshf, Glasgow
Thu Feb 07 21:32:21 +0000 2013, Shell gggg, Glasgow
Tue Oct 30 17:34:41 +0000 2012, Shell dsiodshfdsf, Edinburgh
Sun Mar 03 14:37:14 +0000 2013, Shell wowowoo, Glasgow
Mon Jun 18 07:57:23 +0000 2012, Shell dsfdsfds, Glasgow
Tue Jun 25 16:52:33 +0000 2013, Shell dsfdsfdsfdsf, Glasgow
venues.csv
city,name
Glasgow, Sugar rush
Glasgow, ABC
Glasgow, University of Glasgow
Edinburgh, Shell
London, Big Ben
I know that these are basic questions but I'm just getting started with Pig and any help will be appreciated!

I presume that your list of venue names is unique. If not, then you have more problems anyway because you will need to disambiguate which venue is being talked about (perhaps by reference to the city fields). But disregarding that potential complication, here is what you can do:
You have described a fuzzy join. In Pig, if there is no way to coerce your records to contain standard values (and in this case, there isn't without resorting to a UDF), you need to use the CROSS operator. Use this with caution because if you cross two relations with M and N records, the result will be a relation with M*N records, which might be more than your system can handle.
The general strategy is 1) CROSS the two relations, 2) Create a custom regex for each record*, and 3) Filter those that pass the regex.
venues = LOAD 'venues_test_1' USING org.apache.hcatalog.pig.HCatLoader();
tweets = LOAD 'tweets_test_1' USING org.apache.hcatalog.pig.HCatLoader();
/* Create the Cartesian product of venues and tweets */
crossed = CROSS venues, tweets;
/* For each record, create a regex like '.*name.*'
regexes = FOREACH crossed GENERATE *, CONCAT('.*', CONCAT(venues::name, '.*')) AS regex;
/* Keep tweet-venue pairs where the tweet contains the venue name /*
venueMentions = FILTER regexes BY text MATCHES regex;
venueCounts = FOREACH (GROUP venueMentions BY venues::name) GENERATE group, COUNT($1);
The sum of all venueCounts might be more than the number of tweets, if some tweets mention multiple venues.
*Note that you have to be a little careful with this technique, because if the venue name contains characters that have special interpretations in Java regular expressions, you'll need to escape them.

Related

Cut a line in multiple parts delimited by patterns, read and re-write .json files

this one's difficult and I haven't found any answers for hours, I hope you can help me. I'm not an english native speaker, I apologize in advance.
I arrived last week in a company and am working with .json files which are all listed in directories, by companies.
e.g.
d/Company1/comport/enregistrement_sessionhash1
enregistrement_sessionhash2
enregistrement_sessionhash3
d/Company2/comport/enregistrement_sessionhashX
d/Company3/comport/enregistrement_sessionhashY...
Each of them can contain [0-n] characters.
We use these files to calculate data.
The person before me didn't think about classifying them by /year/months, therefore it takes a lot of time when we do algorithms on data during a specific month, because it reads all the files inside the directory, which are being stored every 10 seconds per websitecompany and website-user user for approximately 2 years.
Sadly, we can't use systems' creation/modification time, only text informations in the .json files, since there's been a server problem and my coworkers had to paste files, resetting creation time.
Here is a template of the .json files
BEGIN OF FILE
{"session":"session_hash","enregistrements":[{"session":"session_hash",[...]{"data2":"xxx"}],"timedate_saved":"27 04 2020 12:39:21"},{"session":"session_hash",[...],"timedate_saved":"17 06 2020 11:01:08"},{"data1":"session_hash"[...],{"data2":"xxx"}],"timedate_saved":"27 04 2020 18:01:14"}]}
END OF FILE
In a file, there can't be a different "session" value. This value is a hash, used aswell in the filename e.g. d/Company1/comport/enregistrement_session_hash
I would like to read the files, cut every "enregistrements" sub-arrays (starting with [{"session"...and ending with "timedate_saved":"01 01 1970 00:00:00"}]}. Doing this, i want the cutted out text to get written in files having the same filenames (session_hash), stored by company/comport/year/months/enregistrement_sessionhash, gotten by the "timedate_saved" data. And of course be able to reuse these files for further use, so having the .json parsing.
That's a lot, I hope someone has time on his hands to help me getting through it.

How to insert multiline data in mysql database

I have a command in my script that goes like this
MESSAGE=`grep -Po 'MSG.\K[\w\s,.'\'',.:]*' < $FILENAME`
Now when this command is run I have an output which look like this
Kohl's EDI will be down for scheduled maintenance starting at 12:30 am until
approximately 4:00 am central time on Wednesday June 22nd. Kohl's will not be
able to send or receive EDI or AS2 transmissions during this time. If your
company's AS2 software has an automated process to resend a file after a
failure, Kohl's encourages your company to enable the resend process. This is
also a reminder for AS2 trading partners that Kohl's AS2 certificate will be
changing at 11:00 am central time on Tuesday June 21st.
Now after grepping the whole thing out I would pass the result of the command to a variable that will be used so that I can store the result to a mysql database
The question is How will I do it?
Make sure you have connection to mysql from the server and then you can pass MEASSGE variable in insert statement as \"$MESSAGE\" - \ is because of wrapping message in double quotes for valid insert statement.
Test: I did not have enough bigger column to store entire message as yours so I trimmed it little to fit in the column:
> MESSAGE="Kohl's EDI will be down for scheduled maintenance starting at 12:30 am until
approximately 4:00 am central time on Wednesday June 22nd. Kohl's will not be
able to send or receive EDI or AS2 transmissions during this time. If your
company's AS2 software has an automated process to resend a file after a
"
> sql "insert into at_test_run (run_id,run_error) values ('10000111',\"$MESSAGE\");"
> sql "select run_id,run_error from at_test_run where run_id='10000111'"
+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| run_id | run_error |
+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 10000111 | Kohl's EDI will be down for scheduled maintenance starting at 12:30 am until
approximately 4:00 am central time on Wednesday June 22nd. Kohl's will not be
able to send or receive EDI or AS2 transmissions during this time. If your
company's AS2 software has an automated process to resend a file after a
|
+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Skipping lines that contains only whitespace in R

I have a problem of reading some of the html subsites. Most of them work just fine but for e.g http://www-history.mcs.st-andrews.ac.uk/Biographies/De_Morgan.html have empty lines in H1 and H3. Because of that my data.frame is a total mess when it comes to that people e.g :
data frame example. Frame containts 4 columns "Name" "Date and place of birth" "Date and place of deat" "Link". Im supossed to make a table in LaTeX, but because of those rows with whitespace my tab at some points goes in wrong direction and a guys name is his date of birth and so on. To read that sites im using simply using loop from j=1 to length(LinkiWlasciwy)
matematyk=LinkWlasciwy[j] %>%
read_html() %>%
html_nodes(selektor1) %>%
html_text()
where selektor1="h3 font , h1". After that i save it contains to .txt file and read it in another script where i am supposed to make .tex file based out of these data. In my opinion it would be best to just delete lines in file that only contains whitespace such as space,\n etc. In my txt file for e.g.
Marie-Sophie Germain| 1 April 1776
in Paris, France| 27 June 1831
in Paris, France|www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html|
As a separator i use " | " . Not all of them are the same, some contains only one space, some two and etc. All i want is to bring every wrong record to this
Marie-Sophie Germain| 1 April 1776 in Paris, France| 27 June 1831 in Paris, France|www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html|
I had to delete http:// from the text samples because i dont have yet 10 reputation and they are counted as links
You can use library stringi:
library(stringi)
line<-c("Marie-Sophie Germain| 1 April 1776",
" ",
"in Paris, France| 27 June 1831",
" ",
"in Paris, France|www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html|")
line2<- line[stri_count_regex(line, "^[ \\t]+$") ==0]
line2
stri_paste(line2, collapse="")
Result:
[1] "Marie-Sophie Germain| 1 April 1776in Paris, France| 27 June 1831in Paris, France|www-history.mcs.st-andrews.ac.uk/Biographies/Germain.html|"

Cannot determine why a CSV file is invalid

I have the following CSV file:
textbox6,textbox10,textbox35,textbox17,textbox43,textbox20,textbox39,textbox23,textbox9,textbox16
"Monday, March 02, 2015",Water Front Lodge,"Tuesday, September 23, 2014",,Routine,#1 Johnson Street,Low,Northern Health - Mamaw/Keewa/Athab,Critical Item,4 - Hand Washing Facilities/Practices
"Monday, March 02, 2015",Water Front Lodge,"Thursday, August 01, 2013",,Routine,#1 Johnson Street,Low,Northern Health - Mamaw/Keewa/Athab,General Item,11 - Accurate Thermometer Available to Monitor Food Temperatures
"Monday, March 02, 2015",Water Front Lodge,"Wednesday, February 08, 2012",,Routine,#1 Johnson Street,Low,Northern Health - Mamaw/Keewa/Athab,Critical Item,1 - Refrigeration/Cooling/Thawing (must be 4°C/40°F or lower)
"Monday, March 02, 2015",Water Front Lodge,"Wednesday, February 08, 2012",,Routine,#1 Johnson Street,Low,Northern Health - Mamaw/Keewa/Athab,General Item,12 - Construction/Storage/Cleaning of Equipment/Utensils
And here's what file tells me:
Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
I was trying to use Scala-csv to parse it but always get Malformed CSV exceptions. I've uploaded it to CSV Lint and get 5 "unknown errors".
Eyeballing the file, I cannot determine why two separate parsers would fail. it seems to be perfectly ordinary and valid CSV. What about it is malformed?
And yes, I'm aware that it's terrible CSV. I didn't create it -- I just have to parse it.
EDIT: Of note is that this parser also fails.
It is definitely the newline. See the Lint results here:
CSV Lint Validation
I copied your SCV and made sure the newline characters were CRLF
I used Notepad++ and used the Edit=>EOL Conversion=>Windows Format to do the conversion.

Re-programming the output on a Magtek MagWedge

I am trying to reprogram the the output of my Magtek MagWedge and I cant find any documentation on how the syntax to send to output just the cc number from my cc swipe reader and not of the other data
Below is the example configuration, however I have no clue how to change these values to.
Comment:Set up IntelliPIN to Required Configuration
/rawxact 50B01001011
/rawxact 50E10000000
/rawxact 940101010101010101
/rawxact 564
Comment:99{{SN}}
/rawsend 52
Comment:50Z00000110
/rawsend 42Setup Done
Thanks!
It turns out I needed to get the USBMSR Demo program and then send message 01 03 then send 02, then restart the application and send 01 03 send msg then 02 send msg and it fixed it for me.