Cannot determine why a CSV file is invalid

Cannot determine why a CSV file is invalid - csv

I have the following CSV file:
textbox6,textbox10,textbox35,textbox17,textbox43,textbox20,textbox39,textbox23,textbox9,textbox16
"Monday, March 02, 2015",Water Front Lodge,"Tuesday, September 23, 2014",,Routine,#1 Johnson Street,Low,Northern Health - Mamaw/Keewa/Athab,Critical Item,4 - Hand Washing Facilities/Practices
"Monday, March 02, 2015",Water Front Lodge,"Thursday, August 01, 2013",,Routine,#1 Johnson Street,Low,Northern Health - Mamaw/Keewa/Athab,General Item,11 - Accurate Thermometer Available to Monitor Food Temperatures
"Monday, March 02, 2015",Water Front Lodge,"Wednesday, February 08, 2012",,Routine,#1 Johnson Street,Low,Northern Health - Mamaw/Keewa/Athab,Critical Item,1 - Refrigeration/Cooling/Thawing (must be 4°C/40°F or lower)
"Monday, March 02, 2015",Water Front Lodge,"Wednesday, February 08, 2012",,Routine,#1 Johnson Street,Low,Northern Health - Mamaw/Keewa/Athab,General Item,12 - Construction/Storage/Cleaning of Equipment/Utensils
And here's what file tells me:
Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
I was trying to use Scala-csv to parse it but always get Malformed CSV exceptions. I've uploaded it to CSV Lint and get 5 "unknown errors".
Eyeballing the file, I cannot determine why two separate parsers would fail. it seems to be perfectly ordinary and valid CSV. What about it is malformed?
And yes, I'm aware that it's terrible CSV. I didn't create it -- I just have to parse it.
EDIT: Of note is that this parser also fails.

It is definitely the newline. See the Lint results here:
CSV Lint Validation
I copied your SCV and made sure the newline characters were CRLF
I used Notepad++ and used the Edit=>EOL Conversion=>Windows Format to do the conversion.

Related

Cut a line in multiple parts delimited by patterns, read and re-write .json files

this one's difficult and I haven't found any answers for hours, I hope you can help me. I'm not an english native speaker, I apologize in advance.
I arrived last week in a company and am working with .json files which are all listed in directories, by companies.
e.g.
d/Company1/comport/enregistrement_sessionhash1
enregistrement_sessionhash2
enregistrement_sessionhash3
d/Company2/comport/enregistrement_sessionhashX
d/Company3/comport/enregistrement_sessionhashY...
Each of them can contain [0-n] characters.
We use these files to calculate data.
The person before me didn't think about classifying them by /year/months, therefore it takes a lot of time when we do algorithms on data during a specific month, because it reads all the files inside the directory, which are being stored every 10 seconds per websitecompany and website-user user for approximately 2 years.
Sadly, we can't use systems' creation/modification time, only text informations in the .json files, since there's been a server problem and my coworkers had to paste files, resetting creation time.
Here is a template of the .json files
BEGIN OF FILE
{"session":"session_hash","enregistrements":[{"session":"session_hash",[...]{"data2":"xxx"}],"timedate_saved":"27 04 2020 12:39:21"},{"session":"session_hash",[...],"timedate_saved":"17 06 2020 11:01:08"},{"data1":"session_hash"[...],{"data2":"xxx"}],"timedate_saved":"27 04 2020 18:01:14"}]}
END OF FILE
In a file, there can't be a different "session" value. This value is a hash, used aswell in the filename e.g. d/Company1/comport/enregistrement_session_hash
I would like to read the files, cut every "enregistrements" sub-arrays (starting with [{"session"...and ending with "timedate_saved":"01 01 1970 00:00:00"}]}. Doing this, i want the cutted out text to get written in files having the same filenames (session_hash), stored by company/comport/year/months/enregistrement_sessionhash, gotten by the "timedate_saved" data. And of course be able to reuse these files for further use, so having the .json parsing.
That's a lot, I hope someone has time on his hands to help me getting through it.

Pig count occurrence of strings in text messages

I've got two files - venues.csv and tweets.csv. I want to count for each of the venues the number of times occurs in the tweet message from the tweets file.
I've imported the csv files in HCatalog.
What I managed to do so far:
I know how to filter the text fields and to get these tuples that contain 'Shell' their tweet messages. I want to do the same but not with hard-coded Shell, rather for each name from the venuesNames bag. How can I do that? Also then how can I use the generate command properly to generate a new bag that is matching the results from the count with the names of the venues?
a = LOAD 'venues_test_1' USING org.apache.hcatalog.pig.HCatLoader();
b = LOAD 'tweets_test_1' USING org.apache.hcatalog.pig.HCatLoader();
venuesNames = foreach a generate name;
countX = FILTER b BY (text matches '.*Shell.*');
venueToCount = generate ('Shell' as venue, COUNT(countX) as countVenues);
DUMP venueToCount;
The files that I'm using are:
tweets.csv
created_at,text,location
Sat Nov 03 13:31:07 +0000 2012, Sugar rush dfsudfhsu, Glasgow
Sat Nov 03 13:31:07 +0000 2012, Sugar rush ;dfsosjfd HAHAHHAHA, London
Sat Apr 25 04:08:47 +0000 2009, at Sugar rush dfjiushfudshf, Glasgow
Thu Feb 07 21:32:21 +0000 2013, Shell gggg, Glasgow
Tue Oct 30 17:34:41 +0000 2012, Shell dsiodshfdsf, Edinburgh
Sun Mar 03 14:37:14 +0000 2013, Shell wowowoo, Glasgow
Mon Jun 18 07:57:23 +0000 2012, Shell dsfdsfds, Glasgow
Tue Jun 25 16:52:33 +0000 2013, Shell dsfdsfdsfdsf, Glasgow
venues.csv
city,name
Glasgow, Sugar rush
Glasgow, ABC
Glasgow, University of Glasgow
Edinburgh, Shell
London, Big Ben
I know that these are basic questions but I'm just getting started with Pig and any help will be appreciated!

I presume that your list of venue names is unique. If not, then you have more problems anyway because you will need to disambiguate which venue is being talked about (perhaps by reference to the city fields). But disregarding that potential complication, here is what you can do:
You have described a fuzzy join. In Pig, if there is no way to coerce your records to contain standard values (and in this case, there isn't without resorting to a UDF), you need to use the CROSS operator. Use this with caution because if you cross two relations with M and N records, the result will be a relation with M*N records, which might be more than your system can handle.
The general strategy is 1) CROSS the two relations, 2) Create a custom regex for each record*, and 3) Filter those that pass the regex.
venues = LOAD 'venues_test_1' USING org.apache.hcatalog.pig.HCatLoader();
tweets = LOAD 'tweets_test_1' USING org.apache.hcatalog.pig.HCatLoader();
/* Create the Cartesian product of venues and tweets */
crossed = CROSS venues, tweets;
/* For each record, create a regex like '.*name.*'
regexes = FOREACH crossed GENERATE *, CONCAT('.*', CONCAT(venues::name, '.*')) AS regex;
/* Keep tweet-venue pairs where the tweet contains the venue name /*
venueMentions = FILTER regexes BY text MATCHES regex;
venueCounts = FOREACH (GROUP venueMentions BY venues::name) GENERATE group, COUNT($1);
The sum of all venueCounts might be more than the number of tweets, if some tweets mention multiple venues.
*Note that you have to be a little careful with this technique, because if the venue name contains characters that have special interpretations in Java regular expressions, you'll need to escape them.

Filter only in one element/array of Twitter JSON file

I crawled the twitter JSON file from Streaming API and got a file of thousands lines of JSON data. However, this data contains of lots of elements such as "creation date", "source", "tweet text", etc.
I actually want to filter the word "iphone" in the tweet text. However, if I filter using GREP UNIX, it filters out not only in the "tweet text" field but also in the "source" field. So it means that a tweet that does not contains word "iphone" but tweeted from Twitter for Iphone as stated in the "Source" field will also be filtered.
Is there anyway to filter this JSON only in one certain field (in my case it is "tweet text" field).
Here's the example of one JSON line:
{"created_at":"Tue Aug 20 03:48:27 +0000 2013","id":369667218608369666,"id_str":"369667218608369666","text":"#Mattyb_chyeah_ yeah I'm only watching him! :)","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":369666992334073856,"in_reply_to_status_id_str":"369666992334073856","in_reply_to_user_id":1557571363,"in_reply_to_user_id_str":"1557571363","in_reply_to_screen_name":"Mattyb_chyeah_","user":{"id":1325959333,"id_str":"1325959333","name":"MattyBRapsTexas","screen_name":"MattyBRapsTexas","location":"Atlanta,Georgia","url":"http:\/\/www.instagram.com\/mattybrapstexas","description":"3 RT 6 Mentions He followed me on 4\/15\/13 6\/17\/13 Maddi Jane followed me on 6\/18\/13 #8:25pm! Cimorelli also follows Pizza Hut mentioned me 2 times on 7\/26\/13","protected":false,"followers_count":1095,"friends_count":426,"listed_count":8,"created_at":"Thu Apr 04 02:34:56 +0000 2013","favourites_count":226,"utc_offset":-14400,"time_zone":"Eastern Time (US & Canada)","geo_enabled":false,"verified":false,"statuses_count":3447,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/a0.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/si0.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/378800000313651225\/afee0cc2286882eeb15f21ed7fae334a_normal.jpeg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/378800000313651225\/afee0cc2286882eeb15f21ed7fae334a_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/1325959333\/1376759786","profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"Mattyb_chyeah_","name":"MattyB (\u2661_\u2661\u2740)","id":1557571363,"id_str":"1557571363","indices":[0,15]}]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"en"

What are you using for your grep regex? If you are just using 'iphone' for the regex then yes, you'll get multiple hits. You can expand your regex to match iphone only in text section before the source:
grep '"text":".*iphone.*","source":' myfile.txt
will search for the pattern iphone after "text" but before "source". It will ignore iphone in the rest of the line.

Line Feeds and Carriage Rerturns in Data: 0D 0A

I am writing a data clean up script (MS Smart Quotes, etc.) that will operate on mySQL tables encoded in Latin1. While scanning the data I noticed a ton of 0D 0A where the line breaks are.
Since I am cleaning the data, should I also address all of the 0D, too, by removing them? Is there ever a good reason to keep 0D (carriage return) anymore?
Thanks!

0D0A (\r\n), and 0A (\n) are line terminators; \r\n is mostly used in OS Windows, \n in unix systems.
Is there ever a good reason to keep 0D anymore?
I think you should answer this question yourself.
You could remove '\r' from the data, but make sure that the programs that will use this data understand that '\n' means the end of line very well. In most cases it is taken into account, but check just in case.

The CR/LF combination is a Windows thing. *NIX operating systems just use LF. So based on the application that uses your data, you'll need to make the decision on whether you want/need to filter out CR's. See the Wikipedia entry on newline for more info.

Python's readline() returns a line followed with a \O12. \O means Octal. 12 is octal for decimal 10. You can see on the ASCII table that Dec 10 is NL or LF. Newline or line feed.
Standard for end-of-line in a unix text or script file.
http://www.asciitable.com/
So be aware that the len() will include the NL unless you try to read past the EOF the len() will never be zero.
Therefore if you INSERT any line of text obtained by the Python readline() into a mysql table it will include the NL character by default, at the end.

What is the byte signature of a password-protected ZIP file?

I've read that ZIP files start with the following bytes:
50 4B 03 04
Reference: http://www.garykessler.net/library/file_sigs.html
Question: Is there a certain sequence of bytes that indicate a ZIP file has been password-protected?

It's not true that ZIP files must start with
50 4B 03 04
Entries within zip files start with 50 4B 03 04... ..and often, pure zip files start with a zip entry as the first thing in the file. But, there is no requirement that zip files start with those bytes. All files that start with those bytes are probably zip files, but not all zip files start with those bytes.
For example, you can create a self-extracting archive which is a PE-COFF file, a regular EXE, in which there actually is a signature for the file, which is 4D 5A .... Then, later in the exe file, you can store zip entries, beginning with 50 4B 03 04.... The file is both an .exe and a .zip.
A self-extracting archive is not the only class of zip file that does not start with 50 4B 03 04 . You can "hide" arbitrary data in a zip file this way. WinZip and other tools should have no problems reading a zip file formatted this way.
If you find the 50 4B 03 04 signature within a file, either at the start of the file or somewhere else, you can look at the next few bytes to determine whether that particular entry is encrypted. Normally it looks something like this:
50 4B 03 04 14 00 01 00 08 00 ...
The first four bytes are the entry signature. The next two bytes are the "version needed to extract". In this case it is 0x0014, which is 20. According to the pkware spec, that means version 2.0 of the pkzip spec is required to extract the entry. (The latest zip "feature" used by the entry is described by v2.0 of the spec). You can find higher numbers there if more advanced features are used in the zip file. AES encryption requires v5.1 of the spec, hence you should find 0x0033 in that header. (Not all zip tools respect this).
The next 2 bytes represents the general purpose bit flag (the spec calls it a "bit flag" even though it is a bit field), in this case 0x0001. This has bit 0 set, which indicates that the entry is encrypted.
Other bits in that bit flag have meaning and may also be set. For example bit 6 indicates that strong encryption was used - either AES or some other stronger encryption. Bit 11 says that the entry uses UTF-8 encoding for the filename and the comment.
All this information is available in the PKWare AppNote.txt spec.

It's underlying files within the zip archive that are password-protected. You can have a series of password protected and password unprotected files in an archive (e.g. a readme file and then the contents).

If you followed the links describing ZIP files in the URL you reference, you'd find that this one discusses the bit that indicates whether a file in the ZIP archive is encrypted or not. It seems that each file in the archive can be independently encrypted or not.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008