How do I combine Wikipedia Commons maps into one display? - google-maps

I am currently working on a project where I am looking to map out a set of state routes. After picking x amount, I want to construct one aggregate map of those routes so that I can visualize them together. I see that on the Wikipedia page for US Interstate Routes, it's possible to do this: https://en.wikipedia.org/wiki/Interstate_Highway_System#/map/0.
Each of the individual red lines on the map links to a data source in Wikimedia Commons, for example "Data:Interstate 70.map": https://commons.wikimedia.org/wiki/Data:Interstate_70.map. So, I can search up all of these individual routes, but I'd like to be able to find them all and combine them into one (zoom-able) image...
Additionally, if the route doesn't already exist as a data source in Wikimedia: I've noticed that existing routes are sourced from OpenStreetMaps. I've tried messing around with the site a little, and I'm not sure how Wikimedia Commons contributors are constructing the route (as a whole -- I can select small segments just fine) in the first place and exporting/uploading it to Wikimedia Commons. For example, PA Route 50 shows up on the map in OpenStreetMaps, but I am unable to select it as a whole.
If there is a better service to do this with, that also works! I just want to visualize the routes together, and I don't care how it's done (other than manually taking screenshots and editing the photos together)
Any help would be appreciated! Thank you!

Wikipedia has it's own special way organising geo-data for these maps embedded in articles. Let's have a poke around...
If you mouse over the bottom credits of your map link, you can see reference to lots of different .map files. Likewise if you view the wikitext source on this article.
Here is the full list of referenced .map file names (one for each interstate):
Interstate 4.map
Interstate 5.map
Interstate 8.map
Interstate 10.map
Interstate 11.map
Interstate 12.map
Interstate 14.map
Interstate 15.map
Interstate 16.map
Interstate 17.map
Interstate 19.map
Interstate 20.map
Interstate 22.map
Interstate 24.map
Interstate 25.map
Interstate 26.map
Interstate 27.map
Interstate 29.map
Interstate 30.map
Interstate 35.map
Interstate 37.map
Interstate 39.map
Interstate 40.map
Interstate 41.map
Interstate 43.map
Interstate 44.map
Interstate 45.map
Interstate 49 1.map
Interstate 55.map
Interstate 57.map
Interstate 59.map
Interstate 64.map
Interstate 65.map
Interstate 66.map
Interstate 68.map
Interstate 69.map
Interstate 70.map
Interstate 71.map
Interstate 72.map
Interstate 73.map
Interstate 74.map
Interstate 75.map
Interstate 76 (Ohio–New Jersey).map
Interstate 76 (Colorado–Nebraska).map
Interstate 77.map
Interstate 78.map
Interstate 79.map
Interstate 80.map
Interstate 81.map
Interstate 82.map
Interstate 83.map
Interstate 84 (Oregon–Utah).map
Interstate 84 (Pennsylvania–Massachusetts).map
Interstate 85.map
Interstate 86 (Idaho).map
Interstate 86 (Pennsylvania–New York).map
Interstate 87 (North Carolina).map
Interstate 87 (New York).map
Interstate 88 (Illinois).map
Interstate 88 (New York).map
Interstate 89.map
Interstate 90.map
Interstate 91.map
Interstate 93.map
Interstate 94.map
Interstate 95.map
Interstate 96.map
Interstate 97.map
Interstate 99.map
Interstate 35W (Texas).map
Interstate 35W (Minnesota).map
Each of these is a file name of a .map file which you can download from wikimedia commons with a URL such as:
https://commons.wikimedia.org/wiki/Data:Interstate%204.map?action=raw
This gives you a json file. If you strip off the outer element containing meta data fields and take only the data field contents... that's valid geojson (Maybe there's a way to request geojson from wikimedia commons more directly)
So you could download all of those and set up that data (manually. Or did you want code to do that automatically?) and then I suppose you are left with a question a bit like this one: How to present a geojson file on a map.

Related

Does this text-based file format similar to csv/tsv that seems to contain multiple sheets have a name?

I have a text-based file format that is similar to csv/tsv with separators that are pipes |, and the first column of each row seems to be the "sheet/table" name, but there are no headers.
See the example below...I'd like to put a name to it so that I can import it into a tool and work with the data.
TABLEORSHEET1|Lar|Lafard|113 North Dakota Ln.|Johnstown|PA|15905
TABLEORSHEET1|Nancy|Lafard|114 North Dakota Ln.|Johnstown|PA|15905
TABLEORSHEET1|Tommy|Lafard|115 North Dakota Ln.|Johnstown|PA|15905
TABLEORSHEET2|1|Tea Cup|1.42|0
TABLEORSHEET2|1|Coffee Cup|3.42|1
TABLEORSHEET3|1|EDIT|LNAME|Laffer|Lafard
TABLEORSHEET3|1|EDIT|FNAME|Larry|Lar
I've seen this file format used twice before and both were either an import or an export to/from an Oracle system.

Importing CSV file to Google maps format

I build a software that generate trails for my own use
I would like to test the software so I create A CSV file that contain the longitude and latitude of the trail points
What is the format of a CSV file that can imported to Google maps
The documentation isn't very specific about CSV files, so I just tried a bunch of formats.
Option 1 is to have separate latitude and longitude columns. You will be able to specify columns in the upload wizard.
lon,lat,title
-20.0390625,53.27835301753182,something
-17.841796875,53.27835301753182,something
Option 2 is to have a single coordinate column with the coordinates separated by space. You will be able to chose the order of the coordinate pair in the upload wizard.
lonlat,title
-20.0390625 53.27835301753182,something
-17.841796875 53.27835301753182,something
You'll also need one column that acts as the description for your points, it is, again, selectable in the wizard.
There seems to be no way to import CSVs as line geometries and no way to convert points to lines later on. Well-known-text (WKT) in the coordinate column fails to import.
The separator needs to be comma ,. Semicolons ;, spaces   and tabs don't work.

Cannot determine why a CSV file is invalid

I have the following CSV file:
textbox6,textbox10,textbox35,textbox17,textbox43,textbox20,textbox39,textbox23,textbox9,textbox16
"Monday, March 02, 2015",Water Front Lodge,"Tuesday, September 23, 2014",,Routine,#1 Johnson Street,Low,Northern Health - Mamaw/Keewa/Athab,Critical Item,4 - Hand Washing Facilities/Practices
"Monday, March 02, 2015",Water Front Lodge,"Thursday, August 01, 2013",,Routine,#1 Johnson Street,Low,Northern Health - Mamaw/Keewa/Athab,General Item,11 - Accurate Thermometer Available to Monitor Food Temperatures
"Monday, March 02, 2015",Water Front Lodge,"Wednesday, February 08, 2012",,Routine,#1 Johnson Street,Low,Northern Health - Mamaw/Keewa/Athab,Critical Item,1 - Refrigeration/Cooling/Thawing (must be 4°C/40°F or lower)
"Monday, March 02, 2015",Water Front Lodge,"Wednesday, February 08, 2012",,Routine,#1 Johnson Street,Low,Northern Health - Mamaw/Keewa/Athab,General Item,12 - Construction/Storage/Cleaning of Equipment/Utensils
And here's what file tells me:
Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
I was trying to use Scala-csv to parse it but always get Malformed CSV exceptions. I've uploaded it to CSV Lint and get 5 "unknown errors".
Eyeballing the file, I cannot determine why two separate parsers would fail. it seems to be perfectly ordinary and valid CSV. What about it is malformed?
And yes, I'm aware that it's terrible CSV. I didn't create it -- I just have to parse it.
EDIT: Of note is that this parser also fails.
It is definitely the newline. See the Lint results here:
CSV Lint Validation
I copied your SCV and made sure the newline characters were CRLF
I used Notepad++ and used the Edit=>EOL Conversion=>Windows Format to do the conversion.

Pig count occurrence of strings in text messages

I've got two files - venues.csv and tweets.csv. I want to count for each of the venues the number of times occurs in the tweet message from the tweets file.
I've imported the csv files in HCatalog.
What I managed to do so far:
I know how to filter the text fields and to get these tuples that contain 'Shell' their tweet messages. I want to do the same but not with hard-coded Shell, rather for each name from the venuesNames bag. How can I do that? Also then how can I use the generate command properly to generate a new bag that is matching the results from the count with the names of the venues?
a = LOAD 'venues_test_1' USING org.apache.hcatalog.pig.HCatLoader();
b = LOAD 'tweets_test_1' USING org.apache.hcatalog.pig.HCatLoader();
venuesNames = foreach a generate name;
countX = FILTER b BY (text matches '.*Shell.*');
venueToCount = generate ('Shell' as venue, COUNT(countX) as countVenues);
DUMP venueToCount;
The files that I'm using are:
tweets.csv
created_at,text,location
Sat Nov 03 13:31:07 +0000 2012, Sugar rush dfsudfhsu, Glasgow
Sat Nov 03 13:31:07 +0000 2012, Sugar rush ;dfsosjfd HAHAHHAHA, London
Sat Apr 25 04:08:47 +0000 2009, at Sugar rush dfjiushfudshf, Glasgow
Thu Feb 07 21:32:21 +0000 2013, Shell gggg, Glasgow
Tue Oct 30 17:34:41 +0000 2012, Shell dsiodshfdsf, Edinburgh
Sun Mar 03 14:37:14 +0000 2013, Shell wowowoo, Glasgow
Mon Jun 18 07:57:23 +0000 2012, Shell dsfdsfds, Glasgow
Tue Jun 25 16:52:33 +0000 2013, Shell dsfdsfdsfdsf, Glasgow
venues.csv
city,name
Glasgow, Sugar rush
Glasgow, ABC
Glasgow, University of Glasgow
Edinburgh, Shell
London, Big Ben
I know that these are basic questions but I'm just getting started with Pig and any help will be appreciated!
I presume that your list of venue names is unique. If not, then you have more problems anyway because you will need to disambiguate which venue is being talked about (perhaps by reference to the city fields). But disregarding that potential complication, here is what you can do:
You have described a fuzzy join. In Pig, if there is no way to coerce your records to contain standard values (and in this case, there isn't without resorting to a UDF), you need to use the CROSS operator. Use this with caution because if you cross two relations with M and N records, the result will be a relation with M*N records, which might be more than your system can handle.
The general strategy is 1) CROSS the two relations, 2) Create a custom regex for each record*, and 3) Filter those that pass the regex.
venues = LOAD 'venues_test_1' USING org.apache.hcatalog.pig.HCatLoader();
tweets = LOAD 'tweets_test_1' USING org.apache.hcatalog.pig.HCatLoader();
/* Create the Cartesian product of venues and tweets */
crossed = CROSS venues, tweets;
/* For each record, create a regex like '.*name.*'
regexes = FOREACH crossed GENERATE *, CONCAT('.*', CONCAT(venues::name, '.*')) AS regex;
/* Keep tweet-venue pairs where the tweet contains the venue name /*
venueMentions = FILTER regexes BY text MATCHES regex;
venueCounts = FOREACH (GROUP venueMentions BY venues::name) GENERATE group, COUNT($1);
The sum of all venueCounts might be more than the number of tweets, if some tweets mention multiple venues.
*Note that you have to be a little careful with this technique, because if the venue name contains characters that have special interpretations in Java regular expressions, you'll need to escape them.

What is the byte signature of a password-protected ZIP file?

I've read that ZIP files start with the following bytes:
50 4B 03 04
Reference: http://www.garykessler.net/library/file_sigs.html
Question: Is there a certain sequence of bytes that indicate a ZIP file has been password-protected?
It's not true that ZIP files must start with
50 4B 03 04
Entries within zip files start with 50 4B 03 04... ..and often, pure zip files start with a zip entry as the first thing in the file. But, there is no requirement that zip files start with those bytes. All files that start with those bytes are probably zip files, but not all zip files start with those bytes.
For example, you can create a self-extracting archive which is a PE-COFF file, a regular EXE, in which there actually is a signature for the file, which is 4D 5A .... Then, later in the exe file, you can store zip entries, beginning with 50 4B 03 04.... The file is both an .exe and a .zip.
A self-extracting archive is not the only class of zip file that does not start with 50 4B 03 04 . You can "hide" arbitrary data in a zip file this way. WinZip and other tools should have no problems reading a zip file formatted this way.
If you find the 50 4B 03 04 signature within a file, either at the start of the file or somewhere else, you can look at the next few bytes to determine whether that particular entry is encrypted. Normally it looks something like this:
50 4B 03 04 14 00 01 00 08 00 ...
The first four bytes are the entry signature. The next two bytes are the "version needed to extract". In this case it is 0x0014, which is 20. According to the pkware spec, that means version 2.0 of the pkzip spec is required to extract the entry. (The latest zip "feature" used by the entry is described by v2.0 of the spec). You can find higher numbers there if more advanced features are used in the zip file. AES encryption requires v5.1 of the spec, hence you should find 0x0033 in that header. (Not all zip tools respect this).
The next 2 bytes represents the general purpose bit flag (the spec calls it a "bit flag" even though it is a bit field), in this case 0x0001. This has bit 0 set, which indicates that the entry is encrypted.
Other bits in that bit flag have meaning and may also be set. For example bit 6 indicates that strong encryption was used - either AES or some other stronger encryption. Bit 11 says that the entry uses UTF-8 encoding for the filename and the comment.
All this information is available in the PKWare AppNote.txt spec.
It's underlying files within the zip archive that are password-protected. You can have a series of password protected and password unprotected files in an archive (e.g. a readme file and then the contents).
If you followed the links describing ZIP files in the URL you reference, you'd find that this one discusses the bit that indicates whether a file in the ZIP archive is encrypted or not. It seems that each file in the archive can be independently encrypted or not.