Why does my GTFS data contain "invisible" line breaks? - csv

So I've been looking at a way to import GTFS data into an SQLdb for my application. I found a solution available on GitHub.
But, this is written using python. I don't think I can use this directly in my windows application. Please correct me if I am wrong here.
But I have no issues with understanding the logic behind the solution and creating my own 'parser'.
So, I opened the GTFS data file "calendar dates.txt" on Notepad and found its content confusing. It was like:
service_id,date,exception_type1,20151012,11,20151111,12,20150822,12,20150829,12.....
You can see that its confusing when there are no line breaks.
But I paste the code here to show it to you guys, and it automatically formats to:
service_id,date,exception_type
1,20151012,1
1,20151111,1
2,20150822,1
2,20150829,1
2
Now it clearly makes sense!! (There are spaces in between for parsing)..
But I don't understand. Is Notepad showing it wrong? How do I see the data "properly" then, in order to write my own parser?

Most likely your GTFS data is written with UNIX end-of-line characters (linefeed only) as opposed to MS-DOS/Windows characters (carriage return followed by linefeed). This is permitted by the GTFS spec, which says:
Each line must end with a CRLF or LF linebreak character.
Most application software available for Windows, including Notepad, recognizes only Windows end-of-line characters and opening a file created on UNIX will show the entire contents as a single line, as you've observed. However, tools like Notepad++ that are meant for developers, as well as most programming libraries (such as those meant to parse CSV files), are usually smart enough to recognize both formats and handle them appropriately.
Wikipedia has more information about end-of-line representations across operating systems if you're interested.
Finally, I'll mention that I've recently posted to Github my own GTFS-to-SQLite loading tool, which is written in C and uses libcsv to parse GTFS data. If you're developing in a language lower-level than Python you may find it useful as an example.

First of all copy your related GTFS(routes,shaps etc) and than paste in an online text editor(for example: http://www.editpad.org/)
And than copy from this online text editor and paste again to your original .txt.

Related

TCL no such file or directory

I am running this script nmrCube.tcl for generating 3D box from NMR data.
I initially had problem with Library before which is now sorted
While running the script I get this, (even though it is indeed there):
Error in startup script: couldn't read file "“./nmrCube.tcl”": no such file or directory
Tcl regards “curly quotes” as entirely ordinary characters. They're not alphanumerics or one of Tcl's metacharacters, so they follow the same basic rules as characters like / and . and so on.
You probably don't want to use them in a Tcl script except in text for display to the user. You might want to use the "straight quotes" instead, which are metacharacters for Tcl. If your editor insists on converting those to fancy quotes, find another text editor. (You'd have problems using it for virtually any other programming language as well.)

View the innards of a .ppt file?

I need to figure out what is going on inside a client's .ppt files. What is a good way to get started?
My eventual hope is to convert it to HTML. But if I just export the .ppt to HTML, I get a lot of images (as opposed to text), which is not a Good Thing.
EDIT: software that automatically converts .ppt to HTML would be terrific, provided that it preserves as much information as possible in text format. If that doesn't exist, the next best thing would be to understand the innards of the .ppt and write my own code to do a partial conversion.
EDIT: I used OfficeConvert as recommended by Michiel Leenaars. It got me text all right. My 50-page, 8MB test file turned into 40MB of text. The fact that I got text is good. The fact that the amount went way up is moving in the wrong direction. And there is an awful lot of repetition in there. The word "style" appeared 410815 times; the word "draw" appeared 351229 times.
I think a safe way would be to use OfficeConvert to automatically convert to ODF programmatically with Microsoft Office. Run it with /? to get help. There are some dependencies (see below).
Then use a good ODF library like lpod to look inside it.
You can view some interesting code examples here.
Dependencies:
Microsoft .NET Framework Version 2.0 Redistributable Package (x86)
Primary Interop Assemblies for Office 2007 or Office 2010 (whichever you are using).
I like the Aspose products. (I'm not associated with them other than as a customer.) I've used the PPT one specifically to write code that pokes around in the insides of a PPT. Overkill if you just want to convert it to HTML, but invaluable for the sorts of things I use it for.
If you know Java, Apache has the POI project which lets you take a look at the inners of a PPT project. Could get all the info you want about the project (images, text) and then convert it to html however you like.
Its free too.

Find and Replace in Files - UTF8

Searching for a free application for commercial usage that allows find/replace in multiple files (regular expressions are nice but not a must), that supports opening and saving in UTF-8.
Tried a few like BKReplaceEm but the application ends up saving all the files as ASCII which causes some problems with web-rendering.
Please advise.
[UPDATE] To further clarify, I am searching for a windows utility.
[UPDATE #2] This is going to be used to run through our 450 page site and replace all french characters with the much needed HTML entities.
Notepad++ supports this feature, and is a great little editor in it's own regard.
Edit : Actually, Notepad++ does support replace in files. Click Search -> Find in Files, then select "Replace in files" in the dialog.
In the spirit of previous answer, you can use Perl (which has seamless native Unicode support and whose RegEx capablity are unparalleled). There are Windows perl versions avialable (ActivePerl, Strawberry, or you can use CygWin), and you can even slap GUIs on top of it -= for the latter, you can see what answers are given to my very recent So question :)
Plus, Perl can grab pretty much unlimitedly powerful collection of files, by using globs for simple things, File::Find for more complicated, and using grep on resulting file list to refine further if you need more fancy stuff, e.g. by content of modification time.
UPDATE For a Windows Editor, you can use UltraEdit. It has free evaluation period, and to be perfectly honest, I find the purchase price to be WELL worth paying for this very nice and powerful editor. Among its other features, it supports Unicode, and has pretty fancy search/replace ablities, including Perl RegEx support and S/R in multiple files.
Use sed.
jEdit has a feature called "HyperSearch" (just open the find dialog). You can specify a directory, a file name pattern and jEdit (being based on Java) does support lots of different encodings (and is often smart enough to figure out the correct one).
You could try my editor, Code Trowel
If it doesn't do what you want I'd probably fix it :-)
For windows, Notepad++ is awesome. It's licensed under the GPL. It does search and replace in files and does support regular expressions.

How to analyze binary file?

I have a binary file. I don't know how it's formatted, I only know it comes from a delphi code.
Does it exist any way to analyze a binary file?
Does it exist any "pattern" to analyze and deserialize the binary content of a file with unknown format?
Try these:
Deserialize data: analyze how it's compiled your exe (try File Analyzer). Try to deserialize the binary data with the language discovered. Then serialize it in a xml format (language-indipendent) that every programming language can understand
Analyze the binary data: try to save various versions of the file with little variation and use a diff program to analyze the meaning of every bit with an hex editor. Use it in conjunction with binary hacking techniques (like How to crack a Binary File Format by Frans Faase)
Reverse Engineer the application: try getting code using reverse engineering tools for the programming language used for build the app (found with File Analyzer). Otherwise use disassembler analysis tool like IDA Pro Disassembler
For my hobby project I had to reverse engineer some old game files. My approaches were:
Have a good hex editor.
Look for readable words in the binary file. Note how their distribution is. If the distance between them is constant you know it is a listing.
Look for 2-3 consequent zeros. Might indicate an int32 value.
Some dwords might be pointers into the file.
Try to identify reoccurring patterns in the file.
Seeing lots of C0-CF might indicate RLE compressed data.
I've developed Hexinator (Window & Linux) and Synalyze It! (macOS) exactly for this purpose. These applications allow you to see the binary files like in other hex editors but additionally you can create a "grammar" with the specifics of a binary file format. The grammar contains all the building blocks and is used to parse the file automatically.
Thus you can keep the knowledge you gain in the analysis and apply it to multiple files simultaneously. You can also color-code the bits and pieces of file formats for a quick overview in the hex editor.
The parsing results are displayed in a tree view where you can also modify the files easily (applying endianness et cetera).
Reverse engineering a binary file when you have some idea of what it represents is a very time consuming process. If you have no idea what it is then it will be even harder.
It is possible though, but you have to have a pretty good reason for doing so.
The first step would be to open it up in a hex editor of your choice and see if you can find any English text to point you in the direction of what the file is even supposed to represent. From there, Google "Reverse Engineering binary files", there are much more knowledgeable people than me that have written guides about it.
The "strings" program from GNU binutils is very useful. It will print the strings of printable characters in a file, quite often giving a clue to what a file contains or a program does.
If the data represents serialized Delphi objects, you should start reading about the Delphi serialization process. If that's the case, I think your best bet would be to load it using Delphi and continue your analysis from the IDE. Some informations about Delphi serialization can be found here.
EDIT: if the file does contain serialized delphi objects, then you should write a small delphi program that loads it, and "convert" the data yourself to something neutral, like xml. If you manage to do this, you should check and see if delphi supports serializing to xml. Then, you could access those objects from any language.
The unix "file" command is really useful - I don't know if there is anything like it in windows. You run it like this:
file myfile.ext
And it spits out a text description based on the magic numbers and data contained therein.
Probably it is contained within cygwin.
If you have access to the application that creates the file, you can apply changes to the application, then save the file and see the effects (Keep in mind that numbers are probably stored in little endian):
First create the file repeatedly. If the files are not binary equal, the current date/time is probably stored in the area where hte differences occur.
Maybe you want to repeat that with the software running under different environments, to see if OS version etc are stored, but this is rather unusual.
Next you can try to change single variables and create several files that only differ in the value of this variable. This helps you identify where this variable is stored.
That way you can also exclude variables that are not stored in the file: If you change them, but the files created are identical, they are not stored.
In order to test the hypotheses you worked out with the steps above, edit one of the files and have the application read it.
If you don't have access to the application itself, I suggest that you forget about it and find another way to solve your problem. There is a very high probability that it will be faster...
If file does not give a meaningful answer, you may want to try TRiD by Marco Pontello to determine whether your data is stored in a known format.
Get the Delphi application and open it in IDA Pro freeware version, and find where it writes the file, and decode how it writes the file that way.
Unless it's plan text.
Do you know the program that uses it? If so you can hook that programs write to file function and get an idea of what data its writing, the size of the data and where.
More Info: http://www.codeproject.com/KB/DLL/Win32APIHooking_Trouble.aspx
Unlike traditional hex editors which only display the raw hex bytes of a file, 010 Editor can also parse a file into a hierarchical structure using a Binary Template. The results of running a Binary Template are much easier to understand and edit than using just the raw hex bytes.
http://www.sweetscape.com/010editor/
Try to open it in a hex editor and analyse.

How can I analyze a closed format (e.g. doc or vce)?

I want to study the .vce format. It's a binary format and it seems more complicated than a simple object serialization. Does it exist any tool or technique to analyze a binary format?
You might need to "Reverse-Code-Engineer" a programm using this file format (http://www.openrce.org/). Tools used for this kind of analysis are: brain, disassembler (IDA Pro for example) and Debugger (OllyDBG for example). But beware - the way for successfull reverse engineering a file format is veeeeeerrry hard.
And reversing an application might be illegal depending on where you live!
You'll have to get a library that can read the format (or create one yourself).
Here is some of the microsoft office binary format specifications
I believe it would only be possible through some nasty reversed-engineering. It would be very useful to have access to application that uses mentioned format, so that you can generate few simple files and compare them in hex editor. You cannot get far with this method, but you might be able to figure out the header.
It would also be useful to study some binary format mechanisms, such as encryption and compression. If you're talking about Visual CertExam file format, than it is likely that useful data will be strongly encrypted.
My 2 cents:
Start by reversing the application reading the files themselves. Particularly android applications are helpful, as the resulting java source is easier to read (you might want to try A+ vce reader for android for example). This program indicates that vce uses/embeds sqlite in the file (in line with what is hinted here: Reverse Engineer a File Format).
Where to go from here? You might want to explore sqlite file carving tools to see if there might be a way to programatically identify the patterns in the file. Good luck!