Definition of a text file

Definition of a text file - binary

I'm not really a professional programmer (I just do some number crunching), I'm just trying to learn more some things about computing.
I'm here to ask for -a reference- for a reading regarding the basic aspects of a 'file'. I'm having difficulty to understand the difference between text files and binary files. With my current understaning an image file is no more 'binary' than a text file. I'd like to understand what makes a file a text file. Is it a special sequence of bits?
Please, I just need a good reading reference (although some clarification would be welcome) and I'm not really trying to make a vague, generic, question.
Preferable, I'd like to be pointed to a technical reading containing definitions such as "a text file is a sequence of bits whose etc..."
Thanks,
Seneika.
BTW: what one finds on Wikipedia, for example, is not what I want.
Edit: horrible grammar mistake corrected...

A text file is a computer file that stores a typed document as a
series of alphanumeric characters, usually without visual formatting
information. The content may be a personal note or list, a journal or
newspaper article, a book, or any other text that can be rendered
accurately in typewritten form. Text files are similar to word
processing files in that the content of both is primarily textual;
they differ in that text files usually do not record information such
as character style and size, pagination, or other details that would
specify the appearance of a finished document. Some computer operating
systems make a basic distinction between a text file, which is
intended to be translated directly into human-readable text, and a
binary file, which is interpreted directly by the computer.
Source : Binary File & Text File
More details on WikiPedia

This should help you in your quest to answer your question :-)
http://www.wisegeek.com/what-is-a-binary-file.htm

Related

Is it safe to use numbers in your web page file names?

Someone recently told me that using numbers in web page file names is not good practice. For example, say I was making a website about Samara Morgan and I had a file named 7days.html - would it be bad to start the file name with a number? Is it riskier than having numbers put later in the file name (ie. day7.html)?
I'm just a tad confused on whether it's generally discouraged to use numbers in file names or not.
EDIT: After asking them to explain a bit more, this is what they said to me:
.... the simplest way I can explain it is that certain programming
languages and operating systems might be confused by putting the
number as the first character. In other words, it has a higher
potential for error, so it's not recommended. That being said, it IS
acceptable to use a number AFTER the first character. By the way, a
domain name (like 4chan.org) is a little different because it's not a
file.
Here are some more tips/best practices (you'll see it as #3):
https://ed.fnal.gov/lincon/tech_web_naming.shtml

I think you need to go back to this someone and ask them for more information - are they saying there's a security problem? a usability problem because of something users might want to do with it? a Search Engine Optimisation trick you're missing that would make it easier for people to find?
I can't actually think of why numbers in URLs would matter for any of these, however. It seems most likely they were thinking of SEO, because that's a constant battle between search engines (who want users to get the results they want) and publishers (who want to get their brand higher up the results) and full of half-understood experiments and dodgy advice.
It's also worth noting that URLs don't exactly have "filenames" at all - they're just a string that the browser sends to the server, and the server may or may not map to a file on disk. Look at the URL of this page, for instance - it contains enough information for the server to look up the right question in a database, plus some human-readable text which is mostly for SEO.
Your server has filenames, of course, but I can't think of any reason why having numbers in those would be a problem, let alone why it would apply particularly to web pages.
Edit based on additional information supplied:
Two things I notice about the link you've added: one, it's twenty years old; two, it includes detailed reasoning for every single point, except point 3. I can't think of any "programming languages and operating systems" that would have a problem with a leading digit. It's actually quite common in some (non-web) contexts, as a way of forcing files to be listed or run in the desired order (e.g. 01-contents.txt, 02-introduction.txt, etc).
I can imagine problems if you began the filename with a ., -, or _, because sometimes there are entrenched conventions that those are hidden, or backups, etc. Either the advice made sense 20 years ago, or the author was being overly conservative to keep the rule simple.

To be precise . Your question refers to whether it is permissible or appropriate to begin the name of a file with an o or more numeric characters .. and according to convenzini on die files used by (main operating system names) this type of naming is allowed and does not present any problem we use it to enterpretazione ..
windows https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx
linux https://www.cyberciti.biz/faq/linuxunix-rules-for-naming-file-and-directory-names/
the situation slightly different for the programming languages and the most common case is that of C / C ++ where the use of variables with completely numeric characters or compound nouns that begin with numeric characters can be confusing, and therefore this practice is by some not recommended.
(See this SO for C/C++ vars naming samples and problem Is it safe to use numbers in your web page file names?)
Therefore, in your case that refers to names of files .. the limitations that you have been inidicate are not reflected.

No just keep it like that it doesn't effect anything

Pro's and Con's of using HTML Codes vs Special Characters

When building websites for non-english speaking countries
you have tons of characters that are out of the scope.
For the database I usally encode it on either utf-8 or latin-1.
I would like to know if there is any issue with performance, speed resolution, space optimization, etc.
For the fixed texts that are on the html between using for example
á or á
which looks exactly the same: á or á
The things that I have so far for using it with utf-8:
Pros:
Easy to read for the developers and the web administrator
Only one space ocupied on the code instead of 4-5
Easier to extract an excerpt from a text
1 byte against 8 bytes (according to my testings)
Cons:
When sending files to other developers depending on the ide, softwares, etc that they use to read the code they will break the accent in things like: Ã©
When an auto minification of code occurs it sometimes break it too
Usually breaks when is inside an encoding
The two cons that I have a bigger weight than the pros by my perspective because the reflect on the visitor.

Just use the actual character á.
This is for many reasons.
First: a separation of concerns, the database shouldn't know about HTML. Just imagine if at a later date you want to create an API to use it in another service or a Mobile App.
Second: just use UTF-8 for your database not latin. Again, think ahead what if your app suddently needs to support Japanese then how you store あ？
You always have the change to convert it to HTML codes if you really have to... in a view. HTML is an implementation detail, not core to your app.
If your concern is the user, all major browsers in this time and age support UTF-8. Just use the right meta tag. Easy.
If your problem are developers and their tools take a look at http://editorconfig.org/ to enforce and automatize line endings and the usage of UTF-8 in your files.
Maybe add some git attributes to the mix and why not go the extra mile and have a git precommit hook running some checker so make super sure everyone commits UTF-8 files.
Computer time is cheap, developer time is expensive: á is easier to change and understand, just use it.

Can you recognize this file format as parseable?

I'm a technologist at a news company and inherited systems that never migrated our old stories out of the first CMS iteration. I'd like to do so, but the cost of software support is many thousands of dollars. No matter: they are plaintext-ish and it appears I can extract the contents.
Can anyone recognize this file format as parseable for programmatic extraction?
Here are the original files per #Jongware's help in the comments:
https://github.com/mcnaughton/mystery/blob/master/1?raw=true
https://github.com/mcnaughton/mystery/blob/master/2?raw=true
https://github.com/mcnaughton/mystery/blob/master/3?raw=true
Here are the hex dumps:
https://raw.githubusercontent.com/mcnaughton/mystery/master/hexdumps/1
https://raw.githubusercontent.com/mcnaughton/mystery/master/hexdumps/2
https://raw.githubusercontent.com/mcnaughton/mystery/master/hexdumps/3
Here is an example of a useless ASCII version: https://gist.github.com/buley/3acd74d2bf418520c309

Resources on generating equivalent phrases (same language translation)?

I'm interested in building a program that takes some text (an article, for example) and then generates a new text with equivalent meaning, but I'm not sure how to get started on such a problem.
Can anyone recommend some code/books/papers/techniques that would help me tackle this?

Did you check this one:
fastsubs: a program to generate most likely substitutes for words in a given text based on an n-gram language model.
More info is available at:
http://denizyuret.blogspot.be/2012/05/fastsubs-efficient-admissible-algorithm.html

what data storage model is used to store articles in wikipedia

Articles in wikipedia get edited. They can grow/shrink/updated etc. What file system/database storage layout etc is used underneath to support it. In database course, I had read a bit on variable length record, but that seemed like more for small strings and not for whole document. Like in file system, files can grow/shrink etc, and I think its done by chaining blocks together. each time, we update a file, not the whole file is rewritten. Perhaps something similar would be done here.
I am looking for specific names,terminologies, may be even how the schema in mysql is defined. (I think wikipedia uses mysql).
Below are links to some writeup on wikipedia architecture, but I am not being able to answer my question from these:
http://swe.web.cs.unibo.it/twiki/pub/WikiFactory/AntonelloDiMuroThesis/Wikipedia-cheapandexplosivescalingwithLAMP.pdf
http://dom.as/uc/workbook2007.pdf
Thanks,

See:
http://www.mediawiki.org/wiki/Manual:Database_layout

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008