Text parser and output converter - language-agnostic

I want to be more efficient and save some time when coding. Here is the idea which I do not know any solution to:
(Note: I am a beginner and I am open to any programming languages you suggest.)
Let´s assume we have a text data. I have special chars at the beginning and at the end of a keyword. Firstly I need to parse the text data and then insert them into another text file.
For example like this:
I have a certain text
$method1$
§text1§
$method2$
§text2§
the text between the chars $$(here method1 and method2) and the text between §...§(here text1 and text2) would be found by the program and then inserted into a template:
method1() { print.text1};
method2() { print.text2};
Does such program already exist?
If not I really have no idea how to approach making one. I appreciate every hint and help.

You can easily make this with a programming language I believe.
Really, you can use any language you like. I prefer Ruby to do this, but Perl is also great for parsing this type of thing. It would be great if you could give us a sample of the actual file you will be parsing. Really, programming language choice is up to you, and whichever you choose you can google "regular expressions" and the name of the language to figure out how to do it.
If you did Ruby you can do something like this:
text.scan(/^\s*$\s/)
Ruby reference:
Parsing text in Ruby
Parsing strings and regular expressions in Perl (good tutorial)
http://perldoc.perl.org/perlretut.html

Related

Rails, HTML to JSON?

Given a static HTML page, is there an automated way to generate json?
For a large website that contains a lot of static HTML I am wanting to generate json for RSS feeds and search functionality and am looking for a way to convert HTML to json.
I could obviously write json templates for every page and every language but that would be a unmaintainable. That would double an 800page website to 1600 pages and that is not an option.
One approach I thought of could be to write a bot that would loop through the routes to index the pages and save data to a database which would give me all the choices I could wish for, for searching such as solr, elastic search, thinking sphinx etc...
I could use capybarra to aid me in this by visiting each path and extracting text to save to a database in a rake task as a background job but not sure how that would work in a production environment and it seems that such a common requirement might have already been achieved but for the life of me I can't find one.
I would be far happier (I think) if I could find a way to convert HTML text content to JSON
Any ideas? Has this already been done? are there any gems that might help? or is there built in functionality that I have not thought of, maybe a way to get html into a hash that could then be converted into json? whatever the approach it needs to be automated. I'm just stuck for the best approach.
Basically html looks a lot like xml, but with strong tag meanings, so you could use xml to json conversion, if it all ends up getting tree of html tags embedded in each other.
And so your question becomes this question Except you might get problems with single tags, without closing one. So you might get all of these and put a closing bracket after each one before trying to get it as hash from xml. Oh, early answer. Btw in general for parsing text data you should look at regular expressions.
I chose to go with a nokogiri solution in the end and wrote a parser to meet my needs

Word formatting to equivalent HTML

Sorry for not being clear at the first time. Here is my need.
I am trying to write a VBA script to convert simple word text formatting to HTML.
Now I know that Word already can convert documents to HTML but it add's far to much junk code for the end result to be of use to me.
Basically all I need is very very simple text formatting conversion. I have several word documents I need to upload to my website the only text formatting that is in my documents are "Bold" "Underline" and "Italics".
I simply want a VBA script that will run through the document and convert all text (words or sentances) that have this formatting to HTML.
for example
The cat was sleepy .... changed to .... The cat was sleepy
The cat was sleepy .... changed to .... The cat was sleepy
I wish to save the end result as plain text file.
P.S I am a novoice to VBA programming .
I would want to do this in MS word 2007.
The topic starter apparently frowns on the amount of extra needless tags which are always inserted by own Microsoft's converter. To write a macro, you just need to learn appropriate features of the Document Object Model (DOM). DOM is described in the built-in help system, which can invoked from within VBA editor which comes with MS Office. But this task is more difficult than it can be viewed at the first glance, because you'll inevitably face with some strangeness of MS Word when processing some documents. So a better solution would be to use third-party tool for this task.

HTML entities to Hex equivalent

I have legacy xml files with html entities such as — etc. How can I convert this entities to hex equivalent such as —. Is there any easy way to do this using a batch command or something else? I am not high level programmer so any detail help will be appreciated.
Just a simple find and replace may be all you need. Most, if not all text/code editors have a find/replace function.
Chances are that there are only a few characters strings that make up the majority of what you need to replace and fortunately, they're all pretty unique so it's unlikely that you'll have any accidental replacements.

Which technology should I use to transform my latex documents into html documents

I want to write a little program that transforms my TeX files into HTML. I want to parse the documents and turn the macros (the build-in and of course my own) into HTML pieces. Here are my requirements:
predefined rules (e.g. begin{itemize} \item text \end{itemize} => <br> <p>text </p> <br/>)
defining own CSS style
ability to convert formulars (extract the formulars, load them in an imagecreator and then save the jpg/png)
easy to maintain and concise
I know there are several technologies out there, but I don't exactly know which is the best for me. Here are the technologies which flow into my mind
Ruby (I/O is easy, formular loading via webrat),
XML XSLT (I don't think that I need just overhead)
perl (there are many libs out there but I'm not quite familiar with it)
bash (I worked with sed and was surprised how easy it was to work with regular expressions)
latex2html ... (these converters won't work for me and they don't give me freedom in parsing)
Any suggestions, hints and comments are welcome.
Thanks for your time, folks.
have a look at pandoc here. it can also be installed on linux or os x. Though it won't do your custom macros. The only thing I've seen that can do a decent job with custom macros is tex4ht, but to really work well you need to be producing .DVI files. If you have a ton of custom macros, writing your own converter is going to take an ass load of time. Even if you only have a few custom macros, it's still going to be a pain. good luck!
Six: TeX
Seven: Haskell
(I gave up trying to persuade SO to start numbering my list from 6).

storing code snippets in a database

I want to make a code snippet database web application. Would the best way to store it in the database be to html encode everything to prevent XSS when displaying the snippets on the web page?
Thanks for the help!
The database has nothing to do with this; you simply need to escape the snippets when they are rendered as HTML.
At minimum, you need to encode all & as & and all < characters as <.
However, your server-side language already has a built-in HTML encoding function; you should use it instead of re-inventing the wheel. For more details, please tell us what language your server-side code is in.
Based on your previous questions, I assume you're using PHP.
If so, you're looking for the htmlspecialchars or htmlentities functions.
You would either have to escape it when you store it, or escape it when you display it. It'd probably be better to do it on display so that if you need to edit it later on, you don't have to decode it then re-encode it.
Also, you'll want to make sure you escape it properly when you store it in the database, otherwise you'd be leaving yourself open to SQL injection. Parameterized statements would be the best method, you shouldn't have to change the raw data at all.
The best thing to do is to not store it in the database. I have seen people store stored procedures in databases as a row. Just because you can doesn't mean you should.
It doesn't matter how you store it, what matters is how you render it in the HTML representation. I'd guess you'll need to do some sort of sanitation before rendering the bytes. Another option might be to convert every character to an HTML entity; this might suffice to prevent any code or tags from actually being interpreted.
As an example, view the source of a Stack Overflow page with some example code, and see how they're representing the code in the HTML.