Word formatting to equivalent HTML

Word formatting to equivalent HTML - html

Sorry for not being clear at the first time. Here is my need.
I am trying to write a VBA script to convert simple word text formatting to HTML.
Now I know that Word already can convert documents to HTML but it add's far to much junk code for the end result to be of use to me.
Basically all I need is very very simple text formatting conversion. I have several word documents I need to upload to my website the only text formatting that is in my documents are "Bold" "Underline" and "Italics".
I simply want a VBA script that will run through the document and convert all text (words or sentances) that have this formatting to HTML.
for example
The cat was sleepy .... changed to .... The cat was sleepy
The cat was sleepy .... changed to .... The cat was sleepy
I wish to save the end result as plain text file.
P.S I am a novoice to VBA programming .
I would want to do this in MS word 2007.

The topic starter apparently frowns on the amount of extra needless tags which are always inserted by own Microsoft's converter. To write a macro, you just need to learn appropriate features of the Document Object Model (DOM). DOM is described in the built-in help system, which can invoked from within VBA editor which comes with MS Office. But this task is more difficult than it can be viewed at the first glance, because you'll inevitably face with some strangeness of MS Word when processing some documents. So a better solution would be to use third-party tool for this task.

Related

Merge JSON data inside dynamic PDF templates

My problem is rather simple:
I need a tool to merge some medium complex JSON input data inside a template PDF.
Then based on the data:
Some sections of the template could be replicated.
Some sections may be deleted and the gap created should disappear.
Tables could be filled by N elements without messing the formatting.
Tables could have "merged cells inside"
Also templates should be easily adjustable without re-writing the code, this means they could be: editable PDF, Word files, Spreadsheets, some tool's templates... (..html?)
In all the tools i tried (a lot! :(( ), one or more of the points above was always a nightmare.
So far i tried:
latex -> pdflatex -> pdf: this probably simply reduced my life expectation by 1.. 1 and a 1/2 years. Unfortunately the most powerful tool because.. Latex? anyway not maintainable at all
pdfminer\pdfjs\whatever npm pkg.. : coding always ends up in a low level mess or huge workarounds.
google sheets -> pdf: APIs are kind of hard, and anyway are cell\row based so it's difficult to manage dynamic sections
pdfgeneratorapi.com : basically what i needed with a rich editor in it, but formatting and aligning tables is bugged so results are always ugly. Also things like "merged cells" are not possible.
So question would be: is there a tool or package out there in 2022 capable of handling all these requirements at once?

I would recommend using LaTeX for your PDF generation requirements. LaTeX comes with a huge number of available packages and chances are that someone has already solved your issue in LaTeX.
Specifically, DynamicDocs API by ADVICEment might be a good option for your requirements. DynamicDocs is JSON to PDF API based on LaTeX. Here are some features:
Ready-made JSON to PDF templates (no need to understand LaTeX)
Ability to write your own template in LaTeX and merge JSON to PDF using the R language layer, making the templates and their content dynamic
Excel to PDF Add-in (currently in Beta) to generate PDFs based on data in Excel
Each account is given a FREE plan with a limited number of monthly API calls
Disclaimer: I am involved in developing DynamicDocs API.

Rails, HTML to JSON?

Given a static HTML page, is there an automated way to generate json?
For a large website that contains a lot of static HTML I am wanting to generate json for RSS feeds and search functionality and am looking for a way to convert HTML to json.
I could obviously write json templates for every page and every language but that would be a unmaintainable. That would double an 800page website to 1600 pages and that is not an option.
One approach I thought of could be to write a bot that would loop through the routes to index the pages and save data to a database which would give me all the choices I could wish for, for searching such as solr, elastic search, thinking sphinx etc...
I could use capybarra to aid me in this by visiting each path and extracting text to save to a database in a rake task as a background job but not sure how that would work in a production environment and it seems that such a common requirement might have already been achieved but for the life of me I can't find one.
I would be far happier (I think) if I could find a way to convert HTML text content to JSON
Any ideas? Has this already been done? are there any gems that might help? or is there built in functionality that I have not thought of, maybe a way to get html into a hash that could then be converted into json? whatever the approach it needs to be automated. I'm just stuck for the best approach.

Basically html looks a lot like xml, but with strong tag meanings, so you could use xml to json conversion, if it all ends up getting tree of html tags embedded in each other.
And so your question becomes this question Except you might get problems with single tags, without closing one. So you might get all of these and put a closing bracket after each one before trying to get it as hash from xml. Oh, early answer. Btw in general for parsing text data you should look at regular expressions.

I chose to go with a nokogiri solution in the end and wrote a parser to meet my needs

Text parser and output converter

I want to be more efficient and save some time when coding. Here is the idea which I do not know any solution to:
(Note: I am a beginner and I am open to any programming languages you suggest.)
Let´s assume we have a text data. I have special chars at the beginning and at the end of a keyword. Firstly I need to parse the text data and then insert them into another text file.
For example like this:
I have a certain text
$method1$
§text1§
$method2$
§text2§
the text between the chars $$(here method1 and method2) and the text between §...§(here text1 and text2) would be found by the program and then inserted into a template:
method1() { print.text1};
method2() { print.text2};
Does such program already exist?
If not I really have no idea how to approach making one. I appreciate every hint and help.

You can easily make this with a programming language I believe.
Really, you can use any language you like. I prefer Ruby to do this, but Perl is also great for parsing this type of thing. It would be great if you could give us a sample of the actual file you will be parsing. Really, programming language choice is up to you, and whichever you choose you can google "regular expressions" and the name of the language to figure out how to do it.
If you did Ruby you can do something like this:
text.scan(/^\s*$\s/)
Ruby reference:
Parsing text in Ruby
Parsing strings and regular expressions in Perl (good tutorial)
http://perldoc.perl.org/perlretut.html

Extracting data from PDF or Word using PHP, Java

I need help on this...
Especially since I don't know where to start..
I am an IT undergraduate and, along with my groupmates, is now undergoing on-the-job training in a company.
SCENARIO:
The company asked us to create a program that will generate a report and store it in a database.
The database that will be used is MySQL.
As for what language to use, we are considering VB.Net, Java, PHP.
The program must be able to :
generate a report that will be sent through email to an office
store in a database
collect all reports, collate those reports
generate a new report which will then be sent to their main office
then store it in their own databse...
For now,
we are still trying to determine how the program will run and what language will be used that has the capability of reading and extracting data from a text file (can either be a word document or a PDF file).
The company also wants the program to be online-ready for future expansion.
Now, our problem is
Is there a way to extract data from a PDF or Word file using either Java, PHP, VB then store it in the MySQL DB?
if there is, can it be implemented without using any 3rd party software?
the reason why we chose to use either a PDF or Word file type is that, the file should be printable for archive purposes.
What programming language can we easily use to be able to achieve our problem above?
I would like to apologize if the info I am giving is a bit messed up. I will be giving additional information once we are able to talk wth the company this week.
If there is a problem with the way I posted this, please forgive me. I am just trying my best to provide you with the information the best I could.

I'll answer for Java as it is what I use at work.
You can easily extract text from Word files or build a new Word file with Apache POI
As for PDF, iText or PDFBox both does a pretty nice job.

Why can't you use 3rd party software? If you could, I would recommend something like How to read PDF files using Java?.
Or, to read a .doc file: http://www.roseindia.net/tutorial/java/poi/readDocFile.html
Anyway, if you can't use 3rd party tools, why not read the specifications and figure out how to extract the text from PDF, DOC, and DOCX files?
Here you can find DOC specifications: http://msdn.microsoft.com/en-us/library/cc313118.aspx
Here you can find the PDF format specification: http://www.adobe.com/devnet/pdf/pdf_reference.html
Good luck!

View the innards of a .ppt file?

I need to figure out what is going on inside a client's .ppt files. What is a good way to get started?
My eventual hope is to convert it to HTML. But if I just export the .ppt to HTML, I get a lot of images (as opposed to text), which is not a Good Thing.
EDIT: software that automatically converts .ppt to HTML would be terrific, provided that it preserves as much information as possible in text format. If that doesn't exist, the next best thing would be to understand the innards of the .ppt and write my own code to do a partial conversion.
EDIT: I used OfficeConvert as recommended by Michiel Leenaars. It got me text all right. My 50-page, 8MB test file turned into 40MB of text. The fact that I got text is good. The fact that the amount went way up is moving in the wrong direction. And there is an awful lot of repetition in there. The word "style" appeared 410815 times; the word "draw" appeared 351229 times.

I think a safe way would be to use OfficeConvert to automatically convert to ODF programmatically with Microsoft Office. Run it with /? to get help. There are some dependencies (see below).
Then use a good ODF library like lpod to look inside it.
You can view some interesting code examples here.
Dependencies:
Microsoft .NET Framework Version 2.0 Redistributable Package (x86)
Primary Interop Assemblies for Office 2007 or Office 2010 (whichever you are using).

I like the Aspose products. (I'm not associated with them other than as a customer.) I've used the PPT one specifically to write code that pokes around in the insides of a PPT. Overkill if you just want to convert it to HTML, but invaluable for the sorts of things I use it for.

If you know Java, Apache has the POI project which lets you take a look at the inners of a PPT project. Could get all the info you want about the project (images, text) and then convert it to html however you like.
Its free too.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008