Can someone help me strip down HTML code and populate different columns in excel?
For eg.
If my HTML code is:
<p></p>10-16-2013 22:35<br/>I love pizza! Ordering was a breeze!<p></p>10-16-2013 13:19:46<br />this has time stamps too!<p></p>10-21-2013 11:55<br />This is a test<br />
How can I output it as separate columns in Excel like this?
Column A Column B
10-16-2013 22:35 I love pizza! Ordering was a breeze!
10-16-2013 13:19:46 this has time stamps too!
10-21-2013 11:55 This is a test
Will be extremely grateful if someone can help me out!
There are three different options you might try for parsing the html:
Combine InStr, Mid and/or Replace as mehow suggests.
Use VBScript's RegExp library. You would need to include it into your VBA project by clicking "Tools" ---> "References" and then checking the box next to "Microsoft VBScript Regular Expressions 5.5". Regular Expressions are a very powerful text parsing tool, but it does take some time to get used to the syntax. I found that this pattern allowed me to get the dates/comments as submatches: <p></p>([^<]*)<br />([^<]*). I assume you are pulling that example out of a full webpage, so you would need to tweak that pattern to match exactly the parts of it that you are looking for. This site has a good tutorial on using the VBScript RegExp library.
Use a higher level HTML parser. I suggest the MSHTML library, which you can add to your VBA project by clicking "Tools" ---> "References" and then checking the box next to "Microsoft HTML object library". This parser is aware of constructs like HTML paragraphs, breaks and tables.
In my opinion, if you're willing to take the time to learn it, Regular Expressions would be your best bet. The InStr/Replace method may not be able to account for the variability in the webpage content and the HTML method would probably be overkill, especially given the lack of formatting in the example HTML.
Once you've parsed it, you can tackle the second part of the question using Excel Worksheet and Range objects. Like wehow noted, if you can put together some code it will be easier to help you.
Related
I have a large text file, which is an Italian-English dictionary. A typical line is:
Mazzapícchio, a long pole that fishers vse to bob vp and down for Eeles, and also to make fish to stirre. Also a kind of meate or custard in some parts of Italie made with milke and egges.
(Yes, it's a 17th-century dictionary.)
I'm looking for the best/easiest way to turn this into a searchable database.
The search would need to ignore the diacritics; with everything up to the first comma as the 'entry'. There are some cross-references, e.g.: Mefíte, as Mephíte.
My first thought is simply to turn it into HTML, with anchor tags for the word/phrase up to the first comma. That should be easy enough with a bit of Grep. I could also add links to the crossrefs in the same way (using BBEdit to confirm each change). It would then be easy to query just using a browser's search field.
However, ideally, I'd like something that returned only (all) the matching results. XML/HTML Tagging is the easy bit: the problem is the front-end to access/query it.
I'm on MacOS. (I'm also investigating Apple's Dictionary format...)
Any ideas on how to proceed would be welcome. Thanks.
This is a huge question. So many choices at so many areas.
A small start:
A searchable db. Look at https://solr.apache.org/
Php to handle interaction front-end with solr and to serve your html search form and results.
Using HP ALM REST API, we get the Memo fields embedded with HTML tags such as <html>, <span>, <body>, etc. Is there a way to suppress the same using any options?
Using the earlier OTA API, we had the option to use tdconnection.IgnoreHtmlFormat=True, which used to suppress these tags, but using REST API, I am unable to find an equivalent one. Any suggestions or should I build a parser myself after reading the output?
I personally don't know of a switch like that.
Alternatively you might try this:
How to Parse Only Text from HTML
On paper this seems quite nice.
This requires an extra step though. After getting the request you'ld have to run it through the proposed library to get the get the flat text. Shouldn't be more than a line of code I think.
Downside is there might be some stuff going south because you dump any formatting stored as HTML. Usually that isn't much though. Depends of the project and the people off course.
So does anyone know if there is an mso-number-format to mimic the Accounting format in Excel (Negatives are put in parens, zeros are dashes, all have 2 decimals and everything gets a dollar sign waaaay to the left)
I have an html table that i am opening in excel that i would Love to have this format.
I found the following one online, but it doens't seem to work:
mso-number-format:\#\,\#\#0\.00_\)\;\[Black\]\\(\#\,\#\#0\.00\\)
Thanks
This is the string that I return from C# that gives a true Accounting format in Excel:
"\"_-$* #,##0.00_-;-$* #,##0.00_-;_-$* \\\" - \\\"??_-;_-#_-\""
Some important things to note when dealing with mso-number-format:
All custom formats must be surrounded
in double quotes
Any double quotes
'inside' the custom format must be
replaced by literal "escaped double
quotes", for Excel to interpret
later.
So, you want the actual markup to be outputted like this
"_-$* #,##0.00_-;-$* #,##0.00_-;_-$* \" - \"??_-;_-#_-"
Hope this helps
MSO is pretty proprietary to Microsoft, and hopefully this stuff won't be supported in the future so that UI Devs like me don't have to un-do it....fingers crossed.
My first inclination would be to dynamically build the excel spreadsheet with a tool like PHPExcel so that you have 100% control over formatting, calculations, etc the way that Excel is looking for it. Certainly there are variations of this software for the respective technology you have at your disposal (.net, java, etc)
Absent that solution, there are wonderful JQuery plugins such as this However, I'm not entirely sure this would suffice when you pull the html natively into Excel--it might not fire.
Do you have back-end technology available? Something like a regex replacement could quickly solve the problem with no ill effects on Excel.
From Microsoft's own website, CSS can't format numbers.
I'll ask my question first, then give some background for those who are interested:
I would like to know if there is a command in html that will automatically generate a bibliography from a .bib file? This means that throughout the text, i would add something like <cite name="Jones2010">, and then at the bottom of the html (or css) file, I would write something like <makebib file="biblist.bib", format="APA">, and a bibliography would be generated using my .bib file, and formated according to the APA style. The functionality would be quite similar to footnotes, except that each footnote is populated according to some script that extracts the information from (essentially) an xml file and outputs the content in the desired format. It is not difficult to imagine somebody creating a tool to do just that, however, my google search skills have not enabled me to find such a tool. It is easy to find tools that convert bib files to html or xml, but that is not sufficient for my needs. I do not desire to publish my entire bib file online. Rather, for each document that I generate, I want several of the entries in the bib file to be included as footnotes. Any pointers will be greatly appreciated.
Now, the reason behind the question:
I have recently begun switching from writing all my manuscripts using latex to writing them using html/css. The advantages of this approach are fast: only 1 file for versioning (instead of .dvi, .ps, .aux, .blg, etc.), it is much smaller to share, other people can edit the html file and compile it much more easily, it is more configurable to my tastes, easier to read on screen, etc. The disadvantage for me, however, is that while I've been writing in latex for years, I've only just begin using html and css for scientific document creating. The main impetus for the switch was MathJaX, which enables me to to embed latex equations in my html files, and therefore, allows me to combine the advantages of latex with the advantages of css. I imagine that nearly all my colleagues will switch away from latex to this simpler format, assuming a few remaining issues get resolved, like ease of creating bibliographies.
Many thanks.
What you're asking isn't possible, unless when you specify html/css you really mean html/css/php or html/css/python or some other combination that includes an actual programming language, rather than just a markup language.
I understand your motivation, I'd love to switch to html instead of latex! However, I suspect an html-based solution would involve so much extra processing added on top to sort out bibliographies etc that the complexity would start approaching that of LaTeX by the time you got it all worked out.
I'd be pleased to be proven wrong on this!
I've done this, in the past, using XSLT and BibTeX. In outline, the steps are
Mark up your document using some convention or other: I used <span class='citation'>Smith99</span>
Write an XSLT script to transform that file into a .aux file with \citation commands in it
Use BibTeX along with a .bst file which spits out HTML rather than LaTeX
Use another XSLT script (or the same one, in a different mode) to pull the bibliography in
It's not quite as fiddly as it sounds, but you can look at how I did it on google code. In particular, see structure.xslt and plainhtml.bst.
If there's a more direct way, I'd be quite interested to hear about it.
Both answers so far are somewhat correct, although not quite what you were asking for. Part of the problem is that the question as it's phrased doesn't necessarily makes sense.
HTML is just markup; you need something to process the markup, be it python, php, ruby, etc.
And you probably want to write in XML (or XHTML), not HTML.
XSLT may work for you (once it's in XML), but remember, an XSLT document that defines a set of rules. You would get an XSLT engine to apply your XSLT rules against your XML document.
You can create an html bibliography from a .bib file using bibtex2html. This package takes a series of command line arguments and extracts the info from the BibTeX source and outputs a file with html markup.
As far as I know you cannot get it to read and parse the html document like the LaTeX \cite command but there are several ways to indicate the references you want. I find that the easiest way is to just maintain a text file of the BibTeX keys I use in my manuscript and then call this using the --citefile option. There is also a tool called bib2bib included that will take search commands.
It is a very flexible package and there are a lot of options so it works in a lot of situations. For example you can get it to omit the <html> headers from the output file so that you can directly paste into an existing html document.
The documentation is useful but make sure you look at the pdf documentation file and the man pages.
I would like to convert doc/docx documents to semantic HTML.
Some wishes/requirements:
Semantic HTML such that headers in the document are <h1>, <h2> etc., tables are <table> and so forth.
Should preferably be possible to handle headings, lists, tables and images. Graphs and math formulas is a nice extra.
• Doesn't have to be converted straight from doc/docx to html, could use an intermediary format, such as xml or docbook.
• Should work programatically, and with large number of documents.
The closest thing to a solution I've found so far is http://holloway.co.nz/docvert/index.html, but unfortunately there are many a few bugs, small user base and it can't handle a lot of documents. More of a proof of concept.
" headers in the document are "
I think this is impossible.
Because MS Word only write down the result, with different styles of <p>
just like printed text on paper, the original info are not recorded.
Your other wishes could be approached.
There're two commercial tools can do this
(don't believe those free tools or online tools, they don't do the real work.)
1 Word Cleaner by Zapadoo
www.zapadoo.com
2 HTML Cleaner for Word by wonder Studio
www.htmlcleaner.com
I prefer the second one which released just last year. You can try them both.
There's a tool called upCast which is able to convert Word documents into XML.
docx4j (for docx only, not doc) writes clean HTML output. You'd need to change things a bit if you wanted <h1> instead of <p class="h1">, but its open source so you can do that.
I wrote a utility which implements the requirements you listed, excluding images, graphs and maths formulas. It's beta quality (i.e., it works on my machine). I published it at http://www.modeltext.com/word
Just more ideas.
Use Gmail to convert word docs
http://www.oreillynet.com/mac/blog/2006/05/use_gmail_to_convert_word_docs.html