What is the state of the art in HTML content extraction?

What is the state of the art in HTML content extraction? - html

There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three, but I'm not really clear about how well the practice of the latter reflects the ideas of the former. What is the best practice?
Pointers to good (in particular, open source) implementations and good scholarly surveys of implementations would be the kind of thing I'm looking for.
Postscript the first: To be precise, the kind of survey I'm after would be a paper (published, unpublished, whatever) that discusses both criteria from the scholarly literature, and a number of existing implementations, and analyses how unsuccessful the implementations are from the viewpoint of the criteria. And, really, a post to a mailing list would work for me too.
Postscript the second To be clear, after Peter Rowell's answer, which I have accepted, we can see that this question leads to two subquestions: (i) the solved problem of cleaning up non-conformant HTML, for which Beautiful Soup is the most recommended solution, and (ii) the unsolved problem or separating cruft (mostly site-added boilerplate and promotional material) from meat (the contentthat the kind of people who think the page might be interesting in fact find relevant. To address the state of the art, new answers need to address the cruft-from-meat peoblem explicitly.

Extraction can mean different things to different people. It's one thing to be able to deal with all of the mangled HTML out there, and Beautiful Soup is a clear winner in this department. But BS won't tell you what is cruft and what is meat.
Things look different (and ugly) when considering content extraction from the point of view of a computational linguist. When analyzing a page I'm interested only in the specific content of the page, minus all of the navigation/advertising/etc. cruft. And you can't begin to do the interesting stuff -- co-occurence analysis, phrase discovery, weighted attribute vector generation, etc. -- until you have gotten rid of the cruft.
The first paper referenced by the OP indicates that this was what they were trying to achieve -- analyze a site, determine the overall structure, then subtract that out and Voila! you have just the meat -- but they found it was harder than they thought. They were approaching the problem from an improved accessibility angle, whereas I was an early search egine guy, but we both came to the same conclusion:
Separating cruft from meat is hard. And (to read between the lines of your question) even once the cruft is removed, without carefully applied semantic markup it is extremely difficult to determine 'author intent' of the article. Getting the meat out of a site like citeseer (cleanly & predictably laid out with a very high Signal-to-Noise Ratio) is 2 or 3 orders of magnitude easier than dealing with random web content.
BTW, if you're dealing with longer documents you might be particularly interested in work done by Marti Hearst (now a prof at UC Berkely). Her PhD thesis and other papers on doing subtopic discovery in large documents gave me a lot of insight into doing something similar in smaller documents (which, surprisingly, can be more difficult to deal with). But you can only do this after you get rid of the cruft.
For the few who might be interested, here's some backstory (probably Off Topic, but I'm in that kind of mood tonight):
In the 80's and 90's our customers were mostly government agencies whose eyes were bigger than their budgets and whose dreams made Disneyland look drab. They were collecting everything they could get their hands on and then went looking for a silver bullet technology that would somehow ( giant hand wave ) extract the 'meaning' of the document. Right. They found us because we were this weird little company doing "content similarity searching" in 1986. We gave them a couple of demos (real, not faked) which freaked them out.
One of the things we already knew (and it took a long time for them to believe us) was that every collection is different and needs it's own special scanner to deal with those differences. For example, if all you're doing is munching straight newspaper stories, life is pretty easy. The headline mostly tells you something interesting, and the story is written in pyramid style - the first paragraph or two has the meat of who/what/where/when, and then following paras expand on that. Like I said, this is the easy stuff.
How about magazine articles? Oh God, don't get me started! The titles are almost always meaningless and the structure varies from one mag to the next, and even from one section of a mag to the next. Pick up a copy of Wired and a copy of Atlantic Monthly. Look at a major article and try to figure out a meaningful 1 paragraph summary of what the article is about. Now try to describe how a program would accomplish the same thing. Does the same set of rules apply across all articles? Even articles from the same magazine? No, they don't.
Sorry to sound like a curmudgeon on this, but this problem is genuinely hard.
Strangely enough, a big reason for google being as successful as it is (from a search engine perspective) is that they place a lot of weight on the words in and surrounding a link from another site. That link-text represents a sort of mini-summary done by a human of the site/page it's linking to, exactly what you want when you are searching. And it works across nearly all genre/layout styles of information. It's a positively brilliant insight and I wish I had had it myself. But it wouldn't have done my customers any good because there were no links from last night's Moscow TV listings to some random teletype message they had captured, or to some badly OCR'd version of an Egyptian newspaper.
/mini-rant-and-trip-down-memory-lane

One word: boilerpipe.
For the news domain, on a representative corpus, we're now at 98% / 99% extraction accuracy (avg/median)
Demo: http://boilerpipe-web.appspot.com/
Code: http://code.google.com/p/boilerpipe/
Presentation: http://videolectures.net/wsdm2010_kohlschutter_bdu/
Dataset and slides: http://www.l3s.de/~kohlschuetter/boilerplate/
PhD thesis: http://www.kohlschutter.com/pdf/Dissertation-Kohlschuetter.pdf
Also quite language independent (today, I've learned it works for Nepali, too).
Disclaimer: I am the author of this work.

Have you seen boilerpipe? Found it mentioned in a similar question.

I have come across http://www.keyvan.net/2010/08/php-readability/
Last year I ported Arc90′s Readability
to use in the Five Filters project.
It’s been over a year now and
Readability has improved a lot —
thanks to Chris Dary and the rest of
the team at Arc90.
As part of an update to the Full-Text
RSS service I started porting a more
recent version (1.6.2) to PHP and the
code is now online.
For anyone not familiar, Readability
was created for use as a browser addon
(a bookmarklet). With one click it
transforms web pages for easy reading
and strips away clutter. Apple
recently incorporated it into Safari
Reader.
It’s also very handy for content
extraction, which is why I wanted to
port it to PHP in the first place.

there are a few open source tools available that do similar article extraction tasks.
https://github.com/jiminoc/goose which was open source by Gravity.com
It has info on the wiki as well as the source you can view. There are dozens of unit tests that show the text extracted from various articles.

I've worked with Peter Rowell down through the years on a wide variety of information retrieval projects, many of which involved very difficult text extraction from a diversity of markup sources.
Currently I'm focused on knowledge extraction from "firehose" sources such as Google, including their RSS pipes that vacuum up huge amounts of local, regional, national and international news articles. In many cases titles are rich and meaningful, but are only "hooks" used to draw traffic to a Web site where the actual article is a meaningless paragraph. This appears to be a sort of "spam in reverse" designed to boost traffic ratings.
To rank articles even with the simplest metric of article length you have to be able to extract content from the markup. The exotic markup and scripting that dominates Web content these days breaks most open source parsing packages such as Beautiful Soup when applied to large volumes characteristic of Google and similar sources. I've found that 30% or more of mined articles break these packages as a rule of thumb. This has caused us to refocus on developing very low level, intelligent, character based parsers to separate the raw text from the markup and scripting. The more fine grained your parsing (i.e. partitioning of content) the more intelligent (and hand made) your tools must be. To make things even more interesting, you have a moving target as web authoring continues to morph and change with the development of new scripting approaches, markup, and language extensions. This tends to favor service based information delivery as opposed to "shrink wrapped" applications.
Looking back over the years there appears to have been very few scholarly papers written about the low level mechanics (i.e. the "practice of the former" you refer to) of such extraction, probably because it's so domain and content specific.

Beautiful Soup is a robust HTML parser written in Python.
It gracefully handles HTML with bad markup and is also well-engineered as a Python library, supporting generators for iteration and search, dot-notation for child access (e.g., access <foo><bar/></foo>' usingdoc.foo.bar`) and seamless unicode.

If you are out to extract content from pages that heavily utilize javascript, selenium remote control can do the job. It works for more than just testing. The main downside of doing this is that you'll end up using a lot more resources. The upside is you'll get a much more accurate data feed from rich pages/apps.

Related

can tags be replacement of taxonomy?

My Question is around usability. In most of the sites i have seen and developed i see taxonomy as a way a user would find something he is looking for in the site. But quite recently i have seen the concept of tagging. Where products services questions are tagged and can be found with the tagname. Is tagging an alternative to taxonomy or they should work together.

I'd say that like most things, it depends on what kind of information you're trying to organize.
For example, here on Stack Overflow, there isn't really a rigid hierarchy by which to sort the questions. They're much more organic in the sense that they can span multiple, and even unrelated, disciplines or fields and create a whole host of dynamic connections. For organizing this type of information, I think tags are an appropriate replacement for traditional, hierarchical taxonification. The decentralized, dehierarchized nature of tagging dovetails perfectly with the general organization of the site's content, especially when the site's users/community is encouraged to participate in cataloguing and organizing the information. Many blogs and social networking sites like Delicious organize their content with a series of tags as well.
Conversely, if you're trying to sell products or provide technical support, you'll probably find that tagging is not a suitable replacement for traditional taxonomic organization. If you're familiar with MSDN, which provides online documentation for developers in the Microsoft ecosystem, you'll observe that most of its content is organized into a natural hierarchy by technology/language, feature, sub-feature, etc. If you want to buy a computer from Dell, you start by narrowing down your choices: do you want a desktop, notebook, or tablet? Do you want a performance-oriented notebook, a desktop-replacement notebook, or an ultra-portable? Etc. Of course, that doesn't mean that you shouldn't consider implementing tags as an alternative way for users to explore the information that you have available, but in the best of cases, they will work together.
Think about the type of content you plan to host on your site and consider the most natural way to organize that information. Your users will appreciate more than anything a site that is intuitive and where they feel it is easy to locate exactly what they're looking for.

That is an argument I always found interesting, and basically I reduce to this question:
In order to found something, is better to have a hierarchical taxonomy or a flat tag-based taxonomy (maybe collaborative i.e. Folksonomy) ?
Well, there's no unique answer, but, depending on the search context, sometimes the former is more convenient and sometimes the latter is.
The best thing would be to have both kind of taxonomies, but could be difficult to manage, in particular if contents are created by people and so the classification is up to them.
One solution could be have tags inheritance, like in drupal taxonomy system.
So for instance when you want to classify a picture of your dog, you just have to select the tag: 'dogs' and automatically your picture will belong to tags: 'dogs' --> 'animals' --> 'living beings' and so on.

This question is an issue related to the human thinking:
Sure it is better, if you can find something by a tagged word. If you dont know the word/tag perfectly, you are not able to find it. Others may have taged the thing you search for with a similar, but other tag. In this case a (binary) tag search will not give you the correct (or whole) awnser.
Anyway, there is a possibility to extract a taxonomy (as long as words/tags are related) out of tags. This concept (combined with a vecor-orientated-search) can be presented to the user and will help him to find what he needs.

Although I'd just upvote Cody's answer (I did), I would also like to add something:
The field of usability used to be within the realm of ergonomics before it grew up. So I think it is appropriate to refer to one of ergonomics' core principles.
Every person has a unique set of dimensions, so there is no single set of “correct dimensions” for e.g. a chair. The best dimensions are adjustable dimensions that provide a reasonable range of variability.
It is possible to apply this principle to website navigation as well and provide multiple ways of reaching the same content, so that people with different habits can find stuff using the way they are most comfortable with.

Extracting pure content / text from HTML Pages by excluding navigation and chrome content

I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc
I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run the diff algorithm comparing various article's text from same website this results in similar text being eliminated. This gives me content minus the common navigation content etc.
Despite the above approach I am still getting quite some junk in my final text. This results in incorrect News Abstract being extracted. The error rate is 5 in 10 article i.e. 50%. Error as in
Can you
Suggest an alternative strategy for extraction of pure content,
Would/Can learning Natural Language rocessing help in extracting correct abstract from these articles ?
How would you approach the above problem ?.
Are these any research papers on the same ?.
Regards
Ankur Gupta

You might have a look at my boilerpipe project on Google Code and test it on pages of your choice using the live web app on Google AppEngine (linked from there).
I am researching this area and have written some papers about content extraction/boilerplate removal from HTML pages. See for example "Boilerplate Detection using Shallow Text Features" and watch the corresponding video on VideoLectures.net. The paper should give you a good overview of the state of the art in this area.
Cheers,
Christian

For question (1), I am not sure. I haven't done this before. Maybe one of the other answers will help.
For question (2), automatic creation of abstracts is not a developed field. It is usually referred to as 'sentence selection', because the typical approach right now is to just select entire sentences.
For question (3), the basic way to create abstracts from machine learning would be to:
Create a corpus of existing abstracts
Annotate the abstracts in a useful way. For example, you'd probably want to indicate whether each sentence in the original was chosen and why (or why not).
Train a classifier of some sort on the corpus, then use it to classify the sentences in new articles.
My favourite reference on machine learning is Tom Mitchell's Machine Learning. It lists a number of ways to implement step (3).
For question (4), I am sure there are a few papers because my advisor mentioned it last year, but I do not know where to start since I'm not an expert in the field.

I don't know how it works, but check out Readability. It does exactly what you wanted.

Should I make it a priority to semantically mark up my pages? Or is the Semantic Web a good idea that will never really get off the ground?

The Semantic Web is an awesome idea. And there are a lot of really cool things that have been done using the semantic web concept. But after all this time I am beginning to wonder if it is all just a pipe dream in the end. If we will ever truly succeed in making a fully semantic web, and if we are not going to be able to utilize semantic web to provide our users a deeper experience on the web is it worth spending the time and extra effort to ensure FULLY semantic web pages are created by myself or my team?
I know that semantic pages usually just turn out better (more from attention to detail than anything I would think), so I am not questioning attempting semantic page design, what I am currently mulling over, is dropping the review and revision process of making a partially semantic page, fully semantic in hopes of some return in the future.

On a practical level, some aspects of the semantic web are taking off:
1) Semantic markup helps search engines identify key content and improves keyword results.
2) Online identity is a growing concern, and semantic markup in links like rel='me' help to disambiguate these things. Autodiscovery of social connections is definitely upcoming. (Twitter uses XFN markup for all of your information and your friends, for example)
3) Google (and possibly others) are starting to pay attention to microformats like hCard and hCalendar to gather greater information about people and events going on. This feature is still on the "very new" list, but these microformats are useful examples of the semantic web.
It may take some time for it all to get there, but there are definite possible benefits. I wouldn't put a huge amount of effort into it these days, but its definitely worth keeping in mind when you're developing a site.

Yahoo and Google have both announced support for RDFa annotations in your HTML content. Check out Yahoo SearchMonkey and Google Rich Snippets. If you care about SEO and driving traffic to your site, these are good ways to get better search engine coverage today.
Additionally, the Common Tag vocabulary is an RDFa vocabulary for annotating and organizing your content using semantic tags. Yahoo and Google will make use of these annotations, and existing publishing platforms such as Drupal 7 are investigating adopting the Common Tag format.

I would say no.
The reason I would say this is that the current return for creating a fully semantic web page right now is practically zero. You will have to spend extra time and effort, and there is very little to show for it now.
Effort is not like investing, however, so doing it now has no practical advantage. If the semantic web does start to show potential, then you can always revisit it and tap into that potential later.

It should be friendly to search engines, but going further is not going to provide good ROI.
Furthermore, what are you selling? A lot of the purpose behind being semantic beyond being indexable is easier 3rd party integration and data mining (creating those ontologies). Are these desirable traits for your data sets? If you are selling advertisement, making it easier for others to pull in your content is probably not going to be helpful.
It's all about where you want to spend your time.

You shouldn't do anything without a requirement. Otherwise, how do you know if you've succeeded? Do you have a requirement for being semantic? How much? How do you measure success? How do you measure return on investment?
Don't do anything just because of fads, unless keeping up with fads is a requirement.

Let me ask you a question - would you live in a house or buy a car that wasn't built according to a spec?
"So is this 4x4 lumber, upheld with a steel T-Beam?"
"Nope...we managed to rig the foundation on on PVC Piping...pretty cool, huh."

Giving presentation on software project to non-programmers [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Soon I will need to give a presentation on my honours project for the engineering faculty and a large group of engineering and technology students at my university. While all the the people attending will be technical-minded, not all of them will be programmers and most will be from other engineering disciplines.
I have given presentations before, and I am confident speaking to a crowd, but I realize now all the presentations I have given before have been to fellow CS/SE majors and teaching staff. I wonder if my presentation style assumes that I am presenting to other software geeks, so they will know what I am talking about and I can put on a more interactive demo involving the audience.
My honours project isn't terribly complex or theoretical, I have a prototype C# Winforms app but it is designed to be extensible and operate with different data sources (ODBC or WS) in the future, and some research to how it could be extended with a rule engine and DSL and turned into a marketable product. The organization that is testing my prototype is saving tens of thousands of dollars a year by automating a critical business function.
I had planned to show off how extensible it was by some live coding and UML-style diagrams. I really enjoy doing demos and live coding but I don't know if that kind of presentation will be as accessible to non-programmers, and I am worried if I get too geeky and technical I may alienate the audience and judges.
What are the effective techniques you have found to present software projects in a way that is also interesting to non-programmers

When I was working on my doctorate, the faculty gave us this rule for seminars - and it has proved very useful since:
Tell 'em what you're gonna tell 'em. (E.g., brief introductory problem
description and results abstract)
Tell 'em. (E.g. technical details comprising the bulk of the time)
Tell 'em what you told 'em. (E.g. brief summary and conclusions)
Open the floor for questions.
In your position, I would take about 10-20% of your allotted time to do #1 in a largely non-technical way. So you might describe the business function your code automates, why that's important, what things were like before and after applying your solution, how it's saving money, that kind of thing.
Then I'd launch into a highly technical discussion aimed at the CS/SE crowd. Even if the rest of the folks don't understand it and their eyes glaze over, your introduction at least will have given them a sense of what it's all about, and they might recognize a bit here or there.
For the third part, I'd briefly recap the problem and describe how you solved it in non-technical language, and then do your live-coding extensibility whiz-bang demo. Even if the non-CS/SE folks don't understand the demo, they'll see eye candy flying by and your professional peers and faculty all nodding and smiling, so they'll think it's cool.
I once attended a seminar by a guy who won the Nobel Prize for applying chaos theory to chemical systems. He applied this approach, so even though all the non-theoreticians like my fellow organic chemists and I were all completely out of our depth, the fact that the theoreticians were all excited left us feeling like it was a great seminar even though we didn't have a clue about what he'd said.

To appeal to both audiences, I sometimes give the technical explanation and then follow it up with my "in English, please" explanation. CSI and other dramas with science in them do this all the time, to good effect.
In other words, [insert plain english explanation here].

Lets attack this as a refactoring problem.
ie Instead of adding more to your presentation, Is there a way in which you can take stuff out?
For example I don't think showing off that your demo App can use multiple data sources is essential, much less grants for you to program right there during the presentation.
I know it took care in the design of your app to reach that point, but still most people are more interested in the OUTPUTS not the INPUTS of an app. And even more in the BENEFITS of said app.
Some guiding points:
Make the presentation about them. If the audience has felt the pain that your program solves, remind them of that pain. If they are other researches like yourself then ask them to put themselves in the shoes of the organization you helped.
Compare the old way vs the new way of doing things. Why is the new way more efficient? Will it lead to more sales? will it reduce inventory? or save money? Will someone lose his/her job because your solution makes his task irrelevant. Note: When making technological presentations I've observed is important to address what happens to the people that was doing the task previously. Fortunately most of the time people don't lose their jobs, in most cases the same people can manage a much larger volume of work thanks to
Technology.
Show results. What are the real results your demo company has observed?
Use meaningful visuals. If you could make some animations that explain your algorithm even better.
Tell your point at the beginning and the end. Most people will forget what happened in the middle so make sure to tell the most important thing at the beginning and the of your talk.
Practice, Practice. Yeah it sounds ridiculous but do your whole presentation in front of a mirror or video recorded at least twice. The more the better.
Don't give one of the most important presentations of your life without a rehearsal.
Breath and be positive you will do fine :-D
PS: My suggestions are derived from this webpage. It has guided me several times:
6 Stimuli to reach the old brain

You're already working on knowing your audience, which I think is awesome, you just need to take it a step further, and ask yourself, if I were x person in the audience, what would I get out of this presentation.
I'd question the validity and how much effort should go into the technical/coding demo, if the group you're presenting to is never likely to use your specific implementation. It may be more important to portray how you approached the extensibility, so that you garner ideas within the peers on how they can approach it in the future, as well as hit on points throughout that are important to all of your audience members, and maybe shortcut the demo a bit to just show that, yes, indeed it does work.
I don't know about you, but personally I've always got more value out of these types of presentations based around how the project appeals to everyone, how you are managing to save tens of thousands of dollars per year for this company, theoretically why other companies might want to use it as well, what is the market and other factors, what were the giant technological hurtles you had to overcome, even if it's a simple project, there were things you must have thought about ahead of time to avoid and prevent you from getting backed into a corner.
I think if you're a really good presenter, and the purpose of the presentation is to be broad and appealing to the entire group, and not a talk on the chaos theory and application to chemical systems, which has that stated purpose, you should appeal to the lowest common denominator of the audience, and the entire audience can be entertained and appreciate what you have achieved at every step along the way, and to do this, they don't necessarily have to understand every step taken either.

I've been in the same situation
(presenting a software engineering/image processing/recognition project in an EE faculty competition).
Start with the issue (the problem)
Then the background (a BIT of technical background)
The solution:
Start with block-charts (all engineers read those)
Then explain the technologies and how briefly - how complicated the implementation was
(don't underestimate the complicated part - otherwise you may make your work seem to simple to engineers from other fields - they won't appreciate your effort)
Results:
Show short visual examples (try to make them intriguing)
(short code examples can go here)
Short user interface demo
Show impressive graphs
Bibliography, thanks, possible future improvements/research
Questions (if the forum is large, tell them in advance that the time for questions will be at the end)
General advice:
Practice presenting (over and over)
Leave 45-60 seconds per slide
No more than 5 points per slide
1 line per point
Add jokes
No animations except for demonstrating complicated issues faster
Use clear fonts (Ariel or Calibri for regular text, 1 different font for titles)
Use high contrast colors
(bright on black or dark on white if you must - no dark on dark or bright or bright)

Well first of all, I would suggest talking to your faculty advisers about what they expect from your presentation. If there's any question about how you should balance technical details understandable to only CS people versus more general concepts understandable to the larger audience, I think it would really help to get input from those who will be evaluating you.
One thing I really like to see from a presentation is a "take home message". What is the one thing you want everyone in that audience to remember long after they've left the room? Tell them the take-home message at the very beginning. Tell them you will spend the rest of the presentation explaining why they should care and why they should believe you. Even if people get lost in some of the technicalities, if you at least drive home that one message, you've delivered one thing to a lot of people.
Another suggestion: don't forget about format. Presentation slides should be readable from anywhere in the auditorium/lecture hall. Don't overwhelm people with too much text on one slide. Keep bullets short and easy to scan. Do you want people spend their time reading your slides or do you want them to listen to what you have to say? Don't use acronyms, but if you must, explain what they mean--and put the definitions on your slides--unless you are sure they are common knowledge. If people are sitting there wondering what the heck that acronym means, they aren't listening.
As to whether you should show actual code or do live coding, my gut feeling is that you shouldn't unless it's absolutely critical to the point you're making. If your project were actually about some coding construct (e.g., if you had invented the concept of an "extension method"), okay, it would make sense to get into some actual code. But it sounds like the significance of what you've done is definitely up a level from that. You might want to show how little code it takes to, say, hook up a different data source, but I wouldn't actually get into walking through the code itself unless you feel you can't make your point otherwise. One thing I probably would like to see if I were in the audience is a demo of your code in action. Show me what is does, and tell me why that's cool.
I hope it goes well!

Here is my advice:
Be clear who your audience is and what your message is - Are you trying to impress six faculty members who are marking your project, or proving you can entertain the whole audience.
Have a Contents page early on - that way the audience know what to expect.
Put the geek stuff in an appendix to your main presentation. That way you can dip into it ,for questions, but you will not loose the main point of your talk.
Make sure your presentation flows and tells a story - limit slide numbers and don't clutter them e.g. project goals,possible uses, design challenges, software choice, what you did (limit techie), results (demo), results and limitations, next steps, questions.
Have a Conclusions page at the end -- make sure you circle back and cross refer to your original contents page.
Leave 15-20% of your time for questions. This will reveal what the audience is interested in, and allow you to display a deeper understanding of the topic i.e. only do live coding if they ask for it.
Rehearse out loud even if you feel stupid doing it.
Good luck.

A few tips
Use a common technical language. only use terms that the hearing will recognize.
It links what you expose yourself, with examples recognizable by the audience.
you can also read these great articles.
11 Top Tips for a Successful Technical Presentation
Tips for a Successful Technical Presentation
Bye.

Mix and match some topic everybody know. It has helped me to theme slides with images ranging from the Divine Comedy to the Simpsons I don't know how formal is your presentation but it's a common constructivist technique to hook on something your audiente already know to show your point.
I once attended a presentation of Larry Wall where he explained Perl 6 features using examples from golf mixed with the Lord of the Ring.

What I do is to talk analogies, try to convert to real world the terms you are explaining.
BTW, Why are you talking about software tech aspects to non tech people?? You have to target the content to your audience first. Who is your main audience?? The techies or non techies, choose one.
Regards,

I'd be inclined to not use code (unless you actually have to), and use some form of generic (and straightforward) pseudocode.
Also, if you are doing the talk with prompt cards, put 'Breathe!' at the top of the cards. It helped me...

Focus on the user interface (aka how it makes their lives easier) and how it is different from similar products (why they should listen.)

I think Simon Peyton Jones gives excellent talks. See the How to give a good research talk section on this page. In particular, check out the video of his talk about the subject linked to in that section. You can find other videos out there of his talks on Haskell, functional programming, etc. to see how he practices what he preaches.

Please listen to the following podcast : Manager Tools - Presentation basic
It will cover all the basics you need to do effective presentations.
Now when doing project presentations do the following:
Create a High Level Architecture model ... see this model you can probably do better (note: the model image is from my blog.).
Create a High level requirement list
Create a application workflow process diagram (once again pretty colors, arrows and blocks). This model will show how a user is expected to work with the application in order to solve its main task.
In order the present the application first show them the requirement list and talk about them, then the high level architecture and finally the application workflow process diagram which can be followed by a live demo.
The most important rule is to present at a fairly high level with lots of diagrams and models to show what you are talking about.

An online resource to back an argument for cleaner design

I am working in the web dept of a large legal firm, and among other things am responsible for maintaining a professional look for all our email communications (over 600 pieces per year).
Right now I am in a rut. Using a lot of pressure and manipulation, a person in management got to "art direct" a couple of HTML emails working directly with a member of my team and I caught the design at the last moment.
Her "designs" introduced background images behind the text of the emails along with additional, high-contrast imagery sitting behind the title in the header.
I ended up mandating a design change, however she is very insistent on "her" design and questioning all my reasoning for simplifying the look.
Basically she is questioning my expertise and asking for "proof" that her design is not user friendly.
I have the meeting in a couple of hours and was wondering if anyone here could point me to resources that discuss these specific items:
Argument against background images positioned behind the copy of an email. The images are at about 10% opacity, which makes them incomprehensible, and makes the design busy and ugly (my perspective).
Argument against high-contrast images behind titles.
Now, I am aware of the technical implications of including images in HTML emails, Outlook 2007 not loading background images etc. This is not necessary a technical issue, but a serious aesthetic/usability step in the wrong direction.
Thank you!

Facts:
Common sense in communicating dictates that anything that distracts you away from the message -- the content of the emails is not a good idea.
Is there images in the background of your letter heads and on all your invoices? Why so? Why not?
What do background images contribute to the value and perception of the message, the image of your corporation? Is it clearly known the impacts they have?
Go take a look at email newsletter sites. They are covered in guides and tutorials on how to email market effectively.
www.icontact.com
www.constantcontact.com
and so on..
Opinions:
Emails are not meant to be flyers. They are meant to communicate, clearly, simply and concisely while bringing a professional image. Making it look like a cartoon, or a flower shop, or whatever else you are dealing with probably doesn't add to it.
The issue you may run into is she is taking it personally because she is attached to the design for personal reasons and not designing for the needs of the business. So an attack on the design is an attack on her. She is too involved with her ego of looking good and avoiding looking bad or wanting some kind of glory.
Simply put, she should be the one qualifying to you why it IS good design, not the other way around. If she doesn't know, why is she asking you to prove it to her? How would she understand?
There's a book called Dealing with difficult people that may be of use to you.
Of course, if common sense was really common we wouldn't have to point it out as being common sense.
Update us on what happens!

http://www.asciiribbon.org/.
They have a lot of points on why not to use HTML ect, in emails.
Quite a few e-mail clients do not support HTML e-mail.
Other clients have a very poor or broken HTML rendering, causing the messages to be unreadable as well.
Sending HTML e-mails causes great overhead, and is very inefficient.
People that are limited to a text-only terminal, people with disabilities, blind people, basically anyone that cannot use a graphical interface easily or at all, are likely unable to read your mail.
(Extract from link)

In addition to what others have said, consider legal accessibility requirements. I found one example of the US Department of Education accessibility requirements. I'm sure searching for this one can find more examples.
Although it doesn't really apply, you may be able to reference the Americans With Disabilities Act, assuming you're in America.
Also, since you're sending HTTP formatted mail, maybe the Web Content Accessibility Guidelines 1.0 are of interest. For example, Guideline 2.2 of this document states "Ensure that foreground and background color combinations provide sufficient contrast when viewed by someone having color deficits or when viewed on a black and white screen."

A professional email should only be graphic-intensive if those graphics either emulate the look of the company stationary, emulates the look of the company site, or if the graphics are interactive. A common example where this makes reasonable sense would be billing emails from Amazon.com. Note, however, that the content itself does not actually have any graphics, only the frame above, below, and to the sides of the content uses graphics. Similar stuff shows up in banking emails and paypal emails. This sort of thing makes it easier for people to associate the email with the site and makes for nicer printed records that match the online version of the same records.
For standard communication, I'd just go with a header and/or footer graphic.

What I did was escalate to a person who has both the position and understanding of the importance of the issue. Also, presented a version with a cleaner design. Made sure to address all objectives, and did not imply that design decisions are open to discussion.

I HATE email with designs and pictures on it. It is unprofessional IMO
I hate them.
the desirability/niceness of designs and art are subjective. So, how can you be sure all the people who receive them will appreciate them.
For me they are a big turnoff.

Also consider looking up some references in Human Factors engineering texts that show readibility studies. I bet a quick library search in this area would yield much scientific data that her way causing reading errors, eye strain and or slower reading speed.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008