Is there any parallel HTML parser existing or in design? [closed] - html

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
As I know, HTML parsing is difficult to parallelize due to its strong dependences.
Is there any parallel HTML parser existing or in design, so that a single HTML document can be parsed in parallel and a single DOM tree would be produced finally?
It could be either for earlier HTML versions, or the latest HTML5.

The "strong dependencies" in HTML aren't much different from a parsing point of view, than strong dependencies in any other language you might parse. The real issue is that parsing one part of the file, usually depends on the left context. The problem for a parallel parser is how to get left context?
There's general theory about how to build parallel parsers, by breaking the text into chunks, parsing them separately, and stitching the parts together. McKeeman's paper (referenced) claimed .85N speedup for N processors.
I seem to remember a paper that proposed to parse a file from both ends, meeting in the middle. The right-going parser generated left context; the left-going parser generated right context. You can do the bi-directional scanning relatively easily by reversing the grammar, and feed the forward and backward grammars to parser generators. Gluing it together likely requires the kinds of techniques sketched in referenced paper.
Our DMS Software Reengineering Toolkit has a GLR parser that uses pipelining to separate the lexing stages from parsing, and has a full HTML4 parser available. (DMS is built on parallel foundations; it is relatively easy to configure it to parse individual files in parallel, too.) That HTML4 parser is likely extendable to HTML5 using DMS's support for language dialects.
As a general rule, if you are only parsing one program (or HTML) file, this kind of parallelism really doesn't matter much, as it won't affect your overall performance much. Most parsers are pretty fast and their time is largely covered by the effort to process the individual characters. You'd probably get much of the speedup by breaking the file into chunks, and lexing the chunks individually, especially since much of HTML files is wasted whitespace.
If you had to process lots of HTML files, you'd probably be better off with one thread per file being parsed. Then you can use pretty conventional parser technology in each thread.

Related

scalable persistence for startup [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
What database would you suggest for a startup that might possibly grow very fast?
To be more specific:
We are using JSON to interchange data with mobile clients, so the data should be stored ideally in this format
The data model is relatively simple, like users, categories, history of
actions...
The users interact in "real time" (5 second propagation delay is still OK)
The queries are known beforehand (can cache results or use mapreduce)
The system would have up to 10000 concurrent users (just guessing...)
Transactions are a plus but can live without them I think
Spatially enabled is a plus
The data replication between nodes should be easy to administer
Open source
Hosting services available (we'd like to outsource the sysadmin part)
We have now a functional private prototype with a standard relational PostgreSQL/PostGIS. But scalability apart questions, I have to convert relational data to JSON and vice versa which seems like an overhead in high load.
I did a little research but I lack experience with all the new NoSQL stuff.
So far, I think of these solutions:
Couchbase: master-master replication, native JSON document store, spatial extension, couchapps and although I don't know iriscouch hosting they seem good techs.
The downside I see so far is javascript debugging, disk occupation.
MongoDb: has only one master but safe failover. Uses binary JSON.
Cluster MySQL: the evergreen of web (one master I think)
PostgresSQL&Slony: because I just love Postgres:-)
But there are plenty of others, Cassandra, Membase...
Do you guys have some real life experience? The bad one counts too!
Thanks in advance,
Karel
Unless you are already having problems with scaling, you can't really have a good idea what you actually need for the future. You should be basing your design decisions on what you need now, not when you have your best estimate of customers. Remember, you have to impress your first few customers with how well your product solves their problems before you can worry about impressing your 10,000th
That said, I've found that its almost always neccesary to have basically everything:
A smart/powerful database for the important data and queries that are part of the current application. For this I have no choice ahead of PostgreSQL/PostGIS.
A document database (sometimes called NoSQL) to record forever anything that has passed through your system. It was an invalid or useless request a year ago, but now you have an application that can use that kind of data, and the vendor finally gave you the API spec you need to parse it, I hope you've got it around in a form you can work with it. At my current organization we are using CouchDB for this and it's proven to be a great choice so far.
I have to convert relational data to JSON and vice versa which seems like an overhead in high load.
Not really; the expensive stuff is IO and poorly written queries. The marshalling/unmarshalling is pure CPU, which is about the cheapest thing in the world to grow. Don't worry about it.

What is Functional Decomposition? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Closed 4 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Functional Decomposition, what is it useful for and what are its pros/cons? Where are there some worked examples of how it is used?
Functional Decomposition is the process of taking a complex process and breaking it down into its smaller, simpler parts.
For instance, think about using an ATM. You could decompose the process into:
Walk up to the ATM
Insert your bank card
Enter your pin
well...you get the point.
You can think of programming the same way. Think of the software running that ATM:
Code for reading the card
PIN verification
Transfer Processing
Each of which can be broken down further. Once you've reached the most decomposed pieces of a subsystem, you can think about how to start coding those pieces. You then compose those small parts into the greater whole. Check out this Wikipedia Article:
Decomposition (programming)
The benefit of functional decomposition is that once you start coding, you are working on the simplest components you can possibly work with for your application. Therefore developing and testing those components becomes much easier (not to mention you are better able to architect your code and project to fit your needs).
The obvious downside is the time investment. To perform functional decomposition on a complex system takes more than a trivial amount of time BEFORE coding begins.
Personally, I think that amount of time is well worth it.
It's the same as WorkBreakDown Structures (WBS), mindMapping and top down development - basically breaking a large problem into smaller, more comprehensible sub-parts.
Pros
allows for a proactive approach to programming (resiting the urge to code)
helps identify the complex and/or risk areas of a project (in the ATM example, security is probably the more complex component)
helps identify ALL components of a project - the #1 cause of project/code failure (via Capers Jones) is missing pieces - things not thought of until late in the project (gee, I didn't realize I had to check the person's balance prior to handing out the $)
allows for decoupling of components for better programming, sharing of code and distribution of work
Cons - there are no real CONS in doing a decomposition, however there are some common mistakes
not breaking down far enough or breaking down to far - each person needs to determine the happy level of detail needed to provide them with the insight to the component without overdoing it (don't break down into programming lines of code...)
not using pre-existing patterns/code modules into consideration (rework)
not reviewing with clients to ensure the scope is correct
not using the breakdown when actually coding (like designing a house than forgetting about the plan and just starting to nail some boards together)
Here's an example: your C compiler.
First there's the preprocessor: it handles #include and #define and all the macros. You give it a file name and some options and it returns a really long string. Let's call this function preprocess(filename).
Then there's the lexical analyzer. It takes a string and breaks it into tokens. Call it lex(string). The parser takes tokens and turns them into a tree, call it parse(tokens). Then there's a function for converting a tree to a DAG of blocks, call it dag(tree). Call the code emitter emit(dag), which takes a DAG of blocks and spits out assembler.
The compiler is then:
emit(dag(parse(lex(preprocess(filename)))));
We've decomposed a big, difficult to understand function (the compile function) into a bunch of smaller, easier to understand functions. You don't have to do it as a pipeline, you could write your program as:
process_data(parse_input(), parse_config())
This is more typical; compilers are fairly deep programs, most programs are broad by comparison.
Functional decomposition is a way of breaking down the complex problem into simpler problems based on the tasks that need to be performed rather than the the data relationships. This term is usually associated with the older procedure-oriented design.
A short description about the difference between procedure-oriented and object-oriented design.
Functional decomposition is helpful prior to creating functional requirements documents. If you need software for something, functional decomposition answers the question "What are the functions this software must provide". Decomposing is needed to define fine-grain functions. "I need software for energy efficiency measurement" is too general. That's why we break this into smaller pieces until the point where we clearly understand all the functions the systems need to provide. This can be later used as a checklist for completeness of a system.
A functional requirements document (FD) is basically a textual representation of functional decomposition. Coding directly from the FD may be ok for procedural languages, but it is not good enough for object-oriented solutions, because it doesn't identify objects. Neither is good for usability planning and testing.
My opinion is that you should take some time to create a FD, but not to use it too much of the time. Consult every person that knows the process you are following with your system to find all the functions needed.
I have a lot of experience in software design, development, and selling, and I use functional decomposition as the first step of development. I use it as a base for the contract, so the client knows what they will get and I know what I must provide.

What tools do you use for outlining projects? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Whenever I start working on projects that are complex enough that I can't keep it all in my head at once I like to outline how the app should work... I usually hack something like this out in a text editor:
# Program is run
# check to see if database exists
# create database
# complain on error, exit
# ensure database is writable
# complain to user, exit
# check to see if we have stored user credentials
# present dialog asking for credentials
# verify credentials and reshow dialog if they're invalid
# show currently stored data
# start up background thread to check for new data
# update displayed data if new data becomes available
# ...
#
# Background service
# Every 15min update data from server
# Every 24 hours do a full sync w/ server
Et cetera (note: this is commented so SO won't parse it, not because I include it as comments in code).
What I'm wondering is how you guys do this. Are there any tools for outlining a program's flow? How do you describe complex projects so that when it comes time to code you can concentrate on the code and not the design/architecture of all the little pieces?
I use GraphViz if I need to sketch out such simple diagrams - the DOT language is lightweight and diffs very nicely when I compare versions of the diagrams.
I blogged about this with an example a few months ago with an example showing a more complex architecture diagram.
I've also just added a blog post with a zoomed-out diagram that shows a large program flow, to give an idea of how a GraphViz flow might be composed. I haven't the time to obfuscate all the text so just put it up there as a picture at low res to give the impression of the architecture without being able to zoom in to see readable details.
This diagram was composed by hand after a bunch of grepping to get launches. To avoid taunting you too much, here are some excerpts of the DOT text that generates the diagram.
digraph windows {
rankdir=LR
label="Windows Invoked\nby controls and menu items"
node[fontsize=12]
/* ENTRY POINTS */
wndMainMenu[shape=box color=red fontcolor=red]
DEFAULT_WINDOW[LABEL="DEFAULT\NWINDOW" shape=box color=red fontcolor=red]
/* WINDOWS */
node[shape=box color=black fontcolor=black style=solid]
App
wndAddBill [label="Add Payable\nwndAddBill"]
wndAddCustomer [label="Add a Customer\nwndAddCustomer"]
...
/* WINDOW INVOCATION */
node[shape=oval color=blue fontcolor=blue style=normal]
edge[fontsize=10 style=normal color=blue fontcolor=blue]
wndPayBills_bvlNewBill -> wndAddBill
wndAddCustomer -> wndAddCustomer_save001
wndManageDrivers_bvlNewCustomer -> wndAddCustomer
alt text http://www.aussiedesignedsoftware.com/img/WindowLaunchesZoomedOut.png
Emacs M-x outline-mode
Or, paper.
p.s. this is a serious answer.
Basically what you are trying to do is extract the information and use-cases in Given-When-Then format. refer http://wiki.github.com/aslakhellesoy/cucumber/given-when-then. This approach solved both problems.
comprehension of domain and edge cases
outlining of the solution so you know what to work on next in addition to where to start
Are there any tools for outlining a program's flow?
Your top comments ("Program is run") could be expressed using a "flow chart".
Your bottom comments ("Background service") could be expressed using a "data flow diagram".
I don't use flow charts (I don't find they add value compared to the corresponding pseudo-code/text, as you wrote it), but I do like data flow diagrams for showing a top-level view of a system (i.e. the data stores/formats/locations, and the data processing stages/IO). Data flow diagrams predate UML, though, so there aren't very many descriptions of them on the 'net.
For anything related to documentation: Wikis, wikis and more wikis!
Easy to read and most important, easy for anyone to update.
My favourite one: Trac (much more than just a wiki anyway)
I like sequence diagrams for anything in the OO realm. There are several nice ways to create sequence diagrams without spending all your time pushing polygons around.
First, there are some online sequence diagram generators that take textual input. For one example, see WebSequenceDiagrams.com.
There's also a nice Java based tool that takes textual input and creates diagrams. This is well-suited for integration into your build process, because it can be invoked directly from ant.
If something is complex I like pictures, but I tend to do these by hand on paper, so I can visualize it better. Whiteboards are great for this.
I break the large, or complex app, into smaller parts, and design those out on paper, so I can better understand the flow between the parts.
Once I have the flow between parts done, then I can better design each part separately, as each part is it's own subsystem, so I can change languages or platforms if I desire.
At that point, I just start working on the application, and just work on one subsystem at a time, even though the subsystem may need to be decomposed, until I have a part that I can keep in my head.
Use Cases
Activity Diagrams
Sequence Diagrams
State Machine Diagrams
Class Diagrams
Database Diagrams
Finally, after those are done and the project is looking well defined, into Microsoft Project.
I like to keep this flow as it keeps things well documented, well defined and easily explainable, not to mention, it's simply a good process. If you are unsure on what these are, look at my answer in here giving more information, as well as some links out.
I recommend using UML
There are various depths you can go into when designing. If you take UML far enough, most UML applications can auto generate the basic framework of your code for you.
Typically I rely on loose UML, generating use cases, use case diagram, class diagram, component diagram, and have started using sequence diagrams more.
Depending on the project a whiteboard or notepad works, but for a project of reasonable size and time, I'll do everything using ArgoUML
I have enjoyed StarUML in the past, but it's Win32 only, which is now useless to me.
A great book on the subject is
Applying UML and Patterns: An Introduction to Object-Oriented Analysis and Design and Iterative Development (3rd Edition) - [978-0131489066]
I had to pick it up for a college course which did a crumby job teaching UML, but kept it and have read it a time or two since.
This is also worth checking out: Learning UML 2.0 - O'Reilly - [978-0596009823]

Semantic Diff Utilities [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm trying to find some good examples of semantic diff/merge utilities. The traditional paradigm of comparing source code files works by comparing lines and characters.. but are there any utilities out there (for any language) that actually consider the structure of code when comparing files?
For example, existing diff programs will report "difference found at character 2 of line 125. File x contains v-o-i-d, where file y contains b-o-o-l". A specialized tool should be able to report "Return type of method doSomething() changed from void to bool".
I would argue that this type of semantic information is actually what the user is looking for when comparing code, and should be the goal of next-generation progamming tools. Are there any examples of this in available tools?
We've developed a tool that is able to precisely deal with this scenario. Check http://www.semanticmerge.com
It merges (and diffs) based on code structure and not using text-based algorithms, which basically allows you to deal with cases like the following, involving strong refactor. It is also able to render both the differences and the merge conflicts as you can see below:
And instead of getting confused with the text blocks being moved, since it parses first, it is able to display the conflicts on a per method basis (per element in fact). A case like the previous won't even have manual conflicts to solve.
It is a language-aware merge tool and it has been great to be finally able to answer this SO question :-)
Eclipse has had this feature for a long time. It's called "Structure Compare", and it's very nice. Here is a sample screenshot for Java, followed by another for an XML file:
(Note the minus and plus icons on methods in the upper pane.)
To do "semantic comparisons" well, you need to compare the syntax trees of
the languages, and take into account the meaning of symbols. A really
good semantic diff would understand the language semantics, and realize
when one block of code was equivalent in function to another. Going
this far requires a theorem prover, and while it would be extremely
cute, isn't presently practical for a real tool.
A workable approximation of this is simply comparing syntax trees, and reporting
changes in terms of structures inserted, deleted, moved, or changed.
Getting somewhat closer to a "semantic comparison", one could report
when an identifier is changed consistently across a block of code.
See our http://www.semanticdesigns.com/Products/SmartDifferencer/index.html
for a syntax tree-based comparison engine that works with many languages, that does
the above approximation.
EDIT Jan 2010: Versions available for C++, C#, Java, PHP, and COBOL.
The website shows specific examples for most of these.
EDIT May 2010: Python and JavaScript added.
EDIT Oct 2010: EGL added.
EDIT Nov 2010: VB6, VBScript, VB.net added
What you're groping for is a "tree diff". It turns out that this is much harder to do well than a simple line-oriented textual diff, which is really just the comparison of two flat sequences.
"A Fine-Grained XML Structural Comparison Approach" concludes, in part with:
Our theoretical study as well as our experimental evaluation
showed that the proposed method yields improved structural similarity results with
respect to existing alternatives, while having the same time complexity (O(N^2))
(emphasis mine)
Indeed, if you're looking for more examples of tree differencing I suggest focusing on XML since that's been driving practical developments in that area.
Shameless plug for my own project:
HTML Tree Diff does structure-aware comparison of xml and html documents, written in python.
http://pypi.python.org/pypi/html-tree-diff/0.1.0
The solution to this would be on a per language basis. I.e. unless it's designed with a plugin architecture that defers a lot of the parsing of the code into a tree and the semantic comparison to a language specific plugin then it will be very difficult to support multiple languages. What language(s) are you interested in having such a tool for. Personally I'd love one for C#.
For C# there is an assembly diff add-in to Reflector but it only does a diff on the IL not the C#.
You can download the diff add-in here [zip] or go to the project on the codeplex site here.
A company called Zynamics offers a binary-level semantic diff tool. It uses a meta-assembly language called REIL to perform graph-theoretic analysis of 2 versions of a binary, and produces a color-coded graph to illustrate differences between them. I am not sure of the price, but I doubt it is free.
http://prettydiff.com/
Pretty Diff minifies each input to remove comments and unnecessary white space and then beautifies the code prior to the diff algorithm. I cannot think of anyway to become more code semantic than this. And, its written JavaScript so it runs directly in the browser.

How do you maintain your program vocabulary? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
In a not-so-small program, when you have not-so-few entities, in order to maintain code readability, common terms, and otherwise improve mutual understanding between team members, one have to define and maintain program vocabulary.
How do you (or your company) deal with this task, what discipline do you have, what arrangements do you introduce?
Most projects of reasonable size should have a programming/coding standards document that dictates common conventions and naming guidelines that should be followed.
Another way to help with this is through code reviews. Obviously some coordination among reviewers is required (the document helps with that, too). Code reviews help keep the greener devs and senior devs alike on track and act as an avenue to enforce the coding standards.
#Ilya Ryzhenkov,
I'm afraid most companies don't have such practice :) I've worked in the not-so-small company with multimillion LOC code base and they don't have any documentation at all (beside common coding guideline)
On one of my projects we maintained thesaurus of common terms used in our application domain and used it during code review. I analyzed .NET XML documentation diff from time to time to decide which entities\terms should be added to the thesaurus. Only means to enforce compliance with thesaurus was coding guideline.
Wiki approach proved to be non-applicable because nobody cares to update it regularly :)
I'm wondering what methods do you use at JetBrains ? I've inspected ReSharper's code in Reflector and was amazed with number and names of entities :)
Divide your packages/modules into logical groups and use descriptive and concise names. Avoid generic names except if they are really counters etc. Create conventions for groups of functions or functionality and stick to them.
Domain Driven Design is interesting here, since it encourages programmers to embrace the domain vocabulary. On top of that, there is some design conventions, which allow you to refer parts of your application using well known terms, like services, repositories, factories, etc.
Combining domain vocabulary and using technical conventions above it could be a good solution.
My team keeps this kind of information (conventions/vocabulary etc.) on a wiki. This makes it easy to keep up to date and share.