Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I want to highlight C/C++/Java/C# etc source codes in my website.
How can I do this?
Is it a CPU intensive job to highlight the source code?
You can either do this server-side or client-side. It's not very processor intensive, but if you do it client side (using Javascript) there will be a noticeable lag. Most client side solutions revolve around Google Code's syntax highlighting engine. This seems to be the most popular one: SyntaxHighlighter
Server-side solutions tend to be more flexible, especially in the way of defining new languages and configuring how they are highlighted (e.g. colors used). I use GeSHi, which is a PHP solution with a moderately nice plugin for Wordpress. There are also a few libraries built for Java, and even some that are based upon VIM (usually requiring a Perl module to be installed from CPAN).
In short: you have quite a few options, what are your criteria? It's hard to make a solid recommendation without knowing your requirements.
I use GeSHi ("Generic Syntax Highlighter") on pastebin.com
pastebin has high traffic, so I do cache the results of the transformation, which certainly reduces the load.
Personally, I prefer offline tools: I don't see the point of parsing the code (particularly large ones) over and over, for each served page, or even worse, on each browser (for JS libraries), because as pointed above, these libraries often lag (you often see raw source before it is formatted).
There are a number of tools to do this job, some pointed above. I just use the export feature of my favorite editor (SciTE) because it just respects the choices of color I carefully set up... :-) And it can output XML, PDF, RTF and LaTeX too.
Pygment is a good Python library to generate HTML, RTF, ANSI (terminal-style) or LaTeX code. It supports a large range of languages (C, C++, Lua, Erlang, ...) and you can even write your own output formatter.
I use google-code-prettify. It is the simplest to set up and works great with all C-style languages.
If you use jEdit, you might want to use the Code2HTML plugin.
I use SyntaxHighligher on my blog.
Just run it through a tool like: http://www.gnu.org/software/src-highlite/
If you are using PHP, you can use GeSHi to highlight many different languages. I've used it before and it works quite well. A quick googling will also uncover GeSHi plugins for wordpress and drupal.
I wouldn't consider highlighting to be CPU intensive unless you are intending to display megabytes of it all at once. And even then, the CPU load would be minimal and your main problem would be transfer speed for it all.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I often feel that after iterating over my code for a number of times I am left with functions or classes or other lines of code in general which made sense in the previous revision but are not very useful for the new revision. I know that a profiler can tell you what part of your code was called when you run your test cases? But how does one go about identifying what part of the code never got called to remove it so that whats left is more readable? For example, is there a quick way to know which functions in your code are not being called from anywhere and can be safely removed. It may sound like a trivial question for a small code base, but when your code base grows over the years, this becomes an important and not so trivial question.
To summarize the question, for different languages, what is the best approach to remove dead code? Are there any lanaguage agnostic solutions or strategies for this. Or does each language provide a tool for identifying dead code.
We normally program in Java or Python or Objective-C.
The term you're looking for is "code coverage" and there are various tools that will generate that information. You would have to make sure that you exercise every possible path through your code in order to be able to detect "dead code" with such a tool though, which is only possible with a really extensive set of tests.
Most compilers have some level of dead code detection, but that only detects code that cannot possibly be called, not code that will never be called due to program logic, etc..
edit:
for Python specifically: How can you find unused functions in Python code?
for Java: How to find unused/dead code in java projects, Java: Dead code elimination
for Objective-C: Xcode -- finding dead methods in a project, Cleaning up Objective-C code
For functions, try a global search on the function name, and analyze what you get. Dead code inside a function will usually be findable.
If you suspect a function of being unused, you can remove it, or comment it out, and see if what you've got still compiles.
This only works on functions that are unused because they are no longer called. Functionality that is never used because the control path through the code is no longer active is harder to find, and code analysis tools won't do well at finding it either.
You can use the code coverage report to find out the function which are not used or part of function which is never executed.
Based on the logic, you can treat them as dead code/unused code.
Popular code coverage tools that can be used:
C/C++: gcov & lcov
Python: Coverage.py
Java: JCov
Objective-C: xccov
I'm wrapping a C++ library in PHP using SWIG and there have been some occasions where I want to modify the generated code (both generated C++ and PHP):
Fix code-generation errors
Add code that makes sense in PHP, but not in C++ (e.g. type checking)
Add documentation tags (e.g. phpDoc)
I'm currently automating these modifications with patch. This approach works, but it seems high-maintenance and fragile. Is there a better way of doing this?
The best bet is to have your code generator generate correct code for your needs. Hand-tweaking generated output is unsustainable. You'll have to tweak it again any time the input changes.
If a tool is producing flatly erroneous output, it's ideal to repair it and submit patches back to the maintainer. If the output is correct for some circumstances but wrong for yours, I'd suggest to add an option that changes the behavior to what you need.
Sometimes, you can use a short program that automatically does an intelligent job of patching your generated code, so that you don't need a manual process to make patches.
Alternatively, you could write your own code generator, but I suspect that's much more work than you want. It also depends on what you're doing. Sometimes code-generation is really just macro-expansion, and there are plenty of good tools for that in the wild.
Good luck!
You may end up having a maintenance nightmare later on. Instead of SWIG you might consider using another generative approach that:
Let you add your custom code directly on the model (so that you won't need to add it post-generation)
Let you define your own generator. This feature alone could take out the need to add custom code all along.
The problem of using third-party generators is that they never really generate what you want. The problem of writing your own code generators is that it's much more work. You choose.
But correcting an automation with another automation...
Code generation is quite a wide topic and there are definitely many other approaches, which might be more interresting to you as mentioned above.
But if you do not want to use other tool, depending on what code is generated and on the PHP OO capabilities, you might use the Generation Gap pattern.
How does one study open-source libraries code, particularly standard libraries?
The code base is often vast and hard to navigate. How to find some function or class definition?
Do I search through downloaded source files?
Do I need cvs/svn for that?
Maybe web-search?
Should I just know the structure of the standard library?
Is there any reference on it?
Or do some IDEs have such features? Or some other tools?
How to do it effectively without one?
What are the best practices of doing this in any open-source libraries?
Is there any convention of how are sources manipulated on Linux/Unix systems?
What are the differences for specific programming languages?
Broad presentation of the subject is highly encouraged.
I mark this 'community wiki' so everyone can rephrase and expand my awkward formulations!
Update: Probably didn't express the problem clear enough. What I want to, is to view just the source code of some specific library class or function. And the problem is mostly about work organization and usability - how do I navigate in the huge pile of sources to find the thing, maybe there are specific tools or approaches? It feels like there should've long existed some solution(s) for that.
One thing to note is that standard libraries are sometimes (often?) optimized more than is good for most production code.
Because they are widely used, they have to perform well over a wide variety of conditions, and may be full of clever tricks and special logic for corner cases.
Maybe they are not the best thing to study as a beginner.
Just a thought.
Well, I think that it's insane to just site down and read a library's code. My approach is to search whenever I come across the need to implement something by myself and then study the way that it's implemented in those libraries.
And there's also allot of projects/libraries with excellent documentation, which I find more important to read than the code. In Unix based systems you often find valuable information in the man pages.
Wow, that's a big question.
The short answer: it depends.
The long answer:
Some libraries provide documentation while others don't. Standard libraries are usually pretty well documented, whether your chosen implementation of the library includes documentation or not. For instance you may have found an implementation of the c standard library without documentation but the c standard has been around long enough that there are hundreds of good reference books available. Documentation with hyperlinks is a very useful way to learn a new API. In any case the first place I would look is the library's main website
For less well known libraries lacking documentation I find two different approaches very helpful.
First is a doc generator. Nearly every language I know of has one. It basically parses an source tree and creates documentation (usually as html or xml) which can be used to learn a library. Some use specially formatted comments in the code to create more complete documentation. JavaDoc is one good example of this. Doc generators for many other languages borrow from JavaDoc.
Second an IDE with a class browser. These act as a sort of on the fly documentation. Some display just the library's interface. Other's include description comments from the library's source.
Both of these will require access to the libraries source (which will come in handy if you intend actually use a library).
Many of these tools and techniques work equally well for closed/proprietary libraries.
The standard Java libraries' source code is available. For a beginning Java programmer these can be a great read. Especially the Collections framework is a good place to start. Take for instance the implementation of ArrayList and learn how you can implement a resizeable array in Java. Most of the source has even useful comments.
The best parts to read are probably whose purpose you can understand immediately. Start with the easy pieces and try to follow all the steps that are hidden behind that single call you make from your own code.
Something I do from time to time :
apt-get source foo
Then new C++ project (or whatever) in Eclipse and import.
=> Wow ! Browsable ! (use F3)
How does Stack Overflow show the revision changes in the diff-like format they use?
I don't care about Stack Overflow per se; it's just a convenient way to describe my requirement. I have an audit history of changes to a text field. I'd like to show the changes the same way Stack Overflow shows revision history changes. I recall a Stack Overflow podcast where Jeff Atwood discussed it, but I can't find it in the transcripts and have no idea what podcast. IIRC, it's not .NET based, maybe Python?
This is for end-user consumption so anything that looks like a Unix-like diff is out. It is tempting to show two blocks and text, old and new and let them figure it out, but the Stack Overflow revision history is sooo much nicer.
The Python difflib standard library provides this sort of capability:
This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs. For comparing directories and files, see also, the filecmp module.
Since you didn't really specify a language: I've done this using the PHP PEAR package Text_Diff.
They used Beyond Compare. As far as I can tell, it's not a native .NET program, but you can use it as a command-line tool.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 months ago.
Improve this question
I would like to get data from from different webpages such as addresses of restaurants or dates of different events for a given location and so on. What is the best library I can use for extracting this data from a given set of sites?
If using python, take a good look at Beautiful Soup (http://crummy.com/software/BeautifulSoup).
An extremely capable library, makes scraping a breeze.
The HTML Agility Pack For .net programers is awesome. It turns webpages in XML docs that can be queried with XPath.
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a#href")
{
HtmlAttribute att = link"href";
att.Value = FixLink(att);
}
doc.Save("file.htm");
You can find it here. http://www.codeplex.com/htmlagilitypack
I think the general answer here is to use any language + http library + html/xpath parser. I find that using ruby + hpricot gives a nice clean solution:
require 'rubygems'
require 'hpricot'
require 'open-uri'
sites = %w(http://www.google.com http://www.stackoverflow.com)
sites.each do |site|
doc = Hpricot(open(site))
# iterate over each div in the document (or use xpath to grab whatever you want)
(doc/"div").each do |div|
# do something with divs here
end
end
For more on Hpricot see http://code.whytheluckystiff.net/hpricot/
I personally like the WWW::Mechanize Perl module for these kinds of tasks. It gives you an object that is modeled after a typical web browser, (i.e. you can follow links, fill out forms, or use the "back button" by calling methods on it).
For the extraction of the actual content, you could then hook it up to HTML::TreeBuilder to transform the website you're currently visiting into a tree of HTML::Element objects, and extract the data you want (the look_down() method of HTML::Element is especially useful).
i think watir or selenium are the best choices. Most of the other mentioned libraries are actually HTML parsers, and that is not what you want... You are scraping, if the owner of the website wanted you to get to his data he'd put a dump of his database or site on a torrent and avoid all the http requests and expensive traffic.
basically, you need to parse HTML, but more importantly automate a browser. This to the point of being able to move the mouse and click, basically really mimicking a user. You need to use a screencapture program to get to the captchas and send them off to decaptcha.com (that solve them for a fraction of a cent) to circumvent that. forget about saving that captcha file by parsing the html without rendering it in a browser 'as it is supposed to be seen'. You are screenscraping, not httprequestscraping.
watir did the trick for me in combination with autoitx (for moving the mouse and entering keys in fields -> sometimes this is necessery to set of the right javascript events) and a simple screen capture utility for the captcha's. this way you will be most succesfull, it's quite useless writing a great html parser to find out that the owner of the site has turned some of the text into graphics. (Problematic? no, just get an OCR library and feed the jpeg, text will be returned). Besides i have rarely seen them go that far, although on chinese sites, there is a lot of text in graphics.
Xpath saved my day all the time, it's a great Domain Specific Language (IMHO, i could be wrong) and you can get to any tag in the page, although sometimes you need to tweak it.
What i did miss was 'reverse templates' (the robot framework of selenium has this). Perl had this in CPAN module Template::Extract, very handy.
The html parsing, or the creation of the DOM, i would leave to the browser, yes, it won't be as fast, but it'll work all the time.
Also libraries that pretend to be Useragents are useless, sites are protected against scraping nowadays, and the rendering of the site on a real screen is often necessery to get beyond the captcha's, but also javascript events that need to be triggered for information to appear etc.
Watir if you're into Ruby, Selenium for the rest i'd say. The 'Human Emulator' (or Web Emulator in russia) is really made for this kind of scraping, but then again it's a russian product from a company that makes no secret of its intentions.
i also think that one of these weeks Wiley has a new book out on scraping, that should be interesting. Good luck...
I personally find http://github.com/shuber/curl/tree/master and http://simplehtmldom.sourceforge.net/ awesome for use in my PHP spidering/scraping projects.
The Perl WWW::Mechanize library is excellent for doing the donkey work of interacting with a website to get to the actual page you need.
I would use LWP (Libwww for Perl). Here's a good little guide: http://www.perl.com/pub/a/2002/08/20/perlandlwp.html
WWW::Scraper has docs here: http://cpan.uwinnipeg.ca/htdocs/Scraper/WWW/Scraper.html
It can be useful as a base, you'd probably want to create your own module that fits your restaurant mining needs.
LWP would give you a basic crawler for you to build on.
There have been a number of answers recommending Perl Mechanize, but I think that Ruby Mechanize (very similar to Perl's version) is even better. It handles some things like forms in a much cleaner way syntactically. Also, there are a few frontends which run on top of Ruby Mechanize which make things even easier.
What language do you want to use?
curl with awk might be all you need.
You can use tidy to convert it to XHTML, and then use whatever XML processing facilities your language of choice has available.
I'd recommend BeautifulSoup. It isn't the fastest but performs really well in regards to the not-wellformedness of (X)HTML pages which most parsers choke on.
what someone said.
use ANY LANGUAGE.
as long as you have a good parser library and http library, you are set.
the tree stuff are slower, then just using a good parse library.