I have below document in mongodb.
I am using below python code to save it in .json file.
file = 'employee'
json_cur = find_document(file)
count_document = emp_collection.count_documents({})
with open(file_path, 'w') as f:
f.write('[')
for i, document in enumerate(json_cur, 1):
print("document : ", document)
f.write(dumps(document))
if i != count_document:
f.write(',')
f.write(']')
the output is -
{
"_id":{
"$oid":"611288c262c5c14df84f649b"
},
"Lname":"Borg",
"Fname":"James",
"Dname":"Headquarters",
"Projects":"[{"HOURS": 5.0, "PNAME": "Reorganization", "PNUMBER": 20}]"
}
But i need it like this (Projects value without quotes) -
{
"_id":{
"$oid":"611288c262c5c14df84f649b"
},
"Lname":"Borg",
"Fname":"James",
"Dname":"Headquarters",
"Projects":[{"HOURS": 5.0, "PNAME": "Reorganization", "PNUMBER": 20}]
}
Could anyone please help me to resolve this?
Thanks,
Jay
You should parse the JSON from the Projects field
Like this:
from json import loads
document['Projects'] = loads(document['Projects'])
So,
file = 'employee'
json_cur = find_document(file)
count_document = emp_collection.count_documents({})
with open(file_path, 'w') as f:
f.write('[')
for i, document in enumerate(json_cur, 1):
document['Projects'] = loads(document['Projects'])
print("document : ", document)
f.write(dumps(document))
if i != count_document:
f.write(',')
f.write(']')
I was able to take a text file, read each line, create a dictionary per line, update(append) each line and store the json file. The issue is when reading the json file it will not read correctly. the error point to a storing file issue?
The text file looks like:
84.txt; Frankenstein, or the Modern Prometheus; Mary Wollstonecraft (Godwin) Shelley
98.txt; A Tale of Two Cities; Charles Dickens
...
import json
import re
path = "C:\\...\\data\\"
books = {}
books_json = {}
final_book_json ={}
file = open(path + 'books\\set_of_books.txt', 'r')
json_list = file.readlines()
open(path + 'books\\books_json.json', 'w').close() # used to clean each test
json_create = []
i = 0
for line in json_list:
line = line.replace('#', '')
line = line.replace('.txt','')
line = line.replace('\n','')
line = line.split(';', 4)
BookNumber = line[0]
BookTitle = line[1]
AuthorName = line[-1]
file
if BookNumber == ' 2701':
BookNumber = line[0]
BookTitle1 = line[1]
BookTitle2 = line[2]
AuthorName = line[3]
BookTitle = BookTitle1 + ';' + BookTitle2 # needed to combine title into one to fit dict format
books = json.dumps( {'AuthorName': AuthorName, 'BookNumber': BookNumber, 'BookTitle': BookTitle})
books_json = json.loads(books)
final_book_json.update(books_json)
with open(path + 'books\\books_json.json', 'a'
) as out_put:
json.dump(books_json, out_put)
with open(path + 'books\\books_json.json', 'r'
) as out_put:
'books\\books_json.json', 'r')]
print(json.load(out_put))
The reported error is: JSONDecodeError: Extra data: line 1 column 133
(char 132) - adding this is right between the first "}{". Not sure
how json should look in a flat-file format? The output file as seen on
an editor looks like: {"AuthorName": " Mary Wollstonecraft (Godwin)
Shelley", "BookNumber": " 84", "BookTitle": " Frankenstein, or the
Modern Prometheus"}{"AuthorName": " Charles Dickens", "BookNumber": "
98", "BookTitle": " A Tale of Two Cities"}...
I ended up changing the approach and used pandas to read the text and then spliting the single-cell input.
books = pd.read_csv(path + 'books\\set_of_books.txt', sep='\t', names =('r','t', 'a') )
#print(books.head(10))
# Function to clean the 'raw(r)' inoput data
def clean_line(cell):
...
return cell
books['r'] = books['r'].apply(clean_line)
books = books['r'].str.split(';', expand=True)
I have a CSV file like below
COL1,COL2,COL3,COL4
3920,10163,"ST. PAUL, MN",TWIN CITIES
I want to read the file and split them outside double quotes WITHOUT using any external libraries. For example in the above CSV, we need to split them into 4 parts as
1. 3920
2. 10163
3. ST. PAUL, MN
4. TWIN CITIES
i tried using regex with folliwing code but never worked. I want to make this work using Groovy code. I tried different solutions given in Java. But couldnt achieve the solution.
NOTE : I dont want to use any external grails/Jars to make this work.
def staticCSV = new File(staticMapping.csv")
staticCSV.eachLine {line->
def parts = line.split(",(?=(?:[^\"]\"[^\"]\")[^\"]\${1})")
parts.each {
println "${it}"
}
}
Got the solution :
def getcsvListofListFromFile( String fileName ) {
def lol = []
def r1 = r1 = ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*\$)"
try {
def csvf = new File(fileName) ;
csvf.eachLine { line ->
def c1 = line.split(r1)
def c2 = []
c1.each { e1 ->
def s = e1.toString() ;
s = s.replaceAll('^"', "").replaceAll('"\$', "")
c2.add(s)
}
lol.add(c2) ;
}
return (lol)
} catch (Exception e) {
def eMsg = "Error Reading file [" + fileName + "] --- " + e.getMessage();
throw new RuntimeException(eMsg)
}
}
Using a ready-made library is a better idea. But you certainly have your reasons.
Here is an alternative solution to yours. It splits the lines with commas and reassembles the parts that originally belonged together (see multipart).
def content =
"""COL1,COL2,COL3,COL4
3920,10163, "ST. PAUL, MN" ,TWIN CITIES
3920,10163, " ST. PAUL, MN " ,TWIN CITIES, ,"Bla,Bla, Bla" """
content.eachLine {line ->
def multiPart
for (part in line.split(/,/)) {
if (!part.trim()) continue // for empty parts
if (part =~ /^\s*\"/) { // beginning of a multipart
multiPart = part
continue
} else if (part =~ /"\s*$/) { // end of the multipart
multiPart += "," + part
println multiPart.replaceAll(/"/, "").trim()
multiPart = null
continue
}
if (multiPart) {
multiPart += "," + part
} else {
println part.trim()
}
}
}
Output (You can copy the code directly into the GroovyConsole to run.
COL1
COL2
COL3
COL4
3920
10163
ST. PAUL, MN
TWIN CITIES
3920
10163
ST. PAUL, MN
TWIN CITIES
Bla,Bla, Bla
I have a file which looks exactly as below.
{"eventid" : "12345" ,"name":"test1","age":"18"}
{"eventid" : "12346" ,"age":"65"}
{"eventid" : "12336" ,"name":"test3","age":"22","gender":"Male"}
Think of the above file as event.json
The number of data objects may vary per line.
I would like the following csv output. and it would be output.csv
eventid,name,age,gender
12345,test1,18
12346,,65
12336,test3,22,Male
Could someone kindly help me? I could accept the answer from an any scripting language (Javascript, Python and etc.).
This code will collect all the headers dynamically and write the file to CSV.
Read comments in code for details:
import json
# Load data from file
data = '''{"eventid" : "12345" ,"name":"test1","age":"18"}
{"eventid" : "12346" ,"age":"65"}
{"eventid" : "12336" ,"name":"test3","age":"22","gender":"Male"}'''
# Store records for later use
records = [];
# Keep track of headers in a set
headers = set([]);
for line in data.split("\n"):
line = line.strip();
# Parse each line as JSON
parsedJson = json.loads(line)
records.append(parsedJson)
# Make sure all found headers are kept in the headers set
for header in parsedJson.keys():
headers.add(header)
# You only know what headers were there once you have read all the JSON once.
#Now we have all the information we need, like what all possible headers are.
outfile = open('output_json_to_csv.csv','w')
# write headers to the file in order
outfile.write(",".join(sorted(headers)) + '\n')
for record in records:
# write each record based on available fields
curLine = []
# For each header in alphabetical order
for header in sorted(headers):
# If that record has the field
if record.has_key(header):
# Then write that value to the line
curLine.append(record[header])
else:
# Otherwise put an empty value as a placeholder
curLine.append('')
# Write the line to file
outfile.write(",".join(curLine) + '\n')
outfile.close()
Here is a solution using jq.
If filter.jq contains the following filter
(reduce (.[]|keys_unsorted[]) as $k ({};.[$k]="")) as $o # object with all keys
| ($o | keys_unsorted), (.[] | $o * . | [.[]]) # generate header and data
| join(",") # convert to csv
and data.json contains the sample data then
$ jq -Mrs -f filter.jq data.json
produces
eventid,name,age,gender
12345,test1,18,
12346,,65,
12336,test3,22,Male
Here's a Python solution (should work in both Python 2 & 3).
I'm not proud of the code, as there's probably a better way to do this (using the csv module) but this gives you the desired output.
I've taken the liberty of naming your JSON data data.json and I'm naming the output csv file output.csv.
import json
header = ['eventid', 'name', 'age', 'gender']
with open('data.json', 'r') as infile, \
open('outfile.csv', 'w+') as outfile:
# Writes header row
outfile.write(','.join(header))
outfile.write('\n')
for row in infile:
line = ['', '', '', ''] # I'm sure there's a better way
datarow = json.loads(row)
for key in datarow:
line[header.index(key)] = datarow[key]
outfile.write(','.join(line))
outfile.write('\n')
Hope this helps.
Using Angularjs with ngCsv plugin we can generate csv file from desired json with dynamic headers.
Run in plunkr
// Code goes here
var myapp = angular.module('myapp', ["ngSanitize", "ngCsv"]);
myapp.controller('myctrl', function($scope) {
$scope.filename = "test";
$scope.getArray = [{
label: 'Apple',
value: 2,
x:1,
}, {
label: 'Pear',
value: 4,
x:38
}, {
label: 'Watermelon',
value: 4,
x:38
}];
$scope.getHeader = function() {
var vals = [];
for( var key in $scope.getArray ) {
for(var k in $scope.getArray[key]){
vals.push(k);
}
break;
}
return vals;
};
});
<!DOCTYPE html>
<html>
<head>
<link href="https://netdna.bootstrapcdn.com/bootstrap/3.0.0/css/bootstrap.min.css" rel="stylesheet">
<script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.4.7/angular.min.js"></script>
<script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.4.7/angular-sanitize.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/ng-csv/0.3.6/ng-csv.min.js"></script>
</head>
<body>
<div ng-app="myapp">
<div class="container" ng-controller="myctrl">
<div class="page-header">
<h1>ngCsv <small>example</small></h1>
</div>
<button class="btn btn-default" ng-csv="getArray" csv-header="getHeader()" filename="{{ filename }}.csv" field-separator="," decimal-separator=".">Export to CSV with header</button>
</div>
</div>
</body>
</html>
var arr = $.map(obj, function(el) { return el });
var content = "";
for(var element in arr){
content += element + ",";
}
var filePath = "someFile.csv";
var fso = new ActiveXObject("Scripting.FileSystemObject");
var fh = fso.OpenTextFile(filePath, 8, false, 0);
fh.WriteLine(content);
fh.Close();
How do you parse HTML with a variety of languages and parsing libraries?
When answering:
Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.
For the sake of consistency, I ask that the example be parsing an HTML file for the href in anchor tags. To make it easy to search this question, I ask that you follow this format
Language: [language name]
Library: [library name]
[example code]
Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:
Purpose: [what the parse does]
Language: JavaScript
Library: jQuery
$.each($('a[href]'), function(){
console.debug(this.href);
});
(using firebug console.debug for output...)
And loading any html page:
$.get('http://stackoverflow.com/', function(page){
$(page).find('a[href]').each(function(){
console.debug(this.href);
});
});
Used another each function for this one, I think it's cleaner when chaining methods.
Language: C#
Library: HtmlAgilityPack
class Program
{
static void Main(string[] args)
{
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerHtml);
}
}
}
language: Python
library: BeautifulSoup
from BeautifulSoup import BeautifulSoup
html = "<html><body>"
for link in ("foo", "bar", "baz"):
html += '%s' % (link, link)
html += "</body></html>"
soup = BeautifulSoup(html)
links = soup.findAll('a', href=True) # find <a> with a defined href attribute
print links
output:
[foo,
bar,
baz]
also possible:
for link in links:
print link['href']
output:
http://foo.com
http://bar.com
http://baz.com
Language: Perl
Library: pQuery
use strict;
use warnings;
use pQuery;
my $html = join '',
"<html><body>",
(map { qq($_) } qw/foo bar baz/),
"</body></html>";
pQuery( $html )->find( 'a' )->each(
sub {
my $at = $_->getAttribute( 'href' );
print "$at\n" if defined $at;
}
);
language: shell
library: lynx (well, it's not library, but in shell, every program is kind-of library)
lynx -dump -listonly http://news.google.com/
language: Ruby
library: Hpricot
#!/usr/bin/ruby
require 'hpricot'
html = '<html><body>'
['foo', 'bar', 'baz'].each {|link| html += "#{link}" }
html += '</body></html>'
doc = Hpricot(html)
doc.search('//a').each {|elm| puts elm.attributes['href'] }
language: Python
library: HTMLParser
#!/usr/bin/python
from HTMLParser import HTMLParser
class FindLinks(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
at = dict(attrs)
if tag == 'a' and 'href' in at:
print at['href']
find = FindLinks()
html = "<html><body>"
for link in ("foo", "bar", "baz"):
html += '%s' % (link, link)
html += "</body></html>"
find.feed(html)
language: Perl
library: HTML::Parser
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
my $find_links = HTML::Parser->new(
start_h => [
sub {
my ($tag, $attr) = #_;
if ($tag eq 'a' and exists $attr->{href}) {
print "$attr->{href}\n";
}
},
"tag, attr"
]
);
my $html = join '',
"<html><body>",
(map { qq($_) } qw/foo bar baz/),
"</body></html>";
$find_links->parse($html);
Language Perl
Library: HTML::LinkExtor
Beauty of Perl is that you have modules for very specific tasks. Like link extraction.
Whole program:
#!/usr/bin/perl -w
use strict;
use HTML::LinkExtor;
use LWP::Simple;
my $url = 'http://www.google.com/';
my $content = get( $url );
my $p = HTML::LinkExtor->new( \&process_link, $url, );
$p->parse( $content );
exit;
sub process_link {
my ( $tag, %attr ) = #_;
return unless $tag eq 'a';
return unless defined $attr{ 'href' };
print "- $attr{'href'}\n";
return;
}
Explanation:
use strict - turns on "strict" mode -
eases potential debugging, not fully
relevant to the example
use HTML::LinkExtor - load of interesting module
use LWP::Simple - just a simple way to get some html for tests
my $url = 'http://www.google.com/' - which page we will be extracting urls from
my $content = get( $url ) - fetches page html
my $p = HTML::LinkExtor->new( \&process_link, $url ) - creates LinkExtor object, givin it reference to function that will be used as callback on every url, and $url to use as BASEURL for relative urls
$p->parse( $content ) - pretty obvious I guess
exit - end of program
sub process_link - begin of function process_link
my ($tag, %attr) - get arguments, which are tag name, and its atributes
return unless $tag eq 'a' - skip processing if the tag is not <a>
return unless defeined $attr{'href'} - skip processing if the <a> tag doesn't have href attribute
print "- $attr{'href'}\n"; - pretty obvious I guess :)
return; - finish the function
That's all.
Language: Ruby
Library: Nokogiri
#!/usr/bin/env ruby
require 'nokogiri'
require 'open-uri'
document = Nokogiri::HTML(open("http://google.com"))
document.css("html head title").first.content
=> "Google"
document.xpath("//title").first.content
=> "Google"
Language: Common Lisp
Library: Closure Html, Closure Xml, CL-WHO
(shown using DOM API, without using XPATH or STP API)
(defvar *html*
(who:with-html-output-to-string (stream)
(:html
(:body (loop
for site in (list "foo" "bar" "baz")
do (who:htm (:a :href (format nil "http://~A.com/" site))))))))
(defvar *dom*
(chtml:parse *html* (cxml-dom:make-dom-builder)))
(loop
for tag across (dom:get-elements-by-tag-name *dom* "a")
collect (dom:get-attribute tag "href"))
=>
("http://foo.com/" "http://bar.com/" "http://baz.com/")
Language: Clojure
Library: Enlive (a selector-based (à la CSS) templating and transformation system for Clojure)
Selector expression:
(def test-select
(html/select (html/html-resource (java.io.StringReader. test-html)) [:a]))
Now we can do the following at the REPL (I've added line breaks in test-select):
user> test-select
({:tag :a, :attrs {:href "http://foo.com/"}, :content ["foo"]}
{:tag :a, :attrs {:href "http://bar.com/"}, :content ["bar"]}
{:tag :a, :attrs {:href "http://baz.com/"}, :content ["baz"]})
user> (map #(get-in % [:attrs :href]) test-select)
("http://foo.com/" "http://bar.com/" "http://baz.com/")
You'll need the following to try it out:
Preamble:
(require '[net.cgrand.enlive-html :as html])
Test HTML:
(def test-html
(apply str (concat ["<html><body>"]
(for [link ["foo" "bar" "baz"]]
(str "" link ""))
["</body></html>"])))
language: Perl
library: XML::Twig
#!/usr/bin/perl
use strict;
use warnings;
use Encode ':all';
use LWP::Simple;
use XML::Twig;
#my $url = 'http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser';
my $url = 'http://www.google.com';
my $content = get($url);
die "Couldn't fetch!" unless defined $content;
my $twig = XML::Twig->new();
$twig->parse_html($content);
my #hrefs = map {
$_->att('href');
} $twig->get_xpath('//*[#href]');
print "$_\n" for #hrefs;
caveat: Can get wide-character errors with pages like this one (changing the url to the one commented out will get this error), but the HTML::Parser solution above doesn't share this problem.
Language: Perl
Library: HTML::Parser
Purpose: How can I remove unused, nested HTML span tags with a Perl regex?
Language: Java
Libraries: XOM, TagSoup
I've included intentionally malformed and inconsistent XML in this sample.
import java.io.IOException;
import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Element;
import nu.xom.Node;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;
import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;
public class HtmlTest {
public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
final Parser parser = new Parser();
parser.setFeature(Parser.namespacesFeature, false);
final Builder builder = new Builder(parser);
final Document document = builder.build("<html><body><ul><li>google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</li><li><a name=\"nothing\">nothing</a><li></ul></body></html>", null);
final Element root = document.getRootElement();
final Nodes links = root.query("//a[#href]");
for (int linkNumber = 0; linkNumber < links.size(); ++linkNumber) {
final Node node = links.get(linkNumber);
System.out.println(((Element) node).getAttributeValue("href"));
}
}
}
TagSoup adds an XML namespace referencing XHTML to the document by default. I've chosen to suppress that in this sample. Using the default behavior would require the call to root.query to include a namespace like so:
root.query("//xhtml:a[#href]", new nu.xom.XPathContext("xhtml", root.getNamespaceURI())
Language: C#
Library: System.XML (standard .NET)
using System.Collections.Generic;
using System.Xml;
public static void Main(string[] args)
{
List<string> matches = new List<string>();
XmlDocument xd = new XmlDocument();
xd.LoadXml("<html>...</html>");
FindHrefs(xd.FirstChild, matches);
}
static void FindHrefs(XmlNode xn, List<string> matches)
{
if (xn.Attributes != null && xn.Attributes["href"] != null)
matches.Add(xn.Attributes["href"].InnerXml);
foreach (XmlNode child in xn.ChildNodes)
FindHrefs(child, matches);
}
Language: PHP
Library: SimpleXML (and DOM)
<?php
$page = new DOMDocument();
$page->strictErrorChecking = false;
$page->loadHTMLFile('http://stackoverflow.com/questions/773340');
$xml = simplexml_import_dom($page);
$links = $xml->xpath('//a[#href]');
foreach($links as $link)
echo $link['href']."\n";
Language: JavaScript
Library: DOM
var links = document.links;
for(var i in links){
var href = links[i].href;
if(href != null) console.debug(href);
}
(using firebug console.debug for output...)
Language: Racket
Library: (planet ashinn/html-parser:1) and (planet clements/sxml2:1)
(require net/url
(planet ashinn/html-parser:1)
(planet clements/sxml2:1))
(define the-url (string->url "http://stackoverflow.com/"))
(define doc (call/input-url the-url get-pure-port html->sxml))
(define links ((sxpath "//a/#href/text()") doc))
Above example using packages from the new package system: html-parsing and sxml
(require net/url
html-parsing
sxml)
(define the-url (string->url "http://stackoverflow.com/"))
(define doc (call/input-url the-url get-pure-port html->xexp))
(define links ((sxpath "//a/#href/text()") doc))
Note: Install the required packages with 'raco' from a command line, with:
raco pkg install html-parsing
and:
raco pkg install sxml
language: Python
library: lxml.html
import lxml.html
html = "<html><body>"
for link in ("foo", "bar", "baz"):
html += '%s' % (link, link)
html += "</body></html>"
tree = lxml.html.document_fromstring(html)
for element, attribute, link, pos in tree.iterlinks():
if attribute == "href":
print link
lxml also has a CSS selector class for traversing the DOM, which can make using it very similar to using JQuery:
for a in tree.cssselect('a[href]'):
print a.get('href')
Language: Objective-C
Library: libxml2 + Matt Gallagher's libxml2 wrappers + Ben Copsey's ASIHTTPRequest
ASIHTTPRequest *request = [ASIHTTPRequest alloc] initWithURL:[NSURL URLWithString:#"http://stackoverflow.com/questions/773340"];
[request start];
NSError *error = [request error];
if (!error) {
NSData *response = [request responseData];
NSLog(#"Data: %#", [[self query:#"//a[#href]" withResponse:response] description]);
[request release];
}
else
#throw [NSException exceptionWithName:#"kMyHTTPRequestFailed" reason:#"Request failed!" userInfo:nil];
...
- (id) query:(NSString *)xpathQuery WithResponse:(NSData *)resp {
NSArray *nodes = PerformHTMLXPathQuery(resp, xpathQuery);
if (nodes != nil)
return nodes;
return nil;
}
Language: Perl
Library : HTML::TreeBuilder
use strict;
use HTML::TreeBuilder;
use LWP::Simple;
my $content = get 'http://www.stackoverflow.com';
my $document = HTML::TreeBuilder->new->parse($content)->eof;
for my $a ($document->find('a')) {
print $a->attr('href'), "\n" if $a->attr('href');
}
Language: PHP
Library: DOM
<?php
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTMLFile('http://stackoverflow.com/questions/773340');
$xpath = new DOMXpath($doc);
$links = $xpath->query('//a[#href]');
for ($i = 0; $i < $links->length; $i++)
echo $links->item($i)->getAttribute('href'), "\n";
Sometimes it's useful to put # symbol before $doc->loadHTMLFile to suppress invalid html parsing warnings
Language: Python
Library: HTQL
import htql;
page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>";
query="<a>:href,tx";
for url, text in htql.HTQL(page, query):
print url, text;
Simple and intuitive.
language: Ruby
library: Nokogiri
#!/usr/bin/env ruby
require "nokogiri"
require "open-uri"
doc = Nokogiri::HTML(open('http://www.example.com'))
hrefs = doc.search('a').map{ |n| n['href'] }
puts hrefs
Which outputs:
/
/domains/
/numbers/
/protocols/
/about/
/go/rfc2606
/about/
/about/presentations/
/about/performance/
/reports/
/domains/
/domains/root/
/domains/int/
/domains/arpa/
/domains/idn-tables/
/protocols/
/numbers/
/abuse/
http://www.icann.org/
mailto:iana#iana.org?subject=General%20website%20feedback
This is a minor spin on the one above, resulting in an output that is usable for a report. I only return the first and last elements in the list of hrefs:
#!/usr/bin/env ruby
require "nokogiri"
require "open-uri"
doc = Nokogiri::HTML(open('http://nokogiri.org'))
hrefs = doc.search('a[href]').map{ |n| n['href'] }
puts hrefs
.each_with_index # add an array index
.minmax{ |a,b| a.last <=> b.last } # find the first and last element
.map{ |h,i| '%3d %s' % [1 + i, h ] } # format the output
1 http://github.com/tenderlove/nokogiri
100 http://yokolet.blogspot.com
Language: Java
Library: jsoup
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.xml.sax.SAXException;
public class HtmlTest {
public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
final Document document = Jsoup.parse("<html><body><ul><li>google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</li><li><a name=\"nothing\">nothing</a><li></ul></body></html>");
final Elements links = document.select("a[href]");
for (final Element element : links) {
System.out.println(element.attr("href"));
}
}
}
Using phantomjs, save this file as extract-links.js:
var page = new WebPage(),
url = 'http://www.udacity.com';
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
var results = page.evaluate(function() {
var list = document.querySelectorAll('a'), links = [], i;
for (i = 0; i < list.length; i++) {
links.push(list[i].href);
}
return links;
});
console.log(results.join('\n'));
}
phantom.exit();
});
run:
$ ../path/to/bin/phantomjs extract-links.js
Language: Coldfusion 9.0.1+
Library: jSoup
<cfscript>
function parseURL(required string url){
var res = [];
var javaLoader = createObject("javaloader.JavaLoader").init([expandPath("./jsoup-1.7.3.jar")]);
var jSoupClass = javaLoader.create("org.jsoup.Jsoup");
//var dom = jSoupClass.parse(html); // if you already have some html to parse.
var dom = jSoupClass.connect( arguments.url ).get();
var links = dom.select("a");
for(var a=1;a LT arrayLen(links);a++){
var s={};s.href= links[a].attr('href'); s.text= links[a].text();
if(s.href contains "http://" || s.href contains "https://") arrayAppend(res,s);
}
return res;
}
//writeoutput(writedump(parseURL(url)));
</cfscript>
<cfdump var="#parseURL("http://stackoverflow.com/questions/773340/can-you-provide-examples-of-parsing-html")#">
Returns an array of structures, each struct contains an HREF and TEXT objects.
Language: JavaScript/Node.js
Library: Request and Cheerio
var request = require('request');
var cheerio = require('cheerio');
var url = "https://news.ycombinator.com/";
request(url, function (error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html);
var anchorTags = $('a');
anchorTags.each(function(i,element){
console.log(element["attribs"]["href"]);
});
}
});
Request library downloads the html document and Cheerio lets you use jquery css selectors to target the html document.