Can you provide examples of parsing HTML?

Can you provide examples of parsing HTML? - html

How do you parse HTML with a variety of languages and parsing libraries?
When answering:
Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.
For the sake of consistency, I ask that the example be parsing an HTML file for the href in anchor tags. To make it easy to search this question, I ask that you follow this format
Language: [language name]
Library: [library name]
[example code]
Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:
Purpose: [what the parse does]

Language: JavaScript
Library: jQuery
$.each($('a[href]'), function(){
console.debug(this.href);
});
(using firebug console.debug for output...)
And loading any html page:
$.get('http://stackoverflow.com/', function(page){
$(page).find('a[href]').each(function(){
console.debug(this.href);
});
});
Used another each function for this one, I think it's cleaner when chaining methods.

Language: C#
Library: HtmlAgilityPack
class Program
{
static void Main(string[] args)
{
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerHtml);
}
}
}

language: Python
library: BeautifulSoup
from BeautifulSoup import BeautifulSoup
html = "<html><body>"
for link in ("foo", "bar", "baz"):
html += '%s' % (link, link)
html += "</body></html>"
soup = BeautifulSoup(html)
links = soup.findAll('a', href=True) # find <a> with a defined href attribute
print links
output:
[foo,
bar,
baz]
also possible:
for link in links:
print link['href']
output:
http://foo.com
http://bar.com
http://baz.com

Language: Perl
Library: pQuery
use strict;
use warnings;
use pQuery;
my $html = join '',
"<html><body>",
(map { qq($_) } qw/foo bar baz/),
"</body></html>";
pQuery( $html )->find( 'a' )->each(
sub {
my $at = $_->getAttribute( 'href' );
print "$at\n" if defined $at;
}
);

language: shell
library: lynx (well, it's not library, but in shell, every program is kind-of library)
lynx -dump -listonly http://news.google.com/

language: Ruby
library: Hpricot
#!/usr/bin/ruby
require 'hpricot'
html = '<html><body>'
['foo', 'bar', 'baz'].each {|link| html += "#{link}" }
html += '</body></html>'
doc = Hpricot(html)
doc.search('//a').each {|elm| puts elm.attributes['href'] }

language: Python
library: HTMLParser
#!/usr/bin/python
from HTMLParser import HTMLParser
class FindLinks(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
at = dict(attrs)
if tag == 'a' and 'href' in at:
print at['href']
find = FindLinks()
html = "<html><body>"
for link in ("foo", "bar", "baz"):
html += '%s' % (link, link)
html += "</body></html>"
find.feed(html)

language: Perl
library: HTML::Parser
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
my $find_links = HTML::Parser->new(
start_h => [
sub {
my ($tag, $attr) = #_;
if ($tag eq 'a' and exists $attr->{href}) {
print "$attr->{href}\n";
}
},
"tag, attr"
]
);
my $html = join '',
"<html><body>",
(map { qq($_) } qw/foo bar baz/),
"</body></html>";
$find_links->parse($html);

Language Perl
Library: HTML::LinkExtor
Beauty of Perl is that you have modules for very specific tasks. Like link extraction.
Whole program:
#!/usr/bin/perl -w
use strict;
use HTML::LinkExtor;
use LWP::Simple;
my $url = 'http://www.google.com/';
my $content = get( $url );
my $p = HTML::LinkExtor->new( \&process_link, $url, );
$p->parse( $content );
exit;
sub process_link {
my ( $tag, %attr ) = #_;
return unless $tag eq 'a';
return unless defined $attr{ 'href' };
print "- $attr{'href'}\n";
return;
}
Explanation:
use strict - turns on "strict" mode -
eases potential debugging, not fully
relevant to the example
use HTML::LinkExtor - load of interesting module
use LWP::Simple - just a simple way to get some html for tests
my $url = 'http://www.google.com/' - which page we will be extracting urls from
my $content = get( $url ) - fetches page html
my $p = HTML::LinkExtor->new( \&process_link, $url ) - creates LinkExtor object, givin it reference to function that will be used as callback on every url, and $url to use as BASEURL for relative urls
$p->parse( $content ) - pretty obvious I guess
exit - end of program
sub process_link - begin of function process_link
my ($tag, %attr) - get arguments, which are tag name, and its atributes
return unless $tag eq 'a' - skip processing if the tag is not <a>
return unless defeined $attr{'href'} - skip processing if the <a> tag doesn't have href attribute
print "- $attr{'href'}\n"; - pretty obvious I guess :)
return; - finish the function
That's all.

Language: Ruby
Library: Nokogiri
#!/usr/bin/env ruby
require 'nokogiri'
require 'open-uri'
document = Nokogiri::HTML(open("http://google.com"))
document.css("html head title").first.content
=> "Google"
document.xpath("//title").first.content
=> "Google"

Language: Common Lisp
Library: Closure Html, Closure Xml, CL-WHO
(shown using DOM API, without using XPATH or STP API)
(defvar *html*
(who:with-html-output-to-string (stream)
(:html
(:body (loop
for site in (list "foo" "bar" "baz")
do (who:htm (:a :href (format nil "http://~A.com/" site))))))))
(defvar *dom*
(chtml:parse *html* (cxml-dom:make-dom-builder)))
(loop
for tag across (dom:get-elements-by-tag-name *dom* "a")
collect (dom:get-attribute tag "href"))
=>
("http://foo.com/" "http://bar.com/" "http://baz.com/")

Language: Clojure
Library: Enlive (a selector-based (à la CSS) templating and transformation system for Clojure)
Selector expression:
(def test-select
(html/select (html/html-resource (java.io.StringReader. test-html)) [:a]))
Now we can do the following at the REPL (I've added line breaks in test-select):
user> test-select
({:tag :a, :attrs {:href "http://foo.com/"}, :content ["foo"]}
{:tag :a, :attrs {:href "http://bar.com/"}, :content ["bar"]}
{:tag :a, :attrs {:href "http://baz.com/"}, :content ["baz"]})
user> (map #(get-in % [:attrs :href]) test-select)
("http://foo.com/" "http://bar.com/" "http://baz.com/")
You'll need the following to try it out:
Preamble:
(require '[net.cgrand.enlive-html :as html])
Test HTML:
(def test-html
(apply str (concat ["<html><body>"]
(for [link ["foo" "bar" "baz"]]
(str "" link ""))
["</body></html>"])))

language: Perl
library: XML::Twig
#!/usr/bin/perl
use strict;
use warnings;
use Encode ':all';
use LWP::Simple;
use XML::Twig;
#my $url = 'http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser';
my $url = 'http://www.google.com';
my $content = get($url);
die "Couldn't fetch!" unless defined $content;
my $twig = XML::Twig->new();
$twig->parse_html($content);
my #hrefs = map {
$_->att('href');
} $twig->get_xpath('//*[#href]');
print "$_\n" for #hrefs;
caveat: Can get wide-character errors with pages like this one (changing the url to the one commented out will get this error), but the HTML::Parser solution above doesn't share this problem.

Language: Perl
Library: HTML::Parser
Purpose: How can I remove unused, nested HTML span tags with a Perl regex?

Language: Java
Libraries: XOM, TagSoup
I've included intentionally malformed and inconsistent XML in this sample.
import java.io.IOException;
import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Element;
import nu.xom.Node;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;
import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;
public class HtmlTest {
public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
final Parser parser = new Parser();
parser.setFeature(Parser.namespacesFeature, false);
final Builder builder = new Builder(parser);
final Document document = builder.build("<html><body><ul><li>google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</li><li><a name=\"nothing\">nothing</a><li></ul></body></html>", null);
final Element root = document.getRootElement();
final Nodes links = root.query("//a[#href]");
for (int linkNumber = 0; linkNumber < links.size(); ++linkNumber) {
final Node node = links.get(linkNumber);
System.out.println(((Element) node).getAttributeValue("href"));
}
}
}
TagSoup adds an XML namespace referencing XHTML to the document by default. I've chosen to suppress that in this sample. Using the default behavior would require the call to root.query to include a namespace like so:
root.query("//xhtml:a[#href]", new nu.xom.XPathContext("xhtml", root.getNamespaceURI())

Language: C#
Library: System.XML (standard .NET)
using System.Collections.Generic;
using System.Xml;
public static void Main(string[] args)
{
List<string> matches = new List<string>();
XmlDocument xd = new XmlDocument();
xd.LoadXml("<html>...</html>");
FindHrefs(xd.FirstChild, matches);
}
static void FindHrefs(XmlNode xn, List<string> matches)
{
if (xn.Attributes != null && xn.Attributes["href"] != null)
matches.Add(xn.Attributes["href"].InnerXml);
foreach (XmlNode child in xn.ChildNodes)
FindHrefs(child, matches);
}

Language: PHP
Library: SimpleXML (and DOM)
<?php
$page = new DOMDocument();
$page->strictErrorChecking = false;
$page->loadHTMLFile('http://stackoverflow.com/questions/773340');
$xml = simplexml_import_dom($page);
$links = $xml->xpath('//a[#href]');
foreach($links as $link)
echo $link['href']."\n";

Language: JavaScript
Library: DOM
var links = document.links;
for(var i in links){
var href = links[i].href;
if(href != null) console.debug(href);
}
(using firebug console.debug for output...)

Language: Racket
Library: (planet ashinn/html-parser:1) and (planet clements/sxml2:1)
(require net/url
(planet ashinn/html-parser:1)
(planet clements/sxml2:1))
(define the-url (string->url "http://stackoverflow.com/"))
(define doc (call/input-url the-url get-pure-port html->sxml))
(define links ((sxpath "//a/#href/text()") doc))
Above example using packages from the new package system: html-parsing and sxml
(require net/url
html-parsing
sxml)
(define the-url (string->url "http://stackoverflow.com/"))
(define doc (call/input-url the-url get-pure-port html->xexp))
(define links ((sxpath "//a/#href/text()") doc))
Note: Install the required packages with 'raco' from a command line, with:
raco pkg install html-parsing
and:
raco pkg install sxml

language: Python
library: lxml.html
import lxml.html
html = "<html><body>"
for link in ("foo", "bar", "baz"):
html += '%s' % (link, link)
html += "</body></html>"
tree = lxml.html.document_fromstring(html)
for element, attribute, link, pos in tree.iterlinks():
if attribute == "href":
print link
lxml also has a CSS selector class for traversing the DOM, which can make using it very similar to using JQuery:
for a in tree.cssselect('a[href]'):
print a.get('href')

Language: Objective-C
Library: libxml2 + Matt Gallagher's libxml2 wrappers + Ben Copsey's ASIHTTPRequest
ASIHTTPRequest *request = [ASIHTTPRequest alloc] initWithURL:[NSURL URLWithString:#"http://stackoverflow.com/questions/773340"];
[request start];
NSError *error = [request error];
if (!error) {
NSData *response = [request responseData];
NSLog(#"Data: %#", [[self query:#"//a[#href]" withResponse:response] description]);
[request release];
}
else
#throw [NSException exceptionWithName:#"kMyHTTPRequestFailed" reason:#"Request failed!" userInfo:nil];
...
- (id) query:(NSString *)xpathQuery WithResponse:(NSData *)resp {
NSArray *nodes = PerformHTMLXPathQuery(resp, xpathQuery);
if (nodes != nil)
return nodes;
return nil;
}

Language: Perl
Library : HTML::TreeBuilder
use strict;
use HTML::TreeBuilder;
use LWP::Simple;
my $content = get 'http://www.stackoverflow.com';
my $document = HTML::TreeBuilder->new->parse($content)->eof;
for my $a ($document->find('a')) {
print $a->attr('href'), "\n" if $a->attr('href');
}

Language: PHP
Library: DOM
<?php
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTMLFile('http://stackoverflow.com/questions/773340');
$xpath = new DOMXpath($doc);
$links = $xpath->query('//a[#href]');
for ($i = 0; $i < $links->length; $i++)
echo $links->item($i)->getAttribute('href'), "\n";
Sometimes it's useful to put # symbol before $doc->loadHTMLFile to suppress invalid html parsing warnings

Language: Python
Library: HTQL
import htql;
page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>";
query="<a>:href,tx";
for url, text in htql.HTQL(page, query):
print url, text;
Simple and intuitive.

language: Ruby
library: Nokogiri
#!/usr/bin/env ruby
require "nokogiri"
require "open-uri"
doc = Nokogiri::HTML(open('http://www.example.com'))
hrefs = doc.search('a').map{ |n| n['href'] }
puts hrefs
Which outputs:
/
/domains/
/numbers/
/protocols/
/about/
/go/rfc2606
/about/
/about/presentations/
/about/performance/
/reports/
/domains/
/domains/root/
/domains/int/
/domains/arpa/
/domains/idn-tables/
/protocols/
/numbers/
/abuse/
http://www.icann.org/
mailto:iana#iana.org?subject=General%20website%20feedback
This is a minor spin on the one above, resulting in an output that is usable for a report. I only return the first and last elements in the list of hrefs:
#!/usr/bin/env ruby
require "nokogiri"
require "open-uri"
doc = Nokogiri::HTML(open('http://nokogiri.org'))
hrefs = doc.search('a[href]').map{ |n| n['href'] }
puts hrefs
.each_with_index # add an array index
.minmax{ |a,b| a.last <=> b.last } # find the first and last element
.map{ |h,i| '%3d %s' % [1 + i, h ] } # format the output
1 http://github.com/tenderlove/nokogiri
100 http://yokolet.blogspot.com

Language: Java
Library: jsoup
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.xml.sax.SAXException;
public class HtmlTest {
public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
final Document document = Jsoup.parse("<html><body><ul><li>google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</li><li><a name=\"nothing\">nothing</a><li></ul></body></html>");
final Elements links = document.select("a[href]");
for (final Element element : links) {
System.out.println(element.attr("href"));
}
}
}

Using phantomjs, save this file as extract-links.js:
var page = new WebPage(),
url = 'http://www.udacity.com';
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
var results = page.evaluate(function() {
var list = document.querySelectorAll('a'), links = [], i;
for (i = 0; i < list.length; i++) {
links.push(list[i].href);
}
return links;
});
console.log(results.join('\n'));
}
phantom.exit();
});
run:
$ ../path/to/bin/phantomjs extract-links.js

Language: Coldfusion 9.0.1+
Library: jSoup
<cfscript>
function parseURL(required string url){
var res = [];
var javaLoader = createObject("javaloader.JavaLoader").init([expandPath("./jsoup-1.7.3.jar")]);
var jSoupClass = javaLoader.create("org.jsoup.Jsoup");
//var dom = jSoupClass.parse(html); // if you already have some html to parse.
var dom = jSoupClass.connect( arguments.url ).get();
var links = dom.select("a");
for(var a=1;a LT arrayLen(links);a++){
var s={};s.href= links[a].attr('href'); s.text= links[a].text();
if(s.href contains "http://" || s.href contains "https://") arrayAppend(res,s);
}
return res;
}
//writeoutput(writedump(parseURL(url)));
</cfscript>
<cfdump var="#parseURL("http://stackoverflow.com/questions/773340/can-you-provide-examples-of-parsing-html")#">
Returns an array of structures, each struct contains an HREF and TEXT objects.

Language: JavaScript/Node.js
Library: Request and Cheerio
var request = require('request');
var cheerio = require('cheerio');
var url = "https://news.ycombinator.com/";
request(url, function (error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html);
var anchorTags = $('a');
anchorTags.each(function(i,element){
console.log(element["attribs"]["href"]);
});
}
});
Request library downloads the html document and Cheerio lets you use jquery css selectors to target the html document.

Related

Convert xml to json in Jenkinsfile

I have a problem with a method in my Jenkinsfile when i tried to convert xml to json. That is the method, and the pipeline.
I tried to pass the method directly to the echo, but it gives me an error and the pipeline fails
Sorry but i don't know that details i could give about the error, because i start to learning and this is the first time that i see this code.
ERROR: org.xml.sax.SAXParseException; lineNumber: 2; columnNumber: 1; Content is not allowed in prolog.
I edit my question and i add a bat in stage OWASP dependencies testing. This bat create automatically the xml, i put in a validator xml and this did not errors. So i don't know if the problem is with the code of Jenkinsfile or xml, because the error is the same. I put part of xml code because it's very long, but
the error is still in the second line.
XML Code:
<?xml version="1.0" encoding="UTF-8"?>
<analysis xmlns="https://jeremylong.github.io/DependencyCheck/dependency-check.2.2.xsd">
<scanInfo>
<engineVersion>5.2.2</engineVersion>
<dataSource>
<name>NVD CVE Checked</name>
<timestamp>2019-11-25T09:01:51</timestamp>
</dataSource>
<datasource>...</datasource>
</scanInfo>
....................
</analysis>
import groovy.json.*;
def getDependencyResumeFromXML(pathReport){
def xml = bat(script:'type ' + pathReport, returnStdout:true);
def x = new XmlParser().parseText(xml);
def nDep = x.dependencies.dependency.size();
def dependencies = [:];
for(def i=0;i<nDep;i++){
dependencies[i] = [fileName: x.dependencies.dependency[i].fileName.text(),description:x.dependencies.dependency[i].description.text(),vulnerabilities:[:]];
def nVul = x.dependencies.dependency[i].vulnerabilities.vulnerability.size();
for(def j=0;j<nVul;j++){
dependencies[i].vulnerabilities[j] = [
name:x.dependencies.dependency[i].vulnerabilities.vulnerability[j].name.text(), cvssScore:x.dependencies.dependency[i].vulnerabilities.vulnerability[j].cvssScore.text(),
severity:x.dependencies.dependency[i].vulnerabilities.vulnerability[j].severity.text(),
cwe:x.dependencies.dependency[i].vulnerabilities.vulnerability[j].cwe.text(),
description:x.dependencies.dependency[i].vulnerabilities.vulnerability[j].description.text(),
];
}
}
return dependencies;
}
pipeline{
.......
stages{
stage('OWASP dependencies testing'){
steps{
script{
bat 'mvn org.owasp:dependency-check-maven:check';
def pathReport = 'C:\\tmp\\workspace\\umbrella-pipeline-prueba\\target\\dependency-check\\dependency-check-report.xml';
def xml = bat(script:'type ' + pathReport, returnStdout:true);
echo '------------------ 1';
echo xml;
echo '------------------ 2';
echo '--------------------------------'
def dependencias = getDependencyResumeFromXML(pathReport);
echo '------------- 3';
echo dependencias;
echo '------------- 4';
}
}
}
}
}

pug/jade & gulp-merge-json line break

I'm using pug/jade with gulp data & gulp-merge-json. I can't find a way to insert line break in my Json.
I have tried to insert in the json :
\n
\"
\\
\/
\b
\f
\r
\u
None works. I managed to have a space using \t, but I really need to be able to use line breaks.
var data = require('gulp-data');
fs = require('fs'),
path = require('path'),
merge = require('gulp-merge-json');
var pug = require('gulp-pug');
gulp.task('pug:data', function() {
return gulp.src('assets/json/**/*.json')
.pipe(merge({
fileName: 'data.json',
edit: (json, file) => {
// Extract the filename and strip the extension
var filename = path.basename(file.path),
primaryKey = filename.replace(path.extname(filename), '');
// Set the filename as the primary key for our JSON data
var data = {};
data[primaryKey.toUpperCase()] = json;
return data;
}
}))
.pipe(gulp.dest('assets/temp'));
});
Here is the pug code :
#abouttxt.hide.block.big.centered.layer_top
| #{ABOUT[langage]}
Here is the Json named About.json :
{"en": "Peace, Love, Unity and Having fun* !",
"fr": "Ecriture et Réalisation\nVina Hiridjee et David Boisseaux-Chical\n\nDirection artistique et Graphisme\nNicolas Dali\n\nConception et Développement\nRomain Malauzat    Production\nKomet Prod ...... etc"
}

I didn't read the pug documentation well enough, my bad.
you can do Unescaped String Interpolation like this.
!{ABOUT[langage]}

ruby sketchup scene serialization

I am very new on Sketchup and ruby , I have worked with java and c# but this is the first time with ruby.
Now I have one problem, I need to serialize all scene in one json (scene hierarchy, object name, object material and position this for single object) how can I do this?
I have already done this for unity3D (c#) without a problem.
I tried this:
def main
avr_entities = Sketchup.active_model.entities # all objects
ambiens_dictionary = {}
ambiens_list = []
avr_entities.each do |root|
if root.is_a?(Sketchup::Group) || root.is_a?(Sketchup::ComponentInstance)
if root.name == ""
UI.messagebox("this is a group #{root.definition.name}")
if root.entities.count > 0
root.entities.each do |leaf|
if leaf.is_a?(Sketchup::Group) || leaf.is_a?(Sketchup::ComponentInstance)
UI.messagebox("this is a leaf #{leaf.definition.name}")
end
end
end
else
# UI.messagebox("this is a leaf #{root.name}")
end
end
end
end

Have you tried the JSON library
require 'json'
source = { a: [ { b: "hello" }, 1, "world" ], c: 'hi' }.to_json
source.to_json # => "{\"a\":[{\"b\":\"hello\"},1,\"world\"],\"c\":\"hi\"}"

Used the code below to answer a question Here, but it might also work here.
The code can run outside of SketchUp for testing in the terminal. Just make sure to follow these steps...
Copy the code below and paste it on a ruby file (example: file.rb)
Run the script in terminal ruby file.rb.
The script will write data to JSON file and also read the content of JSON file.
The path to the JSON file is relative to the ruby file created in step one. If the script can't find the path it will create the JSON file for you.
module DeveloperName
module PluginName
require 'json'
require 'fileutils'
class Main
def initialize
path = File.dirname(__FILE__)
#json = File.join(path, 'file.json')
#content = { 'hello' => 'hello world' }.to_json
json_create(#content)
json_read(#json)
end
def json_create(content)
File.open(#json, 'w') { |f| f.write(content) }
end
def json_read(json)
if File.exist?(json)
file = File.read(json)
data_hash = JSON.parse(file)
puts "Json content: #{data_hash}"
else
msg = 'JSON file not found'
UI.messagebox(msg, MB_OK)
end
end
# # #
end
DeveloperName::PluginName::Main.new
end
end

How to convert a dynamic JSON like file to a CSV file

I have a file which looks exactly as below.
{"eventid" : "12345" ,"name":"test1","age":"18"}
{"eventid" : "12346" ,"age":"65"}
{"eventid" : "12336" ,"name":"test3","age":"22","gender":"Male"}
Think of the above file as event.json
The number of data objects may vary per line.
I would like the following csv output. and it would be output.csv
eventid,name,age,gender
12345,test1,18
12346,,65
12336,test3,22,Male
Could someone kindly help me? I could accept the answer from an any scripting language (Javascript, Python and etc.).

This code will collect all the headers dynamically and write the file to CSV.
Read comments in code for details:
import json
# Load data from file
data = '''{"eventid" : "12345" ,"name":"test1","age":"18"}
{"eventid" : "12346" ,"age":"65"}
{"eventid" : "12336" ,"name":"test3","age":"22","gender":"Male"}'''
# Store records for later use
records = [];
# Keep track of headers in a set
headers = set([]);
for line in data.split("\n"):
line = line.strip();
# Parse each line as JSON
parsedJson = json.loads(line)
records.append(parsedJson)
# Make sure all found headers are kept in the headers set
for header in parsedJson.keys():
headers.add(header)
# You only know what headers were there once you have read all the JSON once.
#Now we have all the information we need, like what all possible headers are.
outfile = open('output_json_to_csv.csv','w')
# write headers to the file in order
outfile.write(",".join(sorted(headers)) + '\n')
for record in records:
# write each record based on available fields
curLine = []
# For each header in alphabetical order
for header in sorted(headers):
# If that record has the field
if record.has_key(header):
# Then write that value to the line
curLine.append(record[header])
else:
# Otherwise put an empty value as a placeholder
curLine.append('')
# Write the line to file
outfile.write(",".join(curLine) + '\n')
outfile.close()

Here is a solution using jq.
If filter.jq contains the following filter
(reduce (.[]|keys_unsorted[]) as $k ({};.[$k]="")) as $o # object with all keys
| ($o | keys_unsorted), (.[] | $o * . | [.[]]) # generate header and data
| join(",") # convert to csv
and data.json contains the sample data then
$ jq -Mrs -f filter.jq data.json
produces
eventid,name,age,gender
12345,test1,18,
12346,,65,
12336,test3,22,Male

Here's a Python solution (should work in both Python 2 & 3).
I'm not proud of the code, as there's probably a better way to do this (using the csv module) but this gives you the desired output.
I've taken the liberty of naming your JSON data data.json and I'm naming the output csv file output.csv.
import json
header = ['eventid', 'name', 'age', 'gender']
with open('data.json', 'r') as infile, \
open('outfile.csv', 'w+') as outfile:
# Writes header row
outfile.write(','.join(header))
outfile.write('\n')
for row in infile:
line = ['', '', '', ''] # I'm sure there's a better way
datarow = json.loads(row)
for key in datarow:
line[header.index(key)] = datarow[key]
outfile.write(','.join(line))
outfile.write('\n')
Hope this helps.

Using Angularjs with ngCsv plugin we can generate csv file from desired json with dynamic headers.
Run in plunkr
// Code goes here
var myapp = angular.module('myapp', ["ngSanitize", "ngCsv"]);
myapp.controller('myctrl', function($scope) {
$scope.filename = "test";
$scope.getArray = [{
label: 'Apple',
value: 2,
x:1,
}, {
label: 'Pear',
value: 4,
x:38
}, {
label: 'Watermelon',
value: 4,
x:38
}];
$scope.getHeader = function() {
var vals = [];
for( var key in $scope.getArray ) {
for(var k in $scope.getArray[key]){
vals.push(k);
}
break;
}
return vals;
};
});
<!DOCTYPE html>
<html>
<head>
<link href="https://netdna.bootstrapcdn.com/bootstrap/3.0.0/css/bootstrap.min.css" rel="stylesheet">
<script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.4.7/angular.min.js"></script>
<script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.4.7/angular-sanitize.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/ng-csv/0.3.6/ng-csv.min.js"></script>
</head>
<body>
<div ng-app="myapp">
<div class="container" ng-controller="myctrl">
<div class="page-header">
<h1>ngCsv <small>example</small></h1>
</div>
<button class="btn btn-default" ng-csv="getArray" csv-header="getHeader()" filename="{{ filename }}.csv" field-separator="," decimal-separator=".">Export to CSV with header</button>
</div>
</div>
</body>
</html>

var arr = $.map(obj, function(el) { return el });
var content = "";
for(var element in arr){
content += element + ",";
}
var filePath = "someFile.csv";
var fso = new ActiveXObject("Scripting.FileSystemObject");
var fh = fso.OpenTextFile(filePath, 8, false, 0);
fh.WriteLine(content);
fh.Close();

Opening multiple html files & outputting to .txt with Nokogiri

Just wondering if these two functions are to be done using Nokogiri or via more basic Ruby commands.
require 'open-uri'
require 'nokogiri'
require "net/http"
require "uri"
doc = Nokogiri.parse(open("example.html"))
doc.xpath("//meta[#name='author' or #name='Author']/#content").each do |metaauth|
puts "Author: #{metaauth}"
end
doc.xpath("//meta[#name='keywords' or #name='Keywords']/#content").each do |metakey|
puts "Keywords: #{metakey}"
end
etc...
Question 1: I'm just trying to parse a directory of .html documents, get the information from the meta html tags, and output the results to a text file if possible. I tried a simple *.html wildcard replacement, but that didn't seem to work (at least not with Nokogiri.parse(open()) maybe it works with ::HTML or ::XML)
Question 2: But more important, is it possible to output all of those meta content outputs into a text file to replace the puts command?
Also forgive me if the code is overly complicated for the simple task being performed, but I'm a little new to Nokogiri / xpath / Ruby.
Thanks.

I have a code similar.
Please refer to:
module MyParser
HTML_FILE_DIR = `your html file dir`
def self.run(options = {})
file_list = Dir.entries(HTML_FILE_DIR).reject { |f| f =~ /^\./ }
result = file_list.map do |file|
html = File.read("#{HTML_FILE_DIR}/#{file}")
doc = Nokogiri::HTML(html)
parse_to_hash(doc)
end
write_csv(result)
end
def self.parse_to_hash(doc)
array = []
array << doc.css(`your select conditons`).first.content
... #add your selector code css or xpath
array
end
def self.write_csv(result)
::CSV.open("`your out put file name`", 'w') do |csv|
result.each { |row| csv << row }
end
end
end
MyParser.run

You can output to a file like so:
File.open('results.txt','w') do |file|
file.puts "output" # See http://ruby-doc.org/core-2.1.2/IO.html#method-i-puts
end
Alternatively, you could do something like:
authors = doc.xpath("//meta[#name='author' or #name='Author']/#content")
keywrds = doc.xpath("//meta[#name='keywords' or #name='Keywords']/#content")
results = authors.map{ |x| "Author: #{x}" }.join("\n") +
keywrds.map{ |x| "Keywords: #{x}" }.join("\n")
File.open('results.txt','w'){ |f| f << results }

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Can you provide examples of parsing HTML? - html

Language: C# Library: HtmlAgilityPack class Program { static void Main(string[] args) { var web = new HtmlWeb(); var doc = web.Load("http://www.stackoverflow.com"); var nodes = doc.DocumentNode.SelectNodes("//a[#href]"); foreach (var node in nodes) { Console.WriteLine(node.InnerHtml); } } }

Language: Perl Library: pQuery use strict; use warnings; use pQuery; my $html = join '', "<html><body>", (map { qq($_) } qw/foo bar baz/), "</body></html>"; pQuery( $html )->find( 'a' )->each( sub { my $at = $_->getAttribute( 'href' ); print "$at\n" if defined $at; } );

language: shell library: lynx (well, it's not library, but in shell, every program is kind-of library) lynx -dump -listonly http://news.google.com/

language: Ruby library: Hpricot #!/usr/bin/ruby require 'hpricot' html = '<html><body>' ['foo', 'bar', 'baz'].each {|link| html += "#{link}" } html += '</body></html>' doc = Hpricot(html) doc.search('//a').each {|elm| puts elm.attributes['href'] }

Language: Ruby Library: Nokogiri #!/usr/bin/env ruby require 'nokogiri' require 'open-uri' document = Nokogiri::HTML(open("http://google.com")) document.css("html head title").first.content => "Google" document.xpath("//title").first.content => "Google"

Language: Perl Library: HTML::Parser Purpose: How can I remove unused, nested HTML span tags with a Perl regex?

Language: PHP Library: SimpleXML (and DOM) <?php $page = new DOMDocument(); $page->strictErrorChecking = false; $page->loadHTMLFile('http://stackoverflow.com/questions/773340'); $xml = simplexml_import_dom($page); $links = $xml->xpath('//a[#href]'); foreach($links as $link) echo $link['href']."\n";

Language: JavaScript Library: DOM var links = document.links; for(var i in links){ var href = links[i].href; if(href != null) console.debug(href); } (using firebug console.debug for output...)

Language: Perl Library : HTML::TreeBuilder use strict; use HTML::TreeBuilder; use LWP::Simple; my $content = get 'http://www.stackoverflow.com'; my $document = HTML::TreeBuilder->new->parse($content)->eof; for my $a ($document->find('a')) { print $a->attr('href'), "\n" if $a->attr('href'); }

Language: Python Library: HTQL import htql; page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>"; query="<a>:href,tx"; for url, text in htql.HTQL(page, query): print url, text; Simple and intuitive.

Related

Convert xml to json in Jenkinsfile

pug/jade & gulp-merge-json line break

ruby sketchup scene serialization

How to convert a dynamic JSON like file to a CSV file

Opening multiple html files & outputting to .txt with Nokogiri

Categories

Resources