Apache Tika Document Content Extraction Per Page

Apache Tika Document Content Extraction Per Page - jruby

I am using Apache Tika 1.9 and content extraction working awesome.
The problem I am facing is with pages. I can extract total pages from document metadata. But I can't find any way to extract content per page from the document.
I had searched a lot and tried some solutions suggested by users, but did not work for me, may be due to latest Tika version.
Please suggest any solution or further research direction for this.
I will be thankful.
NOTE: I am using JRuby for implementation

Here is the class for custom content handler that I created and which solved my issue.
class PageContentHandler < ToXMLContentHandler
attr_accessor :page_tag
attr_accessor :page_number
attr_accessor :page_class
attr_accessor :page_map
def initialize
#page_number = 0
#page_tag = 'div'
#page_class = 'page'
#page_map = Hash.new
end
def startElement(uri, local_name, q_name, atts)
start_page() if #page_tag == q_name and atts.getValue('class') == #page_class
end
def endElement(uri, local_name, q_name)
end_page() if #page_tag == q_name
end
def characters(ch, start, length)
if length > 0
builder = StringBuilder.new(length)
builder.append(ch)
#page_map[#page_number] << builder.to_s if #page_number > 0
end
end
def start_page
#page_number = #page_number + 1
#page_map[#page_number] = String.new
end
def end_page
return
end
end
And to use this content handler, here is the code:
parser = AutoDetectParser.new
handler = PageContentHandler.new
parser.parse(input_stream, handler, #metadata_java, ParseContext.new)
puts handler.page_map

Related

Remove conditions from rails front-end html slim

I have following code in my rails front-end html.slim file. I want to remove these nested if-else conditions. Can I implement this by moving these if-else conditions to some helper class?
- if #current_task.task_type == 'econsent'
- if #patient_organization.organization.identity_verification
- if #patient_organization.manual_verified
- if session['kiosk_token']
= render "#{#current_task.task_type}_tasks"
- else
- if #reauthenticated
= render "#{#current_task.task_type}_tasks"
- else
= render 'relogin_required_screen'
- else
= render 'manual_verification_required_screen'
- else
- if #patient.self_verified
- if session['kiosk_token']
= render "#{#current_task.task_type}_tasks"
- else
- if #reauthenticated
= render "#{#current_task.task_type}_tasks"
- else
= render 'relogin_required_screen'
- else
- if #patient.self_verification_req_sent
= render 'verify_email_after_sent_screen'
- else
= render 'verify_email_screen'
- else
= render "#{#current_task.task_type}_tasks"

I think you need to refactor those conditions, not just move them to another place. For ex: there is 5 different conditions that ends with render "#{#current_task.task_type}_tasks" you need to find what those have in comum, you don't need so many conditionals. Take a look at the usage of if, elsif, else and unless.

I believe you could simply assign an #to_render variable in your controller action, something like:
class FooController < ApplicationController
def bar_action
...
#to_render = get_to_render
...
end
private
def get_to_render
if current_task_type == 'econsent'
if #patient_organization.organization.identity_verification
if #patient_organization.manual_verified
return :relogin_required_screen if (!#reauthenticated && !kiosk_token?)
else
return :manual_verification_required_screen
end
else
if #patient.self_verified
return :relogin_required_screen if (!#reauthenticated && !kiosk_token)
else
return #patient.self_verification_req_sent ? :verify_email_after_sent_screen : :verify_email_screen
end
end
end
return "#{current_task_type}_tasks".to_sym
end
def kiosk_token?
session['kiosk_token']
end
end
Then in your html.slim file, do:
= render #to_render
I can't remember, but you may need to do:
= render "#{#to_render}"
since #to_render will be a symbol. Doing the string interpolation will automatically convert the symbol to a string.

I couldn't show article on view rails

I'm beginner of programming and doing the project on rails.
I'm having a problem that I can't show the data on view.
The codes are listed bellow.
#routes.rb
scope module: :mobile do
scope module: :home do
get "/", action: :index
-
#index.html.slim
- if #pickup_links.present?
.user-posts-area
.inner-headline
h2 Pickup Link
h3 ピックアップリンク
.top-user-posts
- pl = #pickup_links
a.post href=pl.page_path
img.lazy data-original=pl.picture
.post-descs
h3 = pl.title_or_notitle
h4 = pl.name_or_no_name
.date-area
.right-date = pl.created_at.to_s(:md_dot_en)
-
#home_controller.rb
def index
#pickup_links = PickupLink.limit(1)
end
I tested "#pickup_links = PickupLink.limit(1)" on terminal and could get the data from the database.
Please someone give me a hand.

I am not familiar with "slim" but it looks like "HAML". So my guess is that your line
- pl = #pickup_links
is not a block, so all following line should not be nested.
Another matter (I know this is only a test project but) why don't you do
# why link**s**
#pickup_links = PickupLink.first
then you would only test like this
- if #pickup_links
and you would not need to set
-pl = #pickup_links
but just use #pickup_links. "pl" btw is still a relation of PickupLink and has none of the methods you are calling

How to post-process HTML to add "target blank" to all links in Ruby?

How to post-process HTML to add "target blank" to all links in Ruby?
I am currently using Rinku (gem) to auto-link text, and that works great.
However, I am post-processing HTML and some links are already links, and therefore are not processed with Rinku.
How could I add the target blank attribute to those?
application_controller.rb
def text_renderer text
AutoHTML.new(text).render
end
auto_html.rb
class AutoHTML
include ActionView::Helpers
def initialize text
#text = text
end
def render
text = prepare #text
text = auto_link(text)
text.html_safe
end
private
def prepare text
if text.nil? || text.empty?
""
else
text
end
end
def auto_link text
Rinku.auto_link(text, :all, 'target="_blank"')
end
end

I implemented a solution with nokogiri:
def self.a_with_target_blank(body)
doc = Nokogiri::HTML(body)
doc.css('a').each do |link|
link['target'] = '_blank'
# Worried about #spickermann's security concerns in the comment? then
# consider also to add:
#
# link['rel'] = 'noopener'
#
# In any case, this security hole has been solved in modern browsers, (check
# https://github.com/whatwg/html/issues/4078) so unless you're supporting
# very old browsers, there's no much to worry about.
end
doc.to_s
end

Download HTML Text with Ruby

I am trying to create a histogram of the letters (a,b,c,etc..) on a specified web page. I plan to make the histogram itself using a hash. However, I am having a bit of a problem actually getting the HTML.
My current code:
#!/usr/local/bin/ruby
require 'net/http'
require 'open-uri'
# This will be the hash used to store the
# histogram.
histogram = Hash.new(0)
def open(url)
Net::HTTP.get(URI.parse(url))
end
page_content = open('_insert_webpage_here')
page_content.each do |i|
puts i
end
This does a good job of getting the HTML. However, it gets it all. For www.stackoverflow.com it gives me:
<body><h1>Object Moved</h1>This document may be found here</body>
Pretending that it was the right page, I don't want the html tags. I'm just trying to get Object Moved and This document may be found here.
Is there any reasonably easy way to do this?

When you require 'open-uri', you don't need to redefine open with Net::HTTP.
require 'open-uri'
page_content = open('http://www.stackoverflow.com').read
histogram = {}
page_content.each_char do |c|
histogram[c] ||= 0
histogram[c] += 1
end
Note: this does not strip out <tags> within the HTML document, so <html><body>x!</body></html> will have { '<' => 4, 'h' => 2, 't' => 2, ... } instead of { 'x' => 1, '!' => 1 }. To remove the tags, you can use something like Nokogiri (which you said was not available), or some sort of regular expression (such as the one in Dru's answer).

See the section "Following Redirection" on the Net::HTTP Documentation here

Stripping html tags without Nokogiri
puts page_content.gsub(/<\/?[^>]*>/, "")
http://codesnippets.joyent.com/posts/show/615

Nokogiri prevent converting entities

def wrap(content)
require "Nokogiri"
doc = Nokogiri::HTML.fragment("<div>"+content+"</div>")
chunks = doc.at("div").traverse do |p|
if p.is_a?(Nokogiri::XML::Text)
input = p.content
p.content = input.scan(/.{1,5}/).join("")
end
end
doc.at("div").inner_html
end
wrap("aaaaaaaaaa")
gives me
"aaaaa&shy;aaaaa"
instead of
"aaaaaaaaaa"
How get the second result ?

Return
doc.at("div").text
instead of
doc.at("div").inner_html
This, however, strips all HTML from the result. If you need to retain other markup, you can probably get away with using CGI.unescapeHTML:
CGI.unescapeHTML(doc.at("div").inner_html)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Apache Tika Document Content Extraction Per Page - jruby

Related

Remove conditions from rails front-end html slim

I couldn't show article on view rails

How to post-process HTML to add "target blank" to all links in Ruby?

Download HTML Text with Ruby

Nokogiri prevent converting entities

Categories

Resources