How to add more XPATH in parsefilter.json in stormcrawler - json

I am using stormcrawler (v 1.16) & Elasticsearch(v 7.5.0) for extracting data from about 5k news websites. I have added some XPATH patterns for extracting author name in parsefilter.json.
Parsefilter.json is as shown below:
{
"com.digitalpebble.stormcrawler.parse.ParseFilters": [
{
"class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
"name": "XPathFilter",
"params": {
"canonical": "//*[#rel=\"canonical\"]/#href",
"parse.description": [
"//*[#name=\"description\"]/#content",
"//*[#name=\"Description\"]/#content"
],
"parse.title": [
"//TITLE",
"//META[#name=\"title\"]/#content"
],
"parse.keywords": "//META[#name=\"keywords\"]/#content",
"parse.datePublished": "//META[#itemprop=\"datePublished\"]/#content",
"parse.author":[
"//META[#itemprop=\"author\"]/#content",
"//input[#id=\"authorname\"]/#value",
"//META[#name=\"article:author\"]/#content",
"//META[#name=\"author\"]/#content",
"//META[#name=\"byline\"]/#content",
"//META[#name=\"dc.creator\"]/#content",
"//META[#name=\"byl\"]/#content",
"//META[#itemprop=\"authorname\"]/#content",
"//META[#itemprop=\"article:author\"]/#content",
"//META[#itemprop=\"byline\"]/#content",
"//META[#itemprop=\"dc.creator\"]/#content",
"//META[#rel=\"authorname\"]/#content",
"//META[#rel=\"article:author\"]/#content",
"//META[#rel=\"byline\"]/#content",
"//META[#rel=\"dc.creator\"]/#content",
"//META[#rel=\"author\"]/#content",
"//META[#id=\"authorname\"]/#content",
"//META[#id=\"byline\"]/#content",
"//META[#id=\"dc.creator\"]/#content",
"//META[#id=\"author\"]/#content",
"//META[#class=\"authorname\"]/#content",
"//META[#class=\"article:author\"]/#content",
"//META[#class=\"byline\"]/#content",
"//META[#class=\"dc.creator\"]/#content",
"//META[#class=\"author\"]/#content"
]
}
},
I have also made change in crawler-conf.yaml and it is as shown below.
indexer.md.mapping:
- parse.author=author
metadata.persist:
- author
The issue i am facing is : I am getting result only for 1st pattern (i.e. "//META[#itemprop="author"]/#content") of "parse.author". What changes I should do so that all patterns can be taken as input.

What changes I should do so that all patterns can be taken as input.
I read this as "How can I make a single XPath expression that tries all different ways an author can appear in the document?"
Simplest approach: Join the all expressions you already have into a single one with the XPath Union operator |:
input[...]|meta[...]|meta[...]|meta[...]
And since this potentially selects more than one node, we could state explicitly that we only care for the first match:
(input[...]|meta[...]|meta[...]|meta[...])[1]
This probably works but it will be very long and hard to read. XPath can do better.
Your expressions are all pretty repetitive, that's a good starting point to reduce the size of the expression. For example, those two are the same, except for the attribute value:
//meta[#class='author']/#content|//meta[#class='authorname']/#content
We could use or and it would get shorter already:
//meta[#class='author' or #class='authorname']/#content
But when you have 5 or 6 potential values, it still is pretty long. Next try, a predicate for the attribute:
//meta[#class[.='author' or .='authorname']]/#content
A little shorter, as we don't need to type #class all the time. But still pretty long with 5 or 6 potential values. How about a value list and a substring search (I'm using / as a delimiter character):
//meta[contains(
'/author/authorname/',
concat('/', #class, '/')
)]/#content
Now we can easily expand the list of valid values, and even look at different attributes, too:
//meta[contains(
'/author/authorname/article:author/',
concat('/', #class|#id , '/')
)]/#content
And since we're looking for almost the same possible strings across multiple possible attributes, we could use a fixed list of values that all possible attributes are checked against:
//meta[
contains(
'/author/article:author/authorname/dc.creator/byline/byl/',
concat('/', #name|#itemprop|#rel|#id|#class, '/')
)
]/#content
Combined with the first two points, we could end up with this:
(
//meta[
contains(
'/author/article:author/authorname/dc.creator/byline/byl/',
concat('/', #name|#itemprop|#rel|#id|#class, '/')
)
]/#content
|
//input[
#id='authorname'
]/#value
)[1]
Caveat: This only works as expected when a <meta> will never have both e.g. #name and #rel, or if, that they at least both have the same value. Otherwise concat('/', #name|#itemprop|#rel|#id|#class, '/') might pick the wrong one. It's a calculated risk, I think it's not usual for this to happen in HTML. But you need to decide, you're the one who knows your input data.

Related

sorting on field with an additional filter in Vega-lite

This might be a somewhat obscure use case.
As you can see below, I have the bar (count) overlaid. I want to sort the bars in the background (where is_overview set to 1), but currently, the filtering is set to all of count, which includes is_overview being set to 0.
I need the sort to be on a filtered field.
I went through the sorting documentation but I cannot figure out a way to support this use case. If you might have ideas, I would really appreciate the help!
Editor code
If you want custom sort behavior, often the best approach is to use a calculate transform, which makes available all of the vega expression syntax, and define a new custom field on which to sort.
In your example, you could do something like this:
"transform": [
{"calculate": "datum.is_overview ? datum.count : null", "as": "order"}
],
and then sort on the order property.
The result looks like this (vega editor):

What is the difference between "" and nothing

Hello I would like to know what is the difference between '' and nothing I mean for instance if I take this example :
"test":[""]
and
"test":[""]
Though you have given identical examples, I think you tried to provide these two:
"test":[""]
and this: "test":[]
For the first case, your test array length is 1. So, accessing test[0] will give you ""
But for the second case, its length is 0. So no element in the later but for the first one you have "".
That's it.

How to select from a selection box with a variable in the name?

I am having trouble using selecting from this select element.
<select name="vehicle_attrs[position_count]" class="mb1"><option>Position / Quantity</option><option>Front</option><option>Rear</option></select>
I have tried
select('Front', :from=>'mb1')
select('Front', :from=>'vehicle_attrs[position_count]')
select('Front', :from=>'vehicle_attrs[1]')
All of them result in a can not find selection box error
I've never liked how restrictive Capybara's concept of a 'locator' is (i.e. must have a name/id/label), but if you dig into the source code, those helpful methods like select, click_on, and fill_in are just wrappers for find and some native method of Element, which takes arbitrary CSS, and works in almost all situations. In this case, you could use:
find('[name="vehicle_attrs[position_count]"]').find('option', text: 'Front').select_option
Since dropdowns often have multiple similar options, where one is a substring of the other, you might consider using an exact string match, like
find('[name="vehicle_attrs[position_count]"]').find('option', text: /\AFront\z/).select_option
From the docs for select - https://www.rubydoc.info/github/teamcapybara/capybara/Capybara/Node/Actions#select-instance_method - we can see that the from option takes "The id, Capybara.test_id atrtribute, name or label of the select box".
Neither 'mb1' or 'vehicle_attrs[1]' are any of those so they would be expected to fail.
'vehicle_attrs[position_count]' is the name so assuming the box is actually visible on the page (not replaced with a JS driven select widget, etc), that should work. If it doesn't, then edit your question and add the full exact error message you get when trying to use it. Of course if there is only one select box on the page with an option of 'Front' then you don't need to specify the from option at all and can just do
select 'Front'

SSRS - How to indent rows in a table with a given content

In SSRS I want to indent certain rows when they start with e.g. 'aa'. See this example:
What is the best practice in this case? As I don't have a parent-child situation here (to use recursive hierarchy group), do I have an option e.g. via the properties to set something like an IIf to solve this? If yes, could you please provide some information where to set this?
Every info is welcome! I'm new to SSRS.
This is simple to do...
Click on the cell that you want to indent.
In the properties panel, expand the Indent properties and then click the drop-down in the Left Indent property and choose Expression.
Then set the expressions to something like
=SWITCH (
LEFT(Fields!FieldIwantToCheck.Value, 2) = "aa", "10pt",
LEFT(Fields!FieldIwantToCheck.Value, 2) = "bb", "30pt",
True, "0pt"
)
You could do this with an IIF expression but if you need to make it more flexible than 1 or two cases then SWITCH is much easier to read/manage.
All we are doing here is checking the left 2 characteras of the FieldIwantToCheck field and setting an indent value respectivley. If none of the criteria match, the final True, Nothing acts like an ELSE and leaves the property as the default Nothing value.

MDX Children of Several Members

The children functions returns the set of the member.
But I need the children of several members.
The problem is, that I can't use Union to make it work like that:
Union([Geography].[Geography].[USA].children,[Geography].[Geography].[Canada].children)
I don't know how many member it will be... So I actually would need all children of a set of members.
like:
([Geography].[Geography].[USA],[Geography].[Geography].[Canada],[Geography].[Geography].[GB]).children
Is there a function like that?
I couldn't answer my question and so I just edit it. With the help of DHN's answer and some brain work I found a solution I could use:
Except(DRILLDOWNLEVEL( {[Geography].[Geography].[USA],[Geography].[Geography].[Canada]},,0 ),
{[Geography].[Geography].[USA],[Geography].[Geography].[Canada]})
That does work for me.
Explanation: I drilldown the elements the tool provides me, which returns children plus parents and then I use DHN's idea and except the parents so clean the list up a bit.
Hopefully it is understandable.
You can use the Descendants method (the fourth form of the description linked uses a set as its first argument. Thus,
Descendants( {
[Geography].[Geography].[USA],
[Geography].[Geography].[Canada],
[Geography].[Geography].[GB]
},
1,
SELF
)
should deliver exactly what you want.
Well actually, you could use a Crossjoin to get the set you want.
Something like
[Geography].[Geography].[USA] * [Geography].[Geography].[Canada] * [Geography].[Geography].[GB]
But this is only a proper solution, if you have only a few different search criteria.
Alternatively, you could use Except to remove those criteria you're not interested in. E.g.
Except([Geography].[Geography].children, [Geography].[Geography].[Germany])
This would give you the whole content of the [Geography] dimension, except the one of [Germany].
Hope this helps a bit.
Edit after comment of TO
Ok, this wasn't part of your question, but I think what you need is the MemberToStr() function. Please find the doc here.
I think something like this should do the trick.
with member [Measures].[Cities]
as membertostr([Geography].[Geography].members.children)
select [Measures].[Cities] on 0
from [WhatEverYourCubeNameIs]
where (
[Geography].[Geography].[USA],
[Geography].[Geography].[Canada]
)
Please note that this query is totally untested. I also may have lost some of my skills, because it's been a while, since I used mdx. You will also have to create the query dynamically, since the selection seems to be user dependant. But I'm sure that you're aware of it. ;)