I have a pretty huge json file of short lines from a screenplay. I am trying to match keywords to keywords in the json file so I can pull out a line from the json.
The json file structure is like this:
[
"Yeah, well I wasn't looking for a long term relationship. I was on TV. ",
"Ok, yeah, you guys got to put a negative spin on everything. ",
"No no I'm not ready, things are starting to happen. ",
"Ok, it's forgotten. ",
"Yeah, ok. ",
"Hey hey, whoa come on give me a hug... "
]
(plus lots more...2444 lines in total)
So far I have this, but it's not making any matches.
# screenplay is read in from a json file
#screenplay_lines = JSON.parse(#jsonfile.read)
#text_to_find = ["relationship","negative","hug"]
#matching_results = []
#screenplay_lines.each do |line|
if line.match(Regexp.union(#text_to_find))
#matching_results << line
end
end
puts "found #{#matching_results.length} matches..."
puts #matching_results
I'm not getting any hits so not sure what's not working. Plus I'm sure it's a pretty expensive process doing it this way with a large amount of data. Any ideas? Thanks.
Yes, Regexp matching is slower than just checking if the String is included in a line of text. But this also depends on the number of keywords and the length of the lines and much more. So best would be to run at least a micro-benchmark.
lines = [
"Yeah, well I wasn't looking for a long term relationship. I was on TV. ",
"Ok, yeah, you guys got to put a negative spin on everything. ",
"No no I'm not ready, things are starting to happen. ",
"Ok, it's forgotten. ",
"Yeah, ok. ",
"Hey hey, whoa come on give me a hug... "
]
keywords = ["relationship","negative","hug"]
def find1(lines, keywords)
regexp = Regexp.union(keywords)
lines.select { |line| regexp.match(line) }
end
def find2(lines, keywords)
lines.select { |line| keywords.any? { |keyword| line.include?(keyword) } }
end
def find3(lines, keywords)
regexp = Regexp.union(keywords)
lines.select { |line| regexp.match?(line) }
end
require 'benchmark/ips'
Benchmark.ips do |x|
x.compare!
x.report('match') { find1(lines, keywords) }
x.report('include?') { find2(lines, keywords) }
x.report('match?') { find3(lines, keywords) }
end
In this setup the include? variant is way faster:
Comparison:
include?: 288083.4 i/s
match?: 91505.7 i/s - 3.15x slower
match: 65866.7 i/s - 4.37x slower
Please note:
I've moved creation of the regexp out of the loop. It does not need to be created for every line. Creation of a regexp is an expensive operation (your variant clocked in at 1/5 of the speed of the regexp outside of the loop)
match? is only available in Ruby 2.4+, it is faster because it does not assign any match results (side-effect free)
I would not worry to much about performance for 2500 lines of text. If it is fast enough then stop searching for a better solution.
There is a possible solution, try this one:
json_expressions
Related
My JSON file looks something like:
{
"generator": {
"name": "Xfer Records Serum",
....
},
"generator": {
"name: "Lennar Digital Sylenth1",
....
}
}
I ask the user for search term and the input is searched for in the name key only. All matching results are returned. It means if I input 's' only then also both the above ones would be returned. Also please explain me how to return all the object names which are generators. The more simple method the better it will be for me. I use json library. However if another library is required not a problem.
Before switching to JSON I tried XML but it did not work.
If your goal is just to search all name properties, this will do the trick:
import re
def search_names(term, lines):
name_search = re.compile('\s*"name"\s*:\s*"(.*' + term + '.*)",?$', re.I)
return [x.group(1) for x in [name_search.search(y) for y in lines] if x]
with open('path/to/your.json') as f:
lines = f.readlines()
print(search_names('s', lines))
which would return both names you listed in your example.
The way the search_names() function works is it builds a regular expression that will match any line starting with "name": " (with varying amount of whitespace) followed by your search term with any other characters around it then terminated with " followed by an optional , and the end of string. Then applies that to each line from the file. Finally it filters out any non-matching lines and returns the value of the name property (the capture group contents) for each match.
A CSV style quoted string, for the purposes of this question, is a string in which:
The string starts and ends with exactly one ".
Two double quotes inside the string are collapsed to one double quote. "Alo""ha"→Alo"ha.
"" on its own is an empty string.
Error inputs, such as "A""" e", cannot be parsed. It's an A", followed by junk e".
I've tried several things, none of which have worked fully.
The closest I've gotten, thanks to some help from user pinkieval in #nom on the Mozilla IRC:
use std::error as stderror; /* Avoids needing nightly to compile */
named!(csv_style_string<&str, String>, map_res!(
terminated!(tag!("\""), not!(peek!(char!('"')))),
csv_string_to_string
));
fn csv_string_to_string(s: &str) -> Result<String, Box<stderror::Error>> {
Ok(s.to_string().replace("\"\"", "\""))
}
This does not catch the end of the string correctly.
I've also attempted to use the re_match! macro with r#""([^"]|"")*""#, but that always results in an Err::Incomplete(1).
I've determined that the given CSV example for Nom 1.0 doesn't work for a quoted CSV string as I'm describing it, but I do know implementations differ.
Here is one way of doing it:
use nom::types::CompleteStr;
use nom::*;
named!(csv_style_string<CompleteStr, String>,
delimited!(
char!('"'),
map!(
many0!(
alt!(
// Eat a " delimiter and the " that follows it
tag!("\"\"") => { |_| '"' }
| // Normal character
none_of!("\"")
)
),
// Make a string from a vector of chars
|v| v.iter().collect::<String>()
),
char!('"')
)
);
fn main() {
println!(r#""Alo\"ha" = {:?}"#, csv_style_string(CompleteStr(r#""Alo""ha""#)));
println!(r#""" = {:?}"#, csv_style_string(CompleteStr(r#""""#)));
println!(r#"bad format: {:?}"#, csv_style_string(CompleteStr(r#""A""" e""#)));
}
(I wrote it in full nom, but a solution like yours, based on an external function instead of map!() each character, would work too, and may be more efficient.)
The magic here, that would also solve your regexp issue, is to use CompleteStr. This basically tells nom that nothing will come after that input (otherwise, nom assumes you're doing a streaming parser, so more input may follow).
This is needed because we need to know what to do with a " if it is the last character fed to nom. Depending on the character that comes after it (another ", a normal character, or EOF), we have to take a different decision -- hence the Incomplete result, meaning nom does not have enough input to make the decision. Telling nom that EOF comes next solves this indecision.
Further reading on Incomplete on nom's author's blog: http://unhandledexpression.com/general/2018/05/14/nom-4-0-faster-safer-simpler-parsers.html#dealing-with-incomplete-usage
You may note that this parser does not actually rejects the invalid input, but parses the beginning and returns the rest. If you use this parser as a subparser in another parser, the latter would then feed the remainder to the next subparser, which would crash as well (because it would expect a comma), causing the overall parser to fail.
If you don't want that, you could make csv_style_string match peek!(alt!(char!(',')|char!('\n")|eof!())).
I am trying to get a JSON response from an API:
test <- GET(url, add_headers(`api_key` = key))
content(test, 'parsed')
When I run content(test, 'parsed'), I get the following error:
# Error: lexical error: invalid string in json text. .Note: Final passage of the "fiscal cliff bill" on January 1
I think this is because of the double quotations. How can I either replace the double quotes or if this is not the problem, how can I fix this issue?
Thanks!
So I had run into a similar problem before, and I had intended to write a quite function to use Jeroen's fix to try to repair the JSON. Since I intended to do it anyway, here's a quick hack attempt.
NB: repairing a structured format like this is speculative at best and most certainly prone to errors. The good news is that I tried to keep this specific enough so that it will not produce false results: it'll either fix what it knows it can, or fail. The "unit-testing" really needs to check other corner-cases. If you find something that this does not fix (and should) or that this breaks (gasp!), please comment!
fix_json_quotes <- function(s) {
if (length(s) != 1) {
warning("the argument has length > 1 and only the first element will be used")
s <- s[[1]]
}
stopifnot(is.character(s))
val <- jsonlite::validate(s)
while (! val) {
ind <- attr(val, "offset") - 1
snew <- gsub("(.*)(['\"])([[:space:],]*)$", "\\1\\\\\\2\\3", substr(s, 1, ind))
if (snew != substr(s, 1, ind)) {
s <- paste0(snew, substr(s, ind + 1, nchar(s)))
} else {
break
}
val <- jsonlite::validate(s)
}
if (! val) {
# still not validating
stop("unable to fix quotes")
}
return(s)
}
Some sample data, unit-testing if you will (testthat is not required for use of the function):
library(testthat)
lst <- list(a="final \"cliff bill\" on")
json <- as.character(toJSON(lst))
json
# [1] "{\"a\":[\"final \\\"cliff bill\\\" on\"]}"
Okay, there should be no change:
expect_equal(json, fix_json_quotes(json))
Some bad data:
# un-escape the double quotes
badlst <- "{\"a\":[\"final \"cliff bill\" on\"]}"
expect_error(jsonlite::fromJSON(badlst))
expect_equal(json, fix_json_quotes(badlst))
PS: this looks specifically for double-quotes, nothing more. However, I believe that there are related errors that this might also be able to fix. I "left room" for this, in the second group within the regex (([\"])); for example, if single-quotes could also cause a problem, then the group could be changed to be ([\"']). I don't know if it's useful or even necessary.
I got three different entries "10576.53012.46344.35174" , "10" and "Doc-15" in foreach loop. Out of these 3 entries, i want 10576.53012.46344.35174. How can i verify that current string contains multiple . and numbers.
Im new to TCL, Need suggestion
This is the sort of task that is a pretty good fit for regular expressions.
The string 10576.53012.46344.35174 is matched by a RE like this: ^\d{5}(?:\.\d{5}){3}$ though you might want something a little less strict (e.g., with more flexibility in the number of digits per group — 5 — or the number of groups following a . — 3).
You test if a string matches a regular expression with the regexp command:
if {[regexp {^\d{5}(?:\.\d{5}){3}$} $theVarWithTheString]} {
puts "the regular expression matched $theVarWithTheString"
}
An alternative approach is to split the string by . and check that each group is what you want:
set goodBits 0
set badString 0
foreach group [split $theVarWithTheString "."] {
if {![string is integer -strict $group]} {
set badString 1
break
}
incr goodBits
}
if {!$badString && $goodBits == 4} {
puts "the string was OK"
}
I greatly prefer the regular expression approach myself (with occasional help from string is as appropriate). Writing non-RE validators can be challenging and tends to require a lot of code.
Earlier I asked about gathering information from API-Link, and I have managed to get out most of the details by using the answar I got.
Now ny problem is when another API to get more information
This time the file will contain this information:
{
"username":"UserName",
"confirmed_rewards":"0",
"round_estimate":"0.00000000",
"total_hashrate":"0.000",
"payout_history":"0",
"round_shares":"0",
"workers":{
"UserName.1":{
"alive":"0",
"hashrate":"0.000"
},
"UserName.2":{
"alive":"0",
"hashrate":"0.000"
},
"UserName.3":{
"alive":"1",
"hashrate":"1517.540",
"last_share_timestamp":1369598007
},
"UserName.4":{
"alive":"0",
"hashrate":"0.000"
}
}
}
And I want to gather each of the workers and print them out. This "workers" could contain multiple information, but always start with "UserName.x", where the username come from the "username" paramter each time.
The numbers will always vary from 0 and up
I want to gether the information in the same way by accessing the document, and decode and print out all the workers, whatever the numbers of them are.
By using the script provided in my last question(look at the link in the start), i was thinking that it would be something like
local t = json.decode( txt )
print("Workers: ".. t["workers.UserName.1"])
But this was not the way.
Due to the username changing all the time, I was also thinking somthing like
print("Workers: ".. t["workers" .. "." .. "username" .. "." .. "1"])
From here I have no clue about how I should gather the information, even when the names and numbers vary
Thanks in advance
Here is the perfect solution:
local json = require "json"
local t = json.decode( jsonFile( "data.json" )
local workers = t.workers
for name, user in pairs(workers) do
print("--------------------")
print(name)
for tag, value in pairs(user) do
print(tag , value)
end
end
Here are some more info:
http://www.coronalabs.com/blog/2011/08/03/tutorial-exploring-json-usage-in-corona/
http://www.coronalabs.com/blog/2011/06/21/understanding-lua-tables-in-corona-sdk/
http://lua-users.org/wiki/TablesTutorial