Delimiting text string based off of muliple critera instead of charector in MS Access Query - ms-access

I have a field with varying strings of concatenated text that I need to delimit. I need the phrase and the count of how many times that phrase appeared into two separate fields and then repeating the same process for every additional phrase.
Example of table field text:
"some text":2; some:other NEAR text:1;
Desired Results:
[Field 1]: "Some Text", [Field 2]: 2, [Field 3]: some:other NEAR text, [Field 4] 1
The problem I am having is that when I use ":" and ";" to delimit the field using Len, Instr, InstrRev, Left, Right and Mid functions it is delimiting the "some:other NEAR text" string into "some" and "other NEAR text". Is there a way around this or should I go about this in another way? Any help is appreciated.

Is this a one-time fix of bad data to parse into discrete fields? Should show your attempted code.
Assuming every record has value in the example structure, try (x represents your concatenated data field):
Field1: Left(x, InStr(x, ":")-1)
Field2: Val(Mid(Left(x, InStr(x, ";")),InStrRev(Left(x, InStr(x, ";")),":")+1))
Field3: Mid(x, InStr(x, ";")+2, Len(Mid(x, InStr(x, ";")+2))-Len(Mid(x,InStrRev(x,":"))))
Field4: Val(Mid(x,InStrRev(x,":")+1))
Otherwise, might have to build a custom VBA function.

Related

How can I loop with multiple conditional statements in OpenRefine (GREL)

I am geocoding using OpenRefine. I pulled data from OpenStreetMaps to my datasetstructure of data
I am adding a "column based on this column" for the coordinates.I want to check that the display_name contains "Rheinland-Pfalz" and if it does, I want to extract the latitude and longitude,i.e. pair.lat + ',' + pair.lon. I want to do this iteratively but I don't know how. I have tried the following:
if(display_name[0].contains("Rheinland-Pfalz"), with(value.parseJson()[0], pair, pair.lat + ',' + pair.lon),"nothing")
but I want to do this for each index [0] up to however many there are. I would appreciate if anyone could help.
Edit: Thanks for your answer b2m.
How would I extract the display_name corresponding to the coordinates that we get. I want the output to be display_name lat,lon for each match (i.e. contains "Rheinland-Pfalz", because I have a different column containing a piece of string that I want to match with the matches generated already.
For example, using b2m's code and incorporating the display_name in the output we get 2 matches:
Schaumburg, Balduinstein, Diez, Rhein-Lahn-Kreis, Rheinland-Pfalz, Deutschland 50.33948155,7.9784308849342604
Schaumburg, Horhausen, Flammersfeld, Landkreis Altenkirchen, Rheinland-Pfalz, Deutschland 52.622319,14.5865283
For each row, I have another string in a different column. Here the entry is "Rhein-Lahn-Kreis". I want to filter the two matches above to only keep those containing my string in the other column. In this case "Rhein-Lahn-Kreis" but the other column entry is different for each row. I hope this is clear and I would greatly appreciate any help
Assuming we have the following json data
[
{"display_name": "BW", "lat": 0, "lon": 1},
{"display_name": "NRW 1", "lat": 2, "long": 3},
{"display_name": "NRW 2", "lat": 4, "lon": 5}
]
You can extract the combined elements lat and long with forEach and filter using the following GREL expression e.g. in the Add column based on this column dialog.
forEach(
filter(
value.parseJson(), geodata, geodata.display_name.contains("NRW")
), el, el.lat + "," + el.lon)
.join(";")
This will result in a new field with the value 2,3;4,5.
You can then split the new multi valued field on the semicolon ";" to obtain separated values (2,3 and 4,5).
Another approach would be to split the JSON Array elements into separate rows, avoiding the forEach and filter functions.

How do I search for a string in this JSON with Python

My JSON file looks something like:
{
"generator": {
"name": "Xfer Records Serum",
....
},
"generator": {
"name: "Lennar Digital Sylenth1",
....
}
}
I ask the user for search term and the input is searched for in the name key only. All matching results are returned. It means if I input 's' only then also both the above ones would be returned. Also please explain me how to return all the object names which are generators. The more simple method the better it will be for me. I use json library. However if another library is required not a problem.
Before switching to JSON I tried XML but it did not work.
If your goal is just to search all name properties, this will do the trick:
import re
def search_names(term, lines):
name_search = re.compile('\s*"name"\s*:\s*"(.*' + term + '.*)",?$', re.I)
return [x.group(1) for x in [name_search.search(y) for y in lines] if x]
with open('path/to/your.json') as f:
lines = f.readlines()
print(search_names('s', lines))
which would return both names you listed in your example.
The way the search_names() function works is it builds a regular expression that will match any line starting with "name": " (with varying amount of whitespace) followed by your search term with any other characters around it then terminated with " followed by an optional , and the end of string. Then applies that to each line from the file. Finally it filters out any non-matching lines and returns the value of the name property (the capture group contents) for each match.

Parse JSON request with REGEX

I'd like to parse the JSON output from an IEX Cloud stock quote query: https://cloud.iexapis.com/stable/stock/aapl/quote?token=YOUR_TOKEN_HERE
I have tired to use Regex101 to solve the issue:
https://regex101.com/r/y8i01T/1/
Here is the Regex expression that I tried:"([^"]+)":"?([^",\s]+)
Here is the example of a IEX Cloud stock quote output for Apple:
{
"symbol":"AAPL",
"companyName":"Apple, Inc.",
"calculationPrice":"close",
"open":204.86,
"openTime":1556285400914,
"close":204.3,
"closeTime":1556308800303,
"high":205,
"low":202.12,
"latestPrice":204.3,
"latestSource":"Close",
"latestTime":"April 26, 2019",
"latestUpdate":1556308800303,
"latestVolume":18604306,
"iexRealtimePrice":204.34,
"iexRealtimeSize":48,
"iexLastUpdated":1556308799763,
"delayedPrice":204.3,
"delayedPriceTime":1556308800303,
"extendedPrice":204.46,
"extendedChange":0.16,
"extendedChangePercent":0.00078,
"extendedPriceTime":1556310657637,
"previousClose":205.28,
"change":-0.98,
"changePercent":-0.00477,
"iexMarketPercent":0.030716437366704246,
"iexVolume":571458,
"avgTotalVolume":27717780,
"iexBidPrice":0,
"iexBidSize":0,
"iexAskPrice":0,
"iexAskSize":0,
"marketCap":963331704000,
"peRatio":16.65,
"week52High":233.47,
"week52Low":142,
"ytdChange":0.29512900000000003
}
I want to save the key value pairs in the JSON response without quotes around the key and gather the value starting after the colon (:). I need to exclude any quotes for text, the comma at the end of each line and include the last key-value pair that does not include a comma at the end of the line.
For example, "peRatio":16.65, should have the key equal to peRatio and the value equal to 16.65. Or another example, "changePercent":-0.00477, should have a key equal to changePercent and a value of -0.00477. If it's a text such as "companyName":"Apple, Inc.",, it should have a key equal to companyName and a value equal to Apple, Inc.
Also, the last JSON key-value entry: "ytdChange":0.29512900000000003 does not have a comma and that needs to be accounted for.
You most likely do not need to parse your data using regex. However, if you wish/have to do so, maybe for practicing regular expressions, you could do so by defining a few boundaries in your expression.
This RegEx might help you to do that, which divides your input JSON values into three categories of string, numeric, and last no-comma values:
"([^"]+)":("(.+)"|(.+))(,{1}|\n\})
Then, you can use the \n} boundary for the last value, "" boundary for your string values and no boundary for numeric values.

Removing \n \\n and other unwanted characters from a json unicode dictionary with python

I've tried a couple of different solutions to fix my problem with some "funny" newlines within my json dictionary and none of them works, so I thought I might make a post. The dictionary is achieved by scraping a website.
I have a json dictionary:
my_dict = {
u"Danish title": u"Avanceret",
u"Course type": u"MScTechnol",
u"Type of": u"assessmen",
u"Date": u"\nof exami",
u"Evaluation": u"7 step sca",
u"Learning objectives": u"\nA studen",
u"Participants restrictions": u"Minimum 10",
u"Aid": u"No Aid",
u"Duration of Course": u"13 weeks",
u"name": u"Advanced u",
u"Department": u"31\n",
u"Mandatory Prerequisites": u"31545",
u"General course objectives": u"\nThe cour",
u"Responsible": u"\nMartin C",
u"Location": u"Campus Lyn",
u"Scope and form": u"Lectures, ",
u"Point( ECTS )": u"10",
u"Language": u"English",
u"number": u"31548",
u"Content": u"\nThe cour",
u"Schedule": u"F4 (Tues 1"
}
I have stripped the value content to [:10] to reduce clutter, but some of the values have a length of 300 characters. It might not be portrayed well here, but some of values have a lot of newline characters in them and I've tried a lot of different solutions to remove them, such as str.strip and str.replace but without success because my 'values' are unicode. And by values I mean key, value in my_dict.items().
How do I remove all the newlines appearing in my dictionary? (With the values in focus as some of the newlines are trailing, some are leading and others are in the middle of the content: e.i \nI have a\ngood\n idea\n).
EDIT
I am using Python v. 2.7.11 and the following piece of code doesn't produce what I need. I want all the newlines to be changed to a single whitespace character.
for key, value in test.items():
value = str(value[:10]).replace("\n", " ")
print key, value
If you're trying to remove all \n or any junk character apart from numbers or letters then use regex
for key in my_dict.keys():
my_dict[key] = mydict[key].replace('\\n', '')
my_dict[key] = re.sub('[^A-Za-z0-9 ]+', '', my_dict[key])
print my_dict
If you wish to keep anything apart from those then add it on to the character class inside the regex
for remove '\n' try this ....
for key, value in my_dict.items():
my_dict[key] = ''.join(value.split('\n'))
you need to put the updated value back to your dictionary ( similar to "by value vs. by reference" situation ;) ) ...
to remove the "/n" this one liner may be more "pythonic" :
new_test ={ k:v.replace("\n", "") for k,v in test.iteritems()}
to do what you try to do in your loop try something like:
new_test ={ k:str(value[:10]).replace("\n", " ") for k,v in test.iteritems()}
In your code, value takes the new value, but you never write it back...
So for example, this would work (but be slower, also you would be changing the values inside the loop, which should not cause problems, but the interpreter might not like...):
for key, value in test.items():
value = str(value[:10]).replace("\n", " ")
#now put it back to the dictionary...
test[key]=value
print key, value

Logstash - Substring from CSV column

I want to import many informations from a CSV file to Elastic Search.
My issue is I don't how can I use a equivalent of substring to select information into a CSV column.
In my case I have a field date (YYYYMMDD) and I want to have (YYYY-MM-DD).
I use filter, mutate, gsub like:
filter
{
mutate
{
gsub => ["date", "[0123456789][0123456789][0123456789][0123456789][0123456789][0123456789][0123456789][0123456789]", "[0123456789][0123456789][0123456789][0123456789]-[0123456789][0123456789]-[0123456789][0123456789]"]
}
}
But my result is false.
I can indentified my string but I don't how can I extract part of this.
My target it's to have something like:
gsub => ["date", "[0123456789][0123456789][0123456789][0123456789][0123456789][0123456789][0123456789][0123456789]","%{date}(0..3}-%{date}(4..5)-%{date}"(6..7)]
%{date}(0..3} : select from the first to the 4 characters of csv columns date
You can use ruby plugin to do conversion. As you say, you will have a date field. So, we can use it directly in ruby
filter {
ruby {
code => "
date = Time.strptime(event['date'],'%Y%m%d')
event['date_new'] = date.strftime('%Y-%m-%d')
"
}
}
The date_new field is the format you want.
First, you can use a regexp range to match a sequence, so rather than [0123456789], you can do [0-9]. If you know there will be 4 numbers, you can do [0-9]{4}.
Second, you want to "capture" parts of your input string and reorder them in the output. For that, you need capture groups:
([0-9]{4})([0-9]{2})([0-9]{2})
where parens define the groups. Then you can reference those on the right side of your gsub:
\1-\2-\3
\1 is the first capture group, etc.
You might also consider getting these three fields when you do the grok{}, and then putting them together again later (perhaps with add_field).