I'm wondering how to achieve the following thing:
I have a string version of a bibteX file (obtained through requests in the following representation:
'#article{blablabla,\n key={\n\t1234567\n\t},\n title={\n\tblablabla\n\t},\n author={\n\t Name_of_the_authors \n\t},\n journal={\n\t Name_of_the_journal \n\t},\n volume={\n\t\n\t},\n pages={\n\t\n\t},\n year={\n\t 2020 \n\t},\n url={\n\t DOI URL \n\t}\n}'
From this I would like to obtain a dict specifying the information that I need from above string, for example:
dict1 = {author : 'Name_of_the_author', 'year' : 2020, 'url':'DOI URL'}
Maybe I could exploit the curly brackets for getting the information?
Many thanks,
James
This is not the most beautiful code to get there, but assuming your sample string in the question is representative of the actual data, one way to handle your problem is through some string manipulation:
data = """#article{blablabla,\n key={\n\t1234567\n\t},\n title={\n\tblablabla\n\t},\n author={\n\t Name_of_the_authors \n\t},\n journal={\n\t Name_of_the_journal \n\t},\n volume={\n\t\n\t},\n pages={\n\t\n\t},\n year={\n\t 2020 \n\t},\n url={\n\t DOI URL \n\t}\n}"""
data2=data.split(',')[1:]
targets = [2,-2,-1] #this is here because you're not interested in all the info in the string - but only 3 items
dict1={}
for target in targets:
item=data2[target].replace('{\n\t','').replace('\n\t}','').strip().split("=")
dict1[item[0]]=item[1].strip()
dict1
Output:
{'author': 'Name_of_the_authors', 'year': '2020', 'url': 'DOI URL \n}'}
Related
I am trying to write my custom API view and I am struggling a bit with querysets and JSON. It shouldn't be that complicated but I am stuck still. Also I am confused by some strange behaviour of the loop I coded.
Here is my view:
#api_view()
def BuildingGroupHeatYear(request, pk, year):
passed_year = str(year)
building_group_object = get_object_or_404(BuildingGroup, id=pk)
buildings = building_group_object.buildings.all()
for item in buildings:
demand_heat_item = item.demandheat_set.filter(year=passed_year).values('building_id', 'year', 'demand')
print(demand_heat_item)
print(type(demand_heat_item)
return Response(demand_heat_item))
Ok so this actually gives me back exactly what I want. Namely that:
{'building_id': 1, 'year': 2019, 'demand': 230.3}{'building_id': 1, 'year': 2019, 'demand': 234.0}
Ok, great, but why? Shouldn't the data be overwritten each time the loop goes over it?
Also when I get the type of the demand_heat_item I get back a queryset <class 'django.db.models.query.QuerySet'>
But this is an API View, so I would like to get a JSON back. SHouldn't that throw me an error?
And how could I do this so I get the same data structure back as a JSON?
It tried to rewrite it like this but without success because I can't serialize it:
#api_view()
def BuildingGroupHeatYear(request, pk, year):
passed_year = str(year)
building_group_object = get_object_or_404(BuildingGroup, id=pk)
buildings = building_group_object.buildings.all()
demand_list = []
for item in buildings:
demand_heat_item = item.demandheat_set.filter(year=passed_year).values('building_id', 'year', 'demand')
demand_list.append(demand_heat_item)
json_data = json.dumps(demand_list)
return Response(json_data)
I also tried with JSON Response and Json decoder.
But maybe there is a better way to do this?
Or maybe my question is formulated clearer like this: How can I get the data out of the loop, and return it as a JSON
Any help is much appreciated. Thanks in advance!!
Also, I tried the following:
for item in buildings:
demand_heat_item = item.demandheat_set.filter(year=passed_year).values('building_id', 'year', 'demand')
json_data = json.dumps(list(demand_heat_item))
return Response(json_data)
that gives me this weird response that I don't really want:
"[{\"building_id\": 1, \"year\": 2019, \"demand\": 230.3}, {\"building_id\": 1, \"year\": 2019, \"demand\": 234.0}]"
I have wrote a code like below to concatenate two arrays together and save them as a JSON file.
In this code, "seg" is an array of some number, which has been produced somewhere in my code. info is also an array containing some data following by "Seg" array.
Defining variable types:
seg: Array<any> = [];
info: Array<any>=[];
final: Array<{info:any, Seg:any}>=[];
push value in array and concatenate them together:
this.info.push({date_created: 25 , description: 'aaa', year:'2015'});
this.final.push({info: this.info ,Seg:this.seg});
this.file.writeFile(this.file.externalApplicationStorageDirectory, 'test.json', JSON.stringify(this.final));
the produced file is something like this:
[{"info":[{"date_created: 25 , "description"="aaa", "year" :"2015"}],"seg":[2,3,4,5]}]
As you can see, the info information is placed between two bracket, so JSON file consider it as a list, not record.
Does anyone knows , how can I remove this brackets from the info array sides?
Should change the type of variable from array to anything else?
You can use like this to store as a record
seg: Array<any> = [];
info: Array<any>=[];
final:{info:any, Seg:any};
this.final.Seg = this.seg;
this.final.info = this.info;
Good morning.
I want to use the the following rest: https://rest.ensembl.org/documentation/info/sequence_id_post
I have the vector object (ids) in R:
> ids
[1] "NM_007294.3:c.932_933insT" "NM_007294.3:c.1883C>T" "NM_007294.3:c.2183A>C"
[4] "NM_007294.3:c.2321C>T" "NM_007294.3:c.4585G>A" "NM_007294.3:c.4681C>A"
I have to put this vector(ids) with more than 200 variables in the body= ids variable (bellow), according to the example of code below, for it works:
Code:
library(httr)
library(jsonlite)
library(xml2)
server <- "https://rest.ensembl.org"
ext <- "/vep/human/hgvs"
r <- POST(paste(server, ext, sep = ""), content_type("application/json"), accept("application/json"), body = '{ "hgvs_notations" : ["NM_007294.3:c.932_933insT", "NM_007294.3:c.1883C>T"] }')
stop_for_status(r)
head(fromJSON(toJSON(content(r))))
I know it's a json format, but when I convert my variable ids to json it's not in the correct format.
Do you have any suggestions?
Thanks for any help.
Leandro
I think that NM_007294.3:c.2321C>T is not a valid query to /sequence/id REST endpoint. It contains a sequence id (NM_007294.3) and a variant (c.2321C>T) and if you understood this literally, you are asking the server a letter T, since this call returns sequences.
Valid query would contain only sequence ids and you can use it like that (provided you have your ids in a vector):
r <- POST(paste(server, ext, sep = ""), content_type("application/json"), accept("application/json"), body = paste0('{ "ids" :', jsonlite::toJSON(ids), ' }')
Depending on the downstream scenario, making your ids unique might help/speed things up.
I am new to Spark, and want to read a log file and create a dataframe out of it. My data is half json, and I cannot convert it into a dataframe properly. Here below is first row in the file;
[2017-01-06 07:00:01] userid:444444 11.11.111.0 info {"artist":"Tears For Fears","album":"Songs From The Big Chair","song":"Everybody Wants To Rule The World","id":"S4555","service":"pandora"}
See first part is plain text and the last part between { } is json, I tried few things, converting it first to RDD then map and split then convert back to DataFrame, but I cannot extract the values from Json part of the row, is there a trick to extract fields in this context?
Final output will be like;
TimeStamp userid ip artist album song id service
2017-01-06 07:00:01 444444 11.11.111.0 Tears For Fears Songs From The Big Chair Everybody Wants To Rule The World S4555 pandora
You just need to parse out the pieces with a Python UDF into a tuple then tell spark to convert the RDD to a dataframe. The easiest way to do this is probably a regular expression. For example:
import re
import json
def parse(row):
pattern = ' '.join([
r'\[(?P<ts>\d{4}-\d\d-\d\d \d\d:\d\d:\d\d)\]',
r'userid:(?P<userid>\d+)',
r'(?P<ip>\d+\.\d+\.\d+\.\d+)',
r'(?P<level>\w+)',
r'(?P<json>.+$)'
])
match = re.match(pattern, row)
parsed_json = json.loads(match.group('json'))
return (match.group('ts'), match.group('userid'), match.group('ip'), match.group('level'), parsed_json['artist'], parsed_json['song'], parsed_json['service'])
lines = [
'[2017-01-06 07:00:01] userid:444444 11.11.111.0 info {"artist":"Tears For Fears","album":"Songs From The Big Chair","song":"Everybody Wants To Rule The World","id":"S4555","service":"pandora"}'
]
rdd = sc.parallelize(lines)
df = rdd.map(parse).toDF(['ts', 'userid', 'ip', 'level', 'artist', 'song', 'service'])
df.show()
This prints
+-------------------+------+-----------+-----+---------------+--------------------+-------+
| ts|userid| ip|level| artist| song|service|
+-------------------+------+-----------+-----+---------------+--------------------+-------+
|2017-01-06 07:00:01|444444|11.11.111.0| info|Tears For Fears|Everybody Wants T...|pandora|
+-------------------+------+-----------+-----+---------------+--------------------+-------+
I have used the following, just some parsing utilizing pyspark power;
parts=r1.map( lambda x: x.value.replace('[','').replace('] ','###')
.replace(' userid:','###').replace('null','"null"').replace('""','"NA"')
.replace(' music_info {"artist":"','###').replace('","album":"','###')
.replace('","song":"','###').replace('","id":"','###')
.replace('","service":"','###').replace('"}','###').split('###'))
people = parts.map(lambda p: (p[0], p[1],p[2], p[3], p[4], p[5], p[6], p[7]))
schemaString = "timestamp mac userid_ip artist album song id service"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
With this I got almost what I want, and performance was super fast.
+-------------------+-----------------+--------------------+-------------------- +--------------------+--------------------+--------------------+-------+
| timestamp| mac| userid_ip| artist| album| song| id|service|
+-------------------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------+
|2017-01-01 00:00:00|00:00:00:00:00:00|111122 22.235.17...|The United States...| This Is Christmas!|Do You Hear What ...| S1112536|pandora|
|2017-01-01 00:00:00|00:11:11:11:11:11|123123 108.252.2...| NA| Dinner Party Radio| NA| null|pandora|
I want to parse a string of complex JSON in Pig. Specifically, I want Pig to understand my JSON array as a bag instead of as a single chararray. I found that complex JSON can be parsed by using Twitter's Elephant Bird or Mozilla's Akela library. (I found some additional libraries, but I cannot use 'Loader' based approach since I use HCatalog Loader to load data from Hive.)
But, the problem is the structure of my data; each value of Map structure contains value part of complex JSON. For example,
1. My table looks like (WARNING: type of 'complex_data' is not STRING, a MAP of <STRING, STRING>!)
TABLE temp_table
(
user_id BIGINT COMMENT 'user ID.',
complex_data MAP <STRING, STRING> COMMENT 'complex json data'
)
COMMENT 'temp data.'
PARTITIONED BY(created_date STRING)
STORED AS RCFILE;
2. And 'complex_data' contains (a value that I want to get is marked with two *s, so basically #'d'#'f' from each PARSED_STRING(complex_data#'c') )
{ "a": "[]",
"b": "\"sdf\"",
"**c**":"[{\"**d**\":{\"e\":\"sdfsdf\"
,\"**f**\":\"sdfs\"
,\"g\":\"qweqweqwe\"},
\"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"}]
},
{\"**d**\":{\"e\":\"sdfsdf\"
,\"**f**\":\"sdfs\"
,\"g\":\"qweqweqwe\"},
\"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"}]
},]"
}
3. So, I tried... (same approach for Elephant Bird)
REGISTER '/path/to/akela-0.6-SNAPSHOT.jar';
DEFINE JsonTupleMap com.mozilla.pig.eval.json.JsonTupleMap();
data = LOAD temp_table USING org.apache.hive.hcatalog.pig.HCatLoader();
values_of_map = FOREACH data GENERATE complex_data#'c' AS attr:chararray; -- IT WORKS
-- dump values_of_map shows correct chararray data per each row
-- eg) ([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }])
([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }]) ...
attempt1 = FOREACH data GENERATE JsonTupleMap(complex_data#'c'); -- THIS LINE CAUSE AN ERROR
attempt2 = FOREACH data GENERATE JsonTupleMap(CONCAT(CONCAT('{\\"key\\":', complex_data#'c'), '}'); -- IT ALSO DOSE NOT WORK
I guessed that "attempt1" was failed because the value doesn't contain full JSON. However, when I CONCAT like "attempt2", I generate additional \ mark with. (so each line starts with {\"key\": ) I'm not sure that this additional marks breaks the parsing rule or not. In any case, I want to parse the given JSON string so that Pig can understand. If you have any method or solution, please Feel free to let me know.
I finally solved my problem by using jyson library with jython UDF.
I know that I can solve it by using JAVA or other languages.
But, I think that jython with jyson is the most simplist answer to this issue.