ml kit - Text Recognition output issues - firebase-mlkit

I am creating a business card app using the ml kit text recognition. I have an app this is up and working, but have found that when uploading a business card and extracting the text, but the text comes back in a mess of clumps with no spaces.
I need to literally extract the text line by line.
Is there a way to fix this?

When the recognition operation succeeds, a FirebaseVisionText object will be passed to the success listener. A FirebaseVisionText object contains the full text recognized in the image and zero or more TextBlock objects.
Each TextBlock represents a rectangular block of text, which contains zero or more Line objects. Each Line object contains zero or more Element objects, which represent words and word-like entities (dates, numbers, and so on).
For each TextBlock, Line, and Element object, you can get the text recognized in the region and the bounding coordinates of the region.
For example:
val resultText = result.text
for (block in result.textBlocks) {
val blockText = block.text
val blockConfidence = block.confidence
val blockLanguages = block.recognizedLanguages
val blockCornerPoints = block.cornerPoints
val blockFrame = block.boundingBox
for (line in block.lines) {
val lineText = line.text
val lineConfidence = line.confidence
val lineLanguages = line.recognizedLanguages
val lineCornerPoints = line.cornerPoints
val lineFrame = line.boundingBox
for (element in line.elements) {
val elementText = element.text
val elementConfidence = element.confidence
val elementLanguages = element.recognizedLanguages
val elementCornerPoints = element.cornerPoints
val elementFrame = element.boundingBox
}
}
}
Source: MLKit documentation

Related

JValue in Json4s (Scala): How to create an empty JValue

I want to create an empty JValue to be able to parse JSON objects together.
As for now I am creating the JValue containing {} and then I am parsing the other objects to it and then in the end I remove the first row using an RDD, but I would like to create
var JValue: JValue = JValue.empty
from the beginning to be able to skip the removing part.
Is it possible to create an empty JValue?
import org.json4s._
import org.json4s.jackson.JsonMethods._
var JValue: JValue = parse("{}")
val a = parse(""" {"name":"Scott", "age":33} """)
val b = parse(""" {"name":"Scott", "location":"London"} """)
JValue = JValue.++(a)
JValue = JValue.++(b)
val df = spark.read.json(Seq(compact(render(JValue ))) toDS())
val rdd = df.rdd.first()
val removeFirstRow = df.rdd.filter(row => row != rdd)
val newDataFrame = spark.createDataFrame(removeFirstRow,df.schema)
If I understand correctly what you are trying to achieve, you can start from an empty array like so:
var JValue: JValue = JArray(List.empty)
Calling the ++ method on the empty array will result in the items being added to that array, as defined here.
The final result is the following object:
[ {
"name" : "Scott",
"age" : 33
}, {
"name" : "Scott",
"location" : "London"
} ]
If you want to play around with the resulting code, you can have a look at this worksheet on Scastie (please bear in mind that I did not pull in the Spark dependency there and I'm not 100% sure that would work anyway in Scastie).
As you can notice in the code I linked above, you can also just to a ++ b to obtain the same result, so you don't have to necessarily start from the empty array.
As a further note, you may want to rename JValue to something different to avoid weird errors in which you cannot tell apart the variable and the JValue type. Usually in Scala types are capitalized and variables are not. But of course, try to work towards the existing practices of your codebase.

JsonNode with Array elements

The node fields contain an array. Currently there is only one element, So I can use get(0) to get the first element and parse the string to find the value of the valid.
The problem with this solution is if tomorrow there will be more k:v added to the array, this will fail. Also can I use any elegant way to parse the value of valid?
import com.fasterxml.jackson.databind.{JsonNode, ObjectMapper}
val response = """{"myTable":{"fields":["valid:true"]}}"""
val mapper = new ObjectMapper()
val node = mapper.readTree(response)
val result = node.get("myTable").get("fields").get(0).toString.contains("valid:true")
println(s"valid=$result")
result :
valid=true
You can iterate over array items using elements method. You just need to convert Iterator to Stream:
val fieldsValues = node.get("myTable").get("fields").elements()
val value = StreamSupport.stream(Spliterators.spliteratorUnknownSize(fieldsValues, 0), false)
.filter(item => item.isTextual)
.map[String]((item: JsonNode) => item.textValue())
.map[Boolean]((item: String) => item.equals("valid:true"))
.findAny()
.orElse(false)

akka-stream - How to treat the last element of a stream differently in a Flow/Graph

I'm trying to implement an Akka Streams Flow that will convert a stream of JSON objects to a stream of a single array of JSON objects. I can use Concat to add an "[" before and "]" after, as well as Zip to insert commas in between elements, but I can't figure out how to not insert the final comma.
The code I have so far is:
trait JsonStreamSupport {
protected def toJsonArrayString[T : Writes] = Flow[T].map(Json.toJson(_)).map(_.toString()).via(jsonArrayWrapper)
private[this] val jsonArrayWrapper: Flow[String, String, NotUsed] = Flow.fromGraph(GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
val start = Source.single("[")
val comma = Source.repeat(",")
val end = Source.single("]")
val concat = b.add(Concat[String](3))
val zip = b.add(Zip[String,String])
comma ~> zip.in1
start ~> concat.in(0)
zip.out.map({case (msg,delim) => msg + delim}) ~> concat.in(1)
end ~> concat.in(2)
FlowShape(zip.in0, concat.out)
})
}
Currently the output is:
[{"key":"value},{"key","value"},]
but I need it to be
[{"key":"value},{"key","value"}] (without final comma), where each element of the array is still a distinct element of the stream so can be, for example, sent over chunked HTTP separately.
just found out about intersperse which is exactly what you need, and much simpler than what I suggested in the first place:
http://doc.akka.io/api/akka/2.4.4/index.html#akka.stream.scaladsl.Flow#intersperse[T%3E:Out]%28start:T,inject:T,end:T%29:FlowOps.this.Repr[T]

Play 2.1 Reading JSON Objects in order

JSON to Parse: http://www.dota2.com/jsfeed/heropickerdata?v=18874723138974056&l=english
Hero Class and JSON Serialization
case class Hero(
var id:Option[Int],
name: String,
bio: String,
var trueName:Option[String]
){}
implicit val modelReader: Reads[Hero] = Json.reads[Hero]
Reading Data
val future: Future[play.api.libs.ws.Response] = WS.url("http://www.dota2.com/jsfeed/heropickerdata?v=18874723138974056&l=english").get()
val json = Json.parse(Await.result(future,5 seconds).body).as[Map[String, Hero]]
var i = 1
json.foreach(p => {
p._2.trueName = Some(p._1)
p._2.id = Some(i)
p._2.commitToDatabase
i += 1
})
I need to get the id of each hero. The order of heros in the json matches their id. Obviously a map is unordered and wont work. Does anyone have any other ideas?
I have tried to use a LinkedHashMap. I even tried to make an implicit Reads for LinkedHashMap but I've failed. If anyone thinks that this is the answer then would you please give me some guidance?
It keeps just saying "No Json deserializer found for type scala.collection.mutable.LinkedHashMap[String,models.Hero]. Try to implement an implicit Reads or Format for this type.". I have the trait imported into the file i'm trying to read from. I have a funny feeling that the last line in my Reads is the problem. i think I can't just do the asInstanceOf, however I have no other ideas of how to do this reads.
LinkedHashMap Implicit Reads Code: http://pastebin.com/cf5NpSCX
You can try extracting data in order from the JsObject returned by Json.parse directly, possibly like this:
val json = Json.parse(Await.result(future,5 seconds).body)
val heroes: Map[String, Hero] = json match {
case obj: JsObject =>
obj.fields.zipWithIndex.map{ case ((name: String, heroJson: JsValue), id) =>
heroJson.asOpt[Hero].map{ _.copy(id = Some(id)) }
}.flatten.toMap
case _ = > Seq.empty
}
I don't believe you'll need an order-preserving map anymore since the ids are generated and fixed.

Parse HTML in Scala

Task: HTML - Parser in Scala. Im pretty new to scala.
So far: I have written a little Parser in Scala to parse a random html document.
import scala.xml.Elem
import scala.xml.Node
import scala.collection.mutable.Queue
import scala.xml.Text
import scala.xml.PrettyPrinter
object Reader {
def loadXML = {
val parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
val parser = parserFactory.newSAXParser()
val source = new org.xml.sax.InputSource("http://www.randomurl.com")
val adapter = new scala.xml.parsing.NoBindingFactoryAdapter
val feed = adapter.loadXML(source, parser)
feed
}
def proc(node: Node): String =
node match {
case <body>{ txt }</body> => "Partial content: " + txt
case _ => "grmpf"
}
def main(args: Array[String]): Unit = {
val content = Reader.loadXML
Console.println(content)
Console.println(proc(content))
}
}
The problem is that the "proc" does not work. Basically, I would like to get exactly the content of one node. Or is there another way to achieve that without matching?
Does the "feed" in the loadxml-function give me back the right format for parsing or is there a better way to achieve that? Feed gives me back the root node, right?
Thanks in advance
You're right: adapter.loadXML(source, parser) gives you the root node. The problem is that that root node probably isn't going to match the body case in in your proc method. Even if the root node were body, it still wouldn't match unless the element contained nothing but text.
You probably want something more like this:
def proc(node: Node): String = (node \\ "body").text
Where \\ is a selector method that's roughly equivalent to XPath's //—i.e., it returns all the descendants of node named body. If you know that body is a child (as opposed to a deeper descendant) of the root node, which is probably the case for HTML, you can use \ instead of \\.