Created
December 6, 2021 20:12
-
-
Save rvanbruggen/0eb16bbb7fc0ef83ab818feb62bbb448 to your computer and use it in GitHub Desktop.
Revisions
-
Rik Van Bruggen created this gist
Dec 6, 2021 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,129 @@ # Revisiting contact tracing with Neo4j 4.4's transaction batching capabilities <img src="https://icon2.cleanpng.com/20180419/vsw/kisspng-sinterklaas-netherlands-zwarte-piet-surprise-dutch-st-vector-5ad81a547e09d4.5770211115241119565163.jpg" align="right" width="150"></img> Yes! It's been a few months, but Saint Nicholas just brought us a brand new and shiny release of [Neo4j 4.4](https://neo4j.com/blog/neo4j-4-4-the-fastest-path-to-graph-database-productivity-generally-available/) to play with. One of the key features is a _generic transaction batching_ capability, similar to what we have been using in `apoc.periodic.iterate` but now built right into the core of the database. It is referred to as the [CALL in Transaction](https://neo4j.com/docs/cypher-manual/current/introduction/transactions/) capability - and of course it is a really interesting feature. So in this article I will be revisiting [this blogpost](http://blog.bruggen.com/2021/06/revisiting-covid-19-contact-tracing.html), but without the need for [APOC's `apoc.periodic.iterate` feature](https://neo4j.com/labs/apoc/4.2/overview/apoc.periodic/apoc.periodic.iterate/). Let's see how that goes. --- ## Create a synthetic contact tracing graph - size of Antwerp The first step of course is going to be similar to, if not exactly the same as, the work I did in 2020 on contact tracing. Take a look at (http://blog.bruggen.com/2020/06/what-recommender-systems-and-contact.html) to see how that went. The key thing to recall there is that I was using the fantastic `faker` plugin. You can download it yourself from the [github page](https://github.com/neo4j-contrib/neo4j-faker). Install is super easy. Just need to make sure the config is updated too - and that you whitelisted `fkr.*` just like you do with `gds.*` and `apoc.*`. As with the previous post, I will be pushing the scale up to the size of my home city of [Antwerp](www.antwerpen.be), Belgium. And critically, we would not even use APOC - but use the transaction batching instead. --- ## Create 500000 `(Person)` nodes Previously we did this in one transaction - which is probably at the limits of what I would normally do. But since we now have this _transaction batching_ mechanism in Cypher, let's use it: ```cypher :auto UNWIND range(1,500000) as id CALL { WITH id CREATE (p:Person {id: id}) SET p += fkr.person('1950-01-01','2021-12-01') SET p.healthstatus = fkr.stringElement("Sick,Healthy") SET p.confirmedtime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H") SET p.birthDate = datetime(p.birthDate) SET p.addresslocation = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)}) SET p.name = p.fullName REMOVE p.fullName } IN transactions of 25000 ROWS; ``` This returns a little more slowly than a single shot transaction would, but that is to be expected. Here's the result:  Then, we will create the (Place) nodes. --- ## Create 10000 `(Place)` nodes Adding the places is instantaneous, even with two batches of 5000: ```cypher :auto UNWIND range (1,10000) as id CALL { WITH id CREATE (p:Place { id: id, name: "Place nr "+id}) SET p.type = fkr.stringElement("Grocery shop,Theater,Restaurant,School,Hospital,Mall,Bar,Park") SET p.location = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)}) } IN transactions of 5000 rows; ``` The result looks like this:  --- ## Put in place some indexes on the NODES and future RELATIONSHIPS We don't really need them for this demo - but could be useful for other queries. Note that we are using the relationship-centric model here - as we proved in the last blogpost that this is at least as capable, and much simpler, as the reified model that used `(Visit)` nodes. So here we add the node indexes: ```cypher CREATE INDEX placenodeid FOR (p:Place) ON (p.id); CREATE INDEX placenodelocation FOR (p:Place) ON (p.location); CREATE INDEX placenodename FOR (p:Place) ON (p.name); CREATE INDEX personnodeid FOR (p:Person) ON (p.id); CREATE INDEX personnodenam FOR (p:Person) ON (p.name); CREATE INDEX personnodehealthstatus FOR (p:Person) ON (p.healthstatus); CREATE INDEX personnodeconfirmedtime FOR (p:Person) ON (p.confirmedtime); ``` And we also the index to the `-[:VISITS]->` relationship property: ```cypher CREATE INDEX visitrelstarttime FOR ()-[v:VISITS]->() ON (v.starttime); ```  Now we can add the 1,5M relationships - the real test of the new transaction batching functionality. ## Add 1500000 random visits to places It's pretty straightforward and similar to the previous examples, so let's just dive in: ```cypher :auto UNWIND range(1,1500000) as iteration CALL { WITH iteration MATCH (p:Person {id: toInteger(rand()*500000)+1}), (pl:Place {id:toInteger(rand()*10000)+1 }) create (p)-[virel:VISITS]->(pl) set virel.starttime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H") set virel.endtime = virel.starttime + duration("PT"+toInteger(round(rand()*10))+"H"+toInteger(round(rand()*60))+"M") set virel.visittime = duration.between(virel.starttime,virel.endtime) set virel.visittimeinseconds = virel.visittime.seconds } IN TRANSACTIONS of 25000 rows; ``` The result was pretty quick: 75 seconds, not even!  --- ## Query on VISITS relationships Just for completeness, I will revisit the main query that we explored in the previous blogpost here as well. This is what that query looks like: ```cypher match (p:Person)-[v:VISITS]->(pl:Place) where v.starttime > datetime()-duration("P20DT17H") and v.starttime < datetime()-duration("P20DT10H") return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds order by totalvisittime desc limit 10; ```  --- ## Conclusion: The new transaction batching functionality makes for a great addition to our toolbox - and clear performs quite well. Looking forward to using it in other use cases, already! Cheers Rik Van Bruggen - [Twitter](https://twitter.com/rvanbruggen) - [Blog](http://blog.bruggen.com/) - [LinkedIn](https://www.linkedin.com/in/rikvanbruggen/)