Skip to content

Instantly share code, notes, and snippets.

@rvanbruggen
Created December 6, 2021 20:12
Show Gist options
  • Select an option

  • Save rvanbruggen/0eb16bbb7fc0ef83ab818feb62bbb448 to your computer and use it in GitHub Desktop.

Select an option

Save rvanbruggen/0eb16bbb7fc0ef83ab818feb62bbb448 to your computer and use it in GitHub Desktop.

Revisions

  1. Rik Van Bruggen created this gist Dec 6, 2021.
    129 changes: 129 additions & 0 deletions transactionbatching.mdx
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,129 @@
    # Revisiting contact tracing with Neo4j 4.4's transaction batching capabilities

    <img src="https://icon2.cleanpng.com/20180419/vsw/kisspng-sinterklaas-netherlands-zwarte-piet-surprise-dutch-st-vector-5ad81a547e09d4.5770211115241119565163.jpg" align="right" width="150"></img>

    Yes! It's been a few months, but Saint Nicholas just brought us a brand new and shiny release of [Neo4j 4.4](https://neo4j.com/blog/neo4j-4-4-the-fastest-path-to-graph-database-productivity-generally-available/) to play with. One of the key features is a _generic transaction batching_ capability, similar to what we have been using in `apoc.periodic.iterate` but now built right into the core of the database. It is referred to as the [CALL in Transaction](https://neo4j.com/docs/cypher-manual/current/introduction/transactions/) capability - and of course it is a really interesting feature.

    So in this article I will be revisiting [this blogpost](http://blog.bruggen.com/2021/06/revisiting-covid-19-contact-tracing.html), but without the need for [APOC's `apoc.periodic.iterate` feature](https://neo4j.com/labs/apoc/4.2/overview/apoc.periodic/apoc.periodic.iterate/). Let's see how that goes.

    ---

    ## Create a synthetic contact tracing graph - size of Antwerp

    The first step of course is going to be similar to, if not exactly the same as, the work I did in 2020 on contact tracing. Take a look at (http://blog.bruggen.com/2020/06/what-recommender-systems-and-contact.html) to see how that went. The key thing to recall there is that I was using the fantastic `faker` plugin. You can download it yourself from the [github page](https://github.com/neo4j-contrib/neo4j-faker). Install is super easy. Just need to make sure the config is updated too - and that you whitelisted `fkr.*` just like you do with `gds.*` and `apoc.*`.

    As with the previous post, I will be pushing the scale up to the size of my home city of [Antwerp](www.antwerpen.be), Belgium. And critically, we would not even use APOC - but use the transaction batching instead.

    ---

    ## Create 500000 `(Person)` nodes
    Previously we did this in one transaction - which is probably at the limits of what I would normally do. But since we now have this _transaction batching_ mechanism in Cypher, let's use it:

    ```cypher
    :auto UNWIND range(1,500000) as id
    CALL {
    WITH id
    CREATE (p:Person {id: id})
    SET p += fkr.person('1950-01-01','2021-12-01')
    SET p.healthstatus = fkr.stringElement("Sick,Healthy")
    SET p.confirmedtime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
    SET p.birthDate = datetime(p.birthDate)
    SET p.addresslocation = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
    SET p.name = p.fullName
    REMOVE p.fullName
    } IN transactions of 25000 ROWS;
    ```
    This returns a little more slowly than a single shot transaction would, but that is to be expected. Here's the result:

    ![](https://drive.google.com/uc?id=11ad66omLXh0yLNTMH2tLLEhx98RjIMhA)

    Then, we will create the (Place) nodes.

    ---

    ## Create 10000 `(Place)` nodes
    Adding the places is instantaneous, even with two batches of 5000:
    ```cypher
    :auto UNWIND range (1,10000) as id
    CALL {
    WITH id
    CREATE (p:Place { id: id, name: "Place nr "+id})
    SET p.type = fkr.stringElement("Grocery shop,Theater,Restaurant,School,Hospital,Mall,Bar,Park")
    SET p.location = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
    } IN transactions of 5000 rows;
    ```
    The result looks like this:
    ![](https://drive.google.com/uc?id=11cmuG7dUv9fVDfYr60JExpXAUmPCM0A9)

    ---

    ## Put in place some indexes on the NODES and future RELATIONSHIPS
    We don't really need them for this demo - but could be useful for other queries. Note that we are using the relationship-centric model here - as we proved in the last blogpost that this is at least as capable, and much simpler, as the reified model that used `(Visit)` nodes.

    So here we add the node indexes:
    ```cypher
    CREATE INDEX placenodeid FOR (p:Place) ON (p.id);
    CREATE INDEX placenodelocation FOR (p:Place) ON (p.location);
    CREATE INDEX placenodename FOR (p:Place) ON (p.name);
    CREATE INDEX personnodeid FOR (p:Person) ON (p.id);
    CREATE INDEX personnodenam FOR (p:Person) ON (p.name);
    CREATE INDEX personnodehealthstatus FOR (p:Person) ON (p.healthstatus);
    CREATE INDEX personnodeconfirmedtime FOR (p:Person) ON (p.confirmedtime);
    ```
    And we also the index to the `-[:VISITS]->` relationship property:
    ```cypher
    CREATE INDEX visitrelstarttime FOR ()-[v:VISITS]->() ON (v.starttime);
    ```

    ![](https://drive.google.com/uc?id=11knLKRCXebDwfjoKuEx42gy0RaspX7hq)

    Now we can add the 1,5M relationships - the real test of the new transaction batching functionality.

    ## Add 1500000 random visits to places
    It's pretty straightforward and similar to the previous examples, so let's just dive in:

    ```cypher
    :auto UNWIND range(1,1500000) as iteration
    CALL {
    WITH iteration
    MATCH (p:Person {id: toInteger(rand()*500000)+1}), (pl:Place {id:toInteger(rand()*10000)+1 })
    create (p)-[virel:VISITS]->(pl)
    set virel.starttime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
    set virel.endtime = virel.starttime + duration("PT"+toInteger(round(rand()*10))+"H"+toInteger(round(rand()*60))+"M")
    set virel.visittime = duration.between(virel.starttime,virel.endtime)
    set virel.visittimeinseconds = virel.visittime.seconds
    } IN TRANSACTIONS of 25000 rows;
    ```
    The result was pretty quick: 75 seconds, not even!

    ![](https://drive.google.com/uc?id=11lSa8gocymgvI_l5_pFy0c9U-SAZVJo8)

    ---

    ## Query on VISITS relationships
    Just for completeness, I will revisit the main query that we explored in the previous blogpost here as well. This is what that query looks like:

    ```cypher
    match (p:Person)-[v:VISITS]->(pl:Place)
    where v.starttime > datetime()-duration("P20DT17H")
    and v.starttime < datetime()-duration("P20DT10H")
    return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds
    order by totalvisittime desc
    limit 10;
    ```

    ![](https://drive.google.com/uc?id=11mmOgfGkZYok-1NSPMfgK8Zcb03kWRcg)

    ---

    ## Conclusion:
    The new transaction batching functionality makes for a great addition to our toolbox - and clear performs quite well. Looking forward to using it in other use cases, already!

    Cheers

    Rik Van Bruggen
    - [Twitter](https://twitter.com/rvanbruggen)
    - [Blog](http://blog.bruggen.com/)
    - [LinkedIn](https://www.linkedin.com/in/rikvanbruggen/)