Graph Databases – The graph database market is thriving and growing as demand for connected data analysis increases rapidly. But the IT user wonders which graph database is the most powerful and best suited for him with its functions.
This brief overview examines common graph databases, which can now be run in the cloud: Neo4J, Amazon Neptune and Apache TinkerPop. A newcomer to the selection is TigerGraph.
Apache TinkerPop is an open-source project. The release of the first version of the graph traversal engine took place in 2011. In 2015, it found its way into the Apache incubator. Because it integrates easily, has flexible storage options, and has a Permissive Use licence, TinkerPop has become the preferred NoSQL vendor when looking to add a graphical interface to their products.
According to the independent publication DB-Engines.com, Neo4J is already responsible for about half of the entire graph market. Products based on TinkerPop account for about 40 per cent of the total market. The remaining 10 per cent is distributed among more than 20 different providers, including well-known ones like Apache Spark, young ones like Amazon Neptune and proprietary ones like SAP HANA.
Neo4J has a native graph platform. “Native means that we developed the graph database from scratch for the graph data model,”. He attended his customer event in Berlin, explaining the difference between non-native graph databases. “There, the graph data model is implemented via an adapter layer, placed on the underlying data model, whether relational, JSON-based or key-value-based.”
In contrast, the Neo4j database is from a single source because the manufacturer owns the entire technology stack and can optimize all components for a specific purpose, i.e. also for particular workloads. He also holds the indices that he or the customer can optimize for particular purposes. The Neo4j platform is available in the public cloud under ” Aura “.
Cypher is now the default query language for graphs, for example, on Apache Spark. “Cypher is intended to replace Spark’s graph tool (see below) over time, with the Cypher implementation for Spark being called CAPS (Cypher for Apache Spark),”.
Amazon Neptune is a fast, resilient, fully-managed graph database service for building applications that work with highly connected datasets. The core is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with millisecond latency. “The database is optimized for OLTP requests, for many parallel requests with a concise latency period,”.
Amazon Neptune supports the most popular W3C Property Graph and RDF data models and associated Apache TinkerPop Gremlin and SPARQL query languages, allowing users to use them like open APIs to build queries that efficiently navigate highly linked datasets.
Amazon Neptune is optimized to work with graph data in memory but doesn’t limit the database to memory size. The size of an Amazon Neptune database is always limited to 64 terabytes, regardless of the size of the main memory. The user must ensure that he always has his data in the working set, i.e. in what is read a lot. For this, Neptune optimizes the read accesses and the read queries. Therefore, the user has to scale up or down as needed. “We differentiate between writing and reading access,”. The various working sets could be designed differently depending on the request loads.
“Customers want reliability and data security, performance and reasonable costs per workload,”. “They also want high availability. Enterprise customers want to support and compliance with SLAs.”
Scalability And Performance
At the same time, customers demand the scalability of the graph database, for example, if a larger database model is to be implemented, such as at Facebook. “There are then also many connections between the entities and concepts.”
There are different ways to ensure sufficient scalability. “You could create a fairly long table in the relational model, for example, for customers and their contacts,”. But with a network of relationships, users quickly reach their limits with a relational model. Therefore, the use of a graph database makes sense.
“The high availability of Neptune is primarily guaranteed with reading replicas,”. These can be placed in different Availability Zones (AZs) and automatically synchronize their data from the Amazon Neptune cluster volume. “Read access for queries” “is done via these replicas, optimal for OLTP, so the latency is in the tens of milliseconds.”
There is one controller instance, and up to 15 read replicas. Write access is limited to the controller instance of the entire database, but read access can be significantly scaled: “The user can distribute this access over a maximum of 15 replicas of the entire database.” The controller instance is also for serializing the transactions responsible for which Neptune was designed. Many working sets would be held in main memory to increase throughput and reduce latency.
Data is distributed to the replicas via Amazon Neptune’s Virtual Storage Layer (VSL), the cluster volume. This logical layer is based on a storage cluster that Neptune manages. The VSL keeps a transaction log distributed across the Availability Zones (AZs), which increases data security in addition to IAM and critical management. The high availability and durability of the data are thus guaranteed for demanding customers.
With a distribution of replicas across multiple AZs, high availability can be maximized for failover. The size of the counterparts depends on the workload, but this size can be well defined statically. The workload is crucial, such as how many applications can or may access.
Performance can be increased along with scalability. “can be moved to larger EC2 instances to give them more performance: more resources, more simultaneous connections, can store more data in the cache.” Load balancing can be implemented for all replicas. “Amazon Cloudwatch, which monitors the database, provides suitable metrics, such as CPU or main memory utilization.”
In March 2020, version 3.0 of the TigerGraph graph database was released. As TigerGraph Cloud, it is also available as a graph database-as-a-service. The strengths of this version, which has a visual query builder, are linear scalability and, above all, analysis. With “analysis at the click of a mouse”, the user should be able to draw relevant conclusions from complex data relationships simply by moving the nodes and edges in a diagram and specifying the analysis levels.
TigerGraph is particularly proud of its ability to examine and display up to ten levels of relationships. This requires a high degree of scalability in a suitably equipped cloud instance. In a demonstration, TigerGraph ran on AWS. For this scalability to grow linearly, appropriate cluster management and massively parallel processing are required, two performance features that TigerGraph values.
The data is stored and secured in the cluster. Custom indexing is intended to enable users to improve database performance for specific questions – analogous to the working set in Neptune. “Similar to the index at the end of a textbook, a user-defined or secondary index in a database also contains references that allow the user to access the data they need at the moment directly”.