Graph database use cases (10 examples)

You are using graph databases every day .

How’s it possible that LinkedIn can show all your 1st, 2nd, and 3rd -degree connections, and the mutual contacts with your 2nd level contacts in real-time. The answer is: because LinkedIn organizes its entire contact network of 660+ million users with a graph! 

Did you know that also Google’s original search ranking is based on a Graph algorithm called “Pagerank”? 

Why are the recommendations on always so spot-on? Well, they use a graph database — and, by the way, so do many other e-commerce giants such as 

Instagram, Twitter, Facebook, Amazon, and, practically, all applications, which must rapidly query information scattered across an exponentially-growing and highly-dynamic network of data, are already taking advantage of Graph Databases.

Why are companies moving from Relational databases to Graph technology? Read our overview of the 10 most prominent Graph Database use-cases to see the advantages!

If you want to find out how to deploy Graph Database in your case, don’t hesitate to contact us!

What’s a Graph Database?

Graph Database presents data as entities, or nodes . Nodes can have properties that have further information. Nodes are connected to other nodes with edges . Each connection between two nodes can be labeled with properties. 

Here is a very simple Graph Database example: 

Node A: John, Node B: ACME Inc., Node C: Austin, Edge 1: works_in, Edge 2: lives_in. This database tells you that John works in ACME Inc and he lives in Austin. 


If you draw this database into a picture to illustrate the relationship between nodes A, B and C, you will end up with the above graph structure. That’s why it is called the Graph Database. 

The Graph Database might not be the best option for each and every application. However, it certainly is a strong alternative in increasingly many database use-cases. Here’s a list of the ten most prominent use-cases for Graph Databases.

The Ten Most Common Graph Database Use-cases You Should Know

There is a good reason why the world’s forerunner-businesses are increasingly using Graph databases. This modern technology offers unprecedented agility, scalability, and performance for managing vast amounts of highly dynamic and exponentially growing data for various use-cases — this is precisely what today’s applications require. 

And, the Graph database is adopted for ever more use-cases and applications as organizations continue implementing the Graph technology. 

The Graph database reveals the complex and hidden relationships between separate data sets, allows you to analyze them, to further improve your business processes, and make smarter business decisions, faster.

Words are just words until put to practice. 

All these use-cases have been successfully implemented in a real business environment — Profium has deployed most of them.

Graph Database for Recommendation Engines in E-commerce

Recommendation engines in E-commerce are a perfect use-case for Graph database. The benefits are obvious – with the Graph technology you can provide your customers with accurate recommendations and maximize your online sales and customer satisfaction.  

In e-commerce, Graph-based recommendation engines are used in web shops, various types of comparison portals, and for example, in hotel and flight booking services.

How to use Graph Database in E-commerce?

Graph databases map networked objects and provide relationships between different objects. The objects are referred to as nodes , and the connections between them are edges . Typical examples of nodes in an e-commerce application include customers, products, searches, purchases, and reviews. Each node represents some piece of information in the Graph, whereas each edge represents a contextual connection between two nodes.

This enables you to retrieve relevant information about your customers, the channels they use, searches they make, and, for example, their purchase history. Based on this data, you can easily provide accurately personalized recommendations based on, both, customers’ own data, and that of the other similar users.

Both, the nodes and edges can be assigned any number of properties and the links can be queried again, e.g. the price, rating, and genre of an article, or how long a product has been on a “watch list”.

Example Graph for E-commerce

Here’s an example of how you could apply Graph in an e-commerce business selling skateboards:

“Customer” and “skateboard” are represented as nodes that are linked together by edges (e.g. “searched”, “bought”, “reviewed”).

The edge “reviewed” can be given the attribute “1 star”, “2 stars” or “3 stars”.

If you want to use this information for referrals, you can follow a customer’s connections to find other customers who have made skateboard related searches, or likes, and use this data to provide referrals.

In addition to representing known facts as nodes and edges in the graph, additional information can be inferred based on these facts.

For example if the Graph contains information that certain skateboards are meant for ramps whereas another skateboard is meant for commuting you can infer that customers who buy these skateboards are using it at ramps, by adding nodes and edges to the graph to represent this.

Inferred data enriches the graph making it easier to make connections between related things and easier to query the data by removing levels of indirection. Customizable rules define how inferred data is dynamically generated and added to, or removed from, the graph as it changes.

Benefits of Graph Database in E-commerce Recommendation Engines

Graph databases are just perfect for e-commerce applications and recommendation engines. Depending on the case, they can perform much faster than alternative systems.

You can add and link information from the browser, run search queries, click histories and social channels to user profiles to build up a rich and complete profile of your customers. The more you can accumulate clicks, searches, purchases and other events, the richer the customer profiles become. New customers, pre-marked articles, as well as comments and assessments are added and immediately considered in the next recommendation. So, the data record remains current.

Master Data Management (MDM)

Master Data Management enables you to link all your company’s critical data to one location – a.k.a. the master file – to provide a single point of reference to all data. 

If your MDM is appropriately implemented, it streamlines data sharing among your personnel and departments and aggregates data located in silos, i.e., in multiple separate systems, platforms, and applications. With an excellent MDM system, the employees and applications across your organization get consistent and accurate data always.

Graph or Relational Database in MDM?

The Master Data Management system is constantly performing several functions: collecting, aggregating, matching, consolidating, and distributing data, and ensuring quality and persistence throughout your organization. 

However, because master data consists of a series of connections, managing your MDM on a relational database becomes complex and slow. Besides, your master data integrate often with cross-enterprise applications, which makes real-time querying a burdening process.

Luckily there’s a better alternative for building an efficient MDM – the Graph databases are optimized for handling contextual relationships between multiple data objects. So, Graph technology offers you a much faster and more effective way to organize the master data. 

Compliance with GDPR, HIPAA and Other Regulations 

Companies are struggling to comply with privacy regulations such as GDPR, General Data Protection Regulation. With just one year of GDPR in effect and already 90,000 data breaches had been reported, 500 investigations were ongoing, and several companies had already been punished with fines – the highest was up to € 50 million. 

And, more and more international regulations are enforced, which puts a strain on companies – especially those organizations that store sensitive customer data. 

Besides GDPR in the EU, the California Consumer Privacy Act – based on GDPR – is set to go into effect at the beginning of 2020. Privacy standards in Japan, Brazil, Argentina, and many other countries have been aligned with GDPR. The American HIPAA, Health Insurance Portability and Accountability Act regulates the flow of information in healthcare and insurance.

If so many organizations fail to comply with GDPR, could the outdated database technologies be the root-cause?

Relational Databases do Not Scale for GDPR 

Relational Databases are great for managing relatively static and structured data, with uniform connections between different data entities. But, that’s not what you will be up against with GDPR!

Personal data is spread across several applications on your own servers, data centers, and external cloud services. According to GDPR, you must be able to track the movement of the data that is in your possession – where did you acquire it, was consent obtained, how does it move over time, where it is located, and how it is used.

Therefore, the connections between different data entities are crucial for tracking the complex path that personal data follows across your domain. Additionally, you must be able to access, report, and remove all this data if required by consumers or authorities.

If you try to track GDPR compliance with a relational database, you will end up with a massive constellation of JOIN tables, thousands of lines of SQL code, and complex queries. Maintenance becomes a headache because you need to add more systems and data relationships. Execution of queries will drain your computing when the system grows.  

The Graph is the Best Database for Regulatory Compliance Systems 

Regulatory Compliance Systems are one of the most deployed use-cases for Graph Databases. 

The Graph Database is optimized for connected data applications such as GDPR, where data relationships are crucial. The Graph tracks and stores contextual connections between vast amounts of heterogeneous data points, and this model, in fact, perfectly resembles the regulatory systems such as GDPR and HIPAA.

Graph systems enable single queries that can offer a visual representation of the results. In this way, they help organizations maintain compliance by tracing data throughout enterprise systems in a more organized manner than a relational database.

Symbolic AI (Symbolic Reasoning)

According to the analyst firm, Gartner, Business Intelligence, and Analytics will be based on Artificial Intelligence and Machine Learning in the future.

In Machine Learning, the algorithm learns rules based on system inputs and outputs. Symbolic Learning requires human intervention. To build a Symbolic Reasoning system, humans have to learn the rules first, and then enter those rules and relationships into a static program.

So, why is Symbolic Reasoning a use-case for Graph Databases?

Because, to create new rules, you must understand the relationships between different entities, and that isn’t very easy for humans if a visual representation of the data is not available.

The Graph Database solves this problem. It shows the data entities and how they connect and relate to each other. This visual representation allows humans to understand the data intuitively, which then makes it a lot easier to create meaningful new rules.

Graph Database Example:

The following image provides a snapshot view from a Graph Database. Just by looking at it for a few seconds, you can understand that Limerick is a city and also a county in Ireland. Additionally, you can see that Limerick is related to eight entities (nodes) in the database, and five data items define what kind of city Limerick is.


On a Graph Database, you can intuitively understand all this in a few seconds – in a Relational Database; it would take several minutes. 

Digital Asset Management (DAM)

An overabundance of digital content is one of the biggest problems for most enterprises today. They are managing unprecedented amounts of documents, images, product descriptions, video material, audio files, and everything in between. 

DAM systems store, organize and share all these digital assets in a central location in your company. DAM helps your teams accomplish their goals and quickly find the right files when needed. DAM unleashes the full potential of your organization – but, only if the database behind it scales up with the rapidly growing data volume, ever-diversifying content types, and delivers your employees the right files quickly. 

The Graph Database provides just this — simple, scalable and cost-efficient database to track how your company’s digital assets such as documents, contracts, and reports related to the employees, who created the files and when, who are allowed to access which files, and so on. With the Graph Database model, Digital Asset Management becomes intuitive. 

Netflix uses Graph Database for its Digital Asset Management because it is a perfect way to track which movies (assets) each viewer has already watched, and which movies they are allowed to watch (access management). Note that also Identity and Access Management (IAM) has an essential role in the DAM. 

Context-aware Services

Context-aware Services use information about the user’s Context – such as the location – to provide him or her with relevant services and information at the right moments. 

There are countless examples of context-aware services – these include: delivering real-time traffic updates around the user’s current whereabouts, streaming a live video from the route the user has planned, sending farmers timely and relevant pest and disease observations from the nearby farms, or, alerting for high pollen exposure along children’s school way. 

The same Context-aware services are also used for contextual marketing – so, delivering customers relevant information and offers based on their location or profiled interests, rather than spamming them with ads randomly.

Context can refer to real-world characteristics such as temperature, time or location. This information can be updated by the user manually, or by other mobile devices, applications or sensors.

Why use Graph Database for Context-aware services? 

The basic idea in Context-aware Services is to look for past contexts similar to the user’s current Context, and use that information to make actionable decisions based on which the user is delivered relevant services or information.

Graph Database is a natural solution for implementing Context-aware Services. The Graph consists of nodes representing contexts and edges connecting the nodes. 

The Graph structure enables you to retrieve related contexts similar to the current Context much faster compared to if a Relational Database was used.

Fraud Detection

Hacking has been the most common cause of data breaches in recent years. Approximately 5,000 major incidents were discovered in 2018 alone – 39% of them were carried out through the Web. As a result of online fraud, billions of sensitive data records are exposed yearly, and the economic losses account for billions of dollars.

Online fraud is extremely difficult to combat – the techniques evolve rapidly, fraud rings change constantly, and they can grow quickly.


To prevent modern, advanced fraud rings, you must be able to detect when and where these rings of false accounts emerge – it can happen suddenly, anytime, and anywhere in the world.

The rings can grow to cover thousands of nodes quickly. Analyzing these records is not yet enough. You must be able to detect how they link to other data points such as credit card records, addresses, or transactions, and analyze these highly complex data relationships.

Why Choose Graph Database for Fraud Detection? 

To limit the damages of fraud, you must detect and prevent incidents as they happen, in real-time. 

Conventional fraud detection techniques based on relational databases are optimized to analyze discrete data records, but they do not scale up to analyze how the records relate to each other.   

Preventing advanced online fraud requires highly scalable, real-time link analysis across large interconnected data – and, that’s exactly why you should build your Fraud Prevention based on Graph databases!  

Applying Graph Database for Fraud Detection

The Graph structure allows you to look further than just discrete data points to the connections that link them. Understanding the connections between data, and deriving meaning from these links you can reframe the problem in a different way and draw better insights from the data.

Unlike most other ways of looking at data, graphs are designed to express relatedness. Graph databases uncover patterns that are difficult to detect using traditional representations such as relational databases.

An increasing number of companies use graph databases to solve a variety of connected data problems, including fraud detection. 

Semantic Search

Keyword-based search tools are a nightmare for enterprises! Why? Because a vast majority of organizational information is stored in an unstructured format and, keyword-based search tools do not comprehend unstructured data. 

Bad search results frustrate employees and decrease working efficiency. And, in highly competitive markets, you can’t afford to miss out on leveraging the valuable data insights that drive your business growth! That’s why enterprises are turning into deploying semantic search tools.

How does Semantic Search work?

Semantic search is search with meaning, as opposed to “normal” search where the search engine looks for literal matches of the queried words without understanding the overall meaning of the query. 

Semantic search takes into account the context of search, location and the intent of queries. It understands the searcher’s intent and the contextual meaning of terms in the Web, or on an enterprise data storage, and provides more relevant results. 

Natural language can be ambiguous, but semantic search exposes the meaning behind the words. Rather than using ranking algorithms to predict relevancy, semantic search uses meanings to produce highly relevant search results. 

It provides your organization with fast, relevant answers to complex questions based on user data and metadata, and other information about your business domain.  

Network management for Telecom, IT, Power grids & Sewers

If you are familiar with network management in telecommunications, power grids, or IT, you probably know how complex it can be. 

The complexity accumulates in networks over time – different business units are not aligned; companies grow through mergers and acquisitions, systems of different vendors are not communicating, and so on. Separate silos, layers, and domains are created, and each has its own relational database to store the network information. 

If you want to aggregate all the siloed data into a central location to create a unified management view across the whole network, you must link multiple relational databases together, and by far the easiest way to do that is a Graph database. Shortly, you will be able to visualize bottlenecks and other issues in your network. 

Graph Databases for networks

Graph databases are a perfect fit for modeling, storing and querying network and IT operational data. Networks are essentially graphs linked together. 

As with master data, a graph database is used to bring together information from disparate management systems and data inventories, providing a single view of the network and the users – from the smallest network element all the way to the applications, services and the users. 

A graph representation of a network enables managers to catalog assets, visualize their deployment and identify the dependencies between the nodes.

Graphs help in ensuring end-to-end redundancy on a network – you can see that if a network element becomes unavailable, or is taken down for maintenance, are there alternative routes available, and are the services and customers impacted. 

Graph databases store configuration information to alert administrators in real-time about potential failures, and reduce the time needed for problem analysis and resolution.

Situational Awareness

Situational Awareness allows you to monitor environmental elements and events in real-time such as the weather or traffic with respect to time or space, understand their meaning, and project their status in the future to make smarter business decisions.

For example, a logistics company can monitor the weather and traffic, plot the situation a map, and proactively manage their fleet based on real-time data.  

Situational Awareness analysis requires you to track a vast amount of data points describing the situation – temperature, humidity, probability of rain, and many other details, and their relation to the desired outcome to determine the best possible business decisions. 

The technology of situation awareness

Situation Awareness consists of advanced semantic technology-based tools for modeling even the most complex business domains. These advanced tools are optimized for modeling business domains, query-based analysis of business domain data and several sophisticated visualizations for better business context and situation awareness.

Sense Situation Awareness includes leading GIS features and technologies like geosemantic queries, GML, WFS, WFS-T, GPS positioning, and SIM tracking.

Janne Saarela

Janne Saarela

Founder and CEO of Profium. In addition to being one of the leading experts of Semantic Web, he has an instrumental role in Profium's research and development to identify and implement new network and content technologies to benefit Profium's customers.

  • Services Product Management Product Ideation Services Product Design Design Design Web Design Mobile Application Design UX Audit Web Development Web Development Web Development in Ruby on Rails Backend API Development in Ruby on Rails Web Applications Development on React.js Web Applications Development on Vue.js Mobile Development Mobile Development Mobile app Development on React Native iOS Applications Development Android Applications Development Software Testing Software Testing Web Application Testing Mobile Application Testing Technology Consulting DevOps Maintenance Source Code Audit HIPAA security consulting
  • Solutions Multi-Vendor Marketplace Multi-Vendor Marketplace B2B - Business to Business B2C - Business to Customer C2C - Customer to Customer Online Store Create an online store with unique design and features at minimal cost using our MarketAge solution Custom Marketplace Get a unique, scalable, and cost-effective online marketplace with minimum time to market Telemedicine Software Get a cost-efficient, HIPAA-compliant telemedicine solution tailored to your facility's requirements Chat App Get a customizable chat solution to connect users across multiple apps and platforms Custom Booking System Improve your business operations and expand to new markets with our appointment booking solution Video Conferencing Adjust our video conferencing solution for your business needs For Enterprise Scale, automate, and improve business processes in your enterprise with our custom software solutions For Startups Turn your startup ideas into viable, value-driven, and commercially successful software solutions
  • Industries Fintech Automate, scale, secure your financial business or launch innovative Fintech products with our help Edutech Cut paperwork, lower operating costs, and expand your market with a custom e-learning platform E-commerce Streamline and scale your e-commerce business with a custom platform tailored to your product segments Telehealth Upgrade your workflow, enter e-health market, and increase marketability with the right custom software

About Us

  • Case Studies

graph database case study

  • Tech Navigator
  • Neo4j Database Guide

Capabilities of the Neo4j Graph Database with Real-life Examples

  • Jul 05, 2018

Gleb B.

Ruby/JS Developer

  • Entrepreneurship

graph database case study

A database is an integral part of any application.

Not only does a database store information, it also impacts the overall performance of software. So selecting a database suitable for your project is crucial. Lots of applications rely on a relational database such as MySQL or PostgreSQL. Despite the many advantages of relational databases, however, they aren’t efficient at coping with ever-growing amounts of connected data.

To handle a growing volume of connected data, you can go for Neo4j, a non-relational graph database that’s optimized for managing relationships. The Neo4j database can help you build high-performance and scalable applications that use large volumes of connected data.

Many software developers know little about the capabilities of graph databases and Neo4j in particular. In this article, we explain the essence of this graph database, show when you can use it, and give examples of how to implement Neo4j in your project.

Neo4j database: Concepts and principles

Before taking an in-depth look at how to implement the Neo4j database in a real project, you should clearly understand how this technology works, what business purposes you can use it for, and what differentiates Neo4j from other databases.

Graph databases are the best solution for handling connected data

If you’ve worked only with relational databases in your career as a developer, you might be asking whether there’s any point in going for a non-relational model. Everything seems clear and familiar in the relational databases you’re used to, doesn’t it? Yet relational databases have several substantial drawbacks:

  • Volume limitations − Relational data stores aren’t optimized to handle large amounts of data.
  • Velocity − The performance of relational stores suffers when they need to deal with huge numbers of read/write operations.
  • Lack of relationships − Relational data stores can’t describe relationships other than standard one-to-one, one-to-many, and many-to-many.
  • Variety − Relational databases lack flexibility when dealing with types of data that can’t be described using the database schema. They also aren’t efficient when it comes to handling big binary and semi-structured data (JSON and XML).
  • Scalability − Horizontal scaling is inefficient for relational data stores.

To overcome these limitations, a number of different non-relational databases have been created. Most of them lack relationships, however, because they often associate pieces of data with each other through references (just like foreign keys in the relational model). References make it difficult to query data (particularly, connected data) as they struggle to describe relationships between entities.

Unlike all other data storage and management technologies, graph databases are focused on relationships and store already connected data. That’s why graph databases prove the most efficient for handling large amounts of connected data.

Types of Databases

Neo4j as a graph database

Graph databases are based on graph theory from mathematics. Graphs are structures that contain vertices (which represent entities, such as people or things) and edges (which represent connections between vertices). Edges can have numerical values called weight .

This structure enables developers to model any scenario defined by relationships. For instance, a graph database allows you to model a social network where nodes are users and relationships are connections between them. Or you can build a road network where vertices are cities, towns, or villages, while edges are roads that connect them with weights indicating distances.

Neo4j provides its own implementation of graph theory concepts. Let’s take an in-depth look at the Labeled Property Graph Model in the Neo4j database. It has the following components:

  • Nodes (equivalent to vertices in graph theory). These are the main data elements that are interconnected through relationships. A node can have one or more labels (that describe its role) and properties (i.e. attributes).
  • Relationships (equivalent to edges in graph theory). A relationship connects two nodes that, in turn, can have multiple relationships. Relationships can have one or more properties.
  • Labels . These are used to group nodes, and each node can be assigned multiple labels. Labels are indexed to speed up finding nodes in a graph.
  • Properties . These are attributes of both nodes and relationships. Neo4j allows for storing data as key-value pairs, which means properties can have any value (string, number, or boolean).

The graph data structure might seem unusual, but it’s simple and natural. Here’s an example of a simple graph data model in Neo4j:

Graph Data Model

As you can see, this graph contains two nodes (Alice and Bob) that are connected by relationships. Both nodes share the same label, Person . In the graph, only Bob’s node has properties, but in Neo4j every node and relationship can have properties.

A graph model is intuitive and easy for people to interpret. After all, the human brain doesn’t think in terms of tables and rows but in terms of abstract objects and connections. In fact, anything you can draw on a blackboard can be displayed with a graph.

How Neo4j compares to relational and other NoSQL databases

Having learned about the graph data model and the Neo4j database, you’re probably wondering how this data store differs from relational data stores. And although Neo4j belongs to the category of NoSQL tools, it’s quite different from other NoSQL databases.

So let’s briefly compare Neo4j to other relational and non-relational databases:

Advantages of Neo4j databases

Designed specifically to deal with huge amounts of connected data, the Neo4j database provides the following advantages:

In relational databases, performance suffers as the number and depth of relationships increases. In graph databases like Neo4j, performance remains high even if the amount of data grows significantly.

Neo4j is flexible, as the structure and schema of a graph model can be easily adjusted to the changes in an application. Also, you can easily upgrade the data structure without damaging existing functionality.

The structure of a Neo4j database is easy-to-upgrade, so the data store can evolve along with your application.

Neo4j database use cases

Now that you know how a Neo4j database works, you’re probably wondering what you can use this data store technology for. It might seem that graph databases can be applied to solve any problem, but that isn’t quite the case. Just like any technology, Neo4j should be used when it’s suitable.

Let’s take a look at several Neo4j database use cases:

Fraud detection and analytics

Businesses lose billions of dollars every year because of fraud. Despite extensive fraud prevention methods, fraudsters come up with increasingly sophisticated ways to steal money and identities. Thanks to its graph data model, a Neo4j database allows you to enhance your application’s fraud detection capabilities and detect financial crimes such as credit card fraud, ecommerce fraud, and money laundering.

Network and database infrastructure monitoring

As the complexity of your network and IT infrastructure grows, you need a more powerful configuration management database than a relational database can provide. The Neo4j graph database allows you to connect your network, data center, and IT assets in order to get important insights into the relationships between different operations within your network. For example, Neo4j can help you manage dependencies and monitor microservices.

Recommendation engines

It’s hard to find an online business that doesn’t use a recommendation engine to recommend relevant products or services to customers. A good recommendation engine should correlate a lot of data and be able to quickly detect new interests shown by clients. Being focused on entities and relations between them, a Neo4j database can easily handle recommendations, significantly outperforming other relational and non-relational databases.

Social networks

Social networks are about connections between people, so basically they have graph structures. Needless to say, graph databases like Neo4j are perfectly tailored to social networks. They speed up the development of social network applications, enhance an app’s overall performance, and allow you to better understand your data.

Knowledge graph

As your business grows, it requires a more powerful contextual search solution. Neo4j can enhance your application’s search capabilities to deliver relevant results. The graph data model can improve simple keyword search and provide additional results related to keywords.

Identity and access management

Managing constantly changing roles, groups, and identities can be a complex task for businesses. A graph database like Neo4j allows you to monitor identity and access authorizations.

Privacy and risk compliance

Neo4j facilitates personal data storage and management: it allows you to track where private information is stored and which systems, applications, and users access it. The graph data model helps visualize personal data and allows for data analysis and pattern detection. Neo4j also comes in handy for financial risk reporting and compliance.

Master data management

To deliver the most pleasant customer experience, businesses need to analyze lots of data. Graph databases help to unify master data, such as information about customers, products, suppliers, and logistics. Neo4j allows you to organize master data and model it in a graph, revealing connections and relationships. Neo4j can provide important insights so that you can make relevant business decisions.

Building an email targeting system with Neo4j

Now that you know what the Neo4j database is and what opportunities it provides to businesses, you’re ready to take a look at a real-life example of how you can apply this data storage technology. We’ve decided to build a simple email targeting system with a Neo4j database, as an email targeting system is an important feature for lots of online businesses, namely online stores and marketplaces.

Our email targeting system will help analyze customer behavior and decide which offers to target audiences with. Thanks to this targeting system, businesses can offer relevant products to people and, therefore, increase conversions and contribute to overall customer satisfaction (since people expect to receive relevant offers).

Step #1: Installing Neo4j

For our sample email targeting system, we only need to download and install Neo4j Server. We could use Neo4j Desktop , but it contains extra functions, most of which we don’t need.

Installing Neo4j Server is quite simple. We’re going to use Neo4j 3.4.1 Community Edition for Ubuntu .

Step #2: Launching the Neo4j Browser

After installing the Neo4j Server, it’s time to run it using the command <NEO4J_HOME>/bin/neo4j start (the top level directory is referred to as NEO4J_HOME). After that, you can launch your web browser and start an interactive console called Neo4j Browser (it’s installed by default with Neo4j Server). To access the Neo4j Browser, go to http://localhost:7474/browser/ in your web browser and sign in with the default login and password (neo4j for both).

Once you’ve signed in, change the password. Then sign in with your new password to establish a connection with the database server.

The Neo4j Browser has an interactive console with a number of commands (:play start, :play concepts, and :play cypher). You can take a training tour and learn more about how to use the Neo4j database, check out sample graphs (such as the Movie Graph), and examine the state of the active database.

Now you have the toolkit for building an email targeting system.

Step #3: Data modeling

Before you start modeling your data, spend some time analyzing the business purpose of your email targeting system. Such systems are used to offer the most relevant products or services to customers, so marketers and analysts need to monitor customer behavior to launch efficient email marketing campaigns.

Our email targeting system is going to have the following entities (with attributes in parentheses):

  • Category (title)
  • Product (title, description, price, availability, shippability)
  • Customer (name, email, registration date)
  • Promotional Offer (type, content)

In our graph database, each of these entities is going to have nodes with respective labels. All entities will be connected via relationships (for the sake of simplicity, we’re going to consider only relationships between two entities). Note that we’re using the singular naming for all entities even though one-to-many connections, which are commonly used in relational databases, are also possible. Also, just like entities, some relationships in our model are going to have properties. Let’s write these properties in parentheses.

So here’s what we’ve got:

  • Product is_in Category
  • Customer added_to_wish_list Product
  • Customer bought Product
  • Customer viewed (clicks_count) Product

It doesn’t matter where the information about clicks comes from; let’s just assume we have this data.

  • Promotional Offer used_to_promote Product

Note that in Neo4j, there’s no need to model bidirectional relationships (such as Product is_in Category and Category has_many Product ). Graph databases allow us to follow edges in both directions.

And… that’s it.

Modeling entities and relationships in a graph database is that simple and intuitive, as we don’t need to switch from a logical model (how entities are connected from the perspective of a task we need to solve) to a physical model (how we store data in our database). It’s also easy to add, modify, or delete new entities and relationships in a graph database without bothering with foreign keys (as in relational databases) or links (as in NoSQL databases).

That’s an amazing advantage of graph databases.

Step #4: Working with the database

Now it’s time to fill our Neo4j database according to the model we defined in the previous step. There are two ways to do it:

  • Use Gremlin , a domain-specific language created specifically for graphs and written in Groovy. Though Gremlin is concise and has a narrow focus, it’s overly mathematical (as it uses concepts from graph theory). Today, Gremlin is considered somewhat outdated and is being replaced by Cypher.
  • Use Cypher , a declarative language like SQL that has distinctive semantics and allows you to write flexible and easy-to-read queries. Cypher syntax emphasizes directions in relationships between entities. Recently, Cypher became an open source project that’s maintained and upgraded by a community of contributors.

Needless to say, we’re going to use Cypher to work with the Neo4j database.

For the sake of convenience, we’re going to add nodes and relationships step by step. First, let’s introduce Categories and Products :

Now we should add customers and establish relationships between them and the products in our database (this part is a continuation of the previous query):

Now the database contains all necessary entities and relationships. As you can see, Cypher is so declarative that you can guess exactly what every piece of code does.

To visualize the graph, execute the MATCH (n) RETURN n query, which returns all nodes in our graph. If everything is correct, you’ll get this graph:

Full Graph Example in Neo4j

The graph is scalable, so it will work fast even with far bigger datasets.

As you can see, the Neo4j Browser allows you not only to create a graph but also to visualize data; this is really helpful when it comes to creating an efficient email targeting campaign. Neo4j helps you model your data and gain valuable insights.

Now it’s time to apply our data to a real-life business problem.

Example #1: Using Neo4j to determine customer preferences

Suppose we need to learn preferences of our customers to create a promotional offer for a specific product category, such as notebooks. First, Neo4j allows us to quickly obtain a list of notebooks that customers have viewed or added to their wish lists. We can use this code to select all such notebooks:

Now that we have a list of notebooks, we can easily include them in a promotional offer. Let’s make a few modifications to the code above:

We can track the changes in the graph with the following query:

Products on Graph Visualization

Linking a promotional offer with specific customers makes no sense, as the structure of graphs allows you to access any node easily. We can collect emails for a newsletter by analyzing the products in our promotional offer.

When creating a promotional offer, it’s important to know what products customers have viewed or added to their wish lists. We can find out with this query:

Promotional Offer on Graph

This example is simple, and we could have implemented the same functionality in a relational database. But our goal is to show the intuitiveness of Cypher and to demonstrate how simple it is to write queries in Neo4j.

Example #2: Using Neo4j to devise promotional offers

Now let’s imagine that we need to develop a more efficient promotional campaign. To increase conversion rates, we should offer alternative products to our customers. For example, if a customer shows interest in a certain product but doesn’t buy it, we can create a promotional offer that contains alternative products.

To show how this works, let’s create a promotional offer for a specific customer:

This query searches for products that don’t have either ADDED_TO_WISH_LIST, VIEWED, or BOUGHT relationships with a client named Alex McGyver. Next, we perform an opposite query that finds all products that Alex McGyver has viewed, added to his wish list, or bought. Also, it’s crucial to narrow down recommendations, so we should make sure that these two queries select products in the same categories. Finally, we specify that only products that cost 20 percent more or less than a specific item should be recommended to the customer.

Now let’s check if this query works correctly.

The product variable is supposed to contain the following items:

  • Xiaomi Mi Mix 2 (price: $420.87). Price range for recommendations: from $336.70 to $505.04.
  • Sony Xperia XA1 Dual G3112 (price: $229.50). Price range for recommendations: from $183.60 to $275.40.

The free_product variable is expected to have these items:

  • Apple iPhone 8 Plus 64GB (price: $874.20)
  • Huawei P8 Lite (price: $191.00)
  • Samsung Galaxy S8 (price: $784.00)
  • Sony Xperia Z22 (price: $765.00)

Note that both product and free_product variables contain items that belong to the same category, which means that the [:IS_IN]->()<-[:IS_IN] constraint has worked.

As you can see, none of the products except for the Huawei P8 Lite fits in the price range for recommendations, so only the P8 Lite will be shown on the recommendations list after the query is executed.

Now we can create our promotional offer. It’s going to be different from the previous one ( personal_replacement_offer instead of discount_offer ), and this time we’re going to store a customer’s email as a property of the USED_TO_PROMOTE relationship as the products contained in the free_product variable aren’t connected to specific customers. Here’s the full code for the promotional offer:

Let’s take a look at the result of this query:

  • In the form of a graph

Query Result on Graph

  • In the form of a table

Query Result in the Form of Table

Example #3: Building a recommendation system with Neo4j

The Neo4j database proves useful for building a recommendation system.

Imagine we want to recommend products to Alex McGyver according to his interests. Neo4j allows us to easily track the products Alex is interested in and find other customers who also have expressed interest in these products. Afterward, we can check out these customers’ preferences and suggest new products to Alex.

First, let’s take a look at all customers and the products they’ve viewed, added to their wish lists, and bought:

Recommendation System on Graph

As you can see, Alex has two touch points with other customers: the Sony Xperia XA1 Dual G3112 (purchased by Allison York) and the Nikon D7500 Kit 18–105mm VR (viewed by Joe Baxton). Therefore, in this particular case, our product recommendation system should offer to Alex those products that Allison and Joe are interested in (but not the products Alex is also interested in). We can implement this simple recommendation system with the help of the following query:

Recommendation Graph in Neo4j

We can further improve this recommendation system by adding new conditions, but the takeaway is that Neo4j helps you build such systems quickly and easily.

Step #5: Using Neo4j with Ruby

We’ve shown just some capabilities of the Neo4j database, but so far we’ve been using the interactive console. However, you might be wondering how to add data to a real-life application. There are two options:

  • Drivers and libraries for different programming languages

The list of libraries includes Neo4j.rb, a library for using Neo4j with Ruby applications. Neo4j.rb contains several gems:

  • neo4j − an Object-Graph-Mapper (OGM) for Neo4j that tries to follow API conventions that are established by ActiveRecord and therefore known to most Ruby developers.
  • neo4j-core − a low-level API that can access both a server and an embedded Neo4j database; this library is automatically included in the neo4j gem.
  • Neo4j-rake_tasks − a set of rake tasks for starting, stopping, and configuring a Neo4j database in your project; this gem is used by the neo4j-core library.

These gems allow you to easily wrap Cypher code and use it in your Ruby applications.

Final thoughts

Modern applications face a challenge of handling large amounts of interconnected data, and you need to pick an efficient technology to cope with it. Neo4j allows you to build applications capable of providing valuable real-time insights into connected data for further analysis and decision-making. If you want to stay updated on the latest advances in mobile and web development, subscribe to our newsletter .

Rate this article!

graph database case study

Share article with

RubyGarage Blog

Please identify yourself to leave comments and connect with other readers

There are no comments yet

Subscribe via email and know it all first!

Thanks for your subscription!

Three Database Architectures for a Multi-Tenant Rails-Based Sa...


How to Optimize Your Website Speed by Improving the Backend


When and How You Should Denormalize a Relational Database


' height=

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • Database (Oxford)
  • v.2021; 2021

An overview of graph databases and their applications in the biomedical domain

Santiago timón-reina.

Departamento de Inteligencia Artificial, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal, 16 Ciudad Universitaria, Madrid 28040, Spain

Mariano Rincón

Rafael martínez-tomás.

Over the past couple of decades, the explosion of densely interconnected data has stimulated the research, development and adoption of graph database technologies. From early graph models to more recent native graph databases, the landscape of implementations has evolved to cover enterprise-ready requirements. Because of the interconnected nature of its data, the biomedical domain has been one of the early adopters of graph databases, enabling more natural representation models and better data integration workflows, exploration and analysis facilities. In this work, we survey the literature to explore the evolution, performance and how the most recent graph database solutions are applied in the biomedical domain, compiling a great variety of use cases. With this evidence, we conclude that the available graph database management systems are fit to support data-intensive, integrative applications, targeted at both basic research and exploratory tasks closer to the clinic.


Nowadays, the generation, consumption and, more importantly, analysis of highly interconnected data have become ubiquitous. In this situation, where the relationships among data grow both in quantity and in significance, graph models become an appealing solution, as graphs are mathematical entities in which objects are connected. Formally, a graph G ( V , E ) is composed of an ordered pair of two disjoint sets: vertices V (also referred to as nodes) and edges (or links) E ( 1 ). The graph abstraction directly translates concepts and instances into nodes and their relationships into edges, making it intuitive for data modeling. However, strong graph data is not straightforward in conventional Database Management Systems (DBMSs), and the physical implementation of a given data model and how the relations are treated ultimately depend on the database type.

For example, the basis of Relational Database Management Systems (RDBMSs) are tables (relations) ( 2–4 ), where each row represents a single data element of an entity and a single column usually defines a particular data attribute. The standard mechanism to create relationships between entities is by defining unique IDs (primary keys) that can be copied into referencing tables (foreign keys). To exploit these references and include different tables in a database query, the Structured Query Language (SQL) ( 5 ) provides the JOIN clause. The relational paradigm is very appropriate for well-defined data structures that are unlikely to change and translate naturally to tables, and the relations among its entities are not numerous and not as relevant as the entities’ attributes. Hence, given its maturity and technological development, RDBMSs are widely used for data storage, with countless examples experienced in everyday life, like user data, inventory tracking, blog posts and many more. However, when most relationships are many-to-many, prevalent in densely connected data, querying the database requires multiple expensive JOIN operations, impacting the performance ( 6 ).

Although graphs can be modeled with tables representing vertices and edges, complex queries or graph algorithms (like path traversals) are challenging to optimize without implementing complementary structures, such as adjacency lists ( 7 ). These modeling and performance limitations have increased the interest in Graph Database Management Systems (GDBMSs). GDBMSs, in contrast to regular DBMSs, allow working directly with a graph model, avoiding sophisticated engineering to represent relationships efficiently, and provide straightforward ways to store, access and operate graph data, especially for traversing paths and matching subgraphs. Furthermore, the schema-less or schema-optional approach that most GDBMSs follow grants a high degree of flexibility, allowing applications to adapt and evolve quickly and introduce abstraction, specialization of entities and relations among them more easily.

Graph models are present in multiple formal representations and become very powerful when the problem model exhibits varied relations among the entities or concepts. Consequently, the trend in graph databases has permeated into many disparate domains, and we can find applications in Energy Management Systems (EMS) ( 8 ), Power Grid Modeling ( 9 ) and even less technologically driven fields like Digital Humanities ( 10 ). The biomedical domain is a complex area that is inevitably studied in many different sub-domains that are inherently related and connected. For instance, the study of human metabolism requires identifying hundreds of concepts (e.g. metabolites, proteins, complexes and metabolic reaction names) and the relations among them (e.g. consumption, production and catalysis), and graph models provide a valuable framework in this situation. Moreover, the amount of data produced in the ‘omics’ era results in large graphs that become difficult to manage without a database optimized for the task.

We can illustrate the differences between the relational and graph-based paradigms depicted in Figures 1 and 2 , a stripped-down biological model describing subject diagnoses and their related phenotype–genotype and pathway implications. For most GDBMSs, the physical design resulting from the logical model described in Figure 1 would be almost equivalent. However, in the case of RDBMSs, the implementation from the logical to the final physical design requires dealing with the many-to-many cardinality that most of the model’s relations will have. A typical normalized relational design, at least to the Third Normal Form (3NF) ( 11 ), prevents data redundancy by introducing intermediate tables for each relationship between two entities, as shown in Figure 2 . For searching heavily connected entities, like genes, this layout would require referencing (joining and sub-querying) several tables multiple times, potentially with various filters, ultimately eroding the query’s performance. Also, complicated queries may end up being rather cumbersome. Thus, designing a relational model for highly interconnected data poses an engineering challenge, especially when the model requires fine-grained semantics, which involves a trade-off between implementing specialized relations (more tables) or limiting the expressiveness at the expense of semantics.

An external file that holds a picture, illustration, etc.
Object name is baab026f1.jpg

Graph model of diagnoses and its related phenotype–genotype and pathway implications.

An external file that holds a picture, illustration, etc.
Object name is baab026f2.jpg

The equivalent normalized relational physical design, with entity tables (white) to store attributes, and join tables (yellow) to implement the relationships.

GDBMSs treat relationships as first-class objects, improving the data model’s semantics and easing the adoption of knowledge models and ontologies, which are computer science constructs that provide well-defined vocabularies that allow the precise and machine-readable description of knowledge about a particular domain ( 12 ). The biomedical domain has driven and benefited from advances in Knowledge Representation (KR) and storage, being one of the early adopters of ontological research. As a result, there exists a significant number of formal biomedical ontologies ( 13 ) that capture and model knowledge from disparate sub-fields, giving rise to initiatives like the Open Biological and Biomedical Ontology (OBO) Foundry ( 14 ) and the National Center for Biomedical Ontology ( 15 ) to promote harmonization and interoperability. These controlled vocabularies and ontologies support the research in several ways, mainly in data annotation ( 16–19 ) and biomedical text mining ( 20 , 21 ).

In this paper, we survey the adoption of GDBMSs in the biomedical domain to present a summary review from an ‘application perspective’ with categorization and description of biomedical applications employing GDBMSs as storage systems. The applications presented are selected from a broad literature search complying with the following characteristics: (i) are biomedical applications using GDBMSs, (ii) are well documented with papers and websites (iii) have been peer-reviewed. Our coverage of biological graph-powered systems is by no means exhaustive, focusing on recent developments that are high quality, publicly available and expected to be of interest to experts and developers in the community. It is worth noting that, given the overlapping nature of biomedical knowledge, some systems can be classified into more than one category. First, we provide a technological background by exploring the different database models and designs and examining the performance through benchmark studies from the literature. Afterward, we highlight the use of GDBMSs within different applications in a wide variety of biomedical contexts, describing the implications and impact of graph technology in these settings. Finally, we discuss the current state, limitations and possible future lines.

Graph database models and design

Graph database models may be defined as those in which the data structures are modeled as a directed, possibly labeled, graph, or its generalizations. The data manipulation is done using graph-oriented operations and type constructors, and appropriate integrity constraints can be defined over the graph structure ( 22 ). Over the past decade, graph database implementations have grown from prototypical, application-driven approaches to fully developed products, providing external interfaces, database languages, query optimizers, storage and transaction engines, and management features. This evolution has been actively reviewed ( 23–28 ), showing how deficiencies such as the lack of integrity constraints, partition and scalability limitations, or the need for standard graph database languages have been addressed throughout the version history. Besta et al. describe the contemporary technological landscape of graph database solutions through a taxonomy of six key design aspects: type of backend technology, data modeling approach, internal data organization, data distribution, query execution and type of transactions ( 29 ).

As far as backend technology is concerned, we can see that, at present, most graph database systems are built upon existing storage designs from both relational and NoSQL ( 30 ) paradigms, such as key-value, document, wide-column, tuple and object-oriented stores. Key-value stores allocate items as (key, value) pairs, usually in standalone hash tables. Document stores extend key value so that the values are ‘documents’, encoded in standard semi-structured formats such as XML, JSON or BSON (Binary JSON). Wide-column stores represent data through a tabular format of rows with a fixed number of column families (an arbitrary number of columns that are logically related to each other and usually accessed together). Triple stores [also known as Resource Description Framework (RDF) databases] work with the notion of triples (subject–object–predicate), and tuple stores generalize these systems to collect tuples of arbitrary size. Object-oriented stores store data as true objects, identified by object IDs (OIDs) and following a class hierarchy. Using existing engines delivers the advantage of mature and well-tested technology but at the expense of obtaining non-optimized graph data representations and queries. In contrast, native graph databases like TigerGraph ( 31 ) and Neo4j are specifically built to maintain and process graphs. Table 1 provides a list of different GDBMSs, which many of the reviewed applications use, with their internal database engines.

Summary of available implementations by core database engine

Regarding data modeling, Labeled Property Graphs (LPG) and RDF are the most common graph models found in graph database systems ( 32–34 ). LPG augments the simple graph model to allow defining labels for nodes and edges, as well as an arbitrary number of properties (also called attributes) for both. RDF, a World Wide Web Consortium (W3C) standard, was conceived as a collection of specifications for representing information to allow easy data exchange between different data formats, and graphs arise from the collection of triples in the form of subject, predicate and object (s, p, o). The RDF format is widely used in biomedical setups, due mainly to the fact that RDF is a serialization and data instantiation format for OWL-based bio-ontologies, and new systems using native graph databases rely on transformations between models to fully exploit their features.

Likewise, systems need to define data structures to represent graphs in the storage layer. The most common representation formats are the adjacency matrix (AM), the adjacency list (AL) and the edge list (EL). Figure 3 shows a graphical representation of these formats. The AM is a square matrix where its cells indicate whether vertex pairs are adjacent (connected) or not. In the AL format, each vertex has an associated adjacency list containing the IDs of all adjacent vertices. The difference with EL is that AL explicitly stores edges with its source and destination vertex. The AL format is efficient on traversal operations, and many graph databases use it. Other features, such as index support, are also relevant for the overall performance.

An external file that holds a picture, illustration, etc.
Object name is baab026f3.jpg

Graphic description of the most common graph representation formats. (a) Original directed graph; (b) adjacency matrix; (c) adjacency list; and (d) edge list.

Data distribution may be achieved through data replication or sharding. With replication, each instance maintains a copy of the dataset, while sharding fragments the data across instances. Distribution becomes essential when dealing with large amounts of data, and query execution is directly linked to it. Multi-server query execution can be enabled in several ways. The concurrent execution allows the execution of different queries at the same time, providing higher throughput. With parallelization, a single query can be executed across servers to obtain lower latencies. Because managing large amounts of data can compromise the system’s performance or availability, these features can become essential for projects in this situation.

Finally, GDBMSs can be evaluated by the support of transactions. Specifically, Atomicity, Consistency, Isolation, Durability (ACID); Online Transaction Processing (OLTP); and Online Analytics Processing (OLAP) support. OLTP systems focus on smaller transactional queries, while OLAP systems execute more expensive analytic queries that span whole graphs.

The literature reveals that the field is evolving rapidly and many referenced databases have either already been discontinued or greatly improved at the time of writing.

Performance and benchmarking

Because of their innate capabilities in dealing with highly interconnected data, graph databases have been attracting attention in the past years. As different technological implementations of graph database engine have emerged, so has the need for accurate, quantitative performance comparisons between them by using standardized queries and workloads. Furthermore, the differences in relational and graph-based paradigms also raised questions about how they would behave in different contexts. Table 2 summarizes the surveyed benchmark studies.

Relevant benchmarking studies

Within standard benchmarks, the Linked Data Benchmark Council (LDBC) ( 35 ) is one of the most consistent works in this topic, and its workloads have been employed and adapted in many benchmarking studies. The library currently includes three kinds of workloads: interactive, business intelligence and graph analytics. Interactive workloads focus on general graph database operations, executing read-only (short and complex) and transactional update queries. Business Intelligence workloads are designed to stress different performance aspects, employing read-only aggregation operations over significant volumes of data that span large parts of the graph. The last workload, ‘graphalytics’ ( 36 ), proposes six graph algorithms to enable the objective comparison of graph analysis platforms: Breadth-First Search ( 37 ), PageRank ( 38 ), weakly connected components ( 39 ), community detection using label propagation ( 40 ), deriving the local clustering coefficient ( 41 ), and computing single-source shortest paths.

GDBMSs have been assessed in studies from different contexts, like data provenance ( 42 ), biomedical settings ( 43–46 ) and social networks ( 47–52 ). Most of the social network benchmarks use or adapt the LDBC’s Social Network Benchmark (SNB) ( 53 ). In parallel with technological surveys, these studies show how GDBMS technology has matured and grown into a competitive and heterogeneous environment, with its weaknesses and strengths.

The number of edges involved in a query has a big impact on performance ( 44 , 46 ). Likewise, subgraph-matching queries are more challenging to handle in large datasets, in contrast to traversal queries employed in some of the works. Lastly, GDBMSs are, in general, less optimized for aggregate operations ( 25 , 51 , 52 , 54 ). In contrast, all the studies acknowledge that schema-less provides a high degree of flexibility to accommodate new nodes or relations, avoiding the need to restructure the schema. GDBMSs are more efficient traversing large graph instances, with lower computational cost than RDBMSs ( 42 , 43 , 45 , 47 , 52 , 55 , 56 ), because the search space is reduced to directly connected nodes, avoiding scanning the entire graph to find the nodes that meet the search criteria. Furthermore, graph algorithms (e.g. pathfinding, community detection, centrality or similarity) are more natural to implement and even available out of the box, like the case of Neo4j’s Graph Data Science Library ( ) or TigerGraph’s ( 31 ) GSQL Graph Algorithm Library ( ).

To compare different paradigms, benchmarking implementations require an extra effort to address peculiarities. In the case of RDBMSs vs. GDBMSs ( 52 ), Cheng et al . propose a unified benchmark that extends the TPC-H ( ) standard RDBMS benchmark and LDBC using transformation mechanisms between relational and graph data, making it possible to evaluate different systems on the same datasets, query workloads and metrics. The query workloads consist of three main categories. Firstly, atomic relational queries (Projection, Aggregation, Join and Order by) aim to evaluate the performance of primitive relational operations implemented in GDBMSs. Secondly, TPC-H query workloads evaluate the performance of GDMBSs on operations that legacy RDBMSs perform well. And lastly, graph query workloads composed of five graph algorithms in the LDBC Benchmark aimed to evaluate the performance of RDBMSs under the situations GDBMSs are supposed to be efficient.

Nevertheless, on the dichotomy between RDBMSs and GDBMSs, we find how late benchmarks show equivalent or even better performance of the former in different settings, questioning whether it is appropriate to favor GDBMSs over RDBMSs without a proper evaluation of the context. We can find one example in real-life high-throughput scenarios, like those with critical concurrent access ( 59 ) or streaming transactional workloads ( 50 ), where GDBMSs are less prevalent. In these settings, RDBMSs can deliver competitive performance for OLTP-like online social networking applications, especially in single-node setups. Moreover, the implementation and optimization of graph analytics in RDBMSs are growing areas of research ( 63–66 ).

The physical data persistence strategy impacts the overall performance in both paradigms. For example ( 50 ), Pacaci et al . show how similar SQL queries over the same database schema drive different performance in PostgreSQL and Virtuoso (SQL). The difference is attributable to the fact that Virtuoso employs columnar storage, which is known to suffer under transactional workloads with frequent updates, while PostgreSQL implements row-oriented storage. In the case of GDBMSs, adjacency lists are common in native graph storage, as they enable index-free adjacency access and provide apparent advantages for read operations. However, other storage approaches offer better performance regarding write operations, as is the case of key-value storage engines implementing the LSM-tree ( 67 ) index. Moreover, tuning procedures are of utter importance to achieve the best possible performance regardless of the system, like optimizing indexing or tablespaces, as some studies report.

Graph database applications in the biomedical domain

Biomedical research produces large amounts of densely interconnected data belonging to many different domains, and storing such data has always presented a technological challenge. Storing graphs using traditional relational databases presents several drawbacks. Relational databases rely on fixed schemas and usually require redesigns when introducing new data structures, affecting flexibility, efficiency and scalability. More generic data models would require many intermediate tables to represent many-to-many relationships, degrading the overall performance because of the need for multiple join operations to traverse interconnected networks. As graph databases matured, they started to gain more attention in the bioinformatics community, given the ubiquity of graphs in this domain. Consequently, many tools emerged to interoperate between formats and paradigms. Table 3 brings together some of the most relevant ones.

A short list of useful graph-oriented open-source tools and utilities

The evolution of Knowledge Representation technologies and, more specifically, ontology languages like OWL, enables more complex and interconnected models. Although many of these tools do not necessarily use an explicit graph model, it is commonly implicit in the semantics, opening the door to exploit graph features. One remarkable example of this approach is the Open Biomedical Ontologies ( 14 ), which many of the works we are about to describe employ as foundational models. Table 4 summarizes publicly available graph-powered systems.

Publicly available graph-powered Biomedical data systems

Applications in systems biology

Intrinsically, systems biology models encode networks of entities and biological processes, such as reactions. As advances in molecular biology produce more extensive and complex networks, the computational demand for analyzing those increases drastically. Consequently, the use of in-house software and desktop solutions started to become a bottleneck. GDBMSs allow decoupling a significant part of the computational needs to dedicated server machines, providing improved tuning of resources for optimal query and algorithm execution performance. One good example is cyNeo4j ( 68 ), a Cytoscape ( 69 , 70 ) app to link this popular network analysis desktop program to a server environment using Neo4j. It enables the user to upload network data and run algorithms both locally and on the Neo4j server, creating an interactive workflow that uses the computational strength of the Neo4j server without interrupting the typical workflow in Cytoscape.

Standard formats of the domain, like Systems Biology Markup Language (SBML) ( 71 ) or CellML ( 72 ), enable modeling biological systems in terms of functional, behavioral or structural aspects, including meta-data and semantic annotations to relate model entities to external resources describing the underlying biology. These meta-data are of great importance to facilitate model reuse and reproducibility, but this introduces heterogeneity, which complicates the design in fixed-schema database systems ( 73 ). Henkel, Wolkenhauer and Waltemath employed Neo4j to store SBML and CellML models, including ontology terms and relations from the semantic annotations that these formats support, effectively combining computational models, semantic annotations and simulation experiments. The approach integrated widely adopted bio-ontologies, adding all classes and relations as nodes and edges but leaving out cross-references between concepts of different ontologies. This integration allows querying the information hidden in the semantic annotations of in-model representations and simulation descriptions. Furthermore, it allows defining flexible connections between the data domains, incorporating links between annotations, whole models and model entities.

The Systems Biology Graphical Notation (SBGN) ( 90 ) is another standard for visual representation of biological networks. It is composed of three orthogonal languages for representing different views of biological systems: Process Descriptions (PDs), Entity Relationships (ERs) and Activity Flows (AFs). SBGN-to-Neo4j (STON) ( 74 ) is a Java framework to transform SBGN markup language files into a Neo4j graph representation, focused only on the PD and AF sub-languages. The authors report that the persistent graph representation yields several benefits, e.g. efficient management and querying of networks, identification of subgraphs in networks, merging of SBGN diagrams/existing pathways into more extensive systems, or the comparison of different layers of granularity in SBGN languages.

Applications in biological and medicinal chemistry

The fields of Biology and Biochemistry have been a pioneer in the development of new data standards and knowledge representation paradigms, such as ontologies, to foster reuse, integration and translation of research data. These standards enable publicly available data resources such as UniProt ( 91 ), KEGG ( 92 ) and NCBI Taxonomy ( 93 ) to soft-link entities between each other, allowing the user to follow such links by manual browsing or through specialized workflows. The introduction of graph databases made it easier to integrate these resources explicitly. Built on Neo4j, Biochem4j ( 82 ) provides an integrated, queryable database that warehouses chemical, reaction, enzyme and taxonomic data from ChEBI ( 94 ), MNXref ( 95 ), Rhea ( 96 ), KEGG, UniProt and the NCBI Taxonomy resources. Biochem4j translates ontology entities and raw biological data into an integrated graph representation, which, leveraged through Cypher query language, allows performing queries and detecting patterns across the whole range of available information.

Logically, graph representations apply to lower-level chemistry and related fields, like drug discovery research. One example is the fragment-based drug discovery (FBDD) ( 97 ), in which the validation stage of a project involves testing sensible close analogs of a fragment hit. This process needs adequate search tools to mine the many millions of similar compounds that are currently available in the fragment space from corporate collections or commercial suppliers. The Fragment Network ( 98 ) employs Neo4j to allow the user to search the chemical space around a compound of interest. The graph model treats each compound as a set of rings, linkers and substituents, with a resulting network containing a total of 23 million nodes and 107 million edges.

Applications in the omics domain

In the last five years, the usage of graph databases to support the integration of genomic, proteomic, metabolomic and phenotypic data has substantially increased. Most of the authors conclude that GDBMSs are valuable tools to deal with heterogeneity and lax structured data models because these provide a high degree of flexibility and lay the foundations for building integrated solutions.

Biological pathways

Repositories of metabolic maps, reconstructions, pathways and interactions provide fundamental tools for the biomedical investigation. Examples of these repositories are the Reactome Knowledgebase ( 99 ), Recon2 ( 100 ) and the latest development, Recon3D ( 101 ).

Reactome is a comprehensive repository of molecular reactions that include signal transduction, transport, DNA replication, protein synthesis and intermediary metabolism. Reactome contains a detailed representation of cellular processes, as an ordered network of molecular reactions, interconnecting terms to form a graph of biological knowledge. This structure serves both as an archive of biological processes and as a tool for discovering unexpected functional relationships in data. Reactome’s data model initially follows a frame-based design stored in a relational MySQL database. Overcoming the relational model’s intrinsic limitations requires an increased level of abstraction in its physical design to accommodate new concepts, ultimately affecting query complexity and execution time. As graph database systems have matured, the limitations of storing pathway data in relational databases have become more evident, motivating the project to develop tools to migrate the content into a Neo4j database ( 86 , 87 ). The Reactome case is especially relevant because it exhibits a detailed description of the process to adopt a native graph database and how it improved the performance and capabilities of the whole system. On the one hand, the average query time dropped from 173.11 ms to 12.56 ms, a 93% reduction. On the other hand, the new graph model provides more straightforward ways to perform complex queries over metabolic pathways.

Recon2 is another large community-driven reconstruction of the human metabolic network, with thousands of reactions, unique metabolites and proteins, included in an SBML model. A model of this size and complexity comprises a challenge for advanced exploration involving associations between multiple concepts (e.g. network neighborhood of metabolites, shortest pathways between metabolites, proteins and complexes). Recon2Neo4j ( 76 ) is a Neo4j-based metabolic framework that models relevant concepts involved in the metabolic reactions as nodes in the graph database and the relationships among them as connecting edges, facilitating the exploration of comprehensive and highly connected human metabolic data and identification of metabolic subnetworks of interest.

HRGRN ( 80 ) is an integrative database for plant signal transduction, metabolism and gene regulation networks that is also backed by Neo4j. The solution, implemented as a web platform, provides the user with a graph-centered search interface to explore these biological systems, allowing to find potential paths or build either node-centralized or nodes-of-interest subnetworks. Regarding the data model, it followed an ad hoc approach, where biological entities (such as genes, proteins, small compounds and RNAs) are represented as nodes. For the relations between these entities, they defined eight types of edges that link the above nodes based on their biological functions. The Property Graph model is employed to attach a property indicating whether the relationship was validated or predicted.

BioGraphDB ( 102 ) is a bioinformatics database to combine different types of data from ten online public resources related to genes, microRNAs (miRNAs), proteins, pathways and diseases. To integrate these disparate resources, it builds on an Extract-Transform-Load (ETL) ecosystem capable of dealing with several formats (Tab delimited, XML, EBML and SQL) with a precise execution order to satisfy dependencies between the integrated resources. This process maps each biological entity and its properties into a vertex and its attributes, and relationships between two biological entities into edges. In this case, the GDBMS of choice was OrientDB. When operating in graph mode, referenced relationships are like edges, accessible as first-class objects having start and end vertices and properties. This feature allows representing a relational model as a document-graph model, maintaining the relationships. With the end-user in mind, the Biograph web application ( 103 ) allows users to query, visualize and analyze biological data belonging to the sources available on BiographDB. However, the system is leaned toward a technical, graph knowledgeable audience, with explicit Gremlin query interfaces.

Similarly ( 104 ), Lysenko et al . illustrate how to build a graph structure to relate biomedical information at different levels and provide biological context to disease-related genes and proteins. It integrated genomic and proteomic data along with disease concepts to investigate possible relations between specific protein interactions, pathways, and typical phenotypes associated with asthma disease. In this case, the modeling strategy follows a protein-centric approach without a rigid schema or upper model (such as an ontology). This approach provides a higher degree of flexibility to integrate many semi-structured data sources and eases the development of ad hoc solutions, but at the expense of data standardization. The study provides a good insight into how graph databases can facilitate hypothesis generation. Another relevant contribution is to show how targeted Cypher queries exploit known structures, as well as graph algorithms like network neighborhood analysis, to provide biological context. An example of structural queries is obtaining proteins common to asthma and other related respiratory diseases, where protein nodes are connected to health conditions with a concrete ‘associated’ relation. They also demonstrate how simple graph traversal queries have the potential to assist in hypothesis generation by exploring relationships between concepts. For instance, to explore the relationship between asthma and alterations in circadian rhythm, they identify all shortest paths in the graph between asthma disease and a subset of protein-coding genes that generate and regulate circadian rhythms.


Epigenetics is a growing area of research within the biomedical domain, and it is being used in many different contexts, such as the study of cancer. Existing relational databases that focus on various features of cancer pathways are restricted because the integration of multiple data types in relational databases is nontrivial, and the concept linking needed in the exploration of cancer-related hypotheses is limited. EpiGeNet ( 83 ) is a graph database that stores conditional relationships between molecular (genetic and epigenetic) events observed at different stages of colorectal oncogenesis. It integrates statistical data on molecular interdependencies recognized in colorectal cancer development, mined from StatEpigen ( 105 ) (a manually curated and annotated database) into a Neo4j instance. For the data model, ‘MolecularEvent’ nodes represent molecular events of conditional relationships, modeled as edges in the graph. The edge type is determined by phenotype information and the direction by the conditionality of the relationship. Attributes of ‘MolecularEvent’ are used to store event type and gene information, and the probability value is stored as a property of the edge. The resulting graph makes it possible to explore path connections associated with the highest ‘incidence score’ and employ Cypher queries in tasks like identifying genetic–epigenetic modifications, or molecular phenomena observed and reported in the specialized literature.


The transcriptome is the complete set of all RNA molecules in a cell, a population of cells, or in an organism ( 106 ). Transcriptomics studies generate large amounts of data, raw or processed, that may be deposited in public databases to make them available for a broader scientific community ( 107 ). These data can be expressed as gene expression and interaction networks, which may additionally be integrated with other biological datasets, such as protein–protein interactions (PPIs), transcription factors (TFs) and gene annotations. In this context and to evaluate the performance of Neo4j ( 46 ), Wiese et al . constructed Genome Regulatory Networks (GRNs) based on known enhancer–promoter interactions (EPIs) and their shared regulatory processes by focusing on cooperative TFs. Exploiting these data, we can find platforms like the non-coding RNA Human Interaction Database (ncRNA-DB), later evolved into Arena-Idb ( 79 ), miTALOS v2 ( 81 ), GeNNet ( 84 ), the Association Network Integration for Multiscale Analysis (ANIMA) ( 77 ) and the Gene Regulation Graph Database (GREG) ( 89 ). Except ncRNA-DB, all these platforms employ Neo4j as the GDBMS.

The ncRNA-DB is built on top of OrientDB, which translates class instances into nodes, permitting to follow an object-oriented design consisting of four main classes and its specializations: BioEntity, Alias, DataSource and Relation. The database imported and integrated associations among non-coding RNAs (miRNAs, circulating miRNAs, Long non-coding RNAs (lncRNAs) and other non-coding RNAs), genes, RNAs and associated diseases from 10 online databases. ncRNA-DB provides three alternative interfaces: a Cytoscape app named ncINetView, a web interface, and a command-line interface for raw resource queries. Later, ncRNA-DB evolved into Arena-Idb, introducing several improvements like a mapping procedure for managing entities, an accurate integration process or reconstructed data storage. The updated dataset included seven new sources [such as Disease Ontology ( 108 ), lnc2cancer ( 109 ), lncACTdb ( 110 ), PSMIR ( 111 ), StarBase ( 112 ) or TarBase ( 113 )]. Arena-Idb follows a hybrid RDBMS and GDBMS implementation by using MySQL to store names, annotations and sequences and Neo4j to handle the construction and visualization of the networks of thousands of biological entities.

To provide a tool to identify pathways regulated by miRNAs in a tissue-specific manner, miTALOS v2 employs Neo4j to integrate several heterogeneous data sources and directly model molecular entities and their interaction networks. This graph model represents miRNAs, genes, pathways and tissues as nodes. miRNAs are connected to genes with ‘REGULATES’ relationships, genes to tissues with ‘EXPRESSED’ and genes to pathways with ‘MEMBER’ relationships. The graph structure allows to, for instance, query the target genes of a miRNA expressed in a tissue or the pathways in which the target genes are involved. Furthermore, the schema-less approach enables the platform to keep updated and integrate new aspects like lncRNAs as regulators of gene expression or disease-specific expression profiles to extend tissue-specific gene expression.

GeNNet is an integrated transcriptome analysis platform that unifies scientific workflows with graph databases for selecting relevant genes according to the evaluated biological systems. The framework consists of three main components: the Scientific Workflow (GeNNet-Wf), the Graph database (GeNNet-DB) and the web interface (GeNNet-Web). GeNNet-DB uses an in-house data model to group nodes and edges into classes, according to the nature of the objects [e.g. GENE, BP (Biological Process), CLUSTER, EXPERIMENT and ORGANISM], and preloads a set of specified organisms to serve as the initial layout. Along with other associated elements, it includes genes annotated/described from ENTREZ ( 114 ) and their relationships integrated from STRING-DB ( 60 ), which contribute to posterior transcriptome analysis. The study provides analyses from the hepatocellular carcinoma (HCC) use case, demonstrating how concise graph operations through Cypher queries are capable of solving relatively complex topological questions, like finding the most connected genes that establish known connections to the PPI network. These genes act as hubs and may be associated with relevant pathways in the experimental context.

ANIMA allows the summarization and visualization of different views of the state of the immune system under different conditions and at multiple scales. The framework generates a multiscale association network from multiple data types by executing a comprehensive analytic workflow, enumerating bipartite graphs from the results and merging all graphs into a single network in Neo4j. ANIMA is architectural and conceptually similar to GeNNet, differing mainly in the detail of the implementation, the containerization approach, and the complexity of the model.

GREG is an integrative database that merges numerous source databases providing different scopes (e.g. DNA–DNA interaction, PPIs, bindings, DNA annotations or human cell data). It follows an in-house data model and takes advantage of the graph model to tackle challenges like integrating EPIs (with DNA binning strategy) or harmonizing data from chromatin interaction technologies with very different resolutions. When using small bins, its graph comprises more than 2 M nodes and more than 19 M edges, and the main limitation is that, due to include all non-coding regions, search time grows with the size of the genomic range. GREG provides both direct access to the Neo4j (via Cypher) and a friendly web platform. Through the web interface, the user can specify search parameters and access typical network analysis algorithms.

Biological knowledge graphs

While there exist multiple definitions of Knowledge Graphs (KGs) that depend on the application context ( 115 ), we can define them as large, heterogeneous knowledgebases modeled through graphs and ontologies, which derive new knowledge from existing datasets ( 116 ). KGs are undergoing a renewed interest not only in academia but in the industry as well ( 117 ). In addition to storing structured, contextual data, the principal reasons are the capability of obtaining new conclusions from existing data through reasoning ( 118 ), and the possibility to enrich machine-learning models by providing context and produce extra information through derived measures or embedding strategies ( 119–124 ). Lastly, advances in machine learning create new opportunities for automating the construction and exploitation of biological KGs ( 125 ). We summarize several platforms that, due to their broad integrative scopes, can be seen as Biological Knowledge Graphs.

The Monarch Initiative ( 85 ) is an ambitious endeavor that uses an ontology-based strategy to deeply integrate genotype–phenotype data from many species and sources, enabling computational interrogation of disease models and revealing complex genotype–phenotype relationships. Monarch employs RDF to ingest a variety of external data sources, modeling several complex data types and connecting entities from different databases. SciGraph ( ) is its central database engine, which provides means to represent ontologies and data described using ontologies as a Neo4j graph. The resulting combined corpus of graphs, from ontologies and ingested data, constitutes the Monarch Knowledge Graph. The platform provides several data access means for graph querying, application population and phenotype matching, as well as a web portal. The Monarch Web Portal ( ) exploits the graph to provide the users with several powerful features, in the likes of basic search, integrated information on entities of interest, search by phenotype profile, or text annotation.

Similarly, on a smaller scale, Pheno4J ( 75 ) provides a Java-based solution that loads annotated genetic variants and well-phenotyped patients into Neo4j. In order to build the database, Pheno4J requires user-generated files with the patient’s genetic variant and phenotype relations on the one hand, and both the Human Phenotype Ontology (HPO) ( 126 ) and a gene-to-HPO file on the other.

Focused on the analysis and discovery of comorbid diseases in humans, GenCoNet ( 127 ) proposes a semi-automatic pipeline that provides the import, fusion and analysis of stable disease, gene, variant and drug data in a Neo4j database, resulting in a KG for network analysis of gene–disease associations. The workflow consists of four concrete steps. The first step determines comorbidities of high interest and obtains Disease Ontologies terms associated with genes. Secondly, the workflow obtains genes associated with disease variants from HPO, MalaCards ( 128 ), DisGeNet ( 129 ) and OMIM ( ). The third step determines the gene controlled by eQTL and associated with the disease. Lastly, it finds the drugs, extracted from DrugBank ( 130 ), which target genes and treats or contraindicate the disease. GenCoNet showcases the KG by employing network analysis to detect drug-induced diseases or contraindications of drugs.

We can also find hybrid approaches that utilize different database implementations to build the KG ( 131 ). Canevet et al . build on the Ondex software platform ( 132 ) and employ both triple stores and the Neo4j, which supports gene-evidence graph patterns by making the KGs accessible via Cypher. The data integration is harmonized through the Bio-Knowledge Network Ontology (BioKNO), a lightweight and general ontology. Likewise, focused on bacterial whole-genome sequencing (WGS), Spfy ( 88 ) employs ontologies and different database paradigms to integrate disparate data sources and formats. Spfy primarily uses Blazegraph ( ) for storage along with MongoDB ( ) to cache a hash table for duplicate checking, arguing a more efficient approach than would be possible through a search of the graph structure. The graph allows retrospective comparisons across stored results as more genomes are sequenced or populations change.

As mentioned before, ontological and semantic approaches have proved its utility in knowledge-intensive domains like the biomedical domain. Exploiting semantic and logic descriptions is natural for graph databases and triple stores and can be of great importance in KG implementations. In contrast to the rest of similar efforts, BioGrakn ( 133 ) builds upon Grakn ( ) to deliver a KG with deductive reasoning capabilities. It employs almost the same data sources as BioGraphDB, but its model is designed through an ontology implemented in Graql, the Grankn’s declarative, knowledge-oriented graph query language. In the same vein as OWL and SWRL standards, Graql allows categorizing objects and relationships into distinct types, enabling inference and validation, used for searching genes linked to a particular Gene Ontology annotation, pathways linked to a particular gene, or finding all the upregulated differentially expressed (DE) miRNAs that also have validated mutations.

The literature body shows several advantages when biomedical systems and applications employ a graph model in the storage layer. The graph model is especially useful for representing and accessing biological data because path-based queries are intuitive in biological networks, closer to real-world conceptualizations. RDF schema or OWL Bio-ontologies easily translate into a graph because they are already based on triples, which can be further expanded by identifying implied relations between classes through logical reasoning ( 134 ). Also, exploiting graph theory algorithms and subgraph matching queries enables the inspection and discovering patterns of interest within the graph structure. GDBMSs schema-less/schema-optional grants a high degree of flexibility in research settings, allowing applications to adapt and evolve quickly and introduce abstraction and specialization of entities and relations among them more easily. This adaptability eases data integration tasks, as we have seen in many of the integrative platforms.

Specialized, industry-ready GDBMSs are relatively new and well-established biological systems build upon conventional databases, typically RDBMSs. Relevant examples are the protein databases ( 135 ), which have to deal with millions of protein/complex interactions, as is PPI databases’ case ( 136 , 137 ). As described in the technical background, the underlying design of relational systems can lead to a trade-off between data integrity and performance. BioGRID ( 138 ), for instance, approaches this problem by utilizing a suite of tables specifically engineered to optimize query time while maintaining a structured normalized form that does not compromise fundamental design principles. Other relevant databases like DIP ( 139 ), IntAct ( 140 ) and STRING ( 141 ) maintain their relational model to fulfill the storage needs without further considerations concerning performance.

As seen in section 2 ( 43 ), Have and Jensen employed STRING as the use case to evaluate GDBMSs in biomedical settings and confront against RDBMSs, generally finding better performance of the former in usual tasks in the context of PPI networks. In section 3 , we also see that many applications integrate PPI databases by explicitly transforming protein entries as nodes and intermediate relationship tables directly as edges, reporting performance improvements with GDBMSs over RDBMSs in some of the reviewed works ( 45 , 87 ). Still, it is important to remark that redesigning the data storage/access layer usually involves a notable development effort, which may discourage research teams (usually short in human and economic resources). Since most of the protein databases are freely available, it would be relevant to compare their current implementation and a GDBMS implementation through formal benchmarks in that specific scenario, justifying or not engaging such development.

There exist limitations and potential issues of which developers need to be aware. While ontologies avoid designing specific problem-oriented data models and minimize reliability issues, these may increment the model’s complexity, jeopardizing the performance and integration time. If more relaxed schema approaches are adopted, the main trade-offs are deciding when certain data items become nodes or attributes and restraining both model complexity and integrity. Regarding performance, comparative benchmarks and more ad hoc studies are quite heterogeneous and show disparate findings in some cases, making it challenging to identify a performance baseline to favor a concrete technology. Those focused on specific problems, like biological questions, report better GDBMS performances and qualitative features for managing networks ( 42 , 43 , 45 , 47 , 52 , 55 , 56 ). More formal benchmarks ( 50 ) and ( 52 ) report superior RDBMS results in several categories, especially for grouping, sorting, aggregating and setting operations. However, in graph analytics workloads that mainly consist of multi-table joining, pattern matching or path identification, GDBMSs still perform better. The gap widens as the size of the dataset increases. Yet, some benchmarks report problems when the graph is large. In the case of Neo4j, the number of edges to evaluate and subgraph pattern matching size may be a performance pit. This situation requires GDBMSs to provide proper mechanisms, like node replication or partitioning, or forego features like schema-less as TigerGraph does. All in all, GDBMSs are not necessarily superior in all graph queries, and, like any development, the aims and operational context should dictate the technological choices.

From a development point of view, big projects naturally tend to adopt traditional relational databases because they require industry-level tools and libraries that ensure code quality and architectural features such as scalability, integration and standard design patterns. Both industry and communities back RDBMS implementations with reliable frameworks that ease its adoption with, for instance, database to object abstraction layers. However, at this point, many current GDBMS implementations also offer proper frameworks, programming interfaces and Object-Graph Mapping that fulfill such needs.

Another important consideration is the current lack of standardization of query languages and data access methods across GDBMS implementations at both syntactic and theoretical levels ( 142 ). Apache Tinkerpop ( ) provides a high-level framework and the functional graph traversal language Gremlin, but not all GDSMS integrate it and this approach implies more coupling with the application code. Neo4j’s Cypher is a declarative language with similarities to common query languages and provides a clear graph path description syntax with full Create, Read, Update, Delete capabilities, making it one of the best solutions for graph querying. Cypher is the root of openCypher, a fully specified and open query language for property graph databases with >10 implementations across GDBMS solutions, even non-native ones like RedisGraph. TigerGraph follows a different approach with GSQL ( ), another powerful graph query language. It maintains backward compatibility with SQL, imposing a strict schema declaration in the query definition, and the queries behave as stored procedures, consisting of multiple SELECT clauses and imperative instructions such as branches and loops. This design targets enterprise applications, where the number and heterogeneity of external sources are not a concern, but instead, the size and performance, by optimizing storage format and query execution strategy, obtaining exciting results, as seen in Rusu and Huang ( 51 ). Fortunately, at the time of writing, the international committees that develop the SQL standard have voted to initiate Graph Query Language (GQL) ( ) and intend to develop a declarative graph query language that builds on the foundations of SQL and integrates proven ideas from the existing openCypher, Oracle’s PGQL, GSQL and G-CORE ( 143 ) languages, a move that ensures the future of GDBMSs.

We have seen different technologies come and go, and deciding a GDBMS that satisfy the necessities may also become a time-consuming task that can be seen as five steps or stages: problem analysis, requirements analysis, GDBMS analysis, benchmarking and GDBMS selection ( 144 ). Sites like provide useful ranks, comparative tables and insights that help in the selection process. From what we have seen in the literature, Neo4j outstands in its adoption, not only in the biomedical domain, mainly due to the powerful Cypher query language, decent performance and ease of implementation. We foresee that this situation will be less evident in the near future, given the number of competitive developments in the field.

In this work, we have followed the evolution and current landscape of GDBMSs, reviewed the bibliography looking for methods to evaluate their performance in different contexts and explored their applications in the biomedical domain. While RDBMSs and other NoSQL engines still provide better scalability options, more standardized query languages and more efficiency on typical data aggregation operations, most of the comparative analyses note that their performance suffers in densely connected datasets that imply a majority of many-to-many relations. Scenarios with a significant volume of complex relationships may benefit from GDBMSs for the following reasons: (i) graphs provide more natural modeling of many-to-many relationships; (ii) graph-oriented query languages provide more intuitive means for writing complex network traversal and graph algorithm queries than table-oriented ones like SQL, which require to join tables explicitly and reference columns; (iii) the schema-less/optional grants flexibility and (iv) in most situations, GDBMSs present higher performance for relationship-centric searches, like path traversals. These features yield several advantages for the biomedical domain, like easing the communication between domain experts, providing tools for discovering entities/clusters/patterns within the graph structure and facilitating data integration tasks, all of them very common when the investigation involves multiple sub-domains.

GDBMS technology is rapidly evolving to tackle scalability and similar operational weaknesses, offering a wide range of reliable choices to support the storage layer for either small prototypes or large, production-ready projects. The collection of described use cases and author experiences provides evidence that GDBMSs are very fit for biomedical data, as an individual storage system or as part of a hybrid, partitioned architecture. Moreover, by providing direct access to a graph model, late GDBMSs enable the use of graph algorithms and analytics in a very transparent way, improving hypothesis generation and testing.

Contributor Information

Santiago Timón-Reina, Departamento de Inteligencia Artificial, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal, 16 Ciudad Universitaria, Madrid 28040, Spain.

Mariano Rincón, Departamento de Inteligencia Artificial, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal, 16 Ciudad Universitaria, Madrid 28040, Spain.

Rafael Martínez-Tomás, Departamento de Inteligencia Artificial, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal, 16 Ciudad Universitaria, Madrid 28040, Spain.

This work was supported by the Norwegian Research Council (Dementia Disease Initiation) [grant number 217780]; Helse Sør-øst, NASATS (Dementia Disease Initiation) under [grant number 2013131]; Research project PID2019-110686RB-I00 of the State Research Program Oriented to the Challenges of Society.

Conflict of interest.

None declared.

Neo4j Connections: Generative AI and Knowledge Graphs. Register Today North America & Europe | Asia Pacific

  • Neo4j Graph Database
  • Neo4j AuraDB
  • Neo4j Graph Data Science
  • Neo4j Developer Tools
  • Neo4j Workspace
  • Neo4j Bloom
  • Neo4j GraphQL Library
  • Neo4j Data Connectors
  • Cypher Query Language
  • Deployment Center
  • Professional Services
  • Generative AI
  • Industries & Use Cases
  • Case Studies
  • Developer Home
  • Documentation
  • Developer Blog
  • Virtual Events
  • GraphAcademy
  • Graph Data Science Home
  • Data Science Documentation
  • Get Started with Graph Data Science
  • Data Science Community
  • GraphAcademy for Data Science
  • Resource Library
  • Executive Insights
  • Events Calendar
  • GraphSummit
  • Connections
  • Find a Partner
  • Become a Partner
  • Solution Partners
  • OEM Partners
  • Technology Partners
  • Partner Portal Login
  • Awards and Honors
  • Graphs4Good
  • Get Started Free
  • Neo4j Graph Database Self-managed, deploy anywhere
  • Neo4j AuraDB Fully managed graph database as a service
  • Neo4j Graph Data Science Graph analytics and modeling platform
  • Deployment Center Download Neo4j to get started
  • Professional Services The world’s best graph database consultants
  • Neo4j Developer Tools Desktop, Browser, and Data Importer
  • Neo4j Workspace Import, Explore, and Query Neo4j
  • Neo4j Bloom Easy graph visualization and exploration
  • Neo4j GraphQL Library Low-code, open source API library
  • Neo4j Data Connectors Apache Kafka, Apache Spark, and BI tools
  • Cypher Query Language Powerful, intuitive, and graph-optimized
  • Generative AI Back your LLMs with a knowledge graph for better business AI
  • Industries & Use Cases Fraud detection, knowledge graphs and more
  • Case Studies In-depth looks at customer success stories
  • Customers Companies, governments and NGOs using Neo4j
  • Developer Home Best practices, how-to guides and tutorials
  • Documentation Manuals for Neo4j products, Cypher and drivers
  • Deployment Center Get Neo4j products, tools and integrations
  • Developer Blog Deep dives into more technical Neo4j topics
  • Community A global forum for online discussion
  • Virtual Events Global developer conferences and workshops
  • GraphAcademy Free online courses and certifications
  • Graph Data Science Home Learn what Neo4j offers for data science
  • Data Science Documentation Manual for the Graph Data Science library
  • Get Started with Graph Data Science Download our software or get started in Sandbox today!
  • Data Science Community A global forum for online discussion
  • GraphAcademy for Data Science Free online courses and certifications for data scientists
  • Resource Library White papers, data sheets and more
  • Neo4j Blog Daily reads on general Neo4j topics
  • Executive Insights Get to Know Graph Technology
  • Events Calendar Live online events, trainings and demos
  • GraphSummit Live events around the world
  • Connections Our ongoing digital event series
  • Live Demos Weekly demos with Neo4j experts
  • Webinars Upcoming live and on-demand webinars

Warning: JavaScript is disabled on your browser. Parts of will not work properly.

(Neo4j Blog)←[:BACK]

Introducing Graph Databases For Dummies

Jim webber & rik van bruggen , chief scientist & regional vice president sep 16, 2020 7 mins read.

Check out highlights of Chapter 1.

Introducing Graph Databases

Exploring graph database basics, understanding who uses graph databases and why, seeing the benefits of graph databases, explaining labeled property graphs.

Remember! In the labeled property graph model, we use naming conventions to distinguish elements at a glance. The following helps describe the naming conventions: Node labels are PascalCase . Every word starts with an uppercase letter with no spaces. Relationships are SNAKE_CASE_ALL_CAPS . Replace all the spaces with an underlined character and convert all the letters to capitals. Properties on nodes and relationships are snake_case . Replace all spaces with an underlined character and lowercase all the words.

Defining Nodes

Explaining relationships, enforcing constraints, climbing the graph learning curve, going all-in on graphs, email me blog updates.

The information you provide will be used in accordance with the terms of our privacy policy .

Jim Webber & Rik Van Bruggen Image

Jim Webber & Rik Van Bruggen, Chief Scientist & Regional Vice President

Related Articles

Webinars, Demos, Connections – Oh My!

Data Topics

  • Data Architecture
  • Data Literacy
  • Data Science
  • Data Strategy
  • Data Modeling
  • Governance & Quality
  • Education Resources For Use & Management of Data

Graph Database Use Cases

One of the primary advantages of using a graph database is the ability to present the relationships that exist between datasets and files. Much of the data is connected, and graph database use cases are increasingly helping to find and explore these relationships and develop new conclusions. Additionally, graph databases are designed for quick data […]

graph database use cases

One of the primary advantages of using a graph database is the ability to present the relationships that exist between datasets and files. Much of the data is connected, and graph database use cases are increasingly helping to find and explore these relationships and develop new conclusions. Additionally, graph databases are designed for quick data retrieval. 

Graph databases offer a much faster and more intuitive method of modeling and querying data than do traditional  relational databases .

Algorithms can be used when analyzing graphs. They can explore the paths and distances between vertices, the clustering of vertices, and the relevance of the vertices. The algorithms often examine incoming edges and the importance of neighboring vertices. 

Applying  algorithms  to graphs allows researchers to apply pattern recognition, machine learning, and statistical analysis. When massive amounts of data are processed, this process provides a more efficient analysis.

In a DATAVERSITY®  interview , Gaurav Deshpande, vice president of marketing for TigerGraph, said,

“Whenever customers ask me about graph databases, I keep it very simple. When you hear the word ‘graph,’ graph is equal to ‘relationship.’ So, any time you are trying to do analysis of relationships, that’s where you should use the graph database. And given that all of us are increasingly more connected to each other – both as people and as organizations, as entities – it just makes sense that graph databases would become more prominent and more important as time goes by.”

Graph databases are designed to store relationships, so algorithms and queries can be used to perform their tasks in subseconds rather than minutes or hours. Users aren’t required to perform countless joins, and  machine learning  and  data analytics  operate more efficiently. While not known for being user-friendly, graph databases tend to operate more efficiently than  SQL systems .

The Two Types of Data Graphs

There are two basic types of data graphs:  property graphs  and  RDF graphs . The property graph focuses on data integration, while the RDF graph deals with analytics and querying. Both forms of graph are made up of points (vertices) and their connections between the points (edges). However, there are several differences.

Property graphs focus on data integration and are used to model relationships between the data. They support query and data analytics based on these relationships. A property graph’s vertices can contain detailed information on a subject, while the edges express relationships between the vertices.

The resource description framework (RDF) model is designed to represent statements. A statement contains three elements – two vertices that are connected by an edge. Each vertex and edge has a unique resource identifier (URI) that is used for identifying and locating it. The RDF model offers a way to publish the data using a standardized format with well-defined semantics. Pharmaceutical businesses, health care companies, and government agencies working with statistics are examples of organizations that have begun using RDF graphs.

RDF graphs are especially useful for showing  master data  (aka essential data – names, addresses, phone numbers that provide context for transactions) and complex metadata. RDF graphs are commonly used to express complex ideas in a domain, or when circumstances require rich semantics.

Because SQL databases and graph databases have significantly different designs, each comes with its own strengths and weaknesses. Graph databases can be used to resolve a variety of problems. Below are just a few popular graph database use cases.

Detecting Bank Fraud:  One form of bank fraud is called “mule fraud,” and involves a person who is called the “money mule.” This person transfers or deposits money into their own account, and then the money is transferred to a partner in the scam, who is often in another country. 

Traditional SQL systems will create alerts regarding suspicious accounts, which are then flagged by a human. Unfortunately, because of the limited information SQL systems communicate about these accounts, questionable behavior can go unrecognized.

Often these accounts will share similar information (addresses and telephone numbers) that is required for opening the accounts. While criminals may use two or three names, they typically use one phone number and one mailing address. With graph-based queries, bank security can quickly identify accounts with the same phone numbers, addresses, or similar connections, and flag them for further investigation.

This method can use machine learning models that have been trained to identify money mules and their fraud behaviors.

Customer Marketing:  A key aspect of marketing is determining what the customer wants. In a  data-driven business environment, marketers study the relationships customers have with each other and with various products, as well as the relationships that exist between different products. (An individual purchases a pregnancy test, and from the same store the next day purchases three books on how to have a healthy baby). This helps marketers determine what the customers want. Marketers attempt to offer the customers what they want before they have purchased it, with the goal of making a profit.  

Today, many companies have collected the following information about their customers.

  • Master data:  age, name, gender, and address
  • Customer research:  web click streams, traffic lines, call logs, etc.
  • Transaction history:  purchases, purchase time, types of purchases
  • Customer predictions:  purchase histories, search histories, cart abandonment, and social media profiles

While many businesses collect this information, they often are unable to use it comprehensively, because the data is not interconnected. However, this data can be integrated using graph technology, allowing researchers to view all the information surrounding a customer. 

With the use of graphs, marketers can develop a better understanding of their customers and the customers’ relationships with each other and with various products.

After identifying relationships the customers have with each other, and with purchased products, the graph researchers can run algorithms that provide more finely tuned predictions about the customer.

Data Lineage:  As data continues to grow in volume, managing it while ensuring data privacy and compliance with laws and regulations has become increasingly difficult. Data can be extremely difficult to track, and locating the source of unwanted changes can also be difficult. Discovering what data is stored in each database as it is moved around and transformed can be extremely problematic.

Graph databases are excellent for tracking  data lineage . The data’s life cycle moves through a variety of steps, and graph databases can follow it, vertex by vertex, by tracking the edges. With graphs, it is possible to see how the information was used, where it was copied, and its original source. 

Manufacturing Traceability:  Manufacturers find traceability to be a very useful process. For example, a flashlight manufacturer might need to issue a recall on a flashlight model because it has a defective component that was purchased from multiple sources. But locating the source of the problem and the specific flashlights affected can be a challenge.

Many manufacturing companies use a production database that manages the product’s lot information, but they also have a retail database, a purchase database, and a shipping database. This complicated situation makes discovering all the relevant information hard to find and organize. 

A graph database is ideal for connecting all the relationships, and graph algorithms can be used to highlight the connections and relevant information.

Criminal Investigations:  Graph databases have recently been used to revolutionize criminal activity analysis. This is generally not used for small, opportunistic crimes, but for crimes involving many interconnected people, businesses, gangs, and locations. 

Graphs can provide an efficient way of identifying criminals and their networks. Graph-based algorithms (such as  PageRank , which uses a centrality process) can be used to discover insights regarding locations, look for important people, and identify potential criminal gangs. Researchers can find the “weakest link” in the graph, meaning the vertex that the graph is based on. If that vertex is removed, the graph, as a whole, may fall apart. This does not mean there’s a problem, but that the linchpin of a criminal organization has been found.

The Graph Database Mission

The  mission  of graph databases and graph database use cases is to provide an understanding of the relationships that exist between data elements, offering analytics that can identify business opportunities and support a foundation for AI/ML projects. It is one of the most significant innovations to evolve from NoSQL databases, storing the relationships between data objects inside the objects themselves, in turn supporting analytics that are almost impossible to produce by other databases.

Ideally, graph databases will work alongside a SQL database – which is still the data workhorse of choice for most organizations.

Image used under license from

graph database case study

Graph Databases Use Cases

Graph Databases Use Cases

“Big data” grows bigger every year, but today’s enterprise leaders don’t only need to manage larger volumes of data, but they critically need to generate insight from their existing data. Businesses need to stop merely collecting data points, and start connecting them. In other words, the relationships between data points matter almost more than the individual points themselves. In order to leverage those data relationships, your organization needs a database technology that stores relationship information as a first-class entity. That technology is a graph database.

While traditional relational databases have served the industry well in the past in enabling service and process models that tread upon these complexities, in most deployments they still demand significant overhead and expert levels of administration to adapt to change. Relational databases require cumbersome indexing when faced with the non-hierarchic relationships that are becoming yet more persistent in complex IT ecosystems, with partners and/or suppliers and service providers, as well as more dynamic infrastructures associated with cloud and agile.

Unlike relational databases, graph databases are designed to store interconnected data that’s not purely hierarchic, make it easier to make sense of that data by not forcing intermediate indexing at every turn, and also making it easier to evolve models of real-world infrastructures, business services, social relationships, or business behaviors that are both fluid and multi-dimensional.

Use Case: Master Data Management

The world of master data is changing. Data architects and application developers are swapping their relational databases with graph databases to store their master data. This switch enables them to use a data store optimized to discover new insights in existing data, provide a 360-degree view of master data and answer questions about data relationships in real time.

Your Master Data Management (MDM) program likely uses the same database technology as your transactional application: a mature, highly-tuned, relational database (RDBMS). You excel at relational databases because you have many years of experience working with them and most of your data live there, so it makes sense to keep master data there. Traditionally, MDM has included Customer, Product, Accounts, Vendor, Partners and any other highly shareable data in an enterprise.

Master data, by definition, is highly shared, this struggle tends to cost business agility in a way that ripples throughout the organization. Our architectures have focused on getting data to fit a single definition of the truth, something most of us realize is not a feasible solution in the long run.

The Future of Master Data Management

MDM programs that attempted to persist data in one single location physically continue to struggle with the realities of modern information technology. Most enterprise organizations use vendor applications: customer relationship management (CRM) systems, work management systems, accounts payable, accounts receivable, the point of sale systems, etc. Due to this approach, it’s not always feasible to move all master data to a single location. Even with a CRM system in place, we typically end up with customer information maintained in several systems. The same goes for product and accounting data as well.

The most successful programs will not strive to find a single physical location for all data but will provide the standards, tools, and services necessary to provide a consistent vision of enterprise data. There will be data we can store in one place, using the technologies that best fit its data story. Data will also likely be found in multiple physical systems due to the increasing use of packaged applications as well for performance and geographically-distributed processing needs. Once we understand our environment, we can architect solutions that build upon those needs.

The future of master data management will derive value from data and its relationships to other data. MDM will be about supplying consistent, meaningful views of master data. In many cases, we will be able to unify data into one location, especially to optimize for query performance and data fit. Graph databases offer exactly that type of data/performance fit, as we will see below. In this paper, we discuss why your master data is a graph and how graph databases like Neo4j are the best technologies for master data.

Today’s enterprises are drowning in “big data” – most of which is mission-critical master data – and managing its complex relationships can be quite a challenge. Here are some of the most difficult hurdles in MDM that enterprises must face:

Complex and hierarchical data sets

Master data, such as organizational and product data, has deep hierarchies with top-down, lateral and diagonal connections. Managing such data models with relational database results in complex and unwieldy code that are slow to run, expensive to build and time-consuming to maintain.

Real-time query performance

Master data systems must integrate with and provide data to a host of applications within the enterprise – sometimes in real time. However, traversing a complex and highly interconnected dataset to provide real-time information is a challenge.

Dynamic structure

Master data is highly dynamic with constant addition and re-organization of nodes, making it harder for your developers to design systems that accommodate both current and future requirements.

The best data-driven business decisions aren’t based on stale information silos. Instead, you need real-time master data with information about data relationships. Graph databases are built from the ground up to support data relationships. With more efficient modeling and querying, organizing your master data in a graph yields relevant answers faster and with more flexibility than ever before.

Virtual Machines for data science

graph database case study

Use Case: Network and IT Operations

By their nature, networks are graphs. Graph databases are, therefore, an excellent fit for modeling, storing and querying network and IT operational data no matter which side of the firewall your business is on – whether it’s a communications network or a data center. Today, graph databases are being successfully employed in the areas of telecommunications, network management, impact analysis, cloud platform management and data center and IT asset management. In all of these domains, graph databases store configuration information to alert operators in real time to potential shared failure modes in the infrastructure and to reduce problem analysis and resolution times from hours to seconds.

Network analysts and data center professionals face greater challenges than ever before as the volume of data and size of networks continues to grow. Here are just a few of their most difficult challenges:

Highly interrelated elements

Whether you’re managing a major network change; providing more efficient security-related access; or optimizing a network, application infrastructure or data center – the physical and human interdependencies are extremely complex and challenging to manage.

Non-linear and non-hierarchical relationships

Relationships among the various nodes in your network are neither purely linear nor hierarchical, making it difficult to model using traditional RDBMS. In addition, when two or more systems are brought together, these relationships become even more complex to describe.

Growing physical and virtual nodes

With the rapid growth in network sizes and both the number and types of elements added to support new network users and services, your IT organization must develop systems that accommodate both current and future requirements.

As with master data, a graph database is used to bring together information from disparate inventory systems, providing a single view of the network and its consumers – from the smallest network element all the way to the applications, services and customers who use them. A graph representation of a network enables IT managers to catalog assets, visualize their deployment and identify the dependencies between the two. The graph’s connected structure enables network managers to conduct sophisticated impact analyses, answering questions like:

• Which parts of the network – which applications, services, virtual machines, physical machines, data centers, routers, switches and fiber – do particular customers depend on? (Top-down analysis)

• Conversely, which applications and services, and ultimately, customers in the network will be affected if a particular network element – such as a router or switch – fails? (Bottom-up analysis)

• Is there redundancy throughout the network for the most important customers?

A graph database representation of the network can also be used to enrich operational intelligence based on event correlations. Whenever an event correlation engine (such as a Complex Event Processor) infers a complex event from a stream of low-level network events, it assesses the impact of that event against the graph model and triggers any necessary compensating or mitigating actions.

Discovering, capturing and making sense of complex interdependencies is central to effectively running Network and IT operations are a critical part of running an enterprise. Whether it’s optimizing a network or application infrastructure or providing more efficient security-related access – these problems involve a complex set of physical and human interdependencies that are a challenge to manage. The relationships between network and infrastructure elements are rarely linear or purely hierarchical. Graph databases are designed to store that interconnected data, making it easy to translate network and IT data into actionable insights.

Use Case: Real-Time Recommendation Engine

Graph databases have revolutionized the way people discover new products, services, information, and people. Recommendation engines powered by graph databases help companies personalize products, content, and services by leveraging the connections between data — all in real time.

Whether your enterprise operates in the retail, social, services or media sector, offering your users highly targeted, real-time recommendations are essential to maximizing customer value and staying competitive.  Unlike other business data, recommendations must be inductive and contextual to be considered relevant by your end consumers.

With a graph database, you capture a customer’s browsing behavior and demographics and combine those with their buying history to instantly analyze their current choices and then immediately provide relevant recommendations – all before a potential customer clicks to a competitor’s website.

With so many data to track and process in a short amount of time, creating a recommendation engine capable of relevant, real-time suggestions isn’t easy. Here are some of the biggest challenges involved:

Process large amounts of data and relationships for context

Collaborative and content-based filtering algorithms rely on rapid traversal of a continually growing and highly interconnected dataset.

Offering relevant recommendations in real time

The power of a suggestion system lies in its ability to make a recommendation in real time using immediate purchase or browsing history.

Accommodate new data and relationships continually

The rapid growth in the size and number of data elements means your suggestion system needs to accommodate both current and future requirements.

Real-time recommendation engines provide a key differentiating capability for enterprises in retail, logistics, recruitment, media, sentiment analysis, search and knowledge management.

The key technology in enabling real-time recommendations is the graph database. Graph databases also out-class other database technology for connecting masses of buyer and product data (or connected in general).

Making effective real-time recommendations depends on a database that understands the relationships between entities, as well as the quality and strength of those connections.

Only a graph database efficiently tracks these relationships according to user purchase history, interactions and reviews to give you the most meaningful insight into customer needs and product trends.

Graph-powered recommendation engines can take two major approaches:

·         Identifying resources of interest to individuals

·         Identifying individuals likely to be interested in a given resource

With either approach, graph databases make the necessary correlations and connections to serve up the most relevant results for the individual or resource in question.

Storing and querying recommendation data using a graph database allows your application to provide real-time results rather than recalculated, stale data. As consumer expectations increase – and their patience decreases – providing these sorts of relevant, real-time suggestions will become a greater competitive advantage than ever before. Real-time recommendation engines provide a key differentiating capability for enterprises in retail, logistics, recruitment, media, sentiment analysis, search and knowledge management.

Use Case: Fraud Detection

Banks and insurance companies lose billions of dollars every year to fraud.

Traditional methods of fraud detection fail to minimize these losses since they perform discrete analyses that are susceptible to false positives (and false negatives). Knowing this, increasingly sophisticated fraudsters have developed a variety of ways to exploit the weaknesses of discrete analysis.

Graph databases, on the other hand, offer new methods of uncovering fraud rings and other complex scams with a high level of accuracy through advanced contextual link analysis, and they are capable of stopping advanced fraud scenarios in real time.

Between the enormous amounts of data available for analysis and today’s experienced fraud rings (and solo fraudsters), fraud detection professionals are beset with challenges. Here are some of their biggest:

Complex link analysis to discover fraud patterns

Uncovering fraud rings requires you to traverse data relationships with high computational complexity – a problem that’s exacerbated as a fraud ring grows.

Detect and prevent fraud as it happens

To prevent a fraud ring, you need real-time link analysis on an interconnected dataset, from the time a false account is created to when a fraudulent transaction occurs.

Evolving and dynamic fraud rings

Fraud rings are continuously growing in shape, and size and your application need to detect these fraud patterns in this highly dynamic and emerging environment.

While no fraud prevention measures are perfect, significant improvements occur when you look beyond individual data points to the connections that link them. Understanding the relationships between data, and deriving meaning from these links doesn’t necessarily mean gathering new data. You can draw significant insights from your existing data simply by reframing the problem in a new way: as a graph.

Unlike most other ways of looking at data, graphs are designed to express relatedness. Graph databases uncover patterns that are difficult to detect using traditional representations such as tables. An increasing number of companies use graph databases to solve a variety of connected data problems, including fraud detection.

Now let’s consider how graph databases can help solve this problem. Uncovering rings with traditional relational database technologies requires modeling the graph above as a set of tables and columns and then carrying out a series of complex joins and self-joins. Such queries are very complex to build and expensive to run. Scaling them in a way that supports real-time access poses significant technical challenges, with performance becoming exponentially worse not only as the size of the ring increases but as the total data set grows.

Graph databases have emerged as an ideal tool for overcoming these hurdles. Languages like Cypher provide a simple semantic for detecting rings in the graph, navigating connections in memory, in real time.

The graph data model in Diagram 4 below represents how the data looks to the graph database and illustrates how one can find rings by simply walking the graph.

Augmenting one’s existing fraud detection infrastructure to support ring detection can be done by running appropriate entity link analysis queries using a graph database, and running checks during critical stages in the customer & account lifecycle, such as:

1. at the time the account is created,

2. during an investigation,

3. as soon as a credit balance threshold is hit, or

4. when a check is bounced.

Real-time graph traversals tied to the right kinds of events can help banks identify probable fraud rings: during or even before the Bust- Out occurs.

Whether it is bank fraud, insurance fraud, e-commerce fraud, or another type of fraud, two points are very clear. The first is the importance of detecting fraud as quickly as possible so that criminals can be stopped before they have an opportunity to do too much damage. As business processes become faster and more automated, the time margins for detecting fraud are becoming narrower and narrower, increasing the call for real-time solutions.

The second is the value of connected analysis. Sophisticated criminals have learned to attack systems where they are weak. Traditional technologies, while still suitable and necessary for certain types of prevention, are not designed to detect elaborate fraud rings. This is where graph databases can add value.

Graph Databases are the ideal enabler for efficient and manageable fraud detection solutions. From fraud rings and collusive groups to professional criminals operating on their own, graph databases provide a unique ability to uncover a variety of significant fraud patterns, in real time. Collisions that were previously hidden become evident when looking at them with a system designed to manage connected data, using real-time graph queries as a powerful tool for detecting a variety of highly-impactful fraud scenarios.

Use Case: Social Network

Whether you’re leveraging declared social connections or inferring relationships based on activity, graph databases offer a world of fresh possibility when it comes to creating innovative social networks.

Here are some of the biggest challenges:

Highly dynamic networks

Social networks change and evolve quickly, so your application must be able to detect early trends and adapt accordingly.

High density of connections

Social networks are densely connected and become more so over time, requiring you to parse this relationship data quickly for better business insights.

Relationships are equally important

When you’re striving to understand user behavior in social networks, relationships between users are as important as the individual users themselves. Your social network application must be able to process data relationships as quickly as it processes individual data entities.

Complex queries

Navigating a social graph and understanding both individuals and their relationships requires complex and deep queries. These particular queries bring most relational databases to their knees. Likewise, other types of NoSQL databases struggle to handle high degrees of relatedness. Graph databases are both easy and quick at traversing relationships, and they return instantaneous query results, making them an ideal choice for your social application.

When exploring relational database options, it became clear there that a graph database was a better and safer choice for this project. One important factor was the so-called impedance mismatch. The data and queries were clearly graph-oriented, and it was clear that “bending” the data into a tabular format would incur significant project cost and performance overhead. A graph database solution was able to meet both operational and analytic requirements. Graph databases were a technology that fit the use case much better than relational databases because they are a natural fit for the social domain.

Use Case: Identity and Access Management

Identity and access management (IAM) solutions store information about parties (e.g., administrators, business units, end-users) and resources (e.g., files, shares, network devices, products, agreements), along with the rules governing access to those resources. IAM solutions apply these rules to determine who can or can’t access or manipulate a resource. Traditionally, identity and access management have been implemented either by using directory services or by building a custom solution inside an application’s backend. Hierarchical directory structures, however, can’t cope with the complex dependency structures found in multi-party distributed supply chains. Custom solutions that use nongraphic databases to store identity and access data become slow and unresponsive as their datasets grow.

A graph database can store complex, densely connected access control structures spanning billions of parties and resources. Its richly and variably structured data model supports both hierarchical and non-hierarchical structures, while its extensible property model allows for capturing rich metadata regarding every element in the system.

With a query engine that can traverse millions of relationships per second, graph database access lookups over large, complex structures execute in milliseconds, not minutes or hours.

As with network and IT operations, a graph database access control solution allows for both top-down and bottom-up queries:

• Which resources – company structures, products, services, agreements and end users – can a particular administrator manage? (Top-down)

• Given a particular resource, who can modify its access settings?(Bottom-up)

• Which resource can an end-user access?

Access control and authorization solutions powered by graph databases are particularly applicable in the areas of content management, federated authorization services, social networking preferences and software as a service (SaaS) offerings, where they realize minutes-to-milliseconds increases in performance over their relational database predecessors.

Today’s enterprise data professionals face greater challenges than ever before when it comes to storing and managing user identities and authorization. Not only must data architects deal with user access fraud, but they also must manage all of these changing relationships in real time. Here are some of their biggest challenges:

Highly interconnected identity and access permissions data

To verify an accurate identity and its access permissions, the system needs to traverse through a highly interconnected dataset that is growing in size and complexity.

Productivity and customer satisfaction

As users, products and permissions grow, traditional systems no longer deliver responsive query performance, resulting in diminished user experience and frustration for users.

Dynamic structure and environment

With the rapid growth in the size of users and their associated metadata, your application needs to accommodate both current and future identity management requirements.

For your enterprise organization, managing multiple changing roles, groups, products and authorizations is an increasingly complex task. Relational databases simply aren’t up to the task of managing your identity and access needs as queries are far too slow and unresponsive. Using a graph database, you seamlessly track all of your identity and access relationships real-time results, connecting your data along intuitive relationships. With an interconnected view of your data, you have better insights and controls than ever before.

Use Case: Graph-Based Search

Managing your organization’s growing library of digital assets requires a more robust search solution. With graph-based search tools, your queries return more accurate and relevant real-time results. A graph-based search is a new approach to data and digital asset management originally pioneered by Facebook and Google.

Search powered by a graph database delivers relevant information that you may not have specifically asked for – offering a more proactive and targeted search experience, allowing you to triangulate the data points of the greatest interest quickly.

The key to this enhanced search capability is that on the very first query, a graph-based search engine takes into account the entire structure of available connected data. And because graph systems understand how data is related, they return much richer and more precise results.

Think of graph-based search more as a “conversation” with your data, rather than a series of one-off searches. It’s search and discovery, rather than search and retrieval.

In this “Graph Databases in the Enterprise” series, we’ll explore the most impactful and profitable use cases of graph database technologies at the world’s leading organizations. In past weeks, we’ve examined fraud detection, real-time recommendation engines, master data management, network & IT operations, and identity & access management (IAM).

As a cutting edge technology, the graph-based search is beset with challenges. Here are some of the biggest:

The size and connectedness of asset metadata

The usefulness of a digital asset increases with the associated rich metadata describing the asset and its connections. However, adding more metadata increases the complexity of managing and searching for an asset.

The power of a graph-based search application lies in its ability to search and retrieve data in real time. Yet, traversing such complex and highly interconnected data in real time is a significant challenge.

Growing number of data nodes

With the rapid growth in the size of assets and their associated metadata, your application needs to be able to accommodate both the current and future requirements.

The graph-based search would be impossible without a graph database to power it. In essence, the graph-based search is intelligent: You can ask much more precise and useful questions and get back the most relevant and meaningful information, whereas traditional keyword-based search delivers results that are more random, diluted and low-quality.

With graph-based search, you can easily query all of your connected data in real time, then focus on the answers provided and launch new real-time searches prompted by the insights you’ve discovered.

Graph databases make advanced search-and-discovery possible because:

Enterprises can structure their data exactly as it occurs and carry out searches based on their own inherent structure. Graph databases provide the model and query language to support the natural structure of data.

Users receive fast, accurate search results in real time. With a graph database, a variety of rich metadata is assigned to all content for rapid search and discovery.

Data architects and developers can easily change their data and its structure as well as add a wide variety of new data. The built-in flexibility of a graph database model allows for agile changes to search capabilities.

In contrast, information held in a relational database is much more inflexible to future change: If you want to add new kinds of content or make structural changes, you are forced to re-work the relational model in a way that you don’t need to do with the graph model.

The graph model is much more easily extensible and over 1,000 times faster than a relational database when working with connected data.

For businesses that have huge volumes of products, content or digital assets, graph-based search provides a better way to make this data available to users, as corporate giants Google and Facebook have clearly demonstrated.

The valuable uses of graph-based search in the enterprise are endless; customer support portals, product catalogs, content portals and social networks are just a few.

Graph-based search offers numerous competitive advantages, including better customer experience, more targeted content and increased revenue opportunities. Enterprises that tap into the power of graph-based search today will be well ahead of their peers tomorrow.

Comments (0)

Add a new comment:, related services, deep learning, machine learning, data science.

The ultimate guide to graph databases

Unite your data and get a highly scalable, performant, native GraphQL graph database in the cloud that delivers blazingly fast query speeds.

Supercharge your app development.

Don’t get stuck in the same old pitfalls by developing apps in legacy ways. Instead, use Dgraph to supercharge your app development by building apps the modern way.

We built Dgraph with the idea of giving developers a way to build better apps, quickly. As part of this, we saw a more modern approach which includes:

graph database case study

Graph databases vs relational databases: Why do graph databases exist?

You would think they would, but that’s not the case. Let’s take a look at the most common type of database, relational databases.

Associative Entity, JOIN table or Lookup table

The relational database model, used by databases such as PostgreSQL, MySQL, or SQL Server, uses a table format to store data (as seen above). Each column represents a table, with the arrows between representing relationships.

Relational databases use a table format, consisting of rows and columns to represent data. Each table (usually) represents a specific entity. For example, the Employees table represents employees, with columns holding data such as ID, name, department, etc. So how would you represent relationships between entities (tables)?

There are two ways to create relationships between tables. The first way is for one entity to directly refer to another via its primary key. For example, Alice has a nice office on the second floor. The room’s ID is 812, and we can store that ID in Alice’s record.

For many-to-many relationships, the above method won’t cut it. We need to use a special table commonly referred to as a “JOIN table” or a Lookup table. We would create a new table that only holds IDs, and use that table to store relationships between entities.

As you can see from the example above, graph databases allow us to model relationships in a much more natural way. In addition, JOIN operations in relational databases are very costly. This is usually addressed by denormalizing data and breaking data integrity.

In a graph system, relationships are stored with data, which means graph databases are much more performant when querying highly connected datasets. Considering that highly connected data is increasing across industries at a rapid rate, it makes sense why you’re reading about graph databases right now. In short, if relationships between data are more important than the data itself, graph databases are the undisputed winners.

So, if graph databases are more intuitive and more performant, why are they not the go-to databases? To understand that, we need to learn a bit about the history of databases.

A very brief history of databases

The beginning.

Back in the 1960s, the first general use database systems started appearing. Charles W. Bachman created the “Integrated Database System”, widely known as the first DBMS (Database Management System). Around the same time, IBM also created their own DBMS known as IMS. By the mid-60s, many other general use database systems started appearing.

The lack of standardization led many customers of these systems to demand a standard, which led to the formation of the “Database Task Group” (very creative name😉 ). In 1971, the Database Task group created the “CODASYL” approach, which wasn’t great and left many people unhappy.

SQLing with delight

One such person was Edgar Codd, who wrote A Relational Model of Data for Large Shared Data Banks , which described a method for storing data in a “table with fixed-length records”. Sadly, IBM, where Codd was working, was heavily invested in IMS and wasn’t interested in it.

Luckily, two individuals at UC Berkeley decided to research relational database systems and created INGRES, which proved that a relational model could be efficient and practical. This pressured IBM to improve on QUEL (INGRES’ query language), and in 1974, IBM launched SQL , which soon overtook QUEL as a more functional query language. By the 1990s, SQL had become the de facto language of data and the SQL database reigned supreme. What if you didn’t want to use SQL? Too bad, you don’t have a choice.

Say NO to SQL

Relational databases up to this point were architected on the assumption that they would be run on a single machine. However, as the internet exploded, some workloads grew so heavy that no single computer could handle the load. Enter the NoSQL database .

NoSQL, at least in the beginning, was developed as a solution to this scaling problem, as well as the need for unstructured data. One of the pioneers of NoSQL was Google, who, as one could imagine, ran into this scaling problem rather quickly. In 2004, Google released a paper describing a distributed file system with a storage system built on top of it. This storage system, known as BigTable, was able to be run distributed over many physical nodes (servers). Other companies such as Facebook and Apache later followed suit and created their own distributed NoSQL systems.

The “NoSQL” term might mislead you into thinking that NoSQL databases are all similar in design, but that is not the case. Today, there are hundreds, if not thousands of NoSQL databases, which can be split into one (or more) of the following categories:

  • Key/Value Stores : Simplest NoSQL model. They store pairs of keys and values and retrieve a value for a given key. Very high performance compared to SQL. Some store these values entirely in memory (e.g. Redis, Dgraph Badger)
  • Document Stores : Handle semi-structured data (XML, JSON, etc). They store key/document pairs, but internal document data can be processed and indexed as well. Can contain nested structures such as arrays. Very flexible. The most popular example is MongoDB.
  • Graph Databases : Data represented as a graph (graph theory link), composed of nodes and edges. Optimized for complex data models. Some are distributed like Dgraph.
  • Wide Column Stores : Inspired by BigTable. Cells can be individually accessed. Lots of optimization techniques for splitting data across files and the network. Very high performance, very scalable.

Graphs to the Rescue?

NoSQL provided a great alternative to companies that needed to store massive amounts of data. Great. Now, let’s say you want to look through that data, process it, and provide an answer to your team or your customers? Suddenly NoSQL solutions are less appealing.

If you want to do anything interesting with your data, like recommend purchases based on what friends or friends-of-friends buy, you will be processing many records – think terabytes of data. With a graph database? Let me get back to you in a millisecond (or less).

Graph Technology Drawbacks

If graph databases are so great, why haven’t they caught on?

Graph databases are relatively new technology.

Neo4j, Inc, creators of the Neo4J database and the first graph database company, was founded in 2000. Graph databases themselves only started gaining commercial acceptance in the early 2010s.

Learning a new query language sounds like a hassle.

Even innovative companies like Google have fallen into this trap. Funny enough, Dgraph was founded because of Google’s resistance to migrating to a graph database. Read more about why Google needed a graph database and the founding of Dgraph.

New query languages and a different paradigm.

Almost all relational databases use some variant of SQL for querying. Graph databases, on the other hand, don’t have a standard language

But the main thing to notice is that, in fact, they are catching on! This is a graph of “Popularity Changes per Database Categories” from DB-Engines. Since 2013, the popularity of graph databases has risen by almost 1200%!

graph database case study

Who needs a graph database? What are some graph database use cases?

Knowledge graphs.

Yeah, it has graph in the name. But there’s more to it than that. A knowledge graph organizes the relationships between entities to get a more human-centered data understanding. By capturing real-world context in your data’s knowledge graph, you’ll connect internal data silos.

For example, purchase data may be useless on its own, but once you connect it to browsing data and a social network, you can now start to see what users are buying, what their friends are buying, and most importantly, what products they are likely to buy in the future. Interested in an example? Take a look at how KE Holdings used Dgraph to create a searchable knowledge graph from 48 billion tuples.

Social Networks

Social networks are inherently graphs. Whether declared or implied, you (an entity) are friends with entities, coworkers with other entities, possibly married to an entity, etc. You are interested in specific entities (interests), live in an entity (city), post entities (images posts, statuses), which are liked, commented, and shared by other entities (friends). You go out to eat in entities (restaurants) with other entities (people), you enjoy eating specific entities (food).

Facebook created its own graph system and graph storage named TAO.

Fraud Detection

Fraudsters have become more sophisticated, and once state-of-the-art traditional fraud measures focusing on discrete data points are much more easily fooled. Cutting-edge fraud detection systems now look beyond individual data points, looking at patterns of relationships that are difficult to notice without using a graph database.

See how Feedzai, which scores trillions of dollars a year for fraud, built their graph database to improve their fraud detection efforts.

Recommendation Engines

Real-time recommendations are at the heart of any online store. Making relevant real-time recommendations means taking into account product, customer, inventory, delivery, and sentiment data – and that’s without including any new interests in the customer’s current session. Using a graph database you can provide real-time content, service, and product recommendations by uniting all user data for a truly personalized and engaging experience that increases revenue.

Read how you could build an Amazon-like recommendation engine for yourself.

Machine Learning/AI

Many companies are now dabbling in AI and machine learning to achieve their goals. Like the use cases above, machine learning depends heavily on uncovering patterns in data. You can make better predictions about data by using relationships in the data than you can from just the data alone. For example, the most powerful predictor of whether someone will start smoking is whether they have friends that smoke. And what kind of database helps unlock relational data? You guessed it, a graph database.

When do I need to switch to a graph database?

That’s a tough question. If you answer yes to one or more of the following questions, it might be time to start using a graph database for at least a portion of your data.

Is your data heavily connected?

Do you have many many-to-many relationships? Are relationships as or more important than the data itself? If so, heavily consider using a graph database. As your dataset grows larger, relationships will become more and more cumbersome to maintain, and much harder to query and understand.

Is query speed more important than write speed?

If you think that querying and analyzing data fast is much more important to you than optimizing writing and storing data, then a graph database is a good fit for you. As your dataset and relationships between data grow, a graph database will become far more performant for complex queries and data analytics.

How do I choose a graph database? Which graph database is best?

Just use Dgraph. It’s that easy!

You’re still here? OK, let’s get serious. Two main points differentiate some graph databases from others, performance and scalability. Other features are nice to have, but a lack of performance and stability are most often the dealbreakers.

Every graph database claims that is “blazing-fast”. Due to a relative lack of consistent and unbiased benchmarks in the space, it’s tough to crown a specific database the winner in the performance category. Additionally, every database is built differently and excels in slightly different areas, which makes things more confusing.

That said, KE Holdings ran its own benchmarks and assessments for performance and scalability. The team found that Dgraph was able to load and query a dataset of 48 billion tuples, and still maintain 15000 Q/S (queries per second)!

It’s also worth looking at how Dgraph has previously run benchmarks again Neo4j . We encourage others to follow similar steps when benchmarking competitors against each other.

Third-Party Benchmarks

While unbiased benchmarking is scarce, there are still third-party sources of information that can reduce bias and help in your assessment.

graph database case study

Forrester evaluates the “Graph Data Platform” category. While the methodology may not be applicable for all companies in the category, it can be a helpful starting point.

graph database case study

Jepsen Test

Jepsen is a framework designed for testing whether distributed systems live up to their consistency guarantees. Dgraph is the first and only graph database to have been Jepsen tested.

graph database case study

G2 collects honest reviews from verified product users – people have to share screenshots of the product to prove that they’re actual users and not just bots.

graph database case study

GitHub + Docker

Dgraph is the most popular graph database on Github, with over 15k stars and more than 1000 forks. Dgraph also has over five million docker pulls!

Get Started Today!

Dgraph gives you the scalability and performance you need with the pricing and transparency you expect. Start building today with the world’s most advanced and performant graph database with native GraphQL.

Contribute to Dgraph

Join the community.

  • Open access
  • Published: 12 December 2022

Graph4Med: a web application and a graph database for visualizing and analyzing medical databases

  • Jero Schäfer   ORCID: 1   na1 ,
  • Ming Tang 2 , 3   na1 ,
  • Danny Luu 2 ,
  • Anke Katharina Bergmann 2 &
  • Lena Wiese 1 , 4  

BMC Bioinformatics volume  23 , Article number:  537 ( 2022 ) Cite this article

2939 Accesses

1 Altmetric

Metrics details

Medical databases normally contain large amounts of data in a variety of forms. Although they grant significant insights into diagnosis and treatment, implementing data exploration into current medical databases is challenging since these are often based on a relational schema and cannot be used to easily extract information for cohort analysis and visualization. As a consequence, valuable information regarding cohort distribution or patient similarity may be missed. With the rapid advancement of biomedical technologies, new forms of data from methods such as Next Generation Sequencing (NGS) or chromosome microarray (array CGH) are constantly being generated; hence it can be expected that the amount and complexity of medical data will rise and bring relational database systems to a limit.


We present Graph4Med, a web application that relies on a graph database obtained by transforming a relational database. Graph4Med provides a straightforward visualization and analysis of a selected patient cohort. Our use case is a database of pediatric Acute Lymphoblastic Leukemia (ALL). Along routine patients’ health records it also contains results of latest technologies such as NGS data. We developed a suitable graph data schema to convert the relational data into a graph data structure and store it in Neo4j. We used NeoDash to build a dashboard for querying and displaying patients’ cohort analysis. This way our tool (1) quickly displays the overview of patients’ cohort information such as distributions of gender, age, mutations (fusions), diagnosis; (2) provides mutation (fusion) based similarity search and display in a maneuverable graph; (3) generates an interactive graph of any selected patient and facilitates the identification of interesting patterns among patients.

We demonstrate the feasibility and advantages of a graph database for storing and querying medical databases. Our dashboard allows a fast and interactive analysis and visualization of complex medical data. It is especially useful for patients similarity search based on mutations (fusions), of which vast amounts of data have been generated by NGS in recent years. It can discover relationships and patterns in patients cohorts that are normally hard to grasp. Expanding Graph4Med to more medical databases will bring novel insights into diagnostic and research.

Medical databases are not only vitally important for providing accurate and timely health services but also crucial for an improvement of the work flow for doctors, researchers and health care providers. Managing a health database system is challenging because it needs to ensure (1) real-time access and analysis, (2) data security and sharing, (3) patient privacy while having to deal with very different data formats and users [ 1 ]. Traditional medical databases are usually relational or network-based. They are designed to manage the information that stores different data regarding a single entity. However as the volume and diversity of medical data continue to expand exponentially, people realize that a relational model actually keeps “healthcare data locked, isolated and unused” [ 2 ]. More and more healthcare providers are migrating from relational to non-relational database systems like graph databases and document data stores [ 3 , 4 , 5 ].

Medical graph databases

With the increasing amount of heterogeneous biological data obtained by novel technologies in the medical sector, graph databases have gained more attention as flexible and feasible storage systems [ 6 ] that help to find and understand complex hidden relationships [ 7 ]. Biological pathways can also be modeled more efficiently in a graph database than in a traditional relational database, which results in an increased query performance when traversing the knowledge graph [ 8 ]. The integration of multi-omics data provides the ability to extract new knowledge from data but is challenging due to the high diversity and complexity of such data and requires novel approaches (e.g., as provided by graph database systems) [ 9 ]. One of the most popular graph database systems is the Neo4j graph data platform [ 10 ] (cf. Graph Databases). It was chosen for realizing medical data applications around the management of biological knowledge bases [ 8 , 11 ] or the integration of data from multiple sources [ 12 ].

The usage of a graph database such as Neo4j to store, manage and query medical data often serves the purpose of building a backbone for a web application with easy user access. A web-based dashboard is a powerful tool for visualizing and analyzing the graph data as it makes the stored data available to the users in a comprehensible fashion by abstracting from the underlying graph database technology. Bukhari et al. [ 12 ] have used a Neo4j database to implement such an intuitive dashboard on top of it for browsing and visualizing immunological data in plots supporting the automatic translation of natural language queries to Cypher queries.Their web application lets users view immunology-related data, e.g., age distributions of subjects in a study, via a graph-based and natural language query interface for a more intuitive usage. Our approach, in contrast, uses a dashboard that retrieves data for rendering with Cypher queries, that the user does not need to interact with but can if they are familiar with Cypher, for multiple visual representations of data in one page. The purpose of LinkedImm is to integrate different data sources into a linked graph, whereas our tool focuses on the analysis of a cohort of patients obtained from a relational database and transformed into a graph model.

The graph database BioGraphDB [ 13 , 14 , 15 ] also relies on Neo4j as one of the core technologies and has been used to build the BioGraph [ 16 ] web application, which allows users to interactively query and analyze the integrated biological data, e.g., microRNA or protein sequence data. Similar to LinkedImm, the BioGraph integrates data from multiple sources into a data graph with a manually derived graph schema to model the relationships between biological entities like genes and proteins. The schema was developed according to the results of the ETL processes whereas our proposal gives a generally applicable methodology to transform a relational to a graph data schema. The web application on top of BioGraph offers also only single visualizations of data at a time, that can be retrieved with several predefined queries with parameters or freely entered queries. Our dashboard is capable of visualizing an aggregation of different aspects for a more comfortable work flow. Another platform displaying multi-omics data in a web application has been developed in Graphomics  [ 11 ]. It combines a Neo4j database that maps the multi-omics data to a graph of connected entities, reactions and pathways with a relational SQLite database that stores the final results.

Medical background

The integration of Next Generation Sequencing (NGS) and related technologies to medicine have revolutionized the field and made personalized medicine possible. As large amounts of data are being generated and added to medical databases, their analysis and visualization becomes increasingly challenging. This greatly hampers our efforts to take full advantage of these new technologies. A specific example would be the pediatric acute lymphoblastic leukemia (ALL) database that we used in this study. In acute lymphoblastic leukemia the most common genetic drivers are gene fusions while mutations and copy number variations may also contribute [ 17 ]. Over the last years, diagnostic samples of ALL were analyzed by RNA sequencing, panel sequencing and/or arrayCGH. RNA sequencing detects gene fusions much more efficiently than the traditional methods such as karyotyping or Fluorescence in situ hybridization (FISH). Panel sequencing identifies mutations in selected DNA regions of interests. ArrayCGH is a powerful method that detects losses or gains of genomic regions. All these new technologies play critical roles in providing diagnostic, prognostic and treatment information. For example, fusion result is one of the main criteria to stratify leukemia patients and identify patient similarities [ 17 ]. However, the current relational databases lack the capacity to search and analyze patients based on fusion/mutation types. In contrast, similarity searches are simplified by using a graph-based database structure.

Our contribution

We introduce Graph4Med, a user-friendly, graph-based visualization tool for the analysis of a cohort of patients. In particular, we extracted pediatric ALL cases from a relational database and transformed them into a graph schema tailored to our use case, which was derived from the relational schema. The extracted cohort was then stored in a Neo4j graph database and a web-based dashboard was built with NeoDash, a Neo4j dashboard building tool [ 18 ]. The rest of this work is organized as follows: First, we outline the limitations of the current relational system and the benefits of using a graph database in our use case in section “ Use case scenario ”. Then the section “ Implementation ” describes the process of modeling the graph database schema from the transformation of the relational schema into a graph schema, which is further converted into the final graph model. Then in the “ Results ” section the system architecture and the built application dashboard are elucidated. We further comment on the usability, significance and limitations of our implementation in the “ Discussion ”. Finally, a summary of our contribution is given in the “ Conclusion ”.

Use case scenario

Current relational system.

Currently our medical partner uses a system that employs a relational database and a graphical interface to interact with the patient data. According to the relational schema, the user can browse different concepts and navigate through the case of an individual patient by viewing data about the diagnoses, samples, tests, analyses or prescriptions in the form of tabular or unstructured data. Although it is possible to alter the stored records, there are no straightforward visualizations other than tables or sheets to obtain a comprehensive overview. This leads to overly complex and time intense work for the user to identify key aspects for the diagnosis or treatment of the subject. To this end, a dashboard improves the work flow by displaying the required and most valuable information concisely and user-friendly.

In particular, the desired information in many scenarios is often scattered across multiple tables due to database normalization. The information inside these tables are only shown separately in the interface. To address certain questions, clinicians and researchers have to acquire information from different tables. However, the combination of multiple tables via joins in queries, that should aid to directly derive the same solution, often also impedes the work flow as these tables are quite overloaded and, thus, inefficient to work with. Additionally, the resulting joined table might contain redundant information inflating the amount of information the user has to deal with. Especially, if querying for a subgroup of patients, the non-redundant information usually needs to be aggregated carefully as otherwise records with redundant information are returned. To overcome these issues, the proposed system provides access to the information in an implicitly non-redundant fashion by using a graph database.

Further use cases not addressed by the former system are to examine common features or correlations inside subgroups of patients and not just the individuals, which is an important aspect for research in general. Particularly regarding ALL, we believe that it is crucial to investigate on common fusions or mutations to gain valuable insights for diagnostics and treatment opportunities. Hence, the system is required to support the search for a specific subgroup by fusions and appropriate display of the gathered information, for instance, by applying additional filters based on the age at diagnosis, sex or aneuploidy of the patients. If the case of an individual patient is identified to be of high interest, the next step usually is to find other cases being similar to the target patient. This functionality is also not in the scope of the old relational-based system as it would require a set of complex SQL queries or a separate application program to measure similarity between patients.

Graph databases

Graph databases manage data by employing a graph data model using graph structures for the logical representation of data and their schema [ 19 ]. The essential part comes from the mathematical formulation of graph theory that supplies the abstract data type. Formally, a graph consists of a set of nodes (or vertices) and a set of edges to connect pairs of nodes. Edges can have a direction pointing from one node to another or be undirected. In a labeled-property graph, both the nodes and edges are labeled and can have additional properties in the form of key-value pairs further characterizing the entities (nodes) and relationships (edges). In the context of a database, data are then obtained by formulating queries against the graph and data manipulations are performed by graph transformation operations.

One of the benefits of using a graph data model is that related information can be queried more easily. Starting from an individual patient or a common concept, e.g., the detection of a certain fusion or mutation, the graph can be traversed along multiple relationships step-by-step without having to consider unrelated data. The index-free adjacency ensures that the neighbors of a node (i.e., connected via an edge) can be directly accessed from the node itself and, thus, the lookup performance does not depend on the size of the graph resulting in a better scalability. Furthermore, queries are more flexible regarding the grouping and aggregation of data by returning subgraphs as query results. This capability implicitly eliminates duplicate answers on-the-fly as the same node is just returned once but with possibly multiple relationships/paths to other nodes. For example, consider a scenario of multiple chained one-to-many relationships, e.g., various analyses, that themselves might have multiple results, can be done for each patient. In relational databases the grouping for findings per patient and analysis would lead to a table with potentially redundant information as each patient occurs once for each analysis. The graph database, in contrast, would return a subgraph of patient nodes with paths via analysis nodes to result nodes. The graph structure inherently yields a powerful possibility for visualization of such subgraphs that facilitates the identification of complex relationships in the data by the user.

Neo4j is our choice of graph database management system that “stores and manages data in its more natural, connected state, maintaining data relationships that deliver lightning-fast queries, deeper context for analytics, and a pain-free modifiable data model” [ 10 ]. Neo4j is available in an open-source version and comes with a native graph database, the graph query language Cypher and libraries for graph analytics. It has a vivid community, a broad variety of tools and extensions, and is one of the most popular systems for storing data with a graph data model. The corresponding query language comes with an intuitive and logical syntax that is easy to learn and understand (e.g., documentation at ). This is important as our dashboard will use Cypher queries to populate the reports with data directly obtained from the database.


Relational database schema.

The data underlying the dashboard application was extracted from a relational database and mapped into a suitable graph data model. Figure  1 shows a simplified version of the relational database schema that was restricted to only comprise the features of the entities and core relationships between the different entities present that are relevant for the visualization of the ALL cases in the dashboard. Formally, the relational database schema \(\textbf{R}\) consists of the following main relations: Patient(id, name, gender, dob) , Project(id, name) , Family(id, name) , Order(id, type) , Diagnosis(id, name, icd) , DiagnosisAddition(id, description) , AnalysisMaster(id, type) , Analysis(id, order_id, master_id, material_id, result) , DynamicField(analysis_id, field, name, value) , Material(id, type) , MaterialNumber(id, master_id, sub_type, sub_number) and Result(id, description, value) .

figure 1

Relational database schema. The schema of the relational database with primary key (PK) attributes indicated by bold names and foreign key (FK) relationships indicated by the grey lines connecting FK and PK attributes

These relations model the logical entities of our use case and the other relations are associative relations that implement the logical one-to-many or many-to-many relationships. Such relationships occur for the membership of patients in families and projects, the assignment of the patients’ diagnoses incl. addition, orders and materials, or the relation of analyses to results, that may aggregate multiple analytical results. The relation that represents the patient entity plays a central role (marked red) in the designed relational model. The other relations modeling logical entities can be differentiated into patient-specific or patient-agnostic relations. The relations Project , Family , Diagnosis , DiagnosisAddition and AnalysisMaster are patient-agnostic as they (generally) do not store information that depends on a specific patient but potentially link to every patient. The tuple (12345, ’Leukemia Research’) in the relation Project , for instance, represents a research project named ’Leukemia Research’ with id 12345 and multiple patients as subjects. In contrast to this, the relations Order , Analysis , Result , DynamicField , Material and MaterialNumber are patient-specific and store information that belong to a specific patient. For instance, the relation Material can contain a tuple (98765, ’DNA’) representing a DNA sample with id 98765 that was obtained from a patient (as mapped by the MaterialPatient relation).

Schema graph transformation

Similar to Definition 1 in [ 20 ], we constructed a relational schema graph \(\textbf{RG} = \langle N, E\rangle\) for the relational database schema \(\textbf{R}\) with a node \(n_a\) for each of the attributes \(a \in X_i\) of each relation schema \(R_i(X_i)\) in \(\textbf{R}\) . Each node \(n_a\) is labeled with the name of the relation \(R_i\) followed by a dot and the name of the attribute a , i.e. \(R_i.a\) . Before introducing edges, we merge the nodes of composite PK attributes into single nodes which are then labeled with \(R_i.PK\) . This reduced the amount of nodes and edges for a simpler representation of the schema. Furthermore, the transformation of the schema graph to the graph data schema is simplified by the merge of PK attributes when creating the nodes representing the different entities. In the following, we refer to both single and composite PK attribute nodes as PK nodes. Also, we did not consider composite FKs but they could be handled in the same way as composite PKs by merging them into one single FK node. The general transformation steps towards the final graph data schema are:

Create nodes labeled with \(R_i.a\) for relations \(R_i\) and their attributes a .

Merge nodes of composite PK attributes into single PK nodes.

Create directed edges from PK nodes to nodes of other attributes a of the same relation \(R_i\) .

Create directed edges from FK nodes to the respective PK nodes.

Merge sinks (i.e. nodes without any outgoing edges) that are all connected to the same PK node and have only one incoming edge into one node labeled \(R_i.attributes\) .

Merge PK hubs (i.e. nodes with incoming and outgoing edges) labeled \(R_i.a\) with only one outgoing edge to a (merged) sink with these sinks. The new entity nodes are labeled \(R_i\) and contain all attributes from the previously merged hub and sink.

Replace sources n (i.e. nodes without incoming edges) that are connected to exactly two entity nodes p and q and these two edges by an undirected edge e connecting p and q directly.

Case 1: If n has no other edges, no other actions are required.

Case 2: If n has an edge to a (merged) sink m , add the attribute(s) represented by m as property to e and remove m , too.

Case 3: If n has an edge to a hub m with only one other edge to an entity node r containing only one additional attribute a next to the identifying attribute(s), add a as property to e and remove m and r from N .

If none of the above cases is applicable, no merge is performed.

Resolve FK relations by edges:

Case 1: Replace FK relations indicated by hubs \(n_h\) with one incoming edge from an entity node m and one outgoing edge to an entity node o by an undirected edge directly connecting m and o .

Case 2: If the FK relations is a source \(n_s\) labeled \(R_i.a\) with outgoing edges to an entity node m and to a (merged) sink o , first merge \(n_s\) and o (with all attributes except the FK attribute) into an entity node \(R_i\) . Then, connect the entity nodes m and \(R_i\) with an undirected edge.

The obtained schema graph \(\textbf{RG}\) after applying steps 1-5 is shown in Fig.  2 but was split into two parts \(\mathbf {RG_1} \cup \mathbf {RG_2}\) for a less overloaded visualization. The left part of the figure displays the schema graph around the Patient , Family , Project , Diagnosis and DiagnosisAddition relations and the right part displays the schema graph restricted to the Patient and the other relations. The PK property of attributes represented by the nodes are highlighted by a thicker outline (e.g., the node labeled Patient . id ). The direct edges of \(\textbf{RG}\) visualized by arrows between the nodes indicate either the FK relationships between attributes or the relationship of PK to non-key attributes (e.g., the node OrderPatient . PK comprises the two attributes OrderPatient.patient_id and OrderPatient.order_id ), which are references to patients and orders, respectively. It also links the PK to the OrderPatient.order_date attribute characterizing the date of order request. The (merged) sinks are colored green and the source are colored red.

figure 2

Schema graph. The constructed and compressed schema graph \(\textbf{RG}\) which is split into two parts \(\mathbf {RG_1} \cup \mathbf {RG_2}\) (left and right, respectively). Nodes with a thick outline represent PK attributes. Sinks are colored green and sources are colored red

Figure  3 shows the application of the transformation steps 6 and 7 to the first part \(\mathbf {RG_1}\) of the schema graph (left graph in Fig.  2 ) from left (step 6) to right (step 7). At the first stage, the nodes representing the patient, project, family, diagnosis and diagnosis addition entities were created by merging the hubs, that are PK nodes, with the connected attributes. This aggregated the entity-specific properties in one node. The source nodes between exactly two of the new entity nodes (blue colored nodes in Fig.  3 ) were resolved as edges modeling the relationship between objects of the two entities. The simplest scenario (step 7.1) occured for transforming the sources labeled ProjectPatient . PK and FamilyPatient . PK into the relationships InProject and InFamily between patients and projects/families stating that a patient was participating in a research project or was member of a family, respectively.

figure 3

Schema graph \(\mathbf {RG_1}\) transformation. The transformation of the schema graph \(\mathbf {RG_1}\) is depicted in two steps. The left panel shows how hubs and directly connected sinks were merged into new entity nodes (blue) in \(\mathbf {RG_1}\) according to transformation step 6 (e.g., nodes Patient . id and Patient . attributes were merged into node Patient ). Sources were transformed into edges connecting the previously created nodes directly (e.g., ProjectPatient . PK became the edge labeled InProject ) as shown in the right panel (step 7)

For the third source labeled DiagnosisPatient . PK the transformation steps 7.2 and 7.3 were applicable. In absence of the DiagnosisAddition relation, the direct relationship between the Patient and Diagnosis node would simply be established by transforming the source according to step 7.2. The relationship HasDiagnosis would then also incorporate the additional property for the date at which the certain diagnosis was made for the specific patient. Considering also the diagnosis addition, the ternary relationship could not be modeled by a single relationship between two of the affected entities if the DiagnosisAddition relation stored more than just an id and description (i.e. further attributes regarding the characteristics of the addition or FK attributes linking to other relations). However, the attributes DiagnosisPatient.addition and constituting the FK relationship between a patient’s diagnosis and the (optional) addition to the diagnosis were consumed as defined in step 7.3 because the DiagnosisAddition.description was pulled into HasDiagnosis as an additional attribute of the relationship, too.

The same transformations were applied to the other half of \(\textbf{RG}\) . This is visualized by the two graphs in Fig.  4 after subsequent application of each of the two previously described transformation steps 6 and 7. From the transformation step 6, we obtained the nodes Order , Analysis , Result , AnalysisMaster , MaterialNumber , Material and of course Patient (which was already shown in Fig.  3 ) with the respective attributes. Following the rules for transformation step 7, the sources OrderPatient . PK , ResultAnalysis . PK and MaterialPatient . PK were converted into edges labeled HasOrder , HasResult and HasMaterial , respectively.

figure 4

Schema graph \(\mathbf {RG_2}\) transformation. The transformation of the schema graph \(\mathbf {RG_2}\) is depicted in two steps. The left panel displays the merge of hubs and directly connected sinks into new entity nodes (blue) in \(\mathbf {RG_2}\) according to transformation step 6 (e.g., nodes Patient . id and Patient . attributes were merged into node Patient ). Sources were transformed into edges connecting the previously created nodes directly (e.g., OrderPatient . PK and \(OrderPatient.order\_date\) became the edge labeled HasOrder with attribute \(order\_date\) ) as shown in the right panel (step 7)

Figure  5 visualizes the further processing of \(\mathbf {RG_2}\) by the previously defined conversion of FK relations into edges (step 8). In particular, the four hubs labeled \(MaterialNumber.master\_id\) , \(Analysis.material\_id\) , \(Analysis.master\_id\) and \(Analysis.order\_id\) fulfill the condition of linking two entities (i.e. they represent a FK relation between them), and, thus, were resolved as relationships in the graph data model according to step 8.1. These new edges, that were labeled CreatedFrom , OnMaterial , HasMaster and HasAnalysis , replace the four hub nodes in \(\mathbf {RG_2}\) depicted in the left-hand panel of Fig.  5 . Furthermore, based on the source DynamicField . PK , that covers the analysis_id and field attribute of the DynamicField relation, and the sink DynamicField . attributes a new entity DynamicField and a new relationship HasDynamicField going out from it towards the entity Analysis were added to \(\mathbf {RG_2}\) (step 8.2).

figure 5

Schema graph \(\mathbf {RG_2}\) transformation. The further transformation of the schema graph \(\mathbf {RG_2}\) is depicted in two steps. The left panel depicts how the remaining FK relations between some entities were resolved and how the respective nodes were replaced by relationships between the entities in \(\mathbf {RG_2}\) according to transformation step 8. For instance, the reference from Analysis to Order via the FK \(Analysis.order\_id\) was transformed to HasAnalysis . The graph data model was further refined and simplified by structural changes (e.g., pulling the AnalysisMaster into the Analysis entity) as depicted in the right panel

Use case specific data model adaptation

Under the consideration that the AnalysisMaster relation only contained a single attribute description next to the id attribute with the sole purpose of identifying each record uniquely, the same information from the corresponding AnalysisMaster entity could also be incorporated into each Analysis entity directly by declaring the description about a certain type of analysis as an attribute of the Analysis entity logically replacing the AnalysisMaster entity. As alternative to this strategy, we specified a node label for each of the different types of analysis that all inherit attributes and relationships from the general Analysis node label for a more fine-grained and intuitive modeling. On the one hand, this allowed the formulation of Cypher queries that either traverse analysis nodes regardless of the specific type of analysis or traverse nodes of a specific analysis type only. The definition of the data model in this way, on the other hand, could also utilize even more sophisticated, deeper taxonomies. The effect of this restructuring is also shown in the right panel of Fig.  5 where the AnalysisMaster entity was “absorbed” by the Analysis entity that now also represented the node labels of the different analysis types (not visualized in this figure for the sake of simplicity).

Another transformation was applied to the entity nodes representing the material hierarchy obtained as biological samples from a patient (e.g., DNA or RNA material, from which sub-materials are produced by cultivation or preparation). We did not further distinguish between the subtypes of materials such as preparation or cultivation but rather model them as material themselves which were obtained from a main material. Hence, the MaterialNumber entity was merged into the Material entity and then represented main materials and sub-materials produced from main materials. A node in the Neo4j database with the label Material is then either a main material with a unique number as id and a type value or it is a sub-material with an additional subnumber attribute. The optionality of such properties as well as the introduced self-reference (i.e. the edge labeled CreatedFrom in the graph on the right panel in Fig.  5 ) appropriately mirrored the native dependence in the material hierarchy. Here, we decided not to model the materials analogously to the analyses with additional labels for the different types of material. This avoids an inflation of the model by too many node labels for the more than 50 material types.

To model the peculiarities of the different types of analyses recorded in the medical database which manifest in diverse results (e.g., mutations indicated by potentially multiple fusions as detected in the context of an RNA sequence analysis or the determined karyotype information based on an Array-CGH analysis) the relational schema contains multiple relations storing that analysis-dependent information as key-value pairs. For the implementation of our system, we, first, aggregated them logically into one placeholder relation called DynamicField representing all the different relations with specific key-value pairs for the results of the analyses. This relation resulted in the applied transformation of the relational schema to the graph data model to an entity DynamicField that is related to the Analysis entity via the relation HasDynamicField as depicted in Fig.  5 .

As our dashboard is focused on the analytical results, we then further refined our graph data model at this place. There were multiple ways of how to restructure the model around the specific replacement of the DynamicField placeholder depending on the value type of the analytical result:

A dynamic field can be restructured when pulling the respective entity node into the specific analysis by setting the stored name or key of the field as new attribute name. Such a restructuring is suitable if there are dedicated fields with a huge or even infinite active domain that are also independent from other fields (e.g., a field with name q for some measured numerical value would be a reasonable extension to the corresponding analysis entity node as additional attribute). Creating an individual node for each possible value, in contrast, is obviously not feasible if there are too many possibilities.

If the data type of the dynamic field is multi-valued (e.g., a list of identified mutations) or multiple dynamic fields constitute one logical entity that reasonably should be grouped together (e.g., some combination of fields represent an estimated result), a likelihood for the estimation and the next most likely estimation, the placeholder entity can be replaced with a more specific entity or set of entities. In our use case, one new node label named Fusion was inserted into the database that stands for a certain mutation or group of mutations. The information about each analysis that detected a fusion matching one of the mutations was then stored by establishing the relation HasFusion between the specific analysis node and the corresponding fusion node. This was an improvement as it resolved the dynamic fields even over different types of analysis identifying mutations. Additionally, this facilitated queries and the graph structure as patients with the same mutation(s) related to the same fusion node(s).

Architecture and ETL

The final graph database model as obtained from the transformations of \(\textbf{RG}\) (cf. Implementation) and implemented in the Neo4j graph database as backbone of our Graph4Med system is shown in Fig.  6 . The nodes model the different domains (i.e., the classes of objects stored in the graph) and the edges model the relationships between objects. The implemented model was equal to the union \(\mathbf {RG_1} \cup \mathbf {RG_2}\) after the applied transformations (right graphs from Figs.  3 and 5 ). The Analysis node was still a generic node label from which other labels for more specific analysis types (e.g., the label ArrayCGHAnalysis for Array-CGH analyses) were derived. For the sake of a less overloaded graph these are not shown in Fig.  6 . Here, we also show the Fusion node as a restructuring of a dynamic field that was detected for different types of analysis (e.g., RNA sequence or Array-CGH analysis).

figure 6

Graph database model. The final graph database model as applied in the Neo4j database behind the web-based dashboard in Graph4Med. The nodes of this graph represent the different domains of the objects stored in the database and the edges indicate the relationships between two objects

We used Python scripts to extract the patient cohort of pediatric ALL cases from the relational MS-SQL database server in an incremental fashion via SQL queries. Then, the obtained data were transformed into the graph data model using Neomodel [ 21 ], an Object Graph Mapper (OGM) for the Neo4j database system, and loaded into the Neo4j database instance. First the patients’ personal information (i.e. id, name, age, gender) were queried and added as nodes to mark the cohort. Subsequently all other related data (e.g., projects, orders, analyses) were queried for each patient of the cohort and linked to the related nodes according to the schema. We focused on the retrieval of general information about the patient and related entities, e.g., diagnoses or lab results. The analytics-related data concentrated on the results of various assays and methods, instead of raw data, e.g., raw NGS data, raw karyotype data. Our system is not intended to store the raw data, but instead the analytical results and related information, e.g., the materials on which the specific analyses were carried out, that contribute to the visualization regarding the ALL cases in our proposed visualization tool. The setup of the Neo4j database took approx. 50-55 minutes and was scheduled to be updated once every night.

NeoDash dashboard builder

Based on this Neo4j database, the web-based dashboard building tool NeoDash [ 18 ] was deployed in a web server next to Neo4j and used to build a dashboard for the visualization and analysis of the ALL cohort. Once the NeoDash web application is setup, it can be connected to any Neo4j database instance for a convenient development and usage of dashboards for displaying and analyzing the stored data. The main components of a dashboard are one or more pages that each contain a collection of reports, i.e., some sort of visualization of information. The reports are populated with the results of Cypher queries that are executed via the dashboard against database on-the-fly. A built-in query editor enables to formulate these queries for the reports based on the selected chart type, e.g., a bar chart. The definition of the dashboard structure itself can also be stored in Neo4j for a convenient versioning, sharing and on-demand loading of the developed dashboards. The modularity of dashboard pages with multiple reports increases the flexibility regarding the extension of the dashboard by adding reports or new pages on demand.

Our analysis dashboard consists of several pages to grant different views on the data being investigated. Each page comprises multiple reports presenting tabular, graph-based, or chart-based visualizations of the patient data. In particular, our dashboard incorporates pages for the general overview and statistics of the full cohort, the detailed analysis of a specific subgroup of patients and the analysis of a single patient including the similarity comparison to other patients.

Cohort view

In the previously used relational database of this pediatric ALL cohort, it was always cumbersome to retrieve overall statistics. Doctors and researchers were required to perform the counting and selections from the exported tables themselves, which is time consuming and error-prone. It was almost impossible for them to generate visualizations of cohort statistics. In contrast, with our tool Graph4Med, they can now easily obtain up-to-date statistics and visualizations immediately. For example, in the cohort view page of our dashboard, the top two bar plots show the distribution of current age (top) or age at diagnosis (middle) (Fig.  7 ), which are both additionally grouped by the gender of the cohort patients. (Note: Due to data privacy reasons, we do not show the plots from the real data, but instead from artificially generated data). The number to the top left indicates the size of the cohort. In these bar plots, the stacked colors indicate the distribution grouped by gender. Green and orange indicate female and male, respectively. The bottom bar plot shows various non-ALL diagnosis grouped by gender. Because our cohort had a small percentage of non-ALL cases such as Acute Myeloid Leukemia (AML), pediatric and non-pediatric Myelodysplastic Syndrome (MDS), or Trisomy 21, it is very helpful to have the number and gender distribution of these comorbidity cases. These plots are useful for identifying relationships between age/gender and disease. Moreover, our dashboard can displays other useful information in addition to age/gender: The three drop down fields below the bar plot allow users to choose the values to plot on the X-axis, Y-axis and color group. For example, it is possible to show frequencies instead of patients counts. Similarly, it is possible to display distributions of MRD (minimal residual disease) or material type instead of gender.

figure 7

Cohort view. Dashboard page for general cohort information and statistics such as gender and age distribution. The three drop down fields below each bar plot allow to choose distributions other than gender and age. Note: due to data privacy reasons, the plots shown here were made from artificially generated data

Subgroup view

As mentioned previously, leukemia is mostly driven by gene fusions and aberrant chromosome numbers. They are the main factors for deciding B-ALL subgroups and play important roles in risk stratification [ 17 ]. In Fig.  8 a, we demonstrated the number of fusions per patient in the left panel, and a distribution of major B-ALL subgroups in the middle panel. The table on the right shows all the names of fusions and their aliases used in the database.

figure 8

Subgroup distribution. a The left panel shows the number of fusions per patient. The middle panel shows the distribution of 10 most frequent B-ALL subgroups. The table on the right lists all the fusion names and their aliases. b Age and gender distribution for selected fusion, “CRLF2-P2RY8” in this case. Note : Due to data privacy reasons, the plots shown here do not reflect the real distributions

Because gene fusion and aneuploidy information are important for leukemia, we implemented a function to select certain fusion or aneuploidy type, and then visualize patients’ distribution within this subgroup. In the left panel of Fig.  8 b, there is an auto-completing selection field where users can enter the name of a subgroup, for example, CRLF2-P2RY8 was chosen in this case. In the middle and right panel of Fig.  8 b , we display the age and gender distributions for this selected subgroup.

We also developed a graph view to demonstrate the relationships among patients. After selecting a certain subgroup, all the patients of this subgroup are displayed in a table (right panel of Fig.  9 a). A maneuverable graph showing the relationships between patients (green nodes) and subgroups (yellow nodes) are displayed on the left. This graph can show different nodes depending on the needs. For example, in Fig.  9 b, material information (orange nodes) were added in relation to the patients and subgroup. Similarly any other features (color of nodes) mentioned in Fig.  6 can be added. The exact value to display inside each node can be chosen by the drop-down menu on the bottom, which has the same color scheme as nodes (more explanation in next section and Table  1 ). Our graph is flexible in that it can not only display any feature (color of nodes) of interests but users may also (1) drag the nodes for easier visualizations; (2) hover over or click on nodes and relationships to inspect their properties; (3) zoom in/out on any region of the graph.

figure 9

Graph view of relationships. a Graph view of a selected subgroup. It shows the relationship between patients and subgroups. The table on the right lists all the patients belonging to this subgroup with their additional information such as age, karyotype, chromosomes etc. b Different features of patients (color of node) can be integrated, for example, patients materials (orange nodes) were chosen here as additional information. Note : due to data privacy reasons, the plots and table shown here were made from artificially generated data

Individual patient view

The traditional relational database was designed based on the management and health care for an individual patient. Its functionality is sufficient to retrieve information of a particular patient (e.g., a doctor checks the lab results and plans the treatment for one patient). Here, we enhance this functionality by displaying an individual’s information in a compact, maneuverable graph. By a single glance, the users can grasp most of the details instead of going through the cumbersome and error-prone tabular process. The bottom left of Fig.  10 shows such a graph where different colors of nodes represent different types of information (features) for the selected patient. The user can dynamically adapt the value displayed on the nodes through the small drop-down menus at the bottom of the graph. In this way, the node text can be set to any of the node attributes or some basic property such as the node label according to the data schema. For instance, currently in Fig.  10 the dark blue node presents the patient’s gender; the brown nodes depict a single material linked to the patient and analysis; the light grey nodes represent the fusions identified, which are PAX5 and CRLF2-P2RY8 in this case. Table  1 gives an overview for the alternative values that could be displayed on the nodes of each type (color). Furthermore, the tabular report in Fig.  10 gives an overview over the subgroup with one row per patient. The columns can be sorted and filtered and the user can also decide which columns to display or hide (e.g., the column with the patient ID is hidden here).

figure 10

Individual patient view. Presentation of an individual patient’s information in a table and graph (top right and bottom left panel). Bottom right is the result of the similarity search across the entire cohort with the selected patient in the middle. Note: due to privacy reasons, the plots shown here were made from artificially generated data

Graph4Med can also generate a patient similarity graph, which is shown on the bottom right of Fig.  10 . Here, we implemented a very simple similarity algorithm based on diagnosis, gene fusions, and aneuploidy which can be treated as a fusion. Depending on the needs of the specific medical use case, a more sophisticated similarity measure could be developed in the future. In our use case, the Jaccard index \(J(A, B) = \frac{A \cap B}{A \cup B}\) was applied to the target patient and all other candidates. It yields a value between 0 and 1 where 1 indicates maximum similarity between target and candidate patient and 0 indicates no similarity between them. In the example shown on the bottom right of Fig.  10 , the target patient (the node in the middle) is connected to all the patients that have a similarity exceeding a certain threshold. The width and color of the edges between nodes scale with the level of similarity: Bolder and darker connections represent a strong similarity and vice versa. In this example, we see two levels of similarity, thin light-green and thick blue.

In Graph4Med, we converted a relational database to a graph database in Neo4j and further built a dashboard on it. This tool is very well liked by our end users—clinicians, researchers and lab scientists. It not only provides meticulous visualizations other than tables and statistics of the whole cohort, but also enables to search patient subgroups based on fusions and aneuploidy, which are the most important factors in stratifying leukemia patients. The former system did not support the search for patient subgroups nor the computation of statistics whereas these important aspects were effortlessly included in the dashboard reports. We also implemented a small algorithm to display patient similarities. The success of Graph4Med in the pediatric ALL database has already prompted interests among clinicians from other disease areas.

Table  2 summarizes the implementation of key features for the old and new system that have driven the development of Graph4Med. Our dashboard is designed to be user-friendly with flexibility and interactivity to improve the efficiency of the current clinical practice. To this end, it overcomes several limitations of the former system, e.g., by providing various visualizations, broad overviews and summarizing statistics. The key feature to navigate individual cases is still kept while Graph4Med also enables the detection of complex relationships among patients and subpopulations in the dashboard pages. It allows each user to individually modify the visualizations and adapt them to his/her specific needs. (1) Users can select the contents to be displayed in the nodes on-demand. (2) Because there are very complex data structures in the original database even for one patient (see Fig.  9 b), it is impossible to display all the nodes when we have more than 5 patients. Our tool made it possible to choose what type (color) of nodes to display. (3) The Cypher queries for plots and tables are embedded in the dashboard and the query language is the only prerequisite for adding additional reports. It is possible for non-IT users to modify parameters and obtain the plots they desire.

As the structure of the dashboard itself is stored in the Neo4j database, it can be updated and extended easily. The users are allowed to share and load different versions of the dashboard at any time. Furthermore, this even gives the option to have access to multiple independent dashboards focusing on different use cases or research questions on the same underlying data set. The graph database also implicitly removes redundant answers with the graph structure and data aggregation becomes feasible than with the complex SQL queries and overloaded tables. This results in less overhead for the users who can skip the time intense step of filtering the redundant information from the tables.

Medical databases are constantly updated. Therefore, we update the Graph4Med database every night to keep the dashboard up-tp-date. Graph4Med currently has 2185 patients with 4,919 analyses and 723 fusions detected and currently contains 64,677 nodes and 77,129 relationships with a total size of approx. 400 MB. It takes about 50–55 min to update the whole database.

Comparison with related medical dashboards

In Table  3 we compare Graph4Med with other medical visualization systems or dashboards on several high-level features. The column “System” lists the different medical visualization systems and dashboards. For each system, the table states the actual database management sytem (col. “DBMS”) as well as the offered types of interfaces (col. “Interface”), e.g., a web application or command line interface (CLI). We evaluated the interactivity, i.e., the possibility to interact with the plots of the dashboard, and the flexibility, i.e., the option to effortlessly extend the dashboard with new visualizations or reports, of the dashboards and the results are shown in the columns “Int” and “Flex”, respectively. The table column “Coll” refers to the capability of the system for collaboration in the sense of versioning or sharing extended dashboard versions with other users and “Exp” to the ability to export charts or download data from the tool. Table  3 also contains a column “Data” that summarizes the types of data dealt with in the corresponding system.


During our implementation, we also observed some technical limitations of building the dashboard by NeoDash and Neo4j. For example, (1) One cannot export the report charts, graphs or tables directly from the dashboard; (2) One can not select a certain sub-population after choosing a fusion type. It would be desirable if users can choose which patients from the table to include in the graph report (Fig.  9 a). We would also like to mention that due to the complexity of our previously used database, we did not convert all the different analysis types to our graph database. However, the missing analysis or any novel techniques can be integrated seamlessly in accordance to previously incorporated analysis. The results of FISH analyses that are also employed to identify gene fusions are not incorporated yet, for instance, which leads to smaller numbers of ETV6-RUNX1 and BCR-ABL cases in the cohort statistics than they actually are.

As Neo4j and NeoDash are constantly improving, we should be able to improve the functionality of our dashboard such as exporting reports in the future. We are also interested to expand Graph4Med to other disease databases or to integrate additional sources of information. One future direction to extend Graph4Med would be to integrate more data for the diagnostic pipeline such as gene expression and point mutation results. Incorporating these results or additional resources with bioinformatics data into Graph4Med could further facilitate the analytical capabilities of the system. We could also have more complex algorithms for measuring patient similarity when the use case is bigger and contains a rich diversity of diseases. For example, our collaborator at the Department of Human Genetics of the Medizininsche Hochschule Hannover also has a database for various, rare genetic diseases. We could design a new similarity algorithm considering various factors including age, gender, symptom, genetic factors. In the case of rare diseases, suggestions from patient similarity search will be especially useful in pinpointing treatment options.

In our work, we developed a flexible medical visualization tool including a web-based dashboard on top of a Neo4j graph database storing the application data. We presented a method on how to convert a relational database schema to a graph data model for the easy implementation of our dashboards with Cypher queries against the stored data graph. The visualizations provide the analytical capabilities in a convenient and interactive fashion that were not possible in the old system. Our work proves the flexibility and feasibility of a graph database for managing medical data as it allows for an intuitive representation of the structure of the medical data schema.

Availability of data and materials

The clinical data used in our approach are not shared due to the preservation of the patients’ privacy. A public demonstrator with artificial data can be found on the Graph4Med website. Source code is located in the project repository.


Acute myeloid Leukemia

Array-based Comparative Genomic Hybridization

Acute lymphoblastic leukemia

Command line interface

Database management system

Deoxyribonucleic acid

Fluorescence in situ hybridization

Foreign key

Myelodys-plastic syndrome

Micro ribonucleic acid

Minimal residual disease

Next generation sequencing

Object graph mapper

Primary key

Structured query language

Ismail L, Materwala H, Karduck AP, Adem A. Requirements of health data management systems for biomedical care and research: scoping review. J Med Internet Res. 2020;22(7):e17508.

Article   Google Scholar  

Lawrence R. How RDBMS delays the healthcare data revolution. 2019. .

Have CT, Jensen LJ. Are graph databases ready for bioinformatics? Bioinformatics. 2013;29(24):3107.

Article   CAS   Google Scholar  

Wiese L. Advanced data management. Berlin: De Gruyter; 2015.

Book   Google Scholar  

Tomar D, Bhati JP, Tomar P, Kaur G. Migration of healthcare relational database to NoSQL cloud database for healthcare analytics and management. In: Dey N, Ashour AS, Bhatt C, James Fong S, editors. Healthcare data analytics and management. Advances in ubiquitous sensing applications for healthcare. London: Academic Press; 2019. p. 59–87.

Chapter   Google Scholar  

Rodriguez MA, Neubauer P. The graph traversal pattern. In: Sakr S, Pardede E, editors. Graph data management: techniques and applications. IGI Global: Hershey; 2012. p. 29–46.

Yoon BH, Kim SK, Kim SY. Use of graph database for the integration of heterogeneous biological data. Genom Inform. 2017;15(1):19.

Fabregat A, Korninger F, Viteri G, Sidiropoulos K, Marin-Garcia P, Ping P, et al. Reactome graph database: efficient access to complex pathway data. PLoS Comput Biol. 2018;14(1):e1005968.

Thapa I, Ali H. A new graph database system for multi-omics data integration and mining complex biological information. In: International conference on computational advances in bio and medical sciences. Springer; 2019. p. 171–83.

Neo4j, Inc. The Neo4j graph data platform. 2022. .

Wandy J, Daly R. GraphOmics: an interactive platform to explore and integrate multi-omics data. BMC Bioinform. 2021;22(1):1–19.

Bukhari SAC, Pawar S, Mandell J, Kleinstein SH, Cheung KH. LinkedImm: a linked data graph database for integrating immunological data. BMC Bioinform. 2021;22(9):1–14.

Google Scholar  

Fiannaca A, La Rosa M, La Paglia L, Messina A, Urso A. BioGraphDB: a new GraphDB collecting heterogeneous data for bioinformatics analysis. In: Proceedings of BIOTECHNO. 2016.

Fiannaca A, La Paglia L, La Rosa M, Messina A, Storniolo P, Urso A. Integrated DB for bioinformatics: a case study on analysis of functional effect of MiRNA SNPs in cancer. In: International conference on information technology in bio-and medical informatics. Springer; 2016. p. 214–22.

Fiannaca A, La Paglia L, La Rosa M, Messina A, Rizzo R, Stabile D, et al. Gremlin language for querying the BiographDB integrated biological database. In: International conference on bioinformatics and biomedical engineering. Springer; 2017. p. 303–313.

Messina A, Fiannaca A, La Paglia L, La Rosa M, Urso A. BioGraph: a web application and a graph database for querying and analyzing bioinformatics resources. BMC Syst Biol. 2018;12(5):75–89.

Iacobucci I, Mullighan CG. Genetic basis of acute lymphoblastic leukemia. J Clin Oncol. 2017;35(9):975.

Niels de Jong. NeoDash—Neo4j Dashboard Builder. 2022. .

Angles R, Gutierrez C. Survey of graph database models. ACM Comput Surv (CSUR). 2008;40(1):1–39.

De Virgilio R, Maccioni A, Torlone R. Converting relational to graph databases. In: First international workshop on graph data management experiences and systems. 2013. p. 1–6.

Robin Edwards. Neomodel documentation. 2019. .

Koumakis L, Schera F, Parker H, Bonotis P, Chatzimina M, Argyropaidas P, et al. Fostering palliative care through digital intervention: a platform for adult patients with hematologic malignancies. Front Digital Health. 2021;3:730722.

Gütebier L, Bleimehl T, Henkel R, Munro J, Müller S, Morgner A, et al. CovidGraph: a graph to fight COVID-19. Bioinformatics. 2022;38(20):4843–5.

Chakravarty D, Gao J, Phillips S, Kundra R, Zhang H, Wang J, et al. OncoKB: a precision oncology knowledge base. JCO Precis Oncol. 2017;1:1–16.

Download references


We acknowledge publication support by the OAF of the University Library of Goethe University Frankfurt.

Availability and requirements

Project name: Graph4Med. Project home page: . Project repository: . Operating system(s): Platform independent. Programming language: Python, JavaScript. Other requirements: Neo4j, NeoDash, Browser. LICENSE: MIT. Restrictions for non-academics: None.

Open Access funding enabled and organized by Projekt DEAL. This work was supported by Else Kröner-Fresenius-Stiftung (Promotionsprogramm DigiStrucMed 2020_EKPK.20) and the Federal Ministry of Education and Research (BMBF), Germany, under the project LeibnizKILabor (grant no. 01DD20003). No funding body played any roles in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

J. Schäfer and M. Tang have contributed equally to this work.

Authors and Affiliations

Institute of Computer Science, Goethe-Universität Frankfurt, Frankfurt am Main, Germany

Jero Schäfer & Lena Wiese

Department of Human Genetics, Hannover Medical School, Hannover, Germany

Ming Tang, Danny Luu & Anke Katharina Bergmann

L3S Research Center, Leibniz Universität Hannover, Hannover, Germany

Bioinformatics Group, Fraunhofer ITEM, Hannover, Germany

You can also search for this author in PubMed   Google Scholar


JS: project conception, system design, implementation, discussion, manuscript editing; MT: project conception, system design, discussion, manuscript editing, assessment, result validation, supervision. AB: project conception, system design, discussion, manuscript editing, result validation, supervision; DL: project conception, system design, discussion, result validation; LW: project conception, system design, discussion, manuscript editing, assessment, technical support, supervision; All authors read and approved the final version of the manuscript.

Corresponding author

Correspondence to Jero Schäfer .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit . The Creative Commons Public Domain Dedication waiver ( ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Cite this article.

Schäfer, J., Tang, M., Luu, D. et al. Graph4Med: a web application and a graph database for visualizing and analyzing medical databases. BMC Bioinformatics 23 , 537 (2022).

Download citation

Received : 24 March 2022

Accepted : 01 December 2022

Published : 12 December 2022


Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Graph database
  • Medical database
  • Data exploration
  • Visualization
  • Web application

BMC Bioinformatics

ISSN: 1471-2105

graph database case study

Digital Contact Tracing Based on a Graph Database Algorithm for Emergency Management During the COVID-19 Epidemic: Case Study


  • 1 College of Public Administration, Huazhong University of Science and Technology, Wuhan, China.
  • 2 Non-traditional Security Research Center, Huazhong University of Science and Technology, Wuhan, China.
  • 3 School of Law and Humanities, China University of Mining and Technology, Beijing, China.
  • PMID: 33460389
  • PMCID: PMC7837510
  • DOI: 10.2196/26836

Background: The COVID-19 epidemic is still spreading globally. Contact tracing is a vital strategy in epidemic emergency management; however, traditional contact tracing faces many limitations in practice. The application of digital technology provides an opportunity for local governments to trace the contacts of individuals with COVID-19 more comprehensively, efficiently, and precisely.

Objective: Our research aimed to provide new solutions to overcome the limitations of traditional contact tracing by introducing the organizational process, technical process, and main achievements of digital contact tracing in Hainan Province.

Methods: A graph database algorithm, which can efficiently process complex relational networks, was applied in Hainan Province; this algorithm relies on a governmental big data platform to analyze multisource COVID-19 epidemic data and build networks of relationships among high-risk infected individuals, the general population, vehicles, and public places to identify and trace contacts. We summarized the organizational and technical process of digital contact tracing in Hainan Province based on interviews and data analyses.

Results: An integrated emergency management command system and a multi-agency coordination mechanism were formed during the emergency management of the COVID-19 epidemic in Hainan Province. The collection, storage, analysis, and application of multisource epidemic data were realized based on the government's big data platform using a centralized model. The graph database algorithm is compatible with this platform and can analyze multisource and heterogeneous big data related to the epidemic. These practices were used to quickly and accurately identify and trace 10,871 contacts among hundreds of thousands of epidemic data records; 378 closest contacts and a number of public places with high risk of infection were identified. A confirmed patient was found after quarantine measures were implemented by all contacts.

Conclusions: During the emergency management of the COVID-19 epidemic, Hainan Province used a graph database algorithm to trace contacts in a centralized model, which can identify infected individuals and high-risk public places more quickly and accurately. This practice can provide support to government agencies to implement precise, agile, and evidence-based emergency management measures and improve the responsiveness of the public health emergency response system. Strengthening data security, improving tracing accuracy, enabling intelligent data collection, and improving data-sharing mechanisms and technologies are directions for optimizing digital contact tracing.

Keywords: COVID-19; China; big data; digital contact tracing; emergency management; graph database; visualization.

©Zijun Mao, Hong Yao, Qi Zou, Weiting Zhang, Ying Dong. Originally published in JMIR mHealth and uHealth (, 22.01.2021.

Publication types

  • Research Support, Non-U.S. Gov't
  • COVID-19 / epidemiology
  • COVID-19 / prevention & control*
  • China / epidemiology
  • Computer Graphics
  • Contact Tracing / methods*
  • Data Visualization
  • Databases, Factual
  • Digital Technology*
  • Epidemics / prevention & control*


Graph Database (Use Cases, Examples and Properties)

A graph database is a database designed to store and query data represented in the form of a graph. A graph consists of vertices (also called nodes) and edges, which represent the relationships between the vertices.

What is a Graph Database?

In a graph database, the data is stored as a set of vertices and edges, with each vertex representing an entity (such as a person or a business) and each edge representing a relationship between two vertices (such as a friendship or a business partnership). The graph structure allows for flexible and efficient querying, as it allows for traversing relationships between entities in various ways.

Graph databases are particularly useful for storing and querying data that has complex relationships, such as social networks, recommendation engines, and fraud detection systems. They are also often used in areas such as bioinformatics and supply chain management, where the data has a complex, interconnected structure.

Graph Database Use Cases

There are many use cases for graph databases, as they are particularly well-suited for storing and querying data that has complex relationships. Some common use cases for graph databases include:

Social networks

Graph databases can be used to store and query data about relationships between people, such as friendships, family relationships, and professional connections. This can be used to build social networking platforms, recommendation engines, and other applications.

Fraud detection

Graph databases can be used to identify patterns of fraudulent activity by analyzing the relationships between entities such as individuals, businesses, and transactions.

Recommendation engines

Graph databases can be used to store and query data about users and their interactions with products or content. This can be used to build recommendation engines that suggest products or content to users based on their interests and past behavior.

Supply chain management

Graph databases can be used to store and query data about the relationships between different entities in a supply chain, such as suppliers, manufacturers, and retailers. This can be used to optimize logistics and supply chain management.


Graph databases can be used to store and query data about the relationships between different biological entities, such as genes, proteins, and diseases. This can be used to study the relationships between different biological processes and to develop new drugs and treatments.

Graph Database List

Some popular graph databases are:

Neo4j is a widely used open-source graph database that is optimized for storing and querying large amounts of data. It is written in Java and supports ACID transactions, making it suitable for use in enterprise applications.

JanusGraph is an open-source, distributed graph database that is built on top of Apache Cassandra and Elasticsearch. It is designed to handle large-scale graph data and is optimized for high performance and scalability.

Amazon Neptune

Amazon Neptune is a fully managed graph database service that is optimized for storing and querying graph data. It is designed to be easy to use and is backed by the reliability and security of the Amazon Web Services (AWS) cloud platform.

OrientDB is an open-source, multi-model database that supports graph, document, key-value, and object data models. It is designed to be scalable and efficient, and it supports ACID transactions andSQL-like query language.

ArangoDB is an open-source, multi-model database that supports graph, document, and key-value data models. It is designed to be flexible and easy to use, and it supports ACID transactions and a powerful query language.

Graph Database Properties

Graph databases have several properties that make them unique and well-suited for storing and querying data that has complex relationships:

Flexible data model

Graph databases use a flexible data model that allows for the representation of complex relationships between entities. This makes it easy to store and query data that has many different types of relationships and connections.

Efficient querying

Graph databases are optimized for efficient querying of data, particularly when it comes to traversing relationships between entities. This makes it easy to find and retrieve data about specific entities and their relationships with other entities.


Graph databases are designed to scale well as the size of the data increases. This makes them suitable for storing and querying large amounts of data.

ACID transactions

Many graph databases support ACID (Atomicity, Consistency, Isolation, Durability) transactions, which ensure that data is stored and accessed in a consistent and reliable manner. This is important for applications that require high levels of data integrity and reliability.

High performance

Graph databases are optimized for high performance and can handle a large number of queries and updates in real-time. This makes them suitable for use in high-traffic applications.

Graph Database Vs. Relational Database

Graph databases are based on graph theory and are designed to store and process complex relationships and connections between data. They are particularly well suited for data that is hierarchical, connected, or has complex relationships.

Relational databases are based on the relational model and are designed to store and process structured data. They are particularly well suited for data that can be organized into tables with well-defined relationships between the rows and columns.

There are a few key differences between graph databases and relational databases:

In general, graph databases and relational databases each have their own strengths and weaknesses, and the choice between the two will depend on the specific needs of your application.

graph database (use cases, examples and properties)

More to read

  • Relational Database Benefits and Limitations
  • Relational Vs Non Relational Database
  • 13 Examples of Relational Database
  • Relational Database Vs. Object-Oriented Database
  • 9 Types of Databases
  • Distributed Database
  • Operational Database
  • Personal Database
  • Centralized Database

Similar Posts

Why are Databases Important?

Why are Databases Important?

Databases are important because they allow organizations to store, organize, and access large amounts of data efficiently and effectively. We…

Compare Data Science and Machine Learning (5 Key Differences)

Compare Data Science and Machine Learning (5 Key Differences)

Even after years of schooling, there are most common confusion that some students still face is the basic variance between…

When To Use Relational Database?

When To Use Relational Database?

Relational databases have been the dominant database technology for over 40 years. They remain optimal for many applications and use…

Difference between Document Database VS Key Value

Difference between Document Database VS Key Value

This post compares the two common types of NoSQL databases: document databases and key-value databases. The comparison criteria is based…

Storage vs Database: A Comparative Analysis

Storage vs Database: A Comparative Analysis

Storage and databases are both key components of modern computer systems and applications, but they serve different purposes. This article…

What are the 7 Phases of Database Design?

What are the 7 Phases of Database Design?

Database design is the process of creating a plan for a database that helps in identifying the data to be…

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

Notify me of follow-up comments by email.

Notify me of new posts by email.

  • Industry News

Graph Database Use Cases & Real-life Examples

Graph databases are incredibly flexible. Companies such as Walmart and eBay recognized early on the competitive advantage graph technology could provide, simplifying the complexities of online customer behaviour and the relationships between customer and product data. 

If you’ve only recently been introduced to graph databases, you may not even realize how prevalent the use of graph technology already is across every industry. As the trend continues, new applications of it continue to emerge. 

We have curated a list of graph database use cases and examples so you can explore how graph database systems are put to use in real-world applications. Unleash your imagination!

How do graph databases work?

See for yourself! Click below to register for our FREE online Demos.

Break out of the endless data integration cycle

Is graph technology the fuel that’s missing for data-based government, 5 graph technology applications in healthcare, graph database: how graph is being utilised for data analytics, how graph databases improve fraud detection, icij turns to big data tech to unravel fincen files, graph database use in logistics, do graphs have the power to stop coronavirus-related cybercrime, ai use cases in healthcare for covid-19 and beyond, your life is a graph, look at it that way, open source code is on the front lines of covid-19 fight, graph database use in fraud detection.


  1. Case study graph database

    graph database case study

  2. (PDF) Case Study of a Graph Database System that Supports Ideation and

    graph database case study

  3. Graph database use cases (10 examples)

    graph database case study

  4. Graph database Use Cases

    graph database case study

  5. What is Graph Database and How to Use It

    graph database case study

  6. What are Graph databases and different types of Graph databases

    graph database case study



  2. Explanation on database case study

  3. Exploring Categorical Data-Machine Learning-20A05602T-UNIT I-Preparing to Model-CSE-R20-JNTUA

  4. Basecamp Research: Feature Engineering with Graph Data Science

  5. Data Structure: Graphs Introduction

  6. 2 Relational Database vs. Graph Database #knowledgegraph #neo4j #graphdatabase #relationaldatabase


  1. Graph Database Case Studies

    Graph Database Case Studies Basecamp Research Knowledge Graph See Case Study→ Novo Nordisk Novo Nordisk: Leveraging the Graph for Clinical Trials and Standards Knowledge Graph See Case Study→ CSX CSX: Building a Digital Twin of the Largest Railroad in the Eastern U.S. with Neo4j Digital Twin See Case Study→ Scoutbee

  2. Graph database use cases (10 examples)

    Here's a list of the ten most prominent use-cases for Graph Databases. The Ten Most Common Graph Database Use-cases You Should Know There is a good reason why the world's forerunner-businesses are increasingly using Graph databases.

  3. Graph Database Use Cases & Solutions: Where to Use a Graph Database

    Graph Database Use Cases & Solutions: Where to Use a Graph Database When Connected Data Matters Most Early graph innovators have already pioneered the most popular use cases - fraud detection, personalization, customer 360, knowledge graphs, network management, and more.

  4. PDF 17 Use Cases for Graph Databases and Graph Analytics

    Introduction Let's say you have to perform social network analysis, uncover fraudulent bank transactions, or provide product recommendations. Often, discovering the answer to each of these questions can be complicated and possibly time consuming too. But with graph database, you can view the data landscape in a completely new way.

  5. Neo4j Graph Database: Use Cases and Real-life Examples

    Here's an example of a simple graph data model in Neo4j: As you can see, this graph contains two nodes (Alice and Bob) that are connected by relationships. Both nodes share the same label, Person. In the graph, only Bob's node has properties, but in Neo4j every node and relationship can have properties.

  6. Supply Chain Graph Database Use Cases: Data Management & Visualization

    Supply Chain Graph Database Use Cases: Data Management & Visualization Use Case: Supply Chain Future-Proof Your Supply Chain With Graph Technology Enable true end-to-end supply chain visibility and build a more agile, resilient supply network.

  7. An overview of graph databases and their applications in the biomedical

    Database (Oxford). 2021; 2021: baab026. Published online 2021 May 18. doi: 10.1093/database/baab026 PMCID: PMC8130509 PMID: 34003247 An overview of graph databases and their applications in the biomedical domain Santiago Timón-Reina, Mariano Rincón, and Rafael Martínez-Tomás

  8. Graph Databases for Beginners: An Introduction to Graph Databases

    Graph Databases for Beginners: Why Graph Technology Is the Future Neo4j Jan 19 7 mins read The world of graph technology has changed (and is still changing), so we're rebooting our "Graph Databases for Beginners" series to reflect what's new in the world of graph tech - while also helping newcomers catch up to speed with the graph paradigm.

  9. Top 10 Use Cases: Knowledge Graphs

    Top 10 Use Cases: Knowledge Graphs Jim Webber, Chief Scientist, Neo4j Feb 01, 2021 2 mins read Graph technology is the future. Not only do graph databases effectively store relationships between data points, but they're also flexible in adding new kinds of relationships or adapting a data model to new business requirements.

  10. Introducing Graph Databases For Dummies

    There are good reasons for the emergence of new data models. Document databases optimize for ease of storage and retrieval with a file cabinet metaphor of document-in, document-out. Column store databases optimize for scale and the ability to scan many records rapidly.

  11. Machine Learning within a Graph Database: A Case Study on Link

    This paper prototypes on a state-of-the-art graph database (Neo4j) an in-database ML-driven case study on link prediction, and identifies bulk feature calculation as the most time consuming task, at both the model building and inference stages, and defines it as a focus area for improving how graph databases support ML workloads. In the combination of data management and ML tools, a common ...

  12. PDF Machine Learning within a Graph Database: A Case Study on Link

    In this work we aim to identify how a general case study might look like. We also seek to evaluate from a practical perspective, the pipeline of running clas- sifiers on graph databases,...

  13. Graph Database Use Cases

    By Keith D. Foote on March 23, 2023 One of the primary advantages of using a graph database is the ability to present the relationships that exist between datasets and files. Much of the data is connected, and graph database use cases are increasingly helping to find and explore these relationships and develop new conclusions.

  14. Graph Databases Use Cases

    In many cases, we will be able to unify data into one location, especially to optimize for query performance and data fit. Graph databases offer exactly that type of data/performance fit, as we will see below. In this paper, we discuss why your master data is a graph and how graph databases like Neo4j are the best technologies for master data.

  15. The ultimate guide to graph databases

    Graph databases are relatively new technology. Neo4j, Inc, creators of the Neo4J database and the first graph database company, was founded in 2000. Graph databases themselves only started gaining commercial acceptance in the early 2010s. Learning a new query language sounds like a hassle.

  16. Graph4Med: a web application and a graph database for visualizing and

    We developed a suitable graph data schema to convert the relational data into a graph data structure and store it in Neo4j. ... Storniolo P, Urso A. Integrated DB for bioinformatics: a case study on analysis of functional effect of MiRNA SNPs in cancer. In: International conference on information technology in bio-and medical informatics ...

  17. Digital Contact Tracing Based on a Graph Database Algorithm for

    Digital Contact Tracing Based on a Graph Database Algorithm for Emergency Management During the COVID-19 Epidemic: Case Study JMIR Mhealth Uhealth. 2021 Jan 22 ... The graph database algorithm is compatible with this platform and can analyze multisource and heterogeneous big data related to the epidemic. These practices were used to quickly and ...

  18. Case Study: Graph Databases Help Track Ill-Gotten Assets

    Case Study: Graph Databases Help Track Ill-Gotten Assets - The New Stack Data Case Study: Graph Databases Help Track Ill-Gotten Assets If you want to find oligarchs' dirty money — or reveal connections hidden in any data — you will need a graph, not a map. Jun 12th, 2023 4:00am by Joe Fay Image by Conny Schneider from Unsplash. ANNUAL READER SURVEY

  19. 247 Graph Database Case Studies

    247 Graph Database Case Studies | Graph Database Software Companies 1-5 of 5 results ArangoDB ArangoDB, Inc. is a U.S. based privately held company founded in Cologne in 2014 to meet the increasing demand for professional services around ArangoDB. ArangoDB can operate as a distributed &… Graph Database 25 Case Studies

  20. Graph Database (Use Cases, Examples and Properties)

    A graph database is a type of database that uses graph structures with nodes, edges, and properties to represent and store data. A relational database, on the other hand, is a type of database that stores data in the form of tables with rows and columns. Each row in a table represents a record, and each column represents a field within that record.

  21. Graph Database Use Cases & Real-life Examples

    Graph Database Use Cases & Real-life Examples. Graph databases are incredibly flexible. Companies such as Walmart and eBay recognized early on the competitive advantage graph technology could provide, simplifying the complexities of online customer behaviour and the relationships between customer and product data.

  22. Framework for constructing multimodal transport networks and routing

    As a case study, a database was constructed for London transport networks, and routing tests were performed under various conditions. The constructed multimodal graph database showed stable performance in processing iterative queries, and efficient multi-stop routing was particularly enhanced.

  23. graph database case study

    Case Study: Graph Databases Help Track Ill-Gotten Assets. If you want to find oligarchs' dirty money — or reveal connections hidden in any data... Graph database has promised many advantages over relational databases: fast query, flexible schema, efficient storage, etc.

  24. AOPWIKI-EXPLORER: An Interactive Graph-based Query Engine ...

    In addition, the solution provides both database-specific query and natural language query interfaces to support retrieval and integration of the graph data, including the ability to chain stepwise queries. To evaluate the platform, a case study is presented, with three levels of use case scenarios (simple, moderate, and complex query).

  25. Vibration characteristics, attenuation law and prediction method in the

    For compare and study the difference between the near and far field vibration velocity prediction formulas, the near- and far-field vibration velocity fitting curves and near-field monitoring data were plotted in Fig. 5. Fig. 5 shows that the coefficient of determination (R 2) of using the formula (1) to predict the near-field PPV is 0.792.