Why Your Code Is A Graph

Graph structures and how they are used in security code analysis

Graphs structures are a natural representation of many kinds of data. They are a good way to represent relationships between objects, such as the relationship between users on social media sites, and the distance between different locations.

Today, let’s explore graphs and how they are used in security code analysis. Then, we will dive into how we can use graphs to analyze code for vulnerabilities.

A brief intro to graphs

You’ve probably already heard of graph structures and graph databases. In mathematics, graphs are a set of nodes and edges that connects those nodes. Like this graph right here:

Nodes represent objects, and edges that connect objects represent their relationship. For instance, a graph representation of the relationships between users on a social media site can be represented like this:

Edges can be directed, and each edge can have a value, such as a label or a number associated with it to represent the nature of the edge. Using graph structures, we can represent complex relationships and interactions between different objects. For instance, we can represent which users block another user on a website as well.

This is how graph databases work. By representing objects and their relationships in graphs, you can easily find answers to these questions by querying the graph and tracing associated edges: How many users block user C? And How many friends does user B have?

Using graphs to represent code

Relationships in code can also be represented in graphs. For instance, Abstract Syntax Trees (ASTs) are graphs that represent the syntactic structure of program code.

Here is an abstract syntax tree for the code for the Euclidean algorithm taken from https://en.wikipedia.org/wiki/Abstract_syntax_tree.

This pseudocode represents the logic flow of the Euclidean algorithm.

**while** b ≠ 0
  **if** a > b
    a := a − b
  **else**
    b := b − a
**return** a

You can see that this graph translates the program code into a graph structure while preserving individual elements of its syntax.

There are other graph representations of code too. Control Flow Graphs (CFGs) are designed to represent the order in which code is executed and conditions that need to be met for that piece of code to be executed. In a CFG, each node represents instructions of the program while edges represent the flow of control in the program. With CFGs, you can traverse the graph to determine which instructions can be called at various points during the execution. Here are two examples CFGs that represent (a) an if-then-else structure and (b) a while loop.

Another useful graph representation of code is the Program Dependence Graph (PDG). PDGs represent data and control dependencies of a piece of code. In a PDG, data dependency edges specify the influence of a piece of data, and control dependency edges specify the influence of predicates on other nodes. In this context, predicates are logical expressions that evaluate TRUE or FALSE used to direct the execution path. Take a look at this piece of code:

void example(){
  int x = 1;
  int y = 2;
  if (x > 1){
    int z = x + y;
    print(z);
  }
}

It can be represented with the following PDG. In this graph, D(variable) indicates a data dependency node on the variable, and C(predicate) indicates a control dependency node on the predicate. You can see that the integer declaration of z is dependent on x > 1 being true and also dependent on the values of x and y.

Using PDGs, we can see how data travels through the code and if a particular piece of data can reach a specific statement.

As you can see, different graph representations of code are optimized for different purposes. Many security analysis tools use one of these graphs or a combination of these graphs to map out the relationships in code. For instance, for security purposes, PDGs can help us map out how attacker-controlled data can travel to security-sensitive functions and whether the data has to pass through validation or transformation in the process.

But the real power of the graph representation of code comes through when you combine the different representations. The AST, CFG, and PDG all represent the same code from a different perspective, and some properties of the program are easier to express in one representation over the other. Combining these graphs representing different properties of a program allows you to construct a master graph that merges the different perspectives and takes them all into account. This is what the Code Property Graph (CPG) is. By using the CPG, we can express complex behaviors of the same program in a single graph.

Using graphs to automate security code analysis

Once we have a graph representing the program’s behavior, patterns in code can often be expressed as patterns in these graphs. This means that we can search for patterns in the source code by querying the graph.

Vulnerabilities often have common patterns in code. For example, the signature for an XXE vulnerability is passing user-supplied XML to a parser without disabling DTDs or external entities. (More about XXEs here.) By using a CPG, we can look for these precise patterns in code. This makes the CPG an intuitive yet powerful engine of vulnerability discovery since every vulnerability is just a graph traversal away.

Let’s take a look at some examples of querying the CPG for patterns that indicate vulnerabilities. Joern (https://joern.io/) is an open-source tool that utilizes the CPG to conduct source code analysis. Let’s learn a few Joern queries to see how the CPG can simplify the logical process of looking for vulnerabilities in code.

For instance, detecting a command injection vulnerability in a Java application with Joern looks like this. In these Joern shell queries, we first ask the graph for all identifiers with the name that contains “HttpServletRequest” to find the user-controlled HTTP parameters in the program. We then find any calls to the methods exec or eval, the functions in Java that execute system commands. Finally, we ask if that user input can reach the dangerous functions as an input argument. Together, these three queries will identify if any user input is passed into functions that execute arbitrary system commands.

> def source = cpg.identifier.typeFullName(".*HttpServletRequest.*")

> def sink = cpg.call("exec|eval").argument

> sink.reachableBy(source)

Next up, let’s search for patterns that indicate a reflected XSS. The first query here searches for user-supplied URL parameters. The second query searches where the program prints data to HTTP responses. Finally, the query sees if any of the URL parameters get passed into the print statements. Together, these three queries identify if any URL input parameter gets reflected into an HTTP response, the hallmark of reflected XSS vulnerabilities.

> def source = cpg.call.name("getParameter")

> def sink = cpg.call(".*print.*").where(x => x.reachableBy(cpg.identifier.typeFullName(".*HttpServletRequest.*"))).argument

> sink.reachableBy(source)

That is the basis of ShiftLeft’s code analysis platform. We’ve built a tool that translates your code in different languages into CPGs. Then, using the CPG, we traverse your codebase to accurately find vulnerabilities that are reachable by attackers, minimizing false positives and cutting down on scan time. To learn more about the background, invention, and technology of the CPG, read Dr. Fabian Yamaguchi’s post here: https://blog.shiftleft.io/semantic-code-property-graphs-and-security-profiles-b3b5933517c1.

Integrating frequent security scans into today’s fast-paced software development processes is not impossible. Using CPGs as a way of analyzing code means better speed, accuracy, comprehensiveness in tests. If you’re interested in learning more about ShiftLeft’s security platform ShiftLeft CORE, visit us here.

Thanks for reading! What is the most challenging part of developing secure software for you? I’d love to know. Feel free to connect on Twitter @vickieli7.

References

33