JPStream: Generating Scalable JSONPath Processors

Introduction

JPStream is a JSONPath query compilation system. It compiles a given a set of JSONPath queries into automata-based parallel executables with bounded memory footprints. First, it adopts a stream processing design that combines parsing and querying in one pass without generating any in-memory parse tree. This is achieved by jointly compiling the path expressions and JSON syntax into a single automaton with two stacks. Furthermore, it supports parallel processing of single large JSON file, enabled by a set of parallelization techniques customized to the stream processing automata. For more details about JPStream, please refer to our paper [1].

Publication

[1] Lin Jiang, Xiaofan Sun, Umar Farooq, and Zhijia Zhao. Stream Processing of Contemporary Semi-Structured Data on Commodity Parallel Processors - A Compilation-based Approach. In Proceedings of the 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2019.

Getting Started

JSONPath

JSONPath is the basic query language of JSON data. It refers to substructures of JSON data in a similar way as XPath queries are used for XML data. For the details of JSONPath syntax, please refer to Stefan Goessner's article.

Supported Operators

Operator	Description
`$`	root object
`@`	current object filtered by predicate
`.`	child object
`[]`	child array
`*`	wildcard, all objects or array members
`..`	decendant elements
`[index]`	array index
`[start:end]`	array slice operator
`[?(<expression>)]`	filter expression for evaluation

Operators Not Supported

Operator	Description
`[index1, index2, ...]`	multiple array indexes
`[-start:-end]`	last few array elements
`$..[*]`	get all arrays
`()`	script expression, using underlying script engine

Path Examples

Consider a piece of simplified Google route data

{
    "routes": [ {
        "steps": [
            {
                "loc": {
                    "lat": 32,
                    "lng": -107
                },
                "dis": {
                    "text": "92 ft",
                    "value": 28
                }
            },
            {
                "loc": {
                    "lat": 35,
                    "lng": -106
                }
            }
        ] }
    ]
}

JsonPath	Result
`$.routes[].steps[]`	all steps of each route
`$.routes[].steps[].*`	all things in steps of each route
`$..loc`	all locations
`$.routes[*]..loc`	location in each route
`$.routes[*].steps[2].dis`	distance of the third step in each route
`$.routes[0:2]`	first two routes
`$.routes[*].steps[?(@.loc&&@.dis)]`	filter all steps of each route with location and distance
`$.routes[*].steps[?(@.loc.lat==32)]`	filter all steps of each route with location at 32 degrees latitude
`$..*`	everything in JSON structure

Build

There are some requirements for building this system: cmake 3.12+ and gcc 4.8.5.

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

There are several running cases in "query_test.c" under the folder demo/. New cases can be added there. To run it, after building the system, execute the following commands:

cd build/bin
./query_test

Demo

For serial execution:

    char* file_path = "../../dataset/bb.json";
    PathProcessor* path_processor = createPathProcessor("$.root.products[*].categoryPath[1:3]");
    Output* output = serialRun(path_processor, file_path);

For parallel execution:

    char* file_path = "../../dataset/bb.json";
    PathProcessor* path_processor = createPathProcessor("$.root.products[*].categoryPath[1:3]");
    int num_core = 16;
    Output* output = parallelRun(path_processor, file_path, num_core);

For parallel execution with data constraint learning (more efficient):

    PathProcessor* path_processor = createPathProcessor("$.root.products[*].categoryPath[1:3]");
    
    //collecting data constraints is optional, but can often make parallel execution more efficient
    char* train_file_path = "../../dataset/bb.json";
    ConstraintTable* ct = collectDataConstraints(path_processor, train_file_path);
    
    //parallel exeuction
    char* input_file_path = "../../dataset/bb.json";
    int num_core = 16;
    Output* output = parallelRunOpt(path_processor, input_file_path, num_core, ct);

Code Structure

JSONPath Parser

Implementations are provided in JSONPathParser/ directory:

jsonpath_evaluator.h: run a json path filtering on a predicator
jsonpath_model.h: the data structure of a json path
jsonpath_parser.h: major API of parsing a json path query
parser_j.y: bison specification for json path
scanner_j.l: flex specification for json path

DFA Builder

Implementations are provided in DFABuilder/ directory:

include/dfa.h: basic DFA data structure
include/dfa_builder.h: a list of functions that helps you create DFA

JSONPath Query Execution

Implementations are provided in JSONQuery/ directory:

utility.h: basic functions for string operations.
tuple_list.h: 2-tuple list generated by serial streaming automaton or parallel streaming automata.
output.h: list that saves the final results generated by predicate filtering.
stack.h: syntax stack, query stack and current state information.
multi_stack.h: query stack with multiple paths.
lexing.h: lexer for input JSON stream.
streaming_automaton.h: serial streaming automaton.
parallel_automata_execution.h: parallel streaming automata.
file_partitioner.h: separate input stream into several chunks.
worker_automaton.h: worker automaton used for each thread during parallel execution.
unit.h: data units in worker automaton.
predicate.h: predicate filter.
path_processor.h: interfaces for path processors (high level API).

Internal API

JSONPath Parser

ASTNode* analysisJSONPath(const char* data): parse the json path string into an AST format
void printJsonPathAST(JSONPathNode* root): print the tree structure

DFA Builder

JSONQueryDFA* buildJSONQueryDFA(const char* json_path, JSONQueryDFAContext* context): Create a JSON Query DFA from a JSON Path string
JSONQueryDFA* buildJSONQueryDFAFromAST(JSONPathNode* json_path, JSONQueryDFAContext* context): Create a JSON Query DFA from a JSON Path AST
JSONPathNode* getContextSubtree(JSONQueryDFAContext* ctx, int stop_state): get the subtree that a stop state is connected
int getContextSizeOfMapping(JSONQueryDFAContext* ctx, int stop_state): get the size of mapping
int getContextValueOfMapping(JSONQueryDFAContext* ctx, int stop_state, int idx): get the content of one mapping. Idx is the number from range (0 ~ size-1) that you get from the previous API.

Serial Streaming Automaton

void initStreamingAutomaton(StreamingAutomaton* sa, JSONQueryDFA* qa): Initialize streaming automaton based on query automaton.
void destroyStreamingAutomaton(StreamingAutomaton* sa): Free dynamic memory spaces allocated by streaming automaton.
void executeAutomaton(StreamingAutomaton* sa, char* json_stream, int data_constraint_flag): Execute streaming automaton based on input stream. To generate data constraint table, data_constraint_flag should be OPEN (otherwise it should be CLOSE).

File Loading and Partitioning

char* loadInputStream(char* file_name): Load input stream into memory.
char* loadBoundedInputStream(char* file_name, int* start_pos): Load the next available input chunk with bounded memory footprint, start_pos records the starting position of the next available chunk, the initial value is 0.
PartitionInfo partitionInputStream(char* input_stream, int num_core): Split the loaded input stream into several chunks.
PartitionInfo partitionFile(char* file_name, int num_core): Load and partition input stream into several chunks.

Parallel Streaming Automata

void initParallelAutomata(ParallelAutomata* pa, JSONQueryDFA* qa): Initialize parallel streaming automata based on query automaton.
void destroyParallelAutomata(ParallelAutomata* pa): Free dynamic memory spaces allocated by parallel streaming automata.
void executeParallelAutomata(ParallelAutomata* pa, PartitionInfo par_info, ConstraintTable* ct): Execute parallel streaming automata based on partitioned chunks. Data constraint table can be used for runtime optimization.

Predicate Filtering

void initPredicateFilter(PredicateFilter* pf, TupleList* tl, JSONQueryDFAContext* ctx): Initialize predicate filter based on 2-tuple list and query automaton.
void destroyPredicateFilter(PredicateFilter* pf): Free dynamic memory spaces allocated by predicate filtering component.
Output* generateFinalOutput(PredicateFilter* pf): Run predicate filter component and generate final output list.

Usage of Internal API

Loading Input Stream

To load input stream without partitioning:

    //load input stream
    char* stream = loadInputStream("../../dataset/wiki.json");
    
    //load input stream with bounded memory
    int start_pos = 0;
    char* stream = loadInputStream("../../dataset/wiki.json", &start_pos);

To load input stream with partitioning:

    int num_core = 16;
    PartitionInfo pInfo = partitionFile("../../dataset/bb.json", num_core);
    int num_chunk = pInfo.num_chunk;
    char** stream = pInfo.stream;

Generating Deterministic Finite Automaton (DFA) for JSONPath query

    char* path = "$.root.products[?(@.sku&&@.productId)].categoryPath[?(@.name)].id";
    JSONQueryDFAContext* ctx = (JSONQueryDFAContext*)malloc(sizeof(JSONQueryDFAContext));
    JSONQueryDFA* dfa = buildJSONQueryDFA(path, ctx);

Getting Executables and Run

For serial streaming automaton,

    //create streaming automaton
    StreamingAutomaton sa;
    initStreamingAutomaton(&sa, dfa);

    //run streaming automaton without generating data constraints
    executeAutomaton(&sa, stream, CLOSE);
    
    //run streaming automaton and generate data constraints
    executeAutomaton(&sa, stream, OPEN);
    
    //get results generated from streaming automaton
    TupleList* tl = sa.tuple_list;

For parallel streaming automata,

    //create parallel streaming automata 
    ParallelAutomata pa;
    initParallelAutomata(&pa, dfa);
    
    //run parallel automata without data constraint learning
    executeParallelAutomata(&pa, pInfo, NULL);
    
    //run parallel automata with data constraint learning
    executeParallelAutomata(&pa, pInfo, sa.constraint_table);
    
    //get results generated from parallel automata
    TupleList* tl = pa.tuple_list;

Filtering Results for JSONPath with Predicates

    PredicateFilter pf;
    initPredicateFilter(&pf, tl, ctx);
    Output* final = generateFinalOutput(&pf);