Parsers are an essential part of any programming language. Although there are many open-source parsers easily available in Java and developers can select the right one as per the requirement, sometimes the one you need is not available. In that case, developers often have to create custom parsers in Java. Along with unavailability, another primary reason behind developing custom parsers includes performance issues in available parsers, flaws in them, or their inability to fulfill all the requirements.
Further, in this article, we will discover various ways to generate parsers in Java yourself using available tools and Java libraries.
Before diving into details, let’s first understand what the terms ‘parsing’ and ‘parser’ mean.
Parser and parsing
Generally, parsing can be defined as the process of breaking down a block of data into smaller parts based on certain pre-defined rules. These parts are then interpreted, modified, or managed as per the requirement of the developers.
And, a parser is a program written to perform the breaking down of data into smaller parts. A parser is usually composed of two parts: a lexer, also called scanner or tokenizer, and the parser itself. A lexer and a parser work in sequence, the lexer scans the input data and produces a list of tokens, the parser then scans the tokens and produces the parsing result.
Rules and grammars
The definitions used by lexers or parsers are called rules or productions. And all these rules make up grammar. In simple terms, grammar is the list of rules that define how each construct or line of code should be composed. For example, a rule for an if statement must specify that it must starts with the “IF” keyword, followed by a left parenthesis, an expression, a right parenthesis, and a statement.
If the code or text is not according to the rule, the parser will identify it as incorrect which results in a syntax error.
Ways to parse in Java
If you need to parse a document in Java there are primarily three ways to do it.
- As mentioned earlier, use an existing library that supports that specific language you want to parse: for instance, a library to parse XML.
- If you cannot find an existing parser, use a tool or library to develop a parser: for example, ANTLR, which is used to build parsers for Java or any language.
- Lastly, writing your custom parser from scratch as per your requirement.
Using an Existing Library for parsing
It is a good option for parsing well-known and supported languages, like XML or HTML. A good library usually also includes API to create and modify documents in that language. This is an additional feature that will not come with a basic parser. The limitation is that such libraries are not that common and they support only the most common languages.
Building your own custom Java parser
You may need to go with this option if you have extremely particular needs. In the sense that the language you need to parse cannot be parsed with available parsers and you can’t get the required features in parser generating tools. There could be several reasons behind it like because you need the best possible performance or a deep integration between different components but it is surely the most difficult and time-consuming option out of three.
You would also be needing to have some exceptional skills for writing a parser in Java or a developer who could write that for you.
Using existing tools to write a Parser
Out of all three cases, it is the most used one and is considered the best way to generate parsers in Java. As the most flexible and least time-consuming option to develop a Java parser, we will be mostly concentrating on the tools and libraries.
Libraries and tools to build parsers in Java
For starters, Tools that are used to write the code for a parser are known as parser generators and libraries that create parsers are called parser combinators. Along with that, another tool that is needed is a lexical analyzer (lexers) that analyzes regular languages while writing parsers in Java.
Following is the list of different types of tools required to write a Java parser:
JFlex is a lexer generator based on deterministic finite automata (DFA). It matches the input according to the defined grammar known as spec and executes the corresponding action. It can also be used as a standalone tool, but being a lexer generator, it is designed to work with parser generators: it is typically used with CUP or ANTLR (we will be discussing them later)
The spec (grammar) is divided into three parts, separated using a ‘%%’ symbol:
- usercode, that will be included in the generated class,
- lastly, the lexer rules.
ANTLR is one of the most used generators for context-free parsers in Java. Along with Java, ANTLR can be used to write parsers in numerous other languages.
Its typical grammar is divided into two parts: lexer rules and parser rules. The division is implicit as all the rules starting with an uppercase letter are lexer rules whereas the ones starting with a lowercase letter are parser rules. Lexer and parser grammars can also be defined in separate files.
Coco/R is a compiler generator that generates a scanner and a recursive descent parser after taking an attributed grammar. Attributed grammar is the one where rules are written in an EBNF variant and can be annotated in multiple ways to change the methods of a generated parser.
Coco/R has fine documentation, with numerous examples of grammar available for better understandings. Along with Java, it also supports C# and C++.
CUP stands for Construction of Useful Parsers. It is LALR (look-Ahead LR) parser generator for Java. It just generates proper parts of a parser and is well suited to be used with JFlex. It also comes with an Eclipse plugin to help developers in creating grammar.
JavaCC is the most widely used parser generator for Java. The grammar that comes with it contains all the actions and the custom code needed to build parsers in Java. Compared to ANTLR, the grammar file is not that clean and includes a significant part of Java source code.
Thanks to some of its prominent usage in very important projects, like JavaParser, it led to some good content in its documentation. It offers a grammar repository, but it does not contain that many grammars.
· Java libraries that parse Java: JavaParser
In one special case where you want to parse a Java code in Java, a library named JavaParser is the best option available. It supports lexical preservation as well as pretty printing which means you can parse a Java code, modify it and print it back either with the original formatting or pretty printed. It can be used with JavaSymbolSolver as well. It supports all versions of Java from 1 to 9 so you do not have to worry about which version you should use.
Pros and cons of using parsing tools
Many of these tools and libraries were started as a thesis or a research project. The bright side to this is these tools tend to be easily and freely available by their makers. However, on the flip side, there is often a lack of good documentation on how to use them. It is obvious as they are not primarily made to be used by the mass audience. Also, some tools are abandoned without any new updates as the original authors are done with their master’s or Ph.D. degrees so they stop maintaining them.
Keeping these points in mind, you should select the right tool based on their available information and go through all the documentation before using any of these tools for parsing.
See Also: Understanding Memory Management In Java
Wrapping it up
Parsers in Java is a very vast topic and it is different from the usual world of java development. Due to this, it would be way too much to expect a developer to be able to write parsers in Java. This is where these tools shine as it is now way easier for developers to generate a parser in Java.