query-interpreter/README.md

187 lines
5.9 KiB
Markdown

# query-interpreter
**README LAST UPDATED: 04-24-25**
This project is under active development and is subject to change often and drastically as I am likely an idiot.
Core program to interpret query language strings into structured data, and back again.
## How to Use the Project
### Microservice for client applications
At least for now, this can be treated like a micro service. Very simply you query the endpoint(s)
with your SQL strings and retrieve the structured data back.
```
POST /query
body: {
sql: string
}
```
Be aware that the api is not currently converting enums to their string representations, so you might expect
JoinType to be `"INNER"`, bit it is returned as `0`. Refer to the [dto](q/dto.go) as a reference to what the
`string` values for these `enum`s would be.
> `iota` is incrementing for all values in the `enum`, and starts at `0` unless modified. `iota + 1` will start at `1`
Right now we are only parsing SELECT statements. If you try to do something else it will either error out
or hang. The HTTP response should timeout after 30 seconds.
### Development on core logic
This project is wored on via TDD methods, it is the only way to do so as the parsing of SQL is so janky. If
you are wanting to add a feature to the parsing, you need to first write a unit test.
Become familiar with [select_test](q/select_test.md) to see how we are doing it. In Brief:
We have the test struct where `input` is the entire SQL string you are testing, and expected is
the exact struct (of which ever query struct) you expect to see returned.
```go
type ParsingTest struct {
input string
expected Select
}
```
Add yours to the `var testSqlStatements []ParsingTest` of the file.
If you are adding a new field to the query's struct, or modifying any fields, make sure to add or update the
conditionals in teh `t.Run(testName, func(t *testing.T)` block.
#### Remember the TDD Process
- Write enough of a test to make sure it fails
- Write enough prod code to make sure it passes
- Repeat until finished developing the feature
### Starting the app
**Prerequisites:**
* Go installed (version X.Y or higher - check your code for specifics)
* `go mod tidy` to fetch dependencies
* `cp .env.example .env` to create your own .env file
**Running the App:**
1. `go run main.go` to start the server (PORT is determined by the .env file)
2. `go test ./q` to run test suite if developing features (add `-v` if you want a verbose output)
## Data Structure Philosophy
We are operating off of the philosophy that the first class data is SQL Statement stings.
From these strings we derive all structured data types to represent those SQL statements.
Whether it be CRUD or schema operations.
Our all of these structs will have to implement the `Query` interface
```go
type Query interface {
GetFullSql() string
}
```
So ever struct we create from SQL will need to be able to provide a full and valid SQL
statement of itself.
These structs are then where we are able to alter their fields programatically to create
new statements altogether.
## SQL Tokens
We are currently using DataDog's SQL Tokenizer `sqllexer` to scan through SQL strings.
The general token types it defines can be found [here](/docs/SQL_Token_Types.md)
These are an OK generalizer to start with when trying to parse out SQL, but can not be used
without some extra conditional logic that checks what the actual values are.
Currently we scan through the strings to tokenize it. When stepping through the tokens we try
to determine the type of query we are working with. At that point we assume the over all structure
of the rest of the of the statement to fit a particular format, then parse out the details of
the statement into the struct correlating to its data type.
## Scan State
As stated, we scan through the strings, processing each each chunk, delineated by spaces and
punctuation, as a token. To properly interpret the tokens from their broad `token.Type`s, we
have to keep state of what else we have processed so far.
This state is determined by a set off flags depending on query type.
For example, a Select query will have:
```go
passedSELECT := false
passedColumns := false
passedFROM := false
passedTable := false
passedWHERE := false
passedConditionals := false
passedOrderByKeywords := false
passesOrderByColumns := false
```
The general philosophy for these flags is to name, and use, them in the context of what has
already been processed through the scan. Making naming and reading new flags trivial.
A `Select` object is shaped as the following:
```go
type Select struct {
Table string `json:"table"`
Columns []Column `json:"columns"`
Conditionals []Conditional `json:"conditionals"`
OrderBys []OrderBy `json:"order_bys"`
Joins []Join `json:"joins"`
IsWildcard bool `json:"is_wildcard"`
IsDistinct bool `json:"is_distinct"`
}
type Column struct {
Name string `json:"name"`
Alias string `json:"alias"`
AggregateFunction AggregateFunctionType `json:"aggregate_function"` // Changed type name to match Go naming conventions
}
type Conditional struct {
Key string `json:"key"`
Operator string `json:"operator"`
Value string `json:"value"`
Extension string `json:"extension"`
}
type OrderBy struct {
Key string
IsDescend bool // SQL queries with no ASC|DESC on their ORDER BY are ASC by default, hence why this bool for the opposite
}
type Join struct {
Type JoinType `json:"type"`
Table Table `json:"table"`
Ons []Conditional `json:"ons"`
}
// Only used in Join.Table right now, but Select.Table will also use this soon
type Table struct {
Name string `json:"name"`
Alias string `json:"alias"`
}
type AggregateFunctionType int
const (
MIN AggregateFunctionType = iota + 1
MAX
COUNT
SUM
AVG
)
```
## Improvement Possibilities
- Maybe utilize the `lookBehindBuffer` more to cut down the number of state flags in the scans?