Skip to content

Latest commit

 

History

History
79 lines (59 loc) · 2.28 KB

storages.md

File metadata and controls

79 lines (59 loc) · 2.28 KB

Storages

The storage defines how the trees should be saved on disk. For now, PSIMiner support tree-based and path-based storage formats.

PSIMiner also knows how to find the structure of the dataset and can save input data in the appropriate holdout folders (train, val and test). If the data is not structured, all trees will be saved in the output folder in one file.

Plain text format

Save just method code with applied tree transformations to .jsonl file

{
  "name": "plain text"
}

Tree formats

JSON Lines

Saves each tree with its label in the Json Lines format. Json format of AST inspired by the 150k Python dataset.

{
  "name": "json tree"
}

Path-based representations

Path-based representation was introduced by Alon et al.. It is used in popular code representation models such as code2vec and code2seq.

Code2seq

Extract paths from each AST and save in the code2seq format. The output is path_context.c2s file, which will be generated for every holdout. Each line starts with a label, followed by a sequence of space-separated triples. Each triple contains the start token, path node types, and end token, separated with commas.

To reduce memory usage, you can enable nodesToNumbers option. If nodesToNumbers is set to true, all types are converted into numbers and node_types.csv is added to output files.

{
  "name": "code2seq",
  "pathWidth": 2,
  "pathLength": 9,
  "maxPathsInTrain": 1000,
  "maxPathsInTest": 200,
  "nodesToNumbers": true
}

maxPathsInTrain, maxPathsInTest, and nodesToNumbers are optional parameters.

New storages

To add new storage following next steps:

  1. Implement storage interface.
  2. Add storage config for it.
  3. Register storage config in tool runner.
  4. [Optional] Add tests for this storage.

Metadata collection

To enable metadata collection add additional parameter:

{
  "collectMetadata": true
}

This will collect additional data (filepath and range of element presented by tree) in the metadata folder in form of json lines.