data.lp


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490

This file contains the various data processing-related constants and functions referenced by the tangling and weaving processes.

@: *
@= License

@= Imports

@= Processing limits

@= Formatting keywords

@= Configuration keywords

@= Data structure types

@= Error set

@= Line splitting function

@= Configuration searching function

@= Section searching function

@= Command type detection function

@= Parsing functions

@= Code generation functions

@= Text generation function
@.

@: License
// Copyright 2022 DistressNetwork° <uplink@distress.network>
// This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/.
@.

## Constants

We first import the standard library and the logging function from `log.zig`.

@: Imports
const std = @import("std");
const log = @import("log.zig").log;

const Allocator = std.mem.Allocator;
@.

We define the maximum input file size of 4GiB, and the code generation function's maximum recursion depth of 250 nested calls.

@: Processing limits
pub const input_max = 0x1_0000_0000;
pub const dereference_max = 250;
@.

We then define the recognized formatting keywords. These consist of the following:

- `@:`, which begins a new code section;
- `@+`, which appends content to a previous code section;
- `@.`, which terminates the definition of a code section;
- `@=`, which creates a reference to another code section;
- `*`, which is a reserved section name representing the root level of the source code.

@: Formatting keywords
pub const k_start	= "@: ";
pub const k_add		= "@+ ";
pub const k_end		= "@.";
pub const k_ref		= "@= ";
pub const k_root	= "*";
@.

We similarly define the recognized configuration keywords, consisting of:

- `@start`, which defines the leading formatted code delimiter when beginning new sections;
- `@add`, which defines the leading formatted code delimiter when appending to existing sections;
- `@end`, which defines the trailing formatted code delimiter;
- `@ref`, which defines the format for section references;
- `@@`, which is the escape sequence representing the current section name;
- `\n`, which is the escape sequence representing a newline.

@: Configuration keywords
pub const kc_start	= "@start ";
pub const kc_add	= "@add ";
pub const kc_end	= "@end ";
pub const kc_ref	= "@ref ";
pub const kc_esc	= "@@";
pub const kc_nl		= "\\n";
@.

We then define the data structure used for parsing the input into code sections, described as follows:

- The overall structure of the file is an array of `Section`s.
- A `Section` consists of the section name and an array of `Content` elements.
- A `Content` element may be either a range of literal lines of code or a reference to another section.
- A `LineRange` is a pair of integers indicating the starting and ending line numbers of the section.

@: Data structure types
pub const Section = struct {
	name: []const u8,
	content: []const Content,
};

pub const Content = union(enum) {
	literal: LineRange,
	reference: []const u8,
};

pub const LineRange = struct {
	start: u32,
	end: u32,
};
@.

We also define the set of errors which may be encountered by the various processing functions, consisting of:

- Unexpected section start commands,
- Unexpected section end commands, 
- Recursive dereferencing exceeding the specified depth limit,
- References to nonexistent section names or configuration commands.

@: Error set
pub const Errors = error {
	UnexpectedStart,
	UnexpectedEnd,
	DereferenceLimit,
	NotFound,
};
@.

## Preprocessing & Searching

The line splitting function is defined, which operates on a buffer as follows.

@: Line splitting function
pub fn split_lines(file: []const u8, alloc: Allocator) ![][]const u8 {
	var buffer = std.ArrayList([]const u8).init(alloc);
	defer buffer.deinit();

	@= Split file at each newline

	return buffer.toOwnedSlice();
}
@.

The function simply iteratively splits the file at each newline, and appends each resulting line to the buffer.

@: Split file at each newline
var iterator = std.mem.split(u8, file, "\n");
while (iterator.next()) |line| {
	try buffer.append(line);
}
@.

In addition, the final empty line created by the trailing newline at the end of the file (inserted automatically by some text editors) is removed, if it exists. This may only be performed if the file is non-empty, to avoid out-of-bounds indexing.

@+ Split file at each newline
if ((buffer.items.len > 0) and std.mem.eql(u8, buffer.items[buffer.items.len - 1], "")) {
	_ = buffer.pop();
}
@.

We define the configuration command searching function, which returns a list containing the segments of the split format string. The function will return from within the for loop if the declaration is found, otherwise an error is reported.

@: Configuration searching function
pub fn get_conf(lines: [][]const u8, key: []const u8, alloc: Allocator) ![][]const u8 {
	for (lines) |line| {
		if (std.mem.startsWith(u8, line, key)) {
			return try fmt_conf(line, key, alloc);
		}
	}
	log(.err, "config declaration '{s}' not found", .{std.mem.trimRight(u8, key, " \t")});
	return error.NotFound;
}

@= Auxiliary formatting function
@.

If the declaration is found, its contained format string is split along instances of the section name escape sequence, and each substring has its instances of the newline escape sequence replaced with a literal newline.

@: Auxiliary formatting function
fn fmt_conf(line: []const u8, key: []const u8, alloc: Allocator) ![][]const u8 {
	var buffer = std.ArrayList([]const u8).init(alloc);
	defer buffer.deinit();

	var iterator = std.mem.split(u8, line[(key.len)..], kc_esc);
	while (iterator.next()) |str| {
		try buffer.append(try std.mem.replaceOwned(u8, alloc, str, kc_nl, "\n"));
	}

	return buffer.toOwnedSlice();
}
@.

We define the code section searching function, which returns the index (into the section list) of the first section with a matching name, or returns an error if none exist.

@: Section searching function
fn search(list: []Section, name: []const u8) !usize {
	for (list) |section, index| {
		if (std.mem.eql(u8, section.name, name)) return index;
	}
	log(.err, "section '{s}' not found", .{name});
	return error.NotFound;
}
@.

## Parsing

We first define a function which, for a given line, determines whether it consists of a formatting command, and which type of command it contains. This is done in order to enable the use of switch statements in later functions using this routine.

@: Command type detection function
const CommandType = enum { start, add, end, ref, none };

fn command_type(line: []const u8) CommandType {
	if (std.mem.startsWith(u8, line, k_start)) {
		return .start;
	} else if (std.mem.startsWith(u8, line, k_add)) {
		return .add;
	} else if (std.mem.eql(u8, line, k_end)) {
		return .end;
	} else if (std.mem.startsWith(u8, std.mem.trimLeft(u8, line, " \t"), k_ref)) {
		return .ref;
	} else {
		return .none;
	}
}
@.

We then define the parsing functions, consisting of the main `parse` function which builds the list of `Section`s, and its auxiliary `parse_code` subroutine which builds the contents of each `CodeSection`.

@: Parsing functions
pub fn parse(lines: [][]const u8, alloc: Allocator) ![]Section {
	var sections = std.ArrayList(Section).init(alloc);
	defer sections.deinit();

	@= Main parsing routine

	return sections.toOwnedSlice();
}

fn parse_code(lines: [][]const u8, index: u32, alloc: Allocator) !CodeReturn {
	var content = std.ArrayList(Content).init(alloc);
	defer content.deinit();

	@= Code parsing subroutine

	return CodeReturn{ .content = content.toOwnedSlice(), .index = i + 1 };
}
@.

The latter function takes as arguments the list of lines and the allocator similarly to the main function, but it is also passed the index of the current line being processed, and returns the line at which the main function should resume parsing after the code section is parsed. It thus returns a struct consisting of the contents of the code section and the next line number index, as follows.

@+ Parsing functions
const CodeReturn = struct {
	content: []const Content,
	index: u32,
};
@.

The main parsing routine iterates over the list of lines, adding code sections where they occur, and otherwise ignoring text sections. If a section end command is encountered in the absence of a preceding starting command, an error is returned.

@: Main parsing routine
var i: u32 = 0;
while (i < lines.len) {
	const line = lines[i];
	switch (command_type(line)) {
		.start	=> {
			@= Add new section
		},
		.add	=> {
			@= Append to section
		},
		.end	=> {
			log(.err, "line {d}: unexpected section end", .{i + 1});
			return error.UnexpectedEnd;
		},
		else	=> {
			i += 1;
		},
	}
}
@.

To add a new section, the name (consisting of everything after the starting token) is first retrieved from the starting command. Then the code parsing subroutine is called, beginning at the line after the starting command, and it returns the resulting code section (`section.content`) and the next line at which to resume parsing (`section.index`). The code section is appended to the section list, and the parsing routine continues at the provided index.

@: Add new section
const name = line[(k_start.len)..];
log(.debug, "({d}) starting section '{s}'", .{ i + 1, name });

const section = try parse_code(lines, i + 1, alloc);
try sections.append(.{ .name = name, .content = section.content });

log(.debug, "({d}) ending section '{s}'", .{ section.index, name });
i = section.index;
@.

To append to an existing section, the section name and the code section contents to be appended are retrieved as above. The index of the section is located, along with its address within the section list. Next, the new contents of the section are created by concatenating the old contents with the newly parsed code section contents. The section list is then updated to point to the new section contents, and the parsing routine continues.

@: Append to section
const name = line[(k_add.len)..];
log(.debug, "({d}) appending to section '{s}'", .{ i + 1, name });

const section = try parse_code(lines, i + 1, alloc);
const index = try search(sections.items, name);
const old = &sections.items[index];
const new = try std.mem.concat(alloc, Content, &[_][]const Content{ old.*.content, section.content });
old.*.content = new;

log(.debug, "({d}) ending section '{s}'", .{ section.index, name });
i = section.index;
@.

The code parsing subroutine iterates over the list of lines similarly to the main routine. If a starting or appending command is encountered (lacking a matching ending command), an error is raised. Reference commands may be preceded with any amount of whitespace. The loop exits upon encountering an ending command. Otherwise, the line is appended as a literal element.

@: Code parsing subroutine
var i = index;
while (i < lines.len) {
	const line = lines[i];
	switch (command_type(line)) {
		.start, .add	=> {
			log(.err, "line {d}: unexpected section start", .{i + 1});
			return error.UnexpectedStart;
		},
		.ref	=> {
			@= Add reference
		},
		.end	=> {
			break;
		},
		else	=> {
			@= Add literal range
		},
	}
}
@.

To add a reference, the name of the referenced section is retrieved, consisting of the characters following the leading whitespace and the command token. The resulting string is appended to the section contents list, and the parser continues at the next line.

@: Add reference
const ref_name = std.mem.trimLeft(u8, line, " \t")[(k_ref.len)..];
try content.append(.{ .reference = ref_name });
log(.debug, "({d}) \tappended reference '{s}'", .{ i + 1, ref_name });
i += 1;
@.

To add a literal range, the parser either updates the end index of the previous literal element, or creates a new literal element if the last element added is a reference. This action of switching on the previous section element must only occur if the section contents list is non-empty, in order to prevent out-of-bounds indexing. Otherwise, the parser unconditionally appends a new literal element to the list. After either case, parsing continues at the next line.

@: Add literal range
if (content.items.len > 0) {
	switch (content.items[content.items.len - 1]) {
		.literal => |*range| {
			range.*.end = i;
		},
		.reference => {
			try content.append(.{ .literal = .{ .start = i, .end = i } });
			log(.debug, "({d}) \tappending literal", .{i + 1});
		},
	}
} else {
	try content.append(.{ .literal = .{ .start = i, .end = i } });
	log(.debug, "({d}) \tappending literal", .{i + 1});
}
i += 1;
@.

## Code Generation

We define the source code generation procedure which is split into two functions, consisting of a wrapper function which begins code generation at (the index of) the top-level section, and the main procedure which iterates over the current section contents, recursively resolving section references and appending literal elements to the list of source code lines.

@: Code generation functions
pub fn codegen(lines: [][]const u8, list: []Section, alloc: Allocator) ![][]const u8 {
	const root = try search(list, k_root);
	return try codegen_main(lines, list, root, 0, alloc);
}

fn codegen_main(lines: [][]const u8, list: []Section, index: usize, depth: u8, alloc: Allocator) anyerror![][]const u8 {
	var buffer = std.ArrayList([]const u8).init(alloc);
	defer buffer.deinit();

	const section = list[index];
	log(.debug, "generating section '{s}'", .{section.name});
	for (section.content) |content| switch (content) {
		.literal => |range| {
			@= Append literal range
		},
		.reference => |name| {
			@= Resolve reference
		},
	};

	log(.debug, "ending section '{s}'", .{section.name});
	return buffer.toOwnedSlice();
}
@.

To append a literal range, the range of lines is simply appended to the buffer.

@: Append literal range
log(.debug, "adding literal range {d}-{d}", .{ range.start + 1, range.end + 1 });
try buffer.appendSlice(lines[(range.start)..(range.end + 1)]);
@.

To resolve a section reference, the function must first check whether the current recursion depth has exceeded the configured limit, and return an error if this occurs. Otherwise, the index of the referenced section is retrieved, its contents are recursively parsed (with an incremented recursion depth), and the resulting source code lines are appended to the buffer.

@: Resolve reference
if (depth > dereference_max) {
	log(.err, "section dereferencing recursion depth exceeded (max {d})", .{dereference_max});
	return error.DereferenceLimit;
}
const ref = try search(list, name);
const code = try codegen_main(lines, list, ref, depth + 1, alloc);
try buffer.appendSlice(code);
@.

## Text Generation

Finally, we define the text generation function which iterates over the list of lines and produces the literate program text to be passed to an external document processor. In order to keep track of the name of the code section currently being formatted at any given point, the variable `current_name` is continually updated to contain the current name string. Configuration declarations are skipped, and lines which do not contain any formatting commands are appended as they are.

@: Text generation function
pub fn textgen(lines: [][]const u8, alloc: Allocator) ![][]const u8 {
	var buffer = std.ArrayList([]const u8).init(alloc);
	defer buffer.deinit();

	@= Process configuration declarations

	var current_name: []const u8 = undefined;
	for (lines) |line| {
		if (	std.mem.startsWith(u8, line, kc_start)
		or	std.mem.startsWith(u8, line, kc_add)
		or	std.mem.startsWith(u8, line, kc_end)
		or	std.mem.startsWith(u8, line, kc_ref)) {
			continue;
		} else switch (command_type(line)) {
			.start	=> {
				@= Format starting command
			},
			.add	=> {
				@= Format appending command
			},
			.ref	=> {
				@= Format reference command
			},
			.end	=> {
				@= Format ending command
			},
			else	=> {
				try buffer.append(line);
			},
		}
	}
	
	return buffer.toOwnedSlice();
}
@.

The formatting strings given by each configuration declaration are first retrieved. If the declaration of the format string for the section appending command is omitted, the format string for the section starting command is used in its place.

@: Process configuration declarations
const conf_start = try get_conf(lines, kc_start, alloc);
const conf_add = get_conf(lines, kc_add, alloc) catch conf_start;
const conf_end = try get_conf(lines, kc_end, alloc);
const conf_ref = try get_conf(lines, kc_ref, alloc);
@.

To process a section starting command, the current section name is updated, and the contents of the corresponding formatting command (that is, the segments of the split formatting string) are interspersed with copies of the current section name. The resulting string is then appended to the buffer.

@: Format starting command
current_name = line[(k_start.len)..];
try buffer.append(try std.mem.join(alloc, current_name, conf_start));
@.

Processing a section appending command is performed similarly.

@: Format appending command
current_name = line[(k_add.len)..];
try buffer.append(try std.mem.join(alloc, current_name, conf_add));
@.

To process a reference command, the index of the reference command keyword is first extracted. Then the formatted reference string is created, to which the reference command line's leading whitespace is prepended (to preserve indentation). 

@: Format reference command
const start = std.mem.indexOf(u8, line, k_ref).?;
const ref = try std.mem.join(alloc, line[(start + k_ref.len)..], conf_ref);
try buffer.append(try std.mem.concat(alloc, u8, &[_][]const u8{ line[0..start], ref }));
@.

Processing a section ending command is performed similarly to the starting and appending commands, however it does not require updating the current section name.

@: Format ending command
try buffer.append(try std.mem.join(alloc, current_name, conf_end));
@.