I\'m trying to remove comments and strings from a c file with c code. I\'ll just stick to comments for the examples. I have a sliding window so I only have character n
Doing this correctly is more complicated than one may at first think, as ably pointed out by the other comments here. I would strongly recommend writing a table-driven FSM, using a state transition diagram to get the transitions right. Trying to do anything more than a few states with case statements is horribly error-prone IMO.
Here's a diagram in dot/graphviz format from which you could probably directly code a state table. Note that I haven't tested this at all, so YMMV.
The semantics of the diagram are that when you see , it is a fall-though if none of the other input in that state match. End of file is an error in any state except S0, and so is any character not explicitly listed, or . Every character scanned is printed except when in a comment (S4 and S5), and when detecting a start comment (S1). You will have to buffer characters when detecting a start comment, and print them if it's a false start, otherwise throw them away when sure it's really a comment.
In the dot diagram, sq is a single quote ', dq is a double quote ".
digraph state_machine {
rankdir=LR;
size="8,5";
node [shape=doublecircle]; S0 /* init */;
node [shape=circle];
S0 /* init */ -> S1 /* begin_cmt */ [label = "'/'"];
S0 /* init */ -> S2 /* in_str */ [label = dq];
S0 /* init */ -> S3 /* in_ch */ [label = sq];
S0 /* init */ -> S0 /* init */ [label = ""];
S1 /* begin_cmt */ -> S4 /* in_slc */ [label = "'/'"];
S1 /* begin_cmt */ -> S5 /* in_mlc */ [label = "'*'"];
S1 /* begin_cmt */ -> S0 /* init */ [label = ""];
S1 /* begin_cmt */ -> S1 /* begin_cmt */ [label = "'\\n'"]; // handle "/\n/" and "/\n*"
S2 /* in_str */ -> S0 /* init */ [label = "'\\'"];
S2 /* in_str */ -> S6 /* str_esc */ [label = "'\\'"];
S2 /* in_str */ -> S2 /* in_str */ [label = ""];
S3 /* in_ch */ -> S0 /* init */ [label = sq];
S4 /* in_slc */ -> S4 /* in_slc */ [label = ""];
S4 /* in_slc */ -> S0 /* init */ [label = "'\\n'"];
S5 /* in_mlc */ -> S7 /* end_mlc */ [label = "'*'"];
S5 /* in_mlc */ -> S5 /* in_mlc */ [label = ""];
S7 /* end_mlc */ -> S7 /* end_mlc */ [label = "'*'|'\\n'"];
S7 /* end_mlc */ -> S0 /* init */ [label = "'/'"];
S7 /* end_mlc */ -> S5 /* in_mlc */ [label = ""];
S6 /* str_esc */ -> S8 /* oct */ [label = "[0-3]"];
S6 /* str_esc */ -> S9 /* hex */ [label = "'x'"];
S6 /* str_esc */ -> S2 /* in_str */ [label = ""];
S8 /* oct */ -> S10 /* o1 */ [label = "[0-7]"];
S10 /* o1 */ -> S2 /* in_str */ [label = "[0-7]"];
S9 /* hex */ -> S11 /* h1 */ [label = hex];
S11 /* h1 */ -> S2 /* in_str */ [label = hex];
S3 /* in_ch */ -> S12 /* ch_esc */ [label = "'\\'"];
S3 /* in_ch */ -> S13 /* out_ch */ [label = ""];
S13 /* out_ch */ -> S0 /* init */ [label = sq];
S12 /* ch_esc */ -> S3 /* in_ch */ [label = sq];
S12 /* ch_esc */ -> S12 /* ch_esc */ [label = ""];
}