BBCode parsing methods?

Asked byfrinda

Prompt, what else there are methods of parsing except regular expressions. Regulars, as you know, are not intended for parsing nested constructs. If there is an implementation - please poke.

Answers

jo ann godshall
Nothing complicated - just create a finite state machine ... on the habr, by the way, not so long ago there were several articles on the topic (if I am not mistaken about creating compilers).

For example:
1) Existing parser xbb.uz/

2) Own bike:
Code: http: //pastebin.mozilla-russia.org/106940
Chart: habrastorage.org/storage/b55a4b42/f4942156/b245ccd6/9426eb87.png
(original in VP-UML, if someone needs it - write)

Most likely there are errors (now I'm just debugging).
Replies:
I myself thought about this. In theory, the most direct approach. The only question is parsing speed.
And it is more convenient to represent finite automata IMHO with transition tables. - matthew lavin
Regulars also describe a finite state machine, this is the reason for the impossibility of parsing nested constructions.
The article about compilers dealt with store-side automata. - chris davey
AFAIK machine with store memory is still the end :) - lois day
susie nee
& gt; Regulars are not known to parse nested constructions.

Indeed, the theory tells us that a mighty regular grammar of bbcodes cannot be overcome.
But this does not mean that regulars in this problem cannot be applied at all.
(look at the parsers of popular forums, for example)

In short - with one pass we match the deepest nested pair of tags and replace them with something that does not contain them, we repeat in a loop while a match is found.
Replies:
I watched and even more, I did it before. But this is not quite the right approach (or even wrong). Especially if tags suddenly overlap. - shuying
jessica kintner
It depends on which language you use. If PHP is link. I use it conveniently, quickly, functionally due to callbacks and no distortions :)
Replies:
A good option, you need to look at the implementation. But, IMHO, the application is limited by the fact that it is necessary to compile and connect on the target system, which is not always available.
And the language - PHP, you can Perl - I'm just interested in the approaches themselves. - rasha soliman
The best approach is a state machine with callback functions (because sometimes people need a solution that is not quite standard). Unfortunately, I have not seen normal implementations yet. And also I’m still looking for a finite state machine for parsing HTML ... I’m already looking for the 4th year - one shame. - kara harper
alexi
& gt; The only question is parsing speed.
Not very fast. Not tested. It would also be interesting to compare other parsers.

& gt; And it’s more convenient to represent the state machines IMHO with transition tables
Maybe. But I think the diagram is more visual.

& gt; Especially if tags suddenly overlap
When using this parser, you can arbitrarily handle, at the moment the nested unclosed bb code will be closed forcibly ([a] [b] [/ a] [/ b] = & gt; [a] [b] [/ b] [/ a] [/ b])
Replies:
Missed :(, this is a response to the previous comment divanikus - jenny ong
& gt; But the diagram is in my opinion more vivid.
Well, I meant it that the table is clearer. Although as anyone depends on the number of states.

& gt; When using this parser, you can arbitrarily handle
This is understandable, I meant parsing regular. - jeffrey johnson
karen woods
use a parser generator and a formal description in RBNF =]
Replies:
And he does not end up with a state machine? - josh evans
raist
I will write another option *:
 “Looping bb-codes in a loop:
 1) Find the opening tag & quot; [bbcode & quot;
 2) find & quot;] & quot; (all that is between attributes)
 3) if a single disassemble
 4) if not, look for the first closing tag & quot; [/ bbcode] & quot;
 5) everything between & quot; [bbcode ...] & quot; and & quot; [/ bbcode] & quot; this body (it is formatted according to bb code)
 6) continue;

The main problem is that it is not possible to determine what & quot;] & quot; Because of this, the result depends on the order of parsing the bb-codes * In IPB, the screening of & quot;] & quot; is used to solve this problem. in the attributes ...

* no need to use it ... this is how the parser of bb-codes is written in IP.Board ... there were (and still are) many errors due to the different order of bb-codes and their attributes (including XSS and Apache drops ... a small amount of details can be found on the IBR forum in Ritsuka posts)
Replies:
About IPB in the course, these methods looked a long time ago, but there are constantly XSS because of them.

By the way, the IBR has also been registered since 2003, but in recent years, somehow without need, another area of ​​activity - zelda
eleanor cook
I use the xbb.uz parser myself, I didn’t like the code that it generates, but after the file everything is ok.

And what speed do you need? I only parse once, when saved, then I cache the result and when I need it I give it from the cache.
Replies:
Simple, in my opinion, logic - it works faster, it means it eats less CPU time, i.e. loads less car.
About saving the result is also an interesting question. phpBB, for example, stores both options - initial and parsed. But IPB converts back and forth on the fly. I do not even know which one is better. - anastasia
IPB2 - stored html
IPB3 - stores bb codes, but for posts (and some other data) you can cache html - stephen ryner jr
Ie IPB3 draws html from bbcode for each request? Well, if with disabled caching? Just the latest version that I saw was 2.2.1 it seems. - brooke moncrief
IPB3 - caches otparsny bbcode - doug hart
stacey brutger
& gt; Those. Does IPB3 draw html from bbcode for each request?
Yes. (there are pluses, but probably more minuses)
Replies:
again missed: (this is a response to the previous comment divanikus - juliana es
are you sure? I watched it for several weeks, there was exactly a table where the parsed html was stored - vin cius
I am sure that the bb-codes parse in the output, but the table also has a “content cache” and stores only * some * data (posts, signatures, ...) and * only * when the cache is turned on. Everything else is parsed every time. The most "wonderful" thing is that you have to implement caching yourself in your applications. - chetan
Virtualmin setting default mailbox for mail () in php? :: Technical podcasts :: Brakes Dialer :: Motion tracking? :: How to force webkit not to transfer the text in select to the second line, but just to hide it?
Leave Repply forBBCode parsing methods?
Useful Links