起源

我最近在看《七周七语言 : 理解多种编程范型》，里面提到了Ruby。看完以后有一种“手握锤子看什么都是钉子”的感觉，恰好有导出知乎收藏夹文章的需求，于是打算用Ruby实现之。

寻找API

面对这个需求，根据自己的经验（以前用AHK干过不少类似的事情），首先想到的是用脚本模拟页面点击，再整个htmldownload下来，用正则表达式解析，或者用DOM。这个思路是可以的，但是有点笨，最好是直接通过知乎的api获取收藏夹的内容，这样在速度和解析效率上都会有比较好的表现。

那么，接下来的问题是，怎么样找这个api？Chrome出动了。

打开Chrome，打开新窗口，点击菜单-工具-开发者工具，打开知乎页面，就可以看到页面下方的开发这工具出现如下画面：

tool_1

进入收藏夹页面，随意选一个收藏夹，观察资源的载入情况。点击工具下方的Document标签，如下图所示：

tool_3

可以得知现在通过 HTTP GET 访问了http://www.zhihu.com/collection/20801936地址，其中20801936是收藏夹的id。返回Html正文：

tool_2

好了，分析一下获得的html标签，发现正文内容包含在<textarea class="content hidden">这个标签内，我们通过Dom把它提取出来就可以了。此时页面内大概有20条答案。

这时出现了一个问题：请求头里并没有包含start、next、number等参数，我们怎么导出下一页的文章呢？

我们把收藏夹页面往下拉，发现它会自动载入后面的文章。观察一下开发者工具，发现这样的一个XHR：

tool_4

可以看到返回结果为json格式，里面包含了一个escape了的xml正文。看来api找到了。

那么接下来我们来分析一下这个api：

tool_5

可以看到请求地址（和前面一样）和post所需要的参数。offset和start好理解（请求多少个和起始时间点），但是这个_xsrf从哪里可以拿到？

观察一下前面GET请求拿到的html，发现最下面出现了：

html_1

而且这个值在同一个收藏夹下是不变的（过一段时间会变一下，应该是收藏内容变动的原因）。

再观察到接收到的json，里面出现了两个神秘的数字：20（msg数组的第一个元素）和1331444491（最后一个元素），经测试20是当前结果中文章的数量，1331444491为下20篇文章的开始时间（start参数）。当后面没有文章时，这个参数为-1。

好了，这时脑海里可以构想出抓取整个收藏夹的具体流程了。困难的部分完成，剩下来就是简单的解析内容的工作了。

整理并编写程序

脑海里面的思路整理如下：

获取收藏夹的id
获取_xsrf参数

循环 当start不为-1
    用start和_xsrf发起POST请求
    解析请求结果，提取start参数值和正文内容
    按照模板生成文章，保存为文件

    更新start参数
结束

再通过努力（其实是不熟练）将这个思路用Ruby实现，源代码如下所示：

# encoding: UTF-8
require 'fileutils'
require "net/http"
require "uri"
require "json"
require 'digest'
require 'nokogiri'

Dir.chdir(File.dirname(__FILE__))

def hash_url(url)
        return Digest::MD5.hexdigest("#{url}")
end

def fetchContent(collectionID, xsrf="", start="")
    uri = URI('http://www.zhihu.com/collection/' + collectionID)
        response = Net::HTTP.post_form(uri, {'_xsrf' => xsrf, 'start' => start})
        begin
                json = JSON.parse(response.body)

                res = Hash.new
                res["number"] = json["msg"][0]
                res["content"] = json["msg"][1]
                res["start"] = json["msg"][2]
        rescue
                puts "parse error"
                File.open("error.log", 'w') { |file| file.write(uri.to_s + "\n" + xsrf.to_s + "\n" + start.to_s + "\n" + response.body) }
        end

    return res
end

def parseItems(src)
    items = []
    doc = Nokogiri::HTML(src)
        #File.open("article.log", 'w') { |file| file.write(src)}
    doc.css(".zm-item").each do |zitem|
        item = Hash.new
        item["title"] = zitem.css(".zm-item-title").text.strip

        answers = []
        zitem.css(".zm-item-fav").each do |fitem|
                        answers << fitem
        end
        item["answers"] = answers
        items.push(item)
    end

    return items
end

def doImageCache(title, doc)
        path = "./res/#{title}_file/"
        FileUtils.mkpath(path) unless File.exists?(path)

        imgEntities = []

        doc.css("img").each do |img| 
                uri = URI.parse(img["src"])
                filename = hash_url("#{uri.to_s}") # hash url for save files
                img["src"] = "./#{title}_file/" + filename

                imgEntities << {'uri'=>uri, 'hash'=>filename}
        end

        imgEntities.each_slice(6).to_a.each{ |group|
                threads = []

                group.each {|entity|
                        threads << Thread.new { 
                                begin
                                        uri = entity['uri']
                                        filename = entity['hash']
                                        Net::HTTP.start(uri.hostname) { |http|
                                                resp = http.get(uri.to_s)
                                                File.open(path + filename, "wb") { |file|
                                                        file.write(resp.body)
                                                        print "."
                                                }
                                        }
                                rescue
                                        puts "error: \n    #{uri}"
                                end
                        }
                }

                threads.each { |t| t.join }
        }

        return doc
end

def init(collectionID)
    uri = URI('http://www.zhihu.com/collection/' + collectionID)

    doc = Nokogiri::HTML(Net::HTTP.get(uri))
    xsrf = doc.css("input[name=_xsrf]")[0]["value"]

    src = Hash.new
    src["collectionName"] = doc.css("#zh-fav-head-title").text
        src["xsrf"] = xsrf

    return src
end

def toMultiFile(src, items)

        puts "downloading images."

    template = File.open("template.html", "r:UTF-8").read() # for Windows
    items.each{ |item| 
                buffer = ["<div><h1 class = \"title\">#{item["title"]}</h1></div>"]
        buffer.push("<div class = \"item\" id=\"wrapper\" class=\"typo typo-selection\">")
        buffer.push("<div class = \"answers\">" )
        item["answers"].each { |fitem|

                        author = fitem.css(".zm-item-answer-author-wrap").text.strip
            content = fitem.css(".content.hidden").text
                        link = "http://www.zhihu.com" + fitem.css(".answer-date-link.meta-item").attr("href")

                        content = doImageCache("ImageCache", Nokogiri::HTML(content).css("body").children).to_html # image cache
            buffer.push("<div class = \"author\">#{author}</div>")
            buffer.push("<div class = \"content\">#{content}</div>")
                        buffer.push("<div class=\"link\"><a href=\"#{link}\">[原文链接]</a></div>")
        }
        buffer.push("</div>")
        buffer.push("</div>")

                #[#{src["collectionName"].gsub(/[\x00\/\\:\*\?\"<>\|]/, "_")}]
        File.open("res/#{item["title"].gsub(/[\x00\/\\:\*\?\"<>\|]/, "_")}.html", 'w') { |file| 
            file.write(template.sub("<!-- this is template-->", buffer.join("\n")).sub!("<!-- this is title-->", item["title"])) 
        }
    }

end

collectionID = "19563328"
src = init(collectionID)
puts "collectionName : #{src["collectionName"]}\nxsrf: #{src["xsrf"]}\n"

items = []
loop do
    contents = fetchContent(collectionID, src["xsrf"], src["start"])
        #next unless contents # json parse error
    items += (parseItems(contents["content"]))

    puts "collection's count : #{items.size} \n"
    break if contents["start"] == -1
    src["start"] = contents["start"]
end

toMultiFile(src, items)

(更新：原来用的是豆瓣9点的api，发现错误和重复的条目特别多。所以改用鲜果阅读器的api)

我并没有弄成Ruby模块（懒），提供了方法toMultiFile()，功能为每篇文章生成一个html文件。

生成格式

把内容都保存下来后，接下来的工作比较自由了。

现在要解决的问题是，怎么样显示这些文章？具体一点，就是格式和样式的如何选择。

关于格式，考虑到排版的方便性（其实是我水平不行），决定用html，一篇文章一个html吧。排版方面，以前收藏了一个网址中文网页重设与排版：TYPO.CSS，里面提供了一个css文件，感觉样式挺漂亮的，就用它吧。

最终参考了一下上面那个网址的源代码和知乎阅读页面的源代码，各自采用了一些css参数，最终整合出一个模板：

<!DOCTYPE html>
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta charset="utf-8">
    <link rel="stylesheet" href="./typo.css"><!-- 作者：sofish Lin，基于 MIT License 协议开源。 -->
    <title><!-- this is title--></title>
    <style>
       a{color:#3a3c9c;}
       body h1{font:38px/1.8em;}
       code{color:#080;}
       html{font-size:110%;}
       body{width:100%;}
       #wrapper{min-width:480px;padding:1% 2%;}
       .author{color:#888;font-size:1em;margin:1em 0 2em;padding-bottom:2em;border-bottom:3px double #eee;}
       .link{color:#888;font-size:1em;margin:1em 0 2em;padding-top:2em;border-top:3px double #eee;}
       #table{margin-bottom:2em;color:#888;}

        .content  img {
            display:block;
            max-width: 100%;
            height: auto;
            margin: 10px 0;
            box-shadow: 0 1px 2px rgba(0,0,0,.3);
        }
       .title{
            font-size:130%;
            font-weight:bold;
            padding:2% 2%;
            text-shadow: 0 1px 0 white;
        }
    }
    </style>
</head>
    <body>
<!-- this is template-->
    </body>
</html>

生成内容时将替换成文章正文即可（上面的源代码中有体现）。

最终效果如下：

passage_1

生成电子书

导出为html始终不便观看，便想把它做成电子书放在手机上随身携带。查阅了很多资料后，针对需求修改了代码，最终可以通过Sigil做成如下样式的电子书：

epub

只要跟着Sigil的提示，导入res文件夹下的全部文件，然后一步步生成即可。

总结

Ruby写起来很轻便，不用花太多时间纠结在语法实现上，让你把精力集中在思路上面。但是我在用Ruby编程时还是用过程式的思路去写代码，感觉只是用到了Ruby的一些语法糖而已，还没真正了感受到Ruby的核心价值。不过这种脚本语言解决手头问题的确非常方便啊。

从开始到结束整整用了1天时间，花了很多时间在查找Ruby doc上，以后熟练了会快一些。但值得关注的是查找知乎api和分析页面的过程，这两点我还有很长的路要走。

PS:以上所有源代码已上传到github。

kupbezrecepty.com/

风起 11 年 ago Reply

赞一个，原来这个地址是可以公开访问的，那就又少了一步，我想做成rss输出，用ifttt自动搞到为知笔记去，我试试用php实现，不会ruby囧。。。

ChiChou 11 年 ago Reply

“收藏内容变动的原因”——其实不是这样的。里面的值是用户登录时随机产生的，这个字段的目的是为了防止跨站请求伪造攻击（CSRF）。

OUZY 11 年 ago Reply

ruby ZhihuCollection.rb
collectionName :

脱水干货区（欢迎投稿）

xsrf: 5e712a57dd1dc58a7ff744e1d9a1d14b
parse error
ZhihuCollection.rb:145:in `block in ‘: undefined method `[]’ for nil:NilClass (NoMethodError)
from ZhihuCollection.rb:142:in `loop’
from ZhihuCollection.rb:142:in `’

怎么处理？
Run@Macosx 10.10

legendmohe 10 年 ago Reply

这篇文章写于1年前，现在知乎的接口应该变了。

binance 2 月 ago Reply

Your article helped me a lot, is there any more related content? Thanks!

Cel mai bun cod de recomandare Binance 1 月 ago Reply

Thanks for sharing. I read many of your blog posts, cool, your blog is very good.

binance registracn'y bonus 1 月 ago Reply

Markdown	Result
text	text
text	text
*text*	text
`code`	`code`
~~~ more code ~~~~	more code
[Link](http://www.example.com)	Link
* Listitem	Listitem
> Quote	Quote

用Ruby抓取知乎某个收藏夹的所有文章

起源

寻找API

整理并编写程序

生成格式

生成电子书

总结

7 comments

近期文章

分类

其他操作