威尼斯人线上娱乐

网络爬虫,python依照正则表明式的简便爬虫

5 5月 , 2019  

威尼斯人线上娱乐,后天基张静则表明式轻便的爬了须臾间群众点评,把都城的好吃的食物美味的吃食爬了爬,(商号名,人均消费,地址)

C#网络爬虫,

网络爬虫,python依照正则表明式的简便爬虫。百货店编写妹子须求爬取网页内容,叫自身援救做了一简易的爬取工具

威尼斯人线上娱乐 1

那是爬取网页内容,像是那对大家来讲都以轻松得,可是在这里有部分小退换,代码献上,我们参考

 1 private string GetHttpWebRequest(string url)  
 2        {  
 3            HttpWebResponse result;  
 4            string strHTML = string.Empty;  
 5            try  
 6            {  
 7                Uri uri = new Uri(url);  
 8                WebRequest webReq = WebRequest.Create(uri);  
 9                WebResponse webRes = webReq.GetResponse();  
10   
11                HttpWebRequest myReq = (HttpWebRequest)webReq;  
12                myReq.UserAgent = "User-Agent:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705";  
13                myReq.Accept = "*/*";  
14                myReq.KeepAlive = true;  
15                myReq.Headers.Add("Accept-Language", "zh-cn,en-us;q=0.5");  
16                result = (HttpWebResponse)myReq.GetResponse();  
17                Stream receviceStream = result.GetResponseStream();  
18                StreamReader readerOfStream = new StreamReader(receviceStream, System.Text.Encoding.GetEncoding("utf-8"));  
19                strHTML = readerOfStream.ReadToEnd();  
20                readerOfStream.Close();  
21                receviceStream.Close();  
22                result.Close();  
23            }  
24            catch  
25            {  
26                Uri uri = new Uri(url);  
27                WebRequest webReq = WebRequest.Create(uri);  
28                HttpWebRequest myReq = (HttpWebRequest)webReq;  
29                myReq.UserAgent = "User-Agent:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705";  
30                myReq.Accept = "*/*";  
31                myReq.KeepAlive = true;  
32                myReq.Headers.Add("Accept-Language", "zh-cn,en-us;q=0.5");  
33                //result = (HttpWebResponse)myReq.GetResponse();  
34                try  
35                {  
36                    result = (HttpWebResponse)myReq.GetResponse();  
37                }  
38                catch (WebException ex)  
39                {  
40                    result = (HttpWebResponse)ex.Response;  
41                }  
42                Stream receviceStream = result.GetResponseStream();  
43                StreamReader readerOfStream = new StreamReader(receviceStream, System.Text.Encoding.GetEncoding("gb2312"));  
44                strHTML = readerOfStream.ReadToEnd();  
45                readerOfStream.Close();  
46                receviceStream.Close();  
47                result.Close();  
48            }  
49            return strHTML;  
50        }  

 

  那是依附url爬取网页远啊,有部分小退换,繁多网页有两样的编码格式,以至有点网站做了反爬取的幸免,那几个方法通过能够转移也能爬去

威尼斯人线上娱乐 2

以下是爬取网页全部的网站链接

/// <summary>  
        /// 提取HTML代码中的网址  
        /// </summary>  
        /// <param name="htmlCode"></param>  
        /// <returns></returns>  
        private static List<string> GetHyperLinks(string htmlCode, string url)  
        {  
            ArrayList al = new ArrayList();  
            bool IsGenxin = false;  
            StringBuilder weburlSB = new StringBuilder();//SQL  
            StringBuilder linkSb = new StringBuilder();//展示数据  
            List<string> Weburllistzx = new List<string>();//新增  
            List<string> Weburllist = new List<string>();//旧的  
            string ProductionContent = htmlCode;  
            Regex reg = new Regex(@"http(s)?://([\w-]+\.)+[\w-]+/?");  
            string wangzhanyuming = reg.Match(url, 0).Value;  
            MatchCollection mc = Regex.Matches(ProductionContent.Replace("href=\"/", "href=\"" + wangzhanyuming).Replace("href='/", "href='" + wangzhanyuming).Replace("href=/", "href=" + wangzhanyuming).Replace("href=\"./", "href=\"" + wangzhanyuming), @"<[aA][^>]* href=[^>]*>", RegexOptions.Singleline);  
            int Index = 1;  
            foreach (Match m in mc)  
            {  
                MatchCollection mc1 = Regex.Matches(m.Value, @"[a-zA-z]+://[^\s]*", RegexOptions.Singleline);  
                if (mc1.Count > 0)  
                {  
                    foreach (Match m1 in mc1)  
                    {  
                        string linkurlstr = string.Empty;  
                        linkurlstr = m1.Value.Replace("\"", "").Replace("'", "").Replace(">", "").Replace(";", "");  
                        weburlSB.Append("$-$");  
                        weburlSB.Append(linkurlstr);  
                        weburlSB.Append("$_$");  
                        if (!Weburllist.Contains(linkurlstr) && !Weburllistzx.Contains(linkurlstr))  
                        {  
                            IsGenxin = true;  
                            Weburllistzx.Add(linkurlstr);  
                            linkSb.AppendFormat("{0}<br/>", linkurlstr);  
                        }  
                    }  
                }  
                else  
                {  
                    if (m.Value.IndexOf("javascript") == -1)  
                    {  
                        string amstr = string.Empty;  
                        string wangzhanxiangduilujin = string.Empty;  
                        wangzhanxiangduilujin = url.Substring(0, url.LastIndexOf("/") + 1);  
                        amstr = m.Value.Replace("href=\"", "href=\"" + wangzhanxiangduilujin).Replace("href='", "href='" + wangzhanxiangduilujin);  
                        MatchCollection mc11 = Regex.Matches(amstr, @"[a-zA-z]+://[^\s]*", RegexOptions.Singleline);  
                        foreach (Match m1 in mc11)  
                        {  
                            string linkurlstr = string.Empty;  
                            linkurlstr = m1.Value.Replace("\"", "").Replace("'", "").Replace(">", "").Replace(";", "");  
                            weburlSB.Append("$-$");  
                            weburlSB.Append(linkurlstr);  
                            weburlSB.Append("$_$");  
                            if (!Weburllist.Contains(linkurlstr) && !Weburllistzx.Contains(linkurlstr))  
                            {  
                                IsGenxin = true;  
                                Weburllistzx.Add(linkurlstr);  
                                linkSb.AppendFormat("{0}<br/>", linkurlstr);  
                            }  
                        }  
                    }  
                }  
                Index++;  
            }  
            return Weburllistzx;  
        }  

那块的技巧其实正是概括的施用了正则去相配!接下去献上获取标题,以及存款和储蓄到xml文件的法子

 1 /// <summary>  
 2         /// // 把网址写入xml文件  
 3         /// </summary>  
 4         /// <param name="strURL"></param>  
 5         /// <param name="alHyperLinks"></param>  
 6         private static void WriteToXml(string strURL, List<string> alHyperLinks)  
 7         {  
 8             XmlTextWriter writer = new XmlTextWriter(@"D:\HyperLinks.xml", Encoding.UTF8);  
 9             writer.Formatting = Formatting.Indented;  
10             writer.WriteStartDocument(false);  
11             writer.WriteDocType("HyperLinks", null, "urls.dtd", null);  
12             writer.WriteComment("提取自" + strURL + "的超链接");  
13             writer.WriteStartElement("HyperLinks");  
14             writer.WriteStartElement("HyperLinks", null);  
15             writer.WriteAttributeString("DateTime", DateTime.Now.ToString());  
16             foreach (string str in alHyperLinks)  
17             {  
18                 string title = GetDomain(str);  
19                 string body = str;  
20                 writer.WriteElementString(title, null, body);  
21             }  
22             writer.WriteEndElement();  
23             writer.WriteEndElement();  
24             writer.Flush();  
25             writer.Close();  
26         }  
27         /// <summary>  
28         /// 获取网址的域名后缀  
29         /// </summary>  
30         /// <param name="strURL"></param>  
31         /// <returns></returns>  
32         private static string GetDomain(string strURL)  
33         {  
34             string retVal;  
35             string strRegex = @"(\.com/|\.net/|\.cn/|\.org/|\.gov/)";  
36             Regex r = new Regex(strRegex, RegexOptions.IgnoreCase);  
37             Match m = r.Match(strURL);  
38             retVal = m.ToString();  
39             strRegex = @"\.|/$";  
40             retVal = Regex.Replace(retVal, strRegex, "").ToString();  
41             if (retVal == "")  
42                 retVal = "other";  
43             return retVal;  
44         }  
45 /// <summary>  
46         /// 获取标题  
47         /// </summary>  
48         /// <param name="html"></param>  
49         /// <returns></returns>  
50         private static string GetTitle(string html)  
51         {  
52             string titleFilter = @"<title>[\s\S]*?</title>";  
53             string h1Filter = @"<h1.*?>.*?</h1>";  
54             string clearFilter = @"<.*?>";  
55   
56             string title = "";  
57             Match match = Regex.Match(html, titleFilter, RegexOptions.IgnoreCase);  
58             if (match.Success)  
59             {  
60                 title = Regex.Replace(match.Groups[0].Value, clearFilter, "");  
61             }  
62   
63             // 正文的标题一般在h1中,比title中的标题更干净  
64             match = Regex.Match(html, h1Filter, RegexOptions.IgnoreCase);  
65             if (match.Success)  
66             {  
67                 string h1 = Regex.Replace(match.Groups[0].Value, clearFilter, "");  
68                 if (!String.IsNullOrEmpty(h1) && title.StartsWith(h1))  
69                 {  
70                     title = h1;  
71                 }  
72             }  
73             return title;  
74         }  

那正是所用的整个方法,照旧有广大索要改正之处!咱们如果有发现不足之处还请建议,感谢!

 

集团编辑妹子须求爬取网页内容,叫自身帮衬做了一粗略的爬取工具
那是爬取网页内容,像是那对我们来讲都是简单得,不过在…

前言

该兑现丰富轻松,乃至都不能称为爬虫,并无需任何的本领,可是足以壹本万利自个儿的生活,那正是可行的。

出于前边刚刚学习了JSoup来落成简单的爬虫,这一次爬取了豆瓣同城的来为自个儿图个方便人民群众。
职业是如此的,种种礼拜五自家都习贯去豆瓣同城上去找一下有啥样狼狈的雕塑展恐怕绘画作品展览,但是呢,那样的展出相对较少,在豆瓣同城上去寻觅必要翻众多页才具找到二个,还不至于感兴趣,于是寻思写一个总结的爬虫为和煦爬一下豆类同城的照相和绘画作品展览的连带内容,就有利于多了。

本文实例讲述了Python爬虫完成轻易的爬取有道翻译功用。分享给大家供我们参考,具体如下:

威尼斯人线上娱乐 3威尼斯人线上娱乐 4

实现

# -*- coding:utf-8 -*-
#!python3
import urllib.request
import urllib.parse
import json
while True :
  content = input("请输入需要翻译的内容:(按q退出)")
  if content == 'q' :
    break
  url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=https://www.baidu.com/link'
  head = {}
  head[ 'User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
  data = {}
  data['type'] = 'AUTO'
  data['i'] = content
  data['doctype'] = 'json'
  data['xmlVersion'] = '1.8'
  data['keyfrom'] = 'fanyi.web'
  data['ue'] = 'UTF-8'
  data['action'] = 'FY_BY_CLICKBUTTON'
  data['typoResult'] = 'true'
  data = urllib.parse.urlencode(data).encode('utf-8')
  req = urllib.request.Request(url,data,head)
  response = urllib.request.urlopen(req)
  html = response.read().decode('utf-8')
  target = json.loads(html)
  print("翻译结果:%s" %(target['translateResult'][0][0]['tgt']))
import re
import urllib.request
from urllib.request import urlopen

def getPage(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
                             'Chrome/51.0.2704.63 Safari/537.36'}
    req = urllib.request.Request(url=url, headers=headers)
    res = urllib.request.urlopen(req)
    return res.read().decode('utf-8')

def parsePage(s):
    ret = com.finditer(s)
    for i in ret:
        ret = {
            "店铺名": i.group("shop_name"),
            "人均价格": i.group("per_capita"),
            "地址": i.group("address"),
        }

        yield ret

def main(num):
    url = "http://www.dianping.com/beijing/ch10/p%s?aid=92020785%%2C102284990&cpt=92020785%%2C102284990" % num
    response_html = getPage(url)
    ret = parsePage(response_html)
    print(ret)
    f = open("eat_info", "a", encoding="utf-8")

    for obj in ret:
        print(obj)
        data = str(obj)
        f.write(data + "\n")
com = re.compile(
        '<div class="txt">.*?<h4>(?P<shop_name>.*?)</h4>'
        '.*?<b>¥(?P<per_capita>\d+)</b>.*?(?P<address>.*?)', re.S)

count = 1
for i in range(50):
    main(count)
    count += 1
网页分析

在大约的爬虫达成中,该步骤是最棒费时间的,把每三个想要获取的音讯,找到她多对应的要素,不多说。
因为我们是以找到同城的水墨画展为重要目的的,因而我们得以一向从豆瓣同城-东京(Tokyo)-展览那些首页初阶爬取音信,那样缩小了爬取新闻的数目,减小了难度。

F1贰翻看豆瓣网页源码的时候,调控台上打出那样一句话:

威尼斯人线上娱乐 5

Paste_Image.png

对技术员的招聘,无孔不入啊。

越来越多关于Python相关内容可查阅本站专题:《Python
Socket编制程序才干总计》、《Python正则表达式用法总计》、《Python数据结构与算法教程》、《Python函数使用才具总括》、《Python字符串操作技能汇总》、《Python入门与进阶杰出教程》及《Python文件与目录操作技能汇总》

简易爬虫

爬取内容

运用Jsoup获取每一页面的document然后在挑选出相应的因素,获得想要的内容,然后呈未来荧屏上就好了。

 document = Jsoup.connect("https://beijing.douban.com/events/week-exhibition?start=" + index + "0")
//                                .proxy("222.74.225.231",3128)
//                                .proxy("118.144.149.200",3128)
                                .timeout(10000)
                                .get();
                        Elements li = document.select("li.list-entry");
                        if (li.size() == 0) {
                            break;
                        }

                        for (Element element : li) {
                            Elements meta = element.select("ul.event-meta");
                            if (meta.toString().equals("")) {
                                continue;
                            }
                            LocalBean bean = new LocalBean();
                            bean.setImgURL(element.select("img").attr("data-lazy")); // 图片链接
                            bean.setTitle(element.select("div.title").select("a").attr("title"));//标题
                            bean.setURL(element.select("div.title").select("a").attr("href"));//链接

                            Elements tagElements = element.select("p.event-cate-tag hidden-xs").select("a");
                            if (!tagElements.toString().equals("")) {
                                for (int i = 0; i<tagElements.size() ; i++){
                                    bean.setTag(tagElements.get(i).text());
                                }
                            }
                            bean.setTime(meta.select("li.event-time").text());
                            bean.setLocation(meta.select("li").get(1).text());
                            bean.setCost(meta.select("li.fee").select("strong").text());

                            if (!bean.isWanted()) {
                                continue;
                            }

                            System.out.println("***************************");
                            System.out.println(bean.getTitle());
                            System.out.println(bean.getTime());
                            System.out.println(bean.getImgURL());
                            System.out.println(bean.getURL());
                            System.out.println(bean.getLocation());
                            System.out.println(bean.getCost());
                            System.out.println(bean.getTag());

                            mBeans.add(bean);
                        }

但愿本文所述对大家Python程序设计有着支持。

威尼斯人线上娱乐 6

筛选新闻

小编们目的关键是摄像,其次是绘画作品展览,而且音信严厉性不是很强,只怕存在漏选的景象。

由此笔者做的很简短,只是将标题和描述中的文字举办过滤,当中富含包罗雕塑相关的严重性字的筛选出来。

你或者感兴趣的篇章:

  • python爬虫_微信公众号推送新闻爬取的实例
  • python 爬取微信小说
  • Python爬虫爬取天涯论坛天涯论坛内容示例【基于代理IP】
  • Python完成爬取今日头条神回复轻便爬虫代码分享
  • python爬虫实战之爬取京东店肆实例教程
  • Python爬虫:通过机要字爬取百度图表
  • python制作爬虫爬取京东商品批评教程
  • 选拔Python四线程爬虫爬取电影天堂财富
  • python三简易达成微信爬虫
  • 用 Python
    爬了爬自个儿的微信朋友(实例解说)
  • python三之微信小说爬虫实例批注
  • Python爬取个人微信朋友消息操作示例

 

应对反爬

骨子里的爬虫落成中,应对反爬计谋是很器重的一个环节,比方我们不间断的爬取豆瓣同城的话,接二连三一遍之后就能被封IP,
然后大家就来看了403谬误,这年我们就须求用代理IP进行走访了。
但因为本金和利息用并未频仍爬取数据的必要,数据量也是极小,因而没供给选拔代理IP池的措施,打1枪换1炮,大家健康的每一日只须要爬取叁回,大家只须要下降一下爬取的快慢,每爬取壹页的开始和结果让线程休眠一会,然后继续爬取下一页,这样就跌落了IP被封的可能率。

05-09 18:52:03.669 20699-25842/com.jiesean.exhibitionspider W/System.err: org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=https://beijing.douban.com/events/week-exhibition?start=1890
05-09 18:52:03.670 20699-25842/com.jiesean.exhibitionspider W/System.err:     at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:590)
05-09 18:52:03.671 20699-25842/com.jiesean.exhibitionspider W/System.err:     at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:540)
05-09 18:52:03.671 20699-25842/com.jiesean.exhibitionspider W/System.err:     at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:227)
05-09 18:52:03.671 20699-25842/com.jiesean.exhibitionspider W/System.err:     at org.jsoup.helper.HttpConnection.get(HttpConnection.java:216)
05-09 18:52:03.671 20699-25842/com.jiesean.exhibitionspider W/System.err:     at com.jiesean.exhibitionspider.MainActivity$3.run(MainActivity.java:104)
05-09 18:52:03.671 20699-25842/com.jiesean.exhibitionspider W/System.err:     at java.lang.Thread.run(Thread.java:761)

Demo

威尼斯人线上娱乐 7

威尼斯人线上娱乐 8

本条大约的小应用只是造福自身个人的活着,花点小时间,倒是为温馨节约了不少的光阴。

DEMO地址:Github传送门


相关文章

发表评论

电子邮件地址不会被公开。 必填项已用*标注

网站地图xml地图