Scrape Twitter for Tweets

Overview

Downloads Downloads_month PyPI version GitHub contributors

Backers

Thank you to all our backers! 🙏 [Become a backer]

Sponsors

Support this project by becoming a sponsor. Your logo will show up here with a link to your website. [Become a sponsor]

Synopsis

A simple script to scrape Tweets using the Python package requests to retrieve the content and Beautifulsoup4 to parse the retrieved content.

1. Motivation

Twitter has provided REST API's which can be used by developers to access and read Twitter data. They have also provided a Streaming API which can be used to access Twitter Data in real-time.

Most of the software written to access Twitter data provide a library which functions as a wrapper around Twitter's Search and Streaming API's and are therefore constrained by the limitations of the API's.

With Twitter's Search API you can only send 180 Requests every 15 minutes. With a maximum number of 100 tweets per Request, you can mine 72 tweets per hour (4 x 180 x 100 =72) . By using TwitterScraper you are not limited by this number but by your internet speed/bandwith and the number of instances of TwitterScraper you are willing to start.

One of the bigger disadvantages of the Search API is that you can only access Tweets written in the past 7 days. This is a major bottleneck for anyone looking for older data. With TwitterScraper there is no such limitation.

Per Tweet it scrapes the following information:
  • Tweet-id
  • Tweet-url
  • Tweet text
  • Tweet html
  • Links inside Tweet
  • Hashtags inside Tweet
  • Image URLS inside Tweet
  • Video URL inside Tweet
  • Tweet timestamp
  • Tweet Epoch timestamp
  • Tweet No. of likes
  • Tweet No. of replies
  • Tweet No. of retweets
  • Username
  • User Full Name / Screen Name
  • User ID
  • Tweet is an reply to
  • Tweet is replied to
  • List of users Tweet is an reply to
  • Tweet ID of parent tweet
In addition it can scrape for the following user information:
  • Date user joined
  • User location (if filled in)
  • User blog (if filled in)
  • User No. of tweets
  • User No. of following
  • User No. of followers
  • User No. of likes
  • User No. of lists
  • User is verified

2. Installation and Usage

To install twitterscraper:

(sudo) pip install twitterscraper

or you can clone the repository and in the folder containing setup.py

python setup.py install

If you prefer more isolation you can build a docker image

docker build -t twitterscraper:build .

and run your container with:

docker run --rm -it -v/<PATH_TO_SOME_SHARED_FOLDER_FOR_RESULTS>:/app/data twitterscraper:build <YOUR_QUERY>

2.2 The CLI

You can use the command line application to get your tweets stored to JSON right away. Twitterscraper takes several arguments:

  • -h or --help Print out the help message and exits.
  • -l or --limit TwitterScraper stops scraping when at least the number of tweets indicated with --limit is scraped. Since tweets are retrieved in batches of 20, this will always be a multiple of 20. Omit the limit to retrieve all tweets. You can at any time abort the scraping by pressing Ctrl+C, the scraped tweets will be stored safely in your JSON file.
  • --lang Retrieves tweets written in a specific language. Currently 30+ languages are supported. For a full list of the languages print out the help message.
  • -bd or --begindate Set the date from which TwitterScraper should start scraping for your query. Format is YYYY-MM-DD. The default value is set to 2006-03-21. This does not work in combination with --user.
  • -ed or --enddate Set the enddate which TwitterScraper should use to stop scraping for your query. Format is YYYY-MM-DD. The default value is set to today. This does not work in combination with --user.
  • -u or --user Scrapes the tweets from that users' profile page. This also includes all retweets by that user. See section 2.2.4 in the examples below for more information.
  • --profiles : Twitterscraper will in addition to the tweets, also scrape for the profile information of the users who have written these tweets. The results will be saved in the file userprofiles_<filename>.
  • -p or --poolsize Set the number of parallel processes TwitterScraper should initiate while scraping for your query. Default value is set to 20. Depending on the computational power you have, you can increase this number. It is advised to keep this number below the number of days you are scraping. For example, if you are scraping from 2017-01-10 to 2017-01-20, you can set this number to a maximum of 10. If you are scraping from 2016-01-01 to 2016-12-31, you can increase this number to a maximum of 150, if you have the computational resources. Does not work in combination with --user.
  • -o or --output Gives the name of the output file. If no output filename is given, the default filename 'tweets.json' or 'tweets.csv' will be used.
  • -c or --csv Write the result to a CSV file instead of a JSON file.
  • -d or --dump: With this argument, the scraped tweets will be printed to the screen instead of an outputfile. If you are using this argument, the --output argument doe not need to be used.
  • -ow or --overwrite: With this argument, if the output file already exists it will be overwritten. If this argument is not set (default) twitterscraper will exit with the warning that the output file already exists.
  • -dp or --disableproxy: With this argument, proxy servers are not used when scrapping tweets or user profiles from twitter.

2.2.1 Examples of simple queries

Below is an example of how twitterscraper can be used:

twitterscraper Trump --limit 1000 --output=tweets.json

twitterscraper Trump -l 1000 -o tweets.json

twitterscraper Trump -l 1000 -bd 2017-01-01 -ed 2017-06-01 -o tweets.json

2.2.2 Examples of advanced queries

You can use any advanced query Twitter supports. An advanced query should be placed within quotes, so that twitterscraper can recognize it as one single query.

Here are some examples:

  • search for the occurence of 'Bitcoin' or 'BTC': twitterscraper "Bitcoin OR BTC" -o bitcoin_tweets.json -l 1000
  • search for the occurence of 'Bitcoin' and 'BTC': twitterscraper "Bitcoin AND BTC" -o bitcoin_tweets.json -l 1000
  • search for tweets from a specific user: twitterscraper "Blockchain from:VitalikButerin" -o blockchain_tweets.json -l 1000
  • search for tweets to a specific user: twitterscraper "Blockchain to:VitalikButerin" -o blockchain_tweets.json -l 1000
  • search for tweets written from a location: twitterscraper "Blockchain near:Seattle within:15mi" -o blockchain_tweets.json -l 1000

You can construct an advanced query on Twitter Advanced Search or use one of the operators shown on this page. Also see Twitter's Standard operators

2.2.3 Examples of scraping user pages

You can also scraped all tweets written or retweeted by a specific user. This can be done by adding the boolean argument -u / --user argument. If this argument is used, the search term should be equal to the username.

Here is an example of scraping a specific user:

twitterscraper realDonaldTrump --user -o tweets_username.json

This does not work in combination with -p, -bd, or -ed.

The main difference with the example "search for tweets from a specific user" in section 2.2.2 is that this method really scrapes all tweets from a profile page (including retweets). The example in 2.2.2 scrapes the results from the search page (excluding retweets).

2.3 From within Python

You can easily use TwitterScraper from within python:

from twitterscraper import query_tweets

if __name__ == '__main__':
    list_of_tweets = query_tweets("Trump OR Clinton", 10)

    #print the retrieved tweets to the screen:
    for tweet in query_tweets("Trump OR Clinton", 10):
        print(tweet)

    #Or save the retrieved tweets to file:
    file = open(“output.txt”,”w”)
    for tweet in query_tweets("Trump OR Clinton", 10):
        file.write(str(tweet.text.encode('utf-8')))
    file.close()

2.3.1 Examples of Python Queries

  • Query tweets from a given URL:
    Parameters:
    • query: The query search parameter of url
    • lang: Language of queried url
    • pos: Parameter passed for where to start looking in url
    • retry: Number of times to retry if error
    query_single_page(query, lang, pos, retry=50, from_user=False, timeout=60)
  • Query all tweets that match qeury:
    Parameters:
    • query: The query search parameter
    • limit: Number of tweets returned
    • begindate: Start date of query
    • enddate: End date of query
    • poolsize: Tweets per poolsize
    • lang: Language of query
    query_tweets('query', limit=None, begindate=dt.date.today(), enddate=dt.date.today(), poolsize=20, lang='')
  • Query tweets from a specific user:
    Parameters:
    • user: Twitter username
    • limit: Number of tweets returned
    query_tweets(user, limit=None)

2.4 Scraping for retweets

A regular search within Twitter will not show you any retweets. Twitterscraper therefore does not contain any retweets in the output.

To give an example: If user1 has written a tweet containing #trump2020 and user2 has retweetet this tweet, a search for #trump2020 will only show the original tweet.

The only way you can scrape for retweets is if you scrape for all tweets of a specific user with the -u / --user argument.

2.5 Scraping for User Profile information

By adding the argument --profiles twitterscraper will in addition to the tweets, also scrape for the profile information of the users who have written these tweets. The results will be saved in the file "userprofiles_<filename>".

Try not to use this argument too much. If you have already scraped profile information for a set of users, there is no need to do it again :) It is also possible to scrape for profile information without scraping for tweets. Examples of this can be found in the examples folder.

3. Output

All of the retrieved Tweets are stored in the indicated output file. The contents of the output file will look like:

[{"fullname": "Rupert Meehl", "id": "892397793071050752", "likes": "1", "replies": "0", "retweets": "0", "text": "Latest: Trump now at lowest Approval and highest Disapproval ratings yet. Oh, we're winning bigly here ...\n\nhttps://projects.fivethirtyeight.com/trump-approval-ratings/?ex_cid=rrpromo\u00a0\u2026", "timestamp": "2017-08-01T14:53:08", "user": "Rupert_Meehl"}, {"fullname": "Barry Shapiro", "id": "892397794375327744", "likes": "0", "replies": "0", "retweets": "0", "text": "A former GOP Rep quoted this line, which pretty much sums up Donald Trump. https://twitter.com/davidfrum/status/863017301595107329\u00a0\u2026", "timestamp": "2017-08-01T14:53:08", "user": "barryshap"}, (...)
]

3.1 Opening the output file

In order to correctly handle all possible characters in the tweets (think of Japanese or Arabic characters), the output is saved as utf-8 encoded bytes. That is why you could see text like "u30b1 u30f3 u3055 u307e u30fe ..." in the output file.

What you should do is open the file with the proper encoding:

https://user-images.githubusercontent.com/4409108/30702318-f05bc196-9eec-11e7-8234-a07aabec294f.PNG

Example of output with Japanese characters

3.1.2 Opening into a pandas dataframe

After the file has been opened, it can easily be converted into a `pandas` DataFrame

import pandas as pd
df = pd.read_json('tweets.json', encoding='utf-8')
Comments
  • no results...

    no results...

    ERROR: Failed to parse JSON "Expecting value: line 1 column 1 (char 0)" while requesting "https://twitter.com/i/search/timeline?f=tweets&vertical=default&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=TWEET-838177224989753344-838177234682773505&q=trump%20since%3A2016-07-25%20until%3A2017-03-05&l=None"

    I don't know why suddenly I'm getting into this problem.

    opened by usajameskwon 38
  • no results

    no results

    The following example does not return any tweet, expected?

    twitterscraper Trump -l 100 -bd 2017-01-01 -ed 2017-06-01 -o tweets.json

    INFO: Querying Trump since:2017-01-01 until:2017-01-09 INFO: Querying Trump since:2017-01-25 until:2017-02-02 INFO: Querying Trump since:2017-01-17 until:2017-01-25 INFO: Querying Trump since:2017-02-02 until:2017-02-10 INFO: Querying Trump since:2017-02-10 until:2017-02-18 INFO: Querying Trump since:2017-02-18 until:2017-02-26 INFO: Querying Trump since:2017-02-26 until:2017-03-06 INFO: Querying Trump since:2017-01-09 until:2017-01-17 INFO: Querying Trump since:2017-03-06 until:2017-03-14 INFO: Querying Trump since:2017-03-14 until:2017-03-22 INFO: Querying Trump since:2017-03-22 until:2017-03-30 INFO: Querying Trump since:2017-03-30 until:2017-04-07 INFO: Querying Trump since:2017-04-07 until:2017-04-15 INFO: Querying Trump since:2017-04-15 until:2017-04-23 INFO: Querying Trump since:2017-04-23 until:2017-05-01 INFO: Querying Trump since:2017-05-01 until:2017-05-09 INFO: Querying Trump since:2017-05-09 until:2017-05-17 INFO: Querying Trump since:2017-05-17 until:2017-05-25 INFO: Querying Trump since:2017-05-25 until:2017-06-01 INFO: Got 0 tweets for Trump%20since%3A2017-02-10%20until%3A2017-02-18. INFO: Got 0 tweets (0 new). INFO: Got 0 tweets for Trump%20since%3A2017-03-30%20until%3A2017-04-07. INFO: Got 0 tweets (0 new). INFO: Got 0 tweets for Trump%20since%3A2017-02-18%20until%3A2017-02-26. INFO: Got 0 tweets for Trump%20since%3A2017-01-09%20until%3A2017-01-17. INFO: Got 0 tweets (0 new). INFO: Got 0 tweets (0 new). INFO: Got 0 tweets for Trump%20since%3A2017-04-07%20until%3A2017-04-15. INFO: Got 0 tweets (0 new). INFO: Got 0 tweets for Trump%20since%3A2017-04-15%20until%3A2017-04-23. INFO: Got 0 tweets (0 new). INFO: Got 0 tweets for Trump%20since%3A2017-03-22%20until%3A2017-03-30. INFO: Got 0 tweets (0 new). INFO: Got 0 tweets for Trump%20since%3A2017-03-06%20until%3A2017-03-14. INFO: Got 0 tweets (0 new). INFO: Got 0 tweets for Trump%20since%3A2017-05-25%20until%3A2017-06-01. INFO: Got 0 tweets (0 new). INFO: Got 0 tweets for Trump%20since%3A2017-01-01%20until%3A2017-01-09. INFO: Got 0 tweets (0 new). INFO: Got 0 tweets for Trump%20since%3A2017-04-23%20until%3A2017-05-01. INFO: Got 0 tweets (0 new). INFO: Got 0 tweets for Trump%20since%3A2017-01-17%20until%3A2017-01-25. INFO: Got 0 tweets (0 new). INFO: Got 0 tweets for Trump%20since%3A2017-05-01%20until%3A2017-05-09. INFO: Got 0 tweets (0 new). INFO: Got 0 tweets for Trump%20since%3A2017-03-14%20until%3A2017-03-22. INFO: Got 0 tweets (0 new). INFO: Got 0 tweets for Trump%20since%3A2017-01-25%20until%3A2017-02-02. INFO: Got 0 tweets (0 new). INFO: Got 0 tweets for Trump%20since%3A2017-02-26%20until%3A2017-03-06. INFO: Got 0 tweets (0 new). INFO: Got 0 tweets for Trump%20since%3A2017-05-17%20until%3A2017-05-25. INFO: Got 0 tweets for Trump%20since%3A2017-02-02%20until%3A2017-02-10. INFO: Got 0 tweets (0 new). INFO: Got 0 tweets (0 new). INFO: Got 0 tweets for Trump%20since%3A2017-05-09%20until%3A2017-05-17. INFO: Got 0 tweets (0 new).

    opened by bartengine27 34
  • Want to retrieve the tweets account/handle wise not keyword wise

    Want to retrieve the tweets account/handle wise not keyword wise

    I want to retrieve all the tweets by account wise. Say @amazon has tweeted 27.8K tweets till date, how to retrieve all the tweets made by amazon rather than the tweets in which amazon keyword is included. However i tried using latest advanced search query option in INIT_URL as "https://twitter.com/search?l=&q=from%3A{q}&src=typd&lang=en" but could not find reload url for the same. But this option does not give me whole tweets as I need to modify the script tweet.py to retrieve the tweeter data using tags by BeautifulSoup in from_soup method.

    opened by NileshJorwar 27
  • Parralel scraping doesn't seem to work

    Parralel scraping doesn't seem to work

    I did a few logging modifications and if you checkout https://github.com/sils/twitterscraper/tree/sils/parallel and scrape for test or something like that you'll get like 60ish tweets sometimes for some parts of months which seems rather impossible (and doesn't check out if you put in the advanced query into the search UI)

    @taspinar if you have any idea that'd help a lot :/

    opened by sils 22
  • Rework logging management

    Rework logging management

    See #260. Please let me know if further formalities are required!

    Changelog

    • Specify constant logger name, making it easy for clients to configure loglevel of twitterscraper
    • Nuke ts_logger
    • Add loglevel argument to CLI
    opened by LinqLover 14
  • Only getting small amount of data before midnight

    Only getting small amount of data before midnight

    I am trying to scrape tweets about Bitcoin from November to April. However, the data I obtained only contains those ones before midnight, looks like this:

    screen shot 2018-06-26 at 16 12 24

    which misses the majority of the tweets...

    I wonder anyone has met the same issue

    help wanted critical bug 
    opened by shenyifan17 14
  • ImportError: No module named 'tweet'

    ImportError: No module named 'tweet'

    I get the following error when trying to use this. Installed in a venv via pip

    Traceback (most recent call last):
      File "collector.py", line 1, in <module>
        import twitterscraper
      File "/home/m0hawk/Documents/dev/TUHH/testvenv/lib/python3.5/site-packages/twitterscraper/__init__.py", line 13, in <module>
        from twitterscraper.query import query_tweets
      File "/home/m0hawk/Documents/dev/TUHH/testvenv/lib/python3.5/site-packages/twitterscraper/query.py", line 14, in <module>
        from tweet import Tweet
    ImportError: No module named 'tweet'
    
    opened by sims1253 13
  • Twitterscraper was working fine till tuesday but now it shows 0 tweets

    Twitterscraper was working fine till tuesday but now it shows 0 tweets

    I am facing problem while retrieving tweets with twitterscraper as it was working fine till tuesday but now it shows 0 tweets. I even tried to clone the git repository and change header list, but still it is not working. Is there anyway to fix this issue or anything wrong on twitterscraper side. please help me out !!:((

    opened by lubhaniagarwal 12
  • Is it Possible to get Tweets for all hours of the day?

    Is it Possible to get Tweets for all hours of the day?

    Hello, Is it possible to get tweets for all hours of the day? If i use query_tweets it always start to scrape from 23.59 from each day. I already tried to set the limit high and go day by day to get enough tweets until it reaches 0.00, but he still dont get enough tweets most of the time to get to the start of the day. Also the API stops kinda random at high limits (sometimes just scrapes 1k sometimes 16k etc..) It would be great if anyone got a solution for this problem.

    opened by Flyinghans 12
  • SSL error

    SSL error

    when I tried this statement,

    twitterscraper "Trump OR Clinton" --limit 100 --output=tweets.json I obtained the following error:

    INFO: {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201'} Traceback (most recent call last): File "c:\users\amel\anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 453, in wrap_socket cnx.do_handshake() File "c:\users\amel\anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1915, in do_handshake self._raise_ssl_error(self._ssl, result) File "c:\users\amel\anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1639, in _raise_ssl_error raise SysCallError(errno, errorcode.get(errno)) OpenSSL.SSL.SysCallError: (10054, 'WSAECONNRESET')

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "c:\users\amel\anaconda3\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen chunked=chunked) File "c:\users\amel\anaconda3\lib\site-packages\urllib3\connectionpool.py", line 343, in _make_request self._validate_conn(conn) File "c:\users\amel\anaconda3\lib\site-packages\urllib3\connectionpool.py", line 839, in validate_conn conn.connect() File "c:\users\amel\anaconda3\lib\site-packages\urllib3\connection.py", line 344, in connect ssl_context=context) File "c:\users\amel\anaconda3\lib\site-packages\urllib3\util\ssl.py", line 344, in ssl_wrap_socket return context.wrap_socket(sock, server_hostname=server_hostname) File "c:\users\amel\anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 459, in wrap_socket raise ssl.SSLError('bad handshake: %r' % e) ssl.SSLError: ("bad handshake: SysCallError(10054, 'WSAECONNRESET')",)

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "c:\users\amel\anaconda3\lib\site-packages\requests\adapters.py", line 449, in send timeout=timeout File "c:\users\amel\anaconda3\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen _stacktrace=sys.exc_info()[2]) File "c:\users\amel\anaconda3\lib\site-packages\urllib3\util\retry.py", line 398, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='free-proxy-list.net', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: SysCallError(10054, 'WSAECONNRESET')")))

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "c:\users\amel\anaconda3\lib\runpy.py", line 193, in run_module_as_main "main", mod_spec) File "c:\users\amel\anaconda3\lib\runpy.py", line 85, in run_code exec(code, run_globals) File "C:\Users\Amel\Anaconda3\Scripts\twitterscraper.exe_main.py", line 5, in File "c:\users\amel\anaconda3\lib\site-packages\twitterscraper_init.py", line 13, in from twitterscraper.query import query_tweets File "c:\users\amel\anaconda3\lib\site-packages\twitterscraper\query.py", line 73, in proxies = get_proxies() File "c:\users\amel\anaconda3\lib\site-packages\twitterscraper\query.py", line 43, in get_proxies response = requests.get(PROXY_URL) File "c:\users\amel\anaconda3\lib\site-packages\requests\api.py", line 75, in get return request('get', url, params=params, **kwargs) File "c:\users\amel\anaconda3\lib\site-packages\requests\api.py", line 60, in request return session.request(method=method, url=url, **kwargs) File "c:\users\amel\anaconda3\lib\site-packages\requests\sessions.py", line 533, in request resp = self.send(prep, **send_kwargs) File "c:\users\amel\anaconda3\lib\site-packages\requests\sessions.py", line 646, in send r = adapter.send(request, **kwargs) File "c:\users\amel\anaconda3\lib\site-packages\requests\adapters.py", line 514, in send raise SSLError(e, request=request) requests.exceptions.SSLError: HTTPSConnectionPool(host='free-proxy-list.net', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: SysCallError(10054, 'WSAECONNRESET')")))

    C:>

    opened by amelksibi2019 12
  • JSONDecodeError

    JSONDecodeError

    When running this: twitterscraper "ethereum OR eth" -bd 2018-01-02 -ed 2018-01-15 -o bitcoin_tweets.json

    I get this error: ERROR:root:Failed to parse JSON "Expecting value: line 1 column 1 (char 0)" while requesting "https://twitter.com/i/search/timeline?f=tweets&vertical=default&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=TWEET-949063965027569664-949067891449634816&q=ethereum%20OR%20eth%20since%3A2018-01-04%20until%3A2018-01-05&l=None". Traceback (most recent call last): File "c:\users....\appdata\local\programs\python\python36-32\lib\site-packages\twitterscraper\query.py", line 38, in query_single_page json_resp = response.json() File "c:\users....\appdata\local\programs\python\python36-32\lib\site-packages\requests\models.py", line 866, in json return complexjson.loads(self.text, **kwargs) File "c:\users....\appdata\local\programs\python\python36-32\lib\json_init_.py", line 354, in loads return _default_decoder.decode(s) File "c:\users....\appdata\local\programs\python\python36-32\lib\json\decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "c:\users....\appdata\local\programs\python\python36-32\lib\json\decoder.py", line 357, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

    Any idea why?

    opened by eimis41 12
  • Getting error with main code inside beatifulSoup

    Getting error with main code inside beatifulSoup

    /usr/local/lib/python3.7/dist-packages/twitterscraper/query.py in get_proxies() 47 soup = BeautifulSoup(response.text, 'lxml') 48 table = soup.find('table',id='proxylisttable') ---> 49 list_tr = table.find_all('tr') 50 list_td = [elem.find_all('td') for elem in list_tr] 51 list_td = list(filter(None, list_td))

    AttributeError: 'NoneType' object has no attribute 'find_all'

    When I run CLI or in python It get error from BeautifulSoup lib, please help

    opened by tienquyet28 1
  • query_tweet function throwing errors

    query_tweet function throwing errors

    Hi, I don't know if this package still works but the query_tweet function is throwing WorkerError, which I don't understand:

    query_tweets(handle, limit=None, begindate=dt.date(2020, 5, 10), enddate=enddateset, poolsize=1, lang='')

    This is the error I got:

    Process 'ForkPoolWorker-2' pid:84180 exited with 'signal 11 (SIGSEGV)'
    raise WorkerLostError(
    billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV) Job: 1.
    
    opened by ehsong 0
  • Is this project functional?

    Is this project functional?

    fixed issue reported here with:

        import pandas as pd 
        ...
        def get_proxies():    
        resp = requests.get(PROXY_URL)
        df = pd.read_html(resp.text)[0]
        list_ip=list(df['IP Address'].values)
        list_ports=list(df['Port'].values.astype(str))
        list_proxies = [':'.join(elem) for elem in list(zip(list_ip, list_ports))]
    

    however, this still does not work.

    list_of_tweets = query_tweets("Trump OR Clinton", 10) returns:

    Exception: Traceback (most recent call last):
    --> 249             for new_tweets in pool.imap_unordered(partial(query_tweets_once, limit=limit_per_pool, lang=lang, use_proxy=use_proxy), queries):
        250                 all_tweets.extend(new_tweets)
        251                 logger.info('Got {} tweets ({} new).'.format(
    
    

    also, query_user_info fails as shown here:

    query_user_info(user='elonmusk')

    opened by barniker 1
  • twitter scrapper error

    twitter scrapper error

    Hi all,

    While using twitter scrapper,

    I have this code

    from twitterscraper import query_tweets import datetime as dt import pandas as pd

    begin_date = dt.date(2020,3,1) end_date = dt.date(2021,11,1)

    limit = 100 lang = 'english'

    tweets = query_tweets('vaccinesideeffects', begindate = begin_date, enddate = end_date, limit = limit, lang = lang) df = pd.DataFrame(t.dict for t in tweets)

    df = df['text']

    df

    Getting below error


    AttributeError Traceback (most recent call last) in ----> 1 from twitterscraper import query_tweets 2 import datetime as dt 3 import pandas as pd 4 5 begin_date = dt.date(2020,3,1)

    ~/opt/anaconda3/lib/python3.8/site-packages/twitterscraper/init.py in 11 12 ---> 13 from twitterscraper.query import query_tweets 14 from twitterscraper.query import query_tweets_from_user 15 from twitterscraper.query import query_user_info

    ~/opt/anaconda3/lib/python3.8/site-packages/twitterscraper/query.py in 74 yield start + h * i 75 ---> 76 proxies = get_proxies() 77 proxy_pool = cycle(proxies) 78

    ~/opt/anaconda3/lib/python3.8/site-packages/twitterscraper/query.py in get_proxies() 47 soup = BeautifulSoup(response.text, 'lxml') 48 table = soup.find('table',id='proxylisttable') ---> 49 list_tr = table.find_all('tr') 50 list_td = [elem.find_all('td') for elem in list_tr] 51 list_td = list(filter(None, list_td))

    AttributeError: 'NoneType' object has no attribute 'find_all'

    opened by mahajnay 7
  • Not working due to : AttributeError: 'NoneType' object has no attribute 'find_all'

    Not working due to : AttributeError: 'NoneType' object has no attribute 'find_all'

    Python 3.9.7 osx Big sur.

    twitterscraper Trump --limit 1000 --output=tweets.json

    Traceback (most recent call last): File "/usr/local/bin/twitterscraper", line 33, in sys.exit(load_entry_point('twitterscraper==1.6.1', 'console_scripts', 'twitterscraper')()) File "/usr/local/bin/twitterscraper", line 25, in importlib_load_entry_point return next(matches).load() File "/usr/local/Cellar/[email protected]/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/metadata.py", line 77, in load module = import_module(match.group('module')) File "/usr/local/Cellar/[email protected]/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1030, in _gcd_import File "", line 1007, in _find_and_load File "", line 972, in _find_and_load_unlocked File "", line 228, in _call_with_frames_removed File "", line 1030, in _gcd_import File "", line 1007, in _find_and_load File "", line 986, in _find_and_load_unlocked File "", line 664, in _load_unlocked File "", line 627, in _load_backward_compatible File "", line 259, in load_module File "/usr/local/lib/python3.9/site-packages/twitterscraper-1.6.1-py3.9.egg/twitterscraper/init.py", line 13, in File "", line 1007, in _find_and_load File "", line 986, in _find_and_load_unlocked File "", line 664, in _load_unlocked File "", line 627, in _load_backward_compatible File "", line 259, in load_module File "/usr/local/lib/python3.9/site-packages/twitterscraper-1.6.1-py3.9.egg/twitterscraper/query.py", line 76, in File "/usr/local/lib/python3.9/site-packages/twitterscraper-1.6.1-py3.9.egg/twitterscraper/query.py", line 49, in get_proxies AttributeError: 'NoneType' object has no attribute 'find_all'

    opened by steeley 13
Releases(1.6.0)
  • 1.6.0(Jul 22, 2020)

    • PR234: Adds command line argument -dp or --disableproxy to disable to use of proxy when querying.
    • PR261: Improve logging; there is no ts_logger file, logger is initiated in main.py and query.py, loglevel is set via CLI.
    Source code(tar.gz)
    Source code(zip)
  • 1.5.0(Jul 22, 2020)

    Fixed

    • PR304: Fixed query.py by adding 'X-Requested-With': 'XMLHttpRequest' to header value.
    • PR253: Fixed Docker build

    Added

    • PR313: Added example to README (section 2.3.1).
    • PR277: Support emojis by adding the alt text of images to the tweet text.
    Source code(tar.gz)
    Source code(zip)
  • 1.4.0(Nov 3, 2019)

    Add new Tweet attributes:

    • links, hashtags
    • image urls, video_url,
    • whether or not it is a reply
    • tweet-ID of parent tweet in case of reply,
    • list of usernames to who is replied

    Deleted some Tweet attributes:

    • Tweet.retweet_id
    • Tweet.retweeter_username
    • Tweet.retweet_userid
    Source code(tar.gz)
    Source code(zip)
  • 1.2.0(Jun 22, 2019)

    This version includes some additional fields in the output:

    • is_veriried to indicate whether an user has verified status
    • is_retweet to indicate whether an tweet is an retweet
    • retweeter_username if it is an retweet
    • retweeter_userid if it is an retweet
    • retweet_id if it is an retweet
    • epoch timestamp

    In addition it is using billiard for multiprocessing, which makes it possible that twitterscraper is used in Celery. fake-useragent is used in order to generate random useragent headers.

    Source code(tar.gz)
    Source code(zip)
  • 0.7.0(May 6, 2018)

    • Users can now save the tweets to a CSV-format, by using the command line arguments "-c" or "--csv"

    • The default value of begindate is set to 2006-03-21. The previous value (2017-01-01) was chosen arbitrarily and leaded to questions why not all tweets were retrieved.

    • By using linspace() instead of range() to divide the number of days into the number of parallel processes, edge cases ( p = 1 ) now also work fine.

    Source code(tar.gz)
    Source code(zip)
  • 0.5.0(Feb 2, 2018)

Owner
Ahmet Taspinar
Physicist disguised as a Data Scientist. Blogging at http://www.ataspinar.com
Ahmet Taspinar
This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease.

LeasePlan - Scraper This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease. It has

Rodney 4 Nov 18, 2022
Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation This repository provides two web crawlers to label domain nam

1 Nov 05, 2021
API which uses discord to scrape NameMC searches/droptime/dropping status of minecraft names

NameMC Scrape API This is an api to scrape NameMC using message previews generated by discord. NameMC makes it a pain to scrape their website, but som

Twilak 2 Dec 22, 2021
An automated, headless YouTube Watcher and Scraper

Searches YouTube, queries recommended videos and watches them. All fully automated and anonymised through the Tor network. The project consists of two independently usable components, the YouTube aut

44 Oct 18, 2022
Lovely Scrapper

Lovely Scrapper

Tushar Gadhe 2 Jan 01, 2022
京东茅台抢购 2021年4月最新版

Jd_Seckill 特别声明: 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性,完整性和有效性,请根据情况自行判断。 本项目内所有资源文件,禁止任何公众号、自媒体进行任何形式的转载、发布。 huanghyw 对任何脚本问题概不

45 Dec 14, 2022
UdemyBot - A Simple Udemy Free Courses Scrapper

UdemyBot - A Simple Udemy Free Courses Scrapper

Gautam Kumar 112 Nov 12, 2022
A web scraper which checks price of a product regularly and sends price alerts by email if price reduces.

Amazon-Web-Scarper Created a web scraper using simple functions to check price of a product on amazon (can be duplicated to check price at other marke

Swaroop Todankar 1 Jan 17, 2022
抢京东茅台脚本,定时自动触发,自动预约,自动停止

jd_maotai 抢京东茅台脚本,定时自动触发,自动预约,自动停止 小白信用 99.6,暂时还没抢到过,朋友 80 多抢到了一瓶,所以我感觉是跟信用分没啥关系,完全是看运气的。

Aruelius.L 117 Dec 22, 2022
a high-performance, lightweight and human friendly serving engine for scrapy

a high-performance, lightweight and human friendly serving engine for scrapy

Speakol Ads 30 Mar 01, 2022
Facebook Group Scraping Using Beautiful Soup & Selenium

Extract Facebook group posts that are related to a specific topic and write them to a .json file.

Fatima Ghadieh 14 Aug 12, 2022
Haphazard scripts for scraping bitcoin/bitcoin data from GitHub

This is a quick-and-dirty tool used to scrape bitcoin/bitcoin pull request and commentary data. Each output/pr number folder contains comments.json:

James O'Beirne 8 Oct 12, 2022
An application that on a given url, crowls a web page and gets all words, sorts and counts them.

Web-Scrapping-1 An application that on a given url, crowls a web page and gets all words, sorts and counts them. Installation Using the package manage

adriano atambo 1 Jan 16, 2022
一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件

QQ音乐歌词爬虫 一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件,默认去除了所有演唱会(Live)版本的歌曲。 使用方法 直接运行python run.py即可,然后输入你想获取的歌手名字,然后静静等待片刻。 output目录下保存生成的歌词和歌名文件。以周杰伦为例,会生成两

Yang Wei 11 Jul 27, 2022
Html Content / Article Extractor, web scrapping lib in Python

Python-Goose - Article Extractor Intro Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a

Xavier Grangier 3.8k Jan 02, 2023
Scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info

SpaceX Sofware I developed software to scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info to use the software you need Python a

Maxence Rémy 16 Aug 02, 2022
Anonymously scrapes onlinesim.ru for new usable phone numbers.

phone-scraper Anonymously scrapes onlinesim.ru for new usable phone numbers. Usage Clone the repository $ git clone https://github.com/thomasgruebl/ph

16 Oct 08, 2022
学习强国 自动化 百分百正确、瞬间答题,分值45分

项目简介 学习强国自动化脚本,解放你的时间! 使用Selenium、requests、mitmpoxy、百度智能云文字识别开发而成 使用说明 注:Chrome版本 驱动会自动下载 首次使用会生成数据库文件db.db,用于提高文章、视频任务效率。 依赖安装 pip install -r require

lisztomania 359 Dec 30, 2022
Simple tool to scrape and download cross country ski timings and results from live.skidor.com

LiveSkidorDownload Simple tool to scrape and download cross country ski timings

0 Jan 07, 2022
一个m3u8视频流下载脚本

一个Python的m3u8流视频下载脚本 介绍 m3u8流视频日益常见,目前好用的下载器也有很多,我把之前自己写的一个小脚本分享出来,供广大网友使用。写此程序的目的在于给视频下载爱好者提供一个下载样例,可直接调用,勿再重复造轮子。 使用方法 在python中直接运行程序或进行外部调用 import

Nchu 0 Oct 10, 2021