Skip to content

rizqirizqi/scientific-name-scraper

Repository files navigation

scientific-name-scraper (sciscraper)

contributions welcome MIT License

Watch on GitHub Star on GitHub Tweet

Scrape plants scientific name information from the internet.

Current supported sources:

Requirements

Detailed Guide for Windows
  1. Download python from https://www.python.org/downloads/
  2. Install python, follow the instruction
  3. Press Win button (something like window icon on keyboard), search "env", then open Edit the system environment variables
  4. Click Environment Variables
  5. On System Variables section, edit the Path key
  6. Add these paths using the New button:
    # Please replace the username with your windows username, you can see it in C:\Users folder
    # Please replace the python version with your installed python version
    C:\Users\<YOUR_USERNAME>\AppData\Local\Programs\Python\Python310
    C:\Users\<YOUR_USERNAME>\AppData\Local\Programs\Python\Python310\Scripts
    C:\Users\<YOUR_USERNAME>\AppData\Roaming\Python\Python310\Scripts
    
  7. Click OK, then OK
  8. Open cmd, then type python --version, then it should respond with the python version.
  9. Type pip3 install --user pipenv, then it should install pipenv, make sure it's successfully installed.
  10. Type pipenv --version, then it should respond with the pipenv version.
  11. Done! You can continue follow the guide in the "How to run" section.

How to run

  1. Clone
    git clone git@github.com:rizqirizqi/scientific-name-scraper.git
    cd scientific-name-scraper
  2. Install dependencies
    pipenv --python 3
    pipenv install
  3. Fill your input in input.csv, please look at samples/input.csv for example. You can also use txt or xlsx if you want.
  4. Run
    pipenv run python -m sciscraper -i input.csv
  5. The result will be placed in a file named result.*.csv

Help

pipenv run python -m sciscraper --help

Test Shell

pipenv run scrapy shell <URL>
# Switchboard Example
pipenv run scrapy shell 'http://apps.worldagroforestry.org/products/switchboard/index.php/species_search/Acacia%20abyssinica'
# WFO Example
pipenrun scrapy shell 'http://www.worldfloraonline.org/search?query=Costus+speciosus&view=&limit=5&start=0&sort='
result = response.css("#v results > table tr")[0]
data_col = result.css("td:nth-child(2)")

Cleanup All Default Outputs

rm result.* && rm log.*

Switchboard Special Cases

Case Link Note
ICRAF Database Not Found Engelhardia spicata Need human to check ✔
Genus Found Forficula Need human to check ✔
Multiple Species Found Alstonia spectabilis Get the matched substring of the species ✔
Similar Species Found Costus speciosus Need human to check ✔
Similar Species Found: variant Engelhardtia spicata Get the exact match ✔
Similar Species Found: subsp / ssp Ailanthus integrifolia Get the species ✔
Similar Species Found: double space Anacardium occidentale Get the exact match ✔
Duplicate Link Found Intsia bijuga Need human to check ✔
External Link Found Elaeocarpus petiolatus Remove the link ✔

Contributing

  1. Fork this repo
  2. Develop
  3. Create pull request
  4. Tag @rizqirizqi for review
  5. Merge~~

License

MIT