2. Package¶
These packages provide all functions and class that crawler need.
2.1. swiftea_bot.module module¶
Define several functions for all crawler’s class.
-
swiftea_bot.module.
can_add_doc
(docs, new_doc)[source]¶ To avoid documents duplicate, look for all url doc.
Parse self.infos of Crawler and return True if new_doc isn’t in it.
Parameters: - docs (list) – the documents to check
- new_doc (dict) – the doc to add
Returns: True if can add the doc
-
swiftea_bot.module.
convert_keys
(inverted_index)[source]¶ Convert str words keys into int from inverted-index.
Json convert doc id key in str, must convert in int.
Parameters: inverted_index – inverted_index to convert Tyep inverted_index: dict Returns: converted inverted-index
-
swiftea_bot.module.
create_dirs
()[source]¶ Manage crawler’s running.
- Test a lot of things:
create config directory
create doc file if doesn’t exists
create config file if it doesn’t exists
create links directory if it doesn’t exists
create index directory if it doesn’t exists
-
swiftea_bot.module.
def_links
()[source]¶ Create directory of links if it doesn’t exist
Ask to user what doing if there isn’t basic links. Create a basic links file if user what it.
-
swiftea_bot.module.
errors
(message, error_code)[source]¶ Write the error report with the time in errors file.
Normaly call by tell() when a error_code parameter is given.
Parameters: - message (str) – message to print and write
- error_code (int) – error code
-
swiftea_bot.module.
is_index
()[source]¶ Check if there is a saved inverted-index file.
Returns: True if there is one
-
swiftea_bot.module.
remove_duplicates
(old_list)[source]¶ Remove duplicates from a list.
Parameters: old_list (list) – list to clean Returns: list without duplicates
-
swiftea_bot.module.
stats_webpages
(begining, end)[source]¶ Write the time in second to crawl 10 webpages.
Parameters: - begining (int) – time before starting crawl 10 webpages
- end (int) – time after crawled 10 webpages
-
swiftea_bot.module.
tell
(message, error_code='', severity=1)[source]¶ Manage newspaper.
Print in console what the program is doing and save this in a copy with time in an event file.
Parameters: - message (str) – message to print and write
- error_code (int) – (optional) error code, if given call errors() with given message
- severity (int) – 1 is default severity, -1 add 4 spaces befor message, 0 add 2 spaces befor the message, 2 uppercase and underline message.
2.2. swiftea_bot.data module¶
Define required data by crawler.
2.3. swiftea_bot.file_manager module¶
Swiftea-Crawler use a lot of files. For example to config the app, save links… Here is a class that manage files of crawler.
-
class
swiftea_bot.file_manager.
FileManager
[source]¶ File manager for Swiftea-Crawler.
Save and read links, read and write configuration variables, read inverted-index from json saved file and from used file when sending it.
Create configuration file if it doesn’t exists or read it.
-
ckeck_size_links
(links)[source]¶ Check number of links in file.
Parameters: links (str) – links saved in file
-
get_inverted_index
()[source]¶ Get inverted-index in local.
Called after a connection error. Read a json file that contains the inverted-index. Delete this file after reading it.
Returns: inverted-index
-
get_lists_words
()[source]¶ Get lists words from data
Check for dirs lists words, create them if they don’t exist.
Returns: stopwords, badwords
-
get_url
()[source]¶ Get url of next webpage.
Check the size of curent reading links and increment it if over.
Returns: url of webpage to crawl
-
read_inverted_index
()[source]¶ Get inverted-index in local.
Called after sending inverted-index without error. Read all files created to send inverted-index.
Returns: inverted-index
-
2.4. crawling.web_connection module¶
Connection to webpage is managed by requests module. Those errors are waiting for: timeout with socket module and urllib3 module and all RequestException errors.
-
class
crawling.web_connection.
WebConnection
[source]¶ Manage the web connection with the page to crawl.
-
check_robots_perm
(url)[source]¶ Check robots.txt for permission.
Parameters: url (str) – webpage url Returns: True if can crawl
-
duplicate_content
(request1, url)[source]¶ Avoid param duplicate.
Compare source codes with params and whitout. Return url whitout params if it’s the same content.
Parameters: request (requests.models.Response) – request Returns: url, source code
-
get_code
(url)[source]¶ Get source code of given url.
Parameters: url (str) – url of webpage Returns: source code, True if no take links, score and new url (redirection)
-
2.5. crawling.connection module¶
Define several functions WebConnection.
-
crawling.connection.
all_urls
(request)[source]¶ Return all urls from request.history.
Parameters: - request (requests.models.Response) – request
- first (str) – list start with the url if given
Returns: list of redirected urls, first is the last one
-
crawling.connection.
check_connection
(url='https://github.com')[source]¶ Test internet connection.
Try to connect to a website.
Parameters: url – url used to test the connection Returns: True if connected to internet
2.6. crawling.site_informations module¶
After parsing source code, extracted data must be classified and cleaned. Here is a class that use the html parser and manage all results.
-
class
crawling.site_informations.
SiteInformations
[source]¶ Class to manage searches in source code.
-
clean_favicon
(favicon, base_url)[source]¶ Clean favicon.
Parameters: favicon (str) – favicon url to clean Returns: cleaned favicon
-
clean_keywords
(dirty_keywords, language)[source]¶ Clean found keywords.
Delete stopwords, bad chars, two letter less word and split word1-word2
Parameters: keywords (list) – keywords to clean Returns: list of cleaned keywords
-
clean_links
(links, base_url=None)[source]¶ Clean webpage’s links: rebuild urls with base url and remove anchors, mailto, javascript, .index.
Parameters: links (list) – links to clean Returns: cleanen links without duplicate
-
detect_language
(keywords)[source]¶ Detect language of webpage if not given.
Parameters: keywords (list) – keywords of webpage used for detecting Returns: language found
-
get_infos
(url, code, nofollow, score)[source]¶ Manage all searches of webpage’s informations.
Parameters: - url (str) – url of webpage
- score (int) – score of webpage
- code (str) – source code of webpage
- nofollow (bool) – if we take links of webpage
Returns: links, title, description, key words, language, score, number of words
-
2.7. crawling.searches module¶
Define several functions SiteInformations.
-
crawling.searches.
capitalize
(text)[source]¶ Upper the first letter of given text
Parameters: text (str) – text Returns: text
-
crawling.searches.
clean_link
(url, base_url=None)[source]¶ Clean a link.
Rebuild url with base url, pass mailto and javascript, remove anchors, pass if more than 5 queries, pass if more than 255 chars, remove /index.xxx, remove last /.
Parameters: - url (str) – links to clean
- base_url – base url for rebuilding, can be None if
Returns: cleaned link
-
crawling.searches.
clean_text
(text)[source]¶ Clean up text by removing tabulations, blanks and carriage returns.
Parameters: text (str) – text to clean_text Returns: cleaned text
-
crawling.searches.
get_base_url
(url)[source]¶ Get base url using urlparse.
Parameters: url (str) – url Returns: base url of given url
2.8. crawling.parsers module¶
Data of webpage is provided by the python html.parser. There are two parsers: the first one for all informations and the second one only for encoding.
-
class
crawling.parsers.
ExtractData
[source]¶ Bases:
html.parser.HTMLParser
Html parser to extract data.
self.object: the type of text for title, description and keywords
dict(attrs).get(‘content’): convert attrs in a dict and return the value
- Data that could be extracted:
title
language
description
links with nofollow and noindex
stylesheet
favicon
keywords: h1, h2, h3, strong, em
-
handle_endtag
(tag)[source]¶ Called when parser meet an ending tag.
Parameters: - tag (str) – starting tag
- attrs (list) – attributes
-
class
crawling.parsers.
ExtractEncoding
[source]¶ Bases:
html.parser.HTMLParser
Html parser to extract encoding from source code.
2.9. database.database module¶
Define several functions for DatabaseSwiftea.
2.10. database.database_manager module¶
-
class
database.database_manager.
DatabaseManager
(host, user, password, name)[source]¶ Class to manage queries to the database using PyMySQL.
How to: create a subclass
result, response = self.send_comand(command, data=tuple(), all=False)
if ‘error’ in response:
print(‘An error occured.’)where result are data asked and response a message.
Parameters: - host (str) – hostname of the db server
- user (str) – username to use for connection
- password (str) – password to use for connection
- name (str) – name of database
2.11. database.database_swiftea module¶
-
class
database.database_swiftea.
DatabaseSwiftea
(host, user, password, name, table)[source]¶ Bases:
database.database_manager.DatabaseManager
Class to manage Swiftea database.
Parameters: - host (str) – hostname of the db server
- user (str) – username to use for connection
- password (str) – password to use for connection
- name (str) – name of database
-
del_one_doc
(url, table=None)[source]¶ Delete document corresponding to url.
Parameters: url (str) – url of webpage Returns: status message
-
doc_exists
(url)[source]¶ Check if url is in database.
Parameters: url (str) – url corresponding to doc Returns: True if doc exists
-
get_doc_id
(url)[source]¶ Get id of a document in database.
Parameters: url (str) – url of webpage Returns: id of webpage or None if not found
-
https_duplicate
(old_url)[source]¶ Avoid https and http duplicate.
If old url is secure (https), must delete insecure url if exists, then return secure url (old url). If old url is insecure (http), must delete it if secure url exists, then return secure url (new url)
Parameters: old_url (str) – old url Returns: url to add and url to delete
-
insert
(infos)[source]¶ Insert a new document in database.
Parameters: infos (dict()) – doc infos Returns: True is an arror occured
-
send_doc
(webpage_infos)[source]¶ Send document informations to database.
Parameters: infos (list) – informations to send to database Returns: True if an error occured
2.12. index.index module¶
Define several functions for inverted-index.
-
index.index.
count_files_index
(index)[source]¶ Return number of file to download are uplaod
Parse languages and letters from the given index.
Returns: int
2.13. index.inverted_index module¶
-
class
index.inverted_index.
InvertedIndex
[source]¶ Manage inverted-index for crawler.
Inverted-index is a dict, each keys are language
-> values are a dict, each keys are first letter
-> values are dict, each keys are two first letters
-> values are dict, each keys are word
-> values are dict, each keys are id
-> values are int: tf
example: [‘FR’][‘A’][‘av’][‘avion’][21] is tf of word ‘avion’ in doc 21 in french.
-
add_doc
(keywords, doc_id, language)[source]¶ Add all words of a doc in inverted-index.
Parameters: - keywords (list) – all word in doc_id
- doc_id (int) – id of the doc in database
- language (str) – language of word
-
add_word
(word_infos, doc_id, nb_words)[source]¶ Add a word in inverted-index.
Parameters: - word_infos (dict) – word infos: word, language, occurence, first letter and two first letters
- doc_id (int) – id of the doc in database
- nb_words (int) – number of words in the doc_id
-
delete_doc_id
(doc_id)[source]¶ Delete a id in inverted-index.
Parameters: doc_id (int) – id to delete
-
delete_id_word
(word_infos, doc_id)[source]¶ Delete a id of a word in inverted-index
This method delete a word from a document. Remove a words from a doc.
Parameters: - word_infos (dict) – word infos: word, language, first letter and two first letters
- doc_id (int) – id of the doc in database
-
2.14. index.ftp_manager module¶
-
class
index.ftp_manager.
FTPManager
(host, user='', password='', port=21)[source]¶ Bases:
ftplib.FTP
Class to connect to a ftp server more easily.
Parameters: - host (str) – hostname of the ftp server
- user (str) – username to use for connection
- password (str) – password to use for connection
-
cd
(path)[source]¶ Set the current directory on the server.
Parameters: path (str) – path to set Returns: sever response
-
connection
()[source]¶ Connect to ftp server.
Catch all_errors of ftplib. Use utf-8 encoding.
Returns: server welcome message
-
countfiles
(path='.')[source]¶ Count the file in the given path
Parameters: path (str) – path to count Returns: number of files
-
disconnect
()[source]¶ Quit connection to ftp server.
Close it if an error occured while trying to quit it.
Returns: server goodbye message or error message
-
get
(local_filename, server_filename)[source]¶ Download a file from ftp server.
It creates the file to download.
Parameters: - local_filename (str) – local filename to create
- server_filename (str) – server filename to download
Returns: server response message or error message
-
infos_listdir
(path='.', facts=[])[source]¶ Return the result of mlsd command of ftplib or a list whose first element is the error response.
-
listdir
()[source]¶ Return the result of LIST command or a list whose first element is the error response.