https://cnap.hv.se/getapod - 193.10.203.20 - https://github.com/Telsak/podbooking
Getapod är bokningsysstemet för cisco labbutrustningen och är en python-baserad flask-applikation som körs via Gunicorn i en docker-container.
Databasen kör direkt mot en fil (inte en tjänst) på disken och innehåller användare, rum och bokningar just nu.
Ändra inte --workers 1
till något annat än 1 just nu, detta för att undvika deadlock när vi jobbar mot databasen.
getapod:app
kommer från att huvudfilen heter getapod.py
SCRIPT-NAME: /getapod
lägger till i sökvägen för gunicorn så vi matchar URL sökvägen i nginx.
version: "3"
services:
podbooking-green:
image: podbooking:1.1.0-skills
container_name: podbooking-green
hostname: podbooking-green
environment:
TZ: Europe/Stockholm
SCRIPT_NAME: /getapod
command: gunicorn --bind 0.0.0.0:8000 --workers 1 getapod:app
volumes:
- /opt/podbooking-green:/var/www
networks:
wikinet:
aliases:
- podbooking-green
Vi behöver definiera vår app för att nginx ska kunna peka rätt med proxy_pass
. Detta görs med upstream <app> {}
blocket. Notera att server namnet måste matcha med det interna container_name
i docker konfigurationen ovan.
upstream podbooking-green {
server podbooking-green:8000;
}
server {
listen 80;
server_name cnap.hv.se;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl;
root /config/www;
index index.html index.htm index.php;
server_name cnap.hv.se;
ssl_certificate /config/keys/cnap_hv_se.cer;
ssl_certificate_key /config/keys/cnap_hv_se.key;
client_max_body_size 0;
location /getapod {
proxy_pass http://podbooking-green;
proxy_set_header Host "cnap.hv.se";
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
Om ingen ändring har skett i databas-strukturen eller nya tredje-parts moduler installerats, räcker det med att köra cd /opt/podbooking-green && git pull && docker restart podbooking-green
som root.
Om det finns nya moduler importerade i python filerna som är tredje-part, behöver imagen byggas om. Här är det viktigt att det finns en uppdaterad requirements.txt
fil i repot som har med alla eventuella moduler installerade via pip.
Filen genereras via pipreqs . --force --encoding=utf8 --ignore ".venv"
i systemet som podbokningen utvecklas på. Den importeras sedan via Dockerfile raden pip install -r requirements.txt
när imagen byggs i docker-compose.
Se till att följande rader finns med:
Werkzeug==2.1.2
gunicorn==20.1.0
Gunicorn anropas inte i koden så inte säkert den läggs till av pipreqs. Och Werkzeug tvingar en version så en Flask-bugg inte dyker upp vid mismatch av versioner.
Bygg om containern från testmiljön med docker build -t telsak/podbooking:MAJOR.MINOR.PATCH-feature .
T.ex, docker build -t podbooking:1.1.1-ics
Ladda upp den till docker hub. docker push telsak/podbooking
Om ändring har skett i databas-strukturen behöver de nya tabellerna initialiseras innan vi kan använda dem.
docker start podbooking-blue
docker exec -it podbooking-blue /bin/bash
för att få skal inne i containern.python3
>>> from getapod import db
>>> db.create_all()
>>> exit()
data.sqlite
fil som har den nya tabellen.(du behöver köra allt med root rättigheter här)
cd /opt && ./prep_testbooking.sh
cd podbooking-blue && git merge --squash <branch>
/opt/docker-compose.yml
ett block för test-containern podbooking-blue som pekar på /opt/podbooking-bluepodbooking-blue:
image: telsak/podbooking:MAJOR.MINOR.PATCH-feature
container_name: podbooking-blue
hostname: podbooking-blue
environment:
TZ: Europe/Stockholm
SCRIPT_NAME: /testpod
command: gunicorn --bind 0.0.0.0:8000 --workers 1 getapod:app
volumes:
- /opt/podbooking-blue:/var/www
networks:
wikinet:
aliases:
- podbooking-blue
docker-compose up -d podbooking-blue && docker start podbooking-blue
Podbokningen använder SQLite3 i filformat och lagras i /opt/podbooking/data.sqlite
Det finns en python modul Flask-Migrate som ska användas om existerande tabeller behöver ändras i, men det är inte testat 100% än. Rör inte!
Det görs två typer av backup av bokningssystemet:
/opt/podbooking-green
(exklusive .git/) varje måndag klockan 05:00/opt/podbooking-green/data.sqlite
varje dag klockan 05:05All backup sker mot curve servern. (193.10.237.3)
root@catch-up2:~# crontab -l
--- output omitted ---
# m h dom mon dow command
0 5 * * 1 tar czf ~/tmp/podbooking.tar.gz --exclude=".git" /opt/podbooking-green && rsync -azhP -e "ssh -i $HOME/.ssh/curve.control-server.key" /root/tmp/podbooking.tar.gz cnapbackup@193.10.237.3:~/podbooking/`date +\%Y-\%m-\%d`-complete/ && rm ~/tmp/podbooking.tar.gz
5 5 * * * tar czf ~/tmp/podbooking.tar.gz /opt/podbooking-green/data.sqlite && rsync -azhP -e "ssh -i $HOME/.ssh/curve.control-server.key" /root/tmp/podbooking.tar.gz cnapbackup@193.10.237.3:~/podbooking/`date +\%Y-\%m-\%d`-db/ && rm ~/tmp/podbooking.tar.gz
0 23 * * * if test -f ~/tmp/podbooking.tar.gz; then rm ~/tmp/podbooking.tar.gz; fi
sqlite> pragma table_info('getapod_rooms');
cid name type notnull dflt_value pk
--- ---- ----------- ------- ---------- --
0 id INTEGER 1 1
1 name VARCHAR(30) 0 0
2 pods INTEGER 0 0
pragma table_info('users');
cid name type notnull dflt_value pk
--- ---------- ------------ ------- ---------- --
0 id INTEGER 1 1
1 username VARCHAR(64) 0 0
2 role_id INTEGER 0 0
3 password VARCHAR(128) 0 0
4 flag VARCHAR(64) 0 0
5 last_login INTEGER 0 0
6 created INTEGER 0 0
7 fullname VARCHAR(128) 0 0
8 mail VARCHAR(128) 0 0
9 profile VARCHAR(64) 0 0
sqlite> pragma table_info('roles');
cid name type notnull dflt_value pk
--- ---- ----------- ------- ---------- --
0 id INTEGER 1 1
1 name VARCHAR(64) 0 0
sqlite> pragma table_info('getapod_bookings');
cid name type notnull dflt_value pk
--- ------------ ----------- ------- ---------- --
0 id INTEGER 1 1
1 room INTEGER 0 0
2 time INTEGER 0 0
3 pod INTEGER 0 0
4 duration INTEGER 0 0
5 name1 VARCHAR(64) 0 0
6 name2 VARCHAR(64) 0 0
7 comment VARCHAR(64) 0 0
8 flag VARCHAR(64) 0 0
9 confirmation VARCHAR(64) 0 0
sqlite> pragma table_info('skillinstance');
cid name type notnull dflt_value pk
--- ----------------- ------------- ------- ---------- --
0 id INTEGER 1 1
1 created INTEGER 0 0
2 owner VARCHAR(128) 0 0
3 name VARCHAR(128) 0 0
4 course VARCHAR(128) 0 0
5 schoolyear INTEGER 0 0
6 period VARCHAR(4) 0 0
7 standard_dates VARCHAR(512) 0 0
8 standard_rooms VARCHAR(128) 0 0
9 standard_times VARCHAR(256) 0 0
10 standard_duration INTEGER 0 0
11 extra_dates VARCHAR(512) 0 0
12 extra_rooms VARCHAR(128) 0 0
13 extra_times VARCHAR(256) 0 0
14 extra_duration INTEGER 0 0
15 type VARCHAR(32) 0 0
16 status VARCHAR(32) 0 0
17 comment VARCHAR(1024) 0 0
sqlite> pragma table_info('skillbookings');
cid name type notnull dflt_value pk
--- -------- ------------ ------- ---------- --
0 id INTEGER 1 1
1 skill_id INTEGER 0 0
2 date VARCHAR(16) 0 0
3 time INTEGER 0 0
4 room VARCHAR(16) 0 0
5 timeslot INTEGER 0 0
6 student VARCHAR(16) 0 0
7 result VARCHAR(32) 0 0
8 teacher VARCHAR(32) 0 0
9 comment VARCHAR(256) 0 0
10 flag VARCHAR(64) 0 0
Just nu används inte result, teacher eller comment men är tänkta att kunna användas för att fylla i resultat och ev. kommentar.
Uppdateringar måste vara bra motiverade, inte bara för att det går.
Vad:
Ändring av hur /show
endpointen visar kronox bokningar för den aktuella salen.
Idag:
Varje visning av show kollas ett dict{}
om det finns. Om inte, skrapar bokningen från schema.hv.se salens kommande bokningar till och med 3 veckor från dagens datum. Detta sparas tillsammas med en tidsstämpel i det tomma dict objektet och kan nu hämtas nästa gång av /show
. Om tidsstämpeln visar att informationen är mer än några timmar gammal görs en ny skrapning.
Varför:
Allt detta funkar jättebra när schema.hv.se funkar bra. När schemat ligger nere av någon anledning (host*kronox*host) så krashar hela worker processen som hämtar bokningssidan. Just nu är detta korrigerat genom att blocket som visar refererat till dict-objektet är bortkommenterat i html-templaten som används av /show
Förslag:
Köra asynkront en uppdatering av schemat i bakgrunden istället.
Summering av chatGPT tråd om ämnet:
- To update the data in your Flask application without causing the page to crash when the URL is down or inaccessible, you can use asynchronous scraping.
- To schedule a function to run in an interval of 60 minutes, you can use the schedule library and use the scheduler.add_job function.
- To make sure a dictionary variable is thread-safe, you can use locks to prevent race conditions.
- You can use the with statement in combination with the Lock() function to acquire and release the lock automatically.
- If a function is called while the lock is being held, the function will wait until the lock is released.
- When using a single worker process in the gunicorn environment, all threads should share data and locks should be handled correctly, it should work fine.
I have an already live Flask application that is working fine-ish right now. There is a feature though that scrapes an url whenever a certain route is visited. This works fine most of the time, but sometimes the url being scraped is down or inaccessible and then the page crashes. Can you offer a suggestion on an alternative way to get the data? Right now the data is used as a global variable to present and is updated roughly once an hour.
There are a few different ways that you could handle this situation. Here are a few ideas:
Ultimately, the best solution will depend on your specific use case and the requirements of your application. I recommend trying out a few different approaches and testing which one works best for your needs.
Asynchronous scraping looks nice to me and feels it would best fit my needs. Isn't a Flask app mostly reactive though (based on routes visited)? Can I repeat code that will not interfere with the route functions?
Yes, a Flask application is primarily reactive and based on handling individual routes, but you can still use asynchronous techniques within the application to handle certain tasks, such as scraping data, in the background.
You can use an async library like aiohttp
to make the scraping non-blocking and run it in the background while your application continues to handle other requests. For example, you could create an async task that runs in the background and periodically updates the data, and then use that data in your route handlers.
You could use an async library like asyncio
and aiohttp
along with the Flask-APScheduler
package, to schedule the scraping function to run at specific intervals in the background. The scheduled function can be a coroutine and you could use await
inside the coroutine to make it non-blocking.
It's also important to note that you should be careful when using async code in a Flask application, because it can introduce complexity and additional considerations such as thread-safety. But it can be done and can be a good solution for your use case.
What do you mean thread-safety? I have very limited knowledge of parallell programming.
In computer science, thread safety refers to the design and implementation of code that can be safely used in a concurrent environment, where multiple threads may execute the same code concurrently. In other words, it means that the code can handle multiple threads accessing and modifying shared data without any conflicts or unexpected behavior.
In a typical Flask application, each incoming request is handled by a separate thread, and if your code is not thread-safe, it can lead to race conditions, where multiple threads try to access or modify the same data at the same time, leading to unpredictable results.
When using async code, it is important to ensure that shared data is accessed in a thread-safe manner, using synchronization mechanisms such as locks or semaphores to control access to the shared data.
For example, if you have a global variable that stores the scraped data, and multiple threads are trying to access and update it at the same time, you could use a lock to ensure that only one thread can access the variable at a time, avoiding race conditions.
It's important to keep in mind that thread-safety is a complex topic and it can be a bit challenging to grasp at first. But with proper understanding and testing it can be achieved.
Fair enough. Can you give an example of how to schedule a function to run in an interval of 60 minutes? You can use the example function empty_function() that only does a return immediately.
Sure, here's an example of how you can use the Flask-APScheduler
package to schedule a function to run every 60 minutes in a Flask application:
from flask import Flask
from flask_apscheduler import APScheduler
app = Flask(__name__)
scheduler = APScheduler()
def empty_function():
return
def start_scheduler():
scheduler.init_app(app)
scheduler.start()
scheduler.add_job(func=empty_function, trigger='interval', minutes=60)
if __name__ == '__main__':
start_scheduler()
app.run()
In this example, we are importing Flask
, APScheduler
from flask_apscheduler
package. We create an instance of Flask
and APScheduler
. In the start_scheduler
function, we initialize the scheduler with our Flask app using scheduler.init_app(app)
and start it with scheduler.start()
.
We are scheduling the empty_function()
to run every 60 minutes using scheduler.add_job(func=empty_function, trigger='interval', minutes=60)
.
It's important to note that, in this example, the scheduled function would start running as soon as the server starts and continue running even after the request is completed.
You can also add more options like start_date, end_date, args, kwargs etc as per your requirement.
Also, keep in mind that this package is based on APScheduler
which is a powerful package for scheduling jobs. You can also use schedule
package which is also a powerful package for scheduling jobs and has many other options like retries, misfires etc.
So just checkin my understanding here. The scheduler.add_job function makes sure empty_function() will be called every 60 minutes until the main program terminates?
Yes, that's correct. The scheduler.add_job
function schedules the empty_function()
to run at a specific interval (in this case, every 60 minutes) until the main program is terminated. Once the scheduler is started with scheduler.start()
, it will continue to run in the background and call the empty_function()
every 60 minutes as long as the main program is running.
It's worth noting that, the scheduled function will be called periodically only after the request is completed. So, you will not see any difference in the response time of the request.
Also, keep in mind that, if you are running multiple instance of your app, each of the instances will have its own scheduler and will run the scheduled function independently.
Right now I am running with 1 worker in the gunicorn environment, so all threads should share data if I understand it correctly.
Correct, if you are using Gunicorn with a single worker, all requests will be handled by the same worker process and therefore will share the same memory space. In this case, the global variable that you are using to store the scraped data will be accessible by all threads and will be shared by all requests.
It's important to keep in mind that, even if you are using a single worker, if you're not properly synchronizing access to shared data, you may still run into race conditions and other thread-safety issues.
So, it is important to use synchronization mechanisms such as locks or semaphores to control access to the shared data.
Also, keep in mind that if you are running your app in a multi-worker environment, each worker will have its own memory space, so you'll need to ensure that the data is properly shared and synchronized across all workers.
What is the difference between locks and semaphores in this context?
In the context of thread-safety, locks and semaphores are both used to synchronize access to shared resources.
A lock is a synchronization mechanism that allows only one thread to execute a specific section of code at a time. When a thread acquires a lock, other threads that try to acquire the same lock will be blocked until the thread that holds the lock releases it. Locks are typically used to protect shared data structures and resources.
A semaphore, on the other hand, is a synchronization mechanism that allows multiple threads to execute a specific section of code at the same time, but with a limit on the number of threads that can execute that code simultaneously. A semaphore has a counter that is decremented each time a thread acquires the semaphore and incremented each time a thread releases the semaphore. When the counter reaches zero, no other thread can acquire the semaphore until another thread releases it. Semaphores are typically used to control access to shared resources, such as a pool of database connections.
In summary, locks are used to provide exclusive access to a shared resource, whereas semaphores are used to control access to shared resources, with a limit on the number of threads that can access the resource simultaneously.
Keep in mind that locks and semaphores are not specific to Python, they are general concepts of parallel programming and you can find their implementation in various programming languages.
can you give an example of how I could use a lock to make sure a dictionary variable is "safe"?
Sure! Here's an example of how you could use a lock to make sure a dictionary variable is "safe" in a multi-threaded Flask application:
from threading import Lock
# Create a global lock
data_lock = Lock()
# Create a global dictionary
data = {}
@app.route('/update')
def update_data():
with data_lock:
# Acquire the lock
# Perform operations on the dictionary
data["key"] = "value"
# Release the lock
return "Data updated"
@app.route('/display')
def display_data():
with data_lock:
# Acquire the lock
# Perform operations on the dictionary
return data["key"]
# Release the lock
In this example, the data_lock
is a global variable of type Lock
from the threading
module. The update_data
and display_data
functions are routes in the Flask application.
When the update_data
function is called, it acquires the data_lock
using a with
statement before performing any operations on the data
dictionary. This ensures that only one thread can execute the code inside the with
block at a time, preventing race conditions and other thread-safety issues. After the operations on the dictionary are complete, the lock is automatically released.
Similarly, the display_data
function also acquires the data_lock
before performing any operations on the data
dictionary and releases it after the operations are complete.
By using the with
statement, you can ensure that the lock is released even if an exception is raised inside the block.
This way, you can be sure that the dictionary is safe and that no race condition occur.
So with ... functions just like with open does for file objects?
Yes, the with
statement in Python is similar to the using
statement in C# or the try-finally
block in Java. It is used to acquire a resource, such as a file or a lock, at the beginning of a block and release it automatically at the end of the block, regardless of whether the block executed successfully or raised an exception.
For example, when you open a file using the open()
function, you can use the with
statement to ensure that the file is closed automatically when you're done with it, even if an exception is raised:
with open("file.txt", "r") as f:
content = f.read()
# Do something with the file
# ...
# The file is closed automatically here
Similarly,
I see, what is the result when update_data() is called without the lock being released?
When the update_data
function is called and the lock is not released, it will cause the function to block until the lock is released by the thread that has acquired it. While the function is blocked, no other thread will be able to acquire the lock and execute the code inside the with
block, effectively blocking access to the data
dictionary.
In a web application like Flask, this can lead to poor performance and even timeouts if a large number of requests are blocked on the same lock. It can also cause the application to become unresponsive if the lock is never released due to an error or bug in the code.
It is important to ensure that the lock is always released after it is acquired, either by using the with
statement or by calling the release()
method on the lock explicitly. This can be done by using try-finally block.
data_lock.acquire()
try:
# Perform operations on the dictionary
data["key"] = "value"
finally:
data_lock.release()
This way you can be sure that the lock will be released, even if an exception is raised.
So until the block is released, the function will be effectively waiting, correct?
Yes, that is correct. When a thread calls the acquire()
method on a lock and the lock is already held by another thread, the calling thread will be blocked until the lock is released by the other thread. The thread will not continue execution until it has acquired the lock.
This is why it is important to keep the duration of the code inside the lock as short as possible, to minimize the time that other threads are blocked. And also make sure to release the lock as soon as you are done with the critical section of code.
It is also important to note that using locks can lead to issues like Deadlocks if not used properly.
So circling back to my background scheduling, this should work fine then assuming with lock is handled correctly and the background process correctly handles the scraped data?
Yes, using a background scheduler to periodically update the data and using a lock to synchronize access to the data should work fine, as long as the lock is used correctly and the background process is implemented correctly.
You should make sure that the background process releases the lock after it is done updating the data, so that other threads can access it. And also make sure that the background process is implemented in a way that is thread-safe and can handle any race conditions that may occur.
Additionally, you might want to consider using a more advanced library such as Celery to handle your background tasks, as it provides more features and robustness than a simple scheduler, like the ability to schedule periodic tasks, retry failed tasks, and distribute tasks across multiple worker processes.
It is important to test the application thoroughly in different scenarios, before deploying it to production environment.
Can you elaborate on how this would work with Celery? Remember I only use 1 worker process.
Celery is a task queue library that allows you to schedule and execute background tasks asynchronously. It is built on top of message brokers such as RabbitMQ or Redis, which are used to send and receive messages between the Celery worker processes and the main application.
When you use Celery with a single worker process, the worker process will be responsible for executing all the scheduled tasks. In this case, you don't need to worry about race conditions between different worker processes, since there is only one worker process running.
To use Celery in your Flask application, you would need to install it and configure it to use a message broker. You would also need to create a new Celery instance and configure it with your Flask application.
Here is an example of how you could use Celery to schedule a task that updates your data every 60 minutes:
from celery import Celery
# Create a new Celery instance
celery = Celery('tasks', broker='redis://localhost:6379/0')
# Define the task that updates the data
@celery.task
def update_data():
with data_lock:
# Perform scraping and updating of the data
data = scrape_data()
# Schedule the task to run every 60 minutes
celery.conf.beat_schedule = {
'update-data': {
'task': 'update_data',
'schedule': 60 * 60.0,
},
}
In this example, I am using Redis as the message broker. You can also use other message brokers like RabbitMQ, and you'll need to configure the Celery instance accordingly.
When you run your Flask application, you will also need to start the Celery worker process, which will be responsible for executing the scheduled tasks. You can start the worker process by running celery -A your_flask_module:celery worker --loglevel=info
.
It's worth noting that in case you want to scale up the number of worker processes in the future, Celery is built with that in mind, as it handles the distribution of tasks across multiple worker processes automatically, so you don't have to worry about race conditions or thread-safety issues.
I understand, thank you.
Can you summarize a bulletpoint of the things we discussed? Do not include the Celery part, that is not going to be used.
Sure, here's a summary of the things we discussed:
Vad:
Göra vissa app-variabler dynamiska via ett administrativt gränssnitt
Idag:
Det finns en mängd inställningar som bestämmer triggers på olika händelser, notifieringar och liknande och alla dessa är hårdkodade i källkoden. Det går bra att ändra, men det kräver omstart av docker containern för varje liten ändring.
Varför:
Det kan finnas vissa typer av debug-meddelanden som är intressanta för övervakning eller underhåll och då är det bra att kunna ställa in rapport-nivå på dessa och vart någonstans den informationen ska skickas. T.ex, discord webhooks, api nycklar osv.
Förslag:
Gör en vanlig konfig fil som läses in vid programstart och sätter alla Flask app variabler. Dessa kan sedan ändras under programmets körning, och koppla bara en 'save' knapp till att skriv ut ändringen till konfig filen. Genom att ändra variablerna snällt kan dessa uppdatera appen utan att den behöver byggas om (då konfig filen ligger på den persistenta volymen) i docker.
from flask import current_app
current_app.config['KEY'] = 'new data'