How to store data
Data storage tutorial
Some indicators need to refer to previous data as a reference when processing logic, such as the time of last processing, the result of last data processing, etc. Here we need to store the data in the process or the result data for subsequent use.
Currently, DB-level data storage is not supported, and support for Redis-level data storage is currently provided.
Set up a Redis connection
When testing and developing locally, set REDIS_URL = env.str('REDIS_URL', 'redis://127.0.0.1:6379/0'). in the crawlers/config.py file to the local RUL to use.
Redis usage
Some commonly used Redis operations are defined in the uitls/redis_conn.py file of the project library.
At present, only three methods are provided externally, and only three methods are supported, as shown in the following code:
import logging
import redis
import time
from crawlers.config import REDIS_URL
from urllib.parse import urlparse
_REDIS_URL = urlparse(REDIS_URL)
pool = redis.connection.ConnectionPool(
host=_REDIS_URL.hostname,
port=_REDIS_URL.port,
db=_REDIS_URL.path[1:],
)
_redis_client = redis.Redis(connection_pool=pool)
class rds:
@classmethod
def getex(cls, prefix, name):
"""
Return the value at key ``prefix + ':' + name``, or None if the key doesn't exist
prefix: The prefix parameter indicates the name value of the current crawler
name: Customize the name value related to the current business
"""
key = prefix + ':' + name
if len(key.encode()) > 1024:
logging.warning('Key is too large')
return None
value = _redis_client.get(key)
if value:
return str(value, encoding="utf-8")
return value
@classmethod
def setex(cls, prefix, name: str, value: str, ttl):
"""
Return the value at key ``prefix + ':' + name``, or None if the key doesn't exist
prefix: The prefix parameter indicates the name value of the current crawler
name: Customize the name value related to the current business. The key string size cannot exceed 1KB
value: The stored value cannot exceed 128 KB
ttl: Expiration time must be set, and value must be taken according to business requirements
"""
key = prefix + ':' + name
# Size limit, value cannot exceed 128 KB, key cannot exceed 1 KB
if len(value.encode()) > 1024 * 128 or len(key.encode()) > 1024:
logging.warning('Key or Value is too large')
return False
if _redis_client.get(key) is not None:
return False
if ttl:
_redis_client.set(key, value, ex=ttl)
else:
_redis_client.set(key, value)
return True
@classmethod
def get_and_set_key(cls, prefix, name: str, value: str, ttl: int = None):
"""
Return the value at key ``prefix + ':' + name``, or True if the key exist
prefix: The prefix parameter indicates the name value of the current crawler
name: Customize the name value related to the current business. The key string size cannot exceed 1KB
value: The stored value cannot exceed 128 KB
ttl: Expiration time must be set, and value must be taken according to business requirements
"""
key = prefix + ':' + name
# Size limit, value cannot exceed 128 KB, key cannot exceed 1 KB
if len(value.encode()) > 1024 * 128 or len(key.encode()) > 1024:
logging.warning('Key or Value is too large')
return False
if _redis_client.get(name):
return True
_redis_client.set(name, value)
if ttl:
_redis_client.expire(name, ttl)
@classmethod
def thing_lock(cls, name, expiration_time=2, time_out=3):
"""
code pessimistic locking
Function: Avoid simultaneous execution of functions, resulting in unpredictable problems
"""
def outer_func(func):
def wrapper_func(*args, **kwargs):
lock_name = f'lock:{name}'
end_time = time.time() + time_out
while time.time() < end_time:
if _redis_client.setnx(lock_name, expiration_time):
_redis_client.expire(lock_name, expiration_time)
data = func(*args, **kwargs)
_redis_client.delete(lock_name)
return data
time.sleep(0.001)
return func(*args, **kwargs)
return wrapper_func
return outer_func
Todo:
The prefix field is the prefix of the Redis key and is taken from the name value of the spider.

Demo
Set value:
Get value:
Last updated