Detailled presentation of the data
1 General presentation of the data
Data prepared for this project gathers all of the housing transactions in France between 2014 and 2021 and details about the property sold. Because the characteristics of the property sold isn’t open source, the data you’ll be working with are synthetic data. It has been generated to mirror the original data.
If you want to have a closer look at the data, CEREMA publishes on its website an extensive documentation of the different datasets, the variables, their modalities and the overall quality of variable.
2 List of variables
| Label of the variable | Full name of the variable | Explanation and remarks | Original label of the variable (French) | |
|---|---|---|---|---|
| 1 | dist_tosea | Distance of the property to the nearest seashore - capped at 10km | This variable has been calculated | distance_ltm (calculated) |
| 2 | farea | Reported floor area of the property | dsupdc | |
| 3 | has_cheating | If the property has access to central heating | The modalities are coded in the following way : - 0 : No - 1 : Missing value - 2 : Yes |
gchclc |
| 4 | has_elec | If the property has access to electricity | The modalities are coded in the following way : - 0 : No - 1 : Missing value - 2 : Yes |
gelelc |
| 5 | has_elevator | If the building of the flat has an elevator (for flats only) | The modalities are coded in the following way : - 0 : No - 1 : Missing value - 2 : Yes |
gasclc |
| 6 | has_gas | If the building is connected to the gas mains | The modalities are coded in the following way : - 0 : No - 1 : Missing value - 2 : Yes |
ggazlc |
| 7 | has_mdrainage | If the property is connected to the mains drainage system | The modalities are coded in the following way : - 0 : No - 1 : Missing value - 2 : Yes |
gteglc |
| 8 | has_rchute | If the building of the flat has refuse chutes (for flats only) | The modalities are coded in the following way : - 0 : No - 1 : Missing value - 2 : Yes |
gvorlc |
| 9 | has_water | If the property has access to water | The modalities are coded in the following way : - 0 : No - 1 : Missing value - 2 : Yes |
geaulc |
| 10 | n_ancrooms | Number of ancillary rooms reported in the property | Ancillary rooms include hallways, attics. They differ from n_otherannex. | dnbann |
| 11 | n_attic | Number of attics reported in the whole property | nb_greniers (calculated) | |
| 12 | n_basmt | Number of basements reported in the whole property | nb_caves (calculated) | |
| 13 | n_bath | Number of bathtubs reported in the property | dnbbai | |
| 14 | n_eatr | Number of eating rooms reported in the property | dnbsam | |
| 15 | n_floors | Number of floor in the property (building or house) | This variable is more reliable with houses than with buildings. Underground floors encoding is not fully harmonized and is often equal to 81 for minus 1, 82 for minus 2 … It can also be encoded as 99, 98. A flat at the 2nd floor of a seven-floors building should be encoded with nth_floor=1 and n_floors=8 (ground floor and seven floors above ground level) | dnbniv |
| 16 | n_garage | Number of garages reported in the whole property | nb_garages (calculated) | |
| 17 | n_kit8 | Number of kitchens reported in the property with an area of less than 8 square meters | dnbcu8 | |
| 18 | n_kit9 | Number of kitchens reported in the property with an area of larger than 9 square meters | dnbcu9 | |
| 19 | n_mrooms | Number of main rooms reported in the property | n_mrooms = n_eatr + n_slr + n_kit8 + n_kit9 + n_washr | dnbppr |
| 20 | n_otherannex | Number of other annexes reported in the whole property | nb_autresdep (calculated) | |
| 21 | n_pool | Number of pools reported in the whole property | nb_piscines (calculated) | |
| 22 | n_rooms | Number of rooms reported in the property | n_rooms = n_mrooms + n_annex | dnbpdc |
| 23 | n_show | Number of showers reported in the property | dnbdou | |
| 24 | n_sink | Number of sinks reported in the property | dnblav | |
| 25 | n_slr | Number of sleeping rooms reported in the property | dnbcha | |
| 26 | n_terrace | Number of terraces reported in the whole property | nb_terrasses (calculated) | |
| 27 | n_washr | Number of washing rooms reported in the property | dnbsea | |
| 28 | n_wc | Number of toilets reported in the property | dnbwc | |
| 29 | nth_floor | Reported floor of the property | It represents the floor of the flat (in France, the second floor is the first floor above ground level). This variable is set to 00 for houses. Underground floors encoding is not fully harmonized and is often equal to 81 for minus 1, 82 for minus 2 … It can also be encoded as 99, 98. A flat at the 2nd floor of a seven-floors building should be encoded with nth_floor=1 and n_floors=8 (ground floor and seven floors above ground level). | dniv |
| 30 | price | Price of the transaction | Price of the transaction is in EUR | valeurfonc |
| 31 | price_sqm | Price per square meter of the transaction | price_sqm (calculated) | |
| 32 | prop_loc_citycode | Official city’s code where the property is located | see remarks above | depcom |
| 33 | prop_loc_dep | Department code where the property is located | Data doesn’t cover the whole French territory - overseas territory are included but Alsace-Moselle (Eastern part of France) isn’t | ccodep |
| 34 | prop_loc_x | Longitude where the property is located | see remarks above | x |
| 35 | prop_loc_y | Latitude where the property is located | see remarks above | y |
| 36 | prop_type | Type of property | 1 represents a flat and 2 a house | dteloc |
| 37 | prop_year_harm | Year of contruction of the property | This variable has been harmonized to correct for typing mistakes. More details is available in the introduction. | jannath |
| 38 | s_land_agri | Agricultural land area (square meters) | Agricultural land is used for farming. It includes fields, meadows, orchards and vineyards. | dcntagri |
| 39 | s_land_artif | Artificial land area (square meters) | Artificial land includes recreational areas, land, building plots and gardens. Artificial land refers to land that has been altered by humans. | dcntsol |
| 40 | s_land_nat | Natural land area (square meters) | Natural land is land that has not been altered. This includes, for example, forests. | dcntnat |
| 41 | stair | If the building of the flat has stairs (for flats only) | The modalities are coded in the following way : - 0 : No - 1 : Missing value - 2 : Yes |
gesclc |
| 42 | trans_date | Date of the official certified transaction | datemut | |
| 43 | trans_id | Unique identifier code of the transaction | idmutation | |
| 44 | trans_month | Month of the official certified transaction | moismut | |
| 45 | trans_type_code | Type of transaction | There are several types of transaction in the original data (sale, off-plan sale, sale of building land, tender, compulsory purchase). Original data have been filtered to keep only sale. | idnatmut |
| 46 | trans_type_label | Type of transaction | libnatmut | |
| 47 | trans_year | Year of the official certified transaction | anneemut |
See CEREMA’s documentation (in French) for a more detailled description.
3 Detailled note
Note that the variable year of the building has been harmonized to correct for typing mistakes. Here are the transformation made from the original data :
| Original modality | Corrections to original modality | Exemple of transformation |
|---|---|---|
| 1 ≤ prop_year ≤ 22 | Add 2000 as the first digits were not entered | 8 → 2008 |
| 23 ≤ prop_year ≤ 99 | Add 1900 as the first digits were not entered | 83 → 1983 |
| 100 ≤ prop_year ≤ 119 | Set to 0 as it is unknown | 105 → 0 |
| 120 ≤ prop_year ≤ 200 | Add a 0 at the end as the last digit has not been entered | 187 → 1870 |
| 201 ≤ prop_year ≤ 299 | Set to 0 as it is unknown | 250 → 0 |
| 300 ≤ prop_year ≤ 999 | Add 1000 as the first digits have not been entered | 980 → 1980 |
| 1000 ≤ prop_year ≤ 1120 | Set to 0 as it is unknown | 1005 → 0 |
| 1120 ≤ prop_year ≤ 1199 | Replace the second 1 with 9 | 1155 → 1955 |
| 1200 ≤ prop_year ≤ 2022 | No change | 1200 → 1200 |
More information about this transformation is available online (in French).